Descriptive Statistics Python implementation

Ram Thiagu
4 min readAug 6, 2023

--

Descriptive Statistics

Measures of Central Tendency:

  • Mean: The average value of a dataset.
  • Median: The middle value of a dataset when arranged in ascending order.
  • Mode: The value that appears most frequently in the dataset.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Generating sample data
np.random.seed(42)
data = np.random.normal(loc=50, scale=10, size=1000)
categories = np.random.choice(['A', 'B', 'C'], size=1000)
df = pd.DataFrame({'Data': data, 'Category': categories})

mean_value = df['Data'].mean()
median_value = df['Data'].median()
mode_value = df['Data'].mode().values[0]

print(f"Mean: {mean_value}")
print(f"Median: {median_value}")
print(f"Mode: {mode_value}")
OUTPUT:

Mean: 50.193320558223256
Median: 50.25300612234888
Mode: 17.58732659930927

Measures of Dispersion or Spread:

  • Variance: A measure of how much the data points deviate from the mean.
  • Standard Deviation: The square root of the variance, indicating the spread of data around the mean.
  • Range: The difference between the maximum and minimum values in the dataset.
  • Interquartile Range (IQR): The range between the first quartile (25th percentile) and the third quartile (75th percentile).
# Measures of Dispersion
variance_value = df['Data'].var()
std_deviation_value = df['Data'].std()
range_value = df['Data'].max() - df['Data'].min()
iqr_value = df['Data'].quantile(0.75) - df['Data'].quantile(0.25)

print(f"Variance: {variance_value}")
print(f"Standard Deviation: {std_deviation_value}")
print(f"Range: {range_value}")
print(f"Interquartile Range (IQR): {iqr_value}")
OUTPUT:

Variance: 95.88638535851024
Standard Deviation: 9.792159381796756
Range: 70.93998830723794
Interquartile Range (IQR): 12.955341809352817

Probability Distributions:

  • Normal Distribution: A symmetrical bell-shaped distribution often encountered in real-world data.
  • Skewed Distributions: Positive and negative skewness, where the data is concentrated on one side of the distribution.
  • Uniform Distribution: When all outcomes are equally likely.
# Probability Distribution
normal_distribution = np.random.normal(loc=50, scale=10, size=10000)
plt.figure(figsize=(5, 3))
plt.hist(normal_distribution, bins=30, alpha=0.6)
plt.title("Normal Distribution")
plt.xlabel("Data")
plt.ylabel("Frequency")
plt.show()

Correlation and Covariance:

  • Correlation : Measures the linear relationship between two continuous variables.
  • The value of Correlation lies in the range of -1 and +1.
  • +1 -> 2 variables perfectly having same characteristics
  • -1 -> 2 variables follows completly opposite charcteristics
  • 0 -> no relationship at all between 2 variables
  • Covariance: Measures the joint variability of two variables. How two random variables are dependent on each other. A higher number denotes higher dependency.
  • The value of Covariance lies in the range of -∞ and +∞.
import pandas as pd

# Sample data
cc_data = {
'X': [1, 2, 3, 4, 5],
'Y': [2, 3, 5, 4, 6]
}

# Create a DataFrame from the sample data
cc_df = pd.DataFrame(cc_data)

# Calculate Pearson correlation coefficient
pearson_corr = cc_df['X'].corr(cc_df['Y'])
print(f"Pearson Correlation Coefficient: {pearson_corr}")

# Calculate covariance
covariance = cc_df['X'].cov(cc_df['Y'])
print(f"Covariance: {covariance}")
OUTPUT:

Pearson Correlation Coefficient: 0.8999999999999998
Covariance: 2.25

Percentiles and Quartiles:

  • Percentiles: Divide the data into 100 equal parts, useful for finding specific values within the distribution.
  • Quartiles: Divide the data into four equal parts (Q1, Q2, Q3).
# Percentiles and Quartiles
percentile_25 = df['Data'].quantile(0.25)
percentile_50 = df['Data'].quantile(0.50) # Equivalent to the median
percentile_75 = df['Data'].quantile(0.75)

print(f"25th Percentile (Q1): {percentile_25}")
print(f"50th Percentile (Median): {percentile_50}")
print(f"75th Percentile (Q3): {percentile_75}")
OUTPUT:

25th Percentile (Q1): 43.524096945376485
50th Percentile (Median): 50.25300612234888
75th Percentile (Q3): 56.4794387547293

Outliers and Data Cleaning:

  • Outliers are data points that fall below Q1–1.5 * IQR or above Q3 + 1.5 * IQR. The factor 1.5 is a common threshold, but it can be adjusted based on the specific use case. Data points outside this range are considered outliers.
# Outliers
outliers = df[(df['Data'] < percentile_25 - 1.5 * iqr_value) | (df['Data'] > percentile_75 + 1.5 * iqr_value)]
print("Outliers:")
print(outliers)
OUTPUT:

Outliers:
Data Category
74 23.802549 A
179 77.201692 B
209 88.527315 B
262 17.587327 A
478 80.788808 A
646 23.031134 A
668 23.490302 B
755 76.323821 A

Measures of Skewness and Kurtosis:

  • Skewness: Measures the asymmetry of a dataset’s distribution.
  • Kurtosis: Measures the shape of a dataset’s distribution (e.g., peakedness or flatness).
# Skewness and Kurtosis
skewness_value = df['Data'].skew()
kurtosis_value = df['Data'].kurt()

print(f"Skewness: {skewness_value}")
print(f"Kurtosis: {kurtosis_value}")
OUTPUT:

Skewness: 0.11697636882001361
Kurtosis: 0.07256220235414812

Central Limit Theorem:

  • The central limit theorem (CLT) states that if you take sufficiently large samples from a population, the distribution of the sample means will be approximately normally distributed, regardless of the distribution of the population.
import numpy as np
import matplotlib.pyplot as plt

# Population parameters
population_mean = 100 # Mean of the population
population_stddev = 20 # Standard deviation of the population
population_size = 10000 # Size of the population

# Generating the population data (normally distributed)
population_data = np.random.normal(loc=population_mean, scale=population_stddev, size=population_size)

# Number of samples to draw from the population
num_samples = 1000

# Sample size for each sample
sample_size = 30

# Calculate the sample means for each sample
sample_means = [np.mean(np.random.choice(population_data, sample_size)) for _ in range(num_samples)]

# Plot the histogram of sample means
plt.figure(figsize=(5, 3))
plt.hist(sample_means, bins=30, alpha=0.6)
plt.axvline(x=population_mean, color='red', linestyle='dashed', linewidth=2, label='Population Mean')
plt.xlabel('Sample Means')
plt.ylabel('Frequency')
plt.title('Central Limit Theorem in Action')
plt.legend()
plt.show()

--

--

No responses yet