Descriptive Statistics Python implementation

4 min readAug 6, 2023

Descriptive Statistics

Measures of Central Tendency:

Mean: The average value of a dataset.
Median: The middle value of a dataset when arranged in ascending order.
Mode: The value that appears most frequently in the dataset.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Generating sample data
np.random.seed(42)
data = np.random.normal(loc=50, scale=10, size=1000)
categories = np.random.choice(['A', 'B', 'C'], size=1000)
df = pd.DataFrame({'Data': data, 'Category': categories})

mean_value = df['Data'].mean()
median_value = df['Data'].median()
mode_value = df['Data'].mode().values[0]

print(f"Mean: {mean_value}")
print(f"Median: {median_value}")
print(f"Mode: {mode_value}")

OUTPUT:

Mean: 50.193320558223256
Median: 50.25300612234888
Mode: 17.58732659930927

Measures of Dispersion or Spread:

Variance: A measure of how much the data points deviate from the mean.
Standard Deviation: The square root of the variance, indicating the spread of data around the mean.
Range: The difference between the maximum and minimum values in the dataset.
Interquartile Range (IQR): The range between the first quartile (25th percentile) and the third quartile (75th percentile).

# Measures of Dispersion
variance_value = df['Data'].var()
std_deviation_value = df['Data'].std()
range_value = df['Data'].max() - df['Data'].min()
iqr_value = df['Data'].quantile(0.75) - df['Data'].quantile(0.25)

print(f"Variance: {variance_value}")
print(f"Standard Deviation: {std_deviation_value}")
print(f"Range: {range_value}")
print(f"Interquartile Range (IQR): {iqr_value}")

OUTPUT:

Variance: 95.88638535851024
Standard Deviation: 9.792159381796756
Range: 70.93998830723794
Interquartile Range (IQR): 12.955341809352817

Probability Distributions:

Normal Distribution: A symmetrical bell-shaped distribution often encountered in real-world data.
Skewed Distributions: Positive and negative skewness, where the data is concentrated on one side of the distribution.
Uniform Distribution: When all outcomes are equally likely.

# Probability Distribution
normal_distribution = np.random.normal(loc=50, scale=10, size=10000)
plt.figure(figsize=(5, 3))
plt.hist(normal_distribution, bins=30, alpha=0.6)
plt.title("Normal Distribution")
plt.xlabel("Data")
plt.ylabel("Frequency")
plt.show()

Correlation and Covariance:

Correlation : Measures the linear relationship between two continuous variables.
The value of Correlation lies in the range of -1 and +1.
+1 -> 2 variables perfectly having same characteristics
-1 -> 2 variables follows completly opposite charcteristics
0 -> no relationship at all between 2 variables

Covariance: Measures the joint variability of two variables. How two random variables are dependent on each other. A higher number denotes higher dependency.
The value of Covariance lies in the range of -∞ and +∞.

import pandas as pd

# Sample data
cc_data = {
    'X': [1, 2, 3, 4, 5],
    'Y': [2, 3, 5, 4, 6]
}

# Create a DataFrame from the sample data
cc_df = pd.DataFrame(cc_data)

# Calculate Pearson correlation coefficient
pearson_corr = cc_df['X'].corr(cc_df['Y'])
print(f"Pearson Correlation Coefficient: {pearson_corr}")

# Calculate covariance
covariance = cc_df['X'].cov(cc_df['Y'])
print(f"Covariance: {covariance}")

OUTPUT:

Pearson Correlation Coefficient: 0.8999999999999998
Covariance: 2.25

Percentiles and Quartiles:

Percentiles: Divide the data into 100 equal parts, useful for finding specific values within the distribution.
Quartiles: Divide the data into four equal parts (Q1, Q2, Q3).

# Percentiles and Quartiles
percentile_25 = df['Data'].quantile(0.25)
percentile_50 = df['Data'].quantile(0.50)  # Equivalent to the median
percentile_75 = df['Data'].quantile(0.75)

print(f"25th Percentile (Q1): {percentile_25}")
print(f"50th Percentile (Median): {percentile_50}")
print(f"75th Percentile (Q3): {percentile_75}")

OUTPUT:

25th Percentile (Q1): 43.524096945376485
50th Percentile (Median): 50.25300612234888
75th Percentile (Q3): 56.4794387547293

Outliers and Data Cleaning:

Outliers are data points that fall below Q1–1.5 * IQR or above Q3 + 1.5 * IQR. The factor 1.5 is a common threshold, but it can be adjusted based on the specific use case. Data points outside this range are considered outliers.

# Outliers
outliers = df[(df['Data'] < percentile_25 - 1.5 * iqr_value) | (df['Data'] > percentile_75 + 1.5 * iqr_value)]
print("Outliers:")
print(outliers)

OUTPUT:

Outliers:
          Data Category
74   23.802549        A
179  77.201692        B
209  88.527315        B
262  17.587327        A
478  80.788808        A
646  23.031134        A
668  23.490302        B
755  76.323821        A

Measures of Skewness and Kurtosis:

Skewness: Measures the asymmetry of a dataset’s distribution.
Kurtosis: Measures the shape of a dataset’s distribution (e.g., peakedness or flatness).

# Skewness and Kurtosis
skewness_value = df['Data'].skew()
kurtosis_value = df['Data'].kurt()

print(f"Skewness: {skewness_value}")
print(f"Kurtosis: {kurtosis_value}")

OUTPUT:

Skewness: 0.11697636882001361
Kurtosis: 0.07256220235414812

Central Limit Theorem:

The central limit theorem (CLT) states that if you take sufficiently large samples from a population, the distribution of the sample means will be approximately normally distributed, regardless of the distribution of the population.

import numpy as np
import matplotlib.pyplot as plt

# Population parameters
population_mean = 100  # Mean of the population
population_stddev = 20  # Standard deviation of the population
population_size = 10000  # Size of the population

# Generating the population data (normally distributed)
population_data = np.random.normal(loc=population_mean, scale=population_stddev, size=population_size)

# Number of samples to draw from the population
num_samples = 1000

# Sample size for each sample
sample_size = 30

# Calculate the sample means for each sample
sample_means = [np.mean(np.random.choice(population_data, sample_size)) for _ in range(num_samples)]

# Plot the histogram of sample means
plt.figure(figsize=(5, 3))
plt.hist(sample_means, bins=30, alpha=0.6)
plt.axvline(x=population_mean, color='red', linestyle='dashed', linewidth=2, label='Population Mean')
plt.xlabel('Sample Means')
plt.ylabel('Frequency')
plt.title('Central Limit Theorem in Action')
plt.legend()
plt.show()

Descriptive Statistics Python implementation

Measures of Central Tendency:

Measures of Dispersion or Spread:

Probability Distributions:

Correlation and Covariance:

Percentiles and Quartiles:

Outliers and Data Cleaning:

Measures of Skewness and Kurtosis:

Central Limit Theorem:

Written by Ram Thiagu

No responses yet