Understanding Basic Statistical Terminologies with Python
In this post, we’ll explore some fundamental statistical concepts using Python and explain them in detail. We’ll be working with a dataset of student scores in an exam, and we’ll use Python’s statistics
module and matplotlib
library for visualization.
Let’s start by importing the necessary libraries and defining our dataset:
import matplotlib.pyplot as plt
import statistics
Enter fullscreen mode Exit fullscreen mode
students data
# data of student scores in an exam student_scores = [85, 78, 92, 88, 76, 80, 85, 90, 85, 78]
Enter fullscreen mode Exit fullscreen mode
Mean, Median, and Mode
The mean represents the average value of the dataset, while the median is the middle value when the data is arranged in ascending order. The mode is the most frequent value in the dataset
mean_score = statistics.mean(student_scores)
median_score = statistics.median(student_scores)
mode_score = statistics.mode(student_scores)
print("Mean:", mean_score) # Mean: 83.7 print("Median:", median_score) # Median: 85.0 print("Mode:", mode_score) # Mode: 85
Enter fullscreen mode Exit fullscreen mode
Standard Deviation and Variance
Standard deviation measures the dispersion of data points from the mean, while variance represents the average of the squared differences from the mean.
std_deviation = statistics.stdev(student_scores)
variance = statistics.variance(student_scores)
print("Standard Deviation:", std_deviation) # Standard Deviation : 5.47 print("Variance:", variance) # Variance : 30.011
Enter fullscreen mode Exit fullscreen mode
Range and Quartiles
The range is the difference between the maximum and minimum values in the dataset. Quartiles divide the data into four equal parts.
range_score = max(student_scores) - min(student_scores)
sorted_scores = sorted(student_scores)
q1 = statistics.median(sorted_scores[:len(sorted_scores)//2])
q2 = statistics.median(sorted_scores)
q3 = statistics.median(sorted_scores[len(sorted_scores)//2:])
print("Range:", range_score) # Range: 16 print("Q1:", q1) #Q1: 78 print("Q2 (Median):", q2) #Q2: 85.0 print("Q3:", q3) #Q3: 88
Enter fullscreen mode Exit fullscreen mode
Interquartile Range (IQR)
The Interquartile Range (IQR) is the range between the first and third quartiles, measuring the spread of data.
iqr = q3 - q1
print("Interquartile Range (IQR):", iqr) #Interquartile Range(IQR): 10
Enter fullscreen mode Exit fullscreen mode
Correlation Coefficient
The correlation coefficient measures the linear relationship between two variables. We’ll calculate the correlation coefficient between hours studied and test scores
def correlation_coefficient(x, y):
n = len(x)
mean_x = sum(x) / n
mean_y = sum(y) / n
covariance = sum((x[i] - mean_x) * (y[i] - mean_y) for i in range(n))
std_dev_x = (sum((xi - mean_x) ** 2 for xi in x) / n) ** 0.5
std_dev_y = (sum((yi - mean_y) ** 2 for yi in y) / n) ** 0.5
correlation = covariance / (std_dev_x * std_dev_y)
return correlation
hours_studied = [4, 6, 3, 5, 7]
test_scores = [85, 90, 82, 88, 92]
correlation = correlation_coefficient(hours_studied, test_scores)
print("Correlation between hours studied and test scores:", correlation)
#Correlation between hours studied and test scores: 4.97223302698313
Enter fullscreen mode Exit fullscreen mode
Scatter Plot Visualization
Lastly, we’ll visualize the relationship between hours studied and test scores using a scatter plot.
plt.scatter(hours_studied, test_scores)
plt.xlabel('Hours Studied')
plt.ylabel('Test Scores')
plt.title('Hours Studied vs. Test Scores')
plt.show()
Enter fullscreen mode Exit fullscreen mode
原文链接:Statistical Essentials for Data Analysts: A Beginner’s Guide
暂无评论内容