Statistical Essentials for Data Analysts: A Beginner’s Guide

Understanding Basic Statistical Terminologies with Python

In this post, we’ll explore some fundamental statistical concepts using Python and explain them in detail. We’ll be working with a dataset of student scores in an exam, and we’ll use Python’s statistics module and matplotlib library for visualization.

Let’s start by importing the necessary libraries and defining our dataset:

import matplotlib.pyplot as plt
import statistics

Enter fullscreen mode Exit fullscreen mode

students data

# data of student scores in an exam student_scores = [85, 78, 92, 88, 76, 80, 85, 90, 85, 78]

Enter fullscreen mode Exit fullscreen mode

Mean, Median, and Mode
The mean represents the average value of the dataset, while the median is the middle value when the data is arranged in ascending order. The mode is the most frequent value in the dataset

mean_score = statistics.mean(student_scores)
median_score = statistics.median(student_scores)
mode_score = statistics.mode(student_scores)

print("Mean:", mean_score) # Mean: 83.7 print("Median:", median_score) # Median: 85.0 print("Mode:", mode_score) # Mode: 85 

Enter fullscreen mode Exit fullscreen mode

Standard Deviation and Variance
Standard deviation measures the dispersion of data points from the mean, while variance represents the average of the squared differences from the mean.

std_deviation = statistics.stdev(student_scores)
variance = statistics.variance(student_scores)

print("Standard Deviation:", std_deviation) # Standard Deviation : 5.47 print("Variance:", variance) # Variance : 30.011 

Enter fullscreen mode Exit fullscreen mode

Range and Quartiles
The range is the difference between the maximum and minimum values in the dataset. Quartiles divide the data into four equal parts.

range_score = max(student_scores) - min(student_scores)
sorted_scores = sorted(student_scores)
q1 = statistics.median(sorted_scores[:len(sorted_scores)//2])
q2 = statistics.median(sorted_scores)
q3 = statistics.median(sorted_scores[len(sorted_scores)//2:])

print("Range:", range_score) # Range: 16 print("Q1:", q1) #Q1: 78 print("Q2 (Median):", q2) #Q2: 85.0 print("Q3:", q3) #Q3: 88 

Enter fullscreen mode Exit fullscreen mode

Interquartile Range (IQR)
The Interquartile Range (IQR) is the range between the first and third quartiles, measuring the spread of data.

iqr = q3 - q1
print("Interquartile Range (IQR):", iqr) #Interquartile Range(IQR): 10 

Enter fullscreen mode Exit fullscreen mode

Correlation Coefficient
The correlation coefficient measures the linear relationship between two variables. We’ll calculate the correlation coefficient between hours studied and test scores

def correlation_coefficient(x, y):
    n = len(x)
    mean_x = sum(x) / n
    mean_y = sum(y) / n
    covariance = sum((x[i] - mean_x) * (y[i] - mean_y) for i in range(n))
    std_dev_x = (sum((xi - mean_x) ** 2 for xi in x) / n) ** 0.5
    std_dev_y = (sum((yi - mean_y) ** 2 for yi in y) / n) ** 0.5
    correlation = covariance / (std_dev_x * std_dev_y)
    return correlation

hours_studied = [4, 6, 3, 5, 7]
test_scores = [85, 90, 82, 88, 92]

correlation = correlation_coefficient(hours_studied, test_scores)
print("Correlation between hours studied and test scores:", correlation) 
#Correlation between hours studied and test scores: 4.97223302698313 

Enter fullscreen mode Exit fullscreen mode

Scatter Plot Visualization
Lastly, we’ll visualize the relationship between hours studied and test scores using a scatter plot.

plt.scatter(hours_studied, test_scores)
plt.xlabel('Hours Studied')
plt.ylabel('Test Scores')
plt.title('Hours Studied vs. Test Scores')
plt.show()

Enter fullscreen mode Exit fullscreen mode


原文链接:Statistical Essentials for Data Analysts: A Beginner’s Guide

© 版权声明
THE END
喜欢就支持一下吧
点赞7 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容