Statistical Essentials for Data Analysts: A Beginner's Guide - 拾光赋-拾光赋

Statistical Essentials for Data Analysts: A Beginner’s Guide

10个月前发布

0417

Understanding Basic Statistical Terminologies with Python

In this post, we’ll explore some fundamental statistical concepts using Python and explain them in detail. We’ll be working with a dataset of student scores in an exam, and we’ll use Python’s statistics module and matplotlib library for visualization.

Let’s start by importing the necessary libraries and defining our dataset:

import matplotlib.pyplot as plt
import statistics

Enter fullscreen mode Exit fullscreen mode

students data

# data of student scores in an exam student_scores = [85, 78, 92, 88, 76, 80, 85, 90, 85, 78]

Enter fullscreen mode Exit fullscreen mode

Mean, Median, and Mode
The mean represents the average value of the dataset, while the median is the middle value when the data is arranged in ascending order. The mode is the most frequent value in the dataset

mean_score = statistics.mean(student_scores)
median_score = statistics.median(student_scores)
mode_score = statistics.mode(student_scores)

print("Mean:", mean_score) # Mean: 83.7 print("Median:", median_score) # Median: 85.0 print("Mode:", mode_score) # Mode: 85

Enter fullscreen mode Exit fullscreen mode

Standard Deviation and Variance
Standard deviation measures the dispersion of data points from the mean, while variance represents the average of the squared differences from the mean.

std_deviation = statistics.stdev(student_scores)
variance = statistics.variance(student_scores)

print("Standard Deviation:", std_deviation) # Standard Deviation : 5.47 print("Variance:", variance) # Variance : 30.011

Enter fullscreen mode Exit fullscreen mode

Range and Quartiles
The range is the difference between the maximum and minimum values in the dataset. Quartiles divide the data into four equal parts.

range_score = max(student_scores) - min(student_scores)
sorted_scores = sorted(student_scores)
q1 = statistics.median(sorted_scores[:len(sorted_scores)//2])
q2 = statistics.median(sorted_scores)
q3 = statistics.median(sorted_scores[len(sorted_scores)//2:])

print("Range:", range_score) # Range: 16 print("Q1:", q1) #Q1: 78 print("Q2 (Median):", q2) #Q2: 85.0 print("Q3:", q3) #Q3: 88

Enter fullscreen mode Exit fullscreen mode

Interquartile Range (IQR)
The Interquartile Range (IQR) is the range between the first and third quartiles, measuring the spread of data.

iqr = q3 - q1
print("Interquartile Range (IQR):", iqr) #Interquartile Range(IQR): 10

Enter fullscreen mode Exit fullscreen mode

Correlation Coefficient
The correlation coefficient measures the linear relationship between two variables. We’ll calculate the correlation coefficient between hours studied and test scores

def correlation_coefficient(x, y):
    n = len(x)
    mean_x = sum(x) / n
    mean_y = sum(y) / n
    covariance = sum((x[i] - mean_x) * (y[i] - mean_y) for i in range(n))
    std_dev_x = (sum((xi - mean_x) ** 2 for xi in x) / n) ** 0.5
    std_dev_y = (sum((yi - mean_y) ** 2 for yi in y) / n) ** 0.5
    correlation = covariance / (std_dev_x * std_dev_y)
    return correlation

hours_studied = [4, 6, 3, 5, 7]
test_scores = [85, 90, 82, 88, 92]

correlation = correlation_coefficient(hours_studied, test_scores)
print("Correlation between hours studied and test scores:", correlation) 
#Correlation between hours studied and test scores: 4.97223302698313

Enter fullscreen mode Exit fullscreen mode

Scatter Plot Visualization
Lastly, we’ll visualize the relationship between hours studied and test scores using a scatter plot.

plt.scatter(hours_studied, test_scores)
plt.xlabel('Hours Studied')
plt.ylabel('Test Scores')
plt.title('Hours Studied vs. Test Scores')
plt.show()

Enter fullscreen mode Exit fullscreen mode

原文链接：Statistical Essentials for Data Analysts: A Beginner’s Guide

© 版权声明

文章版权声明 1、本网站名称：拾光赋
2、本站永久网址：https://www.blogs.ink
3、本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长QQ：805375623进行删除处理。
4、本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6、本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。

THE END

Python（EN）
# python # datascience # analytics # statistics

喜欢就支持一下吧

相关推荐

评论抢沙发

请登录后发表评论

暂无评论内容