Data Science With Python: Where And How To Start (4 Part Series)
1 Reading and Manipulating Your Dataset With Pandas
2 Reading and Manipulating Your Dataset With Pandas (2)
3 How to prove that your cat is fat (with statistics and python)
4 Beginners’ journey to machine learning
You have a very, very fat and lazy cat named Bogla. He’s so lazy that he often falls asleep in his food bowl. You keep yelling at him not to eat so much tuna but he pays absolutely no attention to you, and has no clue how fat he is. So you decide to lecture him with statistical proofs. You weigh him and oh my God, he’s already 6 kg.
You look up on the internet how you can statistically prove that your cat is too fat. They have written that you have to collect more ‘data’, and to do so, you need to know the weights of some other cats. So you start collecting the information by calling the other cat parents. You write a small python script where you use the ‘pandas’ library to convert the pairs of cat names and their weights into a row-column like structure:
#Import pandas library
import pandas as pd
# initialize list of lists
height_data = [['tom', 3.0], ['pumpkin', 3.51], ['bonk', 4.2], ['thunder', 5.5], ['oreo', 4.73], ['nya', 5], ['kitkat', 4.55], ['bubbles', 4.9], ['sparkle', 6.29], ['pebbles', 3.72]]
# Create the pandas DataFrame
df_weight = pd.DataFrame(height_data, columns = ['Name', 'Weight(kg)'])
# print dataframe.
df_weight
Enter fullscreen mode Exit fullscreen mode
Now, you think you need to put the data into some type of graph named ‘Histogram’. To make a histogram, you need to understand the concept of bins or buckets. A bin is like a group. For example, when you think of a person’s age, if the person is below 13 years, you call him a kid; if he is between 13 to 19, you call him a teen and if he is more than 19 years old, you usually call him a grown up. Here the age ranges of 0-12, 13-19 and 19+ can be considered as bins.
To make a histogram of the weights of your friends’ cats, you decide to make 5 bins. The lightest cat is Tom, who is 3 kg, and the heaviest is Sparkle, and he is 6.29 kg. So all the cats are ‘distributed’ within this 3 to 6.29 kg range. If you want to make 5 equal bins within this (6.29-3) or 3.29kg of range, each bin will be of around (3.29/5) or 0.658 kg. Therefore, Tom, who is 3 kg, will be in the first bin. Pumpkin (3.51 kg) is also in this range, as the range of the first bin spreads from 3 to 3.658 kg. The second bin will be from 3.658 to (3.658+0.658) or 4.316 kg, and you can see, the third cat, Bonk, who is 4.20 kg, will be in the second bin. In the same way, you put all of these cats into bins and count how many cats each bin contains. You see, bin 1, 2, 3, 4 and 5 contains 2, 2, 3, 2 and 1 cats. You can do the whole process writing a small python script.
import seaborn as sns
sns.histplot(data=df_weight, x="Weight(kg)", bins=5)
#sns.histplot(data=df_weight, x="Weight(kg)", bins='auto') automatically decides How many bins you need
Enter fullscreen mode Exit fullscreen mode
You see, most of the cats (in this case, 3 cats) have their weight in the 3rd bin which covers the range of 4.316 to (4.316+0.658) or 4.974 kg.
There are more cats out there, aren’t they? There might be cats weighing less than 3 kg and so on. It would be great if you could assume the probability of other cats being fat from the data you have. Here, the weight of a cat is a “Random Variable” and the weights of all the cats are “Sample Space”. All the possible values of the random variable (in this case, weight of the cat) and from all possible weights of a cat, how often a specific result might come (for example, a cat weighing 6 kg) can be represented with a “Distribution”. Now let’s try to write a few lines of code to see the distribution of our dataset.
sns.FacetGrid(df_weight, size=6) \
.map(sns.kdeplot, "Weight(kg)") \
.add_legend()
Enter fullscreen mode Exit fullscreen mode
From this graph, you clearly see that the curve is at its highest point near the value of 4 to 5 kg, that means most of the cats from your neighbourhood are within this range.
Here the curve you see is (almost) a ‘Normal Distribution’. An actual normal distribution is symmetric about the mean, kind of like a bell (), that means the mean value stays in the center of the curve, and both right and left sides look like mirror images. Here is a picture of a normal distribution:
You might already know the meaning of mean, median and mode. Mean is the average value of all your data, median is the middle point of data that separates the higher half and lower half of your dataset and mode is the number that occurs the most. In a normal distribution, all three of them are at the same point, and that point is marked in our image with a dotted line.
Now the problem is, your distribution curve was not totally symmetrical. The curves might lean to their left or right sometimes. We call this asymmetry ‘Skewness’. Skewnesses can be negative or positive depending on where their mean and median is. If the mean is greater than median (mean > median, or, mean is in the right to median in the graph), the skewness is positive. In case of a negatively skewed distribution, the mean is lesser than median (mean < median, or, mean is in the left to median).
For our dataset:
import numpy as np
from scipy import stats
mean = np.mean(df_weight['Weight(kg)'])
median = np.median(df_weight['Weight(kg)'])
print(mean, median)
Enter fullscreen mode Exit fullscreen mode
Here mean is 4.54 and median is 4.64, so mean < median. So the data is negatively skewed.
Anyway, you now know from the graph that your cat is somewhat fat. How fat exactly is he? Here comes the concept of ‘Percentile’ to save your day. What is a percentile? Suppose there is a cat of 4.95 kg and he is fatter than 75% cats. Here, the value 4.95 is the ‘75th percentile’. The value of (75th quartile – 25th quartile is called the interquartile range).
Now let’s check how many cats have lower weight than your cat.
import numpy as np
sum(np.abs(df_weight["Weight(kg)"]) < 6) / float(len(df_weight["Weight(kg)"]))
Enter fullscreen mode Exit fullscreen mode
Whoa, 0.9! That means your cat is fatter than 90% of the cats.
You finally know how to yell at your cat in statistics, but your journey wasn’t smooth. You accidentally wrote Tom’s weight ‘0.3’ instead of ‘3.0’ in the beginning while collecting data and all the graphs were messed up.
You didn’t know why there was some blank space in the histogram. You looked up on the internet that time and came across the concept of ‘Outliers’. Outlier is simply the data that differs from the rest. To check outliers in your data, you used boxplots. In box plots, the most likely range of an event happening (in this case, the most common cat weights) is shown in a box and the other lower and upper (but still acceptable) values are shown using whiskers. The unacceptable values are shown using dots.
sns.boxplot(y='Weight(kg)', data=df_weight)
Enter fullscreen mode Exit fullscreen mode
From this you saw there was a value that didn’t belong to your dataset. You can find the value using the interquartile range (IQR) I mentioned earlier. If the value of 25th percentile is Q1 and the 75th percentile is Q3, anything with a value higher than Q3 + 1.5 x IQR or lower than Q1 – 1.5 x IQR is an outlier.
Q1 = np.percentile(df_weight['Weight(kg)'], 25, interpolation = 'midpoint')
Q3 = np.percentile(df_weight['Weight(kg)'], 75, interpolation = 'midpoint')
IQR = Q3 - Q1
low = Q1 - 1.5 * IQR
up = Q3 + 1.5 * IQR
outlier =[]
for x in df_weight['Weight(kg)']:
if ((x> up) or (x<low)):
outlier.append(x)
print('outlier in the dataset is', outlier)
Enter fullscreen mode Exit fullscreen mode
This is how you knew you made a mistake while writing Tom’s weight.
Now go yell at your cat.
Data Science With Python: Where And How To Start (4 Part Series)
1 Reading and Manipulating Your Dataset With Pandas
2 Reading and Manipulating Your Dataset With Pandas (2)
3 How to prove that your cat is fat (with statistics and python)
4 Beginners’ journey to machine learning
原文链接:How to prove that your cat is fat (with statistics and python)
暂无评论内容