The importance of data in the output results of the model means that we need to carry out further handling of the input data. One way that can be done to improve data quality is to carry out data cleaning techniques.
What is Data Cleaning?
Data cleaning is a technique that aims to improve data quality by identifying and eliminating errors and inconsistencies in data.
Here I will share some data cleaning methods and providing simple implementation in how to do it.
There are 4 commonly methods in data cleaning:
- Scalling Feature Value
- Handling Extreme Outlier
- Binning
- Scrubbing
Scalling Feature Value
Feature scaling is the process of normalizing the range of features in a dataset. In real cases, the range of feature values varies greatly. If one of the features has a wide value, then that feature will greatly influence the calculations of the algorithm used. Therefore, the range of all features must be normalized so that each feature can provide a comparable contribution. There are several techniques that can be used to perform feature scaling, including:
Absolute Maximum Scaling
Absolute Maximum Scaling is a scaling technique that is carried out based on the absolute maximum value of each feature. The stages of this technique are:
- Determine the maximum absolute value of the feature in the data set.
- Divide all values in the column by the maximum value.
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span><span>def</span> <span>max_absolute_scaling</span><span>(</span><span>data</span><span>):</span><span># Determines the maximum absolute value </span> <span>max_abs_value</span> <span>=</span> <span>max</span><span>(</span><span>map</span><span>(</span><span>abs</span><span>,</span> <span>data</span><span>))</span><span># Divide each value by the maximum absolute value </span> <span>scaled_data</span> <span>=</span> <span>[]</span><span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span><span>scaled_data</span><span>.</span><span>append</span><span>(</span><span>x</span><span>/</span><span>max_abs_value</span><span>)</span><span>return</span> <span>scaled_data</span><span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span><span>scaled_data</span> <span>=</span> <span>max_absolute_scaling</span><span>(</span><span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>)</span><span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span><span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span><span>plt</span><span>.</span><span>legend</span><span>()</span><span>plt</span><span>.</span><span>show</span><span>()</span><span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span> <span>def</span> <span>max_absolute_scaling</span><span>(</span><span>data</span><span>):</span> <span># Determines the maximum absolute value </span> <span>max_abs_value</span> <span>=</span> <span>max</span><span>(</span><span>map</span><span>(</span><span>abs</span><span>,</span> <span>data</span><span>))</span> <span># Divide each value by the maximum absolute value </span> <span>scaled_data</span> <span>=</span> <span>[]</span> <span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span> <span>scaled_data</span><span>.</span><span>append</span><span>(</span><span>x</span><span>/</span><span>max_abs_value</span><span>)</span> <span>return</span> <span>scaled_data</span> <span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span> <span>scaled_data</span> <span>=</span> <span>max_absolute_scaling</span><span>(</span><span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>)</span> <span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span> <span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span> <span>plt</span><span>.</span><span>legend</span><span>()</span> <span>plt</span><span>.</span><span>show</span><span>()</span>import matplotlib.pyplot as plt def max_absolute_scaling(data): # Determines the maximum absolute value max_abs_value = max(map(abs, data)) # Divide each value by the maximum absolute value scaled_data = [] for x in data: scaled_data.append(x/max_abs_value) return scaled_data data = [3, -1, 6, 2, -4] scaled_data = max_absolute_scaling(data) print("Data:", data) print("Scaled Data:", scaled_data) plt.plot(data, "red", label="Original Data") plt.plot(scaled_data, "blue", label="Scaled Data") plt.legend() plt.show()
Enter fullscreen mode Exit fullscreen mode
Output:
Data: [3, -1, 6, 2, -4]Scaled Data: [0.5, -0.16666666666666666, 1.0, 0.3333333333333333, -0.6666666666666666]Data: [3, -1, 6, 2, -4] Scaled Data: [0.5, -0.16666666666666666, 1.0, 0.3333333333333333, -0.6666666666666666]Data: [3, -1, 6, 2, -4] Scaled Data: [0.5, -0.16666666666666666, 1.0, 0.3333333333333333, -0.6666666666666666]
Enter fullscreen mode Exit fullscreen mode
Min-Max Scaling
Min-Max Scaling is a scaling technique that is carried out by reducing each value in the dataset by the minimum value and then dividing by the range of the dataset (maximum-minimum). By applying this technique, all feature values will be between 0 and 1. The weakness of this technique is also the same, namely that it is susceptible to outliers.
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span><span>def</span> <span>min_max_scaling</span><span>(</span><span>data</span><span>):</span><span># Determine the maximum and minimum values </span> <span>max_value</span> <span>=</span> <span>max</span><span>(</span><span>data</span><span>)</span><span>min_value</span> <span>=</span> <span>min</span><span>(</span><span>data</span><span>)</span><span># Reduces each value by the minimum value </span> <span># then divided by the range of dataset values </span> <span>scaled_data</span> <span>=</span> <span>[]</span><span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span><span>scaled_data</span><span>.</span><span>append</span><span>(</span><span>(</span><span>x</span><span>-</span><span>min_value</span><span>)</span><span>/</span><span>(</span><span>max_value</span><span>-</span><span>min_value</span><span>)</span><span>)</span><span>return</span> <span>scaled_data</span><span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span><span>scaled_data</span> <span>=</span> <span>min_max_scaling</span><span>(</span><span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>)</span><span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span><span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span><span>plt</span><span>.</span><span>legend</span><span>()</span><span>plt</span><span>.</span><span>show</span><span>()</span><span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span> <span>def</span> <span>min_max_scaling</span><span>(</span><span>data</span><span>):</span> <span># Determine the maximum and minimum values </span> <span>max_value</span> <span>=</span> <span>max</span><span>(</span><span>data</span><span>)</span> <span>min_value</span> <span>=</span> <span>min</span><span>(</span><span>data</span><span>)</span> <span># Reduces each value by the minimum value </span> <span># then divided by the range of dataset values </span> <span>scaled_data</span> <span>=</span> <span>[]</span> <span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span> <span>scaled_data</span><span>.</span><span>append</span><span>(</span> <span>(</span><span>x</span><span>-</span><span>min_value</span><span>)</span><span>/</span><span>(</span><span>max_value</span><span>-</span><span>min_value</span><span>)</span> <span>)</span> <span>return</span> <span>scaled_data</span> <span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span> <span>scaled_data</span> <span>=</span> <span>min_max_scaling</span><span>(</span><span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>)</span> <span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span> <span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span> <span>plt</span><span>.</span><span>legend</span><span>()</span> <span>plt</span><span>.</span><span>show</span><span>()</span>import matplotlib.pyplot as plt def min_max_scaling(data): # Determine the maximum and minimum values max_value = max(data) min_value = min(data) # Reduces each value by the minimum value # then divided by the range of dataset values scaled_data = [] for x in data: scaled_data.append( (x-min_value)/(max_value-min_value) ) return scaled_data data = [3, -1, 6, 2, -4] scaled_data = min_max_scaling(data) print("Data:", data) print("Scaled Data:", scaled_data) plt.plot(data, "red", label="Original Data") plt.plot(scaled_data, "blue", label="Scaled Data") plt.legend() plt.show()
Enter fullscreen mode Exit fullscreen mode
Output:
Data: [3, -1, 6, 2, -4]Scaled Data: [0.7, 0.3, 1.0, 0.6, 0.0]Data: [3, -1, 6, 2, -4] Scaled Data: [0.7, 0.3, 1.0, 0.6, 0.0]Data: [3, -1, 6, 2, -4] Scaled Data: [0.7, 0.3, 1.0, 0.6, 0.0]
Enter fullscreen mode Exit fullscreen mode
Normalization
Normalization is a scaling technique that is similar to min-max scaling, but each feature value is reduced by the average value of the dataset. The results of the reduction are then divided by the range of dataset values.
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span><span>from</span> <span>statistics</span> <span>import</span> <span>mean</span><span>def</span> <span>normalization</span><span>(</span><span>data</span><span>):</span><span># Determine maximum, minimum, average values </span> <span>max_value</span> <span>=</span> <span>max</span><span>(</span><span>data</span><span>)</span><span>min_value</span> <span>=</span> <span>min</span><span>(</span><span>data</span><span>)</span><span>mean_value</span> <span>=</span> <span>mean</span><span>(</span><span>data</span><span>)</span><span># Subtract each value by the average value </span> <span># then divided by the range of dataset values </span> <span>scaled_data</span> <span>=</span> <span>[]</span><span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span><span>scaled_data</span><span>.</span><span>append</span><span>(</span><span>(</span><span>x</span><span>-</span><span>mean_value</span><span>)</span><span>/</span><span>(</span><span>max_value</span><span>-</span><span>min_value</span><span>)</span><span>)</span><span>return</span> <span>scaled_data</span><span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span><span>scaled_data</span> <span>=</span> <span>normalization</span><span>(</span><span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>)</span><span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span><span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span><span>plt</span><span>.</span><span>legend</span><span>()</span><span>plt</span><span>.</span><span>show</span><span>()</span><span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span> <span>from</span> <span>statistics</span> <span>import</span> <span>mean</span> <span>def</span> <span>normalization</span><span>(</span><span>data</span><span>):</span> <span># Determine maximum, minimum, average values </span> <span>max_value</span> <span>=</span> <span>max</span><span>(</span><span>data</span><span>)</span> <span>min_value</span> <span>=</span> <span>min</span><span>(</span><span>data</span><span>)</span> <span>mean_value</span> <span>=</span> <span>mean</span><span>(</span><span>data</span><span>)</span> <span># Subtract each value by the average value </span> <span># then divided by the range of dataset values </span> <span>scaled_data</span> <span>=</span> <span>[]</span> <span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span> <span>scaled_data</span><span>.</span><span>append</span><span>(</span> <span>(</span><span>x</span><span>-</span><span>mean_value</span><span>)</span><span>/</span><span>(</span><span>max_value</span><span>-</span><span>min_value</span><span>)</span> <span>)</span> <span>return</span> <span>scaled_data</span> <span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span> <span>scaled_data</span> <span>=</span> <span>normalization</span><span>(</span><span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>)</span> <span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span> <span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span> <span>plt</span><span>.</span><span>legend</span><span>()</span> <span>plt</span><span>.</span><span>show</span><span>()</span>import matplotlib.pyplot as plt from statistics import mean def normalization(data): # Determine maximum, minimum, average values max_value = max(data) min_value = min(data) mean_value = mean(data) # Subtract each value by the average value # then divided by the range of dataset values scaled_data = [] for x in data: scaled_data.append( (x-mean_value)/(max_value-min_value) ) return scaled_data data = [3, -1, 6, 2, -4] scaled_data = normalization(data) print("Data:", data) print("Scaled Data:", scaled_data) plt.plot(data, "red", label="Original Data") plt.plot(scaled_data, "blue", label="Scaled Data") plt.legend() plt.show()
Enter fullscreen mode Exit fullscreen mode
Output:
Data: [3, -1, 6, 2, -4]Scaled Data: [0.18, -0.22000000000000003, 0.48, 0.08, -0.52]Data: [3, -1, 6, 2, -4] Scaled Data: [0.18, -0.22000000000000003, 0.48, 0.08, -0.52]Data: [3, -1, 6, 2, -4] Scaled Data: [0.18, -0.22000000000000003, 0.48, 0.08, -0.52]
Enter fullscreen mode Exit fullscreen mode
Standardization (Z-score Normalization)
Standardization is a scaling technique that is carried out by reducing each feature value by the average and dividing by the standard deviation value or what is usually called the z-score. The result of this technique is data that is scaled so that it has features centered on the average and a standard deviation of 1. This technique is suitable if the features have a normal distribution such as salary or age
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span><span>from</span> <span>statistics</span> <span>import</span> <span>mean</span><span>,</span> <span>stdev</span><span>def</span> <span>standardization</span><span>(</span><span>data</span><span>):</span><span># Determine the average value, standard deviation </span> <span>mean_value</span> <span>=</span> <span>mean</span><span>(</span><span>data</span><span>)</span><span>stdev_value</span> <span>=</span> <span>stdev</span><span>(</span><span>data</span><span>)</span><span># Subtract each value by the average value </span> <span># then divided by the standard deviation </span> <span>scaled_data</span> <span>=</span> <span>[]</span><span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span><span>scaled_data</span><span>.</span><span>append</span><span>(</span><span>(</span><span>x</span><span>-</span><span>mean_value</span><span>)</span><span>/</span><span>(</span><span>stdev_value</span><span>)</span><span>)</span><span>return</span> <span>scaled_data</span><span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span><span>scaled_data</span> <span>=</span> <span>standardization</span><span>(</span><span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>)</span><span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span><span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span><span>plt</span><span>.</span><span>legend</span><span>()</span><span>plt</span><span>.</span><span>show</span><span>()</span><span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span> <span>from</span> <span>statistics</span> <span>import</span> <span>mean</span><span>,</span> <span>stdev</span> <span>def</span> <span>standardization</span><span>(</span><span>data</span><span>):</span> <span># Determine the average value, standard deviation </span> <span>mean_value</span> <span>=</span> <span>mean</span><span>(</span><span>data</span><span>)</span> <span>stdev_value</span> <span>=</span> <span>stdev</span><span>(</span><span>data</span><span>)</span> <span># Subtract each value by the average value </span> <span># then divided by the standard deviation </span> <span>scaled_data</span> <span>=</span> <span>[]</span> <span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span> <span>scaled_data</span><span>.</span><span>append</span><span>(</span> <span>(</span><span>x</span><span>-</span><span>mean_value</span><span>)</span><span>/</span><span>(</span><span>stdev_value</span><span>)</span> <span>)</span> <span>return</span> <span>scaled_data</span> <span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span> <span>scaled_data</span> <span>=</span> <span>standardization</span><span>(</span><span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>)</span> <span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span> <span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span> <span>plt</span><span>.</span><span>legend</span><span>()</span> <span>plt</span><span>.</span><span>show</span><span>()</span>import matplotlib.pyplot as plt from statistics import mean, stdev def standardization(data): # Determine the average value, standard deviation mean_value = mean(data) stdev_value = stdev(data) # Subtract each value by the average value # then divided by the standard deviation scaled_data = [] for x in data: scaled_data.append( (x-mean_value)/(stdev_value) ) return scaled_data data = [3, -1, 6, 2, -4] scaled_data = standardization(data) print("Data:", data) print("Scaled Data:", scaled_data) plt.plot(data, "red", label="Original Data") plt.plot(scaled_data, "blue", label="Scaled Data") plt.legend() plt.show()
Enter fullscreen mode Exit fullscreen mode
Output:
Data: [3, -1, 6, 2, -4]Scaled Data: [0.469476477861571, -0.5738045840530313, 1.2519372742975226, 0.20865621238292043, -1.3562653804889828]Data: [3, -1, 6, 2, -4] Scaled Data: [0.469476477861571, -0.5738045840530313, 1.2519372742975226, 0.20865621238292043, -1.3562653804889828]Data: [3, -1, 6, 2, -4] Scaled Data: [0.469476477861571, -0.5738045840530313, 1.2519372742975226, 0.20865621238292043, -1.3562653804889828]
Enter fullscreen mode Exit fullscreen mode
Robust Scaling
In the Robust Scaling technique, each data is reduced by the median value and then divided by the Inter Quartile Range (IQR) value. IQR is the difference between the upper quartile (Q3) and the lower quartile (Q1).
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span><span>def</span> <span>robust_scaling</span><span>(</span><span>data</span><span>):</span><span>series</span> <span>=</span> <span>pd</span><span>.</span><span>Series</span><span>(</span><span>data</span><span>)</span><span># Determine the median value, IQR </span> <span>q1</span><span>,</span> <span>median</span><span>,</span> <span>q3</span> <span>=</span> <span>series</span><span>.</span><span>quantile</span><span>([</span><span>0.25</span><span>,</span> <span>0.5</span><span>,</span> <span>0.75</span><span>])</span><span>IQR</span> <span>=</span> <span>q3</span> <span>-</span> <span>q1</span><span># Subtract each value by the median value </span> <span># then divided by the IQR value </span> <span>scaled_data</span> <span>=</span> <span>[]</span><span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span><span>scaled_data</span><span>.</span><span>append</span><span>(</span><span>(</span><span>x</span><span>-</span><span>median</span><span>)</span><span>/</span><span>IQR</span><span>)</span><span>return</span> <span>scaled_data</span><span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span><span>scaled_data</span> <span>=</span> <span>robust_scaling</span><span>(</span><span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>)</span><span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span><span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span><span>plt</span><span>.</span><span>legend</span><span>()</span><span>plt</span><span>.</span><span>show</span><span>()</span><span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span> <span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> <span>def</span> <span>robust_scaling</span><span>(</span><span>data</span><span>):</span> <span>series</span> <span>=</span> <span>pd</span><span>.</span><span>Series</span><span>(</span><span>data</span><span>)</span> <span># Determine the median value, IQR </span> <span>q1</span><span>,</span> <span>median</span><span>,</span> <span>q3</span> <span>=</span> <span>series</span><span>.</span><span>quantile</span><span>([</span><span>0.25</span><span>,</span> <span>0.5</span><span>,</span> <span>0.75</span><span>])</span> <span>IQR</span> <span>=</span> <span>q3</span> <span>-</span> <span>q1</span> <span># Subtract each value by the median value </span> <span># then divided by the IQR value </span> <span>scaled_data</span> <span>=</span> <span>[]</span> <span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span> <span>scaled_data</span><span>.</span><span>append</span><span>(</span> <span>(</span><span>x</span><span>-</span><span>median</span><span>)</span><span>/</span><span>IQR</span> <span>)</span> <span>return</span> <span>scaled_data</span> <span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span> <span>scaled_data</span> <span>=</span> <span>robust_scaling</span><span>(</span><span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>)</span> <span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span> <span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span> <span>plt</span><span>.</span><span>legend</span><span>()</span> <span>plt</span><span>.</span><span>show</span><span>()</span>import matplotlib.pyplot as plt import pandas as pd def robust_scaling(data): series = pd.Series(data) # Determine the median value, IQR q1, median, q3 = series.quantile([0.25, 0.5, 0.75]) IQR = q3 - q1 # Subtract each value by the median value # then divided by the IQR value scaled_data = [] for x in data: scaled_data.append( (x-median)/IQR ) return scaled_data data = [3, -1, 6, 2, -4] scaled_data = robust_scaling(data) print("Data:", data) print("Scaled Data:", scaled_data) plt.plot(data, "red", label="Original Data") plt.plot(scaled_data, "blue", label="Scaled Data") plt.legend() plt.show()
Enter fullscreen mode Exit fullscreen mode
Output:
Data: [3, -1, 6, 2, -4]Scaled Data: [0.25, -0.75, 1.0, 0.0, -1.5]Data: [3, -1, 6, 2, -4] Scaled Data: [0.25, -0.75, 1.0, 0.0, -1.5]Data: [3, -1, 6, 2, -4] Scaled Data: [0.25, -0.75, 1.0, 0.0, -1.5]
Enter fullscreen mode Exit fullscreen mode
Scaling to Vector Unit Length
Scaling to Vector Unit Length is a scaling technique that is carried out by transforming the components of a feature vector so that the transformed vector has a length of 1. In this technique, each feature value is divided by the vector length.
This technique can only be done if the value ||X||>0
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span><span>import</span> <span>numpy</span> <span>as</span> <span>np</span><span>def</span> <span>vector_normalization</span><span>(</span><span>data</span><span>):</span><span>vector</span> <span>=</span> <span>np</span><span>.</span><span>array</span><span>(</span><span>data</span><span>)</span><span># Determines the length of the vector </span> <span>magnitued</span> <span>=</span> <span>np</span><span>.</span><span>linalg</span><span>.</span><span>norm</span><span>(</span><span>vector</span><span>)</span><span># Normalize vectors to unit length </span> <span>scaled_data</span> <span>=</span> <span>vector</span><span>/</span><span>magnitued</span><span>return</span> <span>scaled_data</span><span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span><span>scaled_data</span> <span>=</span> <span>vector_normalization</span><span>(</span><span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>.</span><span>tolist</span><span>())</span><span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span><span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span><span>plt</span><span>.</span><span>legend</span><span>()</span><span>plt</span><span>.</span><span>show</span><span>()</span><span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span> <span>import</span> <span>numpy</span> <span>as</span> <span>np</span> <span>def</span> <span>vector_normalization</span><span>(</span><span>data</span><span>):</span> <span>vector</span> <span>=</span> <span>np</span><span>.</span><span>array</span><span>(</span><span>data</span><span>)</span> <span># Determines the length of the vector </span> <span>magnitued</span> <span>=</span> <span>np</span><span>.</span><span>linalg</span><span>.</span><span>norm</span><span>(</span><span>vector</span><span>)</span> <span># Normalize vectors to unit length </span> <span>scaled_data</span> <span>=</span> <span>vector</span><span>/</span><span>magnitued</span> <span>return</span> <span>scaled_data</span> <span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span> <span>scaled_data</span> <span>=</span> <span>vector_normalization</span><span>(</span><span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>.</span><span>tolist</span><span>())</span> <span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span> <span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span> <span>plt</span><span>.</span><span>legend</span><span>()</span> <span>plt</span><span>.</span><span>show</span><span>()</span>import matplotlib.pyplot as plt import numpy as np def vector_normalization(data): vector = np.array(data) # Determines the length of the vector magnitued = np.linalg.norm(vector) # Normalize vectors to unit length scaled_data = vector/magnitued return scaled_data data = [3, -1, 6, 2, -4] scaled_data = vector_normalization(data) print("Data:", data) print("Scaled Data:", scaled_data.tolist()) plt.plot(data, "red", label="Original Data") plt.plot(scaled_data, "blue", label="Scaled Data") plt.legend() plt.show()
Enter fullscreen mode Exit fullscreen mode
Output:
Data: [3, -1, 6, 2, -4]Scaled Data: [0.3692744729379982, -0.12309149097933272, 0.7385489458759964, 0.24618298195866545, -0.4923659639173309]Data: [3, -1, 6, 2, -4] Scaled Data: [0.3692744729379982, -0.12309149097933272, 0.7385489458759964, 0.24618298195866545, -0.4923659639173309]Data: [3, -1, 6, 2, -4] Scaled Data: [0.3692744729379982, -0.12309149097933272, 0.7385489458759964, 0.24618298195866545, -0.4923659639173309]
Enter fullscreen mode Exit fullscreen mode
Handling Extreme Outlier
Outliers are values that are much different from the majority of the data in the data set. Outliers may represent natural variation in the population. However, in most cases outliers are caused by an error in the data collection process such as entering incorrect data, equipment failure, or other measurement errors. If outliers are not handled, they can affect the results of statistical analysis and the accuracy of the model being developed.
Outlier Detection
Outlier Detection with Sorting Methods
The sorting method is the simplest method that can be used to detect outliers. Quantitative data can be sorted from low to high and manually detected for data with very low or very high values. In the Python language, sorting can be done using the sorted function.
<span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span><span>sorted_data</span> <span>=</span> <span>sorted</span><span>(</span><span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Sorted Data:</span><span>"</span><span>,</span> <span>sorted_data</span><span>)</span><span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span> <span>sorted_data</span> <span>=</span> <span>sorted</span><span>(</span><span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Sorted Data:</span><span>"</span><span>,</span> <span>sorted_data</span><span>)</span>data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] sorted_data = sorted(data) print("Data:", data) print("Sorted Data:", sorted_data)
Enter fullscreen mode Exit fullscreen mode
Output:
Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]Sorted Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 27, 100]Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] Sorted Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 27, 100]Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] Sorted Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 27, 100]
Enter fullscreen mode Exit fullscreen mode
A value of 100 is very high when compared to other data. So 100 can be detected as outliers. However, this method is less accurate because outliers are not determined using statistical calculations.
Outlier Detection using the Histogram Method
Histograms can be used to help visualize data and find out whether there are outlier values in a set of data. Data that is outside the data curve is detected as an outlier. The disadvantages of this method are the same as the average method where outliers are determined only from visual observation of the data.
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span><span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span><span>plt</span><span>.</span><span>hist</span><span>(</span><span>data</span><span>)</span><span>plt</span><span>.</span><span>show</span><span>()</span><span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span> <span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span> <span>plt</span><span>.</span><span>hist</span><span>(</span><span>data</span><span>)</span> <span>plt</span><span>.</span><span>show</span><span>()</span>import matplotlib.pyplot as plt data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] plt.hist(data) plt.show()
Enter fullscreen mode Exit fullscreen mode
Output:
In the histogram visualization above, it can be seen that the data values are more centered between the values 0 to 40. However, there is data that is separated from other data, namely 100 and this data can be categorized as an outlier.
Outlier Detection with Box-Plot
Box-Plot is a summary of the sample distribution presented graphically which can describe the shape of the data distribution (skewness), a measure of central tendency and a measure of the spread (diversity) of observational data. There are 5 statistical measures that can be read in the box plot, namely minimum, maximum, Q1, median, and Q3. Values outside the box and whisker can be categorized as outliers.
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span><span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span> <span>plt</span><span>.</span><span>boxplot</span><span>(</span><span>data</span><span>,</span> <span>vert</span><span>=</span><span>False</span><span>)</span><span>plt</span><span>.</span><span>title</span><span>(</span><span>"</span><span>Detecting outliers using Boxplot</span><span>"</span><span>)</span><span>plt</span><span>.</span><span>xlabel</span><span>(</span><span>'</span><span>Data</span><span>'</span><span>)</span><span>plt</span><span>.</span><span>show</span><span>()</span><span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span> <span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span> <span>plt</span><span>.</span><span>boxplot</span><span>(</span><span>data</span><span>,</span> <span>vert</span><span>=</span><span>False</span><span>)</span> <span>plt</span><span>.</span><span>title</span><span>(</span><span>"</span><span>Detecting outliers using Boxplot</span><span>"</span><span>)</span> <span>plt</span><span>.</span><span>xlabel</span><span>(</span><span>'</span><span>Data</span><span>'</span><span>)</span> <span>plt</span><span>.</span><span>show</span><span>()</span>import matplotlib.pyplot as plt data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] plt.boxplot(data, vert=False) plt.title("Detecting outliers using Boxplot") plt.xlabel('Data') plt.show()
Enter fullscreen mode Exit fullscreen mode
Output:
Value 100 that is outside the whisker can be categorized as an outlier.
Deteksi Outlier dengan Z-Score
The criteria for determining outliers with z-score is that every data point that has a z-score value that is outside the 3rd standard deviation is an outlier. The stages carried out using this technique include:
- For all data points, calculate the z-score value using the formula (Xi-mean)/std.
- Initialize the threshold value=3 and mark data points that have an absolute z-score value greater than the threshold as outliers.
<span>import</span> <span>statistics</span> <span>as</span> <span>s</span><span>def</span> <span>detect_outliers_zscore</span><span>(</span><span>data</span><span>):</span><span># Threshold initialization </span> <span>thres</span> <span>=</span> <span>3</span><span># Determine the average value and standard deviation </span> <span>mean</span> <span>=</span> <span>s</span><span>.</span><span>mean</span><span>(</span><span>data</span><span>)</span><span>std</span> <span>=</span> <span>s</span><span>.</span><span>stdev</span><span>(</span><span>data</span><span>)</span><span>outliers</span> <span>=</span> <span>[]</span><span># Determine the z-score value for each data </span> <span>for</span> <span>i</span> <span>in</span> <span>data</span><span>:</span><span>z_score</span> <span>=</span> <span>(</span><span>i</span><span>-</span><span>mean</span><span>)</span><span>/</span><span>std</span><span># Check whether the data is outliers </span> <span>if </span><span>(</span><span>abs</span><span>(</span><span>z_score</span><span>)</span> <span>></span> <span>thres</span><span>):</span><span>outliers</span><span>.</span><span>append</span><span>(</span><span>i</span><span>)</span><span>return</span> <span>outliers</span><span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span><span>outliers</span> <span>=</span> <span>detect_outliers_zscore</span><span>(</span><span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Outliers:</span><span>"</span><span>,</span> <span>outliers</span><span>)</span><span>import</span> <span>statistics</span> <span>as</span> <span>s</span> <span>def</span> <span>detect_outliers_zscore</span><span>(</span><span>data</span><span>):</span> <span># Threshold initialization </span> <span>thres</span> <span>=</span> <span>3</span> <span># Determine the average value and standard deviation </span> <span>mean</span> <span>=</span> <span>s</span><span>.</span><span>mean</span><span>(</span><span>data</span><span>)</span> <span>std</span> <span>=</span> <span>s</span><span>.</span><span>stdev</span><span>(</span><span>data</span><span>)</span> <span>outliers</span> <span>=</span> <span>[]</span> <span># Determine the z-score value for each data </span> <span>for</span> <span>i</span> <span>in</span> <span>data</span><span>:</span> <span>z_score</span> <span>=</span> <span>(</span><span>i</span><span>-</span><span>mean</span><span>)</span><span>/</span><span>std</span> <span># Check whether the data is outliers </span> <span>if </span><span>(</span><span>abs</span><span>(</span><span>z_score</span><span>)</span> <span>></span> <span>thres</span><span>):</span> <span>outliers</span><span>.</span><span>append</span><span>(</span><span>i</span><span>)</span> <span>return</span> <span>outliers</span> <span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span> <span>outliers</span> <span>=</span> <span>detect_outliers_zscore</span><span>(</span><span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Outliers:</span><span>"</span><span>,</span> <span>outliers</span><span>)</span>import statistics as s def detect_outliers_zscore(data): # Threshold initialization thres = 3 # Determine the average value and standard deviation mean = s.mean(data) std = s.stdev(data) outliers = [] # Determine the z-score value for each data for i in data: z_score = (i-mean)/std # Check whether the data is outliers if (abs(z_score) > thres): outliers.append(i) return outliers data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] outliers = detect_outliers_zscore(data) print("Outliers:", outliers)
Enter fullscreen mode Exit fullscreen mode
Output:
Outliers: [100]Outliers: [100]Outliers: [100]
Enter fullscreen mode Exit fullscreen mode
Outlier Detection with Inter Quartile Range (IQR)
Based on the Inter Quartile Range (IQR) value, outliers can be detected if the data point is located 1.5 times the IQR above Q3 and below Q1. The stages in determining outliers with IQR are:
- Sort asset data in ascending order
- Calculate 1st and 3rd quartiles (Q1, Q3)
- Calculate the value of IQR=Q3-Q1
- Calculate the lower limit value = (Q1–1.5*IQR) and the upper limit = (Q3+1.5*IQR)
- For all data in the data set, check whether any data is below the lower limit and above the upper limit. Then mark the data as an outlier.
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span><span>def</span> <span>detect_outliers_iqr</span><span>(</span><span>data</span><span>):</span><span>series</span> <span>=</span> <span>pd</span><span>.</span><span>Series</span><span>(</span><span>data</span><span>)</span><span># Determine Q1, Q3, IQR values </span> <span>Q1</span><span>,</span> <span>Q3</span> <span>=</span> <span>series</span><span>.</span><span>quantile</span><span>([</span><span>0.25</span><span>,</span> <span>0.75</span><span>])</span><span>IQR</span> <span>=</span> <span>Q3</span><span>-</span><span>Q1</span><span># Determine lower bound and upper bound </span> <span>lower_bound</span> <span>=</span> <span>Q1</span> <span>-</span> <span>1.5</span> <span>*</span> <span>IQR</span><span>upper_bound</span> <span>=</span> <span>Q3</span> <span>+</span> <span>1.5</span> <span>*</span> <span>IQR</span><span>outliers</span> <span>=</span> <span>[]</span><span># Determine the z-score value for each data </span> <span>for</span> <span>i</span> <span>in</span> <span>data</span><span>:</span><span># Check whether the data is outliers </span> <span>if </span><span>(</span><span>i</span> <span><</span> <span>lower_bound</span> <span>or</span> <span>i</span> <span>></span> <span>upper_bound</span><span>):</span><span>outliers</span><span>.</span><span>append</span><span>(</span><span>i</span><span>)</span><span>return</span> <span>outliers</span><span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span><span>outliers</span> <span>=</span> <span>detect_outliers_iqr</span><span>(</span><span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Outliers:</span><span>"</span><span>,</span> <span>outliers</span><span>)</span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> <span>def</span> <span>detect_outliers_iqr</span><span>(</span><span>data</span><span>):</span> <span>series</span> <span>=</span> <span>pd</span><span>.</span><span>Series</span><span>(</span><span>data</span><span>)</span> <span># Determine Q1, Q3, IQR values </span> <span>Q1</span><span>,</span> <span>Q3</span> <span>=</span> <span>series</span><span>.</span><span>quantile</span><span>([</span><span>0.25</span><span>,</span> <span>0.75</span><span>])</span> <span>IQR</span> <span>=</span> <span>Q3</span><span>-</span><span>Q1</span> <span># Determine lower bound and upper bound </span> <span>lower_bound</span> <span>=</span> <span>Q1</span> <span>-</span> <span>1.5</span> <span>*</span> <span>IQR</span> <span>upper_bound</span> <span>=</span> <span>Q3</span> <span>+</span> <span>1.5</span> <span>*</span> <span>IQR</span> <span>outliers</span> <span>=</span> <span>[]</span> <span># Determine the z-score value for each data </span> <span>for</span> <span>i</span> <span>in</span> <span>data</span><span>:</span> <span># Check whether the data is outliers </span> <span>if </span><span>(</span><span>i</span> <span><</span> <span>lower_bound</span> <span>or</span> <span>i</span> <span>></span> <span>upper_bound</span><span>):</span> <span>outliers</span><span>.</span><span>append</span><span>(</span><span>i</span><span>)</span> <span>return</span> <span>outliers</span> <span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span> <span>outliers</span> <span>=</span> <span>detect_outliers_iqr</span><span>(</span><span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Outliers:</span><span>"</span><span>,</span> <span>outliers</span><span>)</span>import pandas as pd def detect_outliers_iqr(data): series = pd.Series(data) # Determine Q1, Q3, IQR values Q1, Q3 = series.quantile([0.25, 0.75]) IQR = Q3-Q1 # Determine lower bound and upper bound lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR outliers = [] # Determine the z-score value for each data for i in data: # Check whether the data is outliers if (i < lower_bound or i > upper_bound): outliers.append(i) return outliers data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] outliers = detect_outliers_iqr(data) print("Outliers:", outliers)
Enter fullscreen mode Exit fullscreen mode
Output:
Outliers: [100]Outliers: [100]Outliers: [100]
Enter fullscreen mode Exit fullscreen mode
Outlier Handling
After successfully detecting outlier data in the dataset, the next stage is to handle the outlier data. There are several ways that can be done to handle outlier data that has been detected.
Trimming
Outlier data detected using this technique will be removed from the dataset. However, this method is not the best practice to do.
<span>def</span> <span>trimming</span><span>(</span><span>data</span><span>,</span> <span>outlier</span><span>):</span><span>new_data</span> <span>=</span> <span>[]</span><span># remove data that includes outliers </span> <span>for</span> <span>i</span> <span>in</span> <span>outlier</span><span>:</span><span>new_data</span> <span>=</span> <span>[</span><span>x</span> <span>for</span> <span>x</span> <span>in</span> <span>data</span> <span>if</span> <span>x</span> <span>!=</span> <span>i</span> <span>]</span><span>return</span> <span>new_data</span><span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span><span>outlier</span> <span>=</span> <span>[</span><span>100</span><span>]</span><span>new_data</span> <span>=</span> <span>trimming</span><span>(</span><span>data</span><span>,</span> <span>outlier</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>,</span> <span>new_data</span><span>)</span><span>def</span> <span>trimming</span><span>(</span><span>data</span><span>,</span> <span>outlier</span><span>):</span> <span>new_data</span> <span>=</span> <span>[]</span> <span># remove data that includes outliers </span> <span>for</span> <span>i</span> <span>in</span> <span>outlier</span><span>:</span> <span>new_data</span> <span>=</span> <span>[</span><span>x</span> <span>for</span> <span>x</span> <span>in</span> <span>data</span> <span>if</span> <span>x</span> <span>!=</span> <span>i</span> <span>]</span> <span>return</span> <span>new_data</span> <span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span> <span>outlier</span> <span>=</span> <span>[</span><span>100</span><span>]</span> <span>new_data</span> <span>=</span> <span>trimming</span><span>(</span><span>data</span><span>,</span> <span>outlier</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>,</span> <span>new_data</span><span>)</span>def trimming(data, outlier): new_data = [] # remove data that includes outliers for i in outlier: new_data = [x for x in data if x != i ] return new_data data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] outlier = [100] new_data = trimming(data, outlier) print("Data:", data) print("New Data:", new_data)
Enter fullscreen mode Exit fullscreen mode
Output:
Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]New Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 27]Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] New Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 27]Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] New Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 27]
Enter fullscreen mode Exit fullscreen mode
Quantile Based Flooring and Capping
Handling of outliers in this technique is carried out by limiting outliers to certain values above the 90th percentile value or placed in factors below the 10th percentile value.
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span><span>def</span> <span>handle_quantile_outlier</span><span>(</span><span>data</span><span>):</span><span>series</span> <span>=</span> <span>pd</span><span>.</span><span>Series</span><span>(</span><span>data</span><span>)</span><span># Determine the 10th and 90th percentiles </span> <span>P10</span> <span>=</span> <span>series</span><span>.</span><span>quantile</span><span>(</span><span>0.1</span><span>)</span><span>P90</span> <span>=</span> <span>series</span><span>.</span><span>quantile</span><span>(</span><span>0.9</span><span>)</span><span>new_data</span> <span>=</span> <span>[]</span><span># Replace the data value with P10 for data < P10 </span> <span># and with P90 for data > P90 </span> <span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span><span>if</span> <span>x</span> <span><</span> <span>P10</span><span>:</span><span>x</span> <span>=</span> <span>P10</span><span>elif</span> <span>x</span> <span>></span> <span>P90</span><span>:</span><span>x</span> <span>=</span> <span>P90</span><span>new_data</span><span>.</span><span>append</span><span>(</span><span>x</span><span>)</span><span>return</span> <span>new_data</span><span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span><span>new_data</span> <span>=</span> <span>handle_quantile_outlier</span><span>(</span><span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>,</span> <span>new_data</span><span>)</span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> <span>def</span> <span>handle_quantile_outlier</span><span>(</span><span>data</span><span>):</span> <span>series</span> <span>=</span> <span>pd</span><span>.</span><span>Series</span><span>(</span><span>data</span><span>)</span> <span># Determine the 10th and 90th percentiles </span> <span>P10</span> <span>=</span> <span>series</span><span>.</span><span>quantile</span><span>(</span><span>0.1</span><span>)</span> <span>P90</span> <span>=</span> <span>series</span><span>.</span><span>quantile</span><span>(</span><span>0.9</span><span>)</span> <span>new_data</span> <span>=</span> <span>[]</span> <span># Replace the data value with P10 for data < P10 </span> <span># and with P90 for data > P90 </span> <span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span> <span>if</span> <span>x</span> <span><</span> <span>P10</span><span>:</span> <span>x</span> <span>=</span> <span>P10</span> <span>elif</span> <span>x</span> <span>></span> <span>P90</span><span>:</span> <span>x</span> <span>=</span> <span>P90</span> <span>new_data</span><span>.</span><span>append</span><span>(</span><span>x</span><span>)</span> <span>return</span> <span>new_data</span> <span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span> <span>new_data</span> <span>=</span> <span>handle_quantile_outlier</span><span>(</span><span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>,</span> <span>new_data</span><span>)</span>import pandas as pd def handle_quantile_outlier(data): series = pd.Series(data) # Determine the 10th and 90th percentiles P10 = series.quantile(0.1) P90 = series.quantile(0.9) new_data = [] # Replace the data value with P10 for data < P10 # and with P90 for data > P90 for x in data: if x < P10: x = P10 elif x > P90: x = P90 new_data.append(x) return new_data data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] new_data = handle_quantile_outlier(data) print("Data:", data) print("New Data:", new_data)
Enter fullscreen mode Exit fullscreen mode
Output:
Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]New Data: [17.1, 17.1, 18, 19, 20, 21, 22, 23, 24, 26.9, 26, 26.9]Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] New Data: [17.1, 17.1, 18, 19, 20, 21, 22, 23, 24, 26.9, 26, 26.9]Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] New Data: [17.1, 17.1, 18, 19, 20, 21, 22, 23, 24, 26.9, 26, 26.9]
Enter fullscreen mode Exit fullscreen mode
Mean/Median Imputation
The average value is greatly influenced by the presence of outliers, so it is recommended to replace these outliers with median values.
<span>import</span> <span>statistics</span> <span>as</span> <span>s</span><span>def</span> <span>handle_median_outlier</span><span>(</span><span>data</span><span>,</span> <span>outlier</span><span>):</span><span># Determine median value </span> <span>median</span> <span>=</span> <span>s</span><span>.</span><span>median</span><span>(</span><span>data</span><span>)</span><span># Change outlier with median value </span> <span>new_data</span> <span>=</span> <span>[]</span><span>for</span> <span>i</span> <span>in</span> <span>outlier</span><span>:</span><span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span><span>if</span> <span>x</span> <span>==</span> <span>i</span><span>:</span><span>new_data</span><span>.</span><span>append</span><span>(</span><span>median</span><span>)</span><span>else</span><span>:</span><span>new_data</span><span>.</span><span>append</span><span>(</span><span>x</span><span>)</span><span>return</span> <span>new_data</span><span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span><span>outlier</span> <span>=</span> <span>[</span><span>100</span><span>]</span><span>new_data</span> <span>=</span> <span>handle_median_outlier</span><span>(</span><span>data</span><span>,</span> <span>outlier</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>,</span> <span>new_data</span><span>)</span><span>import</span> <span>statistics</span> <span>as</span> <span>s</span> <span>def</span> <span>handle_median_outlier</span><span>(</span><span>data</span><span>,</span> <span>outlier</span><span>):</span> <span># Determine median value </span> <span>median</span> <span>=</span> <span>s</span><span>.</span><span>median</span><span>(</span><span>data</span><span>)</span> <span># Change outlier with median value </span> <span>new_data</span> <span>=</span> <span>[]</span> <span>for</span> <span>i</span> <span>in</span> <span>outlier</span><span>:</span> <span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span> <span>if</span> <span>x</span> <span>==</span> <span>i</span><span>:</span> <span>new_data</span><span>.</span><span>append</span><span>(</span><span>median</span><span>)</span> <span>else</span><span>:</span> <span>new_data</span><span>.</span><span>append</span><span>(</span><span>x</span><span>)</span> <span>return</span> <span>new_data</span> <span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span> <span>outlier</span> <span>=</span> <span>[</span><span>100</span><span>]</span> <span>new_data</span> <span>=</span> <span>handle_median_outlier</span><span>(</span><span>data</span><span>,</span> <span>outlier</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>,</span> <span>new_data</span><span>)</span>import statistics as s def handle_median_outlier(data, outlier): # Determine median value median = s.median(data) # Change outlier with median value new_data = [] for i in outlier: for x in data: if x == i: new_data.append(median) else: new_data.append(x) return new_data data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] outlier = [100] new_data = handle_median_outlier(data, outlier) print("Data:", data) print("New Data:", new_data)
Enter fullscreen mode Exit fullscreen mode
Output:
Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]New Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 21.5, 26, 27]Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] New Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 21.5, 26, 27]Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] New Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 21.5, 26, 27]
Enter fullscreen mode Exit fullscreen mode
Log Transformation
Log transformation is a common technique used to reduce the skew in a distribution and make it more symmetric. In this way, the occurrence of extreme values can be reduced and the data becomes more normally distributed.
<span>import</span> <span>math</span><span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span><span>def</span> <span>handle_log_outlier</span><span>(</span><span>data</span><span>):</span><span>new_data</span> <span>=</span> <span>[]</span><span># Transformasi setiap nilai data dengan log </span> <span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span><span>new_data</span><span>.</span><span>append</span><span>(</span><span>math</span><span>.</span><span>log</span><span>(</span><span>x</span><span>,</span> <span>10</span><span>))</span><span>return</span> <span>new_data</span><span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span><span>new_data</span> <span>=</span> <span>handle_log_outlier</span><span>(</span><span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>,</span> <span>new_data</span><span>)</span><span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>'</span><span>red</span><span>'</span><span>,</span> <span>label</span><span>=</span><span>'</span><span>Data</span><span>'</span><span>)</span><span>plt</span><span>.</span><span>plot</span><span>(</span><span>new_data</span><span>,</span> <span>'</span><span>blue</span><span>'</span><span>,</span> <span>label</span><span>=</span><span>'</span><span>New Data</span><span>'</span><span>)</span><span>plt</span><span>.</span><span>legend</span><span>()</span><span>plt</span><span>.</span><span>show</span><span>()</span><span>import</span> <span>math</span> <span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span> <span>def</span> <span>handle_log_outlier</span><span>(</span><span>data</span><span>):</span> <span>new_data</span> <span>=</span> <span>[]</span> <span># Transformasi setiap nilai data dengan log </span> <span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span> <span>new_data</span><span>.</span><span>append</span><span>(</span><span>math</span><span>.</span><span>log</span><span>(</span><span>x</span><span>,</span> <span>10</span><span>))</span> <span>return</span> <span>new_data</span> <span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span> <span>new_data</span> <span>=</span> <span>handle_log_outlier</span><span>(</span><span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>,</span> <span>new_data</span><span>)</span> <span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>'</span><span>red</span><span>'</span><span>,</span> <span>label</span><span>=</span><span>'</span><span>Data</span><span>'</span><span>)</span> <span>plt</span><span>.</span><span>plot</span><span>(</span><span>new_data</span><span>,</span> <span>'</span><span>blue</span><span>'</span><span>,</span> <span>label</span><span>=</span><span>'</span><span>New Data</span><span>'</span><span>)</span> <span>plt</span><span>.</span><span>legend</span><span>()</span> <span>plt</span><span>.</span><span>show</span><span>()</span>import math import matplotlib.pyplot as plt def handle_log_outlier(data): new_data = [] # Transformasi setiap nilai data dengan log for x in data: new_data.append(math.log(x, 10)) return new_data data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] new_data = handle_log_outlier(data) print("Data:", data) print("New Data:", new_data) plt.plot(data, 'red', label='Data') plt.plot(new_data, 'blue', label='New Data') plt.legend() plt.show()
Enter fullscreen mode Exit fullscreen mode
Output:
Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]New Data: [1.2041199826559246, 1.2304489213782739, 1.2552725051033058, 1.2787536009528289, 1.301029995663981, 1.322219294733919, 1.3424226808222062, 1.3617278360175928, 1.380211241711606, 2.0, 1.414973347970818, 1.4313637641589871]Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] New Data: [1.2041199826559246, 1.2304489213782739, 1.2552725051033058, 1.2787536009528289, 1.301029995663981, 1.322219294733919, 1.3424226808222062, 1.3617278360175928, 1.380211241711606, 2.0, 1.414973347970818, 1.4313637641589871]Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] New Data: [1.2041199826559246, 1.2304489213782739, 1.2552725051033058, 1.2787536009528289, 1.301029995663981, 1.322219294733919, 1.3424226808222062, 1.3617278360175928, 1.380211241711606, 2.0, 1.414973347970818, 1.4313637641589871]
Enter fullscreen mode Exit fullscreen mode
Binning
Data binning is a method of separating or grouping continuous numerical values into discrete intervals called “bins” or “groups”. Data grouping methods can be used to simplify data distribution and assist in statistical analysis and visualization. There are several techniques that are often used to group data, including:
Equal Width Binning
This technique groups data into intervals or bins with the same width and has been determined previously. Even though this method is simple, it cannot be applied to data with a skewed distribution.
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span><span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>20</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]}</span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span><span>num_bins</span> <span>=</span> <span>4</span><span># Calculate the bin width </span><span>bin_width</span> <span>=</span> <span>(</span><span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>max</span><span>()</span> <span>-</span> <span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>min</span><span>())</span> <span>/</span> <span>num_bins</span><span># Create bin limits </span><span>bin_edges</span> <span>=</span> <span>[</span><span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>min</span><span>()</span> <span>+</span> <span>i</span> <span>*</span> <span>bin_width</span> <span>for</span> <span>i</span> <span>in</span> <span>range</span><span>(</span><span>num_bins</span> <span>+</span> <span>1</span><span>)]</span><span>df</span><span>[</span><span>'</span><span>age_bins</span><span>'</span><span>]</span> <span>=</span> <span>pd</span><span>.</span><span>cut</span><span>(</span><span>df</span><span>[</span><span>'</span><span>usia</span><span>'</span><span>],</span> <span>bins</span><span>=</span><span>bin_edges</span><span>,</span> <span>include_lowest</span><span>=</span><span>True</span><span>,</span> <span>right</span><span>=</span><span>True</span><span>)</span><span>print</span><span>(</span><span>df</span><span>)</span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> <span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>20</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]}</span> <span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span> <span>num_bins</span> <span>=</span> <span>4</span> <span># Calculate the bin width </span><span>bin_width</span> <span>=</span> <span>(</span><span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>max</span><span>()</span> <span>-</span> <span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>min</span><span>())</span> <span>/</span> <span>num_bins</span> <span># Create bin limits </span><span>bin_edges</span> <span>=</span> <span>[</span><span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>min</span><span>()</span> <span>+</span> <span>i</span> <span>*</span> <span>bin_width</span> <span>for</span> <span>i</span> <span>in</span> <span>range</span><span>(</span><span>num_bins</span> <span>+</span> <span>1</span><span>)]</span> <span>df</span><span>[</span><span>'</span><span>age_bins</span><span>'</span><span>]</span> <span>=</span> <span>pd</span><span>.</span><span>cut</span><span>(</span><span>df</span><span>[</span><span>'</span><span>usia</span><span>'</span><span>],</span> <span>bins</span><span>=</span><span>bin_edges</span><span>,</span> <span>include_lowest</span><span>=</span><span>True</span><span>,</span> <span>right</span><span>=</span><span>True</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>)</span>import pandas as pd data = {'age': [16, 17, 18, 19, 20, 21, 22, 23, 24, 20, 26, 27]} df = pd.DataFrame(data) num_bins = 4 # Calculate the bin width bin_width = (df['age'].max() - df['age'].min()) / num_bins # Create bin limits bin_edges = [df['age'].min() + i * bin_width for i in range(num_bins + 1)] df['age_bins'] = pd.cut(df['usia'], bins=bin_edges, include_lowest=True, right=True) print(df)
Enter fullscreen mode Exit fullscreen mode
Output:
age age_bins0 16 (15.999, 18.75]1 17 (15.999, 18.75]2 18 (15.999, 18.75]3 19 (18.75, 21.5]4 20 (18.75, 21.5]5 21 (18.75, 21.5]6 22 (21.5, 24.25]7 23 (21.5, 24.25]8 24 (21.5, 24.25]9 20 (18.75, 21.5]10 26 (24.25, 27.0]11 27 (24.25, 27.0]age age_bins 0 16 (15.999, 18.75] 1 17 (15.999, 18.75] 2 18 (15.999, 18.75] 3 19 (18.75, 21.5] 4 20 (18.75, 21.5] 5 21 (18.75, 21.5] 6 22 (21.5, 24.25] 7 23 (21.5, 24.25] 8 24 (21.5, 24.25] 9 20 (18.75, 21.5] 10 26 (24.25, 27.0] 11 27 (24.25, 27.0]age age_bins 0 16 (15.999, 18.75] 1 17 (15.999, 18.75] 2 18 (15.999, 18.75] 3 19 (18.75, 21.5] 4 20 (18.75, 21.5] 5 21 (18.75, 21.5] 6 22 (21.5, 24.25] 7 23 (21.5, 24.25] 8 24 (21.5, 24.25] 9 20 (18.75, 21.5] 10 26 (24.25, 27.0] 11 27 (24.25, 27.0]
Enter fullscreen mode Exit fullscreen mode
Equal Frequency Binning
In this technique, data is grouped into bins with each bin having approximately the same number of data points. This technique is useful when maintaining the same frequency or distribution across bins if it is important. This binning method can also effectively deal with outlier data and skewed data.
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span><span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>20</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]}</span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span><span>df</span><span>[</span><span>'</span><span>age_bins</span><span>'</span><span>]</span> <span>=</span> <span>pd</span><span>.</span><span>qcut</span><span>(</span><span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>],</span> <span>q</span><span>=</span><span>3</span><span>)</span><span>print</span><span>(</span><span>df</span><span>)</span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> <span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>20</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]}</span> <span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span> <span>df</span><span>[</span><span>'</span><span>age_bins</span><span>'</span><span>]</span> <span>=</span> <span>pd</span><span>.</span><span>qcut</span><span>(</span><span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>],</span> <span>q</span><span>=</span><span>3</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>)</span>import pandas as pd data = {'age': [16, 17, 18, 19, 20, 21, 22, 23, 24, 20, 26, 27]} df = pd.DataFrame(data) df['age_bins'] = pd.qcut(df['age'], q=3) print(df)
Enter fullscreen mode Exit fullscreen mode
Output:
age age_bins0 16 (15.999, 19.667]1 17 (15.999, 19.667]2 18 (15.999, 19.667]3 19 (15.999, 19.667]4 20 (19.667, 22.333]5 21 (19.667, 22.333]6 22 (19.667, 22.333]7 23 (22.333, 27.0]8 24 (22.333, 27.0]9 20 (19.667, 22.333]10 26 (22.333, 27.0]11 27 (22.333, 27.0]age age_bins 0 16 (15.999, 19.667] 1 17 (15.999, 19.667] 2 18 (15.999, 19.667] 3 19 (15.999, 19.667] 4 20 (19.667, 22.333] 5 21 (19.667, 22.333] 6 22 (19.667, 22.333] 7 23 (22.333, 27.0] 8 24 (22.333, 27.0] 9 20 (19.667, 22.333] 10 26 (22.333, 27.0] 11 27 (22.333, 27.0]age age_bins 0 16 (15.999, 19.667] 1 17 (15.999, 19.667] 2 18 (15.999, 19.667] 3 19 (15.999, 19.667] 4 20 (19.667, 22.333] 5 21 (19.667, 22.333] 6 22 (19.667, 22.333] 7 23 (22.333, 27.0] 8 24 (22.333, 27.0] 9 20 (19.667, 22.333] 10 26 (22.333, 27.0] 11 27 (22.333, 27.0]
Enter fullscreen mode Exit fullscreen mode
Quantile Binning
In this technique, data is grouped based on percentile values. The limits of a bin are based on certain percentile values (e.g. 25th, 50th, and 75th percentiles).
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span><span>import</span> <span>numpy</span> <span>as</span> <span>np</span><span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>20</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]}</span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span><span># Defines percentiles for bin boundaries </span><span>percentiles</span> <span>=</span> <span>[</span><span>0</span><span>,</span> <span>25</span><span>,</span> <span>50</span><span>,</span> <span>75</span><span>,</span> <span>100</span><span>]</span> <span># In this case, quartiles are used </span><span># Defines percentiles for bin boundaries </span><span>bin_edges</span> <span>=</span> <span>np</span><span>.</span><span>percentile</span><span>(</span><span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>],</span> <span>percentiles</span><span>)</span><span>df</span><span>[</span><span>'</span><span>age_bins</span><span>'</span><span>]</span> <span>=</span> <span>pd</span><span>.</span><span>cut</span><span>(</span><span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>],</span> <span>bins</span><span>=</span><span>bin_edges</span><span>,</span> <span>include_lowest</span><span>=</span><span>True</span><span>)</span><span>print</span><span>(</span><span>df</span><span>)</span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> <span>import</span> <span>numpy</span> <span>as</span> <span>np</span> <span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>20</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]}</span> <span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span> <span># Defines percentiles for bin boundaries </span><span>percentiles</span> <span>=</span> <span>[</span><span>0</span><span>,</span> <span>25</span><span>,</span> <span>50</span><span>,</span> <span>75</span><span>,</span> <span>100</span><span>]</span> <span># In this case, quartiles are used </span> <span># Defines percentiles for bin boundaries </span><span>bin_edges</span> <span>=</span> <span>np</span><span>.</span><span>percentile</span><span>(</span><span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>],</span> <span>percentiles</span><span>)</span> <span>df</span><span>[</span><span>'</span><span>age_bins</span><span>'</span><span>]</span> <span>=</span> <span>pd</span><span>.</span><span>cut</span><span>(</span><span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>],</span> <span>bins</span><span>=</span><span>bin_edges</span><span>,</span> <span>include_lowest</span><span>=</span><span>True</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>)</span>import pandas as pd import numpy as np data = {'age': [16, 17, 18, 19, 20, 21, 22, 23, 24, 20, 26, 27]} df = pd.DataFrame(data) # Defines percentiles for bin boundaries percentiles = [0, 25, 50, 75, 100] # In this case, quartiles are used # Defines percentiles for bin boundaries bin_edges = np.percentile(df['age'], percentiles) df['age_bins'] = pd.cut(df['age'], bins=bin_edges, include_lowest=True) print(df)
Enter fullscreen mode Exit fullscreen mode
Output:
age age_bins0 16 (15.999, 18.75]1 17 (15.999, 18.75]2 18 (15.999, 18.75]3 19 (18.75, 20.5]4 20 (18.75, 20.5]5 21 (20.5, 23.25]6 22 (20.5, 23.25]7 23 (20.5, 23.25]8 24 (23.25, 27.0]9 20 (18.75, 20.5]10 26 (23.25, 27.0]11 27 (23.25, 27.0]age age_bins 0 16 (15.999, 18.75] 1 17 (15.999, 18.75] 2 18 (15.999, 18.75] 3 19 (18.75, 20.5] 4 20 (18.75, 20.5] 5 21 (20.5, 23.25] 6 22 (20.5, 23.25] 7 23 (20.5, 23.25] 8 24 (23.25, 27.0] 9 20 (18.75, 20.5] 10 26 (23.25, 27.0] 11 27 (23.25, 27.0]age age_bins 0 16 (15.999, 18.75] 1 17 (15.999, 18.75] 2 18 (15.999, 18.75] 3 19 (18.75, 20.5] 4 20 (18.75, 20.5] 5 21 (20.5, 23.25] 6 22 (20.5, 23.25] 7 23 (20.5, 23.25] 8 24 (23.25, 27.0] 9 20 (18.75, 20.5] 10 26 (23.25, 27.0] 11 27 (23.25, 27.0]
Enter fullscreen mode Exit fullscreen mode
Scrubbing
Data scrubbing is a process for changing or deleting incomplete, incorrect, inaccurate, or repetitive data in a dataset. By carrying out this process, it can help improve data consistency, accuracy and reliability.
Deleting Repetitive Data
Deleting duplicate data events is one way to perform data scrubbing. Repeated data often appears if the dataset used comes from several different sources.
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span><span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>,</span> <span>22</span><span>,</span> <span>17</span><span>],</span><span>'</span><span>height</span><span>'</span><span>:</span> <span>[</span><span>155</span><span>,</span> <span>162</span><span>,</span> <span>165</span><span>,</span> <span>170</span><span>,</span> <span>162</span><span>]}</span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df</span><span>)</span><span># Check if data is duplicated </span><span>duplicate_data</span> <span>=</span> <span>df</span><span>[</span><span>df</span><span>.</span><span>duplicated</span><span>()]</span><span>print</span><span>(</span><span>"</span><span>Duplicate Data:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>duplicate_data</span><span>)</span><span># Delete duplicate data </span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>drop_duplicates</span><span>()</span><span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df</span><span>)</span><span>Output</span><span>:</span><span>Data</span><span>:</span><span>age</span> <span>height</span><span>0</span> <span>15</span> <span>155</span><span>1</span> <span>17</span> <span>162</span><span>2</span> <span>23</span> <span>165</span><span>3</span> <span>22</span> <span>170</span><span>4</span> <span>17</span> <span>162</span><span>Duplicate</span> <span>Data</span><span>:</span><span>age</span> <span>height</span><span>4</span> <span>17</span> <span>162</span><span>New</span> <span>Data</span><span>:</span><span>age</span> <span>height</span><span>0</span> <span>15</span> <span>155</span><span>1</span> <span>17</span> <span>162</span><span>2</span> <span>23</span> <span>165</span><span>3</span> <span>22</span> <span>170</span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> <span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>,</span> <span>22</span><span>,</span> <span>17</span><span>],</span> <span>'</span><span>height</span><span>'</span><span>:</span> <span>[</span><span>155</span><span>,</span> <span>162</span><span>,</span> <span>165</span><span>,</span> <span>170</span><span>,</span> <span>162</span><span>]}</span> <span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>)</span> <span># Check if data is duplicated </span><span>duplicate_data</span> <span>=</span> <span>df</span><span>[</span><span>df</span><span>.</span><span>duplicated</span><span>()]</span> <span>print</span><span>(</span><span>"</span><span>Duplicate Data:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>duplicate_data</span><span>)</span> <span># Delete duplicate data </span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>drop_duplicates</span><span>()</span> <span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>)</span> <span>Output</span><span>:</span> <span>Data</span><span>:</span> <span>age</span> <span>height</span> <span>0</span> <span>15</span> <span>155</span> <span>1</span> <span>17</span> <span>162</span> <span>2</span> <span>23</span> <span>165</span> <span>3</span> <span>22</span> <span>170</span> <span>4</span> <span>17</span> <span>162</span> <span>Duplicate</span> <span>Data</span><span>:</span> <span>age</span> <span>height</span> <span>4</span> <span>17</span> <span>162</span> <span>New</span> <span>Data</span><span>:</span> <span>age</span> <span>height</span> <span>0</span> <span>15</span> <span>155</span> <span>1</span> <span>17</span> <span>162</span> <span>2</span> <span>23</span> <span>165</span> <span>3</span> <span>22</span> <span>170</span>import pandas as pd data = {'age': [15, 17, 23, 22, 17], 'height': [155, 162, 165, 170, 162]} df = pd.DataFrame(data) print("Data:") print(df) # Check if data is duplicated duplicate_data = df[df.duplicated()] print("Duplicate Data:") print(duplicate_data) # Delete duplicate data df = df.drop_duplicates() print("New Data:") print(df) Output: Data: age height 0 15 155 1 17 162 2 23 165 3 22 170 4 17 162 Duplicate Data: age height 4 17 162 New Data: age height 0 15 155 1 17 162 2 23 165 3 22 170
Enter fullscreen mode Exit fullscreen mode
Handling Missing Data
In real cases, usually there is a lot of missing data in a data set. The causes of this data loss are very varied, ranging from data corruption to device failure when recording measurements.
Deleting Missing Data
Missing data can be resolved by deleting rows or columns of data that have NULL values.
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span><span>import</span> <span>numpy</span> <span>as</span> <span>np</span><span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>,</span> <span>np</span><span>.</span><span>nan</span><span>,</span> <span>17</span><span>],</span><span>'</span><span>height</span><span>'</span><span>:</span> <span>[</span><span>155</span><span>,</span> <span>162</span><span>,</span> <span>np</span><span>.</span><span>nan</span><span>,</span> <span>170</span><span>,</span> <span>162</span><span>]}</span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df</span><span>)</span><span># Deletes rows of missing data </span><span>df</span><span>.</span><span>dropna</span><span>(</span><span>axis</span><span>=</span><span>0</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span><span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df</span><span>)</span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> <span>import</span> <span>numpy</span> <span>as</span> <span>np</span> <span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>,</span> <span>np</span><span>.</span><span>nan</span><span>,</span> <span>17</span><span>],</span> <span>'</span><span>height</span><span>'</span><span>:</span> <span>[</span><span>155</span><span>,</span> <span>162</span><span>,</span> <span>np</span><span>.</span><span>nan</span><span>,</span> <span>170</span><span>,</span> <span>162</span><span>]}</span> <span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>)</span> <span># Deletes rows of missing data </span><span>df</span><span>.</span><span>dropna</span><span>(</span><span>axis</span><span>=</span><span>0</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>)</span>import pandas as pd import numpy as np data = {'age': [15, 17, 23, np.nan, 17], 'height': [155, 162, np.nan, 170, 162]} df = pd.DataFrame(data) print("Data:") print(df) # Deletes rows of missing data df.dropna(axis=0, inplace=True) print("New Data:") print(df)
Enter fullscreen mode Exit fullscreen mode
Output:
Data:age height0 15.0 155.01 17.0 162.02 23.0 NaN3 NaN 170.04 17.0 162.0New Data:age height0 15.0 155.01 17.0 162.04 17.0 162.0Data: age height 0 15.0 155.0 1 17.0 162.0 2 23.0 NaN 3 NaN 170.0 4 17.0 162.0 New Data: age height 0 15.0 155.0 1 17.0 162.0 4 17.0 162.0Data: age height 0 15.0 155.0 1 17.0 162.0 2 23.0 NaN 3 NaN 170.0 4 17.0 162.0 New Data: age height 0 15.0 155.0 1 17.0 162.0 4 17.0 162.0
Enter fullscreen mode Exit fullscreen mode
Pros:
• A model trained by removing all missing values will produce a robust model.
Cons:
• Losing a lot of information.
• Works poorly if the percentage of missing values is too large compared to the data set.
Fill in missing data with mean/median/mode values
Columns in a dataset that have numeric values can be replaced with the mean, median, or mode of other data in that column. This technique will prevent data loss like the previous method.
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span><span>import</span> <span>numpy</span> <span>as</span> <span>np</span><span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>,</span> <span>np</span><span>.</span><span>nan</span><span>,</span> <span>17</span><span>],</span><span>'</span><span>height</span><span>'</span><span>:</span> <span>[</span><span>155</span><span>,</span> <span>162</span><span>,</span> <span>165</span><span>,</span> <span>170</span><span>,</span> <span>162</span><span>]}</span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df</span><span>)</span><span>df_filled_mean</span> <span>=</span> <span>df</span><span>.</span><span>copy</span><span>()</span><span>df_filled_median</span> <span>=</span> <span>df</span><span>.</span><span>copy</span><span>()</span><span>df_filled_mode</span> <span>=</span> <span>df</span><span>.</span><span>copy</span><span>()</span><span># Determine mean/median/mode value </span><span>mean</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>mean</span><span>()</span><span>median</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>median</span><span>()</span><span>mode</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>mode</span><span>().</span><span>values</span><span>[</span><span>0</span><span>]</span><span># Fill in the data with the mean/median/mode value </span><span>df_filled_mean</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>fillna</span><span>(</span><span>mean</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span><span>df_filled_median</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>fillna</span><span>(</span><span>median</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span><span>df_filled_mode</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>fillna</span><span>(</span><span>mode</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span><span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Filled mean:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df_filled_mean</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Filled median:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df_filled_median</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Filled mode:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df_filled_mode</span><span>)</span><span>Output</span><span>:</span><span>Data</span><span>:</span><span>age</span> <span>height</span><span>0</span> <span>15.0</span> <span>155</span><span>1</span> <span>17.0</span> <span>162</span><span>2</span> <span>23.0</span> <span>165</span><span>3</span> <span>NaN</span> <span>170</span><span>4</span> <span>17.0</span> <span>162</span><span>New</span> <span>Data</span><span>:</span><span>Filled</span> <span>mean</span><span>:</span><span>age</span> <span>height</span><span>0</span> <span>15.0</span> <span>155</span><span>1</span> <span>17.0</span> <span>162</span><span>2</span> <span>23.0</span> <span>165</span><span>3</span> <span>18.0</span> <span>170</span><span>4</span> <span>17.0</span> <span>162</span><span>Filled</span> <span>median</span><span>:</span><span>age</span> <span>height</span><span>0</span> <span>15.0</span> <span>155</span><span>1</span> <span>17.0</span> <span>162</span><span>2</span> <span>23.0</span> <span>165</span><span>3</span> <span>17.0</span> <span>170</span><span>4</span> <span>17.0</span> <span>162</span><span>Filled</span> <span>mode</span><span>:</span><span>age</span> <span>height</span><span>0</span> <span>15.0</span> <span>155</span><span>1</span> <span>17.0</span> <span>162</span><span>2</span> <span>23.0</span> <span>165</span><span>3</span> <span>17.0</span> <span>170</span><span>4</span> <span>17.0</span> <span>162</span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> <span>import</span> <span>numpy</span> <span>as</span> <span>np</span> <span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>,</span> <span>np</span><span>.</span><span>nan</span><span>,</span> <span>17</span><span>],</span> <span>'</span><span>height</span><span>'</span><span>:</span> <span>[</span><span>155</span><span>,</span> <span>162</span><span>,</span> <span>165</span><span>,</span> <span>170</span><span>,</span> <span>162</span><span>]}</span> <span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>)</span> <span>df_filled_mean</span> <span>=</span> <span>df</span><span>.</span><span>copy</span><span>()</span> <span>df_filled_median</span> <span>=</span> <span>df</span><span>.</span><span>copy</span><span>()</span> <span>df_filled_mode</span> <span>=</span> <span>df</span><span>.</span><span>copy</span><span>()</span> <span># Determine mean/median/mode value </span><span>mean</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>mean</span><span>()</span> <span>median</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>median</span><span>()</span> <span>mode</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>mode</span><span>().</span><span>values</span><span>[</span><span>0</span><span>]</span> <span># Fill in the data with the mean/median/mode value </span><span>df_filled_mean</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>fillna</span><span>(</span><span>mean</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span> <span>df_filled_median</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>fillna</span><span>(</span><span>median</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span> <span>df_filled_mode</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>fillna</span><span>(</span><span>mode</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Filled mean:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df_filled_mean</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Filled median:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df_filled_median</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Filled mode:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df_filled_mode</span><span>)</span> <span>Output</span><span>:</span> <span>Data</span><span>:</span> <span>age</span> <span>height</span> <span>0</span> <span>15.0</span> <span>155</span> <span>1</span> <span>17.0</span> <span>162</span> <span>2</span> <span>23.0</span> <span>165</span> <span>3</span> <span>NaN</span> <span>170</span> <span>4</span> <span>17.0</span> <span>162</span> <span>New</span> <span>Data</span><span>:</span> <span>Filled</span> <span>mean</span><span>:</span> <span>age</span> <span>height</span> <span>0</span> <span>15.0</span> <span>155</span> <span>1</span> <span>17.0</span> <span>162</span> <span>2</span> <span>23.0</span> <span>165</span> <span>3</span> <span>18.0</span> <span>170</span> <span>4</span> <span>17.0</span> <span>162</span> <span>Filled</span> <span>median</span><span>:</span> <span>age</span> <span>height</span> <span>0</span> <span>15.0</span> <span>155</span> <span>1</span> <span>17.0</span> <span>162</span> <span>2</span> <span>23.0</span> <span>165</span> <span>3</span> <span>17.0</span> <span>170</span> <span>4</span> <span>17.0</span> <span>162</span> <span>Filled</span> <span>mode</span><span>:</span> <span>age</span> <span>height</span> <span>0</span> <span>15.0</span> <span>155</span> <span>1</span> <span>17.0</span> <span>162</span> <span>2</span> <span>23.0</span> <span>165</span> <span>3</span> <span>17.0</span> <span>170</span> <span>4</span> <span>17.0</span> <span>162</span>import pandas as pd import numpy as np data = {'age': [15, 17, 23, np.nan, 17], 'height': [155, 162, 165, 170, 162]} df = pd.DataFrame(data) print("Data:") print(df) df_filled_mean = df.copy() df_filled_median = df.copy() df_filled_mode = df.copy() # Determine mean/median/mode value mean = df['age'].mean() median = df['age'].median() mode = df['age'].mode().values[0] # Fill in the data with the mean/median/mode value df_filled_mean['age'].fillna(mean, inplace=True) df_filled_median['age'].fillna(median, inplace=True) df_filled_mode['age'].fillna(mode, inplace=True) print("New Data:") print("Filled mean:") print(df_filled_mean) print("Filled median:") print(df_filled_median) print("Filled mode:") print(df_filled_mode) Output: Data: age height 0 15.0 155 1 17.0 162 2 23.0 165 3 NaN 170 4 17.0 162 New Data: Filled mean: age height 0 15.0 155 1 17.0 162 2 23.0 165 3 18.0 170 4 17.0 162 Filled median: age height 0 15.0 155 1 17.0 162 2 23.0 165 3 17.0 170 4 17.0 162 Filled mode: age height 0 15.0 155 1 17.0 162 2 23.0 165 3 17.0 170 4 17.0 162
Enter fullscreen mode Exit fullscreen mode
Pros:
• Prevent data loss resulting in deleted rows or columns
• Works well with small data sets and is easy to implement.
Cons:
• Only works with numeric continuous variables.
• May cause data leaks
Fill in Missing Data in Categorical Columns
When missing data is found in a categorical column of either character or number type, the missing data can be filled in with the highest frequency of the category. If there is a lot of missing data, the data is replaced with a new category.
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span><span>import</span> <span>numpy</span> <span>as</span> <span>np</span><span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>,</span> <span>20</span><span>,</span> <span>17</span><span>],</span><span>'</span><span>impression</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>good</span><span>'</span><span>,</span> <span>'</span><span>fair</span><span>'</span><span>,</span> <span>'</span><span>fair</span><span>'</span><span>,</span> <span>'</span><span>very good</span><span>'</span><span>,</span> <span>np</span><span>.</span><span>nan</span><span>]}</span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df</span><span>)</span><span>most_category</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>impression</span><span>'</span><span>].</span><span>mode</span><span>().</span><span>values</span><span>[</span><span>0</span><span>]</span><span># Fill in the data with the highest category frequency </span><span>df</span><span>[</span><span>'</span><span>impression</span><span>'</span><span>].</span><span>fillna</span><span>(</span><span>most_category</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span><span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df</span><span>)</span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> <span>import</span> <span>numpy</span> <span>as</span> <span>np</span> <span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>,</span> <span>20</span><span>,</span> <span>17</span><span>],</span> <span>'</span><span>impression</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>good</span><span>'</span><span>,</span> <span>'</span><span>fair</span><span>'</span><span>,</span> <span>'</span><span>fair</span><span>'</span><span>,</span> <span>'</span><span>very good</span><span>'</span><span>,</span> <span>np</span><span>.</span><span>nan</span><span>]}</span> <span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>)</span> <span>most_category</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>impression</span><span>'</span><span>].</span><span>mode</span><span>().</span><span>values</span><span>[</span><span>0</span><span>]</span> <span># Fill in the data with the highest category frequency </span><span>df</span><span>[</span><span>'</span><span>impression</span><span>'</span><span>].</span><span>fillna</span><span>(</span><span>most_category</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>)</span>import pandas as pd import numpy as np data = {'age': [15, 17, 23, 20, 17], 'impression': ['good', 'fair', 'fair', 'very good', np.nan]} df = pd.DataFrame(data) print("Data:") print(df) most_category = df['impression'].mode().values[0] # Fill in the data with the highest category frequency df['impression'].fillna(most_category, inplace=True) print("New Data:") print(df)
Enter fullscreen mode Exit fullscreen mode
Output:
Data:age impression0 15 good1 17 fair2 23 fair3 20 very good4 17 NaNNew Data:age impression0 15 good1 17 fair2 23 fair3 20 very good4 17 fairData: age impression 0 15 good 1 17 fair 2 23 fair 3 20 very good 4 17 NaN New Data: age impression 0 15 good 1 17 fair 2 23 fair 3 20 very good 4 17 fairData: age impression 0 15 good 1 17 fair 2 23 fair 3 20 very good 4 17 NaN New Data: age impression 0 15 good 1 17 fair 2 23 fair 3 20 very good 4 17 fair
Enter fullscreen mode Exit fullscreen mode
Pros:
• Prevent data loss resulting in deleted rows or columns
• Works well with small data sets and is easy to implement.
• Eliminate data loss by adding unique categories
Cons:
• Only works with categorical variables.
• Adding new features to the model while coding may result in poor performance
Data Type Conversion
Most machine learning models cannot be run on categorical data. Therefore, categorical data needs to be converted into numerical data. One technique that can be used is one-hot-encoding. One-hot-encoding is a representation of categorical variables in binary vector form.
<span>import</span> <span>numpy</span> <span>as</span> <span>np</span><span># Categorical data to be converted </span><span>colors</span> <span>=</span> <span>[</span><span>"</span><span>red</span><span>"</span><span>,</span> <span>"</span><span>green</span><span>"</span><span>,</span> <span>"</span><span>yellow</span><span>"</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>]</span><span># Color list </span><span>total_colors</span> <span>=</span> <span>[</span><span>"</span><span>red</span><span>"</span><span>,</span> <span>"</span><span>green</span><span>"</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>"</span><span>black</span><span>"</span><span>,</span> <span>"</span><span>yellow</span><span>"</span><span>]</span><span># map each color to numeric </span><span>mapping</span> <span>=</span> <span>{}</span><span>for</span> <span>x</span> <span>in</span> <span>range</span><span>(</span><span>len</span><span>(</span><span>total_colors</span><span>)):</span><span>mapping</span><span>[</span><span>total_colors</span><span>[</span><span>x</span><span>]]</span> <span>=</span> <span>x</span><span>one_hot_encode</span> <span>=</span> <span>[]</span><span># Convert the numeric value of each data </span><span>for</span> <span>c</span> <span>in</span> <span>colors</span><span>:</span><span>arr</span> <span>=</span> <span>list</span><span>(</span><span>np</span><span>.</span><span>zeros</span><span>(</span><span>len</span><span>(</span><span>total_colors</span><span>),</span> <span>dtype</span> <span>=</span> <span>int</span><span>))</span><span>arr</span><span>[</span><span>mapping</span><span>[</span><span>c</span><span>]]</span> <span>=</span> <span>1</span><span>one_hot_encode</span><span>.</span><span>append</span><span>(</span><span>arr</span><span>)</span><span>print</span><span>(</span><span>one_hot_encode</span><span>)</span><span>import</span> <span>numpy</span> <span>as</span> <span>np</span> <span># Categorical data to be converted </span><span>colors</span> <span>=</span> <span>[</span><span>"</span><span>red</span><span>"</span><span>,</span> <span>"</span><span>green</span><span>"</span><span>,</span> <span>"</span><span>yellow</span><span>"</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>]</span> <span># Color list </span><span>total_colors</span> <span>=</span> <span>[</span><span>"</span><span>red</span><span>"</span><span>,</span> <span>"</span><span>green</span><span>"</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>"</span><span>black</span><span>"</span><span>,</span> <span>"</span><span>yellow</span><span>"</span><span>]</span> <span># map each color to numeric </span><span>mapping</span> <span>=</span> <span>{}</span> <span>for</span> <span>x</span> <span>in</span> <span>range</span><span>(</span><span>len</span><span>(</span><span>total_colors</span><span>)):</span> <span>mapping</span><span>[</span><span>total_colors</span><span>[</span><span>x</span><span>]]</span> <span>=</span> <span>x</span> <span>one_hot_encode</span> <span>=</span> <span>[]</span> <span># Convert the numeric value of each data </span><span>for</span> <span>c</span> <span>in</span> <span>colors</span><span>:</span> <span>arr</span> <span>=</span> <span>list</span><span>(</span><span>np</span><span>.</span><span>zeros</span><span>(</span><span>len</span><span>(</span><span>total_colors</span><span>),</span> <span>dtype</span> <span>=</span> <span>int</span><span>))</span> <span>arr</span><span>[</span><span>mapping</span><span>[</span><span>c</span><span>]]</span> <span>=</span> <span>1</span> <span>one_hot_encode</span><span>.</span><span>append</span><span>(</span><span>arr</span><span>)</span> <span>print</span><span>(</span><span>one_hot_encode</span><span>)</span>import numpy as np # Categorical data to be converted colors = ["red", "green", "yellow", "red", "blue"] # Color list total_colors = ["red", "green", "blue", "black", "yellow"] # map each color to numeric mapping = {} for x in range(len(total_colors)): mapping[total_colors[x]] = x one_hot_encode = [] # Convert the numeric value of each data for c in colors: arr = list(np.zeros(len(total_colors), dtype = int)) arr[mapping[c]] = 1 one_hot_encode.append(arr) print(one_hot_encode)
Enter fullscreen mode Exit fullscreen mode
Output:
[[1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [1, 0, 0, 0, 0], [0, 0, 1, 0, 0]][[1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [1, 0, 0, 0, 0], [0, 0, 1, 0, 0]][[1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [1, 0, 0, 0, 0], [0, 0, 1, 0, 0]]
Enter fullscreen mode Exit fullscreen mode
Deleting Irrelevant Data
Data is said to be irrelevant when the data does not match the problem being researched.
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span><span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>],</span><span>'</span><span>email</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>ahmad@gmail.com</span><span>'</span><span>,</span> <span>'</span><span>putra@yahoo.com</span><span>'</span><span>,</span> <span>'</span><span>tegar@gmail.com</span><span>'</span><span>]}</span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df</span><span>)</span><span># Remove irrelevant attributes </span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>drop</span><span>(</span><span>'</span><span>email</span><span>'</span><span>,</span> <span>axis</span><span>=</span><span>1</span><span>)</span><span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df</span><span>)</span><span>Output</span><span>:</span><span>Data</span><span>:</span><span>age</span> <span>email</span><span>0</span> <span>15</span> <span>ahmad</span><span>@gmail.com</span><span>1</span> <span>17</span> <span>putra</span><span>@yahoo.com</span><span>2</span> <span>23</span> <span>tegar</span><span>@gmail.com</span><span>New</span> <span>Data</span><span>:</span><span>age</span><span>0</span> <span>15</span><span>1</span> <span>17</span><span>2</span> <span>23</span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> <span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>],</span> <span>'</span><span>email</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>ahmad@gmail.com</span><span>'</span><span>,</span> <span>'</span><span>putra@yahoo.com</span><span>'</span><span>,</span> <span>'</span><span>tegar@gmail.com</span><span>'</span><span>]}</span> <span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>)</span> <span># Remove irrelevant attributes </span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>drop</span><span>(</span><span>'</span><span>email</span><span>'</span><span>,</span> <span>axis</span><span>=</span><span>1</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>)</span> <span>Output</span><span>:</span> <span>Data</span><span>:</span> <span>age</span> <span>email</span> <span>0</span> <span>15</span> <span>ahmad</span><span>@gmail.com</span> <span>1</span> <span>17</span> <span>putra</span><span>@yahoo.com</span> <span>2</span> <span>23</span> <span>tegar</span><span>@gmail.com</span> <span>New</span> <span>Data</span><span>:</span> <span>age</span> <span>0</span> <span>15</span> <span>1</span> <span>17</span> <span>2</span> <span>23</span>import pandas as pd data = {'age': [15, 17, 23], 'email': ['ahmad@gmail.com', 'putra@yahoo.com', 'tegar@gmail.com']} df = pd.DataFrame(data) print("Data:") print(df) # Remove irrelevant attributes df = df.drop('email', axis=1) print("New Data:") print(df) Output: Data: age email 0 15 ahmad@gmail.com 1 17 putra@yahoo.com 2 23 tegar@gmail.com New Data: age 0 15 1 17 2 23
Enter fullscreen mode Exit fullscreen mode
Avoiding Structural Errors
Structural errors include typos, incorrect naming conventions, incorrect use of capital letters, and so on. The following is an example of improvements to letter capitalization in categorical features:
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span><span>import</span> <span>numpy</span> <span>as</span> <span>np</span><span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>,</span> <span>20</span><span>,</span> <span>17</span><span>],</span><span>'</span><span>impression</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>good</span><span>'</span><span>,</span> <span>'</span><span>Fair</span><span>'</span><span>,</span> <span>'</span><span>fair</span><span>'</span><span>,</span> <span>'</span><span>Very good</span><span>'</span><span>,</span> <span>'</span><span>Good</span><span>'</span><span>]}</span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df</span><span>)</span><span># Fixed letter capitalization </span><span>df</span><span>[</span><span>'</span><span>impression</span><span>'</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>impression</span><span>'</span><span>].</span><span>str</span><span>.</span><span>lower</span><span>()</span><span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df</span><span>)</span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> <span>import</span> <span>numpy</span> <span>as</span> <span>np</span> <span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>,</span> <span>20</span><span>,</span> <span>17</span><span>],</span> <span>'</span><span>impression</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>good</span><span>'</span><span>,</span> <span>'</span><span>Fair</span><span>'</span><span>,</span> <span>'</span><span>fair</span><span>'</span><span>,</span> <span>'</span><span>Very good</span><span>'</span><span>,</span> <span>'</span><span>Good</span><span>'</span><span>]}</span> <span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>)</span> <span># Fixed letter capitalization </span><span>df</span><span>[</span><span>'</span><span>impression</span><span>'</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>impression</span><span>'</span><span>].</span><span>str</span><span>.</span><span>lower</span><span>()</span> <span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>)</span>import pandas as pd import numpy as np data = {'age': [15, 17, 23, 20, 17], 'impression': ['good', 'Fair', 'fair', 'Very good', 'Good']} df = pd.DataFrame(data) print("Data:") print(df) # Fixed letter capitalization df['impression'] = df['impression'].str.lower() print("New Data:") print(df)
Enter fullscreen mode Exit fullscreen mode
Output:
Data:age impression0 15 good1 17 Fair2 23 fair3 20 Very good4 17 GoodNew Data:age impression0 15 good1 17 fair2 23 fair3 20 very good4 17 goodData: age impression 0 15 good 1 17 Fair 2 23 fair 3 20 Very good 4 17 Good New Data: age impression 0 15 good 1 17 fair 2 23 fair 3 20 very good 4 17 goodData: age impression 0 15 good 1 17 Fair 2 23 fair 3 20 Very good 4 17 Good New Data: age impression 0 15 good 1 17 fair 2 23 fair 3 20 very good 4 17 good
Enter fullscreen mode Exit fullscreen mode
Closing
In conclusion, employing effective data cleaning methods not only enhances the reliability of your analyses but also paves the way for informed decision-making, ensuring that your data-driven journey is built on a solid foundation of accuracy and integrity
暂无评论内容