Laid-Back Approaches to Clean Data

The importance of data in the output results of the model means that we need to carry out further handling of the input data. One way that can be done to improve data quality is to carry out data cleaning techniques.

What is Data Cleaning?

Data cleaning is a technique that aims to improve data quality by identifying and eliminating errors and inconsistencies in data.

Here I will share some data cleaning methods and providing simple implementation in how to do it.

There are 4 commonly methods in data cleaning:

  1. Scalling Feature Value
  2. Handling Extreme Outlier
  3. Binning
  4. Scrubbing

Scalling Feature Value

Feature scaling is the process of normalizing the range of features in a dataset. In real cases, the range of feature values ​​varies greatly. If one of the features has a wide value, then that feature will greatly influence the calculations of the algorithm used. Therefore, the range of all features must be normalized so that each feature can provide a comparable contribution. There are several techniques that can be used to perform feature scaling, including:

Absolute Maximum Scaling

Absolute Maximum Scaling is a scaling technique that is carried out based on the absolute maximum value of each feature. The stages of this technique are:

  1. Determine the maximum absolute value of the feature in the data set.
  2. Divide all values ​​in the column by the maximum value.
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>
<span>def</span> <span>max_absolute_scaling</span><span>(</span><span>data</span><span>):</span>
<span># Determines the maximum absolute value </span> <span>max_abs_value</span> <span>=</span> <span>max</span><span>(</span><span>map</span><span>(</span><span>abs</span><span>,</span> <span>data</span><span>))</span>
<span># Divide each value by the maximum absolute value </span> <span>scaled_data</span> <span>=</span> <span>[]</span>
<span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span>
<span>scaled_data</span><span>.</span><span>append</span><span>(</span><span>x</span><span>/</span><span>max_abs_value</span><span>)</span>
<span>return</span> <span>scaled_data</span>
<span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span>
<span>scaled_data</span> <span>=</span> <span>max_absolute_scaling</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>legend</span><span>()</span>
<span>plt</span><span>.</span><span>show</span><span>()</span>
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>

<span>def</span> <span>max_absolute_scaling</span><span>(</span><span>data</span><span>):</span>
    <span># Determines the maximum absolute value </span>    <span>max_abs_value</span> <span>=</span> <span>max</span><span>(</span><span>map</span><span>(</span><span>abs</span><span>,</span> <span>data</span><span>))</span>

    <span># Divide each value by the maximum absolute value </span>    <span>scaled_data</span> <span>=</span> <span>[]</span>
    <span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span>
        <span>scaled_data</span><span>.</span><span>append</span><span>(</span><span>x</span><span>/</span><span>max_abs_value</span><span>)</span>

    <span>return</span> <span>scaled_data</span>

<span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span>
<span>scaled_data</span> <span>=</span> <span>max_absolute_scaling</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>legend</span><span>()</span>
<span>plt</span><span>.</span><span>show</span><span>()</span>
import matplotlib.pyplot as plt def max_absolute_scaling(data): # Determines the maximum absolute value max_abs_value = max(map(abs, data)) # Divide each value by the maximum absolute value scaled_data = [] for x in data: scaled_data.append(x/max_abs_value) return scaled_data data = [3, -1, 6, 2, -4] scaled_data = max_absolute_scaling(data) print("Data:", data) print("Scaled Data:", scaled_data) plt.plot(data, "red", label="Original Data") plt.plot(scaled_data, "blue", label="Scaled Data") plt.legend() plt.show()

Enter fullscreen mode Exit fullscreen mode

Output:

Data: [3, -1, 6, 2, -4]
Scaled Data: [0.5, -0.16666666666666666, 1.0, 0.3333333333333333, -0.6666666666666666]
Data: [3, -1, 6, 2, -4]
Scaled Data: [0.5, -0.16666666666666666, 1.0, 0.3333333333333333, -0.6666666666666666]
Data: [3, -1, 6, 2, -4] Scaled Data: [0.5, -0.16666666666666666, 1.0, 0.3333333333333333, -0.6666666666666666]

Enter fullscreen mode Exit fullscreen mode

Min-Max Scaling

Min-Max Scaling is a scaling technique that is carried out by reducing each value in the dataset by the minimum value and then dividing by the range of the dataset (maximum-minimum). By applying this technique, all feature values ​​will be between 0 and 1. The weakness of this technique is also the same, namely that it is susceptible to outliers.

<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>
<span>def</span> <span>min_max_scaling</span><span>(</span><span>data</span><span>):</span>
<span># Determine the maximum and minimum values </span> <span>max_value</span> <span>=</span> <span>max</span><span>(</span><span>data</span><span>)</span>
<span>min_value</span> <span>=</span> <span>min</span><span>(</span><span>data</span><span>)</span>
<span># Reduces each value by the minimum value </span> <span># then divided by the range of dataset values </span> <span>scaled_data</span> <span>=</span> <span>[]</span>
<span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span>
<span>scaled_data</span><span>.</span><span>append</span><span>(</span>
<span>(</span><span>x</span><span>-</span><span>min_value</span><span>)</span><span>/</span><span>(</span><span>max_value</span><span>-</span><span>min_value</span><span>)</span>
<span>)</span>
<span>return</span> <span>scaled_data</span>
<span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span>
<span>scaled_data</span> <span>=</span> <span>min_max_scaling</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>legend</span><span>()</span>
<span>plt</span><span>.</span><span>show</span><span>()</span>
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>

<span>def</span> <span>min_max_scaling</span><span>(</span><span>data</span><span>):</span>
    <span># Determine the maximum and minimum values </span>    <span>max_value</span> <span>=</span> <span>max</span><span>(</span><span>data</span><span>)</span>
    <span>min_value</span> <span>=</span> <span>min</span><span>(</span><span>data</span><span>)</span>

    <span># Reduces each value by the minimum value </span>    <span># then divided by the range of dataset values </span>    <span>scaled_data</span> <span>=</span> <span>[]</span>
    <span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span>
        <span>scaled_data</span><span>.</span><span>append</span><span>(</span>
        <span>(</span><span>x</span><span>-</span><span>min_value</span><span>)</span><span>/</span><span>(</span><span>max_value</span><span>-</span><span>min_value</span><span>)</span>
        <span>)</span>

    <span>return</span> <span>scaled_data</span>

<span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span>
<span>scaled_data</span> <span>=</span> <span>min_max_scaling</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>legend</span><span>()</span>
<span>plt</span><span>.</span><span>show</span><span>()</span>
import matplotlib.pyplot as plt def min_max_scaling(data): # Determine the maximum and minimum values max_value = max(data) min_value = min(data) # Reduces each value by the minimum value # then divided by the range of dataset values scaled_data = [] for x in data: scaled_data.append( (x-min_value)/(max_value-min_value) ) return scaled_data data = [3, -1, 6, 2, -4] scaled_data = min_max_scaling(data) print("Data:", data) print("Scaled Data:", scaled_data) plt.plot(data, "red", label="Original Data") plt.plot(scaled_data, "blue", label="Scaled Data") plt.legend() plt.show()

Enter fullscreen mode Exit fullscreen mode

Output:

Data: [3, -1, 6, 2, -4]
Scaled Data: [0.7, 0.3, 1.0, 0.6, 0.0]
Data: [3, -1, 6, 2, -4]
Scaled Data: [0.7, 0.3, 1.0, 0.6, 0.0]
Data: [3, -1, 6, 2, -4] Scaled Data: [0.7, 0.3, 1.0, 0.6, 0.0]

Enter fullscreen mode Exit fullscreen mode

Normalization

Normalization is a scaling technique that is similar to min-max scaling, but each feature value is reduced by the average value of the dataset. The results of the reduction are then divided by the range of dataset values.

<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>
<span>from</span> <span>statistics</span> <span>import</span> <span>mean</span>
<span>def</span> <span>normalization</span><span>(</span><span>data</span><span>):</span>
<span># Determine maximum, minimum, average values </span> <span>max_value</span> <span>=</span> <span>max</span><span>(</span><span>data</span><span>)</span>
<span>min_value</span> <span>=</span> <span>min</span><span>(</span><span>data</span><span>)</span>
<span>mean_value</span> <span>=</span> <span>mean</span><span>(</span><span>data</span><span>)</span>
<span># Subtract each value by the average value </span> <span># then divided by the range of dataset values </span> <span>scaled_data</span> <span>=</span> <span>[]</span>
<span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span>
<span>scaled_data</span><span>.</span><span>append</span><span>(</span>
<span>(</span><span>x</span><span>-</span><span>mean_value</span><span>)</span><span>/</span><span>(</span><span>max_value</span><span>-</span><span>min_value</span><span>)</span>
<span>)</span>
<span>return</span> <span>scaled_data</span>
<span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span>
<span>scaled_data</span> <span>=</span> <span>normalization</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>legend</span><span>()</span>
<span>plt</span><span>.</span><span>show</span><span>()</span>
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>
<span>from</span> <span>statistics</span> <span>import</span> <span>mean</span>

<span>def</span> <span>normalization</span><span>(</span><span>data</span><span>):</span>
    <span># Determine maximum, minimum, average values </span>    <span>max_value</span> <span>=</span> <span>max</span><span>(</span><span>data</span><span>)</span>
    <span>min_value</span> <span>=</span> <span>min</span><span>(</span><span>data</span><span>)</span>
    <span>mean_value</span> <span>=</span> <span>mean</span><span>(</span><span>data</span><span>)</span>

    <span># Subtract each value by the average value </span>    <span># then divided by the range of dataset values </span>    <span>scaled_data</span> <span>=</span> <span>[]</span>
    <span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span>
        <span>scaled_data</span><span>.</span><span>append</span><span>(</span>
        <span>(</span><span>x</span><span>-</span><span>mean_value</span><span>)</span><span>/</span><span>(</span><span>max_value</span><span>-</span><span>min_value</span><span>)</span>
        <span>)</span>

    <span>return</span> <span>scaled_data</span>

<span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span>
<span>scaled_data</span> <span>=</span> <span>normalization</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>legend</span><span>()</span>
<span>plt</span><span>.</span><span>show</span><span>()</span>
import matplotlib.pyplot as plt from statistics import mean def normalization(data): # Determine maximum, minimum, average values max_value = max(data) min_value = min(data) mean_value = mean(data) # Subtract each value by the average value # then divided by the range of dataset values scaled_data = [] for x in data: scaled_data.append( (x-mean_value)/(max_value-min_value) ) return scaled_data data = [3, -1, 6, 2, -4] scaled_data = normalization(data) print("Data:", data) print("Scaled Data:", scaled_data) plt.plot(data, "red", label="Original Data") plt.plot(scaled_data, "blue", label="Scaled Data") plt.legend() plt.show()

Enter fullscreen mode Exit fullscreen mode

Output:

Data: [3, -1, 6, 2, -4]
Scaled Data: [0.18, -0.22000000000000003, 0.48, 0.08, -0.52]
Data: [3, -1, 6, 2, -4]
Scaled Data: [0.18, -0.22000000000000003, 0.48, 0.08, -0.52]
Data: [3, -1, 6, 2, -4] Scaled Data: [0.18, -0.22000000000000003, 0.48, 0.08, -0.52]

Enter fullscreen mode Exit fullscreen mode

Standardization (Z-score Normalization)

Standardization is a scaling technique that is carried out by reducing each feature value by the average and dividing by the standard deviation value or what is usually called the z-score. The result of this technique is data that is scaled so that it has features centered on the average and a standard deviation of 1. This technique is suitable if the features have a normal distribution such as salary or age

<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>
<span>from</span> <span>statistics</span> <span>import</span> <span>mean</span><span>,</span> <span>stdev</span>
<span>def</span> <span>standardization</span><span>(</span><span>data</span><span>):</span>
<span># Determine the average value, standard deviation </span> <span>mean_value</span> <span>=</span> <span>mean</span><span>(</span><span>data</span><span>)</span>
<span>stdev_value</span> <span>=</span> <span>stdev</span><span>(</span><span>data</span><span>)</span>
<span># Subtract each value by the average value </span> <span># then divided by the standard deviation </span> <span>scaled_data</span> <span>=</span> <span>[]</span>
<span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span>
<span>scaled_data</span><span>.</span><span>append</span><span>(</span>
<span>(</span><span>x</span><span>-</span><span>mean_value</span><span>)</span><span>/</span><span>(</span><span>stdev_value</span><span>)</span>
<span>)</span>
<span>return</span> <span>scaled_data</span>
<span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span>
<span>scaled_data</span> <span>=</span> <span>standardization</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>legend</span><span>()</span>
<span>plt</span><span>.</span><span>show</span><span>()</span>
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>
<span>from</span> <span>statistics</span> <span>import</span> <span>mean</span><span>,</span> <span>stdev</span>

<span>def</span> <span>standardization</span><span>(</span><span>data</span><span>):</span>
    <span># Determine the average value, standard deviation </span>    <span>mean_value</span> <span>=</span> <span>mean</span><span>(</span><span>data</span><span>)</span>
    <span>stdev_value</span> <span>=</span> <span>stdev</span><span>(</span><span>data</span><span>)</span>

    <span># Subtract each value by the average value </span>    <span># then divided by the standard deviation </span>    <span>scaled_data</span> <span>=</span> <span>[]</span>
    <span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span>
        <span>scaled_data</span><span>.</span><span>append</span><span>(</span>
        <span>(</span><span>x</span><span>-</span><span>mean_value</span><span>)</span><span>/</span><span>(</span><span>stdev_value</span><span>)</span>
        <span>)</span>

    <span>return</span> <span>scaled_data</span>

<span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span>
<span>scaled_data</span> <span>=</span> <span>standardization</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>legend</span><span>()</span>
<span>plt</span><span>.</span><span>show</span><span>()</span>
import matplotlib.pyplot as plt from statistics import mean, stdev def standardization(data): # Determine the average value, standard deviation mean_value = mean(data) stdev_value = stdev(data) # Subtract each value by the average value # then divided by the standard deviation scaled_data = [] for x in data: scaled_data.append( (x-mean_value)/(stdev_value) ) return scaled_data data = [3, -1, 6, 2, -4] scaled_data = standardization(data) print("Data:", data) print("Scaled Data:", scaled_data) plt.plot(data, "red", label="Original Data") plt.plot(scaled_data, "blue", label="Scaled Data") plt.legend() plt.show()

Enter fullscreen mode Exit fullscreen mode

Output:

Data: [3, -1, 6, 2, -4]
Scaled Data: [0.469476477861571, -0.5738045840530313, 1.2519372742975226, 0.20865621238292043, -1.3562653804889828]
Data: [3, -1, 6, 2, -4]
Scaled Data: [0.469476477861571, -0.5738045840530313, 1.2519372742975226, 0.20865621238292043, -1.3562653804889828]
Data: [3, -1, 6, 2, -4] Scaled Data: [0.469476477861571, -0.5738045840530313, 1.2519372742975226, 0.20865621238292043, -1.3562653804889828]

Enter fullscreen mode Exit fullscreen mode

Robust Scaling

In the Robust Scaling technique, each data is reduced by the median value and then divided by the Inter Quartile Range (IQR) value. IQR is the difference between the upper quartile (Q3) and the lower quartile (Q1).

<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>
<span>def</span> <span>robust_scaling</span><span>(</span><span>data</span><span>):</span>
<span>series</span> <span>=</span> <span>pd</span><span>.</span><span>Series</span><span>(</span><span>data</span><span>)</span>
<span># Determine the median value, IQR </span> <span>q1</span><span>,</span> <span>median</span><span>,</span> <span>q3</span> <span>=</span> <span>series</span><span>.</span><span>quantile</span><span>([</span><span>0.25</span><span>,</span> <span>0.5</span><span>,</span> <span>0.75</span><span>])</span>
<span>IQR</span> <span>=</span> <span>q3</span> <span>-</span> <span>q1</span>
<span># Subtract each value by the median value </span> <span># then divided by the IQR value </span> <span>scaled_data</span> <span>=</span> <span>[]</span>
<span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span>
<span>scaled_data</span><span>.</span><span>append</span><span>(</span>
<span>(</span><span>x</span><span>-</span><span>median</span><span>)</span><span>/</span><span>IQR</span>
<span>)</span>
<span>return</span> <span>scaled_data</span>
<span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span>
<span>scaled_data</span> <span>=</span> <span>robust_scaling</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>legend</span><span>()</span>
<span>plt</span><span>.</span><span>show</span><span>()</span>
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>

<span>def</span> <span>robust_scaling</span><span>(</span><span>data</span><span>):</span>
    <span>series</span> <span>=</span> <span>pd</span><span>.</span><span>Series</span><span>(</span><span>data</span><span>)</span>
    <span># Determine the median value, IQR </span>    <span>q1</span><span>,</span> <span>median</span><span>,</span> <span>q3</span> <span>=</span> <span>series</span><span>.</span><span>quantile</span><span>([</span><span>0.25</span><span>,</span> <span>0.5</span><span>,</span> <span>0.75</span><span>])</span>
    <span>IQR</span> <span>=</span> <span>q3</span> <span>-</span> <span>q1</span>

    <span># Subtract each value by the median value </span>    <span># then divided by the IQR value </span>    <span>scaled_data</span> <span>=</span> <span>[]</span>
    <span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span>
        <span>scaled_data</span><span>.</span><span>append</span><span>(</span>
        <span>(</span><span>x</span><span>-</span><span>median</span><span>)</span><span>/</span><span>IQR</span>
        <span>)</span>

    <span>return</span> <span>scaled_data</span>

<span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span>
<span>scaled_data</span> <span>=</span> <span>robust_scaling</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>legend</span><span>()</span>
<span>plt</span><span>.</span><span>show</span><span>()</span>
import matplotlib.pyplot as plt import pandas as pd def robust_scaling(data): series = pd.Series(data) # Determine the median value, IQR q1, median, q3 = series.quantile([0.25, 0.5, 0.75]) IQR = q3 - q1 # Subtract each value by the median value # then divided by the IQR value scaled_data = [] for x in data: scaled_data.append( (x-median)/IQR ) return scaled_data data = [3, -1, 6, 2, -4] scaled_data = robust_scaling(data) print("Data:", data) print("Scaled Data:", scaled_data) plt.plot(data, "red", label="Original Data") plt.plot(scaled_data, "blue", label="Scaled Data") plt.legend() plt.show()

Enter fullscreen mode Exit fullscreen mode

Output:

Data: [3, -1, 6, 2, -4]
Scaled Data: [0.25, -0.75, 1.0, 0.0, -1.5]
Data: [3, -1, 6, 2, -4]
Scaled Data: [0.25, -0.75, 1.0, 0.0, -1.5]
Data: [3, -1, 6, 2, -4] Scaled Data: [0.25, -0.75, 1.0, 0.0, -1.5]

Enter fullscreen mode Exit fullscreen mode

Scaling to Vector Unit Length

Scaling to Vector Unit Length is a scaling technique that is carried out by transforming the components of a feature vector so that the transformed vector has a length of 1. In this technique, each feature value is divided by the vector length.

This technique can only be done if the value ||X||>0

<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>
<span>import</span> <span>numpy</span> <span>as</span> <span>np</span>
<span>def</span> <span>vector_normalization</span><span>(</span><span>data</span><span>):</span>
<span>vector</span> <span>=</span> <span>np</span><span>.</span><span>array</span><span>(</span><span>data</span><span>)</span>
<span># Determines the length of the vector </span> <span>magnitued</span> <span>=</span> <span>np</span><span>.</span><span>linalg</span><span>.</span><span>norm</span><span>(</span><span>vector</span><span>)</span>
<span># Normalize vectors to unit length </span> <span>scaled_data</span> <span>=</span> <span>vector</span><span>/</span><span>magnitued</span>
<span>return</span> <span>scaled_data</span>
<span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span>
<span>scaled_data</span> <span>=</span> <span>vector_normalization</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>.</span><span>tolist</span><span>())</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>legend</span><span>()</span>
<span>plt</span><span>.</span><span>show</span><span>()</span>
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>
<span>import</span> <span>numpy</span> <span>as</span> <span>np</span>

<span>def</span> <span>vector_normalization</span><span>(</span><span>data</span><span>):</span>
    <span>vector</span> <span>=</span> <span>np</span><span>.</span><span>array</span><span>(</span><span>data</span><span>)</span>
    <span># Determines the length of the vector </span>    <span>magnitued</span> <span>=</span> <span>np</span><span>.</span><span>linalg</span><span>.</span><span>norm</span><span>(</span><span>vector</span><span>)</span>

    <span># Normalize vectors to unit length </span>    <span>scaled_data</span> <span>=</span> <span>vector</span><span>/</span><span>magnitued</span>

    <span>return</span> <span>scaled_data</span>

<span>data</span> <span>=</span> <span>[</span><span>3</span><span>,</span> <span>-</span><span>1</span><span>,</span> <span>6</span><span>,</span> <span>2</span><span>,</span> <span>-</span><span>4</span><span>]</span>
<span>scaled_data</span> <span>=</span> <span>vector_normalization</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Scaled Data:</span><span>"</span><span>,</span> <span>scaled_data</span><span>.</span><span>tolist</span><span>())</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Original Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>scaled_data</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>label</span><span>=</span><span>"</span><span>Scaled Data</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>legend</span><span>()</span>
<span>plt</span><span>.</span><span>show</span><span>()</span>
import matplotlib.pyplot as plt import numpy as np def vector_normalization(data): vector = np.array(data) # Determines the length of the vector magnitued = np.linalg.norm(vector) # Normalize vectors to unit length scaled_data = vector/magnitued return scaled_data data = [3, -1, 6, 2, -4] scaled_data = vector_normalization(data) print("Data:", data) print("Scaled Data:", scaled_data.tolist()) plt.plot(data, "red", label="Original Data") plt.plot(scaled_data, "blue", label="Scaled Data") plt.legend() plt.show()

Enter fullscreen mode Exit fullscreen mode

Output:

Data: [3, -1, 6, 2, -4]
Scaled Data: [0.3692744729379982, -0.12309149097933272, 0.7385489458759964, 0.24618298195866545, -0.4923659639173309]
Data: [3, -1, 6, 2, -4]
Scaled Data: [0.3692744729379982, -0.12309149097933272, 0.7385489458759964, 0.24618298195866545, -0.4923659639173309]   
Data: [3, -1, 6, 2, -4] Scaled Data: [0.3692744729379982, -0.12309149097933272, 0.7385489458759964, 0.24618298195866545, -0.4923659639173309]

Enter fullscreen mode Exit fullscreen mode

Handling Extreme Outlier

Outliers are values ​​that are much different from the majority of the data in the data set. Outliers may represent natural variation in the population. However, in most cases outliers are caused by an error in the data collection process such as entering incorrect data, equipment failure, or other measurement errors. If outliers are not handled, they can affect the results of statistical analysis and the accuracy of the model being developed.

Outlier Detection

Outlier Detection with Sorting Methods

The sorting method is the simplest method that can be used to detect outliers. Quantitative data can be sorted from low to high and manually detected for data with very low or very high values. In the Python language, sorting can be done using the sorted function.

<span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span>
<span>sorted_data</span> <span>=</span> <span>sorted</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Sorted Data:</span><span>"</span><span>,</span> <span>sorted_data</span><span>)</span>
<span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span>
<span>sorted_data</span> <span>=</span> <span>sorted</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Sorted Data:</span><span>"</span><span>,</span> <span>sorted_data</span><span>)</span>
data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] sorted_data = sorted(data) print("Data:", data) print("Sorted Data:", sorted_data)

Enter fullscreen mode Exit fullscreen mode

Output:

Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
Sorted Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 27, 100]
Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
Sorted Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 27, 100] 
Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] Sorted Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 27, 100]

Enter fullscreen mode Exit fullscreen mode

A value of 100 is very high when compared to other data. So 100 can be detected as outliers. However, this method is less accurate because outliers are not determined using statistical calculations.

Outlier Detection using the Histogram Method

Histograms can be used to help visualize data and find out whether there are outlier values ​​in a set of data. Data that is outside the data curve is detected as an outlier. The disadvantages of this method are the same as the average method where outliers are determined only from visual observation of the data.

<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>
<span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span>
<span>plt</span><span>.</span><span>hist</span><span>(</span><span>data</span><span>)</span>
<span>plt</span><span>.</span><span>show</span><span>()</span>
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>

<span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span>
<span>plt</span><span>.</span><span>hist</span><span>(</span><span>data</span><span>)</span>
<span>plt</span><span>.</span><span>show</span><span>()</span>
import matplotlib.pyplot as plt data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] plt.hist(data) plt.show()

Enter fullscreen mode Exit fullscreen mode

Output:

In the histogram visualization above, it can be seen that the data values ​​are more centered between the values ​​0 to 40. However, there is data that is separated from other data, namely 100 and this data can be categorized as an outlier.

Outlier Detection with Box-Plot

Box-Plot is a summary of the sample distribution presented graphically which can describe the shape of the data distribution (skewness), a measure of central tendency and a measure of the spread (diversity) of observational data. There are 5 statistical measures that can be read in the box plot, namely minimum, maximum, Q1, median, and Q3. Values ​​outside the box and whisker can be categorized as outliers.

<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>
<span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span> <span>plt</span><span>.</span><span>boxplot</span><span>(</span><span>data</span><span>,</span> <span>vert</span><span>=</span><span>False</span><span>)</span>
<span>plt</span><span>.</span><span>title</span><span>(</span><span>"</span><span>Detecting outliers using Boxplot</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>xlabel</span><span>(</span><span>'</span><span>Data</span><span>'</span><span>)</span>
<span>plt</span><span>.</span><span>show</span><span>()</span>
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>

<span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span> <span>plt</span><span>.</span><span>boxplot</span><span>(</span><span>data</span><span>,</span> <span>vert</span><span>=</span><span>False</span><span>)</span>
<span>plt</span><span>.</span><span>title</span><span>(</span><span>"</span><span>Detecting outliers using Boxplot</span><span>"</span><span>)</span>
<span>plt</span><span>.</span><span>xlabel</span><span>(</span><span>'</span><span>Data</span><span>'</span><span>)</span>
<span>plt</span><span>.</span><span>show</span><span>()</span>
import matplotlib.pyplot as plt data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] plt.boxplot(data, vert=False) plt.title("Detecting outliers using Boxplot") plt.xlabel('Data') plt.show()

Enter fullscreen mode Exit fullscreen mode

Output:

图片[1]-Laid-Back Approaches to Clean Data - 拾光赋-拾光赋
Value 100 that is outside the whisker can be categorized as an outlier.

Deteksi Outlier dengan Z-Score

The criteria for determining outliers with z-score is that every data point that has a z-score value that is outside the 3rd standard deviation is an outlier. The stages carried out using this technique include:

  1. For all data points, calculate the z-score value using the formula (Xi-mean)/std.
  2. Initialize the threshold value=3 and mark data points that have an absolute z-score value greater than the threshold as outliers.
<span>import</span> <span>statistics</span> <span>as</span> <span>s</span>
<span>def</span> <span>detect_outliers_zscore</span><span>(</span><span>data</span><span>):</span>
<span># Threshold initialization </span> <span>thres</span> <span>=</span> <span>3</span>
<span># Determine the average value and standard deviation </span> <span>mean</span> <span>=</span> <span>s</span><span>.</span><span>mean</span><span>(</span><span>data</span><span>)</span>
<span>std</span> <span>=</span> <span>s</span><span>.</span><span>stdev</span><span>(</span><span>data</span><span>)</span>
<span>outliers</span> <span>=</span> <span>[]</span>
<span># Determine the z-score value for each data </span> <span>for</span> <span>i</span> <span>in</span> <span>data</span><span>:</span>
<span>z_score</span> <span>=</span> <span>(</span><span>i</span><span>-</span><span>mean</span><span>)</span><span>/</span><span>std</span>
<span># Check whether the data is outliers </span> <span>if </span><span>(</span><span>abs</span><span>(</span><span>z_score</span><span>)</span> <span>></span> <span>thres</span><span>):</span>
<span>outliers</span><span>.</span><span>append</span><span>(</span><span>i</span><span>)</span>
<span>return</span> <span>outliers</span>
<span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span>
<span>outliers</span> <span>=</span> <span>detect_outliers_zscore</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Outliers:</span><span>"</span><span>,</span> <span>outliers</span><span>)</span>
<span>import</span> <span>statistics</span> <span>as</span> <span>s</span>

<span>def</span> <span>detect_outliers_zscore</span><span>(</span><span>data</span><span>):</span>
    <span># Threshold initialization </span>    <span>thres</span> <span>=</span> <span>3</span>

    <span># Determine the average value and standard deviation </span>    <span>mean</span> <span>=</span> <span>s</span><span>.</span><span>mean</span><span>(</span><span>data</span><span>)</span>
    <span>std</span> <span>=</span> <span>s</span><span>.</span><span>stdev</span><span>(</span><span>data</span><span>)</span>

    <span>outliers</span> <span>=</span> <span>[]</span>
    <span># Determine the z-score value for each data </span>    <span>for</span> <span>i</span> <span>in</span> <span>data</span><span>:</span>
        <span>z_score</span> <span>=</span> <span>(</span><span>i</span><span>-</span><span>mean</span><span>)</span><span>/</span><span>std</span>

        <span># Check whether the data is outliers </span>        <span>if </span><span>(</span><span>abs</span><span>(</span><span>z_score</span><span>)</span> <span>></span> <span>thres</span><span>):</span>
            <span>outliers</span><span>.</span><span>append</span><span>(</span><span>i</span><span>)</span>

    <span>return</span> <span>outliers</span>

<span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span>
<span>outliers</span> <span>=</span> <span>detect_outliers_zscore</span><span>(</span><span>data</span><span>)</span> 
<span>print</span><span>(</span><span>"</span><span>Outliers:</span><span>"</span><span>,</span> <span>outliers</span><span>)</span>
import statistics as s def detect_outliers_zscore(data): # Threshold initialization thres = 3 # Determine the average value and standard deviation mean = s.mean(data) std = s.stdev(data) outliers = [] # Determine the z-score value for each data for i in data: z_score = (i-mean)/std # Check whether the data is outliers if (abs(z_score) > thres): outliers.append(i) return outliers data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] outliers = detect_outliers_zscore(data) print("Outliers:", outliers)

Enter fullscreen mode Exit fullscreen mode

Output:

Outliers: [100]
Outliers: [100]
Outliers: [100]

Enter fullscreen mode Exit fullscreen mode

Outlier Detection with Inter Quartile Range (IQR)

Based on the Inter Quartile Range (IQR) value, outliers can be detected if the data point is located 1.5 times the IQR above Q3 and below Q1. The stages in determining outliers with IQR are:

  1. Sort asset data in ascending order
  2. Calculate 1st and 3rd quartiles (Q1, Q3)
  3. Calculate the value of IQR=Q3-Q1
  4. Calculate the lower limit value = (Q1–1.5*IQR) and the upper limit = (Q3+1.5*IQR)
  5. For all data in the data set, check whether any data is below the lower limit and above the upper limit. Then mark the data as an outlier.
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>
<span>def</span> <span>detect_outliers_iqr</span><span>(</span><span>data</span><span>):</span>
<span>series</span> <span>=</span> <span>pd</span><span>.</span><span>Series</span><span>(</span><span>data</span><span>)</span>
<span># Determine Q1, Q3, IQR values </span> <span>Q1</span><span>,</span> <span>Q3</span> <span>=</span> <span>series</span><span>.</span><span>quantile</span><span>([</span><span>0.25</span><span>,</span> <span>0.75</span><span>])</span>
<span>IQR</span> <span>=</span> <span>Q3</span><span>-</span><span>Q1</span>
<span># Determine lower bound and upper bound </span> <span>lower_bound</span> <span>=</span> <span>Q1</span> <span>-</span> <span>1.5</span> <span>*</span> <span>IQR</span>
<span>upper_bound</span> <span>=</span> <span>Q3</span> <span>+</span> <span>1.5</span> <span>*</span> <span>IQR</span>
<span>outliers</span> <span>=</span> <span>[]</span>
<span># Determine the z-score value for each data </span> <span>for</span> <span>i</span> <span>in</span> <span>data</span><span>:</span>
<span># Check whether the data is outliers </span> <span>if </span><span>(</span><span>i</span> <span><</span> <span>lower_bound</span> <span>or</span> <span>i</span> <span>></span> <span>upper_bound</span><span>):</span>
<span>outliers</span><span>.</span><span>append</span><span>(</span><span>i</span><span>)</span>
<span>return</span> <span>outliers</span>
<span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span>
<span>outliers</span> <span>=</span> <span>detect_outliers_iqr</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Outliers:</span><span>"</span><span>,</span> <span>outliers</span><span>)</span>
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>

<span>def</span> <span>detect_outliers_iqr</span><span>(</span><span>data</span><span>):</span>
    <span>series</span> <span>=</span> <span>pd</span><span>.</span><span>Series</span><span>(</span><span>data</span><span>)</span>

    <span># Determine Q1, Q3, IQR values </span>    <span>Q1</span><span>,</span> <span>Q3</span> <span>=</span> <span>series</span><span>.</span><span>quantile</span><span>([</span><span>0.25</span><span>,</span> <span>0.75</span><span>])</span>
    <span>IQR</span> <span>=</span> <span>Q3</span><span>-</span><span>Q1</span>

    <span># Determine lower bound and upper bound </span>    <span>lower_bound</span> <span>=</span> <span>Q1</span> <span>-</span> <span>1.5</span> <span>*</span> <span>IQR</span>
    <span>upper_bound</span> <span>=</span> <span>Q3</span> <span>+</span> <span>1.5</span> <span>*</span> <span>IQR</span>

    <span>outliers</span> <span>=</span> <span>[]</span>
    <span># Determine the z-score value for each data </span>    <span>for</span> <span>i</span> <span>in</span> <span>data</span><span>:</span>
        <span># Check whether the data is outliers </span>        <span>if </span><span>(</span><span>i</span> <span><</span> <span>lower_bound</span> <span>or</span> <span>i</span> <span>></span> <span>upper_bound</span><span>):</span>
            <span>outliers</span><span>.</span><span>append</span><span>(</span><span>i</span><span>)</span>

    <span>return</span> <span>outliers</span>

<span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span>
<span>outliers</span> <span>=</span> <span>detect_outliers_iqr</span><span>(</span><span>data</span><span>)</span> 
<span>print</span><span>(</span><span>"</span><span>Outliers:</span><span>"</span><span>,</span> <span>outliers</span><span>)</span>
import pandas as pd def detect_outliers_iqr(data): series = pd.Series(data) # Determine Q1, Q3, IQR values Q1, Q3 = series.quantile([0.25, 0.75]) IQR = Q3-Q1 # Determine lower bound and upper bound lower_bound = Q1 - 1.5 * IQR upper_bound = Q3 + 1.5 * IQR outliers = [] # Determine the z-score value for each data for i in data: # Check whether the data is outliers if (i < lower_bound or i > upper_bound): outliers.append(i) return outliers data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] outliers = detect_outliers_iqr(data) print("Outliers:", outliers)

Enter fullscreen mode Exit fullscreen mode

Output:

Outliers: [100]
Outliers: [100]
Outliers: [100]

Enter fullscreen mode Exit fullscreen mode

Outlier Handling

After successfully detecting outlier data in the dataset, the next stage is to handle the outlier data. There are several ways that can be done to handle outlier data that has been detected.

Trimming

Outlier data detected using this technique will be removed from the dataset. However, this method is not the best practice to do.

<span>def</span> <span>trimming</span><span>(</span><span>data</span><span>,</span> <span>outlier</span><span>):</span>
<span>new_data</span> <span>=</span> <span>[]</span>
<span># remove data that includes outliers </span> <span>for</span> <span>i</span> <span>in</span> <span>outlier</span><span>:</span>
<span>new_data</span> <span>=</span> <span>[</span><span>x</span> <span>for</span> <span>x</span> <span>in</span> <span>data</span> <span>if</span> <span>x</span> <span>!=</span> <span>i</span> <span>]</span>
<span>return</span> <span>new_data</span>
<span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span>
<span>outlier</span> <span>=</span> <span>[</span><span>100</span><span>]</span>
<span>new_data</span> <span>=</span> <span>trimming</span><span>(</span><span>data</span><span>,</span> <span>outlier</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>,</span> <span>new_data</span><span>)</span>
<span>def</span> <span>trimming</span><span>(</span><span>data</span><span>,</span> <span>outlier</span><span>):</span>
    <span>new_data</span> <span>=</span> <span>[]</span>

    <span># remove data that includes outliers </span>    <span>for</span> <span>i</span> <span>in</span> <span>outlier</span><span>:</span>
       <span>new_data</span> <span>=</span> <span>[</span><span>x</span> <span>for</span> <span>x</span> <span>in</span> <span>data</span> <span>if</span> <span>x</span> <span>!=</span> <span>i</span> <span>]</span>

    <span>return</span> <span>new_data</span>

<span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span>
<span>outlier</span> <span>=</span> <span>[</span><span>100</span><span>]</span>
<span>new_data</span> <span>=</span> <span>trimming</span><span>(</span><span>data</span><span>,</span> <span>outlier</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>,</span> <span>new_data</span><span>)</span>
def trimming(data, outlier): new_data = [] # remove data that includes outliers for i in outlier: new_data = [x for x in data if x != i ] return new_data data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] outlier = [100] new_data = trimming(data, outlier) print("Data:", data) print("New Data:", new_data)

Enter fullscreen mode Exit fullscreen mode

Output:

Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
New Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 27]
Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
New Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 27]
Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] New Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 26, 27]

Enter fullscreen mode Exit fullscreen mode

Quantile Based Flooring and Capping

Handling of outliers in this technique is carried out by limiting outliers to certain values ​​above the 90th percentile value or placed in factors below the 10th percentile value.

<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>
<span>def</span> <span>handle_quantile_outlier</span><span>(</span><span>data</span><span>):</span>
<span>series</span> <span>=</span> <span>pd</span><span>.</span><span>Series</span><span>(</span><span>data</span><span>)</span>
<span># Determine the 10th and 90th percentiles </span> <span>P10</span> <span>=</span> <span>series</span><span>.</span><span>quantile</span><span>(</span><span>0.1</span><span>)</span>
<span>P90</span> <span>=</span> <span>series</span><span>.</span><span>quantile</span><span>(</span><span>0.9</span><span>)</span>
<span>new_data</span> <span>=</span> <span>[]</span>
<span># Replace the data value with P10 for data < P10 </span> <span># and with P90 for data > P90 </span> <span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span>
<span>if</span> <span>x</span> <span><</span> <span>P10</span><span>:</span>
<span>x</span> <span>=</span> <span>P10</span>
<span>elif</span> <span>x</span> <span>></span> <span>P90</span><span>:</span>
<span>x</span> <span>=</span> <span>P90</span>
<span>new_data</span><span>.</span><span>append</span><span>(</span><span>x</span><span>)</span>
<span>return</span> <span>new_data</span>
<span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span>
<span>new_data</span> <span>=</span> <span>handle_quantile_outlier</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>,</span> <span>new_data</span><span>)</span>
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>

<span>def</span> <span>handle_quantile_outlier</span><span>(</span><span>data</span><span>):</span>
    <span>series</span> <span>=</span> <span>pd</span><span>.</span><span>Series</span><span>(</span><span>data</span><span>)</span>

    <span># Determine the 10th and 90th percentiles </span>    <span>P10</span> <span>=</span> <span>series</span><span>.</span><span>quantile</span><span>(</span><span>0.1</span><span>)</span>
    <span>P90</span> <span>=</span> <span>series</span><span>.</span><span>quantile</span><span>(</span><span>0.9</span><span>)</span>
    <span>new_data</span> <span>=</span> <span>[]</span>

    <span># Replace the data value with P10 for data < P10 </span>    <span># and with P90 for data > P90 </span>    <span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span>
        <span>if</span> <span>x</span> <span><</span> <span>P10</span><span>:</span>
            <span>x</span> <span>=</span> <span>P10</span>
        <span>elif</span> <span>x</span> <span>></span> <span>P90</span><span>:</span>
            <span>x</span> <span>=</span> <span>P90</span>
        <span>new_data</span><span>.</span><span>append</span><span>(</span><span>x</span><span>)</span>

    <span>return</span> <span>new_data</span>

<span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span>
<span>new_data</span> <span>=</span> <span>handle_quantile_outlier</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>,</span> <span>new_data</span><span>)</span>
import pandas as pd def handle_quantile_outlier(data): series = pd.Series(data) # Determine the 10th and 90th percentiles P10 = series.quantile(0.1) P90 = series.quantile(0.9) new_data = [] # Replace the data value with P10 for data < P10 # and with P90 for data > P90 for x in data: if x < P10: x = P10 elif x > P90: x = P90 new_data.append(x) return new_data data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] new_data = handle_quantile_outlier(data) print("Data:", data) print("New Data:", new_data)

Enter fullscreen mode Exit fullscreen mode

Output:

Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
New Data: [17.1, 17.1, 18, 19, 20, 21, 22, 23, 24, 26.9, 26, 26.9]
Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
New Data: [17.1, 17.1, 18, 19, 20, 21, 22, 23, 24, 26.9, 26, 26.9]
Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] New Data: [17.1, 17.1, 18, 19, 20, 21, 22, 23, 24, 26.9, 26, 26.9]

Enter fullscreen mode Exit fullscreen mode

Mean/Median Imputation

The average value is greatly influenced by the presence of outliers, so it is recommended to replace these outliers with median values.

<span>import</span> <span>statistics</span> <span>as</span> <span>s</span>
<span>def</span> <span>handle_median_outlier</span><span>(</span><span>data</span><span>,</span> <span>outlier</span><span>):</span>
<span># Determine median value </span> <span>median</span> <span>=</span> <span>s</span><span>.</span><span>median</span><span>(</span><span>data</span><span>)</span>
<span># Change outlier with median value </span> <span>new_data</span> <span>=</span> <span>[]</span>
<span>for</span> <span>i</span> <span>in</span> <span>outlier</span><span>:</span>
<span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span>
<span>if</span> <span>x</span> <span>==</span> <span>i</span><span>:</span>
<span>new_data</span><span>.</span><span>append</span><span>(</span><span>median</span><span>)</span>
<span>else</span><span>:</span>
<span>new_data</span><span>.</span><span>append</span><span>(</span><span>x</span><span>)</span>
<span>return</span> <span>new_data</span>
<span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span>
<span>outlier</span> <span>=</span> <span>[</span><span>100</span><span>]</span>
<span>new_data</span> <span>=</span> <span>handle_median_outlier</span><span>(</span><span>data</span><span>,</span> <span>outlier</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>,</span> <span>new_data</span><span>)</span>
<span>import</span> <span>statistics</span> <span>as</span> <span>s</span>

<span>def</span> <span>handle_median_outlier</span><span>(</span><span>data</span><span>,</span> <span>outlier</span><span>):</span>
    <span># Determine median value </span>    <span>median</span> <span>=</span> <span>s</span><span>.</span><span>median</span><span>(</span><span>data</span><span>)</span>

    <span># Change outlier with median value </span>    <span>new_data</span> <span>=</span> <span>[]</span>
    <span>for</span> <span>i</span> <span>in</span> <span>outlier</span><span>:</span>
        <span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span>
            <span>if</span> <span>x</span> <span>==</span> <span>i</span><span>:</span>
                <span>new_data</span><span>.</span><span>append</span><span>(</span><span>median</span><span>)</span>
            <span>else</span><span>:</span>
                <span>new_data</span><span>.</span><span>append</span><span>(</span><span>x</span><span>)</span>

    <span>return</span> <span>new_data</span>

<span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span>
<span>outlier</span> <span>=</span> <span>[</span><span>100</span><span>]</span>
<span>new_data</span> <span>=</span> <span>handle_median_outlier</span><span>(</span><span>data</span><span>,</span> <span>outlier</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>,</span> <span>new_data</span><span>)</span>
import statistics as s def handle_median_outlier(data, outlier): # Determine median value median = s.median(data) # Change outlier with median value new_data = [] for i in outlier: for x in data: if x == i: new_data.append(median) else: new_data.append(x) return new_data data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] outlier = [100] new_data = handle_median_outlier(data, outlier) print("Data:", data) print("New Data:", new_data)

Enter fullscreen mode Exit fullscreen mode

Output:

Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
New Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 21.5, 26, 27]
Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
New Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 21.5, 26, 27]
Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] New Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 21.5, 26, 27]

Enter fullscreen mode Exit fullscreen mode

Log Transformation

Log transformation is a common technique used to reduce the skew in a distribution and make it more symmetric. In this way, the occurrence of extreme values ​​can be reduced and the data becomes more normally distributed.

<span>import</span> <span>math</span>
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>
<span>def</span> <span>handle_log_outlier</span><span>(</span><span>data</span><span>):</span>
<span>new_data</span> <span>=</span> <span>[]</span>
<span># Transformasi setiap nilai data dengan log </span> <span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span>
<span>new_data</span><span>.</span><span>append</span><span>(</span><span>math</span><span>.</span><span>log</span><span>(</span><span>x</span><span>,</span> <span>10</span><span>))</span>
<span>return</span> <span>new_data</span>
<span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span>
<span>new_data</span> <span>=</span> <span>handle_log_outlier</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>,</span> <span>new_data</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>'</span><span>red</span><span>'</span><span>,</span> <span>label</span><span>=</span><span>'</span><span>Data</span><span>'</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>new_data</span><span>,</span> <span>'</span><span>blue</span><span>'</span><span>,</span> <span>label</span><span>=</span><span>'</span><span>New Data</span><span>'</span><span>)</span>
<span>plt</span><span>.</span><span>legend</span><span>()</span>
<span>plt</span><span>.</span><span>show</span><span>()</span>
<span>import</span> <span>math</span>
<span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span>

<span>def</span> <span>handle_log_outlier</span><span>(</span><span>data</span><span>):</span>
    <span>new_data</span> <span>=</span> <span>[]</span>
    <span># Transformasi setiap nilai data dengan log </span>    <span>for</span> <span>x</span> <span>in</span> <span>data</span><span>:</span>
        <span>new_data</span><span>.</span><span>append</span><span>(</span><span>math</span><span>.</span><span>log</span><span>(</span><span>x</span><span>,</span> <span>10</span><span>))</span>

    <span>return</span> <span>new_data</span>

<span>data</span> <span>=</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>100</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]</span>
<span>new_data</span> <span>=</span> <span>handle_log_outlier</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>,</span> <span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>,</span> <span>new_data</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>data</span><span>,</span> <span>'</span><span>red</span><span>'</span><span>,</span> <span>label</span><span>=</span><span>'</span><span>Data</span><span>'</span><span>)</span>
<span>plt</span><span>.</span><span>plot</span><span>(</span><span>new_data</span><span>,</span> <span>'</span><span>blue</span><span>'</span><span>,</span> <span>label</span><span>=</span><span>'</span><span>New Data</span><span>'</span><span>)</span>
<span>plt</span><span>.</span><span>legend</span><span>()</span>
<span>plt</span><span>.</span><span>show</span><span>()</span>
import math import matplotlib.pyplot as plt def handle_log_outlier(data): new_data = [] # Transformasi setiap nilai data dengan log for x in data: new_data.append(math.log(x, 10)) return new_data data = [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] new_data = handle_log_outlier(data) print("Data:", data) print("New Data:", new_data) plt.plot(data, 'red', label='Data') plt.plot(new_data, 'blue', label='New Data') plt.legend() plt.show()

Enter fullscreen mode Exit fullscreen mode

Output:

Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
New Data: [1.2041199826559246, 1.2304489213782739, 1.2552725051033058, 1.2787536009528289, 1.301029995663981, 1.322219294733919, 1.3424226808222062, 1.3617278360175928, 1.380211241711606, 2.0, 1.414973347970818, 1.4313637641589871]
Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27]
New Data: [1.2041199826559246, 1.2304489213782739, 1.2552725051033058, 1.2787536009528289, 1.301029995663981, 1.322219294733919, 1.3424226808222062, 1.3617278360175928, 1.380211241711606, 2.0, 1.414973347970818, 1.4313637641589871]
Data: [16, 17, 18, 19, 20, 21, 22, 23, 24, 100, 26, 27] New Data: [1.2041199826559246, 1.2304489213782739, 1.2552725051033058, 1.2787536009528289, 1.301029995663981, 1.322219294733919, 1.3424226808222062, 1.3617278360175928, 1.380211241711606, 2.0, 1.414973347970818, 1.4313637641589871]

Enter fullscreen mode Exit fullscreen mode

Binning

Data binning is a method of separating or grouping continuous numerical values ​​into discrete intervals called “bins” or “groups”. Data grouping methods can be used to simplify data distribution and assist in statistical analysis and visualization. There are several techniques that are often used to group data, including:

Equal Width Binning

This technique groups data into intervals or bins with the same width and has been determined previously. Even though this method is simple, it cannot be applied to data with a skewed distribution.

<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>
<span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>20</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]}</span>
<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span>
<span>num_bins</span> <span>=</span> <span>4</span>
<span># Calculate the bin width </span><span>bin_width</span> <span>=</span> <span>(</span><span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>max</span><span>()</span> <span>-</span> <span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>min</span><span>())</span> <span>/</span> <span>num_bins</span>
<span># Create bin limits </span><span>bin_edges</span> <span>=</span> <span>[</span><span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>min</span><span>()</span> <span>+</span> <span>i</span> <span>*</span> <span>bin_width</span> <span>for</span> <span>i</span> <span>in</span> <span>range</span><span>(</span><span>num_bins</span> <span>+</span> <span>1</span><span>)]</span>
<span>df</span><span>[</span><span>'</span><span>age_bins</span><span>'</span><span>]</span> <span>=</span> <span>pd</span><span>.</span><span>cut</span><span>(</span><span>df</span><span>[</span><span>'</span><span>usia</span><span>'</span><span>],</span> <span>bins</span><span>=</span><span>bin_edges</span><span>,</span> <span>include_lowest</span><span>=</span><span>True</span><span>,</span> <span>right</span><span>=</span><span>True</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> 

<span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>20</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]}</span>

<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span>

<span>num_bins</span> <span>=</span> <span>4</span>

<span># Calculate the bin width </span><span>bin_width</span> <span>=</span> <span>(</span><span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>max</span><span>()</span> <span>-</span> <span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>min</span><span>())</span> <span>/</span> <span>num_bins</span>

<span># Create bin limits </span><span>bin_edges</span> <span>=</span> <span>[</span><span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>min</span><span>()</span> <span>+</span> <span>i</span> <span>*</span> <span>bin_width</span> <span>for</span> <span>i</span> <span>in</span> <span>range</span><span>(</span><span>num_bins</span> <span>+</span> <span>1</span><span>)]</span>

<span>df</span><span>[</span><span>'</span><span>age_bins</span><span>'</span><span>]</span> <span>=</span> <span>pd</span><span>.</span><span>cut</span><span>(</span><span>df</span><span>[</span><span>'</span><span>usia</span><span>'</span><span>],</span> <span>bins</span><span>=</span><span>bin_edges</span><span>,</span> <span>include_lowest</span><span>=</span><span>True</span><span>,</span> <span>right</span><span>=</span><span>True</span><span>)</span>

<span>print</span><span>(</span><span>df</span><span>)</span>
import pandas as pd data = {'age': [16, 17, 18, 19, 20, 21, 22, 23, 24, 20, 26, 27]} df = pd.DataFrame(data) num_bins = 4 # Calculate the bin width bin_width = (df['age'].max() - df['age'].min()) / num_bins # Create bin limits bin_edges = [df['age'].min() + i * bin_width for i in range(num_bins + 1)] df['age_bins'] = pd.cut(df['usia'], bins=bin_edges, include_lowest=True, right=True) print(df)

Enter fullscreen mode Exit fullscreen mode

Output:

age age_bins
0 16 (15.999, 18.75]
1 17 (15.999, 18.75]
2 18 (15.999, 18.75]
3 19 (18.75, 21.5]
4 20 (18.75, 21.5]
5 21 (18.75, 21.5]
6 22 (21.5, 24.25]
7 23 (21.5, 24.25]
8 24 (21.5, 24.25]
9 20 (18.75, 21.5]
10 26 (24.25, 27.0]
11 27 (24.25, 27.0]
    age        age_bins
0     16  (15.999, 18.75]
1     17  (15.999, 18.75]
2     18  (15.999, 18.75]
3     19    (18.75, 21.5]
4     20    (18.75, 21.5]
5     21    (18.75, 21.5]
6     22    (21.5, 24.25]
7     23    (21.5, 24.25]
8     24    (21.5, 24.25]
9     20    (18.75, 21.5]
10    26    (24.25, 27.0]
11    27    (24.25, 27.0]
age age_bins 0 16 (15.999, 18.75] 1 17 (15.999, 18.75] 2 18 (15.999, 18.75] 3 19 (18.75, 21.5] 4 20 (18.75, 21.5] 5 21 (18.75, 21.5] 6 22 (21.5, 24.25] 7 23 (21.5, 24.25] 8 24 (21.5, 24.25] 9 20 (18.75, 21.5] 10 26 (24.25, 27.0] 11 27 (24.25, 27.0]

Enter fullscreen mode Exit fullscreen mode

Equal Frequency Binning

In this technique, data is grouped into bins with each bin having approximately the same number of data points. This technique is useful when maintaining the same frequency or distribution across bins if it is important. This binning method can also effectively deal with outlier data and skewed data.

<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>
<span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>20</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]}</span>
<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span>
<span>df</span><span>[</span><span>'</span><span>age_bins</span><span>'</span><span>]</span> <span>=</span> <span>pd</span><span>.</span><span>qcut</span><span>(</span><span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>],</span> <span>q</span><span>=</span><span>3</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> 

<span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>20</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]}</span>

<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span> 
<span>df</span><span>[</span><span>'</span><span>age_bins</span><span>'</span><span>]</span> <span>=</span> <span>pd</span><span>.</span><span>qcut</span><span>(</span><span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>],</span> <span>q</span><span>=</span><span>3</span><span>)</span>

<span>print</span><span>(</span><span>df</span><span>)</span>
import pandas as pd data = {'age': [16, 17, 18, 19, 20, 21, 22, 23, 24, 20, 26, 27]} df = pd.DataFrame(data) df['age_bins'] = pd.qcut(df['age'], q=3) print(df)

Enter fullscreen mode Exit fullscreen mode

Output:

age age_bins
0 16 (15.999, 19.667]
1 17 (15.999, 19.667]
2 18 (15.999, 19.667]
3 19 (15.999, 19.667]
4 20 (19.667, 22.333]
5 21 (19.667, 22.333]
6 22 (19.667, 22.333]
7 23 (22.333, 27.0]
8 24 (22.333, 27.0]
9 20 (19.667, 22.333]
10 26 (22.333, 27.0]
11 27 (22.333, 27.0]
    age          age_bins
0     16  (15.999, 19.667]
1     17  (15.999, 19.667]
2     18  (15.999, 19.667]
3     19  (15.999, 19.667]
4     20  (19.667, 22.333]
5     21  (19.667, 22.333]
6     22  (19.667, 22.333]
7     23    (22.333, 27.0]
8     24    (22.333, 27.0]
9     20  (19.667, 22.333]
10    26    (22.333, 27.0]
11    27    (22.333, 27.0]
age age_bins 0 16 (15.999, 19.667] 1 17 (15.999, 19.667] 2 18 (15.999, 19.667] 3 19 (15.999, 19.667] 4 20 (19.667, 22.333] 5 21 (19.667, 22.333] 6 22 (19.667, 22.333] 7 23 (22.333, 27.0] 8 24 (22.333, 27.0] 9 20 (19.667, 22.333] 10 26 (22.333, 27.0] 11 27 (22.333, 27.0]

Enter fullscreen mode Exit fullscreen mode

Quantile Binning

In this technique, data is grouped based on percentile values. The limits of a bin are based on certain percentile values ​​(e.g. 25th, 50th, and 75th percentiles).

<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>
<span>import</span> <span>numpy</span> <span>as</span> <span>np</span>
<span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>20</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]}</span>
<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span>
<span># Defines percentiles for bin boundaries </span><span>percentiles</span> <span>=</span> <span>[</span><span>0</span><span>,</span> <span>25</span><span>,</span> <span>50</span><span>,</span> <span>75</span><span>,</span> <span>100</span><span>]</span> <span># In this case, quartiles are used </span>
<span># Defines percentiles for bin boundaries </span><span>bin_edges</span> <span>=</span> <span>np</span><span>.</span><span>percentile</span><span>(</span><span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>],</span> <span>percentiles</span><span>)</span>
<span>df</span><span>[</span><span>'</span><span>age_bins</span><span>'</span><span>]</span> <span>=</span> <span>pd</span><span>.</span><span>cut</span><span>(</span><span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>],</span> <span>bins</span><span>=</span><span>bin_edges</span><span>,</span> <span>include_lowest</span><span>=</span><span>True</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> 
<span>import</span> <span>numpy</span> <span>as</span> <span>np</span>

<span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>16</span><span>,</span> <span>17</span><span>,</span> <span>18</span><span>,</span> <span>19</span><span>,</span> <span>20</span><span>,</span> <span>21</span><span>,</span> <span>22</span><span>,</span> <span>23</span><span>,</span> <span>24</span><span>,</span> <span>20</span><span>,</span> <span>26</span><span>,</span> <span>27</span><span>]}</span>
<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span> 

<span># Defines percentiles for bin boundaries </span><span>percentiles</span> <span>=</span> <span>[</span><span>0</span><span>,</span> <span>25</span><span>,</span> <span>50</span><span>,</span> <span>75</span><span>,</span> <span>100</span><span>]</span> <span># In this case, quartiles are used </span>
<span># Defines percentiles for bin boundaries </span><span>bin_edges</span> <span>=</span> <span>np</span><span>.</span><span>percentile</span><span>(</span><span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>],</span> <span>percentiles</span><span>)</span>

<span>df</span><span>[</span><span>'</span><span>age_bins</span><span>'</span><span>]</span> <span>=</span> <span>pd</span><span>.</span><span>cut</span><span>(</span><span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>],</span> <span>bins</span><span>=</span><span>bin_edges</span><span>,</span> <span>include_lowest</span><span>=</span><span>True</span><span>)</span>

<span>print</span><span>(</span><span>df</span><span>)</span>
import pandas as pd import numpy as np data = {'age': [16, 17, 18, 19, 20, 21, 22, 23, 24, 20, 26, 27]} df = pd.DataFrame(data) # Defines percentiles for bin boundaries percentiles = [0, 25, 50, 75, 100] # In this case, quartiles are used # Defines percentiles for bin boundaries bin_edges = np.percentile(df['age'], percentiles) df['age_bins'] = pd.cut(df['age'], bins=bin_edges, include_lowest=True) print(df)

Enter fullscreen mode Exit fullscreen mode

Output:

age age_bins
0 16 (15.999, 18.75]
1 17 (15.999, 18.75]
2 18 (15.999, 18.75]
3 19 (18.75, 20.5]
4 20 (18.75, 20.5]
5 21 (20.5, 23.25]
6 22 (20.5, 23.25]
7 23 (20.5, 23.25]
8 24 (23.25, 27.0]
9 20 (18.75, 20.5]
10 26 (23.25, 27.0]
11 27 (23.25, 27.0]
    age        age_bins
0     16  (15.999, 18.75]
1     17  (15.999, 18.75]
2     18  (15.999, 18.75]
3     19    (18.75, 20.5]
4     20    (18.75, 20.5]
5     21    (20.5, 23.25]
6     22    (20.5, 23.25]
7     23    (20.5, 23.25]
8     24    (23.25, 27.0]
9     20    (18.75, 20.5]
10    26    (23.25, 27.0]
11    27    (23.25, 27.0]
age age_bins 0 16 (15.999, 18.75] 1 17 (15.999, 18.75] 2 18 (15.999, 18.75] 3 19 (18.75, 20.5] 4 20 (18.75, 20.5] 5 21 (20.5, 23.25] 6 22 (20.5, 23.25] 7 23 (20.5, 23.25] 8 24 (23.25, 27.0] 9 20 (18.75, 20.5] 10 26 (23.25, 27.0] 11 27 (23.25, 27.0]

Enter fullscreen mode Exit fullscreen mode

Scrubbing

Data scrubbing is a process for changing or deleting incomplete, incorrect, inaccurate, or repetitive data in a dataset. By carrying out this process, it can help improve data consistency, accuracy and reliability.

Deleting Repetitive Data

Deleting duplicate data events is one way to perform data scrubbing. Repeated data often appears if the dataset used comes from several different sources.

<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>
<span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>,</span> <span>22</span><span>,</span> <span>17</span><span>],</span>
<span>'</span><span>height</span><span>'</span><span>:</span> <span>[</span><span>155</span><span>,</span> <span>162</span><span>,</span> <span>165</span><span>,</span> <span>170</span><span>,</span> <span>162</span><span>]}</span>
<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>
<span># Check if data is duplicated </span><span>duplicate_data</span> <span>=</span> <span>df</span><span>[</span><span>df</span><span>.</span><span>duplicated</span><span>()]</span>
<span>print</span><span>(</span><span>"</span><span>Duplicate Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>duplicate_data</span><span>)</span>
<span># Delete duplicate data </span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>drop_duplicates</span><span>()</span>
<span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>
<span>Output</span><span>:</span>
<span>Data</span><span>:</span>
<span>age</span> <span>height</span>
<span>0</span> <span>15</span> <span>155</span>
<span>1</span> <span>17</span> <span>162</span>
<span>2</span> <span>23</span> <span>165</span>
<span>3</span> <span>22</span> <span>170</span>
<span>4</span> <span>17</span> <span>162</span>
<span>Duplicate</span> <span>Data</span><span>:</span>
<span>age</span> <span>height</span>
<span>4</span> <span>17</span> <span>162</span>
<span>New</span> <span>Data</span><span>:</span>
<span>age</span> <span>height</span>
<span>0</span> <span>15</span> <span>155</span>
<span>1</span> <span>17</span> <span>162</span>
<span>2</span> <span>23</span> <span>165</span>
<span>3</span> <span>22</span> <span>170</span>
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>

<span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>,</span> <span>22</span><span>,</span> <span>17</span><span>],</span>
        <span>'</span><span>height</span><span>'</span><span>:</span> <span>[</span><span>155</span><span>,</span> <span>162</span><span>,</span> <span>165</span><span>,</span> <span>170</span><span>,</span> <span>162</span><span>]}</span>
<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span>

<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>

<span># Check if data is duplicated </span><span>duplicate_data</span> <span>=</span> <span>df</span><span>[</span><span>df</span><span>.</span><span>duplicated</span><span>()]</span>

<span>print</span><span>(</span><span>"</span><span>Duplicate Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>duplicate_data</span><span>)</span>

<span># Delete duplicate data </span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>drop_duplicates</span><span>()</span>
<span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>
<span>Output</span><span>:</span>
<span>Data</span><span>:</span>
   <span>age</span>  <span>height</span>
<span>0</span>    <span>15</span>     <span>155</span>
<span>1</span>    <span>17</span>     <span>162</span>
<span>2</span>    <span>23</span>     <span>165</span>
<span>3</span>    <span>22</span>     <span>170</span>
<span>4</span>    <span>17</span>     <span>162</span>
<span>Duplicate</span> <span>Data</span><span>:</span>
   <span>age</span>  <span>height</span>
<span>4</span>    <span>17</span>     <span>162</span>
<span>New</span> <span>Data</span><span>:</span>
   <span>age</span>  <span>height</span>
<span>0</span>    <span>15</span>     <span>155</span>
<span>1</span>    <span>17</span>     <span>162</span>
<span>2</span>    <span>23</span>     <span>165</span>
<span>3</span>    <span>22</span>     <span>170</span>
import pandas as pd data = {'age': [15, 17, 23, 22, 17], 'height': [155, 162, 165, 170, 162]} df = pd.DataFrame(data) print("Data:") print(df) # Check if data is duplicated duplicate_data = df[df.duplicated()] print("Duplicate Data:") print(duplicate_data) # Delete duplicate data df = df.drop_duplicates() print("New Data:") print(df) Output: Data: age height 0 15 155 1 17 162 2 23 165 3 22 170 4 17 162 Duplicate Data: age height 4 17 162 New Data: age height 0 15 155 1 17 162 2 23 165 3 22 170

Enter fullscreen mode Exit fullscreen mode

Handling Missing Data

In real cases, usually there is a lot of missing data in a data set. The causes of this data loss are very varied, ranging from data corruption to device failure when recording measurements.

Deleting Missing Data

Missing data can be resolved by deleting rows or columns of data that have NULL values.

<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>
<span>import</span> <span>numpy</span> <span>as</span> <span>np</span>
<span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>,</span> <span>np</span><span>.</span><span>nan</span><span>,</span> <span>17</span><span>],</span>
<span>'</span><span>height</span><span>'</span><span>:</span> <span>[</span><span>155</span><span>,</span> <span>162</span><span>,</span> <span>np</span><span>.</span><span>nan</span><span>,</span> <span>170</span><span>,</span> <span>162</span><span>]}</span>
<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>
<span># Deletes rows of missing data </span><span>df</span><span>.</span><span>dropna</span><span>(</span><span>axis</span><span>=</span><span>0</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>
<span>import</span> <span>numpy</span> <span>as</span> <span>np</span>

<span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>,</span> <span>np</span><span>.</span><span>nan</span><span>,</span> <span>17</span><span>],</span>
        <span>'</span><span>height</span><span>'</span><span>:</span> <span>[</span><span>155</span><span>,</span> <span>162</span><span>,</span> <span>np</span><span>.</span><span>nan</span><span>,</span> <span>170</span><span>,</span> <span>162</span><span>]}</span>
<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span>

<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>

<span># Deletes rows of missing data </span><span>df</span><span>.</span><span>dropna</span><span>(</span><span>axis</span><span>=</span><span>0</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span>

<span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>
import pandas as pd import numpy as np data = {'age': [15, 17, 23, np.nan, 17], 'height': [155, 162, np.nan, 170, 162]} df = pd.DataFrame(data) print("Data:") print(df) # Deletes rows of missing data df.dropna(axis=0, inplace=True) print("New Data:") print(df)

Enter fullscreen mode Exit fullscreen mode

Output:

Data:
age height
0 15.0 155.0
1 17.0 162.0
2 23.0 NaN
3 NaN 170.0
4 17.0 162.0
New Data:
age height
0 15.0 155.0
1 17.0 162.0
4 17.0 162.0
Data:
   age  height
0  15.0   155.0
1  17.0   162.0
2  23.0     NaN
3   NaN   170.0
4  17.0   162.0
New Data:
   age  height
0  15.0   155.0
1  17.0   162.0
4  17.0   162.0
Data: age height 0 15.0 155.0 1 17.0 162.0 2 23.0 NaN 3 NaN 170.0 4 17.0 162.0 New Data: age height 0 15.0 155.0 1 17.0 162.0 4 17.0 162.0

Enter fullscreen mode Exit fullscreen mode

Pros:
• A model trained by removing all missing values ​​will produce a robust model.
Cons:
• Losing a lot of information.
• Works poorly if the percentage of missing values ​​is too large compared to the data set.

Fill in missing data with mean/median/mode values

Columns in a dataset that have numeric values ​​can be replaced with the mean, median, or mode of other data in that column. This technique will prevent data loss like the previous method.

<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>
<span>import</span> <span>numpy</span> <span>as</span> <span>np</span>
<span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>,</span> <span>np</span><span>.</span><span>nan</span><span>,</span> <span>17</span><span>],</span>
<span>'</span><span>height</span><span>'</span><span>:</span> <span>[</span><span>155</span><span>,</span> <span>162</span><span>,</span> <span>165</span><span>,</span> <span>170</span><span>,</span> <span>162</span><span>]}</span>
<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>
<span>df_filled_mean</span> <span>=</span> <span>df</span><span>.</span><span>copy</span><span>()</span>
<span>df_filled_median</span> <span>=</span> <span>df</span><span>.</span><span>copy</span><span>()</span>
<span>df_filled_mode</span> <span>=</span> <span>df</span><span>.</span><span>copy</span><span>()</span>
<span># Determine mean/median/mode value </span><span>mean</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>mean</span><span>()</span>
<span>median</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>median</span><span>()</span>
<span>mode</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>mode</span><span>().</span><span>values</span><span>[</span><span>0</span><span>]</span>
<span># Fill in the data with the mean/median/mode value </span><span>df_filled_mean</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>fillna</span><span>(</span><span>mean</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span>
<span>df_filled_median</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>fillna</span><span>(</span><span>median</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span>
<span>df_filled_mode</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>fillna</span><span>(</span><span>mode</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Filled mean:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df_filled_mean</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Filled median:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df_filled_median</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Filled mode:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df_filled_mode</span><span>)</span>
<span>Output</span><span>:</span>
<span>Data</span><span>:</span>
<span>age</span> <span>height</span>
<span>0</span> <span>15.0</span> <span>155</span>
<span>1</span> <span>17.0</span> <span>162</span>
<span>2</span> <span>23.0</span> <span>165</span>
<span>3</span> <span>NaN</span> <span>170</span>
<span>4</span> <span>17.0</span> <span>162</span>
<span>New</span> <span>Data</span><span>:</span>
<span>Filled</span> <span>mean</span><span>:</span>
<span>age</span> <span>height</span>
<span>0</span> <span>15.0</span> <span>155</span>
<span>1</span> <span>17.0</span> <span>162</span>
<span>2</span> <span>23.0</span> <span>165</span>
<span>3</span> <span>18.0</span> <span>170</span>
<span>4</span> <span>17.0</span> <span>162</span>
<span>Filled</span> <span>median</span><span>:</span>
<span>age</span> <span>height</span>
<span>0</span> <span>15.0</span> <span>155</span>
<span>1</span> <span>17.0</span> <span>162</span>
<span>2</span> <span>23.0</span> <span>165</span>
<span>3</span> <span>17.0</span> <span>170</span>
<span>4</span> <span>17.0</span> <span>162</span>
<span>Filled</span> <span>mode</span><span>:</span>
<span>age</span> <span>height</span>
<span>0</span> <span>15.0</span> <span>155</span>
<span>1</span> <span>17.0</span> <span>162</span>
<span>2</span> <span>23.0</span> <span>165</span>
<span>3</span> <span>17.0</span> <span>170</span>
<span>4</span> <span>17.0</span> <span>162</span>
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>
<span>import</span> <span>numpy</span> <span>as</span> <span>np</span>

<span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>,</span> <span>np</span><span>.</span><span>nan</span><span>,</span> <span>17</span><span>],</span>
        <span>'</span><span>height</span><span>'</span><span>:</span> <span>[</span><span>155</span><span>,</span> <span>162</span><span>,</span> <span>165</span><span>,</span> <span>170</span><span>,</span> <span>162</span><span>]}</span>
<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span>

<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>

<span>df_filled_mean</span> <span>=</span> <span>df</span><span>.</span><span>copy</span><span>()</span>
<span>df_filled_median</span> <span>=</span> <span>df</span><span>.</span><span>copy</span><span>()</span>
<span>df_filled_mode</span> <span>=</span> <span>df</span><span>.</span><span>copy</span><span>()</span>

<span># Determine mean/median/mode value </span><span>mean</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>mean</span><span>()</span> 
<span>median</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>median</span><span>()</span>
<span>mode</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>mode</span><span>().</span><span>values</span><span>[</span><span>0</span><span>]</span>

<span># Fill in the data with the mean/median/mode value </span><span>df_filled_mean</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>fillna</span><span>(</span><span>mean</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span>
<span>df_filled_median</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>fillna</span><span>(</span><span>median</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span>
<span>df_filled_mode</span><span>[</span><span>'</span><span>age</span><span>'</span><span>].</span><span>fillna</span><span>(</span><span>mode</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span>

<span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Filled mean:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df_filled_mean</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Filled median:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df_filled_median</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Filled mode:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df_filled_mode</span><span>)</span>
<span>Output</span><span>:</span>
<span>Data</span><span>:</span>
   <span>age</span>  <span>height</span>
<span>0</span>  <span>15.0</span>     <span>155</span>
<span>1</span>  <span>17.0</span>     <span>162</span>
<span>2</span>  <span>23.0</span>     <span>165</span>
<span>3</span>   <span>NaN</span>     <span>170</span>
<span>4</span>  <span>17.0</span>     <span>162</span>
<span>New</span> <span>Data</span><span>:</span>
<span>Filled</span> <span>mean</span><span>:</span>
   <span>age</span>  <span>height</span>
<span>0</span>  <span>15.0</span>     <span>155</span>
<span>1</span>  <span>17.0</span>     <span>162</span>
<span>2</span>  <span>23.0</span>     <span>165</span>
<span>3</span>  <span>18.0</span>     <span>170</span>
<span>4</span>  <span>17.0</span>     <span>162</span>
<span>Filled</span> <span>median</span><span>:</span>
   <span>age</span>  <span>height</span>
<span>0</span>  <span>15.0</span>     <span>155</span>
<span>1</span>  <span>17.0</span>     <span>162</span>
<span>2</span>  <span>23.0</span>     <span>165</span>
<span>3</span>  <span>17.0</span>     <span>170</span>
<span>4</span>  <span>17.0</span>     <span>162</span>
<span>Filled</span> <span>mode</span><span>:</span>
   <span>age</span>  <span>height</span>
<span>0</span>  <span>15.0</span>     <span>155</span>
<span>1</span>  <span>17.0</span>     <span>162</span>
<span>2</span>  <span>23.0</span>     <span>165</span>
<span>3</span>  <span>17.0</span>     <span>170</span>
<span>4</span>  <span>17.0</span>     <span>162</span>
import pandas as pd import numpy as np data = {'age': [15, 17, 23, np.nan, 17], 'height': [155, 162, 165, 170, 162]} df = pd.DataFrame(data) print("Data:") print(df) df_filled_mean = df.copy() df_filled_median = df.copy() df_filled_mode = df.copy() # Determine mean/median/mode value mean = df['age'].mean() median = df['age'].median() mode = df['age'].mode().values[0] # Fill in the data with the mean/median/mode value df_filled_mean['age'].fillna(mean, inplace=True) df_filled_median['age'].fillna(median, inplace=True) df_filled_mode['age'].fillna(mode, inplace=True) print("New Data:") print("Filled mean:") print(df_filled_mean) print("Filled median:") print(df_filled_median) print("Filled mode:") print(df_filled_mode) Output: Data: age height 0 15.0 155 1 17.0 162 2 23.0 165 3 NaN 170 4 17.0 162 New Data: Filled mean: age height 0 15.0 155 1 17.0 162 2 23.0 165 3 18.0 170 4 17.0 162 Filled median: age height 0 15.0 155 1 17.0 162 2 23.0 165 3 17.0 170 4 17.0 162 Filled mode: age height 0 15.0 155 1 17.0 162 2 23.0 165 3 17.0 170 4 17.0 162

Enter fullscreen mode Exit fullscreen mode

Pros:
• Prevent data loss resulting in deleted rows or columns
• Works well with small data sets and is easy to implement.
Cons:
• Only works with numeric continuous variables.
• May cause data leaks

Fill in Missing Data in Categorical Columns

When missing data is found in a categorical column of either character or number type, the missing data can be filled in with the highest frequency of the category. If there is a lot of missing data, the data is replaced with a new category.

<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>
<span>import</span> <span>numpy</span> <span>as</span> <span>np</span>
<span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>,</span> <span>20</span><span>,</span> <span>17</span><span>],</span>
<span>'</span><span>impression</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>good</span><span>'</span><span>,</span> <span>'</span><span>fair</span><span>'</span><span>,</span> <span>'</span><span>fair</span><span>'</span><span>,</span> <span>'</span><span>very good</span><span>'</span><span>,</span> <span>np</span><span>.</span><span>nan</span><span>]}</span>
<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>
<span>most_category</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>impression</span><span>'</span><span>].</span><span>mode</span><span>().</span><span>values</span><span>[</span><span>0</span><span>]</span>
<span># Fill in the data with the highest category frequency </span><span>df</span><span>[</span><span>'</span><span>impression</span><span>'</span><span>].</span><span>fillna</span><span>(</span><span>most_category</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>
<span>import</span> <span>numpy</span> <span>as</span> <span>np</span>

<span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>,</span> <span>20</span><span>,</span> <span>17</span><span>],</span>
        <span>'</span><span>impression</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>good</span><span>'</span><span>,</span> <span>'</span><span>fair</span><span>'</span><span>,</span> <span>'</span><span>fair</span><span>'</span><span>,</span> <span>'</span><span>very good</span><span>'</span><span>,</span> <span>np</span><span>.</span><span>nan</span><span>]}</span>
<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span>

<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>

<span>most_category</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>impression</span><span>'</span><span>].</span><span>mode</span><span>().</span><span>values</span><span>[</span><span>0</span><span>]</span>

<span># Fill in the data with the highest category frequency </span><span>df</span><span>[</span><span>'</span><span>impression</span><span>'</span><span>].</span><span>fillna</span><span>(</span><span>most_category</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span>

<span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>
import pandas as pd import numpy as np data = {'age': [15, 17, 23, 20, 17], 'impression': ['good', 'fair', 'fair', 'very good', np.nan]} df = pd.DataFrame(data) print("Data:") print(df) most_category = df['impression'].mode().values[0] # Fill in the data with the highest category frequency df['impression'].fillna(most_category, inplace=True) print("New Data:") print(df)

Enter fullscreen mode Exit fullscreen mode

Output:

Data:
age impression
0 15 good
1 17 fair
2 23 fair
3 20 very good
4 17 NaN
New Data:
age impression
0 15 good
1 17 fair
2 23 fair
3 20 very good
4 17 fair
Data:
   age        impression
0    15         good
1    17         fair
2    23         fair
3    20    very good
4    17          NaN
New Data:
   age        impression
0    15         good
1    17         fair
2    23         fair
3    20    very good
4    17         fair
Data: age impression 0 15 good 1 17 fair 2 23 fair 3 20 very good 4 17 NaN New Data: age impression 0 15 good 1 17 fair 2 23 fair 3 20 very good 4 17 fair

Enter fullscreen mode Exit fullscreen mode

Pros:
• Prevent data loss resulting in deleted rows or columns
• Works well with small data sets and is easy to implement.
• Eliminate data loss by adding unique categories
Cons:
• Only works with categorical variables.
• Adding new features to the model while coding may result in poor performance

Data Type Conversion

Most machine learning models cannot be run on categorical data. Therefore, categorical data needs to be converted into numerical data. One technique that can be used is one-hot-encoding. One-hot-encoding is a representation of categorical variables in binary vector form.

<span>import</span> <span>numpy</span> <span>as</span> <span>np</span>
<span># Categorical data to be converted </span><span>colors</span> <span>=</span> <span>[</span><span>"</span><span>red</span><span>"</span><span>,</span> <span>"</span><span>green</span><span>"</span><span>,</span> <span>"</span><span>yellow</span><span>"</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>]</span>
<span># Color list </span><span>total_colors</span> <span>=</span> <span>[</span><span>"</span><span>red</span><span>"</span><span>,</span> <span>"</span><span>green</span><span>"</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>"</span><span>black</span><span>"</span><span>,</span> <span>"</span><span>yellow</span><span>"</span><span>]</span>
<span># map each color to numeric </span><span>mapping</span> <span>=</span> <span>{}</span>
<span>for</span> <span>x</span> <span>in</span> <span>range</span><span>(</span><span>len</span><span>(</span><span>total_colors</span><span>)):</span>
<span>mapping</span><span>[</span><span>total_colors</span><span>[</span><span>x</span><span>]]</span> <span>=</span> <span>x</span>
<span>one_hot_encode</span> <span>=</span> <span>[]</span>
<span># Convert the numeric value of each data </span><span>for</span> <span>c</span> <span>in</span> <span>colors</span><span>:</span>
<span>arr</span> <span>=</span> <span>list</span><span>(</span><span>np</span><span>.</span><span>zeros</span><span>(</span><span>len</span><span>(</span><span>total_colors</span><span>),</span> <span>dtype</span> <span>=</span> <span>int</span><span>))</span>
<span>arr</span><span>[</span><span>mapping</span><span>[</span><span>c</span><span>]]</span> <span>=</span> <span>1</span>
<span>one_hot_encode</span><span>.</span><span>append</span><span>(</span><span>arr</span><span>)</span>
<span>print</span><span>(</span><span>one_hot_encode</span><span>)</span>
<span>import</span> <span>numpy</span> <span>as</span> <span>np</span>

<span># Categorical data to be converted </span><span>colors</span> <span>=</span> <span>[</span><span>"</span><span>red</span><span>"</span><span>,</span> <span>"</span><span>green</span><span>"</span><span>,</span> <span>"</span><span>yellow</span><span>"</span><span>,</span> <span>"</span><span>red</span><span>"</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>]</span>

<span># Color list </span><span>total_colors</span> <span>=</span> <span>[</span><span>"</span><span>red</span><span>"</span><span>,</span> <span>"</span><span>green</span><span>"</span><span>,</span> <span>"</span><span>blue</span><span>"</span><span>,</span> <span>"</span><span>black</span><span>"</span><span>,</span> <span>"</span><span>yellow</span><span>"</span><span>]</span>

<span># map each color to numeric </span><span>mapping</span> <span>=</span> <span>{}</span>
<span>for</span> <span>x</span> <span>in</span> <span>range</span><span>(</span><span>len</span><span>(</span><span>total_colors</span><span>)):</span>
  <span>mapping</span><span>[</span><span>total_colors</span><span>[</span><span>x</span><span>]]</span> <span>=</span> <span>x</span>

<span>one_hot_encode</span> <span>=</span> <span>[]</span>

<span># Convert the numeric value of each data </span><span>for</span> <span>c</span> <span>in</span> <span>colors</span><span>:</span>
  <span>arr</span> <span>=</span> <span>list</span><span>(</span><span>np</span><span>.</span><span>zeros</span><span>(</span><span>len</span><span>(</span><span>total_colors</span><span>),</span> <span>dtype</span> <span>=</span> <span>int</span><span>))</span>
  <span>arr</span><span>[</span><span>mapping</span><span>[</span><span>c</span><span>]]</span> <span>=</span> <span>1</span>
  <span>one_hot_encode</span><span>.</span><span>append</span><span>(</span><span>arr</span><span>)</span>

<span>print</span><span>(</span><span>one_hot_encode</span><span>)</span>
import numpy as np # Categorical data to be converted colors = ["red", "green", "yellow", "red", "blue"] # Color list total_colors = ["red", "green", "blue", "black", "yellow"] # map each color to numeric mapping = {} for x in range(len(total_colors)): mapping[total_colors[x]] = x one_hot_encode = [] # Convert the numeric value of each data for c in colors: arr = list(np.zeros(len(total_colors), dtype = int)) arr[mapping[c]] = 1 one_hot_encode.append(arr) print(one_hot_encode)

Enter fullscreen mode Exit fullscreen mode

Output:

[[1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [1, 0, 0, 0, 0], [0, 0, 1, 0, 0]]
[[1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [1, 0, 0, 0, 0], [0, 0, 1, 0, 0]]
[[1, 0, 0, 0, 0], [0, 1, 0, 0, 0], [0, 0, 0, 0, 1], [1, 0, 0, 0, 0], [0, 0, 1, 0, 0]]

Enter fullscreen mode Exit fullscreen mode

Deleting Irrelevant Data

Data is said to be irrelevant when the data does not match the problem being researched.

<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>
<span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>],</span>
<span>'</span><span>email</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>ahmad@gmail.com</span><span>'</span><span>,</span> <span>'</span><span>putra@yahoo.com</span><span>'</span><span>,</span> <span>'</span><span>tegar@gmail.com</span><span>'</span><span>]}</span>
<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>
<span># Remove irrelevant attributes </span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>drop</span><span>(</span><span>'</span><span>email</span><span>'</span><span>,</span> <span>axis</span><span>=</span><span>1</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>
<span>Output</span><span>:</span>
<span>Data</span><span>:</span>
<span>age</span> <span>email</span>
<span>0</span> <span>15</span> <span>ahmad</span><span>@gmail.com</span>
<span>1</span> <span>17</span> <span>putra</span><span>@yahoo.com</span>
<span>2</span> <span>23</span> <span>tegar</span><span>@gmail.com</span>
<span>New</span> <span>Data</span><span>:</span>
<span>age</span>
<span>0</span> <span>15</span>
<span>1</span> <span>17</span>
<span>2</span> <span>23</span>
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>

<span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>],</span>
        <span>'</span><span>email</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>ahmad@gmail.com</span><span>'</span><span>,</span> <span>'</span><span>putra@yahoo.com</span><span>'</span><span>,</span> <span>'</span><span>tegar@gmail.com</span><span>'</span><span>]}</span>
<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span>

<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>

<span># Remove irrelevant attributes </span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>drop</span><span>(</span><span>'</span><span>email</span><span>'</span><span>,</span> <span>axis</span><span>=</span><span>1</span><span>)</span>

<span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>

<span>Output</span><span>:</span>
<span>Data</span><span>:</span>
   <span>age</span>              <span>email</span>
<span>0</span>    <span>15</span>   <span>ahmad</span><span>@gmail.com</span>
<span>1</span>    <span>17</span>   <span>putra</span><span>@yahoo.com</span>
<span>2</span>    <span>23</span>   <span>tegar</span><span>@gmail.com</span>
<span>New</span> <span>Data</span><span>:</span>
   <span>age</span>
<span>0</span>    <span>15</span>
<span>1</span>    <span>17</span>
<span>2</span>    <span>23</span>
import pandas as pd data = {'age': [15, 17, 23], 'email': ['ahmad@gmail.com', 'putra@yahoo.com', 'tegar@gmail.com']} df = pd.DataFrame(data) print("Data:") print(df) # Remove irrelevant attributes df = df.drop('email', axis=1) print("New Data:") print(df) Output: Data: age email 0 15 ahmad@gmail.com 1 17 putra@yahoo.com 2 23 tegar@gmail.com New Data: age 0 15 1 17 2 23

Enter fullscreen mode Exit fullscreen mode

Avoiding Structural Errors

Structural errors include typos, incorrect naming conventions, incorrect use of capital letters, and so on. The following is an example of improvements to letter capitalization in categorical features:

<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>
<span>import</span> <span>numpy</span> <span>as</span> <span>np</span>
<span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>,</span> <span>20</span><span>,</span> <span>17</span><span>],</span>
<span>'</span><span>impression</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>good</span><span>'</span><span>,</span> <span>'</span><span>Fair</span><span>'</span><span>,</span> <span>'</span><span>fair</span><span>'</span><span>,</span> <span>'</span><span>Very good</span><span>'</span><span>,</span> <span>'</span><span>Good</span><span>'</span><span>]}</span>
<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span>
<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>
<span># Fixed letter capitalization </span><span>df</span><span>[</span><span>'</span><span>impression</span><span>'</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>impression</span><span>'</span><span>].</span><span>str</span><span>.</span><span>lower</span><span>()</span>
<span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>
<span>import</span> <span>numpy</span> <span>as</span> <span>np</span>

<span>data</span> <span>=</span> <span>{</span><span>'</span><span>age</span><span>'</span><span>:</span> <span>[</span><span>15</span><span>,</span> <span>17</span><span>,</span> <span>23</span><span>,</span> <span>20</span><span>,</span> <span>17</span><span>],</span>
        <span>'</span><span>impression</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>good</span><span>'</span><span>,</span> <span>'</span><span>Fair</span><span>'</span><span>,</span> <span>'</span><span>fair</span><span>'</span><span>,</span> <span>'</span><span>Very good</span><span>'</span><span>,</span> <span>'</span><span>Good</span><span>'</span><span>]}</span>
<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span>

<span>print</span><span>(</span><span>"</span><span>Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>

<span># Fixed letter capitalization </span><span>df</span><span>[</span><span>'</span><span>impression</span><span>'</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>impression</span><span>'</span><span>].</span><span>str</span><span>.</span><span>lower</span><span>()</span>

<span>print</span><span>(</span><span>"</span><span>New Data:</span><span>"</span><span>)</span>
<span>print</span><span>(</span><span>df</span><span>)</span>
import pandas as pd import numpy as np data = {'age': [15, 17, 23, 20, 17], 'impression': ['good', 'Fair', 'fair', 'Very good', 'Good']} df = pd.DataFrame(data) print("Data:") print(df) # Fixed letter capitalization df['impression'] = df['impression'].str.lower() print("New Data:") print(df)

Enter fullscreen mode Exit fullscreen mode

Output:

Data:
age impression
0 15 good
1 17 Fair
2 23 fair
3 20 Very good
4 17 Good
New Data:
age impression
0 15 good
1 17 fair
2 23 fair
3 20 very good
4 17 good
Data:
   age    impression
0    15         good
1    17         Fair
2    23         fair
3    20    Very good
4    17         Good
New Data:
   age    impression
0    15         good
1    17         fair
2    23         fair
3    20    very good
4    17         good
Data: age impression 0 15 good 1 17 Fair 2 23 fair 3 20 Very good 4 17 Good New Data: age impression 0 15 good 1 17 fair 2 23 fair 3 20 very good 4 17 good

Enter fullscreen mode Exit fullscreen mode

Closing

In conclusion, employing effective data cleaning methods not only enhances the reliability of your analyses but also paves the way for informed decision-making, ensuring that your data-driven journey is built on a solid foundation of accuracy and integrity

原文链接:Laid-Back Approaches to Clean Data

© 版权声明
THE END
喜欢就支持一下吧
点赞13 分享
Youth means limitless possibilities.
年轻就是无限的可能
评论 抢沙发

请登录后发表评论

    暂无评论内容