Reading and Manipulating Your Dataset With Pandas (2)

Data Science With Python: Where And How To Start (4 Part Series)

1 Reading and Manipulating Your Dataset With Pandas
2 Reading and Manipulating Your Dataset With Pandas (2)
3 How to prove that your cat is fat (with statistics and python)
4 Beginners’ journey to machine learning

Manipulation

Let’s say you need to see only one column of your dataframe. To see the ‘fixed acidity’ column of our dataset, you need to write:

df['fixed acidity']

Enter fullscreen mode Exit fullscreen mode

If you add a condition to this column, for example, if you want to see the rows that has a fixed acidity higher than 9:

df[df['fixed acidity']>9]

Enter fullscreen mode Exit fullscreen mode

Sometimes you might need rows with multiple conditions added to columns:

df[(df['fixed acidity']>9) & (df['citric acid']>0.5)]

Enter fullscreen mode Exit fullscreen mode

If you need to find specific columns:

df.loc[:,['volatile acidity', 'chlorides']]

Enter fullscreen mode Exit fullscreen mode

You may want to add conditions with them too, for example, you may want to see the ‘volatile acidity’ and ‘chlorides’ content of those rows that have a ‘fixed acidity’ of 9.2:

df.loc[df['fixed acidity'] == 9.2, ['fixed acidity','volatile acidity', 'chlorides']]

Enter fullscreen mode Exit fullscreen mode

You can view the rows for specific indices (as discussed in the previous chapter) too, like this:

df.loc[0:3, ['volatile acidity', 'chlorides']]

Enter fullscreen mode Exit fullscreen mode

Now, if you want to locate a specific value, for example, the alchohol content of the wine of 0th row:

df['alcohol'].loc[0]

Enter fullscreen mode Exit fullscreen mode

and you will get a value of 9.4

You can find locate a row using its index too:

df.iloc[100]

Enter fullscreen mode Exit fullscreen mode

Now if you want to pinpoint a value within this, for example, the 1st attribute (volatile acidity in this case) of the 100th row, try:

df.iloc[100][1]

Enter fullscreen mode Exit fullscreen mode

and you will get 0.61 as expected.

You can locate specific consecutive rows and columns using this iloc command, for example, first three columns of 3rd to 7th row:

df.iloc[3:8, 0:3]

Enter fullscreen mode Exit fullscreen mode

and non consecutive rows and columns too:

df.iloc[[71, 122, 400], [0, 2]]

Enter fullscreen mode Exit fullscreen mode

What if you want to add a new column to your dataframe? Let’s add a ‘new column’ containing the word ‘hi’ for all rows:

df['new column'] = 'hi'
df.head()

Enter fullscreen mode Exit fullscreen mode

Let’s try changing the value of ‘new column’ of 0th index of the dataframe using iloc from ‘hi’ to ‘bye’:

df.iloc[0, df.columns.get_loc('new column')]= 'bye'
df.head()

Enter fullscreen mode Exit fullscreen mode

Now let’s try to find the word starts with ‘by’ (that we just have added) and replace it with ‘hello’:

df['new column'].loc[df['new column'].str.startswith('by')] = 'hello'
df.head()

Enter fullscreen mode Exit fullscreen mode

You can also replace null values of your data using pandas. We do not have any null values here, so let’s introduce a null value first. Let’s replace the string ‘hello’ with null. To do so, we would need the numpy library.

import numpy as np
df['new column'].loc[df['new column'].str.startswith('hel')] = np.nan
df.head()

Enter fullscreen mode Exit fullscreen mode

To check the number of null values, you can use the isna() method like this:

df.isna().sum()

Enter fullscreen mode Exit fullscreen mode

This isna() method can also be used to locate the null value like this:

pd.isna(df.head())

Enter fullscreen mode Exit fullscreen mode

Let’s replace the null value with ‘hey’.

df.fillna(value='hey', inplace=True)
df.head()

Enter fullscreen mode Exit fullscreen mode

If you want to drop null values, use the dropna() method.

Now we will try to create a new dataframe using a loop, where one column of the new dataframe would look the same as the ‘new column’ of our dataframe df.

rows = []
for i in range(df.shape[0]):
     rows.append(['hi', 'bye'])
df_new = pd.DataFrame(rows, columns=["new column 2", "new column 3"])
df_new.iloc[0, df_new.columns.get_loc('new column 2')]= 'hey'
df_new.head()

Enter fullscreen mode Exit fullscreen mode

You can merge these two dataframes using their common attributes:

df_merged = df.merge(df_new, left_on='new column', right_on='new column 2')
df_merged.head()

Enter fullscreen mode Exit fullscreen mode

You can make necessary variations in your merging operations by dropping mismatched attributes, or by using a column with common name and so on.

You can also group your dataframes:

df.groupby(['volatile acidity', 'chlorides']).count().head()

Enter fullscreen mode Exit fullscreen mode

You can also group the dataframes using other attributes like sum.

When you are done with manipulation of your dataframes, you are ready to visualize your data.

Data Science With Python: Where And How To Start (4 Part Series)

1 Reading and Manipulating Your Dataset With Pandas
2 Reading and Manipulating Your Dataset With Pandas (2)
3 How to prove that your cat is fat (with statistics and python)
4 Beginners’ journey to machine learning

原文链接:Reading and Manipulating Your Dataset With Pandas (2)

© 版权声明
THE END
喜欢就支持一下吧
点赞10 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容