Python dictionary and Pandas dataframe are the most frequent data structures used in dealing with data. The Pandas DataFrame, is a standard popular data structure to work with tabular data for advanced data analysis. In this article, we will get hands-on practice with how to
- create,
- manipulate,
- select,
- add,
- update,
- delete
data in dictionaries and dataframes.
List
First, let’s talk about the basic Python data type: list. Imagine that we work for the World Bank and want to keep track of the population of each country.
Let’s say we have 2021 population data of each country:
- India(1,393,409,030),
- Burma(54,806,010),
- Thailand(69,950,840),
- Singapore(5,453,570), and so on.
These data are based on Population Data | The World Bank.
To keep track about which population belongs to which country, we create 2 lists as follow, with the names of the countries in the same order as the populations.
# lists countries = ['India', 'Burma', 'Thailand', 'Singapore']
populations = [1393409030, 54806010, 69950840, 5453570]
Enter fullscreen mode Exit fullscreen mode
Now suppose that we want to get the population of Burma. First, we have to figure out where in the list Burma is, so that we can use this position to get the correct population. We will use the method index()
to get the index.
burma_index = countries.index('Burma')
print(burma_index)
Enter fullscreen mode Exit fullscreen mode
Output:
1
Enter fullscreen mode Exit fullscreen mode
We get 1
as the index of ‘Burma’ because the index of python’s list starts from 0. Now, we can use this index to subset the populations
list, to get the population corresponding to Burma.
print(populations[burma_index])
Enter fullscreen mode Exit fullscreen mode
output:
54806010
Enter fullscreen mode Exit fullscreen mode
As expected, we get 54806010
, the population of Burma.
Motivation for Dictionaries
So we have two lists, and used the index to connect corresponding elements in both lists. It worked, but it’s a pretty terrible approach: it’s not convenient and not intuitive. Wouldn’t it be easier if we had a way to connect each country directly to its population, without using an index?
Dictionary
This is where the “dictionary” comes into play. Let’s convert this population data to a dictionary. To create the dictionary, we need curly brackets. Next, inside the curly brackets, we have a bunch of what are called key:value
pairs.
my_dict = {
"key1":"value1",
"key2":"value2",
}
Enter fullscreen mode Exit fullscreen mode
In our case,
- the keys are the country names, and
- the values are the corresponding populations.
The first key is India, and its corresponding value is 1,393,409,030. Notice the colon that separates the key and value here. Let’s do the same thing for the three other key-value pairs, and store the dictionary under the name country_population
.
country_population = {'India':1393409030, 'Burma':54806010, 'Thailand':69950840, 'Singapore':5453570}
Enter fullscreen mode Exit fullscreen mode
If we want to find the population for Burma, we simply type world_population
, and then the string "Burma"
inside square brackets.
print(country_population["Burma"])
Enter fullscreen mode Exit fullscreen mode
output:
54806010
Enter fullscreen mode Exit fullscreen mode
In other words, we pass the key in square brackets, and we get the corresponding value. This approach is not only intuitive, it’s also very efficient, because Python can make the lookup of these keys very fast, even for huge dictionaries.
Create a Dictionary
We will create a dictionary of countries
and capitals
data where the country names are the keys and the capitals are the corresponding values.
- With the strings in
countries
andcapitals
, create a dictionary calledasia
with 4 key:value pairs. Beware of capitalization! Strings in the code, are case-sensitive. - Print out
asia
to see if the result is what we expected.
# From string in countries and capitals, create dictionary called asia asia = {'India':'New Delhi', 'Burma':'Yangon', 'Thailand':'Bangkok', 'Singapore':'Singapore'}
# Print print(asia)
# Print type of asia print(type(asia))
Enter fullscreen mode Exit fullscreen mode
output:
{'India': 'New Delhi', 'Burma': 'Yangon', 'Thailand': 'Bangkok', 'Singapore': 'Singapore'}
<class 'dict'>
Enter fullscreen mode Exit fullscreen mode
Great! <class 'dict'>
means that the class
of asia
is a dictionary. class
is out of this article’s scope and we will explain it in another article which focus on class
. Now that we’ve built our first dictionary.
Manipulating a Dictionary
If the keys of a dictionary are chosen wisely, accessing the values in a dictionary is easy and intuitive. For example, to get the capital for India from asia
we can use India
as the key.
print(asia['India'])
Enter fullscreen mode Exit fullscreen mode
output:
New Delhi
Enter fullscreen mode Exit fullscreen mode
We can check out which keys are in asia
by calling the keys() method on asia
.
# Print out the keys in asia print(asia.keys())
# Print out value that belongs to key 'Burma' print(asia['Burma'])
Enter fullscreen mode Exit fullscreen mode
output:
dict_keys(['India', 'Burma', 'Thailand', 'Singapore'])
Yangon
Enter fullscreen mode Exit fullscreen mode
Next, we created the dictionary country_population
, which basically is a set of key value pairs. we could easily access the population of Burma
, by passing the key in square brackets, like this.
country_population = {'India':1393409030, 'Burma':54806010, 'Thailand':69950840, 'Singapore':5453570}
print(country_population['Burma'])
Enter fullscreen mode Exit fullscreen mode
output:
54806010
Enter fullscreen mode Exit fullscreen mode
Note: For this lookup to work properly, the keys in a dictionary should be unique.
If we try to add another key:value
pair to country_population
with the same key, Burma
, for example,
country_population = {'India':1393409030, 'Burma':54806010, 'Thailand':69950840, 'Singapore':5453570, 'Burma':54800000,}
Enter fullscreen mode Exit fullscreen mode
we’ll see that the resulting country_population
dictionary still contains four pairs. The last pair('Burma':54800000
) that we specified in the curly brackets was kept in the resulting dictionary.
country_population = {'India':1393409030, 'Burma':54806010, 'Thailand':69950840, 'Singapore':5453570, 'Burma':54800000,}
print(country_population)
Enter fullscreen mode Exit fullscreen mode
output:
{'India': 1393409030, 'Burma': 54800000, 'Thailand': 69950840, 'Singapore': 5453570}
Enter fullscreen mode Exit fullscreen mode
let’s see how we can add more data to a dictionary that already exists.
Add data to a Dictionary
Our country_population
dictionary currently does not have china’s data. We want to add "China":1412360000
to country_population
.
# Before adding China data country_population = {'India':1393409030, 'Burma':54806010, 'Thailand':69950840, 'Singapore':5453570}
print(country_population)
Enter fullscreen mode Exit fullscreen mode
output:
{'India': 1393409030, 'Burma': 54806010, 'Thailand': 69950840, 'Singapore': 5453570}
Enter fullscreen mode Exit fullscreen mode
To add this information, simply write the key "China"
in square brackets and assign population 1412360000
to it with the equals sign.
# After adding China data country_population["China"] = 1412360000
print(country_population)
Enter fullscreen mode Exit fullscreen mode
output:
{'India': 1393409030, 'Burma': 54806010, 'Thailand': 69950840, 'Singapore': 5453570, 'China': 1412360000}
Enter fullscreen mode Exit fullscreen mode
Now if you check out world_population
again, indeed, China
is in there. To check this with code, you can also write 'China' in country_population
which gives us True
if the key China
is in there. Note that China
is string
type and case sensitive.
print('China' in country_population)
Enter fullscreen mode Exit fullscreen mode
output:
True
Enter fullscreen mode Exit fullscreen mode
Update data in a Dictionary
With the syntax dict_name[key]=value
, we can also change values, for example, to update the population of China
to 1412000000
. Because each key in a dictionary is unique, Python knows that we’re not trying to create a new pair, but want to update the pair that’s already in there.
country_population["China"] = 1412000000
print(country_population)
Enter fullscreen mode Exit fullscreen mode
output:
{'India': 1393409030, 'Burma': 54806010, 'Thailand': 69950840, 'Singapore': 5453570, 'China': 1412000000}
Enter fullscreen mode Exit fullscreen mode
Delete data from a Dictionary
Suppose now that we want to remove it. We can do this with del
, again pointing to China
inside square brackets. If we print country_population
again, China
is no longer in our dictionary.
del(country_population['China'])
print(country_population)
Enter fullscreen mode Exit fullscreen mode
output:
{'India': 1393409030, 'Burma': 54806010, 'Thailand': 69950840, 'Singapore': 5453570}
Enter fullscreen mode Exit fullscreen mode
List vs Dictionary
Using lists and dictionaries, is pretty similar. We can select, update and remove values with square brackets.There are some big differences though. The list is a sequence of values that are indexed by a range of numbers. The dictionary, on the other hand, is indexed by unique keys.
List | Dictionary | |
---|---|---|
Select, update, remove | use [] |
use [] |
Indexed by | range of numbers | unique keys |
Use | when a collection of values, order matters, selecting entire subsets |
when lookup table with unique keys |
When to use which one? Well, if we have a collection of values where the order matters, and we want to easily select entire subsets of data, we’ll want to go with a list.
If, on the other hand, we need some sort of look up table, where looking for data should be fast and where we can specify unique keys, a dictionary is the preferred option.
Nested Dictionaries
Remember lists? They could contain anything, even other lists. Well, for dictionaries the same holds. Dictionaries can contain key:value
pairs where the values are again dictionaries.
As an example, have a look at the code where another version of asia
– the dictionary we’ve been working with all along. The keys are still the country names, but the values are dictionaries that contain more information than just the capital.
# Dictionary of dictionaries asia = {'India': {'capital':'New Delhi', 'population':1393409030},
'Burma': {'capital':'Yangon', 'population':54806010},
'Thailand': {'capital':'Bangkok', 'population':69950840},
'Singapore': {'capital':'Singapore', 'population':5453570},
}
Enter fullscreen mode Exit fullscreen mode
It’s perfectly possible to chain square brackets to select elements. To fetch the population
for Burma
from asia
,
print(asia['Burma']['population'])
Enter fullscreen mode Exit fullscreen mode
output:
54806010
Enter fullscreen mode Exit fullscreen mode
- Use chained square brackets to select and print out the capital of
Burma
.
# Print out the capital of Burma print(asia['Burma']['capital'])
Enter fullscreen mode Exit fullscreen mode
output:
Yangon
Enter fullscreen mode Exit fullscreen mode
Great! It’s time to learn about a new data structure!
Tabular dataset examples
As a data scientist, we’ll often be working with tons of data. The form of this data can vary greatly, but we can make it down to a tabular structure which is the form of a table like in a spreadsheet. Let’s have a look at some examples.
Suppose we’re working in a chemical factory and have a ton of temperature measurements to analyze. This data can come in the following form:
temperature | measured at | location |
---|---|---|
76 | 2021-03-01 12:00:01 | chamber 1 |
86 | 2021-03-01 12:00:01 | chamber 2 |
72 | 2021-03-01 12:00:01 | chamber 1 |
88 | 2021-03-01 12:00:01 | chamber 2 |
- every row is a measurement, or an observation, and
- columns are different variables.
For each measurement, there is the temperature, but also the date and time of the measurement, and the location.
Another example: we have information on India, Burma, Thailand and so on. We can again build a table with this data.
Country | Capital | Population |
---|---|---|
India | New Delhi | 1393409030 |
Burma | Yangon | 54806010 |
Thailand | Bangkok | 69950840 |
Singapore | Singapore | 5453570 |
China | Beijing | 1412360000 |
Each row is an observation and represents a country. Each observation has the same variables: the country name, the capital and the population.
Datasets in Python
To start working on this data in Python, we’ll need some kind of rectangular data structure. How about the 2D NumPy array? Well, it’s an option, but not necessarily the best one. There are different data types and NumPy arrays are not great at handling these.
Datasets containing different data types
In the above data, the country and capital are string
types while the population is float
type. Our datasets will typically comprise different data types, so we need a tool that’s better suited. To easily and efficiently handle this data, there’s the Pandas package.
Pandas
Pandas is
- an open source library,
- built on the NumPy package,
- easy-to-use data structures,
- a high level data manipulation tool.
making it very interesting for data scientists all over the world. In pandas, we store the tabular data in an object called a DataFrame
. Have a look at the Pandas DataFrame version of the data:
DataFrame
Country | Capital | Population | |
---|---|---|---|
IND | India | New Delhi | 1393409030 |
MMR | Myanmar | Yangon | 54806010 |
THA | Thailand | Bangkok | 69950840 |
SGP | Singapore | Singapore | 5453570 |
CHN | China | Beijing | 1412360000 |
The rows represent the observations, and the columns represent the variables. Also notice that each row has a unique row label: IND
for India, MMR
for Myanmar, and so on. The columns, or variables, also have labels: country, capital, and so on. Notice that the values in the different columns have different types. But how can we create this DataFrame in the first place? Well, there are different ways.
Create a DataFrame from Dictionary
First of all, we can build it manually, starting from a dictionary. Using the distinctive curly brackets, we create key value pairs. The keys are the column labels, and the values are the corresponding columns, in list form.
asia_dict = {
'country':['India', 'Myanmar', 'Thailand', 'Singapore', 'China'],
'capital':['New Delhi', 'Yangon', 'Bangkok', 'Singapore', 'Beijing'],
'population':[1393409030,54806010,69950840, 5453570, 1412360000]
}
Enter fullscreen mode Exit fullscreen mode
After importing the pandas package as pd
, we can create a DataFrame from the dictionary using pd.DataFrame
.
import pandas as pd
asia_df = pd.DataFrame(asia_dict)
print(type(asia_df))
print(asia_df)
Enter fullscreen mode Exit fullscreen mode
output:
<class 'pandas.core.frame.DataFrame'> country capital population 0 India New Delhi 1393409030 1 Myanmar Yangon 54806010 2 Thailand Bangkok 69950840 3 Singapore Singapore 5453570 4 China Beijing 1412360000
Enter fullscreen mode Exit fullscreen mode
If we check out asia_df
now, we see that Pandas
assigned some automatic row labels, 0 up to 4. To specify them manually, we can set the index
attribute of asia_df
to a list with the correct labels.
asia_df.index = ['IND', 'MMR', 'THA', 'SGP', 'CHN']
print(asia_df)
Enter fullscreen mode Exit fullscreen mode
output:
country capital population
IND India New Delhi 1393409030
MMR Myanmar Yangon 54806010
THA Thailand Bangkok 69950840
SGP Singapore Singapore 5453570
CHN China Beijing 1412360000
Enter fullscreen mode Exit fullscreen mode
The resulting asia_df
DataFrame is the same one as we saw before. Using a dictionary approach is fine, but what if we’re working with tons of data, which is typically the case as a data scientist? Well, we won’t build the DataFrame manually. Instead, we import data from an external file that contains all this data.
Create a DataFrame from CSV file
Suppose the countries’ data that we used before comes in the form of a CSV file called countries.csv
. CSV is short for comma separated values. The countries.csv
file used in this article, can be downloaded at this link.
Let’s try to import this data using Pandas read_csv
function. We pass the path to the csv file as an argument.
countries = pd.read_csv('path\to\countries.csv')
print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
Unnamed: 0 country capital population
0 IND India New Delhi 1393409030
1 MMR Myanmar Yangon 54806010
2 THA Thailand Bangkok 69950840
3 SGP Singapore Singapore 5453570
4 CHN China Beijing 1412360000
Enter fullscreen mode Exit fullscreen mode
If we print countries
, there’s still something wrong. The row labels are seen as a column. To solve this, we’ll have to tell the read_csv
function that the first column contains the row indexes. We do this by setting the index_col
argument, like this.
countries = pd.read_csv('path\to\countries.csv', index_col=0)
print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
country capital population
IND India New Delhi 1393409030
MMR Myanmar Yangon 54806010
THA Thailand Bangkok 69950840
SGP Singapore Singapore 5453570
CHN China Beijing 1412360000
Enter fullscreen mode Exit fullscreen mode
This time countries
nicely contains the row and column labels. The read_csv
function features many more arguments that allow us to customize our data importing. Check out its documentation for more details.
Indexing and selecting data in DataFrames
This is important to make accessing columns, rows and single elements in our DataFrame easy. There are numerous ways in which we can index and select data from DataFrames. We’re going to see about how to use
- square brackets
[]
, - advanced data access methods,
-
loc
and -
iloc
,
-
that make Pandas extra powerful.
Access data using square brackets [ ]
Suppose that we only want to select the country column from countries
. How to do this with square brackets? Well, we type countries
, and then the column label inside square brackets. Python prints out the entire column, together with the row labels.
print(countries['country'])
Enter fullscreen mode Exit fullscreen mode
output:
IND India
MMR Myanmar
THA Thailand
SGP Singapore
CHN China
Name: country, dtype: object
Enter fullscreen mode Exit fullscreen mode
But there’s something strange here. The last line says Name: country, dtype: object
. We’re clearly not dealing with a regular DataFrame here. Let’s find out about the type of the object that gets returned, with the type
function as follows.
print(type(countries['country']))
Enter fullscreen mode Exit fullscreen mode
output:
<class 'pandas.core.series.Series'>
Enter fullscreen mode Exit fullscreen mode
So we’re dealing with a Pandas Series here. In a simplified sense, we can think of the Series as a 1-dimensional array that can be labeled, just like the DataFrame. If we put together a bunch of Series, we can create a DataFrame.
If we want to select the country column but keep the data in a DataFrame, we’ll need double square brackets, like this.
print(countries[['country']])
Enter fullscreen mode Exit fullscreen mode
output:
country
IND India
MMR Myanmar
THA Thailand
SGP Singapore
CHN China
Enter fullscreen mode Exit fullscreen mode
If we check out the type of this result, we will see it is DataFrame type.
print(type(countries[['country']]))
Enter fullscreen mode Exit fullscreen mode
output:
<class 'pandas.core.frame.DataFrame'>
Enter fullscreen mode Exit fullscreen mode
Note that the single bracket version gives a Pandas Series, the double bracket version gives a Pandas DataFrame.
We can perfectly extend this call to select two columns, country and capital, for example. If we look at it from a different angle, we’re actually putting a list with column labels inside another set of square brackets, and end up with a sub DataFrame
, containing only the country and capital columns.
print(countries[['country', 'capital']])
Enter fullscreen mode Exit fullscreen mode
output:
country capital
IND India New Delhi
MMR Myanmar Yangon
THA Thailand Bangkok
SGP Singapore Singapore
CHN China Beijing
Enter fullscreen mode Exit fullscreen mode
You can also use the same square brackets to select rows from a DataFrame. The way to do it is by specifying a slice. To get the second and third rows of countries
, we use the slice 1 colon 3. Remember that the end of the slice is exclusive and that the index starts at zero.
print(countries[1:3])
Enter fullscreen mode Exit fullscreen mode
output:
country capital population
MMR Myanmar Yangon 54806010
THA Thailand Bangkok 69950840
Enter fullscreen mode Exit fullscreen mode
These square brackets work, but it only offers limited functionality. Ideally, we’d want something similar to 2D NumPy arrays.
To do a similar thing with Pandas, we have 2 ways.
-
loc
is label-based, which means that we have to specify rows and columns based on their row and column labels. -
iloc
is integer index based, which we have to specify rows and columns by their integer index.
Let’s start with loc
first.
Access data using loc
Let’s have another look at the countries
DataFrame, and try to get the row for Myanmar. We put the label of the row of interest in square brackets after loc
.
print(countries.loc['MMR'])
Enter fullscreen mode Exit fullscreen mode
output:
country Myanmar
capital Yangon
population 54806010
Name: MMR, dtype: object
Enter fullscreen mode Exit fullscreen mode
We get a Pandas Series, containing all the row’s information, rather inconveniently shown on different lines.
To get a DataFrame, we have to put the 'MMR'
string inside another pair of brackets.
print(countries.loc[['MMR']])
Enter fullscreen mode Exit fullscreen mode
output:
country capital population
MMR Myanmar Yangon 54806010
Enter fullscreen mode Exit fullscreen mode
Selecting Rows using loc
We can also select multiple rows at the same time. Suppose we want to also include India and Thailand. Simply add some more row labels to the list.
print(countries.loc[['MMR', 'IND', 'THA']])
Enter fullscreen mode Exit fullscreen mode
output:
country capital population
MMR Myanmar Yangon 54806010
IND India New Delhi 1393409030
THA Thailand Bangkok 69950840
Enter fullscreen mode Exit fullscreen mode
This was only selecting entire rows, that’s something you could also do with the basic square brackets. The difference here is that we can extend your selection with a comma and a specification of the columns of interest.
Selecting Rows & Columns using loc
Let’s extend the previous call to only include the country and capital columns. We add a comma, and a list of column labels we want to keep.
print(countries.loc[['MMR', 'IND', 'THA'], ['country', 'capital']])
Enter fullscreen mode Exit fullscreen mode
output:
country capital
MMR Myanmar Yangon
IND India New Delhi
THA Thailand Bangkok
Enter fullscreen mode Exit fullscreen mode
The intersection gets returned.
Selecting Columns using loc
we can also use loc
to select all rows but only a specific number of columns. Simply replace the first list that specifies the row labels with a colon, a slice going from beginning to end.
print(countries.loc[:, ['country', 'capital']])
Enter fullscreen mode Exit fullscreen mode
output:
country capital
IND India New Delhi
MMR Myanmar Yangon
THA Thailand Bangkok
SGP Singapore Singapore
CHN China Beijing
Enter fullscreen mode Exit fullscreen mode
This time, the result contains all rows, but only two columns.
So, let’s take a step back. Simple square brackets countries[['country', 'capital']]
work fine if we want to get columns. To get rows, we can use slicing countries[1:4]
.
- row access:
countries[1:4]
- column access:
countries[['country', 'capital']]
The loc
function is more versatile: we can select rows, columns, but also rows and columns at the same time. When you use loc
, subsetting becomes remarkable simple.
- row access:
countries.loc[['MMR', 'IND', 'THA']]
- column access:
countries.loc[:, ['country', 'capital']]
- row and column access:
countries.loc[['MMR', 'IND', 'THA'], ['country', 'capital']]
The only difference is that we use labels with loc
, not the positions of the elements. If we want to subset Pandas DataFrames based on their position, or index, you’ll need the iloc
function.
Access data using iloc
In loc
, you use the 'MMR'
string in double square brackets, to get a DataFrame, like this.
print(countries.loc[['MMR']])
Enter fullscreen mode Exit fullscreen mode
output:
country capital population
MMR Myanmar Yangon 54806010
Enter fullscreen mode Exit fullscreen mode
In iloc
, we use the index 1 instead of MMR
. The results are exactly the same.
# return Series type print(countries.iloc[1])
Enter fullscreen mode Exit fullscreen mode
output:
country Myanmar
capital Yangon
population 54806010
Name: MMR, dtype: object
Enter fullscreen mode Exit fullscreen mode
# return DataFrame type print(countries.iloc[[1]])
Enter fullscreen mode Exit fullscreen mode
output:
country capital population
MMR Myanmar Yangon 54806010
Enter fullscreen mode Exit fullscreen mode
Selecting Rows using iloc
To get the rows for Myanmar, India and Thailand, the code is like this when using loc
,
print(countries.loc[['MMR', 'IND', 'THA']])
Enter fullscreen mode Exit fullscreen mode
output:
country capital population
MMR Myanmar Yangon 54806010
IND India New Delhi 1393409030
THA Thailand Bangkok 69950840
Enter fullscreen mode Exit fullscreen mode
We can now use a list with the index(in the order we want) to get the same result.
print(countries.iloc[[1,0,2]])
Enter fullscreen mode Exit fullscreen mode
output:
country capital population
MMR Myanmar Yangon 54806010
IND India New Delhi 1393409030
THA Thailand Bangkok 69950840
Enter fullscreen mode Exit fullscreen mode
Selecting Rows & Columns using iloc
To only keep the country and capital column, which we did as follows with loc
,
print(countries.loc[['IND', 'MMR', 'THA'],['country', 'capital']])
Enter fullscreen mode Exit fullscreen mode
output:
country capital
IND India New Delhi
MMR Myanmar Yangon
THA Thailand Bangkok
Enter fullscreen mode Exit fullscreen mode
we put the indexes 0 and 1 in a list after the comma, referring to the country and capital column when using iloc
.
print(countries.iloc[[0,1,2,],[0,1]])
Enter fullscreen mode Exit fullscreen mode
output:
country capital
IND India New Delhi
MMR Myanmar Yangon
THA Thailand Bangkok
Enter fullscreen mode Exit fullscreen mode
Selecting Columns using iloc
Finally, you can keep all rows and keep only the country and capital column in a similar fashion. With loc
, this is how it’s done.
print(countries.loc[:,['country', 'capital']])
Enter fullscreen mode Exit fullscreen mode
output:
country capital
IND India New Delhi
MMR Myanmar Yangon
THA Thailand Bangkok
SGP Singapore Singapore
CHN China Beijing
Enter fullscreen mode Exit fullscreen mode
For iloc
, it’s like this.
print(countries.iloc[:,[0,1]])
Enter fullscreen mode Exit fullscreen mode
output:
country capital
IND India New Delhi
MMR Myanmar Yangon
THA Thailand Bangkok
SGP Singapore Singapore
CHN China Beijing
Enter fullscreen mode Exit fullscreen mode
loc
and iloc
are pretty similar, the only difference is how we refer to columns and rows. We aced indexing and selecting data from Pandas DataFrames!
Update data in a DataFrame
Updating data in dataframe is similar to selecting data from dataframe. First we select the data we want to update and assign it with new data. In the following we will try to update Country Name Myanmar
to Myanmar(Burma)
. Note that we can do it using loc
or iloc
.
# Before updateing data print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
country capital population
IND India New Delhi 1393409030
MMR Myanmar Yangon 54806010
THA Thailand Bangkok 69950840
SGP Singapore Singapore 5453570
CHN China Beijing 1412360000
Enter fullscreen mode Exit fullscreen mode
- Change Myanmar to Myanmar(Burma)
# Update data using loc countries.loc[['MMR'], ['country']] = 'Myanmar(Burma)'
print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
country capital population
IND India New Delhi 1393409030
MMR Myanmar(Burma) Yangon 54806010
THA Thailand Bangkok 69950840
SGP Singapore Singapore 5453570
CHN China Beijing 1412360000
Enter fullscreen mode Exit fullscreen mode
- Change Myanmar(Burma) to Myanmar
# Update data using iloc countries.iloc[[1], [0]] = 'Myanmar'
print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
country capital popualation
IND India New Delhi 1393409030
MMR Myanmar Yangon 54806010
THA Thailand Bangkok 69950840
SGP Singapore Singapore 5453570
CHN China Beijing 1412360000
Enter fullscreen mode Exit fullscreen mode
Delete data in DataFrame
During cleaning a dataset, we might want to remove some row of data from a dataframe. We can do it by using the drop
method on the dataframe. Let’s try to remove China row from dataframe.
# Before delete/drop data print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
country capital population
IND India New Delhi 1393409030
MMR Myanmar(Burma) Yangon 54806010
THA Thailand Bangkok 69950840
SGP Singapore Singapore 5453570
CHN China Beijing 1412360000
Enter fullscreen mode Exit fullscreen mode
# we pass ['CHN'], telling we want to remove row/column related to 'CHN' # axis=0 means,we want to drop row(s) # inplace=True means dropping takes place on original data countries.drop(['CHN'], axis=0, inplace=True)
print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
country capital population
IND India New Delhi 1393409030
MMR Myanmar(Burma) Yangon 54806010
THA Thailand Bangkok 69950840
SGP Singapore Singapore 5453570
Enter fullscreen mode Exit fullscreen mode
Printing countries
shows that the data row we want to remove is no longer in the dataframe countries
. Next let’s try to remove a column population
from dataframe.
# we pass ["population"], telling we want to remove row/column related to "population" # axis=1 means,we want to drop column(s) # inplace=True means dropping takes place on original data countries.drop(["population"], axis=1, inplace=True)
print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
country capital
IND India New Delhi
MMR Myanmar(Burma) Yangon
THA Thailand Bangkok
SGP Singapore Singapore
Enter fullscreen mode Exit fullscreen mode
As we expected, the column population
is dropped from the dataframe.
Add data to DataFrame
What if we want to add data to a datafame. We can do it using square brackets[]
. Let’s try to add the popualation data we dropped in the previous one.
# before adding data print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
country capital
IND India New Delhi
MMR Myanmar(Burma) Yangon
THA Thailand Bangkok
SGP Singapore Singapore
Enter fullscreen mode Exit fullscreen mode
# Add population column data # the length of column data need to be same as the number of the rows in dataframe countries["population"] = [1393409030,54806010,69950840,5453570]
print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
country capital population
IND India New Delhi 1393409030
MMR Myanmar(Burma) Yangon 54806010
THA Thailand Bangkok 69950840
SGP Singapore Singapore 5453570
Enter fullscreen mode Exit fullscreen mode
Great! Do note that pandas does not know which population data belong to which country and will add the data in the order we give. Now, let’s add our China data row back to the dataframe countries
. Since our data having index label CHN
, we need to add using loc
.
countries.loc['CHN'] = ['China', 'Beijing', 1412360000]
print(countries)
Enter fullscreen mode Exit fullscreen mode
output:
country capital population
IND India New Delhi 1393409030
MMR Myanmar(Burma) Yangon 54806010
THA Thailand Bangkok 69950840
SGP Singapore Singapore 5453570
CHN China Beijing 1412360000
Enter fullscreen mode Exit fullscreen mode
Super!! Now we mastered how to create, select, add, update, delete data in Python dictionaries and Pandas dataframes.
Connect & Discuss with us on LinkedIn
暂无评论内容