I recently completed the US States Game project from the Python Pro Bootcamp course on Udemy in which we used the Pandas library to read and use data from a .csv file in order to create the functionality of the game.
Since there is a lot you can do with Pandas, I wanted to dive into a few of the things I learned about it and solidify the concepts I learned. I also want to make note that Pandas has really great documentation and I encourage you to check it out: https://pandas.pydata.org/docs
Installing & Importing Pandas into Your Project
Installing and importing Pandas is easy. For this post, I will be working in PyCharm.
MacOS:
Windows:
Start by importing Pandas into your project by writing the following ling of code at the top of your file:
import pandas
Enter fullscreen mode Exit fullscreen mode
In PyCharm, you may see some red squiggly lines underlining pandas
which means Pandas package needs to be installed. On MacOS you can just click the red light bulb that pops up when you hover over it and select ‘Install package pandas’. On Windows you can install the package by opening the ‘Python Packages’ window, searching for Pandas, and installing the current stable version (this can be found in the docs). At the time of this writing, the current version is 2.1.4
.
Windows:
Note: We can also implement
import pandas as pd
so that we can write the shorthandpd
in our code instead of writing outpandas
when using it.
Handling Import Errors
I did not come across any errors using Pandas in PyCharm, but I opened my project in VS Code and came across errors later when running the program. This was because I was using a different interpreter from the system in VS Code vs. a virtual environment. It is recommended to use Pandas from a virtual environment (venv
) rather than a system environment. I won’t go into great detail here, but you’ll just want to make sure you’re Python interpreter is set up in a virtual environment, as recommended in the documentation.
Basic Operations
Creating a DataFrame
Pandas uses a data structure called a DataFrame
which is a two-dimensional structure that can be thought of as like a dictionary. You can create a DataFrame
manually or from a .csv file.
To create a DataFrame
manually, we can start by creating a dictionary of data:
user_list = {
"first name": ["Scott", "Kevin", "Johnny"],
"last name": ["Woodruff", "Bong", "Cosmic"],
"email address": ["scott@example.com", "kevin@example.com", "johnny@example.com"],
"user id": [123, 234, 345]
}
Enter fullscreen mode Exit fullscreen mode
Then we will create a variable called user_data
and call pandas.DataFrame
on our user_list
.
After that, we can call the to_csv()
method on our user_data
DataFrame object and pass in the name of the file we want to create:
user_data = pandas.DataFrame(user_list)
user_data.to_csv("user_data.csv")
Enter fullscreen mode Exit fullscreen mode
After we hit run, we can see that there exists now a .csv file called user_data in the same directory as our main.py file. Opening it up, we can see our key names are the column names and each row is indexed. All of the values are separated by commas as .csv stands for Comma Separated Values
.
We can also open it up in a spreadsheet and see the values there as well:
Let’s say we already have a .csv file of data that we want to use. To start working with that data in your project, you can place the .csv file into your project folder, and then use the read_csv()
method. Here, I have a sample dataset of housing info I grabbed from PyDataset that I created a .csv file from just to show:
Note: you can access sample datasets with PyDataset by adding
from pydataset import data
to the top of your file and choosing the data you want to work with. You can find more info on how to do this here. It’s pretty easy.
To start working with our data, we need to call the read_csv
method on our dataset and save it into a variable:
data = pandas.read_csv("housing_data.csv")
Enter fullscreen mode Exit fullscreen mode
You’ll want to make sure that the spelling and file path matches the file exactly otherwise you’ll get an error.
Now that we have our housing data in a variable, we can work with it. You’ll notice if you print data
you can see the output shows a nice table with the column names and values.
You can confirm that we’ve created a DataFrame by checking the type
of data
and see the output in the console says it’s a pandas.core.frame.DataFrame:
Accessing Data by Column or Row
Each column in a DataFrame is called a Series
and there are loads of different things you can do with them. There is a whole section in the docs dedicated to Series
here.
Pandas automatically turns column names into attributes, so we can access a whole column using dot notation after our dataframe variable name. In our case, if we want to get only the column of prices, we can write:
print(data.price)
Enter fullscreen mode Exit fullscreen mode
This will return just the items in the price column:
To access just a particular row of data, we’ll use brackets after our variable, then specify the column as before using dot notation and set it equal to the row value for which we want all related data. For example:
print(data[data.price == 38500.0])
Enter fullscreen mode Exit fullscreen mode
This will return:
You can see the row of data here where the price
column equals 38500.0
as indicated in the code.
Filtering and manipulating data
Now that we know how to obtain specific columns and rows, let’s use that to filter our data. Let’s say we want to find the lowest and highest home prices in this list. We can use the min()
and max()
methods for this.
print(data.price.max())
print(data.price.min())
OUTPUT:
190000.0
25000.0
Enter fullscreen mode Exit fullscreen mode
Now say we are looking for a house and would like to have a list of only the homes that fit our budget and criteria. Let’s create a list that only contains homes that are under $50,000 and have at least 2 bedrooms and 1 bathroom.
Step 1: set the criteria as variables
max_price = 50000.0
num_bedrooms = 2
num_bathrooms = 1
Enter fullscreen mode Exit fullscreen mode
Step 2: Apply the filters
filtered_data = data[(data['price'] < max_price) &
(data['bedrooms'] == num_bedrooms) &
(data['bathrms'] == num_bathrooms)]
Enter fullscreen mode Exit fullscreen mode
Step 3: Create new .csv file with the filtered values
filtered_data.to_csv("our_home_choices.csv")
Enter fullscreen mode Exit fullscreen mode
Our new file will appear in our project folder. We can open it to see that it includes only homes that are under $50,000 and have 2 bedrooms and 1 bathroom:
We can also check to see how many records there are in our new list by running the following line of code:
print(len(filtered_data))
Enter fullscreen mode Exit fullscreen mode
In this case, it returns 61. So we can see that we have 61 homes to choose from with the criteria that we set. This can be further filtered down with the same method using the other columns as attributes.
Some Other Useful Methods
to_dict()
Method
The to_dict()
method is used to convert a DataFrame into a dictionary. This can be particularly useful when you need to transform your data for a format that is more compatible with certain types of processing. The method offers various formats for the dictionary, like orienting by rows, columns, or even index.
Example:
# Converting the entire DataFrame to a dictionary data_dict = data.to_dict()
# Converting the DataFrame to a dictionary with a specific orientation data_dict_oriented = data.to_dict(orient='records') # Each row becomes a dictionary
Enter fullscreen mode Exit fullscreen mode
to_list()
Method
The to_list()
method is used with Pandas Series
objects to convert the values of the Series into a list. This is particularly useful when you need to extract the data from a DataFrame column for use in a context where a list is more appropriate, such as in loops, certain calculations, or data visualizations.
Example:
# Converting a DataFrame column to a list price_list = data['price'].to_list()
# Using the list for further operations average_price = sum(price_list) / len(price_list)
Enter fullscreen mode Exit fullscreen mode
Both of these methods, to_dict()
and to_list()
, are part of Pandas’ powerful suite of tools that make data manipulation and conversion simple and efficient, allowing for a smooth workflow between different data formats and structures.
Conclusion
I hope you enjoyed learning a bit about some of the things you can do with the Python Pandas library. This is just the tip of the iceberg, as there are many more capabilities to explore. I am excited to continue learning and sharing with you. I also encourage you to explore the documentation and experiment with your own datasets!
If you have any questions or if there’s anything else you’d like to know about Pandas, please don’t hesitate to reach out. I’ll do my best to write a post about it. Writing also helps me learn more!
Thanks for reading and happy coding!
Connect with me:
Twitter: @sarah_schlueter
Discord: sarahmarie73
暂无评论内容