Welcome to the world of data analysis with Pandas! If you’re new to programming or data analytics, don’t worry—this guide is designed to be simple, friendly, and hands-on. Pandas is a powerful Python library that makes working with data as easy as playing with a spreadsheet—but way more fun! By the end of this article, you’ll understand the essentials of Pandas, from loading data to analyzing it, and we’ll even build a small project together, complete with a simple graph. Let’s dive in!
What is Pandas?
Pandas is a Python library that helps you work with data in tables (called DataFrames) and lists (called Series). Think of it as a supercharged version of Excel or Google Sheets, where you can clean, explore, and analyze data with code. Whether you’re handling sales figures, survey results, or anything tabular, Pandas has your back.
To use Pandas, you’ll need Python installed on your computer. If you don’t have it yet, download it from python.org, then install Pandas by running this in your terminal or command prompt:
pip install pandaspip install pandaspip install pandas
Enter fullscreen mode Exit fullscreen mode
1. Getting Started with Pandas
Let’s kick things off by importing Pandas and creating some basic data structures.
<span># Import Pandas with the nickname 'pd' (a common shortcut) </span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span><span># Create a Series (a single column of data) </span><span>series</span> <span>=</span> <span>pd</span><span>.</span><span>Series</span><span>([</span><span>10</span><span>,</span> <span>20</span><span>,</span> <span>30</span><span>],</span> <span>index</span><span>=</span><span>[</span><span>'</span><span>a</span><span>'</span><span>,</span> <span>'</span><span>b</span><span>'</span><span>,</span> <span>'</span><span>c</span><span>'</span><span>])</span><span># Create a DataFrame (a table with rows and columns) </span><span>data</span> <span>=</span> <span>{</span><span>'</span><span>Name</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>Alice</span><span>'</span><span>,</span> <span>'</span><span>Bob</span><span>'</span><span>],</span> <span>'</span><span>Age</span><span>'</span><span>:</span> <span>[</span><span>25</span><span>,</span> <span>30</span><span>]}</span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span><span># Show the results </span><span>print</span><span>(</span><span>"</span><span>Series:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>series</span><span>)</span><span>print</span><span>(</span><span>"</span><span>\n</span><span>DataFrame:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df</span><span>)</span><span># Import Pandas with the nickname 'pd' (a common shortcut) </span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> <span># Create a Series (a single column of data) </span><span>series</span> <span>=</span> <span>pd</span><span>.</span><span>Series</span><span>([</span><span>10</span><span>,</span> <span>20</span><span>,</span> <span>30</span><span>],</span> <span>index</span><span>=</span><span>[</span><span>'</span><span>a</span><span>'</span><span>,</span> <span>'</span><span>b</span><span>'</span><span>,</span> <span>'</span><span>c</span><span>'</span><span>])</span> <span># Create a DataFrame (a table with rows and columns) </span><span>data</span> <span>=</span> <span>{</span><span>'</span><span>Name</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>Alice</span><span>'</span><span>,</span> <span>'</span><span>Bob</span><span>'</span><span>],</span> <span>'</span><span>Age</span><span>'</span><span>:</span> <span>[</span><span>25</span><span>,</span> <span>30</span><span>]}</span> <span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span> <span># Show the results </span><span>print</span><span>(</span><span>"</span><span>Series:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>series</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>\n</span><span>DataFrame:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>)</span># Import Pandas with the nickname 'pd' (a common shortcut) import pandas as pd # Create a Series (a single column of data) series = pd.Series([10, 20, 30], index=['a', 'b', 'c']) # Create a DataFrame (a table with rows and columns) data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]} df = pd.DataFrame(data) # Show the results print("Series:") print(series) print("\nDataFrame:") print(df)
Enter fullscreen mode Exit fullscreen mode
Breakdown:
- Importing Pandas:
import pandas as pd
lets us use Pandas functions with the shorthandpd
. - Series: A Series is like a labeled list. Here, we gave it numbers (10, 20, 30) with labels (‘a’, ‘b’, ‘c’).
- DataFrame: A DataFrame is a table. We made one with two columns: ‘Name’ and ‘Age’.
- Output: The Series shows values with their indices, and the DataFrame looks like a neat table.
2. Loading and Saving Data
Data analysts often work with files like CSV or Excel. Pandas makes this a breeze.
<span># Create a small DataFrame </span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>A</span><span>'</span><span>:</span> <span>[</span><span>1</span><span>,</span> <span>2</span><span>],</span> <span>'</span><span>B</span><span>'</span><span>:</span> <span>[</span><span>3</span><span>,</span> <span>4</span><span>]})</span><span># Save it to a CSV file </span><span>df</span><span>.</span><span>to_csv</span><span>(</span><span>'</span><span>my_data.csv</span><span>'</span><span>,</span> <span>index</span><span>=</span><span>False</span><span>)</span> <span># 'index=False' skips row numbers </span><span># Load it back </span><span>loaded_df</span> <span>=</span> <span>pd</span><span>.</span><span>read_csv</span><span>(</span><span>'</span><span>my_data.csv</span><span>'</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Loaded DataFrame:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>loaded_df</span><span>)</span><span># Create a small DataFrame </span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>A</span><span>'</span><span>:</span> <span>[</span><span>1</span><span>,</span> <span>2</span><span>],</span> <span>'</span><span>B</span><span>'</span><span>:</span> <span>[</span><span>3</span><span>,</span> <span>4</span><span>]})</span> <span># Save it to a CSV file </span><span>df</span><span>.</span><span>to_csv</span><span>(</span><span>'</span><span>my_data.csv</span><span>'</span><span>,</span> <span>index</span><span>=</span><span>False</span><span>)</span> <span># 'index=False' skips row numbers </span> <span># Load it back </span><span>loaded_df</span> <span>=</span> <span>pd</span><span>.</span><span>read_csv</span><span>(</span><span>'</span><span>my_data.csv</span><span>'</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Loaded DataFrame:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>loaded_df</span><span>)</span># Create a small DataFrame df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]}) # Save it to a CSV file df.to_csv('my_data.csv', index=False) # 'index=False' skips row numbers # Load it back loaded_df = pd.read_csv('my_data.csv') print("Loaded DataFrame:") print(loaded_df)
Enter fullscreen mode Exit fullscreen mode
Breakdown:
- Saving:
to_csv()
writes the DataFrame to a file namedmy_data.csv
. - Loading:
read_csv()
reads the file back into a DataFrame. - Other Formats: You can also use
pd.read_excel()
,pd.read_json()
, ordf.to_excel()
for different file types.
3. Exploring Your Data
Before analyzing, you need to know what’s in your data.
<span># Sample DataFrame </span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>Name</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>Alice</span><span>'</span><span>,</span> <span>'</span><span>Bob</span><span>'</span><span>],</span> <span>'</span><span>Age</span><span>'</span><span>:</span> <span>[</span><span>25</span><span>,</span> <span>30</span><span>]})</span><span># See the first row </span><span>print</span><span>(</span><span>"</span><span>First row:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df</span><span>.</span><span>head</span><span>(</span><span>1</span><span>))</span><span># Check structure </span><span>print</span><span>(</span><span>"</span><span>\n</span><span>Info:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df</span><span>.</span><span>info</span><span>())</span><span># Get stats for numbers </span><span>print</span><span>(</span><span>"</span><span>\n</span><span>Stats:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df</span><span>.</span><span>describe</span><span>())</span><span># Sample DataFrame </span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>Name</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>Alice</span><span>'</span><span>,</span> <span>'</span><span>Bob</span><span>'</span><span>],</span> <span>'</span><span>Age</span><span>'</span><span>:</span> <span>[</span><span>25</span><span>,</span> <span>30</span><span>]})</span> <span># See the first row </span><span>print</span><span>(</span><span>"</span><span>First row:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>.</span><span>head</span><span>(</span><span>1</span><span>))</span> <span># Check structure </span><span>print</span><span>(</span><span>"</span><span>\n</span><span>Info:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>.</span><span>info</span><span>())</span> <span># Get stats for numbers </span><span>print</span><span>(</span><span>"</span><span>\n</span><span>Stats:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>.</span><span>describe</span><span>())</span># Sample DataFrame df = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]}) # See the first row print("First row:") print(df.head(1)) # Check structure print("\nInfo:") print(df.info()) # Get stats for numbers print("\nStats:") print(df.describe())
Enter fullscreen mode Exit fullscreen mode
Breakdown:
-
head(1)
: Shows the first row (you can change the number). -
info()
: Lists columns, their data types (e.g., ‘object’ for text, ‘int64’ for numbers), and if anything’s missing. -
describe()
: Gives stats like average and max, but only for numeric columns like ‘Age’.
4. Selecting Data
Pandas lets you pick exactly what you need.
<span># Sample DataFrame </span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>Name</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>Alice</span><span>'</span><span>,</span> <span>'</span><span>Bob</span><span>'</span><span>],</span> <span>'</span><span>Age</span><span>'</span><span>:</span> <span>[</span><span>25</span><span>,</span> <span>30</span><span>]})</span><span># Pick one column </span><span>print</span><span>(</span><span>"</span><span>Names only:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df</span><span>[</span><span>'</span><span>Name</span><span>'</span><span>])</span><span># Pick rows where Age > 25 </span><span>print</span><span>(</span><span>"</span><span>\n</span><span>People over 25:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df</span><span>[</span><span>df</span><span>[</span><span>'</span><span>Age</span><span>'</span><span>]</span> <span>></span> <span>25</span><span>])</span><span># Sample DataFrame </span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>Name</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>Alice</span><span>'</span><span>,</span> <span>'</span><span>Bob</span><span>'</span><span>],</span> <span>'</span><span>Age</span><span>'</span><span>:</span> <span>[</span><span>25</span><span>,</span> <span>30</span><span>]})</span> <span># Pick one column </span><span>print</span><span>(</span><span>"</span><span>Names only:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>[</span><span>'</span><span>Name</span><span>'</span><span>])</span> <span># Pick rows where Age > 25 </span><span>print</span><span>(</span><span>"</span><span>\n</span><span>People over 25:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>[</span><span>df</span><span>[</span><span>'</span><span>Age</span><span>'</span><span>]</span> <span>></span> <span>25</span><span>])</span># Sample DataFrame df = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [25, 30]}) # Pick one column print("Names only:") print(df['Name']) # Pick rows where Age > 25 print("\nPeople over 25:") print(df[df['Age'] > 25])
Enter fullscreen mode Exit fullscreen mode
Breakdown:
- Column Selection:
df['Name']
grabs the ‘Name’ column as a Series. - Row Filtering:
df[df['Age'] > 25]
keeps only rows where the condition is true (here, just Bob).
5. Cleaning Data
Real data is messy—Pandas helps fix it.
<span># Import NumPy for missing values (NaN) </span><span>import</span> <span>numpy</span> <span>as</span> <span>np</span><span># DataFrame with missing and duplicate data </span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>A</span><span>'</span><span>:</span> <span>[</span><span>1</span><span>,</span> <span>np</span><span>.</span><span>nan</span><span>,</span> <span>3</span><span>],</span> <span>'</span><span>B</span><span>'</span><span>:</span> <span>[</span><span>4</span><span>,</span> <span>4</span><span>,</span> <span>5</span><span>]})</span><span># Fill missing values with 0 </span><span>df</span><span>[</span><span>'</span><span>A</span><span>'</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>A</span><span>'</span><span>].</span><span>fillna</span><span>(</span><span>0</span><span>)</span><span># Remove duplicate rows in 'B' </span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>drop_duplicates</span><span>(</span><span>subset</span><span>=</span><span>'</span><span>B</span><span>'</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Cleaned DataFrame:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df</span><span>)</span><span># Import NumPy for missing values (NaN) </span><span>import</span> <span>numpy</span> <span>as</span> <span>np</span> <span># DataFrame with missing and duplicate data </span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>A</span><span>'</span><span>:</span> <span>[</span><span>1</span><span>,</span> <span>np</span><span>.</span><span>nan</span><span>,</span> <span>3</span><span>],</span> <span>'</span><span>B</span><span>'</span><span>:</span> <span>[</span><span>4</span><span>,</span> <span>4</span><span>,</span> <span>5</span><span>]})</span> <span># Fill missing values with 0 </span><span>df</span><span>[</span><span>'</span><span>A</span><span>'</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>A</span><span>'</span><span>].</span><span>fillna</span><span>(</span><span>0</span><span>)</span> <span># Remove duplicate rows in 'B' </span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>drop_duplicates</span><span>(</span><span>subset</span><span>=</span><span>'</span><span>B</span><span>'</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Cleaned DataFrame:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>)</span># Import NumPy for missing values (NaN) import numpy as np # DataFrame with missing and duplicate data df = pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, 4, 5]}) # Fill missing values with 0 df['A'] = df['A'].fillna(0) # Remove duplicate rows in 'B' df = df.drop_duplicates(subset='B') print("Cleaned DataFrame:") print(df)
Enter fullscreen mode Exit fullscreen mode
Breakdown:
- Missing Values:
np.nan
is a missing value.fillna(0)
replaces it with 0. - Duplicates:
drop_duplicates()
keeps only the first occurrence of repeated values in ‘B’.
6. Manipulating Data
Change your data however you like.
<span># Sample DataFrame </span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>Sales</span><span>'</span><span>:</span> <span>[</span><span>100</span><span>,</span> <span>200</span><span>],</span> <span>'</span><span>Region</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>North</span><span>'</span><span>,</span> <span>'</span><span>South</span><span>'</span><span>]})</span><span># Add a new column </span><span>df</span><span>[</span><span>'</span><span>Tax</span><span>'</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>Sales</span><span>'</span><span>]</span> <span>*</span> <span>0.1</span><span># Sort by Sales </span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>sort_values</span><span>(</span><span>'</span><span>Sales</span><span>'</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Updated DataFrame:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df</span><span>)</span><span># Sample DataFrame </span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>Sales</span><span>'</span><span>:</span> <span>[</span><span>100</span><span>,</span> <span>200</span><span>],</span> <span>'</span><span>Region</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>North</span><span>'</span><span>,</span> <span>'</span><span>South</span><span>'</span><span>]})</span> <span># Add a new column </span><span>df</span><span>[</span><span>'</span><span>Tax</span><span>'</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>Sales</span><span>'</span><span>]</span> <span>*</span> <span>0.1</span> <span># Sort by Sales </span><span>df</span> <span>=</span> <span>df</span><span>.</span><span>sort_values</span><span>(</span><span>'</span><span>Sales</span><span>'</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Updated DataFrame:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>)</span># Sample DataFrame df = pd.DataFrame({'Sales': [100, 200], 'Region': ['North', 'South']}) # Add a new column df['Tax'] = df['Sales'] * 0.1 # Sort by Sales df = df.sort_values('Sales') print("Updated DataFrame:") print(df)
Enter fullscreen mode Exit fullscreen mode
Breakdown:
- New Column:
df['Tax']
creates a column based on a calculation. - Sorting:
sort_values('Sales')
arranges rows from low to high Sales.
7. Grouping and Aggregating
Summarize data by categories.
<span># Sample DataFrame </span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>Region</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>North</span><span>'</span><span>,</span> <span>'</span><span>South</span><span>'</span><span>,</span> <span>'</span><span>North</span><span>'</span><span>],</span> <span>'</span><span>Sales</span><span>'</span><span>:</span> <span>[</span><span>100</span><span>,</span> <span>150</span><span>,</span> <span>200</span><span>]})</span><span># Total sales by region </span><span>sales_by_region</span> <span>=</span> <span>df</span><span>.</span><span>groupby</span><span>(</span><span>'</span><span>Region</span><span>'</span><span>)[</span><span>'</span><span>Sales</span><span>'</span><span>].</span><span>sum</span><span>()</span><span>print</span><span>(</span><span>"</span><span>Sales by Region:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>sales_by_region</span><span>)</span><span># Sample DataFrame </span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>Region</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>North</span><span>'</span><span>,</span> <span>'</span><span>South</span><span>'</span><span>,</span> <span>'</span><span>North</span><span>'</span><span>],</span> <span>'</span><span>Sales</span><span>'</span><span>:</span> <span>[</span><span>100</span><span>,</span> <span>150</span><span>,</span> <span>200</span><span>]})</span> <span># Total sales by region </span><span>sales_by_region</span> <span>=</span> <span>df</span><span>.</span><span>groupby</span><span>(</span><span>'</span><span>Region</span><span>'</span><span>)[</span><span>'</span><span>Sales</span><span>'</span><span>].</span><span>sum</span><span>()</span> <span>print</span><span>(</span><span>"</span><span>Sales by Region:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>sales_by_region</span><span>)</span># Sample DataFrame df = pd.DataFrame({'Region': ['North', 'South', 'North'], 'Sales': [100, 150, 200]}) # Total sales by region sales_by_region = df.groupby('Region')['Sales'].sum() print("Sales by Region:") print(sales_by_region)
Enter fullscreen mode Exit fullscreen mode
Breakdown:
-
groupby('Region')
: Groups rows by the ‘Region’ column. -
sum()
: Adds up ‘Sales’ for each group. North: 100 + 200 = 300; South: 150.
8. Merging Data
Combine multiple datasets.
<span># Two small DataFrames </span><span>df1</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>ID</span><span>'</span><span>:</span> <span>[</span><span>1</span><span>,</span> <span>2</span><span>],</span> <span>'</span><span>Name</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>Alice</span><span>'</span><span>,</span> <span>'</span><span>Bob</span><span>'</span><span>]})</span><span>df2</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>ID</span><span>'</span><span>:</span> <span>[</span><span>1</span><span>,</span> <span>3</span><span>],</span> <span>'</span><span>Score</span><span>'</span><span>:</span> <span>[</span><span>85</span><span>,</span> <span>90</span><span>]})</span><span># Merge them on 'ID' </span><span>merged</span> <span>=</span> <span>pd</span><span>.</span><span>merge</span><span>(</span><span>df1</span><span>,</span> <span>df2</span><span>,</span> <span>on</span><span>=</span><span>'</span><span>ID</span><span>'</span><span>,</span> <span>how</span><span>=</span><span>'</span><span>inner</span><span>'</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Merged DataFrame:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>merged</span><span>)</span><span># Two small DataFrames </span><span>df1</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>ID</span><span>'</span><span>:</span> <span>[</span><span>1</span><span>,</span> <span>2</span><span>],</span> <span>'</span><span>Name</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>Alice</span><span>'</span><span>,</span> <span>'</span><span>Bob</span><span>'</span><span>]})</span> <span>df2</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>ID</span><span>'</span><span>:</span> <span>[</span><span>1</span><span>,</span> <span>3</span><span>],</span> <span>'</span><span>Score</span><span>'</span><span>:</span> <span>[</span><span>85</span><span>,</span> <span>90</span><span>]})</span> <span># Merge them on 'ID' </span><span>merged</span> <span>=</span> <span>pd</span><span>.</span><span>merge</span><span>(</span><span>df1</span><span>,</span> <span>df2</span><span>,</span> <span>on</span><span>=</span><span>'</span><span>ID</span><span>'</span><span>,</span> <span>how</span><span>=</span><span>'</span><span>inner</span><span>'</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Merged DataFrame:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>merged</span><span>)</span># Two small DataFrames df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']}) df2 = pd.DataFrame({'ID': [1, 3], 'Score': [85, 90]}) # Merge them on 'ID' merged = pd.merge(df1, df2, on='ID', how='inner') print("Merged DataFrame:") print(merged)
Enter fullscreen mode Exit fullscreen mode
Breakdown:
-
merge()
: Linksdf1
anddf2
where ‘ID’ matches. -
how='inner'
: Keeps only rows where ‘ID’ exists in both (here, just ID 1).
9. Pivot Tables
Rearrange data for insights.
<span># Sample DataFrame </span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>Date</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>2023-01</span><span>'</span><span>,</span> <span>'</span><span>2023-01</span><span>'</span><span>],</span> <span>'</span><span>Product</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>A</span><span>'</span><span>,</span> <span>'</span><span>B</span><span>'</span><span>],</span> <span>'</span><span>Sales</span><span>'</span><span>:</span> <span>[</span><span>100</span><span>,</span> <span>150</span><span>]})</span><span># Pivot to spread Products across columns </span><span>pivot</span> <span>=</span> <span>df</span><span>.</span><span>pivot</span><span>(</span><span>index</span><span>=</span><span>'</span><span>Date</span><span>'</span><span>,</span> <span>columns</span><span>=</span><span>'</span><span>Product</span><span>'</span><span>,</span> <span>values</span><span>=</span><span>'</span><span>Sales</span><span>'</span><span>)</span><span>print</span><span>(</span><span>"</span><span>Pivot Table:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>pivot</span><span>)</span><span># Sample DataFrame </span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>Date</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>2023-01</span><span>'</span><span>,</span> <span>'</span><span>2023-01</span><span>'</span><span>],</span> <span>'</span><span>Product</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>A</span><span>'</span><span>,</span> <span>'</span><span>B</span><span>'</span><span>],</span> <span>'</span><span>Sales</span><span>'</span><span>:</span> <span>[</span><span>100</span><span>,</span> <span>150</span><span>]})</span> <span># Pivot to spread Products across columns </span><span>pivot</span> <span>=</span> <span>df</span><span>.</span><span>pivot</span><span>(</span><span>index</span><span>=</span><span>'</span><span>Date</span><span>'</span><span>,</span> <span>columns</span><span>=</span><span>'</span><span>Product</span><span>'</span><span>,</span> <span>values</span><span>=</span><span>'</span><span>Sales</span><span>'</span><span>)</span> <span>print</span><span>(</span><span>"</span><span>Pivot Table:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>pivot</span><span>)</span># Sample DataFrame df = pd.DataFrame({'Date': ['2023-01', '2023-01'], 'Product': ['A', 'B'], 'Sales': [100, 150]}) # Pivot to spread Products across columns pivot = df.pivot(index='Date', columns='Product', values='Sales') print("Pivot Table:") print(pivot)
Enter fullscreen mode Exit fullscreen mode
Breakdown:
-
pivot()
: Turns ‘Product’ values into columns, with ‘Sales’ as the data. One row for ‘2023-01’ with A and B as columns.
10. Time Series
Work with dates and times.
<span># Sample DataFrame </span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>Date</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>2023-01-01</span><span>'</span><span>,</span> <span>'</span><span>2023-02-01</span><span>'</span><span>],</span> <span>'</span><span>Sales</span><span>'</span><span>:</span> <span>[</span><span>100</span><span>,</span> <span>200</span><span>]})</span><span># Convert Date to datetime </span><span>df</span><span>[</span><span>'</span><span>Date</span><span>'</span><span>]</span> <span>=</span> <span>pd</span><span>.</span><span>to_datetime</span><span>(</span><span>df</span><span>[</span><span>'</span><span>Date</span><span>'</span><span>])</span><span># Set Date as index </span><span>df</span><span>.</span><span>set_index</span><span>(</span><span>'</span><span>Date</span><span>'</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span><span># Monthly total </span><span>monthly</span> <span>=</span> <span>df</span><span>.</span><span>resample</span><span>(</span><span>'</span><span>M</span><span>'</span><span>).</span><span>sum</span><span>()</span><span>print</span><span>(</span><span>"</span><span>Monthly Sales:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>monthly</span><span>)</span><span># Sample DataFrame </span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>Date</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>2023-01-01</span><span>'</span><span>,</span> <span>'</span><span>2023-02-01</span><span>'</span><span>],</span> <span>'</span><span>Sales</span><span>'</span><span>:</span> <span>[</span><span>100</span><span>,</span> <span>200</span><span>]})</span> <span># Convert Date to datetime </span><span>df</span><span>[</span><span>'</span><span>Date</span><span>'</span><span>]</span> <span>=</span> <span>pd</span><span>.</span><span>to_datetime</span><span>(</span><span>df</span><span>[</span><span>'</span><span>Date</span><span>'</span><span>])</span> <span># Set Date as index </span><span>df</span><span>.</span><span>set_index</span><span>(</span><span>'</span><span>Date</span><span>'</span><span>,</span> <span>inplace</span><span>=</span><span>True</span><span>)</span> <span># Monthly total </span><span>monthly</span> <span>=</span> <span>df</span><span>.</span><span>resample</span><span>(</span><span>'</span><span>M</span><span>'</span><span>).</span><span>sum</span><span>()</span> <span>print</span><span>(</span><span>"</span><span>Monthly Sales:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>monthly</span><span>)</span># Sample DataFrame df = pd.DataFrame({'Date': ['2023-01-01', '2023-02-01'], 'Sales': [100, 200]}) # Convert Date to datetime df['Date'] = pd.to_datetime(df['Date']) # Set Date as index df.set_index('Date', inplace=True) # Monthly total monthly = df.resample('M').sum() print("Monthly Sales:") print(monthly)
Enter fullscreen mode Exit fullscreen mode
Breakdown:
-
to_datetime()
: Makes ‘Date’ a proper date type. -
set_index()
: Uses ‘Date’ as row labels. -
resample('M')
: Groups by month (‘M’) and sums ‘Sales’.
11. SQL in Pandas
Pandas can connect to databases or mimic SQL queries.
<span># Sample DataFrame </span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>id</span><span>'</span><span>:</span> <span>[</span><span>1</span><span>,</span> <span>2</span><span>],</span> <span>'</span><span>product</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>Laptop</span><span>'</span><span>,</span> <span>'</span><span>Mouse</span><span>'</span><span>],</span> <span>'</span><span>amount</span><span>'</span><span>:</span> <span>[</span><span>1000</span><span>,</span> <span>20</span><span>]})</span><span># Filter like SQL: WHERE amount > 50 </span><span>result</span> <span>=</span> <span>df</span><span>[</span><span>df</span><span>[</span><span>'</span><span>amount</span><span>'</span><span>]</span> <span>></span> <span>50</span><span>]</span><span>print</span><span>(</span><span>"</span><span>Filtered DataFrame:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>result</span><span>)</span><span># Sample DataFrame </span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>({</span><span>'</span><span>id</span><span>'</span><span>:</span> <span>[</span><span>1</span><span>,</span> <span>2</span><span>],</span> <span>'</span><span>product</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>Laptop</span><span>'</span><span>,</span> <span>'</span><span>Mouse</span><span>'</span><span>],</span> <span>'</span><span>amount</span><span>'</span><span>:</span> <span>[</span><span>1000</span><span>,</span> <span>20</span><span>]})</span> <span># Filter like SQL: WHERE amount > 50 </span><span>result</span> <span>=</span> <span>df</span><span>[</span><span>df</span><span>[</span><span>'</span><span>amount</span><span>'</span><span>]</span> <span>></span> <span>50</span><span>]</span> <span>print</span><span>(</span><span>"</span><span>Filtered DataFrame:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>result</span><span>)</span># Sample DataFrame df = pd.DataFrame({'id': [1, 2], 'product': ['Laptop', 'Mouse'], 'amount': [1000, 20]}) # Filter like SQL: WHERE amount > 50 result = df[df['amount'] > 50] print("Filtered DataFrame:") print(result)
Enter fullscreen mode Exit fullscreen mode
Breakdown:
- SQL-Like:
df[df['amount'] > 50]
is likeSELECT * FROM df WHERE amount > 50
. - Database Option: Use
pd.read_sql("SELECT * FROM table", connection)
to pull from a real database (e.g., MySQL).
Project: Analyzing Store Sales with a Simple Plot
Let’s put it all together with a beginner-friendly project! We’ll analyze a small store’s sales and add a basic bar chart using Matplotlib to visualize our results.
Project Setup
- Install Python: Download from python.org if you haven’t.
- Install Pandas: Run
pip install pandas
in your terminal. - Install Matplotlib: Run
pip install matplotlib
to add plotting (we’ll use it briefly). - Pick an Editor: Use Notepad, VS Code, or Jupyter Notebook.
- Create a File: Save this code as
store_sales.py
or run it in a notebook.
The Code
<span># Import Pandas for data and Matplotlib for plotting </span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span><span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span><span># Step 1: Create our dataset </span><span>data</span> <span>=</span> <span>{</span><span>'</span><span>Date</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>2023-01-01</span><span>'</span><span>,</span> <span>'</span><span>2023-01-01</span><span>'</span><span>,</span> <span>'</span><span>2023-01-02</span><span>'</span><span>,</span> <span>'</span><span>2023-01-02</span><span>'</span><span>],</span><span>'</span><span>Item</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>Pen</span><span>'</span><span>,</span> <span>'</span><span>Notebook</span><span>'</span><span>,</span> <span>'</span><span>Pen</span><span>'</span><span>,</span> <span>'</span><span>Pencil</span><span>'</span><span>],</span><span>'</span><span>Quantity</span><span>'</span><span>:</span> <span>[</span><span>5</span><span>,</span> <span>2</span><span>,</span> <span>3</span><span>,</span> <span>4</span><span>],</span><span>'</span><span>Price</span><span>'</span><span>:</span> <span>[</span><span>1.0</span><span>,</span> <span>2.5</span><span>,</span> <span>1.0</span><span>,</span> <span>0.5</span><span>]</span><span>}</span><span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span> <span># Turn the dictionary into a table </span><span>print</span><span>(</span><span>"</span><span>Our dataset:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df</span><span>)</span><span># Step 2: Explore the data </span><span>print</span><span>(</span><span>"</span><span>\n</span><span>First 2 rows:</span><span>"</span><span>)</span> <span># Show first 2 rows </span><span>print</span><span>(</span><span>df</span><span>.</span><span>head</span><span>(</span><span>2</span><span>))</span><span>print</span><span>(</span><span>"</span><span>\n</span><span>Dataset info:</span><span>"</span><span>)</span> <span># Check structure </span><span>print</span><span>(</span><span>df</span><span>.</span><span>info</span><span>())</span><span># Step 3: Calculate total sales per row </span><span>df</span><span>[</span><span>'</span><span>Total</span><span>'</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>Quantity</span><span>'</span><span>]</span> <span>*</span> <span>df</span><span>[</span><span>'</span><span>Price</span><span>'</span><span>]</span> <span># New column: Quantity × Price </span><span>print</span><span>(</span><span>"</span><span>\n</span><span>With Total column:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>df</span><span>)</span><span>total_sales</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>Total</span><span>'</span><span>].</span><span>sum</span><span>()</span> <span># Add up all totals </span><span>print</span><span>(</span><span>f</span><span>"</span><span>\n</span><span>Total sales: $</span><span>{</span><span>total_sales</span><span>}</span><span>"</span><span>)</span><span># Step 4: Find total quantity sold per item </span><span>item_sales</span> <span>=</span> <span>df</span><span>.</span><span>groupby</span><span>(</span><span>'</span><span>Item</span><span>'</span><span>)[</span><span>'</span><span>Quantity</span><span>'</span><span>].</span><span>sum</span><span>()</span> <span># Total quantity per item </span><span>print</span><span>(</span><span>"</span><span>\n</span><span>Quantity sold per item:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>item_sales</span><span>)</span><span># Step 5: Make a simple bar chart of item sales </span><span>plt</span><span>.</span><span>bar</span><span>(</span><span>item_sales</span><span>.</span><span>index</span><span>,</span> <span>item_sales</span><span>)</span> <span># Items on x-axis, quantities on y-axis </span><span>plt</span><span>.</span><span>title</span><span>(</span><span>'</span><span>Total Quantity Sold Per Item</span><span>'</span><span>)</span> <span># Add a title </span><span>plt</span><span>.</span><span>xlabel</span><span>(</span><span>'</span><span>Item</span><span>'</span><span>)</span> <span># Label for x-axis </span><span>plt</span><span>.</span><span>ylabel</span><span>(</span><span>'</span><span>Quantity Sold</span><span>'</span><span>)</span> <span># Label for y-axis </span><span>plt</span><span>.</span><span>show</span><span>()</span> <span># Display the plot </span><span># Step 6: Average sales per day </span><span>df</span><span>[</span><span>'</span><span>Date</span><span>'</span><span>]</span> <span>=</span> <span>pd</span><span>.</span><span>to_datetime</span><span>(</span><span>df</span><span>[</span><span>'</span><span>Date</span><span>'</span><span>])</span> <span># Make Date a proper date </span><span>daily_sales</span> <span>=</span> <span>df</span><span>.</span><span>groupby</span><span>(</span><span>'</span><span>Date</span><span>'</span><span>)[</span><span>'</span><span>Total</span><span>'</span><span>].</span><span>sum</span><span>()</span> <span># Total sales per day </span><span>print</span><span>(</span><span>"</span><span>\n</span><span>Total sales per day:</span><span>"</span><span>)</span><span>print</span><span>(</span><span>daily_sales</span><span>)</span><span>avg_daily_sales</span> <span>=</span> <span>daily_sales</span><span>.</span><span>mean</span><span>()</span> <span># Average of daily totals </span><span>print</span><span>(</span><span>f</span><span>"</span><span>\n</span><span>Average daily sales: $</span><span>{</span><span>avg_daily_sales</span><span>}</span><span>"</span><span>)</span><span># Step 7: Save the results </span><span>df</span><span>.</span><span>to_csv</span><span>(</span><span>'</span><span>store_sales.csv</span><span>'</span><span>,</span> <span>index</span><span>=</span><span>False</span><span>)</span> <span># Save to a CSV file </span><span>print</span><span>(</span><span>"</span><span>\n</span><span>Data saved to </span><span>'</span><span>store_sales.csv</span><span>'</span><span>!</span><span>"</span><span>)</span><span># Import Pandas for data and Matplotlib for plotting </span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> <span>import</span> <span>matplotlib.pyplot</span> <span>as</span> <span>plt</span> <span># Step 1: Create our dataset </span><span>data</span> <span>=</span> <span>{</span> <span>'</span><span>Date</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>2023-01-01</span><span>'</span><span>,</span> <span>'</span><span>2023-01-01</span><span>'</span><span>,</span> <span>'</span><span>2023-01-02</span><span>'</span><span>,</span> <span>'</span><span>2023-01-02</span><span>'</span><span>],</span> <span>'</span><span>Item</span><span>'</span><span>:</span> <span>[</span><span>'</span><span>Pen</span><span>'</span><span>,</span> <span>'</span><span>Notebook</span><span>'</span><span>,</span> <span>'</span><span>Pen</span><span>'</span><span>,</span> <span>'</span><span>Pencil</span><span>'</span><span>],</span> <span>'</span><span>Quantity</span><span>'</span><span>:</span> <span>[</span><span>5</span><span>,</span> <span>2</span><span>,</span> <span>3</span><span>,</span> <span>4</span><span>],</span> <span>'</span><span>Price</span><span>'</span><span>:</span> <span>[</span><span>1.0</span><span>,</span> <span>2.5</span><span>,</span> <span>1.0</span><span>,</span> <span>0.5</span><span>]</span> <span>}</span> <span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>(</span><span>data</span><span>)</span> <span># Turn the dictionary into a table </span><span>print</span><span>(</span><span>"</span><span>Our dataset:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>)</span> <span># Step 2: Explore the data </span><span>print</span><span>(</span><span>"</span><span>\n</span><span>First 2 rows:</span><span>"</span><span>)</span> <span># Show first 2 rows </span><span>print</span><span>(</span><span>df</span><span>.</span><span>head</span><span>(</span><span>2</span><span>))</span> <span>print</span><span>(</span><span>"</span><span>\n</span><span>Dataset info:</span><span>"</span><span>)</span> <span># Check structure </span><span>print</span><span>(</span><span>df</span><span>.</span><span>info</span><span>())</span> <span># Step 3: Calculate total sales per row </span><span>df</span><span>[</span><span>'</span><span>Total</span><span>'</span><span>]</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>Quantity</span><span>'</span><span>]</span> <span>*</span> <span>df</span><span>[</span><span>'</span><span>Price</span><span>'</span><span>]</span> <span># New column: Quantity × Price </span><span>print</span><span>(</span><span>"</span><span>\n</span><span>With Total column:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>df</span><span>)</span> <span>total_sales</span> <span>=</span> <span>df</span><span>[</span><span>'</span><span>Total</span><span>'</span><span>].</span><span>sum</span><span>()</span> <span># Add up all totals </span><span>print</span><span>(</span><span>f</span><span>"</span><span>\n</span><span>Total sales: $</span><span>{</span><span>total_sales</span><span>}</span><span>"</span><span>)</span> <span># Step 4: Find total quantity sold per item </span><span>item_sales</span> <span>=</span> <span>df</span><span>.</span><span>groupby</span><span>(</span><span>'</span><span>Item</span><span>'</span><span>)[</span><span>'</span><span>Quantity</span><span>'</span><span>].</span><span>sum</span><span>()</span> <span># Total quantity per item </span><span>print</span><span>(</span><span>"</span><span>\n</span><span>Quantity sold per item:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>item_sales</span><span>)</span> <span># Step 5: Make a simple bar chart of item sales </span><span>plt</span><span>.</span><span>bar</span><span>(</span><span>item_sales</span><span>.</span><span>index</span><span>,</span> <span>item_sales</span><span>)</span> <span># Items on x-axis, quantities on y-axis </span><span>plt</span><span>.</span><span>title</span><span>(</span><span>'</span><span>Total Quantity Sold Per Item</span><span>'</span><span>)</span> <span># Add a title </span><span>plt</span><span>.</span><span>xlabel</span><span>(</span><span>'</span><span>Item</span><span>'</span><span>)</span> <span># Label for x-axis </span><span>plt</span><span>.</span><span>ylabel</span><span>(</span><span>'</span><span>Quantity Sold</span><span>'</span><span>)</span> <span># Label for y-axis </span><span>plt</span><span>.</span><span>show</span><span>()</span> <span># Display the plot </span> <span># Step 6: Average sales per day </span><span>df</span><span>[</span><span>'</span><span>Date</span><span>'</span><span>]</span> <span>=</span> <span>pd</span><span>.</span><span>to_datetime</span><span>(</span><span>df</span><span>[</span><span>'</span><span>Date</span><span>'</span><span>])</span> <span># Make Date a proper date </span><span>daily_sales</span> <span>=</span> <span>df</span><span>.</span><span>groupby</span><span>(</span><span>'</span><span>Date</span><span>'</span><span>)[</span><span>'</span><span>Total</span><span>'</span><span>].</span><span>sum</span><span>()</span> <span># Total sales per day </span><span>print</span><span>(</span><span>"</span><span>\n</span><span>Total sales per day:</span><span>"</span><span>)</span> <span>print</span><span>(</span><span>daily_sales</span><span>)</span> <span>avg_daily_sales</span> <span>=</span> <span>daily_sales</span><span>.</span><span>mean</span><span>()</span> <span># Average of daily totals </span><span>print</span><span>(</span><span>f</span><span>"</span><span>\n</span><span>Average daily sales: $</span><span>{</span><span>avg_daily_sales</span><span>}</span><span>"</span><span>)</span> <span># Step 7: Save the results </span><span>df</span><span>.</span><span>to_csv</span><span>(</span><span>'</span><span>store_sales.csv</span><span>'</span><span>,</span> <span>index</span><span>=</span><span>False</span><span>)</span> <span># Save to a CSV file </span><span>print</span><span>(</span><span>"</span><span>\n</span><span>Data saved to </span><span>'</span><span>store_sales.csv</span><span>'</span><span>!</span><span>"</span><span>)</span># Import Pandas for data and Matplotlib for plotting import pandas as pd import matplotlib.pyplot as plt # Step 1: Create our dataset data = { 'Date': ['2023-01-01', '2023-01-01', '2023-01-02', '2023-01-02'], 'Item': ['Pen', 'Notebook', 'Pen', 'Pencil'], 'Quantity': [5, 2, 3, 4], 'Price': [1.0, 2.5, 1.0, 0.5] } df = pd.DataFrame(data) # Turn the dictionary into a table print("Our dataset:") print(df) # Step 2: Explore the data print("\nFirst 2 rows:") # Show first 2 rows print(df.head(2)) print("\nDataset info:") # Check structure print(df.info()) # Step 3: Calculate total sales per row df['Total'] = df['Quantity'] * df['Price'] # New column: Quantity × Price print("\nWith Total column:") print(df) total_sales = df['Total'].sum() # Add up all totals print(f"\nTotal sales: ${total_sales}") # Step 4: Find total quantity sold per item item_sales = df.groupby('Item')['Quantity'].sum() # Total quantity per item print("\nQuantity sold per item:") print(item_sales) # Step 5: Make a simple bar chart of item sales plt.bar(item_sales.index, item_sales) # Items on x-axis, quantities on y-axis plt.title('Total Quantity Sold Per Item') # Add a title plt.xlabel('Item') # Label for x-axis plt.ylabel('Quantity Sold') # Label for y-axis plt.show() # Display the plot # Step 6: Average sales per day df['Date'] = pd.to_datetime(df['Date']) # Make Date a proper date daily_sales = df.groupby('Date')['Total'].sum() # Total sales per day print("\nTotal sales per day:") print(daily_sales) avg_daily_sales = daily_sales.mean() # Average of daily totals print(f"\nAverage daily sales: ${avg_daily_sales}") # Step 7: Save the results df.to_csv('store_sales.csv', index=False) # Save to a CSV file print("\nData saved to 'store_sales.csv'!")
Enter fullscreen mode Exit fullscreen mode
Breakdown:
- Step 1: We created a DataFrame with sales data (dates, items, quantities, prices).
- Step 2: Explored it with
head()
andinfo()
to understand our dataset. - Step 3: Added a ‘Total’ column (e.g., 5 pens × $1 = $5) and summed it for overall sales ($15).
- Step 4: Grouped by ‘Item’ to see total quantities sold (e.g., Pens: 8 units).
- Step 5: Used Matplotlib’s
plt.bar()
to make a bar chart:-
item_sales.index
(item names) goes on the x-axis. -
item_sales
(quantities) goes on the y-axis. - Added labels and a title, then
plt.show()
displays it. You’ll see a window pop up with bars for each item!
-
- Step 6: Calculated daily sales (Jan 1: $10, Jan 2: $5) and their average ($7.5).
- Step 7: Saved our work to
store_sales.csv
.
Run this code, and you’ll see the outputs in your console plus a bar chart window showing quantities sold per item (Pens tallest at 8, Notebook shortest at 2). You’ll also get a CSV file in your folder!
Conclusion
Congratulations! You’ve just taken your first steps into the exciting world of data analysis with Pandas. From creating and manipulating DataFrames to exploring, cleaning, and visualizing data, you now have the foundational skills to tackle real-world datasets. The hands-on project showed how Pandas transforms raw numbers into meaningful insights—like total sales or item quantities—complete with a simple chart to bring it all to life. As you continue your journey, experiment with larger datasets, explore advanced features like joins or time series analysis, and let Pandas empower you to uncover stories hidden in the data. Happy analyzing!
暂无评论内容