Performing Basic Data Analysis with Python and SQL
In today’s data-driven world, being able to extract insights from data is a valuable skill. In this article, we’ll explore how to perform basic data analysis using Python and SQL. We’ll connect to a SQLite database, execute SQL queries, and visualize the results using Python libraries.
Database Schema:
The database contains information about sales transactions made by a company. Here’s what each field represents:
- TransactionID (INTEGER, PRIMARY KEY):
- This field uniquely identifies each transaction. It’s an integer value and serves as the primary key for the table.
- ProductID (INTEGER):
- This field represents the unique identifier for each product sold in the transaction. It’s also an integer value.
- ProductName (TEXT):
- This field stores the name or description of the product sold. It’s a text field allowing for variable-length strings.
- Quantity (INTEGER):
- This field indicates the quantity of the product sold in the transaction. It’s an integer value.
- PricePerUnit (REAL):
- This field represents the price per unit of the product sold. It’s a real number (floating-point value) to accommodate decimal prices.
- SalesDate (TEXT):
- This field stores the date of the sales transaction. It’s represented as text and typically follows a specific date format.
Explanation:
The provided schema simulates a basic sales database, commonly used in retail or e-commerce settings. Each transaction (identified by TransactionID) involves the sale of one or more products. For each product sold, the database records the ProductID, ProductName, Quantity sold, PricePerUnit, and the date of the transaction (SalesDate).
Purpose:
This database schema and the accompanying data were created for the purpose of practicing database management and performing data analysis tasks,it is not real-world data.But it provides a simplified representation of sales transactions, making it suitable for learning SQL queries, data manipulation, and analysis.Note:
It’s important to note that in a real-world scenario, databases and data structures may be more complex, and additional considerations such as data integrity, normalization, and indexing would be taken into account. However, for learning and practice purposes, this schema serves as a good starting point.
Setting Up the Environment
First, let’s ensure we have the necessary tools installed. We’ll need Python, SQLite, and the matplotlib library for data visualization.
pip install matplotlib
pip install sqlite3
Enter fullscreen mode Exit fullscreen mode
Connecting to the Database
We’ll start by creating a SQLite database and populating it with some sample data. We’ll then connect to this database using Python’s sqlite3
library.
import sqlite3
import matplotlib.pyplot as plt
# Connect to the database (in-memory database for this example) conn = sqlite3.connect(':memory:')
cursor = conn.cursor()
Enter fullscreen mode Exit fullscreen mode
Creating the Database Schema and Inserting Data
We’ll define a schema for our database containing information about sales transactions. The schema includes fields such as TransactionID
, ProductID
, ProductName
, Quantity
, PricePerUnit
, and SalesDate
.
# Creating a table cursor.execute('''CREATE TABLE IF NOT EXISTS sales ( TransactionID INTEGER PRIMARY KEY, ProductID INTEGER, ProductName TEXT, Quantity INTEGER, PricePerUnit REAL, SalesDate TEXT )''')
# Inserting sample data sales_data = [
(1, 101, 'Product A', 10, 20.0, '2024-01-15'),
(2, 102, 'Product B', 15, 25.0, '2024-01-20'),
(3, 101, 'Product A', 8, 20.0, '2024-02-05'),
(4, 103, 'Product C', 12, 30.0, '2024-02-10'),
(5, 102, 'Product B', 20, 25.0, '2024-03-02'),
(6, 104, 'Product D', 5, 35.0, '2024-03-10')
]
cursor.executemany('INSERT INTO sales VALUES (?, ?, ?, ?, ?, ?)', sales_data)
Enter fullscreen mode Exit fullscreen mode
Performing Data Analysis Tasks
Now that our database is set up, let’s perform some basic data analysis tasks using SQL queries.
Task 1: Total Sales Revenue for Each Product
We’ll calculate the total sales revenue for each product by multiplying the Quantity
with PricePerUnit
and summing it up.
query1 = '''SELECT ProductID, ProductName, SUM(Quantity * PricePerUnit) AS TotalRevenue FROM sales GROUP BY ProductID'''
cursor.execute(query1)
result = cursor.fetchall()
Enter fullscreen mode Exit fullscreen mode
Visualizing the Results
We can visualize the results using Python libraries such as matplotlib
.
# Plotting pie chart for Task 1 labels = [row[1] for row in result]
revenues = [row[2] for row in result]
plt.pie(revenues, labels=labels, autopct='%1.1f%%')
plt.title('→Task 1: Total sales revenue distribution')
plt.show()
Enter fullscreen mode Exit fullscreen mode
Task 2: Top 3 Best-Selling Products
We’ll find the top 3 best-selling products based on the total quantity sold.
query2 = '''SELECT ProductID, ProductName, SUM(Quantity) AS TotalQuantity FROM sales GROUP BY ProductID ORDER BY TotalQuantity DESC LIMIT 3'''
cursor.execute(query2)
result = cursor.fetchall()
Enter fullscreen mode Exit fullscreen mode
output
| ProductID | ProductName | TotalQuantity |
|-----------|-------------|---------------|
| 102 | Product B | 35 |
| 101 | Product A | 18 |
| 103 | Product C | 12 |
Enter fullscreen mode Exit fullscreen mode
Task 3: Average Price Per Unit for All Products
We’ll calculate the average price per unit for all products.
query3 = '''SELECT AVG(PricePerUnit) AS AveragePricePerUnit FROM sales'''
cursor.execute(query3)
result = cursor.fetchall()
Enter fullscreen mode Exit fullscreen mode
output
| AveragePricePerUnit |
|----------------------|
| 26.666666666666668 |
Enter fullscreen mode Exit fullscreen mode
Conclusion
In this article, we’ve demonstrated how to connect to a SQLite database using Python, execute SQL queries, and perform basic data analysis tasks. By leveraging Python libraries such as sqlite3
and matplotlib
, we can efficiently analyze data and gain valuable insights.
暂无评论内容