In the world of big data, Spark has become a pivotal tool for handling and processing large datasets efficiently. However, if you’re a data scientist or a data analyst accustomed to the simplicity and power of Pandas, you might find transitioning to Spark a bit daunting. That’s where the Pandas API on Spark comes in! It brings the familiar Pandas syntax to the Spark ecosystem, allowing you to leverage the distributed computing power of Spark while working with a Pandas-like interface.
Why Use Pandas API on Spark?
The Pandas API on Spark allows you to:
- Handle Larger-Than-Memory Data: Work with datasets that exceed the memory capacity of a single machine.
- Leverage Distributed Computing: Benefit from the parallel processing power of a Spark cluster.
- Use Familiar Syntax: Transition smoothly from Pandas to Spark without having to learn a completely new API.
Setting Up Your Environment
To get started, we’ll use Docker to set up a local PySpark environment. Open your terminal and run the following command:
docker run <span>-it</span> <span>-p</span> 8888:8888 jupyter/pyspark-notebookdocker run <span>-it</span> <span>-p</span> 8888:8888 jupyter/pyspark-notebookdocker run -it -p 8888:8888 jupyter/pyspark-notebook
Enter fullscreen mode Exit fullscreen mode
Once the container is running, open your browser and navigate to the second link to access your PySpark environment.
Getting the Data
We’ll use a dataset from Kaggle for this example. You can find the dataset here: Students Performance Dataset. Download the CSV file and place it in the appropriate location within your Docker container (you can drag it to jupyter tab in your browser).
Processing Data with Pandas API on Spark
With the environment set up and the file in the correct place, you can run the following code to read, treat, visualize, and save the data to S3.
Step 1: Import Libraries and Initialize Spark Session
<span>!</span><span>pip</span> <span>install</span> <span>boto3</span> <span>plotly</span><span>import</span> <span>pandas</span> <span>as</span> <span>pd</span><span>import</span> <span>numpy</span> <span>as</span> <span>np</span><span>import</span> <span>pyspark.pandas</span> <span>as</span> <span>ps</span><span>from</span> <span>pyspark.sql</span> <span>import</span> <span>SparkSession</span><span>import</span> <span>boto3</span><span>spark</span> <span>=</span> <span>SparkSession</span><span>.</span><span>builder</span><span>.</span><span>appName</span><span>(</span><span>"</span><span>PandasOnSparkExample</span><span>"</span><span>).</span><span>getOrCreate</span><span>()</span><span>!</span><span>pip</span> <span>install</span> <span>boto3</span> <span>plotly</span> <span>import</span> <span>pandas</span> <span>as</span> <span>pd</span> <span>import</span> <span>numpy</span> <span>as</span> <span>np</span> <span>import</span> <span>pyspark.pandas</span> <span>as</span> <span>ps</span> <span>from</span> <span>pyspark.sql</span> <span>import</span> <span>SparkSession</span> <span>import</span> <span>boto3</span> <span>spark</span> <span>=</span> <span>SparkSession</span><span>.</span><span>builder</span><span>.</span><span>appName</span><span>(</span><span>"</span><span>PandasOnSparkExample</span><span>"</span><span>).</span><span>getOrCreate</span><span>()</span>!pip install boto3 plotly import pandas as pd import numpy as np import pyspark.pandas as ps from pyspark.sql import SparkSession import boto3 spark = SparkSession.builder.appName("PandasOnSparkExample").getOrCreate()
Enter fullscreen mode Exit fullscreen mode
Step 2: Read Data from CSV
<span>columns</span> <span>=</span> <span>[</span><span>'</span><span>StudentID</span><span>'</span><span>,</span> <span>'</span><span>Age</span><span>'</span><span>,</span> <span>'</span><span>Gender</span><span>'</span><span>,</span> <span>'</span><span>Ethnicity</span><span>'</span><span>,</span> <span>'</span><span>ParentalEducation</span><span>'</span><span>,</span><span>'</span><span>StudyTimeWeekly</span><span>'</span><span>,</span> <span>'</span><span>Absences</span><span>'</span><span>,</span> <span>'</span><span>Tutoring</span><span>'</span><span>,</span> <span>'</span><span>ParentalSupport</span><span>'</span><span>,</span><span>'</span><span>Extracurricular</span><span>'</span><span>,</span> <span>'</span><span>Sports</span><span>'</span><span>,</span> <span>'</span><span>Music</span><span>'</span><span>,</span> <span>'</span><span>Volunteering</span><span>'</span><span>,</span> <span>'</span><span>GPA</span><span>'</span><span>,</span> <span>'</span><span>GradeClass</span><span>'</span><span>]</span><span>psdf</span> <span>=</span> <span>ps</span><span>.</span><span>read_csv</span><span>(</span><span>'</span><span>Student_performance_data _.csv</span><span>'</span><span>,</span> <span>names</span><span>=</span><span>columns</span><span>,</span> <span>header</span><span>=</span><span>0</span><span>)</span><span>columns</span> <span>=</span> <span>[</span><span>'</span><span>StudentID</span><span>'</span><span>,</span> <span>'</span><span>Age</span><span>'</span><span>,</span> <span>'</span><span>Gender</span><span>'</span><span>,</span> <span>'</span><span>Ethnicity</span><span>'</span><span>,</span> <span>'</span><span>ParentalEducation</span><span>'</span><span>,</span> <span>'</span><span>StudyTimeWeekly</span><span>'</span><span>,</span> <span>'</span><span>Absences</span><span>'</span><span>,</span> <span>'</span><span>Tutoring</span><span>'</span><span>,</span> <span>'</span><span>ParentalSupport</span><span>'</span><span>,</span> <span>'</span><span>Extracurricular</span><span>'</span><span>,</span> <span>'</span><span>Sports</span><span>'</span><span>,</span> <span>'</span><span>Music</span><span>'</span><span>,</span> <span>'</span><span>Volunteering</span><span>'</span><span>,</span> <span>'</span><span>GPA</span><span>'</span><span>,</span> <span>'</span><span>GradeClass</span><span>'</span><span>]</span> <span>psdf</span> <span>=</span> <span>ps</span><span>.</span><span>read_csv</span><span>(</span><span>'</span><span>Student_performance_data _.csv</span><span>'</span><span>,</span> <span>names</span><span>=</span><span>columns</span><span>,</span> <span>header</span><span>=</span><span>0</span><span>)</span>columns = ['StudentID', 'Age', 'Gender', 'Ethnicity', 'ParentalEducation', 'StudyTimeWeekly', 'Absences', 'Tutoring', 'ParentalSupport', 'Extracurricular', 'Sports', 'Music', 'Volunteering', 'GPA', 'GradeClass'] psdf = ps.read_csv('Student_performance_data _.csv', names=columns, header=0)
Enter fullscreen mode Exit fullscreen mode
Step 3: Exploring the Data
Check the first few rows of the dataset to ensure it’s loaded correctly:
<span>print</span><span>(</span><span>psdf</span><span>.</span><span>head</span><span>())</span><span>print</span><span>(</span><span>psdf</span><span>.</span><span>head</span><span>())</span>print(psdf.head())
Enter fullscreen mode Exit fullscreen mode
Print column names and data types:
<span>print</span><span>(</span><span>psdf</span><span>.</span><span>columns</span><span>)</span><span>print</span><span>(</span><span>psdf</span><span>.</span><span>dtypes</span><span>)</span><span>print</span><span>(</span><span>psdf</span><span>.</span><span>columns</span><span>)</span> <span>print</span><span>(</span><span>psdf</span><span>.</span><span>dtypes</span><span>)</span>print(psdf.columns) print(psdf.dtypes)
Enter fullscreen mode Exit fullscreen mode
Step 4: Handling Missing Data
Handle missing data by either dropping rows with missing values:
<span>psdf_cleaned</span> <span>=</span> <span>psdf</span><span>.</span><span>dropna</span><span>()</span><span>print</span><span>(</span><span>psdf_cleaned</span><span>.</span><span>head</span><span>())</span><span>psdf_cleaned</span> <span>=</span> <span>psdf</span><span>.</span><span>dropna</span><span>()</span> <span>print</span><span>(</span><span>psdf_cleaned</span><span>.</span><span>head</span><span>())</span>psdf_cleaned = psdf.dropna() print(psdf_cleaned.head())
Enter fullscreen mode Exit fullscreen mode
Or filling them with a specific value:
<span>psdf_filled</span> <span>=</span> <span>psdf</span><span>.</span><span>fillna</span><span>(</span><span>value</span><span>=</span><span>0</span><span>)</span><span>print</span><span>(</span><span>psdf_filled</span><span>.</span><span>head</span><span>())</span><span>psdf_filled</span> <span>=</span> <span>psdf</span><span>.</span><span>fillna</span><span>(</span><span>value</span><span>=</span><span>0</span><span>)</span> <span>print</span><span>(</span><span>psdf_filled</span><span>.</span><span>head</span><span>())</span>psdf_filled = psdf.fillna(value=0) print(psdf_filled.head())
Enter fullscreen mode Exit fullscreen mode
Step 5: Data Manipulations and Insights
Group your data and apply aggregate functions:
<span>grouped_psdf</span> <span>=</span> <span>psdf</span><span>.</span><span>groupby</span><span>(</span><span>'</span><span>Gender</span><span>'</span><span>).</span><span>mean</span><span>()</span><span>print</span><span>(</span><span>grouped_psdf</span><span>)</span><span>grouped_psdf</span> <span>=</span> <span>psdf</span><span>.</span><span>groupby</span><span>(</span><span>'</span><span>Gender</span><span>'</span><span>).</span><span>mean</span><span>()</span> <span>print</span><span>(</span><span>grouped_psdf</span><span>)</span>grouped_psdf = psdf.groupby('Gender').mean() print(grouped_psdf)
Enter fullscreen mode Exit fullscreen mode
Sort your DataFrame by values:
<span>sorted_psdf</span> <span>=</span> <span>psdf</span><span>.</span><span>sort_values</span><span>(</span><span>by</span><span>=</span><span>'</span><span>GPA</span><span>'</span><span>,</span> <span>ascending</span><span>=</span><span>False</span><span>)</span><span>print</span><span>(</span><span>sorted_psdf</span><span>.</span><span>head</span><span>())</span><span>sorted_psdf</span> <span>=</span> <span>psdf</span><span>.</span><span>sort_values</span><span>(</span><span>by</span><span>=</span><span>'</span><span>GPA</span><span>'</span><span>,</span> <span>ascending</span><span>=</span><span>False</span><span>)</span> <span>print</span><span>(</span><span>sorted_psdf</span><span>.</span><span>head</span><span>())</span>sorted_psdf = psdf.sort_values(by='GPA', ascending=False) print(sorted_psdf.head())
Enter fullscreen mode Exit fullscreen mode
Step 6: Visualization
Plot the GPA distribution using plotly (it must be installed):
<span>psdf</span><span>[</span><span>'</span><span>StudyTimeWeekly</span><span>'</span><span>].</span><span>to_pandas</span><span>().</span><span>plot</span><span>(</span><span>kind</span><span>=</span><span>'</span><span>hist</span><span>'</span><span>)</span><span>psdf</span><span>[</span><span>'</span><span>StudyTimeWeekly</span><span>'</span><span>].</span><span>to_pandas</span><span>().</span><span>plot</span><span>(</span><span>kind</span><span>=</span><span>'</span><span>hist</span><span>'</span><span>)</span>psdf['StudyTimeWeekly'].to_pandas().plot(kind='hist')
Enter fullscreen mode Exit fullscreen mode
Step 7: Save as Compressed Parquet and Upload to S3
Save the DataFrame as a compressed Parquet file:
<span>parquet_file</span> <span>=</span> <span>'</span><span>student_data.parquet.gzip</span><span>'</span><span>psdf</span><span>.</span><span>to_parquet</span><span>(</span><span>parquet_file</span><span>,</span> <span>compression</span><span>=</span><span>'</span><span>gzip</span><span>'</span><span>)</span><span>parquet_file</span> <span>=</span> <span>'</span><span>student_data.parquet.gzip</span><span>'</span> <span>psdf</span><span>.</span><span>to_parquet</span><span>(</span><span>parquet_file</span><span>,</span> <span>compression</span><span>=</span><span>'</span><span>gzip</span><span>'</span><span>)</span>parquet_file = 'student_data.parquet.gzip' psdf.to_parquet(parquet_file, compression='gzip')
Enter fullscreen mode Exit fullscreen mode
Upload the Parquet file to S3 using boto3
:
<span>s3_bucket</span> <span>=</span> <span>'</span><span>your-s3-bucket-name</span><span>'</span><span>s3_key</span> <span>=</span> <span>'</span><span>path/to/save/student_data.parquet.gzip</span><span>'</span><span># Initialize a session using Amazon S3 </span><span>s3</span> <span>=</span> <span>boto3</span><span>.</span><span>client</span><span>(</span><span>'</span><span>s3</span><span>'</span><span>)</span><span># Upload the file to S3 </span><span>s3</span><span>.</span><span>upload_file</span><span>(</span><span>parquet_file</span><span>,</span> <span>s3_bucket</span><span>,</span> <span>s3_key</span><span>)</span><span>print</span><span>(</span><span>f</span><span>"</span><span>File uploaded to s3://</span><span>{</span><span>s3_bucket</span><span>}</span><span>/</span><span>{</span><span>s3_key</span><span>}</span><span>"</span><span>)</span><span>s3_bucket</span> <span>=</span> <span>'</span><span>your-s3-bucket-name</span><span>'</span> <span>s3_key</span> <span>=</span> <span>'</span><span>path/to/save/student_data.parquet.gzip</span><span>'</span> <span># Initialize a session using Amazon S3 </span><span>s3</span> <span>=</span> <span>boto3</span><span>.</span><span>client</span><span>(</span><span>'</span><span>s3</span><span>'</span><span>)</span> <span># Upload the file to S3 </span><span>s3</span><span>.</span><span>upload_file</span><span>(</span><span>parquet_file</span><span>,</span> <span>s3_bucket</span><span>,</span> <span>s3_key</span><span>)</span> <span>print</span><span>(</span><span>f</span><span>"</span><span>File uploaded to s3://</span><span>{</span><span>s3_bucket</span><span>}</span><span>/</span><span>{</span><span>s3_key</span><span>}</span><span>"</span><span>)</span>s3_bucket = 'your-s3-bucket-name' s3_key = 'path/to/save/student_data.parquet.gzip' # Initialize a session using Amazon S3 s3 = boto3.client('s3') # Upload the file to S3 s3.upload_file(parquet_file, s3_bucket, s3_key) print(f"File uploaded to s3://{s3_bucket}/{s3_key}")
Enter fullscreen mode Exit fullscreen mode
Conclusion
The Pandas API on Spark bridges the gap between Pandas and Spark, offering you the best of both worlds. Whether you’re handling massive datasets or looking to scale your data processing pipelines effortlessly, this API empowers you to harness the full power of Spark with the simplicity of Pandas.
Try it out and supercharge your data analytics workflow today!
For more details, you can refer to Spark’s official documentation.
Happy data wrangling!
暂无评论内容