How to Analyze time series data in Python pandas

Note: This post was originally published on https://bignumworks.com

Introduction

In this article, we will explore how to analyze time series data with Python’s Pandas Data Analysis library.

Let’s first understand what we mean by Time Series data. In simple terms, time series represent a set of observations taken over a period of time. It occurs whenever the data is recorded on a regular basis. For example…

Here, we have a Sales data recorded over months for multiple locations.

Time series data analysis is extremely important for making informed business and policy decisions and plans. It is used to

  • find out the past trends
  • use the past trends to forecast current future trends.
  • find out cyclic variations
  • discover seasonal variations

For example, a retailer can discover the seasonal and cyclic sales trends of it’s inventory and plan accordingly.

Our Methodology

  • We will start with importing the necessary python modules.
  • And then we will read in a dataset having time series data.
  • We will transform the data to make sure it is indexable on the time series data column.
  • We will then demonstrate how to select and filter data based on date and time.
  • And finally we will visualize the time series data.

Setting up your machine

For the code demo in this article, we are using Python 3.x and Python modules Pandas and matplotlib. We are using Pandas for reading in the data and data analysis and matplotlib for data visualization. The best way to get all the required is by installing Anaconda Data Science Python Districtuion.

Importing Python Modules

We start with importing Pandas, matplotlib and datetime modules. We are importing datetime module for as we would need to use some of it’s methods. We are also setting matplotlib inline to make sure the plots are shown right in the Jupyter Notebook

from datetime import datetime
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as pyplot

Introducing our dataset

For this code demo, we are creating a sample employees dataset with two columns

  • Date field
  • No of Employees

This data is a quarterly count of employees from last 8 quarters.

We create a data dictionary with columns names a keys and records as the values. We then pass this dict object to Pandas DataFrame method and create a DataFrame where this data will be stored in tabular format.

employees = {'Date': ['2015-09-30', '2015-12-31', '2016-03-31', '2016-06-30', 
                 '2016-09-30', '2016-12-31', '2017-03-31', '2017-06-30'],
        'Employees': [9, 15, 26, 15, 15, 14, 26, 25]}
df = pd.DataFrame(employees, columns = ['Date', 'Employees'])

Now, we have our data in a Pandas DataFrame, let’s print it out to see how it looks.

print(df)
         Date  Employees
0  2015-09-30          9
1  2015-12-31         15
2  2016-03-31         26
3  2016-06-30         15
4  2016-09-30         15
5  2016-12-31         14
6  2017-03-31         26
7  2017-06-30         25

As we can see, we have the data with two columns Date and Employees. Also, notice that there is a numeric row index which has been set automatically while importing data into a DataFrame. We would have to change this later.

Pandas Time and Date methods

let’s get some information about the dataset, like datatypes of the column

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 2 columns):
Date         8 non-null object
Employees    8 non-null int64
dtypes: int64(1), object(1)
memory usage: 136.0+ bytes

We see that the date column is object, which means it’s stored as text. Date time data in text format in not very useful for running time series analysis. We need to first convert this to a datetime datatype

We do this conversion by calling Pandas to_datetime method on the date column. We assign the converted date column back to the original Pandas DataFrame, thereby replacing the old text date column with the new datetime type column.

df['Date'] = pd.to_datetime(df['Date'])

We check the conversion to a datetime datatype by calling the info method again on our DataFrame

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8 entries, 0 to 7
Data columns (total 2 columns):
Date         8 non-null datetime64[ns]
Employees    8 non-null int64
dtypes: datetime64[ns](1), int64(1)
memory usage: 168.0 bytes

And we see that the date column now shows up as datetime datatype.

There is one more thing we need to do before we can start exploring the time series properties. We need to setup the index of our DataFrame to the date column. We do this by calling index on the Pandas DataFrame and assigning it the date series. We also delete the date column as we don’t need it separately from the index.

df.index = df['Date']
del df['Date']

Let’s print out the DataFrame to see how it looks after setting index to date column

print(df)
            Employees
Date                 
2015-09-30          9
2015-12-31         15
2016-03-31         26
2016-06-30         15
2016-09-30         15
2016-12-31         14
2017-03-31         26
2017-06-30         25

Querying time series data

Now we are set to start exploring our time series data. We will start with showing some data querying techniques.

Here we are querying and filtering the data and selecting only those records that are from year 2016

df['2016']
Employees
Date
2016-03-31 26
2016-06-30 15
2016-09-30 15
2016-12-31 14

Notice that due to our converting the Date column to datetime and setting it as index, we can run all kinds of queries. Above we just passed 2016 and Pandas understood that we are looking for records from year 2016.

Next we are selecting all records from June 2016 onwards.

df[datetime(2016, 6, 30):]
Employees
Date
2016-06-30 15
2016-09-30 15
2016-12-31 14
2017-03-31 26
2017-06-30 25

Time Series data visualized

Here we are using time series to visualize our data by calling the plot method on the DataFrame. Since, we have time series already set as index, we had to simply call the plot method in it’s simplest form to get this plot.

df.plot();

Conclusion

We touch on some of the Pandas time series data analysis capabilities. We transformed and manipulated a dataset containing time series data. We converted it to proper times series format using Pandas in-built methods. We then learned how to explore and filter the data using Pandas date time data methods and functionality. Finally, we saw an easy way to visualize time series data

原文链接:How to Analyze time series data in Python pandas

© 版权声明
THE END
喜欢就支持一下吧
点赞13 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容