Data Science for Developers (2 Part Series)
1 Using R-style Data Pipelines in Notebooks
2 A JupyterHub Showcase: DevOps Intelligence
Mutate Pandas data frames simply and elegantly with data pipelines.
Overview
This post shows how mutating data frames can be written more elegantly (and thus understood more easily) by using data pipelines. R users know this concept from the dplyr
package, and Python offers a similar one named dfply
.
Setting the Stage
We start off with some global definitions…
import numpy as np
import pandas as pd
The sample data (about OS package deployments) is read into the raw_data
dataframe from a CSV file.
raw_data = pd.read_csv("../assets/data/cmdb-packages.csv", sep=',')
print(' of Records: {}\n'.format(len(raw_data)))
for name in raw_data.columns[1:]:
if not name.startswith('Last '):
print(name, '=', list(sorted(set(raw_data[name].fillna('')))))
raw_data.head(3).transpose()
of Records: 146
Distribution = ['Debian 8.11', 'Debian 8.6', 'Debian 8.9', 'jessie']
Architecture = ['amd64']
Environment = ['', 'Canary', 'DEV', 'LIVE', 'QA']
Team = ['Automation', 'Big Data', 'Email', 'Ops App1', 'Ops Linux', 'Persistence', 'Platform']
Installed version = ['41.15-2(amd64)', '42.28-2(amd64)', '42.44-1(amd64)', '45.11-1(amd64)', '48.33-1(amd64)']
0 | 1 | 2 | |
---|---|---|---|
CMDB_Id | 274656589 | 153062618 | 282201163 |
Distribution | jessie | jessie | jessie |
Architecture | amd64 | amd64 | amd64 |
Environment | DEV | DEV | LIVE |
Team | Ops App1 | Ops App1 | Ops App1 |
Last seen | 2019-02-18 11:43 | 2019-02-18 11:56 | 2019-02-18 12:04 |
Last modified | 2019-02-18 11:43 | 2019-02-18 11:56 | 2019-02-18 12:04 |
Installed version | 42.28-2(amd64) | 42.28-2(amd64) | 48.33-1(amd64) |
The map_distro()
helper function is used in the following sections to clean up the Distribution
column.
def map_distro(name):
"""Helper to create canonical OS names."""
return (name.split('.', 1)[0]
.replace('Debian 7', 'wheezy')
.replace('Debian 8', 'jessie')
.replace('Debian 9', 'stretch')
.replace('Debian 10', 'buster')
.replace('squeeze', 'Squeeze [6]')
.replace('wheezy', 'Wheezy [7]')
.replace('jessie', 'Jessie [8]')
.replace('stretch', 'Stretch [9]')
.replace('buster', 'Buster [10]')
)
Data Cleaning With Pandas
This code cleans up the imported data using the Pandas API.
To get sensible version statistics, we split off the auxiliary information in the version column (anything after -
), leaving just the upstream part of the version string. The environment classifier is also cleaned up a little, and distributions are mapped to a canonical set of names. Some unused columns are dropped.
Finally, a subset of unique version samples is selected.
data = raw_data
data = data.assign(Version=data['Installed version'].str.split('-', 1, expand=True)[0])
data = data.assign(Environment=data.Environment.fillna('UNDEFINED').str.upper())
data = data.assign(Distribution=data.Distribution.apply(map_distro))
data = data.drop(columns=['CMDB_Id', 'Last seen', 'Last modified', 'Installed version'])
data = data.drop_duplicates(subset='Version', keep='first')
data.transpose()
0 | 2 | 26 | 45 | 62 | |
---|---|---|---|---|---|
Distribution | Jessie [8] | Jessie [8] | Jessie [8] | Jessie [8] | Jessie [8] |
Architecture | amd64 | amd64 | amd64 | amd64 | amd64 |
Environment | DEV | LIVE | LIVE | UNDEFINED | DEV |
Team | Ops App1 | Ops App1 | Platform | Platform | |
Version | 42.28 | 48.33 | 41.15 | 45.11 | 42.44 |
Data Cleaning With Pipelines
This does the exact same processing as the code above, but is arguably more readable and maintained more easily:
- It has less boilerplate, and makes the use of pipelined processing transparent.
- Each step clearly states what it does to the data.
- When steps are copied into other pipelines, the
X
placeholder ensures you use the data of this pipeline (the code is more DRY).
from dfply import *
piped = (raw_data
>> mutate(Version=X['Installed version'].str.split('-', 1, expand=True)[0])
>> mutate(Environment=X.Environment.fillna('UNDEFINED').str.upper())
>> mutate(Distribution=X.Distribution.apply(map_distro))
>> drop(X.CMDB_Id, X['Last seen'], X['Last modified'], X['Installed version'])
>> distinct(X.Version)
)
piped.transpose()
The result is identical to the pure Pandas code, as expected.
0 | 2 | 26 | 45 | 62 | |
---|---|---|---|---|---|
Distribution | Jessie [8] | Jessie [8] | Jessie [8] | Jessie [8] | Jessie [8] |
Architecture | amd64 | amd64 | amd64 | amd64 | amd64 |
Environment | DEV | LIVE | LIVE | UNDEFINED | DEV |
Team | Ops App1 | Ops App1 | Platform | Platform | |
Version | 42.28 | 48.33 | 41.15 | 45.11 | 42.44 |
To learn more about dfply
, read the dplyr-style Data Manipulation with Pipes in Python blog post, which has more examples.
Reference Links
dfply
- dplyr-style Data Manipulation with Pipes in Python – Towards Data Science
- kieferk/dfply: dplyr-style piping operations for Pandas dataframes
Alternatives
- has2k1/plydata: A grammar for data manipulation in Python
- shaypal5/pdpipe: Easy pipelines for Pandas dataframes
Data Science for Developers (2 Part Series)
1 Using R-style Data Pipelines in Notebooks
2 A JupyterHub Showcase: DevOps Intelligence
暂无评论内容