Back in August of last year (roughly 8 months ago), I hunched over my desk at 4 am desperate to fire off a post before boarding a flight the next morning. The article was titled Creating Database Schemas: a Job for Robots, or Perhaps Pandas. It was my intent at the time to solve a common annoyance: creating database tables out of raw data, without the obnoxious process of explicitly setting each column’s datatype. I had a few leads that led me to believe I had the answer… boy was I wrong.
The task seems somewhat reasonable from the surface. Surely we can spot columns where the data is always in integers, or match the expected format of a date, right? If anything, we’ll fall back to text or varchar and call it a day. Hell, even MongoDB’s Compass does a great job of this by merely uploading a CSV… this has got to be some trivial task handled by third-party libraries by now.
For one reason or another, searching for a solution to this problem almost always comes up empty. Software developers probably have little need for dynamically generated tables if their applications run solely on self-defined models. Full-time Data Scientists have access to plenty of expensive tools which seem to claim this functionality, yet it all seems so… inaccessible.
Is This NOT a Job For Pandas?
From my experience, no. Pandas does offer hope but doesn’t seem to get the job done quite right. Let’s start with a dataset so you can see what I mean. Here’s a bunch of fake identities I’ll be using to mimic the outcome I experienced when working with real data:
id | initiated | hiredate | firstname | lastname | title | department | location | country | type | |
---|---|---|---|---|---|---|---|---|---|---|
100035435 | 2015-12-11T09:16:20.722-08:00 | 3/22/67 | GretchenRMorrow@jourrapide.com | Gretchen | Morrow | Power plant operator | Physical Product | Britling Cafeterias | United Kingdom | Employee |
100056435 | 2015-12-15T10:11:24.604-08:00 | 6/22/99 | ElizabethLSnow@armyspy.com | Elizabeth | Snow | Oxygen therapist | Physical Product | Grade A Investment | United States of America | Employee |
100037955 | 2015-12-16T14:31:32.765-08:00 | 5/31/74 | AlbertMPeterson@einrot.com | Albert | Peterson | Psychologist | Physical Product | Grass Roots Yard Services | United States of America | Employee |
100035435 | 2016-01-20T11:15:47.249-08:00 | 9/9/69 | JohnMLynch@dayrep.com | John | Lynch | Environmental hydrologist | Physical Product | Waccamaw’s Homeplace | United States of America | Employee |
100057657 | 2016-01-21T12:45:38.261-08:00 | 4/9/83 | TheresaJCahoon@teleworm.us | Theresa | Cahoon | Personal chef | Physical Product | Cala Foods | United States of America | Employee |
100056747 | 2016-02-01T11:25:39.317-08:00 | 6/26/98 | KennethHPayne@dayrep.com | Kenneth | Payne | Central office operator | Frontline | Magna Consulting | United States of America | Employee |
100035435 | 2016-02-01T11:28:11.953-08:00 | 4/16/82 | LeifTSpeights@fleckens.hu | Leif | Speights | Staff development director | Frontline | Rivera Property Maintenance | United States of America | Employee |
100035435 | 2016-02-01T12:21:01.756-08:00 | 8/6/80 | JamesSRobinson@teleworm.us | James | Robinson | Scheduling clerk | Frontline | Discount Furniture Showcase | United States of America | Employee |
100074688 | 2016-02-01T13:29:19.147-08:00 | 12/14/74 | AnnaDMoberly@jourrapide.com | Anna | Moberly | Playwright | Physical Product | The Wiz | United States of America | Employee |
100665778 | 2016-02-04T14:40:05.223-08:00 | 9/13/66 | MarjorieBCrawford@armyspy.com | Marjorie | Crawford | Court, municipal, and license clerk | Physical Product | The Serendipity Dip | United Kingdom | Employee |
100876876 | 2016-02-24T12:39:25.872-08:00 | 12/19/67 | LyleCHackett@fleckens.hu | Lyle | Hackett | Airframe mechanic | Physical Product | Infinity Investment Plan | United States of America | Employee |
100658565 | 2016-02-29T15:52:12.933-08:00 | 11/17/83 | MaryJDensmore@jourrapide.com | Mary | Densmore | Employer relations representative | Frontline | One-Up Realtors | United States of America | Employee |
100766547 | 2016-03-01T12:32:53.357-08:00 | 10/1/87 | CindyRDiaz@armyspy.com | Cindy | Diaz | Student affairs administrator | Physical Product | Mr. AG’s | United States of America | Employee |
100045677 | 2016-03-02T12:07:44.264-08:00 | 8/16/65 | AndreaTLigon@einrot.com | Andrea | Ligon | Railroad engineer | Central Growth | Robinson Furniture | United States of America | Employee |
There are some juicy datatypes in there: integers , timestamps , dates , strings…. and those are only the first four columns! Let’s load this thing into a DataFrame and see what information we can get that way:
import pandas as pd
csv = 'data/fake.csv'
workers_df = pd.read_csv(csv, header=0, encoding='utf-8')
meta = workers_df.info(verbose=True)
print(meta)
Using Pandas’ info()
should do the trick! This returns a list of columns and their data types:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22 entries, 0 to 21
Data columns (total 11 columns):
id 14 non-null float64
initiated 14 non-null object
hiredate 14 non-null object
email 14 non-null object
firstname 14 non-null object
lastname 14 non-null object
title 14 non-null object
department 14 non-null object
location 14 non-null object
country 14 non-null object
type 14 non-null object
dtypes: float64(1), object(10)
memory usage: 2.0+ KB
None
…Or not. What is this garbage? Only one of our 11 columns identified a data type, and it was incorrectly listed as a float! Okay, so maybe Pandas doesn’t have a secret one-liner for this. So who does?
What about PySpark?
It’s always been a matter of time before we’d turn to Apache’s family of aged data science products. Hadoop, Spark, Kafka… all of them have a particular musty stench about them that tastes like “I feel like I should be writing in Java right now.” Heads up: they do want you to write in Java. Misery loves company.
Nonetheless, PySpark does support reading data as DataFrames in Python, and also comes with the elusive ability to infer schemas. Installing Hadoop and Spark locally still kind of sucks for solving this one particular problem. Cue Databricks: a company that spun off from the Apache team way back in the day, and offers free cloud notebooks integrated with- you guessed it: Spark.
With Databricks, we can upload our CSV and load it into a DataFrame by spinning up a free notebook. The source looks something like this:
# File location and type
file_location = "/FileStore/tables/fake.csv"
file_type = "csv"
# CSV options
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.option("header", first_row_is_header) \
.option("sep", delimiter) \
.load(file_location)
display(df)
Let’s see out the output looks:
df:pyspark.sql.dataframe.DataFrame
id:integer
initiated:timestamp
hiredate:string
email:string
firstname:string
lastname:string
title:string
department:string
location:string
country:string
type:string
Not bad! We correctly ‘upgraded’ our ID from float to integer, and we managed to get the timestamp correct also. With a bit of messing around, we could probably have even gotten the date correct too, given that we stated the format beforehand.
A look at the Databricks Notebook interface.
And Yet, This Still Kind of Sucks
Even though we can solve our problem in a notebook, we still haven’t solved the use case: I want a drop-in solution to create tables out of CSVs… whenever I want! I want to accomplish this while writing any app, at the drop of a hat without warning. I don’t want to install Hadoop and have Java errors coming back at me through my terminal. Don’t EVER let me see Java in my terminal. UGH:
py4j.protocol.Py4JJavaError: An error occurred while calling o43.count.
: java.lang.IllegalArgumentException: Unsupported class file major version 55
at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:166)
at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:148)
at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:136)
at org.apache.xbean.asm6.ClassReader.<init>(ClassReader.java:237)
at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:49)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:517)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:500)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:500)
at org.apache.xbean.asm6.ClassReader.readCode(ClassReader.java:2175)
at org.apache.xbean.asm6.ClassReader.readMethod(ClassReader.java:1238)
at org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:631)
Python’s “tableschema” Library
Thankfully, there’s at least one other person out there who has shared this desire. That brings us to tableschema, a not-quite-perfect-but-perhaps-good-enough library to gunsling data like some kind of wild data cowboy. Let’s give it a go:
import csv
from tableschema import Table
data = 'data/fake.csv'
schema = infer(data, limit=500, headers=1, confidence=0.85)
print(schema)
If our dataset is particularly large, we can use the limit
attribute to limit the sample size to the first X number of rows. Another nice feature is the confidence
attribute: a 0-1 ratio for allowing casting errors during the inference. Here’s what comes back:
{
"fields": [{
"name": "id",
"type": "integer",
"format": "default"
}, {
"name": "initiated",
"type": "string",
"format": "default"
}, {
"name": "hiredate",
"type": "date",
"format": "default"
}, {
"name": "email",
"type": "string",
"format": "default"
}, {
"name": "firstname",
"type": "string",
"format": "default"
}, {
"name": "lastname",
"type": "string",
"format": "default"
}, {
"name": "title",
"type": "string",
"format": "default"
}, {
"name": "department",
"type": "string",
"format": "default"
}, {
"name": "location",
"type": "string",
"format": "default"
}, {
"name": "country",
"type": "string",
"format": "default"
}, {
"name": "type",
"type": "string",
"format": "default"
}],
"missingValues": [""]
}
Hey, that’s good enough for me! Now let’s automate the shit out this.
Creating a Table in SQLAlchemy With Our New Schema
I’m about to throw a bunch in your face right here. Here’s a monster of a class:
from sqlalchemy import create_engine
import config
import pandas as pd
import psycopg2
from tableschema import Table, infer, Schema
from functions.recursivejson import extract_values
from sqlalchemy.types import Integer, Text, Date
class CreateTablesFromCSVs:
"""Infer a table schema from a CSV."""
__uri = config.PG_URI
__engine = create_engine(__uri, convert_unicode=True, echo=True)
__data = 'data/fake.csv'
@classmethod
def get_data(cls):
"""Pull latest data."""
test_df = pd.read_csv(cls.__data, header=0, encoding='utf-8')
return test_df
@classmethod
def get_schema_from_csv(cls, csv):
"""Infers schema from CSV."""
table = Table(csv)
table.infer(limit=500, confidence=0.55)
schema = table.schema.descriptor
names = cls.get_column_names(schema, 'name')
datatypes = cls.get_column_datatypes(schema, 'type')
schema_dict = dict(zip(names, datatypes))
return schema_dict
@classmethod
def get_column_names(cls, schema, key):
"""Get names of columns."""
names = extract_values(schema, key)
return names
@classmethod
def get_column_datatypes(cls, schema, key):
"""Convert schema to recognizable by SQLAlchemy."""
values = extract_values(schema, key)
for i, value in enumerate(values):
if value == 'integer':
values[i] = Integer
elif value == 'string':
values[i] = Text
elif value == 'date':
values[i] = Date
return values
@classmethod
def create_new_table(cls, data, schema):
"""Create new table from CSV and generated schema."""
workday_table.to_sql('faketable',
con=cls.__engine,
schema='testschema',
if_exists='replace',
chunksize=300,
dtype=schema)
data = CreateTablesFromCSVs.get_schema_from_csv()
schema = CreateTablesFromCSVs.get_schema_from_csv(data)
CreateTablesFromCSVs.create_new_table(data, schema)
The first thing worth mentioning is I’m importing a function from my personal secret library to extract values from JSON objects. I’ve spoken about it before.
Let’s break down this class:
-
get_data()
reads our CSV into a Pandas DataFrame. -
get_schema_from_csv()
kicks off building a Schema that SQLAlchemy can use to build a table. -
get_column_names()
simply pulls column names as half our schema. -
get_column_datatypes()
manually replaces the datatype names we received from tableschema and replaces them with SQLAlchemy datatypes. -
create_new_table
Uses a beautiful marriage between Pandas and SQLAlchemy to create a table in our database with the correct datatypes mapped.
Promising Potential, Room to Grow
While tableschema works some of the time, it isn’t perfect. The base of what we accomplish still stands: we now have a reliable formula for how we would create schemas on the fly if we trust our schemas to be accurate.
Just wait until next time when we introduce Google BigQuery into the mix.
原文链接:From CSVs to Tables: Infer Schema Data Types From Raw Spreadsheets
暂无评论内容