TLDR: Today the Python devteam at MongoDB is thrilled (stoked!) to announce the beta release of PyMongoArrow, a PyPi package supporting CPython 3.6+. This release adds several new APIs that will be of interest to developers who use NumPy, Pandas or Apache Arrow-based frameworks to analyze data stored in MongoDB.
—
As the name suggests, PyMongoArrow leverages Apache Arrow to offer fast and easy conversion of MongoDB query result sets to multiple numerical data formats popular among Python developers including NumPy ndarrays and Pandas DataFrames.
As reference points for our implementation, we also took a look at BigQuery’s Pandas integration, pandas methods to handle JSON/semi-structured data, the Snowflake Python connector, and Dask.DataFrame.
How it Works
PyMongoArrow relies upon a user-specified data schema to marshall query result sets into tabular form. Users can define the schema by instantiating pymongoarrow.api.Schema
using a mapping of field names to type-specifiers, e.g.:
from pymongoarrow.api import Schema
schema = Schema({'_id': int, 'amount': float, 'last_updated': datetime})
Enter fullscreen mode Exit fullscreen mode
There are multiple permissible type-identifiers for each supported BSON type. For a full-list of supported types and associated type-identifiers, see here.
Give it a Try
You can install PyMongoArrow on your local machine using Pip: $ python -m pip install pymongoarrow
or utilize it with mongodb atlas:
$ python -m pip install pymongoarrow
$ python -m pip install "pymongo[srv]>=3.11,<4"
Enter fullscreen mode Exit fullscreen mode
(to use PyMongoArrow with MongoDB Atlas’ mongodb+srv:// URIs users must install PyMongo with the srv extra in addition to installing PyMongoArrow).
Insert some test data
To follow along with the examples below, start by adding the following test data to a MongoDB cluster:
from datetime import datetime
from pymongo import MongoClient
client = MongoClient()
client.db.data.insert_many([
{'_id': 1, 'amount': 21, 'last_updated': datetime(2020, 12, 10, 1, 3, 1)},
{'_id': 2, 'amount': 16, 'last_updated': datetime(2020, 7, 23, 6, 7, 11)},
{'_id': 3, 'amount': 3, 'last_updated': datetime(2021, 3, 10, 18, 43, 9)},
{'_id': 4, 'amount': 0, 'last_updated': datetime(2021, 2, 25, 3, 50, 31)}])
Enter fullscreen mode Exit fullscreen mode
Quick Examples of How it Works
to run a find operation to load all records with a non-zero amount as a:
pandas.DataFrame
df = client.db.data.find_pandas_all({'amount': {'$gt': 0}}, schema=schema)
Enter fullscreen mode Exit fullscreen mode
numpy.ndarray
ndarrays = client.db.data.find_numpy_all({'amount': {'$gt': 0}}, schema=schema)
Enter fullscreen mode Exit fullscreen mode
in this case, the return value is a dictionary where the keys are field names and values are the corresponding arrays.
pyarrow.Table
arrow_table = client.db.data.find_arrow_all({'amount': {'$gt': 0}}, schema=schema)
Enter fullscreen mode Exit fullscreen mode
Developers who create an Arrow table directly can then utilize some of Arrow’s other capabilities, for example, serializing data and sending it to workers (as in a Dask workflow), or use pyArrow’s APIs to write a queried dataset to Parquet format, csv, or many other PyPi packages that operate on Arrow formatted data. For example, to write the table referenced by the variable arrow_table to a Parquet file example.parquet, you’d run:
import pyarrow.parquet as pq
pq.write_table(arrow_table, 'example.parquet')
Enter fullscreen mode Exit fullscreen mode
Other items of Note:
- Originally, we intended to build a new API that worked exclusively with Pandas, however Pandas did not provide a stable C-API that we could use. Meanwhile, we sort of fell in love with Apache Arrow. The Apache Arrow project has a set of standards to address long standing inefficiencies in the processing and transport of large datasets in high-performance applications. Conversion of arrow tables to various formats was simple and fast. Since Arrow is a language independent standard, our Arrow integration will make it easier for developers to move data from Mongodb into a wide variety of OLAP systems.
- Currently we are only distributing pre-built binaries for x86_64 architectures, but we are planning to add more soon. Please feel free to express your preference on github!
- This library is in the early stages of development, and so it’s possible the API may change in the future – we definitely want to continue expanding it.
Photo credit: shiyang xu on Unsplash
暂无评论内容