JSON in data science projects: tips & tricks

Some useful tips and libraries for manipulating json in your data science projects.

The standard json module

Python has a standard module called json that lets you quickly manipulate JSON files.

Loading

import json

with open("data/example.json", "r") as f:
    data = json.load(f)

data
# [{'id': 0, 'content': [0.0, 0.0, 1.0]}, {'id': 1, 'content': [0.0, 1.0, 0.0]}] 

Enter fullscreen mode Exit fullscreen mode

Backup

First tip: when working with textual data, the ensure_ascii=False option is very useful
to preserve, among other things, accents when saving

with open("data/example.json", "w") as f:
    json.dump(data, f, ensure_ascii=False)

Enter fullscreen mode Exit fullscreen mode

Second tip: the indent option in the dump method indents the data in the backup file.

with open("data/example.json", "w") as f:
    json.dump(data, f, ensure_ascii=False, indent=2)

Enter fullscreen mode Exit fullscreen mode

Issues related to numpy

Data science projects often use numpy.
However, numpy objects are not JSON-serializable and therefore require conversion to standard python objects in order to be saved:

import numpy as np

data = np.array([[0., 0., 1.], [0., 1., 0.]])

with open("data/numpy.json", "w") as f:
    json.dump(data, f, ensure_ascii=False)

---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[12], line 6
      3 data = np.array([[0., 0., 1.], [0., 1., 0.]])
      5 with open("data/numpy.json", "w") as f:
----> 6 json.dump(data, f, ensure_ascii=False)

TypeError: Object of type ndarray is not JSON serializable

Enter fullscreen mode Exit fullscreen mode

By converting the array into a list, the data object can be saved:

with open("data/numpy.json", "w") as f:
    json.dump(data.tolist(), f, ensure_ascii=False)

Enter fullscreen mode Exit fullscreen mode

But that’s not very practical…

One solution is to create a custom JSONEncoder
which converts the numpy.ndarray using its tolist method at save time:

class NumpyJSONEncoder(json.JSONEncoder):
    """JSONEncoder to store python dict or list containing numpy arrays"""

    def default(self, obj):
        """Transform numpy arrays into JSON serializable object such as list see : https://docs.python.org/3/library/json.html#json.JSONEncoder.default """
        if isinstance(obj, np.ndarray):
            return obj.tolist()

        return json.JSONEncoder.default(self, obj)

with open("data/numpy.json", "w") as f:
    json.dump(data, f, ensure_ascii=False, cls=NumpyJSONEncoder)

Enter fullscreen mode Exit fullscreen mode

The orjson library

orjson is the fastest JSON library available for python. It natively manages dataclass objects,
datetime, numpy
and UUID objects.

A few things to remember when working with orjson :

  • There is no load or dump method, you have to use loads and dumps instead.
  • You must use flags to use certain functionalities, such as orjson.OPT_SERIALIZE_NUMPY to serialize serialize numpy objects
import orjson

with open("data/example.json", "rb") as f:
    data = orjson.loads(f.read())

with open("data/example.json", "wb") as f:
    f.write(orjson.dumps(data, option=orjson.OPT_SERIALIZE_NUMPY))

Enter fullscreen mode Exit fullscreen mode

Note that the json file is written in binary (hence rb and wb).

Performance

orjson claims to serialize numpy.ndarray 4 to 12 times faster than the standard library. This can be
by comparing the two methods described above:

data = {i: np.random.randn(100) for i in range(100)}

def save_json():
    with open("data/fast.json", "w") as f:
        json.dump(data, f, ensure_ascii=False, cls=NumpyJSONEncoder)

def save_orjson():
    with open("data/orfast.json", "wb") as f:
        f.write(orjson.dumps(data, option=orjson.OPT_SERIALIZE_NUMPY|orjson.OPT_NON_STR_KEYS))

%timeit save_json()
%timeit save_orjson()

# 15.5 ms ± 251 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) # 1.15 ms ± 66.3 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each) 

Enter fullscreen mode Exit fullscreen mode

In this example, orjson is more than 10 times faster. Note that the OPT_NON_STR_KEYS option is used to enable
allow orjson to save non-string keys.

(Tests run with python 3.11)

FastAPI and orjson

The FastAPI documentation contains a guide to using orjson](https://fastapi.tiangolo.com/advanced/custom-response/?h=orjson#use-orjsonresponse) to serialize JSON responses.

This is particularly useful for APIs that expose machine learning models whose outputs are often numpy.ndarray

import numpy as np

from fastapi import FastAPI
from fastapi.responses import ORJSONResponse

app = FastAPI()

@app.get("/random-vector", response_class=ORJSONResponse)
async def get_random_vector():
    return ORJSONResponse(np.random.randn(100))

Enter fullscreen mode Exit fullscreen mode

JSON Lines

When working with the JSON format, it’s not uncommon to manipulate collections of objects.

[ {"id": 0, "name": "toto"}, {"id": 1, "name": "titi"}, ] 

Enter fullscreen mode Exit fullscreen mode

To be valid, objects must be contained in a JSON list, hence the square brackets around the objects in the collection. However, this is not at all practical for reading large volumes of data, as you have to parse the entire file the entire file and load everything into memory.

This can be remedied by using the [JSON Lines] format (https://jsonlines.org/). This involves nothing more and nothing less than placing one JSON object per line, so that you can browse the objects without having to parse the entire
collection all at once.

{"id": 0, "name": "toto"}
{"id": 1, "name": "titi"}

Enter fullscreen mode Exit fullscreen mode

The [jsonlines] library (https://jsonlines.readthedocs.io/en/latest/index.html) is very useful for manipulating
such files. It can also be combined with orjson.

import jsonlines
import orjson

with jsonlines.open("data/many_examples.jsonl", "r", loads=orjson.loads) as reader:
    for obj in reader:
        print(obj)

# {'id': 0, 'name': 'toto'} # {'id': 1, 'name': 'titi'} 

Enter fullscreen mode Exit fullscreen mode

TL;DR

A brief summary of the tips seen here:

  • Use ensure_ascii=False when working with the standard json library.
  • Consider the orjson library for the performance and functionality it offers.
  • Consider the JSON Lines format for collections of JSON objects.

原文链接:JSON in data science projects: tips & tricks

© 版权声明
THE END
喜欢就支持一下吧
点赞12 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容