Save Pandas objects to HDF5

What is HDF5?

HDF stands for “Hierarchical Data Format” and it was designed to store enormous amounts of data. Originally was developed at the National Center for Supercomputing Applications and now it’s supported by The HDF Group, a non-profit corporation.

Why use HDF5?

  • At its core HDF5 is binary file type specification.

  • It has the ability to store many datasets, user-defined metadata, optimized I/O, and the ability to query its contents.

  • Many programming languages have tools to work with HDF.

  • HDF allows datasets to live in a nested tree structure. In effect, HDF5 is a file system within a file. The ‘folders’ inside this filesystems are called groups, and sometimes nodes or keys (or at least these terms are used indistinctively).

Toolbox

There are at least three Python packages which can handle HDF5 files: h5py, pytables, and pandas.

Also, there are a few tools to visualize them: HDFViewer (Java), HDFCompass (Python) and ViTables (Python). They can be found at the Ubuntu repositories, but often they don’t work as expected.

Fortunately, ViTables is available in the conda-forge package channel and works flawlessly.

Example #1: Dump a DataFrame

<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>
<span># Create an example DataFrame </span><span>data</span> <span>=</span> <span>{</span><span>'</span><span>A</span><span>'</span><span>:</span> <span>[</span><span>1</span><span>,</span><span>2</span><span>,</span><span>3</span><span>],</span> <span>'</span><span>B</span><span>'</span><span>:</span> <span>[</span><span>4</span><span>,</span><span>5</span><span>,</span><span>6</span><span>]}</span>
<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>.</span><span>from_records</span><span>(</span><span>data</span><span>)</span>
<span>with</span> <span>pd</span><span>.</span><span>HDFStore</span><span>(</span><span>'</span><span>test.h5</span><span>'</span><span>,</span> <span>mode</span><span>=</span><span>'</span><span>w</span><span>'</span><span>)</span> <span>as</span> <span>f</span><span>:</span>
<span>f</span><span>.</span><span>put</span><span>(</span><span>key</span><span>=</span><span>'</span><span>/new_dataset</span><span>'</span><span>,</span> <span>df</span><span>)</span>
<span>import</span> <span>pandas</span> <span>as</span> <span>pd</span>

<span># Create an example DataFrame </span><span>data</span> <span>=</span> <span>{</span><span>'</span><span>A</span><span>'</span><span>:</span> <span>[</span><span>1</span><span>,</span><span>2</span><span>,</span><span>3</span><span>],</span> <span>'</span><span>B</span><span>'</span><span>:</span> <span>[</span><span>4</span><span>,</span><span>5</span><span>,</span><span>6</span><span>]}</span>
<span>df</span> <span>=</span> <span>pd</span><span>.</span><span>DataFrame</span><span>.</span><span>from_records</span><span>(</span><span>data</span><span>)</span>

<span>with</span> <span>pd</span><span>.</span><span>HDFStore</span><span>(</span><span>'</span><span>test.h5</span><span>'</span><span>,</span> <span>mode</span><span>=</span><span>'</span><span>w</span><span>'</span><span>)</span> <span>as</span> <span>f</span><span>:</span>
    <span>f</span><span>.</span><span>put</span><span>(</span><span>key</span><span>=</span><span>'</span><span>/new_dataset</span><span>'</span><span>,</span> <span>df</span><span>)</span>
import pandas as pd # Create an example DataFrame data = {'A': [1,2,3], 'B': [4,5,6]} df = pd.DataFrame.from_records(data) with pd.HDFStore('test.h5', mode='w') as f: f.put(key='/new_dataset', df)

Enter fullscreen mode Exit fullscreen mode

Example #2: Write metadata

Maybe one of the most interesting aspects of HDF is the ability to store metadata*.

<span>meta</span> <span>=</span> <span>{</span> <span>'</span><span>date</span><span>'</span><span>:</span> <span>'</span><span>21/06/2019</span><span>'</span><span>,</span> <span>'</span><span>author</span><span>'</span><span>:</span> <span>'</span><span>epassaro</span><span>'</span><span>}</span>
<span>with</span> <span>pd</span><span>.</span><span>HDFStore</span><span>(</span><span>'</span><span>test.h5</span><span>'</span><span>,</span> <span>mode</span><span>=</span><span>'</span><span>a</span><span>'</span><span>)</span> <span>as</span> <span>f</span><span>:</span>
<span>f</span><span>.</span><span>get_storer</span><span>(</span><span>'</span><span>/new_dataset</span><span>'</span><span>).</span><span>attrs</span><span>.</span><span>metadata</span> <span>=</span> <span>meta</span>
<span>meta</span> <span>=</span> <span>{</span> <span>'</span><span>date</span><span>'</span><span>:</span> <span>'</span><span>21/06/2019</span><span>'</span><span>,</span> <span>'</span><span>author</span><span>'</span><span>:</span> <span>'</span><span>epassaro</span><span>'</span><span>}</span>

<span>with</span> <span>pd</span><span>.</span><span>HDFStore</span><span>(</span><span>'</span><span>test.h5</span><span>'</span><span>,</span> <span>mode</span><span>=</span><span>'</span><span>a</span><span>'</span><span>)</span> <span>as</span> <span>f</span><span>:</span>
    <span>f</span><span>.</span><span>get_storer</span><span>(</span><span>'</span><span>/new_dataset</span><span>'</span><span>).</span><span>attrs</span><span>.</span><span>metadata</span> <span>=</span> <span>meta</span>
meta = { 'date': '21/06/2019', 'author': 'epassaro'} with pd.HDFStore('test.h5', mode='a') as f: f.get_storer('/new_dataset').attrs.metadata = meta

Enter fullscreen mode Exit fullscreen mode

* the good old FITS format can do this as well!

Example #3: Write metadata to root (“/”)

<span>meta</span> <span>=</span> <span>{</span> <span>'</span><span>date</span><span>'</span><span>:</span> <span>'</span><span>21/06/2019</span><span>'</span><span>,</span> <span>'</span><span>author</span><span>'</span><span>:</span> <span>'</span><span>epassaro</span><span>'</span><span>}</span>
<span>with</span> <span>pd</span><span>.</span><span>HDFStore</span><span>(</span><span>'</span><span>test.h5</span><span>'</span><span>,</span> <span>mode</span><span>=</span><span>'</span><span>a</span><span>'</span><span>)</span> <span>as</span> <span>f</span><span>:</span>
<span>f</span><span>.</span><span>root</span><span>.</span><span>_v_attrs</span><span>[</span><span>'</span><span>author</span><span>'</span><span>]</span> <span>=</span> <span>'</span><span>epassaro</span><span>'</span>
<span>meta</span> <span>=</span> <span>{</span> <span>'</span><span>date</span><span>'</span><span>:</span> <span>'</span><span>21/06/2019</span><span>'</span><span>,</span> <span>'</span><span>author</span><span>'</span><span>:</span> <span>'</span><span>epassaro</span><span>'</span><span>}</span>

<span>with</span> <span>pd</span><span>.</span><span>HDFStore</span><span>(</span><span>'</span><span>test.h5</span><span>'</span><span>,</span> <span>mode</span><span>=</span><span>'</span><span>a</span><span>'</span><span>)</span> <span>as</span> <span>f</span><span>:</span>
    <span>f</span><span>.</span><span>root</span><span>.</span><span>_v_attrs</span><span>[</span><span>'</span><span>author</span><span>'</span><span>]</span> <span>=</span> <span>'</span><span>epassaro</span><span>'</span> 
meta = { 'date': '21/06/2019', 'author': 'epassaro'} with pd.HDFStore('test.h5', mode='a') as f: f.root._v_attrs['author'] = 'epassaro'

Enter fullscreen mode Exit fullscreen mode

原文链接:Save Pandas objects to HDF5

© 版权声明
THE END
喜欢就支持一下吧
点赞10 分享
Nobody looks down on you because everybody is too busy to look at you.
没谁瞧不起你,因为别人根本就没瞧你,大家都很忙的
评论 抢沙发

请登录后发表评论

    暂无评论内容