I/O
HDF5 is a format often used in computational physics and other data-science applications, because of its ability to store huge amounts of structured numerical data. Many datasets can be stored in a single file, categorized, linked together, and so on. A variety of python modules leverage HDF5 for input and output; often they rely on h5py or PyTables, pythonic interfaces interoperabale with numpy, and no native support of python objects.
However, more complex data structures have no native support for H5ing; a variety of choices are possible. Any python object can be pickled and stored as a binary blob in HDF5, but the resulting blobs are not usable outside of python. The pandas data analysis module can read_hdf and export to_hdf, but even though the data is written in a usable way, the data layouts are nontrivial to read without pandas.
We aim for a happy medium, by providing a class, H5able, from which other python classes, which contain a variety of data fields, can inherit to allow them to be easily written to HDF5.
An H5able object will be saved as a group that contains properties written into groups and datasets, with the same name as the property itself.
If a property is one of a slew of known types then it will be written natively as an H5 field, otherwise it will be pickled.
- class tdg.h5.H5able[source]
Bases:
object- to_h5(group, _top=True)[source]
Write the object as an HDF5 group. Member data will be stored as groups or datasets inside
group, with the same name as the property itself.Note
PEP8 considers
_single_leading_underscoresas weakly marked for internal use. All of these properties will be stored in a single group named_.
- classmethod from_h5(group, strict=True, _top=True)[source]
Construct a fresh object from the HDF5 group.
Warning
If there is no known strategy for writing data to HDF5, objects will be pickled.
Loading pickled data received from untrusted sources can be unsafe.
See: https://docs.python.org/3/library/pickle.html for more.
- The data types that are not pickled are
H5ableintfloatdict,listnumpy.ndarraytorch.tensor,torch.SizeA subset of other torch objects; including
torch.distributions.Distribution.
To provide custom methods for H5ing otherwise-unknown types that cannot be made H5able, a user can write a small strategy.
A strategy is an instance-free class with just static methods applies, write, and read.
Suppose the user had an Example class; a strategy might look like
class ExampleStrategy(H5Data, name='example'):
'''
The name is stored as metadata and then used to look up this strategy
'''
@staticmethod
def applies(value):
r'''
Parameters
----------
value: any value at all
Returns
-------
bool: is ``value`` H5able using this interface?
'''
return isinstance(value, Example)
@staticmethod
def write(group, key, value):
r'''
Parameters
----------
group: an H5py group in which to store the Example object in value
key: a string to name the object
value: an Example to store into ``group/key``.
'''
group['property'] = value.example_property
@staticmethod
def read(group):
r'''
Parameters
----------
group: an H5py group
Returns
-------
Example: that was previously written with write.
'''
example_property = group['property']
return Example(example_property)
However, it is probably simplest in most circumstances to just inherit from H5able.
See tdg/io.py for the strategies that for int, float, dict, numpy.ndarray, torch.tensor, torch.Size, and H5able.
If the H5able strategy is desired but the class cannot be made to inherit from H5able, just create a new strategy that inherits from H5ableStrategy and overwrites the applies method.
One can also provide custom strategies and to_h5 and from_h5 methods.
For example, the tdg.ensemble.GrandCanonical adds the ability to read data from_h5 related only to a subset of configurations and the ability to extend_h5 with new configurations or measurements, using a custom ObservableStrategy, even though the data is torch data.