scalarstop.datablob

Group together and name your training, validation, and test sets.

The classes in this module are used to group together data into training, validation, and test sets used for training machine learning models. We also record the hyperparameters used to process the dataset.

The DataBlob subclass name and hyperparameters are used to create a unique content-addressable name that makes it easy to keep track of many datasets at once.

Module Contents

Classes

DataBlob

Subclass this to group your training, validation, and test sets for training machine learning models.

DataFrameDataBlob

Subclass this to transform a pandas.DataFrame into your training, validation, and test sets.

AppendDataBlob

Subclass this to create a new DataBlob that extends an existing DataBlob.

class DataBlob(*, hyperparams: Optional[Union[Mapping[str, Any], HyperparamsType]] = None, **kwargs)

Subclass this to group your training, validation, and test sets for training machine learning models.

Here is how to use DataBlob to group your training, validation, and test sets:

  1. Subclass DataBlob with a class name that describes your dataset in general. In this example, we’ll use MyDataBlob as the class name.

  2. Define a dataclass using the @sp.dataclass decorator at MyDataBlob.Hyperparams. We’ll define an instance of this dataclass at MyDataBlob.hyperparams. This describes the hyperparameters involved in processing your dataset.

  3. Override the methods DataBlob.set_training(), DataBlob.set_validation(), and DataBlob.set_test() to generate tf.data.Dataset pipelines representing your training, validation, and test sets.

Those three steps roughly look like:

>>> import tensorflow as tf
>>> import scalarstop as sp
>>>
>>> class MyDataBlob(sp.DataBlob):
...
...     @sp.dataclass
...     class Hyperparams(sp.HyperparamsType):
...             cols: int
...
...     def _data(self):
...             x = tf.random.uniform(shape=(10, self.hyperparams.cols))
...             y = tf.round(tf.random.uniform(shape=(10,1)))
...             return tf.data.Dataset.zip((
...                     tf.data.Dataset.from_tensor_slices(x),
...                     tf.data.Dataset.from_tensor_slices(y),
...             ))
...
...     def set_training(self):
...         return self._data()
...
...     def set_validation(self):
...         return self._data()
...
...     def set_test(self):
...         return self._data()
>>>

In our above example, our training, validation, and test sets are created with the exact same code. In practice, you’ll be creating them with different inputs.

Now we create an instance of our subclass so we can start using it.

>>> datablob = MyDataBlob(hyperparams=dict(cols=3))
>>> datablob
<sp.DataBlob MyDataBlob-bn5hpc7ueo2uz7as1747tetn>

DataBlob instances are given a unique name by hashing together the class name with the instance’s hyperparameters.

>>> datablob.name
'MyDataBlob-bn5hpc7ueo2uz7as1747tetn'
>>>
>>> datablob.group_name
'MyDataBlob'
>>>
>>> datablob.hyperparams
MyDataBlob.Hyperparams(cols=3)
>>>
>>> sp.enforce_dict(datablob.hyperparams)
{'cols': 3}

We save exactly one instance of each tf.data.Dataset pipeline in the properties DataBlob.training, DataBlob.validation, and DataBlob.test.

>>> datablob.training
<ZipDataset shapes: ((3,), (1,)), types: (tf.float32, tf.float32)>
>>>
>>> datablob.validation
<ZipDataset shapes: ((3,), (1,)), types: (tf.float32, tf.float32)>
>>>
>>> datablob.test
<ZipDataset shapes: ((3,), (1,)), types: (tf.float32, tf.float32)>

DataBlob objects have some methods for applying tf.data transformations to the training, validation, and test sets at the same time:

  • Batching. DataBlob.batch() will batch the training, validation, and test sets at the same time. If you call DataBlob.batch() with the keyword argument with_tf_distribute=True, your input batch size will be multiplied by the number of replicas in your tf.distribute strategy.

  • Caching. DataBlob.cache() will cache the training, validation, and test sets in memory once you iterate over them. This is useful if your tf.data.Dataset are doing something computationally expensive each time you iterate over them.

  • Saving/loading to/from the filesystem. DataBlob.save() saves the training, validation, and test sets to a path on the filesystem. This can be loaded back with the classmethod DataBlob.from_exact_path().

>>> import os
>>> import tempfile
>>> tempdir = tempfile.TemporaryDirectory()
>>>
>>> datablob = datablob.save(tempdir.name)
>>>
>>> os.listdir(tempdir.name)
['MyDataBlob-bn5hpc7ueo2uz7as1747tetn']
>>> path = os.path.join(tempdir.name, datablob.name)
>>> loaded_datablob = MyDataBlob.from_exact_path(path)
>>> loaded_datablob
<sp.DataBlob MyDataBlob-bn5hpc7ueo2uz7as1747tetn>

Alternatively, if you have the hyperparameters of the DataBlob but not the name, you can use the classmethod DataBlob.from_filesystem().

>>> loaded_datablob_2 = MyDataBlob.from_filesystem(
...    hyperparams=dict(cols=3),
...    datablobs_directory=tempdir.name,
... )
>>> loaded_datablob_2
<sp.DataBlob MyDataBlob-bn5hpc7ueo2uz7as1747tetn>

(and now let’s clean up the temporary directory from above)

>>> tempdir.cleanup()
Hyperparams :Type[HyperparamsType]
hyperparams :HyperparamsType
classmethod from_filesystem(cls, *, hyperparams: Optional[Union[Mapping[str, Any], HyperparamsType]] = None, datablobs_directory: str)

Load a DataBlob from the filesystem, calculating the filename from the hyperparameters.

Parameters
  • hyperparams – The hyperparameters of the model that we want to load.

  • datablobs_directory – The parent directory of all of your saved DataBlob s. The exact filename is calculated from the class name and hyperparams.

classmethod from_filesystem_or_new(cls, *, hyperparams: Optional[Union[Mapping[str, Any], HyperparamsType]] = None, datablobs_directory: str, **kwargs)

Load a DataBlob from the fielsystem, calculating the filename from the hyperparameters. Create a new DataBlob if we cannot find a saved one on the filesystem.

Parameters
  • hyperparams – The hyperparameters of the model that we want to load.

  • datablobs_directory – The parent directory of all of your saved DataBlob s. The exact filename is calculated from the class name and hyperparams.

  • **kwargs – Other keyword arguments that you need to pass to your __init__().

static from_exact_path(path: str)scalarstop.datablob.DataBlob

Load a DataBlob from a directory on the filesystem.

property name(self)str

The name of this specific dataset.

property group_name(self)str

The group name of this dataset.

This is typically the DataBlob subclass’s class name.

Conceptually, the group name is the name for all DataBlob s that share the same code but have different hyperparameters.

set_training(self)tf.data.Dataset

Create a tf.data.Dataset for the training set.

property training(self)tf.data.Dataset

A tf.data.Dataset instance representing the training set.

set_validation(self)tf.data.Dataset

Create a tf.data.Dataset for the validation set.

property validation(self)tf.data.Dataset

A tf.data.Dataset instance representing the validation set.

set_test(self)tf.data.Dataset

Create a tf.data.Dataset for the test set.

property test(self)tf.data.Dataset

A tf.data.Dataset instance representing the test set.

batch(self, batch_size: int, *, with_tf_distribute: bool = False)scalarstop.datablob.DataBlob

Batch this DataBlob.

cache(self)scalarstop.datablob.DataBlob

Cache this DataBlob into memory before iterating over it.

save_hook(self, *, subtype: str, path: str)None

Override this method to run additional code when saving this DataBlob to disk.

save(self, datablobs_directory: str, *, ignore_existing: bool = False)scalarstop.datablob.DataBlob

Save this DataBlob to disk.

Parameters
  • datablobs_directory – The directory where you plan on storing all of your DataBlobs. This method will save this DataBlob in a subdirectory of datablobs_directory with same name as DataBlob.name.

  • ignore_existing – Set this to True to ignore if there is already a DataBlob at the given path.

Returns

Return self, enabling you to place this call in a chain.

class DataFrameDataBlob(*, hyperparams: Optional[Union[Mapping[str, Any], HyperparamsType]] = None, **kwargs)

Bases: scalarstop.datablob.DataBlob

Subclass this to transform a pandas.DataFrame into your training, validation, and test sets.

DataBlob is useful when you want to manually define your tf.data pipelines and their input tensors.

However, if your input tensors are in a fixed-size list or DataFrame that you want to slice into a training, validation, and test set, then you might find DataFrameDataBlob handy.

Here is how to use it:

  1. Subclass DataFrameDataBlob with a class name that describes your dataset.

  2. Override DataFrameDataBlob.set_dataframe() and have it return a single DataFrame that contains all of the inputs for your training, validation, and test sets. The DataFrame should have one column representing training samples and another column representing training labels.

  3. Override DataFrameDataBlob.transform() and define a method that transforms an arbitrary DataFrame of inputs into a tf.data.Dataset pipeline that represents the actual dataset needed for training and evaluation.

We define what fraction of the DataFrame to split with the class attributes DataFrameDataBlob.training_fraction and DataFrameDataBlob.validation_fraction. By default, 60 percent of the DataFrame is marked for the training set, 20 percent for the validation set, and the remainder of the DataFrame for the test set.

Roughly, this looks like:

>>> import pandas as pd
>>> import tensorflow as tf
>>> import scalarstop as sp
>>>
>>> class MyDataFrameDataBlob(sp.DataFrameDataBlob):
...    samples_column: str = "samples"
...    labels_column: str = "labels"
...    training_fraction: float = 0.6
...    validation_fraction: float = 0.2
...
...    @sp.dataclass
...    class Hyperparams(sp.HyperparamsType):
...        length: int = 0
...
...    def set_dataframe(self):
...        samples = list(range(self.hyperparams.length))
...        labels = list(range(self.hyperparams.length))
...        return pd.DataFrame({self.samples_column: samples, self.labels_column: labels})
...
...    def transform(self, dataframe: pd.DataFrame):
...        return tf.data.Dataset.zip((
...                tf.data.Dataset.from_tensor_slices(dataframe[self.samples_column]),
...                tf.data.Dataset.from_tensor_slices(dataframe[self.labels_column]),
...        ))
>>> datablob2 = MyDataFrameDataBlob(hyperparams=dict(length=10))

And you can use the resulting object in all of the same ways as we’ve demonstrated with DataBlob subclass instances above.

samples_column :str = samples
labels_column :str = labels
training_fraction :float = 0.6
validation_fraction :float = 0.2
Hyperparams :Type[HyperparamsType]
hyperparams :HyperparamsType
static from_exact_path(path: str)Union[DataBlob, DataFrameDataBlob]

Load a DataFrameDataBlob from a directory on the filesystem.

set_dataframe(self)pandas.DataFrame

Create a new pandas.DataFrame that contains all of the data for the training, validation, and test sets.

property dataframe(self)pandas.DataFrame

A pandas.DataFrame that represents the entire training, validation, and test set.

set_training_dataframe(self)pandas.DataFrame

Sets the pandas.DataFrame for the training set.

By default, this method slices the pandas.DataFrame you have supplied to set_dataframe().

Alternatively, you can choose to directly subclass set_training_dataframe(), set_validation_dataframe(), and :py:meth`set_test_dataframe`.

Returns

Returns a pandas.DataFrame.

property training_dataframe(self)pandas.DataFrame

A pandas.DataFrame representing training set input tensors.

set_validation_dataframe(self)pandas.DataFrame

Sets the pandas.DataFrame for the validation set.

By default, this method slices the pandas.DataFrame you have supplied to set_dataframe().

Alternatively, you can choose to directly subclass set_training_dataframe(), set_validation_dataframe(), and :py:meth`set_test_dataframe`.

Returns

Returns a pandas.DataFrame.

property validation_dataframe(self)pandas.DataFrame

A pandas.DataFrame representing validation set input tensors.

set_test_dataframe(self)pandas.DataFrame

Sets the pandas.DataFrame for the test set.

By default, this method slices the DataFrame you have supplied to set_dataframe().

Alternatively, you can choose to directly subclass set_training_dataframe(), set_validation_dataframe(), and :py:meth`set_test_dataframe`.

Returns

Returns a Pandas pandas.DataFrame.

property test_dataframe(self)pandas.DataFrame

A pandas.DataFrame representing test set input tensors.

transform(self, dataframe: pandas.DataFrame)tf.data.Dataset

Transforms any input tensors into an output tf.data.Dataset.

set_training(self)tf.data.Dataset

Create a tf.data.Dataset for the training set.

set_validation(self)tf.data.Dataset

Create a tf.data.Dataset for the validation set.

set_test(self)tf.data.Dataset

Create a tf.data.Dataset for the test set.

save_hook(self, *, subtype: str, path: str)None

Override this method to run additional code when saving this DataBlob to disk.

classmethod from_filesystem(cls, *, hyperparams: Optional[Union[Mapping[str, Any], HyperparamsType]] = None, datablobs_directory: str)

Load a DataBlob from the filesystem, calculating the filename from the hyperparameters.

Parameters
  • hyperparams – The hyperparameters of the model that we want to load.

  • datablobs_directory – The parent directory of all of your saved DataBlob s. The exact filename is calculated from the class name and hyperparams.

classmethod from_filesystem_or_new(cls, *, hyperparams: Optional[Union[Mapping[str, Any], HyperparamsType]] = None, datablobs_directory: str, **kwargs)

Load a DataBlob from the fielsystem, calculating the filename from the hyperparameters. Create a new DataBlob if we cannot find a saved one on the filesystem.

Parameters
  • hyperparams – The hyperparameters of the model that we want to load.

  • datablobs_directory – The parent directory of all of your saved DataBlob s. The exact filename is calculated from the class name and hyperparams.

  • **kwargs – Other keyword arguments that you need to pass to your __init__().

property name(self)str

The name of this specific dataset.

property group_name(self)str

The group name of this dataset.

This is typically the DataBlob subclass’s class name.

Conceptually, the group name is the name for all DataBlob s that share the same code but have different hyperparameters.

property training(self)tf.data.Dataset

A tf.data.Dataset instance representing the training set.

property validation(self)tf.data.Dataset

A tf.data.Dataset instance representing the validation set.

property test(self)tf.data.Dataset

A tf.data.Dataset instance representing the test set.

batch(self, batch_size: int, *, with_tf_distribute: bool = False)scalarstop.datablob.DataBlob

Batch this DataBlob.

cache(self)scalarstop.datablob.DataBlob

Cache this DataBlob into memory before iterating over it.

save(self, datablobs_directory: str, *, ignore_existing: bool = False)scalarstop.datablob.DataBlob

Save this DataBlob to disk.

Parameters
  • datablobs_directory – The directory where you plan on storing all of your DataBlobs. This method will save this DataBlob in a subdirectory of datablobs_directory with same name as DataBlob.name.

  • ignore_existing – Set this to True to ignore if there is already a DataBlob at the given path.

Returns

Return self, enabling you to place this call in a chain.

class AppendDataBlob(*, parent: scalarstop.datablob.DataBlob, hyperparams: Optional[Union[Mapping[str, Any], HyperparamsType]] = None)

Bases: scalarstop.datablob.DataBlob

Subclass this to create a new DataBlob that extends an existing DataBlob.

The AppendDataBlob class is useful when you have an existing DataBlob or DataFrameDataBlob with most, but not all of the functionality you need. If you are trying to implement multiple data pipelines that share a common compute-intensive first step, you can implement your pipelines as AppendDataBlob subclasses with the common first step as a DataBlob that you save and load to/from the filesystem.

Let’s begin by creating a DataBlob that we will use as a parent for an AppendDataBlob.

>>> import tensorflow as tf
>>> import scalarstop as sp
>>>
>>> class MyDataBlob(sp.DataBlob):
...
...     @sp.dataclass
...     class Hyperparams(sp.HyperparamsType):
...             length: int
...
...     def _data(self):
...         length = self.hyperparams.length
...         x = tf.data.Dataset.from_tensor_slices(list(range(0, length)))
...         y = tf.data.Dataset.from_tensor_slices(list(range(length, length * 2)))
...         return tf.data.Dataset.zip((x, y))
...
...     def set_training(self):
...         return self._data()
...
...     def set_validation(self):
...         return self._data()
...
...     def set_test(self):
...         return self._data()
>>>

And then we create an instance of the datablob and save it to the filesystem.

>>> import os
>>> import tempfile
>>> tempdir = tempfile.TemporaryDirectory()
>>>
>>> datablob = MyDataBlob(hyperparams=dict(length=5))
>>> datablob
<sp.DataBlob MyDataBlob-dac936v7mb1ue9phjp6tc3sb>
>>>
>>> list(datablob.training.as_numpy_iterator())
[(0, 5), (1, 6), (2, 7), (3, 8), (4, 9)]
>>>
>>> datablob = datablob.save(tempdir.name)
>>>
>>> os.listdir(tempdir.name)
['MyDataBlob-dac936v7mb1ue9phjp6tc3sb']

Now, let’s say that we want to create an AppendDataBlob that takes in any input DataBlob or DataFrameDataBlob and multiplies every number in every tensor by a constant.

>>> class MyAppendDataBlob(sp.AppendDataBlob):
...
...     @sp.dataclass
...     class Hyperparams(sp.AppendHyperparamsType):
...          coefficient: int
...
...     hyperparams: "MyAppendDataBlob.Hyperparams"
...
...     def __init__(self, *, parent: sp.DataBlob, hyperparams):
...         hyperparams_dict = sp.enforce_dict(hyperparams)
...         if hyperparams_dict["coefficient"] < 1:
...             raise ValueError("Coefficient is too low.")
...         super().__init__(parent=parent, hyperparams=hyperparams_dict)
...
...     def _wrap_tfdata(self, tfdata: tf.data.Dataset) -> tf.data.Dataset:
...          return tfdata.map(
...              lambda x, y: (
...                  x * self.hyperparams.coefficient,
...                  y * self.hyperparams.coefficient,
...               )
...          )
>>>
>>> append = MyAppendDataBlob(parent=datablob, hyperparams=dict(coefficient=3))
>>> list(append.training.as_numpy_iterator())
[(0, 15), (3, 18), (6, 21), (9, 24), (12, 27)]

(And now let’s clean up the temporary directory that we created earlier.)

>>> tempdir.cleanup()
Parameters
  • parent – The DataBlob to extend.

  • hyperparams – Additional hyperparameters to add on top of the existing hyperparameters from the parent DataBlob.

Hyperparams :Type[AppendHyperparamsType]
hyperparams :HyperparamsType
property parent(self)scalarstop.datablob.DataBlob

The parent DataBlob.

set_training(self)tf.data.Dataset

Create a tf.data.Dataset for the training set.

property training(self)tf.data.Dataset

A tf.data.Dataset instance representing the training set.

set_validation(self)tf.data.Dataset

Create a tf.data.Dataset for the validation set.

property validation(self)tf.data.Dataset

A tf.data.Dataset instance representing the validation set.

set_test(self)tf.data.Dataset

Create a tf.data.Dataset for the test set.

property test(self)tf.data.Dataset

A tf.data.Dataset instance representing the test set.

classmethod from_filesystem(cls, *, hyperparams: Optional[Union[Mapping[str, Any], HyperparamsType]] = None, datablobs_directory: str)

Load a DataBlob from the filesystem, calculating the filename from the hyperparameters.

Parameters
  • hyperparams – The hyperparameters of the model that we want to load.

  • datablobs_directory – The parent directory of all of your saved DataBlob s. The exact filename is calculated from the class name and hyperparams.

classmethod from_filesystem_or_new(cls, *, hyperparams: Optional[Union[Mapping[str, Any], HyperparamsType]] = None, datablobs_directory: str, **kwargs)

Load a DataBlob from the fielsystem, calculating the filename from the hyperparameters. Create a new DataBlob if we cannot find a saved one on the filesystem.

Parameters
  • hyperparams – The hyperparameters of the model that we want to load.

  • datablobs_directory – The parent directory of all of your saved DataBlob s. The exact filename is calculated from the class name and hyperparams.

  • **kwargs – Other keyword arguments that you need to pass to your __init__().

static from_exact_path(path: str)scalarstop.datablob.DataBlob

Load a DataBlob from a directory on the filesystem.

property name(self)str

The name of this specific dataset.

property group_name(self)str

The group name of this dataset.

This is typically the DataBlob subclass’s class name.

Conceptually, the group name is the name for all DataBlob s that share the same code but have different hyperparameters.

batch(self, batch_size: int, *, with_tf_distribute: bool = False)scalarstop.datablob.DataBlob

Batch this DataBlob.

cache(self)scalarstop.datablob.DataBlob

Cache this DataBlob into memory before iterating over it.

save_hook(self, *, subtype: str, path: str)None

Override this method to run additional code when saving this DataBlob to disk.

save(self, datablobs_directory: str, *, ignore_existing: bool = False)scalarstop.datablob.DataBlob

Save this DataBlob to disk.

Parameters
  • datablobs_directory – The directory where you plan on storing all of your DataBlobs. This method will save this DataBlob in a subdirectory of datablobs_directory with same name as DataBlob.name.

  • ignore_existing – Set this to True to ignore if there is already a DataBlob at the given path.

Returns

Return self, enabling you to place this call in a chain.