scalarstop.datablob

Group together and name your training, validation, and test sets.

The classes in this module are used to group together data into training, validation, and test sets used for training machine learning models. We also record the hyperparameters used to process the dataset.

The DataBlob subclass name and hyperparameters are used to create a unique content-addressable name that makes it easy to keep track of many datasets at once.

Module Contents

Classes

DataBlobBase

The abstract base class describing the properties common to all DataBlobs.

DataBlob

Subclass this to group your training, validation, and test sets for training machine learning models.

DataFrameDataBlob

Subclass this to transform a pandas.DataFrame into your training, validation, and test sets.

AppendDataBlob

Subclass this to create a new DataBlob that extends an existing DataBlob.

DistributedDataBlob

Wraps a DataBlob to create a TensorFlow tf.distribute.DistributedDataset.

class DataBlobBase(*, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, **kwargs)

Bases: scalarstop._single_namespace.SingleNamespace

The abstract base class describing the properties common to all DataBlobs.

Parameters

hyperparams – The hyperparameters to initialize this class with.

Hyperparams :Type[scalarstop.hyperparams.HyperparamsType]
set_training(self) Any

Creates and returns a new object representing the training set.

property training(self) Any

An object representing the training set.

set_validation(self) Any

Creates and returns a new object representing the validation set.

property validation(self) Any

An object representing the validation set.

set_test(self) Any

Creates and returns a new object representing the test set.

property test(self) Any

An object representing the test set.

classmethod calculate_name(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None) str

Calculate the hashed name of this object, given the hyperparameters.

This classmethod can be used to calculate what an object would be without actually having to call __init__().

property group_name(self) str

The “group” name is this object’s Python class name.

property name(self) str

The group (class) name and a calculated hash of the hyperparameters.

property hyperparams(self) scalarstop.hyperparams.HyperparamsType

Returns a HyperparamsType instance containing hyperparameters.

property hyperparams_flat(self) Dict[str, Any]

Returns a Python dictionary of “flattened” hyperparameters.

AppendDataBlob objects modify a “parent” DataBlob, nesting the parent’s Hyperparams within the AppendDataBlob ‘s own Hyperparams.

This makes it hard to look up a given hyperparams key. A value at parent_datablob.hyperparams.a is stored at child_datablob.hyperparams.parent.hyperparams.a.

This hyperparams_flat property provides all nested hyperparams keys as a flat Python dictionary. If a child AppendDataBlob has a hyperparameter key that that conflicts with the parent, the child’s value will overwrite the parent’s value.

class DataBlob(*, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, **kwargs)

Bases: DataBlobBase

Subclass this to group your training, validation, and test sets for training machine learning models.

Here is how to use DataBlob to group your training, validation, and test sets:

  1. Subclass DataBlob with a class name that describes your dataset in general. In this example, we’ll use MyDataBlob as the class name.

  2. Define a dataclass using the @sp.dataclass decorator at MyDataBlob.Hyperparams. We’ll define an instance of this dataclass at MyDataBlob.hyperparams. This describes the hyperparameters involved in processing your dataset.

  3. Override the methods DataBlob.set_training(), DataBlob.set_validation(), and DataBlob.set_test() to generate tf.data.Dataset pipelines representing your training, validation, and test sets.

Those three steps roughly look like:

>>> import tensorflow as tf
>>> import scalarstop as sp
>>>
>>> class MyDataBlob(sp.DataBlob):
...
...     @sp.dataclass
...     class Hyperparams(sp.HyperparamsType):
...             cols: int
...
...     def _data(self):
...             x = tf.random.uniform(shape=(10, self.hyperparams.cols))
...             y = tf.round(tf.random.uniform(shape=(10,1)))
...             return tf.data.Dataset.zip((
...                     tf.data.Dataset.from_tensor_slices(x),
...                     tf.data.Dataset.from_tensor_slices(y),
...             ))
...
...     def set_training(self):
...         return self._data()
...
...     def set_validation(self):
...         return self._data()
...
...     def set_test(self):
...         return self._data()
>>>

In our above example, our training, validation, and test sets are created with the exact same code. In practice, you’ll be creating them with different inputs.

Now we create an instance of our subclass so we can start using it.

>>> datablob = MyDataBlob(hyperparams=dict(cols=3))
>>> datablob
<sp.DataBlob MyDataBlob-bn5hpc7ueo2uz7as1747tetn>

DataBlob instances are given a unique name by hashing together the class name with the instance’s hyperparameters.

>>> datablob.name
'MyDataBlob-bn5hpc7ueo2uz7as1747tetn'
>>>
>>> datablob.group_name
'MyDataBlob'
>>>
>>> datablob.hyperparams
MyDataBlob.Hyperparams(cols=3)
>>>
>>> sp.enforce_dict(datablob.hyperparams)
{'cols': 3}

We save exactly one instance of each tf.data.Dataset pipeline in the properties DataBlob.training, DataBlob.validation, and DataBlob.test.

>>> datablob.training
<ZipDataset element_spec=(TensorSpec(shape=(3,), dtype=tf.float32, name=None), TensorSpec(shape=(1,), dtype=tf.float32, name=None))>
>>>
>>> datablob.validation
<ZipDataset element_spec=(TensorSpec(shape=(3,), dtype=tf.float32, name=None), TensorSpec(shape=(1,), dtype=tf.float32, name=None))>
>>>
>>> datablob.test
<ZipDataset element_spec=(TensorSpec(shape=(3,), dtype=tf.float32, name=None), TensorSpec(shape=(1,), dtype=tf.float32, name=None))>

DataBlob objects have some methods for applying tf.data transformations to the training, validation, and test sets at the same time:

  • Batching. DataBlob.batch() will batch the training, validation, and test sets at the same time. If you call DataBlob.batch() with the keyword argument with_tf_distribute=True, your input batch size will be multiplied by the number of replicas in your tf.distribute strategy.

  • Caching. DataBlob.cache() will cache the training, validation, and test sets in memory once you iterate over them. This is useful if your tf.data.Dataset are doing something computationally expensive each time you iterate over them.

  • Saving/loading to/from the filesystem. DataBlob.save() saves the training, validation, and test sets to a path on the filesystem. This can be loaded back with the classmethod DataBlob.from_exact_path().

>>> import os
>>> import tempfile
>>> tempdir = tempfile.TemporaryDirectory()
>>>
>>> datablob = datablob.save(tempdir.name)
>>>
>>> os.listdir(tempdir.name)
['MyDataBlob-bn5hpc7ueo2uz7as1747tetn']
>>> path = os.path.join(tempdir.name, datablob.name)
>>> loaded_datablob = MyDataBlob.from_exact_path(path)
>>> loaded_datablob
<sp.DataBlob MyDataBlob-bn5hpc7ueo2uz7as1747tetn>

Alternatively, if you have the hyperparameters of the DataBlob but not the name, you can use the classmethod DataBlob.from_filesystem().

>>> loaded_datablob_2 = MyDataBlob.from_filesystem(
...    hyperparams=dict(cols=3),
...    datablobs_directory=tempdir.name,
... )
>>> loaded_datablob_2
<sp.DataBlob MyDataBlob-bn5hpc7ueo2uz7as1747tetn>

(and now let’s clean up the temporary directory from above)

>>> tempdir.cleanup()
Parameters

hyperparams – The hyperparameters to initialize this class with.

Hyperparams :Type[scalarstop.hyperparams.HyperparamsType]
classmethod from_filesystem(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1)

Loads a DataBlob from the filesystem, calculating the filename from the hyperparameters.

Parameters
  • hyperparams – The hyperparameters of the model that we want to load.

  • datablobs_directory – The parent directory of all of your saved DataBlob s. The exact filename is calculated from the class name and hyperparams.

classmethod from_filesystem_distributed(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, cache: bool = False, repeat: Union[bool, int, None] = True, per_replica_batch_size: Optional[int] = None, tf_distribute_strategy: Optional[tf.distribute.Strategy] = None) DistributedDataBlob

Loads a sharded DataBlob from the filesystem, automatically splitting the shards amongs the input workers of a tf.distribute.Strategy.

Parameters
  • hyperparams – The hyperparameters of the model that we want to load.

  • datablobs_directory – The parent directory of all of your saved DataBlob s. The exact filename is calculated from the class name and hyperparams.

  • cache – Whether to cache the DataBlob in memory. If repeat is also enabled, then caching will happen before repeating.

  • repeat – Repeats the DataBlob after loading it. Set to True to enable infinite repeating. Set to a positive integer n to repeat the DataBlob n times. Set to False to disable repeating.

  • per_replica_batch_size – The batch size for each individual tf.distribute replica. This is the global batch size divided by tf.distribute.Strategy.num_replicas_in_sync.

  • tf_distribute_strategy – The tf.distribute.Strategy subclass to use. Optionally, this method will detect if it is already inside a :py:meth:`tf.distribute.Strategy.scope context manager.

classmethod metadata_from_filesystem(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str) scalarstop.datablob_metadata.DataBlobMetadata

Loads this DataBlob ‘s DataBlobMetadata from the filesystem, calculating the filename from the hyperparameters.

Parameters
  • hyperparams – The hyperparameters of the model that we want to load.

  • datablobs_directory – The parent directory of all of your saved DataBlob s. The exact filename is calculated from the class name and hyperparams.

classmethod from_filesystem_or_new(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1, **kwargs)

Load a DataBlob from the filesystem, calculating the filename from the hyperparameters. Create a new DataBlob if we cannot find a saved one on the filesystem.

Parameters
  • hyperparams – The hyperparameters of the model that we want to load.

  • datablobs_directory – The parent directory of all of your saved DataBlob s. The exact filename is calculated from the class name and hyperparams.

  • **kwargs – Other keyword arguments that you need to pass to your __init__().

static from_exact_path(path: str, *, shard_offset: Optional[int] = None, shard_quantity: int = 1) DataBlob

Load a DataBlob from a directory on the filesystem.

classmethod from_exact_path_distributed(cls, *, path: str, cache: bool = False, repeat: Union[bool, int, None] = True, per_replica_batch_size: Optional[int] = None, tf_distribute_strategy: Optional[tensorflow.distribute.get_strategy] = None) DistributedDataBlob
Parameters
  • path – The exact location of the saved DataBlob on the filesystem.

  • cache – Whether to cache the DataBlob in memory. If repeat is also enabled, then caching will happen before repeating.

  • repeat – Repeats the DataBlob after loading it. Set to True to enable infinite repeating. Set to a positive integer n to repeat the DataBlob n times. Set to False to disable repeating.

  • per_replica_batch_size – The batch size for each individual tf.distribute replica. This is the global batch size divided by tf.distribute.Strategy.num_replicas_in_sync.

  • tf_distribute_strategy – The tf.distribute.Strategy subclass to use. Optionally, this method will detect if it is already inside a :py:meth:`tf.distribute.Strategy.scope context manager.

static metadata_from_exact_path(path: str) scalarstop.datablob_metadata.DataBlobMetadata

Loads this DataBlob ‘s DataBlobMetadata from a directory on the filesystem.

exists_in_datablobs_directory(self, datablobs_directory: str) bool

Returns True if this DataBlob was already saved within datablobs_directory.

Parameters

datablobs_directory – The parent directory of all of your saved DataBlob s.

Returns

Returns True if we found a py:class:DataBlob metadata file at the expected location.

set_training(self) tf.data.Dataset

Create a tf.data.Dataset for the training set.

property training(self) tf.data.Dataset

A tf.data.Dataset instance representing the training set.

set_validation(self) tf.data.Dataset

Create a tf.data.Dataset for the validation set.

property validation(self) tf.data.Dataset

A tf.data.Dataset instance representing the validation set.

set_test(self) tf.data.Dataset

Create a tf.data.Dataset for the test set.

property test(self) tf.data.Dataset

A tf.data.Dataset instance representing the test set.

batch(self, batch_size: int, *, training: bool = True, validation: bool = True, test: bool = True, with_tf_distribute: bool = False) DataBlob

Batch this DataBlob.

Parameters
  • batch_size – The number of items to collect into a batch.

  • training – Whether to batch the training set. Defaults to True.

  • validation – Whether to batch the validation set. Defaults to True.

  • test – Whether to batch the test set. Defaults to True.

  • with_tf_distribute – Whether to consider tf.distribute auto-data sharding when calculating the batch size.

cache(self, *, training: bool = True, validation: bool = True, test: bool = True, precache_training: bool = False, precache_validation: bool = False, precache_test: bool = False) DataBlob

Cache this DataBlob into memory before iterating over it.

By default, this creates a DataBlob containing a TensorFlow CacheDataset for each of the training, validation and test tf.data.Dataset s.

But these datasets do not load into memory until the first time you completely iterate over one–from start to end. If you want to immediately load your training, validation, or test sets, you can set precache_training, precache_validation, and/or precache_test to True.

Parameters
  • training – Lazily cache the training set in CPU memory. Defaults to True.

  • validation – Lazily cache the validation set in CPU memory. Defaults to True.

  • test – Lazily cache the test set in CPU memory. Defaults to True.

  • precache_training – Eagerly cache the training set into memory. Defaults to False.

  • precache_validation – Eagerly cache the validation set into memory. Defaults to False.

  • precache_test – Eagerly cache the test set into memory. Defaults to False.

prefetch(self, buffer_size: int, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob

Creates a DataBlob that prefetches elements for performance.

Parameters
  • buffer_size – The maximum number of elements that will be buffered when prefetching. If the value tf.data.experimental.AUTOTUNE() is used, then the buffer is dynamically tuned.

  • training – Apply the repeat operator to the training set. Defaults to True.

  • validation – Apply the repeat operator to the validation set. Defaults to True.

  • test – Apply the repeat operator to the test set. Defaults to True.

repeat(self, count: Optional[int] = None, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob

Repeats this DataBlob.

Parameters
  • count – Represents the number of times that the elements in the tf.data.Dataset should be repeated. The default behavior (if count is None or -1) is for the dataset be repeated indefinitely.

  • training – Apply the repeat operator to the training set. Defaults to True.

  • validation – Apply the repeat operator to the validation set. Defaults to True.

  • test – Apply the repeat operator to the test set. Defaults to True.

repeat_interleaved(self, count: int, cycle_length: Optional[int] = None, block_length: Optional[int] = None, num_parallel_calls: Optional[int] = None, deterministic: Optional[bool] = None, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob

Repeats this DataBlob, but interleaved order.

Parameters
  • count – Represents the number of times that the elements in the tf.data.Dataset should be repeated. This must be a finite integer greater than 0. It cannot be a negative number, None, or an infinite value.

  • training – Apply the repeat operator to the training set. Defaults to True.

  • validation – Apply the repeat operator to the validation set. Defaults to True.

  • test – Apply the repeat operator to the test set. Defaults to True.

with_options(self, options: tf.data.Options, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob

Apply a tf.data.Options object to this DataBlob.

Parameters
  • options – The tf.data.Options object to apply.

  • training – Apply the options to the training set. Defaults to True.

  • validation – Apply the options to the validation set. Defaults to True.

  • test – Apply the options to the test set. Defaults to True.

save_hook(self, *, subtype: str, path: str) None

Override this method to run additional code when saving this DataBlob to disk.

save(self, datablobs_directory: str, *, ignore_existing: bool = False, num_shards: int = 1, save_load_version: int = _DEFAULT_SAVE_LOAD_VERSION) DataBlob

Save this DataBlob to disk.

Parameters
  • datablobs_directory – The directory where you plan on storing all of your DataBlobs. This method will save this DataBlob in a subdirectory of datablobs_directory with same name as DataBlob.name.

  • ignore_existing – Set this to True to ignore if there is already a DataBlob at the given path.

  • save_load_version – The ScalarStop version for the ScalarStop protocol.

Returns

Return self, enabling you to place this call in a chain.

classmethod calculate_name(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None) str

Calculate the hashed name of this object, given the hyperparameters.

This classmethod can be used to calculate what an object would be without actually having to call __init__().

property group_name(self) str

The “group” name is this object’s Python class name.

property name(self) str

The group (class) name and a calculated hash of the hyperparameters.

property hyperparams(self) scalarstop.hyperparams.HyperparamsType

Returns a HyperparamsType instance containing hyperparameters.

property hyperparams_flat(self) Dict[str, Any]

Returns a Python dictionary of “flattened” hyperparameters.

AppendDataBlob objects modify a “parent” DataBlob, nesting the parent’s Hyperparams within the AppendDataBlob ‘s own Hyperparams.

This makes it hard to look up a given hyperparams key. A value at parent_datablob.hyperparams.a is stored at child_datablob.hyperparams.parent.hyperparams.a.

This hyperparams_flat property provides all nested hyperparams keys as a flat Python dictionary. If a child AppendDataBlob has a hyperparameter key that that conflicts with the parent, the child’s value will overwrite the parent’s value.

class DataFrameDataBlob(*, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, **kwargs)

Bases: DataBlob

Subclass this to transform a pandas.DataFrame into your training, validation, and test sets.

DataBlob is useful when you want to manually define your tf.data pipelines and their input tensors.

However, if your input tensors are in a fixed-size list or DataFrame that you want to slice into a training, validation, and test set, then you might find DataFrameDataBlob handy.

Here is how to use it:

  1. Subclass DataFrameDataBlob with a class name that describes your dataset.

  2. Override DataFrameDataBlob.set_dataframe() and have it return a single DataFrame that contains all of the inputs for your training, validation, and test sets. The DataFrame should have one column representing training samples and another column representing training labels.

  3. Override DataFrameDataBlob.transform() and define a method that transforms an arbitrary DataFrame of inputs into a tf.data.Dataset pipeline that represents the actual dataset needed for training and evaluation.

We define what fraction of the DataFrame to split with the class attributes DataFrameDataBlob.training_fraction and DataFrameDataBlob.validation_fraction. By default, 60 percent of the DataFrame is marked for the training set, 20 percent for the validation set, and the remainder of the DataFrame for the test set.

Roughly, this looks like:

>>> import pandas as pd
>>> import tensorflow as tf
>>> import scalarstop as sp
>>>
>>> class MyDataFrameDataBlob(sp.DataFrameDataBlob):
...    samples_column: str = "samples"
...    labels_column: str = "labels"
...    training_fraction: float = 0.6
...    validation_fraction: float = 0.2
...
...    @sp.dataclass
...    class Hyperparams(sp.HyperparamsType):
...        length: int = 0
...
...    def set_dataframe(self):
...        samples = list(range(self.hyperparams.length))
...        labels = list(range(self.hyperparams.length))
...        return pd.DataFrame({self.samples_column: samples, self.labels_column: labels})
...
...    def transform(self, dataframe: pd.DataFrame):
...        return tf.data.Dataset.zip((
...                tf.data.Dataset.from_tensor_slices(dataframe[self.samples_column]),
...                tf.data.Dataset.from_tensor_slices(dataframe[self.labels_column]),
...        ))
>>> datablob2 = MyDataFrameDataBlob(hyperparams=dict(length=10))

And you can use the resulting object in all of the same ways as we’ve demonstrated with DataBlob subclass instances above.

Parameters

hyperparams – The hyperparameters to initialize this class with.

samples_column :str = samples
labels_column :str = labels
training_fraction :float = 0.6
validation_fraction :float = 0.2
Hyperparams :Type[scalarstop.hyperparams.HyperparamsType]
static from_exact_path(path: str, *, shard_offset: Optional[int] = None, shard_quantity: int = 1) Union[DataBlob, DataFrameDataBlob]

Load a DataFrameDataBlob from a directory on the filesystem.

set_dataframe(self) pandas.DataFrame

Create a new pandas.DataFrame that contains all of the data for the training, validation, and test sets.

property dataframe(self) pandas.DataFrame

A pandas.DataFrame that represents the entire training, validation, and test set.

set_training_dataframe(self) pandas.DataFrame

Sets the pandas.DataFrame for the training set.

By default, this method slices the pandas.DataFrame you have supplied to set_dataframe().

Alternatively, you can choose to directly subclass set_training_dataframe(), set_validation_dataframe(), and :py:meth`set_test_dataframe`.

Returns

Returns a pandas.DataFrame.

property training_dataframe(self) pandas.DataFrame

A pandas.DataFrame representing training set input tensors.

set_validation_dataframe(self) pandas.DataFrame

Sets the pandas.DataFrame for the validation set.

By default, this method slices the pandas.DataFrame you have supplied to set_dataframe().

Alternatively, you can choose to directly subclass set_training_dataframe(), set_validation_dataframe(), and :py:meth`set_test_dataframe`.

Returns

Returns a pandas.DataFrame.

property validation_dataframe(self) pandas.DataFrame

A pandas.DataFrame representing validation set input tensors.

set_test_dataframe(self) pandas.DataFrame

Sets the pandas.DataFrame for the test set.

By default, this method slices the DataFrame you have supplied to set_dataframe().

Alternatively, you can choose to directly subclass set_training_dataframe(), set_validation_dataframe(), and :py:meth`set_test_dataframe`.

Returns

Returns a Pandas pandas.DataFrame.

property test_dataframe(self) pandas.DataFrame

A pandas.DataFrame representing test set input tensors.

transform(self, dataframe: pandas.DataFrame) tf.data.Dataset

Transforms any input tensors into an output tf.data.Dataset.

set_training(self) tf.data.Dataset

Create a tf.data.Dataset for the training set.

set_validation(self) tf.data.Dataset

Create a tf.data.Dataset for the validation set.

set_test(self) tf.data.Dataset

Create a tf.data.Dataset for the test set.

save_hook(self, *, subtype: str, path: str) None

Override this method to run additional code when saving this DataBlob to disk.

classmethod from_filesystem(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1)

Loads a DataBlob from the filesystem, calculating the filename from the hyperparameters.

Parameters
  • hyperparams – The hyperparameters of the model that we want to load.

  • datablobs_directory – The parent directory of all of your saved DataBlob s. The exact filename is calculated from the class name and hyperparams.

classmethod from_filesystem_distributed(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, cache: bool = False, repeat: Union[bool, int, None] = True, per_replica_batch_size: Optional[int] = None, tf_distribute_strategy: Optional[tf.distribute.Strategy] = None) DistributedDataBlob

Loads a sharded DataBlob from the filesystem, automatically splitting the shards amongs the input workers of a tf.distribute.Strategy.

Parameters
  • hyperparams – The hyperparameters of the model that we want to load.

  • datablobs_directory – The parent directory of all of your saved DataBlob s. The exact filename is calculated from the class name and hyperparams.

  • cache – Whether to cache the DataBlob in memory. If repeat is also enabled, then caching will happen before repeating.

  • repeat – Repeats the DataBlob after loading it. Set to True to enable infinite repeating. Set to a positive integer n to repeat the DataBlob n times. Set to False to disable repeating.

  • per_replica_batch_size – The batch size for each individual tf.distribute replica. This is the global batch size divided by tf.distribute.Strategy.num_replicas_in_sync.

  • tf_distribute_strategy – The tf.distribute.Strategy subclass to use. Optionally, this method will detect if it is already inside a :py:meth:`tf.distribute.Strategy.scope context manager.

classmethod metadata_from_filesystem(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str) scalarstop.datablob_metadata.DataBlobMetadata

Loads this DataBlob ‘s DataBlobMetadata from the filesystem, calculating the filename from the hyperparameters.

Parameters
  • hyperparams – The hyperparameters of the model that we want to load.

  • datablobs_directory – The parent directory of all of your saved DataBlob s. The exact filename is calculated from the class name and hyperparams.

classmethod from_filesystem_or_new(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1, **kwargs)

Load a DataBlob from the filesystem, calculating the filename from the hyperparameters. Create a new DataBlob if we cannot find a saved one on the filesystem.

Parameters
  • hyperparams – The hyperparameters of the model that we want to load.

  • datablobs_directory – The parent directory of all of your saved DataBlob s. The exact filename is calculated from the class name and hyperparams.

  • **kwargs – Other keyword arguments that you need to pass to your __init__().

classmethod from_exact_path_distributed(cls, *, path: str, cache: bool = False, repeat: Union[bool, int, None] = True, per_replica_batch_size: Optional[int] = None, tf_distribute_strategy: Optional[tensorflow.distribute.get_strategy] = None) DistributedDataBlob
Parameters
  • path – The exact location of the saved DataBlob on the filesystem.

  • cache – Whether to cache the DataBlob in memory. If repeat is also enabled, then caching will happen before repeating.

  • repeat – Repeats the DataBlob after loading it. Set to True to enable infinite repeating. Set to a positive integer n to repeat the DataBlob n times. Set to False to disable repeating.

  • per_replica_batch_size – The batch size for each individual tf.distribute replica. This is the global batch size divided by tf.distribute.Strategy.num_replicas_in_sync.

  • tf_distribute_strategy – The tf.distribute.Strategy subclass to use. Optionally, this method will detect if it is already inside a :py:meth:`tf.distribute.Strategy.scope context manager.

static metadata_from_exact_path(path: str) scalarstop.datablob_metadata.DataBlobMetadata

Loads this DataBlob ‘s DataBlobMetadata from a directory on the filesystem.

exists_in_datablobs_directory(self, datablobs_directory: str) bool

Returns True if this DataBlob was already saved within datablobs_directory.

Parameters

datablobs_directory – The parent directory of all of your saved DataBlob s.

Returns

Returns True if we found a py:class:DataBlob metadata file at the expected location.

property training(self) tf.data.Dataset

A tf.data.Dataset instance representing the training set.

property validation(self) tf.data.Dataset

A tf.data.Dataset instance representing the validation set.

property test(self) tf.data.Dataset

A tf.data.Dataset instance representing the test set.

batch(self, batch_size: int, *, training: bool = True, validation: bool = True, test: bool = True, with_tf_distribute: bool = False) DataBlob

Batch this DataBlob.

Parameters
  • batch_size – The number of items to collect into a batch.

  • training – Whether to batch the training set. Defaults to True.

  • validation – Whether to batch the validation set. Defaults to True.

  • test – Whether to batch the test set. Defaults to True.

  • with_tf_distribute – Whether to consider tf.distribute auto-data sharding when calculating the batch size.

cache(self, *, training: bool = True, validation: bool = True, test: bool = True, precache_training: bool = False, precache_validation: bool = False, precache_test: bool = False) DataBlob

Cache this DataBlob into memory before iterating over it.

By default, this creates a DataBlob containing a TensorFlow CacheDataset for each of the training, validation and test tf.data.Dataset s.

But these datasets do not load into memory until the first time you completely iterate over one–from start to end. If you want to immediately load your training, validation, or test sets, you can set precache_training, precache_validation, and/or precache_test to True.

Parameters
  • training – Lazily cache the training set in CPU memory. Defaults to True.

  • validation – Lazily cache the validation set in CPU memory. Defaults to True.

  • test – Lazily cache the test set in CPU memory. Defaults to True.

  • precache_training – Eagerly cache the training set into memory. Defaults to False.

  • precache_validation – Eagerly cache the validation set into memory. Defaults to False.

  • precache_test – Eagerly cache the test set into memory. Defaults to False.

prefetch(self, buffer_size: int, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob

Creates a DataBlob that prefetches elements for performance.

Parameters
  • buffer_size – The maximum number of elements that will be buffered when prefetching. If the value tf.data.experimental.AUTOTUNE() is used, then the buffer is dynamically tuned.

  • training – Apply the repeat operator to the training set. Defaults to True.

  • validation – Apply the repeat operator to the validation set. Defaults to True.

  • test – Apply the repeat operator to the test set. Defaults to True.

repeat(self, count: Optional[int] = None, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob

Repeats this DataBlob.

Parameters
  • count – Represents the number of times that the elements in the tf.data.Dataset should be repeated. The default behavior (if count is None or -1) is for the dataset be repeated indefinitely.

  • training – Apply the repeat operator to the training set. Defaults to True.

  • validation – Apply the repeat operator to the validation set. Defaults to True.

  • test – Apply the repeat operator to the test set. Defaults to True.

repeat_interleaved(self, count: int, cycle_length: Optional[int] = None, block_length: Optional[int] = None, num_parallel_calls: Optional[int] = None, deterministic: Optional[bool] = None, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob

Repeats this DataBlob, but interleaved order.

Parameters
  • count – Represents the number of times that the elements in the tf.data.Dataset should be repeated. This must be a finite integer greater than 0. It cannot be a negative number, None, or an infinite value.

  • training – Apply the repeat operator to the training set. Defaults to True.

  • validation – Apply the repeat operator to the validation set. Defaults to True.

  • test – Apply the repeat operator to the test set. Defaults to True.

with_options(self, options: tf.data.Options, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob

Apply a tf.data.Options object to this DataBlob.

Parameters
  • options – The tf.data.Options object to apply.

  • training – Apply the options to the training set. Defaults to True.

  • validation – Apply the options to the validation set. Defaults to True.

  • test – Apply the options to the test set. Defaults to True.

save(self, datablobs_directory: str, *, ignore_existing: bool = False, num_shards: int = 1, save_load_version: int = _DEFAULT_SAVE_LOAD_VERSION) DataBlob

Save this DataBlob to disk.

Parameters
  • datablobs_directory – The directory where you plan on storing all of your DataBlobs. This method will save this DataBlob in a subdirectory of datablobs_directory with same name as DataBlob.name.

  • ignore_existing – Set this to True to ignore if there is already a DataBlob at the given path.

  • save_load_version – The ScalarStop version for the ScalarStop protocol.

Returns

Return self, enabling you to place this call in a chain.

classmethod calculate_name(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None) str

Calculate the hashed name of this object, given the hyperparameters.

This classmethod can be used to calculate what an object would be without actually having to call __init__().

property group_name(self) str

The “group” name is this object’s Python class name.

property name(self) str

The group (class) name and a calculated hash of the hyperparameters.

property hyperparams(self) scalarstop.hyperparams.HyperparamsType

Returns a HyperparamsType instance containing hyperparameters.

property hyperparams_flat(self) Dict[str, Any]

Returns a Python dictionary of “flattened” hyperparameters.

AppendDataBlob objects modify a “parent” DataBlob, nesting the parent’s Hyperparams within the AppendDataBlob ‘s own Hyperparams.

This makes it hard to look up a given hyperparams key. A value at parent_datablob.hyperparams.a is stored at child_datablob.hyperparams.parent.hyperparams.a.

This hyperparams_flat property provides all nested hyperparams keys as a flat Python dictionary. If a child AppendDataBlob has a hyperparameter key that that conflicts with the parent, the child’s value will overwrite the parent’s value.

class AppendDataBlob(*, parent: DataBlob, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, **kwargs)

Bases: DataBlob

Subclass this to create a new DataBlob that extends an existing DataBlob.

The AppendDataBlob class is useful when you have an existing DataBlob or DataFrameDataBlob with most, but not all of the functionality you need. If you are trying to implement multiple data pipelines that share a common compute-intensive first step, you can implement your pipelines as AppendDataBlob subclasses with the common first step as a DataBlob that you save and load to/from the filesystem.

Let’s begin by creating a DataBlob that we will use as a parent for an AppendDataBlob.

>>> import tensorflow as tf
>>> import scalarstop as sp
>>>
>>> class MyDataBlob(sp.DataBlob):
...
...     @sp.dataclass
...     class Hyperparams(sp.HyperparamsType):
...             length: int
...
...     def _data(self):
...         length = self.hyperparams.length
...         x = tf.data.Dataset.from_tensor_slices(list(range(0, length)))
...         y = tf.data.Dataset.from_tensor_slices(list(range(length, length * 2)))
...         return tf.data.Dataset.zip((x, y))
...
...     def set_training(self):
...         return self._data()
...
...     def set_validation(self):
...         return self._data()
...
...     def set_test(self):
...         return self._data()
>>>

And then we create an instance of the datablob and save it to the filesystem.

>>> import os
>>> import tempfile
>>> tempdir = tempfile.TemporaryDirectory()
>>>
>>> datablob = MyDataBlob(hyperparams=dict(length=5))
>>> datablob
<sp.DataBlob MyDataBlob-dac936v7mb1ue9phjp6tc3sb>
>>>
>>> list(datablob.training.as_numpy_iterator())
[(0, 5), (1, 6), (2, 7), (3, 8), (4, 9)]
>>>
>>> datablob = datablob.save(tempdir.name)
>>>
>>> os.listdir(tempdir.name)
['MyDataBlob-dac936v7mb1ue9phjp6tc3sb']

Now, let’s say that we want to create an AppendDataBlob that takes in any input DataBlob or DataFrameDataBlob and multiplies every number in every tensor by a constant.

>>> class MyAppendDataBlob(sp.AppendDataBlob):
...
...     @sp.dataclass
...     class Hyperparams(sp.AppendHyperparamsType):
...          coefficient: int
...
...     hyperparams: "MyAppendDataBlob.Hyperparams"
...
...     def __init__(self, *, parent: sp.DataBlob, hyperparams):
...         hyperparams_dict = sp.enforce_dict(hyperparams)
...         if hyperparams_dict["coefficient"] < 1:
...             raise ValueError("Coefficient is too low.")
...         super().__init__(parent=parent, hyperparams=hyperparams_dict)
...
...     def _wrap_tfdata(self, tfdata: tf.data.Dataset) -> tf.data.Dataset:
...          return tfdata.map(
...              lambda x, y: (
...                  x * self.hyperparams.coefficient,
...                  y * self.hyperparams.coefficient,
...               )
...          )
>>>
>>> append = MyAppendDataBlob(parent=datablob, hyperparams=dict(coefficient=3))
>>> list(append.training.as_numpy_iterator())
[(0, 15), (3, 18), (6, 21), (9, 24), (12, 27)]

(And now let’s clean up the temporary directory that we created earlier.)

>>> tempdir.cleanup()
Parameters
  • parent – The DataBlob to extend.

  • hyperparams – Additional hyperparameters to add on top of the existing hyperparameters from the parent DataBlob.

Hyperparams :Type[scalarstop.hyperparams.AppendHyperparamsType]
classmethod create_append_hyperparams(cls, *, parent: DataBlob, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None)

Combine the hyperparams from the parent DataBlob with the hyperparams meant for this AppendDataBlob.

classmethod calculate_name_from_parent(cls, *, parent: DataBlob, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None)

Calculate the hashed name of this AppendDataBlob, given the hyperparameters and the parent DataBlob.

classmethod from_filesystem_with_parent(cls, *, parent: DataBlob, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1)

Load a AppendDataBlob from the filesystem, calculating the filename from the parent and the hyperparameters..

classmethod from_filesystem_or_new_with_parent(cls, *, parent: DataBlob, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]], datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1, **kwargs)

Load a AppendDataBlob from the filesystem, calculating the filename from the hyperparameters. Create a new AppendDataBlob if we cannot find a saved one on the filesystem.

Parameters
  • parent – The parent DataBlob to extend.

  • hyperparams – The hyperparameters of the DataBlob that we want to load.

  • datablobs_directory – The parent directory of all of your saved DataBlob s. The exact filename is calculated.

property parent(self) DataBlob

The parent DataBlob.

set_training(self) tf.data.Dataset

Create a tf.data.Dataset for the training set.

property training(self) tf.data.Dataset

A tf.data.Dataset instance representing the training set.

set_validation(self) tf.data.Dataset

Create a tf.data.Dataset for the validation set.

property validation(self) tf.data.Dataset

A tf.data.Dataset instance representing the validation set.

set_test(self) tf.data.Dataset

Create a tf.data.Dataset for the test set.

property test(self) tf.data.Dataset

A tf.data.Dataset instance representing the test set.

classmethod from_filesystem(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1)

Loads a DataBlob from the filesystem, calculating the filename from the hyperparameters.

Parameters
  • hyperparams – The hyperparameters of the model that we want to load.

  • datablobs_directory – The parent directory of all of your saved DataBlob s. The exact filename is calculated from the class name and hyperparams.

classmethod from_filesystem_distributed(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, cache: bool = False, repeat: Union[bool, int, None] = True, per_replica_batch_size: Optional[int] = None, tf_distribute_strategy: Optional[tf.distribute.Strategy] = None) DistributedDataBlob

Loads a sharded DataBlob from the filesystem, automatically splitting the shards amongs the input workers of a tf.distribute.Strategy.

Parameters
  • hyperparams – The hyperparameters of the model that we want to load.

  • datablobs_directory – The parent directory of all of your saved DataBlob s. The exact filename is calculated from the class name and hyperparams.

  • cache – Whether to cache the DataBlob in memory. If repeat is also enabled, then caching will happen before repeating.

  • repeat – Repeats the DataBlob after loading it. Set to True to enable infinite repeating. Set to a positive integer n to repeat the DataBlob n times. Set to False to disable repeating.

  • per_replica_batch_size – The batch size for each individual tf.distribute replica. This is the global batch size divided by tf.distribute.Strategy.num_replicas_in_sync.

  • tf_distribute_strategy – The tf.distribute.Strategy subclass to use. Optionally, this method will detect if it is already inside a :py:meth:`tf.distribute.Strategy.scope context manager.

classmethod metadata_from_filesystem(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str) scalarstop.datablob_metadata.DataBlobMetadata

Loads this DataBlob ‘s DataBlobMetadata from the filesystem, calculating the filename from the hyperparameters.

Parameters
  • hyperparams – The hyperparameters of the model that we want to load.

  • datablobs_directory – The parent directory of all of your saved DataBlob s. The exact filename is calculated from the class name and hyperparams.

classmethod from_filesystem_or_new(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1, **kwargs)

Load a DataBlob from the filesystem, calculating the filename from the hyperparameters. Create a new DataBlob if we cannot find a saved one on the filesystem.

Parameters
  • hyperparams – The hyperparameters of the model that we want to load.

  • datablobs_directory – The parent directory of all of your saved DataBlob s. The exact filename is calculated from the class name and hyperparams.

  • **kwargs – Other keyword arguments that you need to pass to your __init__().

static from_exact_path(path: str, *, shard_offset: Optional[int] = None, shard_quantity: int = 1) DataBlob

Load a DataBlob from a directory on the filesystem.

classmethod from_exact_path_distributed(cls, *, path: str, cache: bool = False, repeat: Union[bool, int, None] = True, per_replica_batch_size: Optional[int] = None, tf_distribute_strategy: Optional[tensorflow.distribute.get_strategy] = None) DistributedDataBlob
Parameters
  • path – The exact location of the saved DataBlob on the filesystem.

  • cache – Whether to cache the DataBlob in memory. If repeat is also enabled, then caching will happen before repeating.

  • repeat – Repeats the DataBlob after loading it. Set to True to enable infinite repeating. Set to a positive integer n to repeat the DataBlob n times. Set to False to disable repeating.

  • per_replica_batch_size – The batch size for each individual tf.distribute replica. This is the global batch size divided by tf.distribute.Strategy.num_replicas_in_sync.

  • tf_distribute_strategy – The tf.distribute.Strategy subclass to use. Optionally, this method will detect if it is already inside a :py:meth:`tf.distribute.Strategy.scope context manager.

static metadata_from_exact_path(path: str) scalarstop.datablob_metadata.DataBlobMetadata

Loads this DataBlob ‘s DataBlobMetadata from a directory on the filesystem.

exists_in_datablobs_directory(self, datablobs_directory: str) bool

Returns True if this DataBlob was already saved within datablobs_directory.

Parameters

datablobs_directory – The parent directory of all of your saved DataBlob s.

Returns

Returns True if we found a py:class:DataBlob metadata file at the expected location.

batch(self, batch_size: int, *, training: bool = True, validation: bool = True, test: bool = True, with_tf_distribute: bool = False) DataBlob

Batch this DataBlob.

Parameters
  • batch_size – The number of items to collect into a batch.

  • training – Whether to batch the training set. Defaults to True.

  • validation – Whether to batch the validation set. Defaults to True.

  • test – Whether to batch the test set. Defaults to True.

  • with_tf_distribute – Whether to consider tf.distribute auto-data sharding when calculating the batch size.

cache(self, *, training: bool = True, validation: bool = True, test: bool = True, precache_training: bool = False, precache_validation: bool = False, precache_test: bool = False) DataBlob

Cache this DataBlob into memory before iterating over it.

By default, this creates a DataBlob containing a TensorFlow CacheDataset for each of the training, validation and test tf.data.Dataset s.

But these datasets do not load into memory until the first time you completely iterate over one–from start to end. If you want to immediately load your training, validation, or test sets, you can set precache_training, precache_validation, and/or precache_test to True.

Parameters
  • training – Lazily cache the training set in CPU memory. Defaults to True.

  • validation – Lazily cache the validation set in CPU memory. Defaults to True.

  • test – Lazily cache the test set in CPU memory. Defaults to True.

  • precache_training – Eagerly cache the training set into memory. Defaults to False.

  • precache_validation – Eagerly cache the validation set into memory. Defaults to False.

  • precache_test – Eagerly cache the test set into memory. Defaults to False.

prefetch(self, buffer_size: int, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob

Creates a DataBlob that prefetches elements for performance.

Parameters
  • buffer_size – The maximum number of elements that will be buffered when prefetching. If the value tf.data.experimental.AUTOTUNE() is used, then the buffer is dynamically tuned.

  • training – Apply the repeat operator to the training set. Defaults to True.

  • validation – Apply the repeat operator to the validation set. Defaults to True.

  • test – Apply the repeat operator to the test set. Defaults to True.

repeat(self, count: Optional[int] = None, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob

Repeats this DataBlob.

Parameters
  • count – Represents the number of times that the elements in the tf.data.Dataset should be repeated. The default behavior (if count is None or -1) is for the dataset be repeated indefinitely.

  • training – Apply the repeat operator to the training set. Defaults to True.

  • validation – Apply the repeat operator to the validation set. Defaults to True.

  • test – Apply the repeat operator to the test set. Defaults to True.

repeat_interleaved(self, count: int, cycle_length: Optional[int] = None, block_length: Optional[int] = None, num_parallel_calls: Optional[int] = None, deterministic: Optional[bool] = None, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob

Repeats this DataBlob, but interleaved order.

Parameters
  • count – Represents the number of times that the elements in the tf.data.Dataset should be repeated. This must be a finite integer greater than 0. It cannot be a negative number, None, or an infinite value.

  • training – Apply the repeat operator to the training set. Defaults to True.

  • validation – Apply the repeat operator to the validation set. Defaults to True.

  • test – Apply the repeat operator to the test set. Defaults to True.

with_options(self, options: tf.data.Options, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob

Apply a tf.data.Options object to this DataBlob.

Parameters
  • options – The tf.data.Options object to apply.

  • training – Apply the options to the training set. Defaults to True.

  • validation – Apply the options to the validation set. Defaults to True.

  • test – Apply the options to the test set. Defaults to True.

save_hook(self, *, subtype: str, path: str) None

Override this method to run additional code when saving this DataBlob to disk.

save(self, datablobs_directory: str, *, ignore_existing: bool = False, num_shards: int = 1, save_load_version: int = _DEFAULT_SAVE_LOAD_VERSION) DataBlob

Save this DataBlob to disk.

Parameters
  • datablobs_directory – The directory where you plan on storing all of your DataBlobs. This method will save this DataBlob in a subdirectory of datablobs_directory with same name as DataBlob.name.

  • ignore_existing – Set this to True to ignore if there is already a DataBlob at the given path.

  • save_load_version – The ScalarStop version for the ScalarStop protocol.

Returns

Return self, enabling you to place this call in a chain.

classmethod calculate_name(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None) str

Calculate the hashed name of this object, given the hyperparameters.

This classmethod can be used to calculate what an object would be without actually having to call __init__().

property group_name(self) str

The “group” name is this object’s Python class name.

property name(self) str

The group (class) name and a calculated hash of the hyperparameters.

property hyperparams(self) scalarstop.hyperparams.HyperparamsType

Returns a HyperparamsType instance containing hyperparameters.

property hyperparams_flat(self) Dict[str, Any]

Returns a Python dictionary of “flattened” hyperparameters.

AppendDataBlob objects modify a “parent” DataBlob, nesting the parent’s Hyperparams within the AppendDataBlob ‘s own Hyperparams.

This makes it hard to look up a given hyperparams key. A value at parent_datablob.hyperparams.a is stored at child_datablob.hyperparams.parent.hyperparams.a.

This hyperparams_flat property provides all nested hyperparams keys as a flat Python dictionary. If a child AppendDataBlob has a hyperparameter key that that conflicts with the parent, the child’s value will overwrite the parent’s value.

class DistributedDataBlob(*, name: str, group_name: str, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, hyperparams_class: Type[scalarstop.hyperparams.HyperparamsType], cache: bool = False, repeat: Union[bool, int, None] = True, per_replica_batch_size: Optional[int] = None, tf_distribute_strategy: Optional[tensorflow.distribute.get_strategy] = None)

Bases: DataBlobBase

Wraps a DataBlob to create a TensorFlow tf.distribute.DistributedDataset.

A DataBlob contains three TensorFlow tf.data.Dataset pipelines, representing a training, validation, and test set. The DistributedDataBlob wraps the creation of a DataBlob to turn each tf.data.Dataset into a tf.distribute.DistributedDataset which is used to distribute a dataset across multiple workers according to a tf.distribute.Strategy.

If you have saved a DataBlob to the filesystem with DataBlob.save(), then you can automatically load the DataBlob from the filesystem as a DistributedDataBlob using the classmethod DataBlob.from_filesystem_distributed() or DataBlob.from_exact_path_distributed().

For more fine-grained control, you can subclass DistributedDataBlob and override DistributedDataBlob.new_sharded_datablob() with your own DataBlob creation and sharding logic. Optionally, you can also subclass DistributedDataBlob.transform_datablob() to change how DistributedDataBlob handles repeating and batching. Finally, you can also subclass DistributedDataBlob.postprocess_tfdata() to make changes to individual tf.data.Dataset instances rather than the DataBlob as a whole.

Parameters
  • name – The name of the wrapped DataBlob.

  • group_name – The group name of the wrapped DataBlob.

  • hyperparams – The hyperparameters of the wrapped DataBlob.

  • hyperparams_class – The HyperparamsType class that hyperparams instances are created from.

  • cache – Whether to cache the DataBlob in memory. If repeat is also enabled, then caching will happen before repeating.

  • repeat – Repeats the DataBlob after loading it. Set to True to enable infinite repeating. Set to a positive integer n to repeat the DataBlob n times. Set to False to disable repeating.

  • per_replica_batch_size – The batch size for each individual tf.distribute replica. This is the global batch size divided by tf.distribute.Strategy.num_replicas_in_sync.

  • tf_distribute_strategy – The tf.distribute.Strategy subclass to use. Optionally, this method will detect if it is already inside a :py:meth:`tf.distribute.Strategy.scope context manager.

Hyperparams :Type[scalarstop.hyperparams.HyperparamsType]
abstract new_sharded_datablob(self, ctx: tf.distribute.InputContext) DataBlob

Subclass this method to return a sharded DataBlob.

Parameters

ctx – A tf.distribute.InputContext instance. The attribute tf.distribute.InputContext.input_pipeline_id returns the current input pipeline. The attribute tf.distribute.InputContext.num_input_pipelines returns the total number of distributed input pipelines in the current tf.distribute.Strategy.

transform_datablob(self, datablob: DataBlob, ctx: tf.distribute.InputContext) DataBlob

Transforms an already-initialized DataBlob to add repeating and sharding logic.

Parameters
  • datablob – The already-initialized DataBlob.

  • ctx – A tf.distribute.InputContext instance. The attribute tf.distribute.InputContext.input_pipeline_id returns the current input pipeline. The attribute tf.distribute.InputContext.num_input_pipelines returns the total number of distributed input pipelines in the current tf.distribute.Strategy.

Returns

Returns a DataBlob that has been modified by

repeating, batching, or another transformation.

postprocess_tfdata(self, tfdata: tf.data.Dataset, ctx: tf.distribute.InputContext) tf.data.Dataset

Performs additional tf.data.Dataset transformations before turning them into tf.distribute.DistributedDataset instances.

Currently, the implementation in DistributedDataBlob does nothing, but is avaiable for you to subclass and change.

Parameters
  • tfdata – The input tf.data.Dataset instance to transform.

  • ctx – A tf.distribute.InputContext instance. The attribute tf.distribute.InputContext.input_pipeline_id returns the current input pipeline. The attribute tf.distribute.InputContext.num_input_pipelines returns the total number of distributed input pipelines in the current tf.distribute.Strategy.

Returns

Returns a transformed tf.data.Dataset.

property tf_distribute_strategy(self) tf.distribute.Strategy

Returns the currently-active tf.distribute.Strategy.

set_training(self) tf.distribute.DistributedDataset

Creates a new tf.distribute.DistributedDataset for the training set.

property training(self) tf.distribute.DistributedDataset

A tf.distribute.DistributedDataset instance for the training set.

set_validation(self) tf.distribute.DistributedDataset

Creates a new tf.distribute.DistributedDataset for the validation set.

property validation(self) tf.distribute.DistributedDataset

A tf.distribute.DistributedDataset instance for the validation set.

set_test(self) tf.distribute.DistributedDataset

Creates a new tf.distribute.DistributedDataset for the test set.

property test(self) tf.distribute.DistributedDataset

A tf.distribute.DistributedDataset instance for the test set.

classmethod calculate_name(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None) str

Calculate the hashed name of this object, given the hyperparameters.

This classmethod can be used to calculate what an object would be without actually having to call __init__().

property group_name(self) str

The “group” name is this object’s Python class name.

property name(self) str

The group (class) name and a calculated hash of the hyperparameters.

property hyperparams(self) scalarstop.hyperparams.HyperparamsType

Returns a HyperparamsType instance containing hyperparameters.

property hyperparams_flat(self) Dict[str, Any]

Returns a Python dictionary of “flattened” hyperparameters.

AppendDataBlob objects modify a “parent” DataBlob, nesting the parent’s Hyperparams within the AppendDataBlob ‘s own Hyperparams.

This makes it hard to look up a given hyperparams key. A value at parent_datablob.hyperparams.a is stored at child_datablob.hyperparams.parent.hyperparams.a.

This hyperparams_flat property provides all nested hyperparams keys as a flat Python dictionary. If a child AppendDataBlob has a hyperparameter key that that conflicts with the parent, the child’s value will overwrite the parent’s value.