scalarstop.datablob
¶
Group together and name your training, validation, and test sets.
The classes in this module are used to group together data into training, validation, and test sets used for training machine learning models. We also record the hyperparameters used to process the dataset.
The DataBlob
subclass name and hyperparameters
are used to create a unique content-addressable name
that makes it easy to keep track of many datasets at once.
Module Contents¶
Classes¶
The abstract base class describing the properties common to all DataBlobs. |
|
Subclass this to group your training, validation, and test sets for training machine learning models. |
|
Subclass this to transform a |
|
Subclass this to create a new |
|
Wraps a |
- class DataBlobBase(*, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, **kwargs)¶
Bases:
scalarstop._single_namespace.SingleNamespace
The abstract base class describing the properties common to all DataBlobs.
- Parameters
hyperparams – The hyperparameters to initialize this class with.
- Hyperparams :Type[scalarstop.hyperparams.HyperparamsType]¶
- set_training(self) Any ¶
Creates and returns a new object representing the training set.
- property training(self) Any ¶
An object representing the training set.
- set_validation(self) Any ¶
Creates and returns a new object representing the validation set.
- property validation(self) Any ¶
An object representing the validation set.
- set_test(self) Any ¶
Creates and returns a new object representing the test set.
- property test(self) Any ¶
An object representing the test set.
- classmethod calculate_name(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None) str ¶
Calculate the hashed name of this object, given the hyperparameters.
This classmethod can be used to calculate what an object would be without actually having to call
__init__()
.
- property hyperparams(self) scalarstop.hyperparams.HyperparamsType ¶
Returns a
HyperparamsType
instance containing hyperparameters.
- property hyperparams_flat(self) Dict[str, Any] ¶
Returns a Python dictionary of “flattened” hyperparameters.
AppendDataBlob
objects modify a “parent”DataBlob
, nesting the parent’s Hyperparams within theAppendDataBlob
‘s own Hyperparams.This makes it hard to look up a given hyperparams key. A value at
parent_datablob.hyperparams.a
is stored atchild_datablob.hyperparams.parent.hyperparams.a
.This
hyperparams_flat
property provides all nested hyperparams keys as a flat Python dictionary. If a childAppendDataBlob
has a hyperparameter key that that conflicts with the parent, the child’s value will overwrite the parent’s value.
- class DataBlob(*, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, **kwargs)¶
Bases:
DataBlobBase
Subclass this to group your training, validation, and test sets for training machine learning models.
Here is how to use
DataBlob
to group your training, validation, and test sets:Subclass
DataBlob
with a class name that describes your dataset in general. In this example, we’ll useMyDataBlob
as the class name.Define a dataclass using the
@sp.dataclass
decorator atMyDataBlob.Hyperparams
. We’ll define an instance of this dataclass atMyDataBlob.hyperparams
. This describes the hyperparameters involved in processing your dataset.Override the methods
DataBlob.set_training()
,DataBlob.set_validation()
, andDataBlob.set_test()
to generatetf.data.Dataset
pipelines representing your training, validation, and test sets.
Those three steps roughly look like:
>>> import tensorflow as tf >>> import scalarstop as sp >>> >>> class MyDataBlob(sp.DataBlob): ... ... @sp.dataclass ... class Hyperparams(sp.HyperparamsType): ... cols: int ... ... def _data(self): ... x = tf.random.uniform(shape=(10, self.hyperparams.cols)) ... y = tf.round(tf.random.uniform(shape=(10,1))) ... return tf.data.Dataset.zip(( ... tf.data.Dataset.from_tensor_slices(x), ... tf.data.Dataset.from_tensor_slices(y), ... )) ... ... def set_training(self): ... return self._data() ... ... def set_validation(self): ... return self._data() ... ... def set_test(self): ... return self._data() >>>
In our above example, our training, validation, and test sets are created with the exact same code. In practice, you’ll be creating them with different inputs.
Now we create an instance of our subclass so we can start using it.
>>> datablob = MyDataBlob(hyperparams=dict(cols=3)) >>> datablob <sp.DataBlob MyDataBlob-bn5hpc7ueo2uz7as1747tetn>
DataBlob
instances are given a unique name by hashing together the class name with the instance’s hyperparameters.>>> datablob.name 'MyDataBlob-bn5hpc7ueo2uz7as1747tetn' >>> >>> datablob.group_name 'MyDataBlob' >>> >>> datablob.hyperparams MyDataBlob.Hyperparams(cols=3) >>> >>> sp.enforce_dict(datablob.hyperparams) {'cols': 3}
We save exactly one instance of each
tf.data.Dataset
pipeline in the propertiesDataBlob.training
,DataBlob.validation
, andDataBlob.test
.>>> datablob.training <ZipDataset element_spec=(TensorSpec(shape=(3,), dtype=tf.float32, name=None), TensorSpec(shape=(1,), dtype=tf.float32, name=None))> >>> >>> datablob.validation <ZipDataset element_spec=(TensorSpec(shape=(3,), dtype=tf.float32, name=None), TensorSpec(shape=(1,), dtype=tf.float32, name=None))> >>> >>> datablob.test <ZipDataset element_spec=(TensorSpec(shape=(3,), dtype=tf.float32, name=None), TensorSpec(shape=(1,), dtype=tf.float32, name=None))>
DataBlob
objects have some methods for applyingtf.data
transformations to the training, validation, and test sets at the same time:Batching.
DataBlob.batch()
will batch the training, validation, and test sets at the same time. If you callDataBlob.batch()
with the keyword argumentwith_tf_distribute=True
, your input batch size will be multiplied by the number of replicas in yourtf.distribute
strategy.Caching.
DataBlob.cache()
will cache the training, validation, and test sets in memory once you iterate over them. This is useful if yourtf.data.Dataset
are doing something computationally expensive each time you iterate over them.Saving/loading to/from the filesystem.
DataBlob.save()
saves the training, validation, and test sets to a path on the filesystem. This can be loaded back with the classmethodDataBlob.from_exact_path()
.
>>> import os >>> import tempfile >>> tempdir = tempfile.TemporaryDirectory() >>> >>> datablob = datablob.save(tempdir.name) >>> >>> os.listdir(tempdir.name) ['MyDataBlob-bn5hpc7ueo2uz7as1747tetn']
>>> path = os.path.join(tempdir.name, datablob.name) >>> loaded_datablob = MyDataBlob.from_exact_path(path) >>> loaded_datablob <sp.DataBlob MyDataBlob-bn5hpc7ueo2uz7as1747tetn>
Alternatively, if you have the hyperparameters of the
DataBlob
but not the name, you can use the classmethodDataBlob.from_filesystem()
.>>> loaded_datablob_2 = MyDataBlob.from_filesystem( ... hyperparams=dict(cols=3), ... datablobs_directory=tempdir.name, ... ) >>> loaded_datablob_2 <sp.DataBlob MyDataBlob-bn5hpc7ueo2uz7as1747tetn>
(and now let’s clean up the temporary directory from above)
>>> tempdir.cleanup()
- Parameters
hyperparams – The hyperparameters to initialize this class with.
- Hyperparams :Type[scalarstop.hyperparams.HyperparamsType]¶
- classmethod from_filesystem(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1)¶
Loads a
DataBlob
from the filesystem, calculating the filename from the hyperparameters.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlob
s. The exact filename is calculated from the class name and hyperparams.
- classmethod from_filesystem_distributed(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, cache: bool = False, repeat: Union[bool, int, None] = True, per_replica_batch_size: Optional[int] = None, tf_distribute_strategy: Optional[tf.distribute.Strategy] = None) DistributedDataBlob ¶
Loads a sharded
DataBlob
from the filesystem, automatically splitting the shards amongs the input workers of atf.distribute.Strategy
.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlob
s. The exact filename is calculated from the class name and hyperparams.cache – Whether to cache the
DataBlob
in memory. Ifrepeat
is also enabled, then caching will happen before repeating.repeat – Repeats the
DataBlob
after loading it. Set toTrue
to enable infinite repeating. Set to a positive integern
to repeat theDataBlob
n
times. Set toFalse
to disable repeating.per_replica_batch_size – The batch size for each individual
tf.distribute
replica. This is the global batch size divided bytf.distribute.Strategy.num_replicas_in_sync
.tf_distribute_strategy – The
tf.distribute.Strategy
subclass to use. Optionally, this method will detect if it is already inside a :py:meth:`tf.distribute.Strategy.scope context manager.
- classmethod metadata_from_filesystem(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str) scalarstop.datablob_metadata.DataBlobMetadata ¶
Loads this
DataBlob
‘sDataBlobMetadata
from the filesystem, calculating the filename from the hyperparameters.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlob
s. The exact filename is calculated from the class name and hyperparams.
- classmethod from_filesystem_or_new(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1, **kwargs)¶
Load a
DataBlob
from the filesystem, calculating the filename from the hyperparameters. Create a newDataBlob
if we cannot find a saved one on the filesystem.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlob
s. The exact filename is calculated from the class name and hyperparams.**kwargs – Other keyword arguments that you need to pass to your
__init__()
.
- static from_exact_path(path: str, *, shard_offset: Optional[int] = None, shard_quantity: int = 1) DataBlob ¶
Load a
DataBlob
from a directory on the filesystem.
- classmethod from_exact_path_distributed(cls, *, path: str, cache: bool = False, repeat: Union[bool, int, None] = True, per_replica_batch_size: Optional[int] = None, tf_distribute_strategy: Optional[tensorflow.distribute.get_strategy] = None) DistributedDataBlob ¶
- Parameters
path – The exact location of the saved
DataBlob
on the filesystem.cache – Whether to cache the
DataBlob
in memory. Ifrepeat
is also enabled, then caching will happen before repeating.repeat – Repeats the
DataBlob
after loading it. Set toTrue
to enable infinite repeating. Set to a positive integern
to repeat theDataBlob
n
times. Set toFalse
to disable repeating.per_replica_batch_size – The batch size for each individual
tf.distribute
replica. This is the global batch size divided bytf.distribute.Strategy.num_replicas_in_sync
.tf_distribute_strategy – The
tf.distribute.Strategy
subclass to use. Optionally, this method will detect if it is already inside a :py:meth:`tf.distribute.Strategy.scope context manager.
- static metadata_from_exact_path(path: str) scalarstop.datablob_metadata.DataBlobMetadata ¶
Loads this
DataBlob
‘sDataBlobMetadata
from a directory on the filesystem.
- exists_in_datablobs_directory(self, datablobs_directory: str) bool ¶
Returns
True
if thisDataBlob
was already saved withindatablobs_directory
.- Parameters
datablobs_directory – The parent directory of all of your saved
DataBlob
s.- Returns
Returns
True
if we found a py:class:DataBlob metadata file at the expected location.
- set_training(self) tf.data.Dataset ¶
Create a
tf.data.Dataset
for the training set.
- property training(self) tf.data.Dataset ¶
A
tf.data.Dataset
instance representing the training set.
- set_validation(self) tf.data.Dataset ¶
Create a
tf.data.Dataset
for the validation set.
- property validation(self) tf.data.Dataset ¶
A
tf.data.Dataset
instance representing the validation set.
- set_test(self) tf.data.Dataset ¶
Create a
tf.data.Dataset
for the test set.
- property test(self) tf.data.Dataset ¶
A
tf.data.Dataset
instance representing the test set.
- batch(self, batch_size: int, *, training: bool = True, validation: bool = True, test: bool = True, with_tf_distribute: bool = False) DataBlob ¶
Batch this
DataBlob
.- Parameters
batch_size – The number of items to collect into a batch.
training – Whether to batch the training set. Defaults to
True
.validation – Whether to batch the validation set. Defaults to
True
.test – Whether to batch the test set. Defaults to
True
.with_tf_distribute – Whether to consider
tf.distribute
auto-data sharding when calculating the batch size.
- cache(self, *, training: bool = True, validation: bool = True, test: bool = True, precache_training: bool = False, precache_validation: bool = False, precache_test: bool = False) DataBlob ¶
Cache this
DataBlob
into memory before iterating over it.By default, this creates a
DataBlob
containing a TensorFlowCacheDataset
for each of the training, validation and testtf.data.Dataset
s.But these datasets do not load into memory until the first time you completely iterate over one–from start to end. If you want to immediately load your training, validation, or test sets, you can set
precache_training
,precache_validation
, and/orprecache_test
toTrue
.- Parameters
training – Lazily cache the training set in CPU memory. Defaults to
True
.validation – Lazily cache the validation set in CPU memory. Defaults to
True
.test – Lazily cache the test set in CPU memory. Defaults to
True
.precache_training – Eagerly cache the training set into memory. Defaults to
False
.precache_validation – Eagerly cache the validation set into memory. Defaults to
False
.precache_test – Eagerly cache the test set into memory. Defaults to
False
.
- prefetch(self, buffer_size: int, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob ¶
Creates a
DataBlob
that prefetches elements for performance.- Parameters
buffer_size – The maximum number of elements that will be buffered when prefetching. If the value
tf.data.experimental.AUTOTUNE()
is used, then the buffer is dynamically tuned.training – Apply the repeat operator to the training set. Defaults to
True
.validation – Apply the repeat operator to the validation set. Defaults to
True
.test – Apply the repeat operator to the test set. Defaults to
True
.
- repeat(self, count: Optional[int] = None, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob ¶
Repeats this
DataBlob
.- Parameters
count – Represents the number of times that the elements in the
tf.data.Dataset
should be repeated. The default behavior (ifcount
isNone
or-1
) is for the dataset be repeated indefinitely.training – Apply the repeat operator to the training set. Defaults to
True
.validation – Apply the repeat operator to the validation set. Defaults to
True
.test – Apply the repeat operator to the test set. Defaults to
True
.
- repeat_interleaved(self, count: int, cycle_length: Optional[int] = None, block_length: Optional[int] = None, num_parallel_calls: Optional[int] = None, deterministic: Optional[bool] = None, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob ¶
Repeats this
DataBlob
, but interleaved order.- Parameters
count – Represents the number of times that the elements in the
tf.data.Dataset
should be repeated. This must be a finite integer greater than 0. It cannot be a negative number,None
, or an infinite value.training – Apply the repeat operator to the training set. Defaults to
True
.validation – Apply the repeat operator to the validation set. Defaults to
True
.test – Apply the repeat operator to the test set. Defaults to
True
.
- with_options(self, options: tf.data.Options, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob ¶
Apply a
tf.data.Options
object to thisDataBlob
.- Parameters
options – The
tf.data.Options
object to apply.training – Apply the options to the training set. Defaults to
True
.validation – Apply the options to the validation set. Defaults to
True
.test – Apply the options to the test set. Defaults to
True
.
- save_hook(self, *, subtype: str, path: str) None ¶
Override this method to run additional code when saving this
DataBlob
to disk.
- save(self, datablobs_directory: str, *, ignore_existing: bool = False, num_shards: int = 1, save_load_version: int = _DEFAULT_SAVE_LOAD_VERSION) DataBlob ¶
Save this
DataBlob
to disk.- Parameters
datablobs_directory – The directory where you plan on storing all of your DataBlobs. This method will save this
DataBlob
in a subdirectory ofdatablobs_directory
with same name asDataBlob.name
.ignore_existing – Set this to
True
to ignore if there is already aDataBlob
at the given path.save_load_version – The ScalarStop version for the ScalarStop protocol.
- Returns
Return
self
, enabling you to place this call in a chain.
- classmethod calculate_name(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None) str ¶
Calculate the hashed name of this object, given the hyperparameters.
This classmethod can be used to calculate what an object would be without actually having to call
__init__()
.
- property hyperparams(self) scalarstop.hyperparams.HyperparamsType ¶
Returns a
HyperparamsType
instance containing hyperparameters.
- property hyperparams_flat(self) Dict[str, Any] ¶
Returns a Python dictionary of “flattened” hyperparameters.
AppendDataBlob
objects modify a “parent”DataBlob
, nesting the parent’s Hyperparams within theAppendDataBlob
‘s own Hyperparams.This makes it hard to look up a given hyperparams key. A value at
parent_datablob.hyperparams.a
is stored atchild_datablob.hyperparams.parent.hyperparams.a
.This
hyperparams_flat
property provides all nested hyperparams keys as a flat Python dictionary. If a childAppendDataBlob
has a hyperparameter key that that conflicts with the parent, the child’s value will overwrite the parent’s value.
- class DataFrameDataBlob(*, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, **kwargs)¶
Bases:
DataBlob
Subclass this to transform a
pandas.DataFrame
into your training, validation, and test sets.DataBlob
is useful when you want to manually define yourtf.data
pipelines and their input tensors.However, if your input tensors are in a fixed-size list or
DataFrame
that you want to slice into a training, validation, and test set, then you might findDataFrameDataBlob
handy.Here is how to use it:
Subclass
DataFrameDataBlob
with a class name that describes your dataset.Override
DataFrameDataBlob.set_dataframe()
and have it return a singleDataFrame
that contains all of the inputs for your training, validation, and test sets. TheDataFrame
should have one column representing training samples and another column representing training labels.Override
DataFrameDataBlob.transform()
and define a method that transforms an arbitraryDataFrame
of inputs into atf.data.Dataset
pipeline that represents the actual dataset needed for training and evaluation.
We define what fraction of the
DataFrame
to split with the class attributesDataFrameDataBlob.training_fraction
andDataFrameDataBlob.validation_fraction
. By default, 60 percent of theDataFrame
is marked for the training set, 20 percent for the validation set, and the remainder of theDataFrame
for the test set.Roughly, this looks like:
>>> import pandas as pd >>> import tensorflow as tf >>> import scalarstop as sp >>> >>> class MyDataFrameDataBlob(sp.DataFrameDataBlob): ... samples_column: str = "samples" ... labels_column: str = "labels" ... training_fraction: float = 0.6 ... validation_fraction: float = 0.2 ... ... @sp.dataclass ... class Hyperparams(sp.HyperparamsType): ... length: int = 0 ... ... def set_dataframe(self): ... samples = list(range(self.hyperparams.length)) ... labels = list(range(self.hyperparams.length)) ... return pd.DataFrame({self.samples_column: samples, self.labels_column: labels}) ... ... def transform(self, dataframe: pd.DataFrame): ... return tf.data.Dataset.zip(( ... tf.data.Dataset.from_tensor_slices(dataframe[self.samples_column]), ... tf.data.Dataset.from_tensor_slices(dataframe[self.labels_column]), ... ))
>>> datablob2 = MyDataFrameDataBlob(hyperparams=dict(length=10))
And you can use the resulting object in all of the same ways as we’ve demonstrated with
DataBlob
subclass instances above.- Parameters
hyperparams – The hyperparameters to initialize this class with.
- samples_column :str = samples¶
- labels_column :str = labels¶
- training_fraction :float = 0.6¶
- validation_fraction :float = 0.2¶
- Hyperparams :Type[scalarstop.hyperparams.HyperparamsType]¶
- static from_exact_path(path: str, *, shard_offset: Optional[int] = None, shard_quantity: int = 1) Union[DataBlob, DataFrameDataBlob] ¶
Load a
DataFrameDataBlob
from a directory on the filesystem.
- set_dataframe(self) pandas.DataFrame ¶
Create a new
pandas.DataFrame
that contains all of the data for the training, validation, and test sets.
- property dataframe(self) pandas.DataFrame ¶
A
pandas.DataFrame
that represents the entire training, validation, and test set.
- set_training_dataframe(self) pandas.DataFrame ¶
Sets the
pandas.DataFrame
for the training set.By default, this method slices the
pandas.DataFrame
you have supplied toset_dataframe()
.Alternatively, you can choose to directly subclass
set_training_dataframe()
,set_validation_dataframe()
, and :py:meth`set_test_dataframe`.- Returns
Returns a
pandas.DataFrame
.
- property training_dataframe(self) pandas.DataFrame ¶
A
pandas.DataFrame
representing training set input tensors.
- set_validation_dataframe(self) pandas.DataFrame ¶
Sets the
pandas.DataFrame
for the validation set.By default, this method slices the
pandas.DataFrame
you have supplied toset_dataframe()
.Alternatively, you can choose to directly subclass
set_training_dataframe()
,set_validation_dataframe()
, and :py:meth`set_test_dataframe`.- Returns
Returns a
pandas.DataFrame
.
- property validation_dataframe(self) pandas.DataFrame ¶
A
pandas.DataFrame
representing validation set input tensors.
- set_test_dataframe(self) pandas.DataFrame ¶
Sets the
pandas.DataFrame
for the test set.By default, this method slices the DataFrame you have supplied to
set_dataframe()
.Alternatively, you can choose to directly subclass
set_training_dataframe()
,set_validation_dataframe()
, and :py:meth`set_test_dataframe`.- Returns
Returns a Pandas
pandas.DataFrame
.
- property test_dataframe(self) pandas.DataFrame ¶
A
pandas.DataFrame
representing test set input tensors.
- transform(self, dataframe: pandas.DataFrame) tf.data.Dataset ¶
Transforms any input tensors into an output
tf.data.Dataset
.
- set_training(self) tf.data.Dataset ¶
Create a
tf.data.Dataset
for the training set.
- set_validation(self) tf.data.Dataset ¶
Create a
tf.data.Dataset
for the validation set.
- set_test(self) tf.data.Dataset ¶
Create a
tf.data.Dataset
for the test set.
- save_hook(self, *, subtype: str, path: str) None ¶
Override this method to run additional code when saving this
DataBlob
to disk.
- classmethod from_filesystem(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1)¶
Loads a
DataBlob
from the filesystem, calculating the filename from the hyperparameters.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlob
s. The exact filename is calculated from the class name and hyperparams.
- classmethod from_filesystem_distributed(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, cache: bool = False, repeat: Union[bool, int, None] = True, per_replica_batch_size: Optional[int] = None, tf_distribute_strategy: Optional[tf.distribute.Strategy] = None) DistributedDataBlob ¶
Loads a sharded
DataBlob
from the filesystem, automatically splitting the shards amongs the input workers of atf.distribute.Strategy
.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlob
s. The exact filename is calculated from the class name and hyperparams.cache – Whether to cache the
DataBlob
in memory. Ifrepeat
is also enabled, then caching will happen before repeating.repeat – Repeats the
DataBlob
after loading it. Set toTrue
to enable infinite repeating. Set to a positive integern
to repeat theDataBlob
n
times. Set toFalse
to disable repeating.per_replica_batch_size – The batch size for each individual
tf.distribute
replica. This is the global batch size divided bytf.distribute.Strategy.num_replicas_in_sync
.tf_distribute_strategy – The
tf.distribute.Strategy
subclass to use. Optionally, this method will detect if it is already inside a :py:meth:`tf.distribute.Strategy.scope context manager.
- classmethod metadata_from_filesystem(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str) scalarstop.datablob_metadata.DataBlobMetadata ¶
Loads this
DataBlob
‘sDataBlobMetadata
from the filesystem, calculating the filename from the hyperparameters.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlob
s. The exact filename is calculated from the class name and hyperparams.
- classmethod from_filesystem_or_new(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1, **kwargs)¶
Load a
DataBlob
from the filesystem, calculating the filename from the hyperparameters. Create a newDataBlob
if we cannot find a saved one on the filesystem.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlob
s. The exact filename is calculated from the class name and hyperparams.**kwargs – Other keyword arguments that you need to pass to your
__init__()
.
- classmethod from_exact_path_distributed(cls, *, path: str, cache: bool = False, repeat: Union[bool, int, None] = True, per_replica_batch_size: Optional[int] = None, tf_distribute_strategy: Optional[tensorflow.distribute.get_strategy] = None) DistributedDataBlob ¶
- Parameters
path – The exact location of the saved
DataBlob
on the filesystem.cache – Whether to cache the
DataBlob
in memory. Ifrepeat
is also enabled, then caching will happen before repeating.repeat – Repeats the
DataBlob
after loading it. Set toTrue
to enable infinite repeating. Set to a positive integern
to repeat theDataBlob
n
times. Set toFalse
to disable repeating.per_replica_batch_size – The batch size for each individual
tf.distribute
replica. This is the global batch size divided bytf.distribute.Strategy.num_replicas_in_sync
.tf_distribute_strategy – The
tf.distribute.Strategy
subclass to use. Optionally, this method will detect if it is already inside a :py:meth:`tf.distribute.Strategy.scope context manager.
- static metadata_from_exact_path(path: str) scalarstop.datablob_metadata.DataBlobMetadata ¶
Loads this
DataBlob
‘sDataBlobMetadata
from a directory on the filesystem.
- exists_in_datablobs_directory(self, datablobs_directory: str) bool ¶
Returns
True
if thisDataBlob
was already saved withindatablobs_directory
.- Parameters
datablobs_directory – The parent directory of all of your saved
DataBlob
s.- Returns
Returns
True
if we found a py:class:DataBlob metadata file at the expected location.
- property training(self) tf.data.Dataset ¶
A
tf.data.Dataset
instance representing the training set.
- property validation(self) tf.data.Dataset ¶
A
tf.data.Dataset
instance representing the validation set.
- property test(self) tf.data.Dataset ¶
A
tf.data.Dataset
instance representing the test set.
- batch(self, batch_size: int, *, training: bool = True, validation: bool = True, test: bool = True, with_tf_distribute: bool = False) DataBlob ¶
Batch this
DataBlob
.- Parameters
batch_size – The number of items to collect into a batch.
training – Whether to batch the training set. Defaults to
True
.validation – Whether to batch the validation set. Defaults to
True
.test – Whether to batch the test set. Defaults to
True
.with_tf_distribute – Whether to consider
tf.distribute
auto-data sharding when calculating the batch size.
- cache(self, *, training: bool = True, validation: bool = True, test: bool = True, precache_training: bool = False, precache_validation: bool = False, precache_test: bool = False) DataBlob ¶
Cache this
DataBlob
into memory before iterating over it.By default, this creates a
DataBlob
containing a TensorFlowCacheDataset
for each of the training, validation and testtf.data.Dataset
s.But these datasets do not load into memory until the first time you completely iterate over one–from start to end. If you want to immediately load your training, validation, or test sets, you can set
precache_training
,precache_validation
, and/orprecache_test
toTrue
.- Parameters
training – Lazily cache the training set in CPU memory. Defaults to
True
.validation – Lazily cache the validation set in CPU memory. Defaults to
True
.test – Lazily cache the test set in CPU memory. Defaults to
True
.precache_training – Eagerly cache the training set into memory. Defaults to
False
.precache_validation – Eagerly cache the validation set into memory. Defaults to
False
.precache_test – Eagerly cache the test set into memory. Defaults to
False
.
- prefetch(self, buffer_size: int, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob ¶
Creates a
DataBlob
that prefetches elements for performance.- Parameters
buffer_size – The maximum number of elements that will be buffered when prefetching. If the value
tf.data.experimental.AUTOTUNE()
is used, then the buffer is dynamically tuned.training – Apply the repeat operator to the training set. Defaults to
True
.validation – Apply the repeat operator to the validation set. Defaults to
True
.test – Apply the repeat operator to the test set. Defaults to
True
.
- repeat(self, count: Optional[int] = None, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob ¶
Repeats this
DataBlob
.- Parameters
count – Represents the number of times that the elements in the
tf.data.Dataset
should be repeated. The default behavior (ifcount
isNone
or-1
) is for the dataset be repeated indefinitely.training – Apply the repeat operator to the training set. Defaults to
True
.validation – Apply the repeat operator to the validation set. Defaults to
True
.test – Apply the repeat operator to the test set. Defaults to
True
.
- repeat_interleaved(self, count: int, cycle_length: Optional[int] = None, block_length: Optional[int] = None, num_parallel_calls: Optional[int] = None, deterministic: Optional[bool] = None, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob ¶
Repeats this
DataBlob
, but interleaved order.- Parameters
count – Represents the number of times that the elements in the
tf.data.Dataset
should be repeated. This must be a finite integer greater than 0. It cannot be a negative number,None
, or an infinite value.training – Apply the repeat operator to the training set. Defaults to
True
.validation – Apply the repeat operator to the validation set. Defaults to
True
.test – Apply the repeat operator to the test set. Defaults to
True
.
- with_options(self, options: tf.data.Options, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob ¶
Apply a
tf.data.Options
object to thisDataBlob
.- Parameters
options – The
tf.data.Options
object to apply.training – Apply the options to the training set. Defaults to
True
.validation – Apply the options to the validation set. Defaults to
True
.test – Apply the options to the test set. Defaults to
True
.
- save(self, datablobs_directory: str, *, ignore_existing: bool = False, num_shards: int = 1, save_load_version: int = _DEFAULT_SAVE_LOAD_VERSION) DataBlob ¶
Save this
DataBlob
to disk.- Parameters
datablobs_directory – The directory where you plan on storing all of your DataBlobs. This method will save this
DataBlob
in a subdirectory ofdatablobs_directory
with same name asDataBlob.name
.ignore_existing – Set this to
True
to ignore if there is already aDataBlob
at the given path.save_load_version – The ScalarStop version for the ScalarStop protocol.
- Returns
Return
self
, enabling you to place this call in a chain.
- classmethod calculate_name(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None) str ¶
Calculate the hashed name of this object, given the hyperparameters.
This classmethod can be used to calculate what an object would be without actually having to call
__init__()
.
- property hyperparams(self) scalarstop.hyperparams.HyperparamsType ¶
Returns a
HyperparamsType
instance containing hyperparameters.
- property hyperparams_flat(self) Dict[str, Any] ¶
Returns a Python dictionary of “flattened” hyperparameters.
AppendDataBlob
objects modify a “parent”DataBlob
, nesting the parent’s Hyperparams within theAppendDataBlob
‘s own Hyperparams.This makes it hard to look up a given hyperparams key. A value at
parent_datablob.hyperparams.a
is stored atchild_datablob.hyperparams.parent.hyperparams.a
.This
hyperparams_flat
property provides all nested hyperparams keys as a flat Python dictionary. If a childAppendDataBlob
has a hyperparameter key that that conflicts with the parent, the child’s value will overwrite the parent’s value.
- class AppendDataBlob(*, parent: DataBlob, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, **kwargs)¶
Bases:
DataBlob
Subclass this to create a new
DataBlob
that extends an existingDataBlob
.The
AppendDataBlob
class is useful when you have an existingDataBlob
orDataFrameDataBlob
with most, but not all of the functionality you need. If you are trying to implement multiple data pipelines that share a common compute-intensive first step, you can implement your pipelines asAppendDataBlob
subclasses with the common first step as aDataBlob
that you save and load to/from the filesystem.Let’s begin by creating a
DataBlob
that we will use as a parent for anAppendDataBlob
.>>> import tensorflow as tf >>> import scalarstop as sp >>> >>> class MyDataBlob(sp.DataBlob): ... ... @sp.dataclass ... class Hyperparams(sp.HyperparamsType): ... length: int ... ... def _data(self): ... length = self.hyperparams.length ... x = tf.data.Dataset.from_tensor_slices(list(range(0, length))) ... y = tf.data.Dataset.from_tensor_slices(list(range(length, length * 2))) ... return tf.data.Dataset.zip((x, y)) ... ... def set_training(self): ... return self._data() ... ... def set_validation(self): ... return self._data() ... ... def set_test(self): ... return self._data() >>>
And then we create an instance of the datablob and save it to the filesystem.
>>> import os >>> import tempfile >>> tempdir = tempfile.TemporaryDirectory() >>> >>> datablob = MyDataBlob(hyperparams=dict(length=5)) >>> datablob <sp.DataBlob MyDataBlob-dac936v7mb1ue9phjp6tc3sb> >>> >>> list(datablob.training.as_numpy_iterator()) [(0, 5), (1, 6), (2, 7), (3, 8), (4, 9)] >>> >>> datablob = datablob.save(tempdir.name) >>> >>> os.listdir(tempdir.name) ['MyDataBlob-dac936v7mb1ue9phjp6tc3sb']
Now, let’s say that we want to create an
AppendDataBlob
that takes in any inputDataBlob
orDataFrameDataBlob
and multiplies every number in every tensor by a constant.>>> class MyAppendDataBlob(sp.AppendDataBlob): ... ... @sp.dataclass ... class Hyperparams(sp.AppendHyperparamsType): ... coefficient: int ... ... hyperparams: "MyAppendDataBlob.Hyperparams" ... ... def __init__(self, *, parent: sp.DataBlob, hyperparams): ... hyperparams_dict = sp.enforce_dict(hyperparams) ... if hyperparams_dict["coefficient"] < 1: ... raise ValueError("Coefficient is too low.") ... super().__init__(parent=parent, hyperparams=hyperparams_dict) ... ... def _wrap_tfdata(self, tfdata: tf.data.Dataset) -> tf.data.Dataset: ... return tfdata.map( ... lambda x, y: ( ... x * self.hyperparams.coefficient, ... y * self.hyperparams.coefficient, ... ) ... ) >>> >>> append = MyAppendDataBlob(parent=datablob, hyperparams=dict(coefficient=3)) >>> list(append.training.as_numpy_iterator()) [(0, 15), (3, 18), (6, 21), (9, 24), (12, 27)]
(And now let’s clean up the temporary directory that we created earlier.)
>>> tempdir.cleanup()
- Parameters
- Hyperparams :Type[scalarstop.hyperparams.AppendHyperparamsType]¶
- classmethod create_append_hyperparams(cls, *, parent: DataBlob, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None)¶
Combine the hyperparams from the parent
DataBlob
with the hyperparams meant for thisAppendDataBlob
.
- classmethod calculate_name_from_parent(cls, *, parent: DataBlob, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None)¶
Calculate the hashed name of this
AppendDataBlob
, given the hyperparameters and the parentDataBlob
.
- classmethod from_filesystem_with_parent(cls, *, parent: DataBlob, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1)¶
Load a
AppendDataBlob
from the filesystem, calculating the filename from the parent and the hyperparameters..
- classmethod from_filesystem_or_new_with_parent(cls, *, parent: DataBlob, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]], datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1, **kwargs)¶
Load a
AppendDataBlob
from the filesystem, calculating the filename from the hyperparameters. Create a newAppendDataBlob
if we cannot find a saved one on the filesystem.
- set_training(self) tf.data.Dataset ¶
Create a
tf.data.Dataset
for the training set.
- property training(self) tf.data.Dataset ¶
A
tf.data.Dataset
instance representing the training set.
- set_validation(self) tf.data.Dataset ¶
Create a
tf.data.Dataset
for the validation set.
- property validation(self) tf.data.Dataset ¶
A
tf.data.Dataset
instance representing the validation set.
- set_test(self) tf.data.Dataset ¶
Create a
tf.data.Dataset
for the test set.
- property test(self) tf.data.Dataset ¶
A
tf.data.Dataset
instance representing the test set.
- classmethod from_filesystem(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1)¶
Loads a
DataBlob
from the filesystem, calculating the filename from the hyperparameters.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlob
s. The exact filename is calculated from the class name and hyperparams.
- classmethod from_filesystem_distributed(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, cache: bool = False, repeat: Union[bool, int, None] = True, per_replica_batch_size: Optional[int] = None, tf_distribute_strategy: Optional[tf.distribute.Strategy] = None) DistributedDataBlob ¶
Loads a sharded
DataBlob
from the filesystem, automatically splitting the shards amongs the input workers of atf.distribute.Strategy
.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlob
s. The exact filename is calculated from the class name and hyperparams.cache – Whether to cache the
DataBlob
in memory. Ifrepeat
is also enabled, then caching will happen before repeating.repeat – Repeats the
DataBlob
after loading it. Set toTrue
to enable infinite repeating. Set to a positive integern
to repeat theDataBlob
n
times. Set toFalse
to disable repeating.per_replica_batch_size – The batch size for each individual
tf.distribute
replica. This is the global batch size divided bytf.distribute.Strategy.num_replicas_in_sync
.tf_distribute_strategy – The
tf.distribute.Strategy
subclass to use. Optionally, this method will detect if it is already inside a :py:meth:`tf.distribute.Strategy.scope context manager.
- classmethod metadata_from_filesystem(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str) scalarstop.datablob_metadata.DataBlobMetadata ¶
Loads this
DataBlob
‘sDataBlobMetadata
from the filesystem, calculating the filename from the hyperparameters.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlob
s. The exact filename is calculated from the class name and hyperparams.
- classmethod from_filesystem_or_new(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1, **kwargs)¶
Load a
DataBlob
from the filesystem, calculating the filename from the hyperparameters. Create a newDataBlob
if we cannot find a saved one on the filesystem.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlob
s. The exact filename is calculated from the class name and hyperparams.**kwargs – Other keyword arguments that you need to pass to your
__init__()
.
- static from_exact_path(path: str, *, shard_offset: Optional[int] = None, shard_quantity: int = 1) DataBlob ¶
Load a
DataBlob
from a directory on the filesystem.
- classmethod from_exact_path_distributed(cls, *, path: str, cache: bool = False, repeat: Union[bool, int, None] = True, per_replica_batch_size: Optional[int] = None, tf_distribute_strategy: Optional[tensorflow.distribute.get_strategy] = None) DistributedDataBlob ¶
- Parameters
path – The exact location of the saved
DataBlob
on the filesystem.cache – Whether to cache the
DataBlob
in memory. Ifrepeat
is also enabled, then caching will happen before repeating.repeat – Repeats the
DataBlob
after loading it. Set toTrue
to enable infinite repeating. Set to a positive integern
to repeat theDataBlob
n
times. Set toFalse
to disable repeating.per_replica_batch_size – The batch size for each individual
tf.distribute
replica. This is the global batch size divided bytf.distribute.Strategy.num_replicas_in_sync
.tf_distribute_strategy – The
tf.distribute.Strategy
subclass to use. Optionally, this method will detect if it is already inside a :py:meth:`tf.distribute.Strategy.scope context manager.
- static metadata_from_exact_path(path: str) scalarstop.datablob_metadata.DataBlobMetadata ¶
Loads this
DataBlob
‘sDataBlobMetadata
from a directory on the filesystem.
- exists_in_datablobs_directory(self, datablobs_directory: str) bool ¶
Returns
True
if thisDataBlob
was already saved withindatablobs_directory
.- Parameters
datablobs_directory – The parent directory of all of your saved
DataBlob
s.- Returns
Returns
True
if we found a py:class:DataBlob metadata file at the expected location.
- batch(self, batch_size: int, *, training: bool = True, validation: bool = True, test: bool = True, with_tf_distribute: bool = False) DataBlob ¶
Batch this
DataBlob
.- Parameters
batch_size – The number of items to collect into a batch.
training – Whether to batch the training set. Defaults to
True
.validation – Whether to batch the validation set. Defaults to
True
.test – Whether to batch the test set. Defaults to
True
.with_tf_distribute – Whether to consider
tf.distribute
auto-data sharding when calculating the batch size.
- cache(self, *, training: bool = True, validation: bool = True, test: bool = True, precache_training: bool = False, precache_validation: bool = False, precache_test: bool = False) DataBlob ¶
Cache this
DataBlob
into memory before iterating over it.By default, this creates a
DataBlob
containing a TensorFlowCacheDataset
for each of the training, validation and testtf.data.Dataset
s.But these datasets do not load into memory until the first time you completely iterate over one–from start to end. If you want to immediately load your training, validation, or test sets, you can set
precache_training
,precache_validation
, and/orprecache_test
toTrue
.- Parameters
training – Lazily cache the training set in CPU memory. Defaults to
True
.validation – Lazily cache the validation set in CPU memory. Defaults to
True
.test – Lazily cache the test set in CPU memory. Defaults to
True
.precache_training – Eagerly cache the training set into memory. Defaults to
False
.precache_validation – Eagerly cache the validation set into memory. Defaults to
False
.precache_test – Eagerly cache the test set into memory. Defaults to
False
.
- prefetch(self, buffer_size: int, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob ¶
Creates a
DataBlob
that prefetches elements for performance.- Parameters
buffer_size – The maximum number of elements that will be buffered when prefetching. If the value
tf.data.experimental.AUTOTUNE()
is used, then the buffer is dynamically tuned.training – Apply the repeat operator to the training set. Defaults to
True
.validation – Apply the repeat operator to the validation set. Defaults to
True
.test – Apply the repeat operator to the test set. Defaults to
True
.
- repeat(self, count: Optional[int] = None, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob ¶
Repeats this
DataBlob
.- Parameters
count – Represents the number of times that the elements in the
tf.data.Dataset
should be repeated. The default behavior (ifcount
isNone
or-1
) is for the dataset be repeated indefinitely.training – Apply the repeat operator to the training set. Defaults to
True
.validation – Apply the repeat operator to the validation set. Defaults to
True
.test – Apply the repeat operator to the test set. Defaults to
True
.
- repeat_interleaved(self, count: int, cycle_length: Optional[int] = None, block_length: Optional[int] = None, num_parallel_calls: Optional[int] = None, deterministic: Optional[bool] = None, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob ¶
Repeats this
DataBlob
, but interleaved order.- Parameters
count – Represents the number of times that the elements in the
tf.data.Dataset
should be repeated. This must be a finite integer greater than 0. It cannot be a negative number,None
, or an infinite value.training – Apply the repeat operator to the training set. Defaults to
True
.validation – Apply the repeat operator to the validation set. Defaults to
True
.test – Apply the repeat operator to the test set. Defaults to
True
.
- with_options(self, options: tf.data.Options, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob ¶
Apply a
tf.data.Options
object to thisDataBlob
.- Parameters
options – The
tf.data.Options
object to apply.training – Apply the options to the training set. Defaults to
True
.validation – Apply the options to the validation set. Defaults to
True
.test – Apply the options to the test set. Defaults to
True
.
- save_hook(self, *, subtype: str, path: str) None ¶
Override this method to run additional code when saving this
DataBlob
to disk.
- save(self, datablobs_directory: str, *, ignore_existing: bool = False, num_shards: int = 1, save_load_version: int = _DEFAULT_SAVE_LOAD_VERSION) DataBlob ¶
Save this
DataBlob
to disk.- Parameters
datablobs_directory – The directory where you plan on storing all of your DataBlobs. This method will save this
DataBlob
in a subdirectory ofdatablobs_directory
with same name asDataBlob.name
.ignore_existing – Set this to
True
to ignore if there is already aDataBlob
at the given path.save_load_version – The ScalarStop version for the ScalarStop protocol.
- Returns
Return
self
, enabling you to place this call in a chain.
- classmethod calculate_name(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None) str ¶
Calculate the hashed name of this object, given the hyperparameters.
This classmethod can be used to calculate what an object would be without actually having to call
__init__()
.
- property hyperparams(self) scalarstop.hyperparams.HyperparamsType ¶
Returns a
HyperparamsType
instance containing hyperparameters.
- property hyperparams_flat(self) Dict[str, Any] ¶
Returns a Python dictionary of “flattened” hyperparameters.
AppendDataBlob
objects modify a “parent”DataBlob
, nesting the parent’s Hyperparams within theAppendDataBlob
‘s own Hyperparams.This makes it hard to look up a given hyperparams key. A value at
parent_datablob.hyperparams.a
is stored atchild_datablob.hyperparams.parent.hyperparams.a
.This
hyperparams_flat
property provides all nested hyperparams keys as a flat Python dictionary. If a childAppendDataBlob
has a hyperparameter key that that conflicts with the parent, the child’s value will overwrite the parent’s value.
- class DistributedDataBlob(*, name: str, group_name: str, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, hyperparams_class: Type[scalarstop.hyperparams.HyperparamsType], cache: bool = False, repeat: Union[bool, int, None] = True, per_replica_batch_size: Optional[int] = None, tf_distribute_strategy: Optional[tensorflow.distribute.get_strategy] = None)¶
Bases:
DataBlobBase
Wraps a
DataBlob
to create a TensorFlowtf.distribute.DistributedDataset
.A
DataBlob
contains three TensorFlowtf.data.Dataset
pipelines, representing a training, validation, and test set. TheDistributedDataBlob
wraps the creation of aDataBlob
to turn eachtf.data.Dataset
into atf.distribute.DistributedDataset
which is used to distribute a dataset across multiple workers according to atf.distribute.Strategy
.If you have saved a
DataBlob
to the filesystem withDataBlob.save()
, then you can automatically load theDataBlob
from the filesystem as aDistributedDataBlob
using the classmethodDataBlob.from_filesystem_distributed()
orDataBlob.from_exact_path_distributed()
.For more fine-grained control, you can subclass
DistributedDataBlob
and overrideDistributedDataBlob.new_sharded_datablob()
with your ownDataBlob
creation and sharding logic. Optionally, you can also subclassDistributedDataBlob.transform_datablob()
to change howDistributedDataBlob
handles repeating and batching. Finally, you can also subclassDistributedDataBlob.postprocess_tfdata()
to make changes to individualtf.data.Dataset
instances rather than theDataBlob
as a whole.- Parameters
name – The name of the wrapped
DataBlob
.group_name – The group name of the wrapped
DataBlob
.hyperparams – The hyperparameters of the wrapped
DataBlob
.hyperparams_class – The
HyperparamsType
class thathyperparams
instances are created from.cache – Whether to cache the
DataBlob
in memory. Ifrepeat
is also enabled, then caching will happen before repeating.repeat – Repeats the
DataBlob
after loading it. Set toTrue
to enable infinite repeating. Set to a positive integern
to repeat theDataBlob
n
times. Set toFalse
to disable repeating.per_replica_batch_size – The batch size for each individual
tf.distribute
replica. This is the global batch size divided bytf.distribute.Strategy.num_replicas_in_sync
.tf_distribute_strategy – The
tf.distribute.Strategy
subclass to use. Optionally, this method will detect if it is already inside a :py:meth:`tf.distribute.Strategy.scope context manager.
- Hyperparams :Type[scalarstop.hyperparams.HyperparamsType]¶
- abstract new_sharded_datablob(self, ctx: tf.distribute.InputContext) DataBlob ¶
Subclass this method to return a sharded
DataBlob
.- Parameters
ctx – A
tf.distribute.InputContext
instance. The attributetf.distribute.InputContext.input_pipeline_id
returns the current input pipeline. The attributetf.distribute.InputContext.num_input_pipelines
returns the total number of distributed input pipelines in the currenttf.distribute.Strategy
.
- transform_datablob(self, datablob: DataBlob, ctx: tf.distribute.InputContext) DataBlob ¶
Transforms an already-initialized
DataBlob
to add repeating and sharding logic.- Parameters
datablob – The already-initialized
DataBlob
.ctx – A
tf.distribute.InputContext
instance. The attributetf.distribute.InputContext.input_pipeline_id
returns the current input pipeline. The attributetf.distribute.InputContext.num_input_pipelines
returns the total number of distributed input pipelines in the currenttf.distribute.Strategy
.
- Returns
- Returns a
DataBlob
that has been modified by repeating, batching, or another transformation.
- Returns a
- postprocess_tfdata(self, tfdata: tf.data.Dataset, ctx: tf.distribute.InputContext) tf.data.Dataset ¶
Performs additional
tf.data.Dataset
transformations before turning them intotf.distribute.DistributedDataset
instances.Currently, the implementation in
DistributedDataBlob
does nothing, but is avaiable for you to subclass and change.- Parameters
tfdata – The input
tf.data.Dataset
instance to transform.ctx – A
tf.distribute.InputContext
instance. The attributetf.distribute.InputContext.input_pipeline_id
returns the current input pipeline. The attributetf.distribute.InputContext.num_input_pipelines
returns the total number of distributed input pipelines in the currenttf.distribute.Strategy
.
- Returns
Returns a transformed
tf.data.Dataset
.
- property tf_distribute_strategy(self) tf.distribute.Strategy ¶
Returns the currently-active
tf.distribute.Strategy
.
- set_training(self) tf.distribute.DistributedDataset ¶
Creates a new
tf.distribute.DistributedDataset
for the training set.
- property training(self) tf.distribute.DistributedDataset ¶
A
tf.distribute.DistributedDataset
instance for the training set.
- set_validation(self) tf.distribute.DistributedDataset ¶
Creates a new
tf.distribute.DistributedDataset
for the validation set.
- property validation(self) tf.distribute.DistributedDataset ¶
A
tf.distribute.DistributedDataset
instance for the validation set.
- set_test(self) tf.distribute.DistributedDataset ¶
Creates a new
tf.distribute.DistributedDataset
for the test set.
- property test(self) tf.distribute.DistributedDataset ¶
A
tf.distribute.DistributedDataset
instance for the test set.
- classmethod calculate_name(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None) str ¶
Calculate the hashed name of this object, given the hyperparameters.
This classmethod can be used to calculate what an object would be without actually having to call
__init__()
.
- property hyperparams(self) scalarstop.hyperparams.HyperparamsType ¶
Returns a
HyperparamsType
instance containing hyperparameters.
- property hyperparams_flat(self) Dict[str, Any] ¶
Returns a Python dictionary of “flattened” hyperparameters.
AppendDataBlob
objects modify a “parent”DataBlob
, nesting the parent’s Hyperparams within theAppendDataBlob
‘s own Hyperparams.This makes it hard to look up a given hyperparams key. A value at
parent_datablob.hyperparams.a
is stored atchild_datablob.hyperparams.parent.hyperparams.a
.This
hyperparams_flat
property provides all nested hyperparams keys as a flat Python dictionary. If a childAppendDataBlob
has a hyperparameter key that that conflicts with the parent, the child’s value will overwrite the parent’s value.