scalarstop.datablob¶
Group together and name your training, validation, and test sets.
The classes in this module are used to group together data into training, validation, and test sets used for training machine learning models. We also record the hyperparameters used to process the dataset.
The DataBlob subclass name and hyperparameters
are used to create a unique content-addressable name
that makes it easy to keep track of many datasets at once.
Module Contents¶
Classes¶
The abstract base class describing the properties common to all DataBlobs. |
|
Subclass this to group your training, validation, and test sets for training machine learning models. |
|
Subclass this to transform a |
|
Subclass this to create a new |
|
Wraps a |
- class DataBlobBase(*, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, **kwargs)¶
Bases:
scalarstop._single_namespace.SingleNamespaceThe abstract base class describing the properties common to all DataBlobs.
- Parameters
hyperparams – The hyperparameters to initialize this class with.
- Hyperparams :Type[scalarstop.hyperparams.HyperparamsType]¶
- set_training(self) Any¶
Creates and returns a new object representing the training set.
- property training(self) Any¶
An object representing the training set.
- set_validation(self) Any¶
Creates and returns a new object representing the validation set.
- property validation(self) Any¶
An object representing the validation set.
- set_test(self) Any¶
Creates and returns a new object representing the test set.
- property test(self) Any¶
An object representing the test set.
- classmethod calculate_name(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None) str¶
Calculate the hashed name of this object, given the hyperparameters.
This classmethod can be used to calculate what an object would be without actually having to call
__init__().
- property hyperparams(self) scalarstop.hyperparams.HyperparamsType¶
Returns a
HyperparamsTypeinstance containing hyperparameters.
- property hyperparams_flat(self) Dict[str, Any]¶
Returns a Python dictionary of “flattened” hyperparameters.
AppendDataBlobobjects modify a “parent”DataBlob, nesting the parent’s Hyperparams within theAppendDataBlob‘s own Hyperparams.This makes it hard to look up a given hyperparams key. A value at
parent_datablob.hyperparams.ais stored atchild_datablob.hyperparams.parent.hyperparams.a.This
hyperparams_flatproperty provides all nested hyperparams keys as a flat Python dictionary. If a childAppendDataBlobhas a hyperparameter key that that conflicts with the parent, the child’s value will overwrite the parent’s value.
- class DataBlob(*, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, **kwargs)¶
Bases:
DataBlobBaseSubclass this to group your training, validation, and test sets for training machine learning models.
Here is how to use
DataBlobto group your training, validation, and test sets:Subclass
DataBlobwith a class name that describes your dataset in general. In this example, we’ll useMyDataBlobas the class name.Define a dataclass using the
@sp.dataclassdecorator atMyDataBlob.Hyperparams. We’ll define an instance of this dataclass atMyDataBlob.hyperparams. This describes the hyperparameters involved in processing your dataset.Override the methods
DataBlob.set_training(),DataBlob.set_validation(), andDataBlob.set_test()to generatetf.data.Datasetpipelines representing your training, validation, and test sets.
Those three steps roughly look like:
>>> import tensorflow as tf >>> import scalarstop as sp >>> >>> class MyDataBlob(sp.DataBlob): ... ... @sp.dataclass ... class Hyperparams(sp.HyperparamsType): ... cols: int ... ... def _data(self): ... x = tf.random.uniform(shape=(10, self.hyperparams.cols)) ... y = tf.round(tf.random.uniform(shape=(10,1))) ... return tf.data.Dataset.zip(( ... tf.data.Dataset.from_tensor_slices(x), ... tf.data.Dataset.from_tensor_slices(y), ... )) ... ... def set_training(self): ... return self._data() ... ... def set_validation(self): ... return self._data() ... ... def set_test(self): ... return self._data() >>>
In our above example, our training, validation, and test sets are created with the exact same code. In practice, you’ll be creating them with different inputs.
Now we create an instance of our subclass so we can start using it.
>>> datablob = MyDataBlob(hyperparams=dict(cols=3)) >>> datablob <sp.DataBlob MyDataBlob-bn5hpc7ueo2uz7as1747tetn>
DataBlobinstances are given a unique name by hashing together the class name with the instance’s hyperparameters.>>> datablob.name 'MyDataBlob-bn5hpc7ueo2uz7as1747tetn' >>> >>> datablob.group_name 'MyDataBlob' >>> >>> datablob.hyperparams MyDataBlob.Hyperparams(cols=3) >>> >>> sp.enforce_dict(datablob.hyperparams) {'cols': 3}
We save exactly one instance of each
tf.data.Datasetpipeline in the propertiesDataBlob.training,DataBlob.validation, andDataBlob.test.>>> datablob.training <ZipDataset element_spec=(TensorSpec(shape=(3,), dtype=tf.float32, name=None), TensorSpec(shape=(1,), dtype=tf.float32, name=None))> >>> >>> datablob.validation <ZipDataset element_spec=(TensorSpec(shape=(3,), dtype=tf.float32, name=None), TensorSpec(shape=(1,), dtype=tf.float32, name=None))> >>> >>> datablob.test <ZipDataset element_spec=(TensorSpec(shape=(3,), dtype=tf.float32, name=None), TensorSpec(shape=(1,), dtype=tf.float32, name=None))>
DataBlobobjects have some methods for applyingtf.datatransformations to the training, validation, and test sets at the same time:Batching.
DataBlob.batch()will batch the training, validation, and test sets at the same time. If you callDataBlob.batch()with the keyword argumentwith_tf_distribute=True, your input batch size will be multiplied by the number of replicas in yourtf.distributestrategy.Caching.
DataBlob.cache()will cache the training, validation, and test sets in memory once you iterate over them. This is useful if yourtf.data.Datasetare doing something computationally expensive each time you iterate over them.Saving/loading to/from the filesystem.
DataBlob.save()saves the training, validation, and test sets to a path on the filesystem. This can be loaded back with the classmethodDataBlob.from_exact_path().
>>> import os >>> import tempfile >>> tempdir = tempfile.TemporaryDirectory() >>> >>> datablob = datablob.save(tempdir.name) >>> >>> os.listdir(tempdir.name) ['MyDataBlob-bn5hpc7ueo2uz7as1747tetn']
>>> path = os.path.join(tempdir.name, datablob.name) >>> loaded_datablob = MyDataBlob.from_exact_path(path) >>> loaded_datablob <sp.DataBlob MyDataBlob-bn5hpc7ueo2uz7as1747tetn>
Alternatively, if you have the hyperparameters of the
DataBlobbut not the name, you can use the classmethodDataBlob.from_filesystem().>>> loaded_datablob_2 = MyDataBlob.from_filesystem( ... hyperparams=dict(cols=3), ... datablobs_directory=tempdir.name, ... ) >>> loaded_datablob_2 <sp.DataBlob MyDataBlob-bn5hpc7ueo2uz7as1747tetn>
(and now let’s clean up the temporary directory from above)
>>> tempdir.cleanup()
- Parameters
hyperparams – The hyperparameters to initialize this class with.
- Hyperparams :Type[scalarstop.hyperparams.HyperparamsType]¶
- classmethod from_filesystem(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1)¶
Loads a
DataBlobfrom the filesystem, calculating the filename from the hyperparameters.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlobs. The exact filename is calculated from the class name and hyperparams.
- classmethod from_filesystem_distributed(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, cache: bool = False, repeat: Union[bool, int, None] = True, per_replica_batch_size: Optional[int] = None, tf_distribute_strategy: Optional[tf.distribute.Strategy] = None) DistributedDataBlob¶
Loads a sharded
DataBlobfrom the filesystem, automatically splitting the shards amongs the input workers of atf.distribute.Strategy.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlobs. The exact filename is calculated from the class name and hyperparams.cache – Whether to cache the
DataBlobin memory. Ifrepeatis also enabled, then caching will happen before repeating.repeat – Repeats the
DataBlobafter loading it. Set toTrueto enable infinite repeating. Set to a positive integernto repeat theDataBlobntimes. Set toFalseto disable repeating.per_replica_batch_size – The batch size for each individual
tf.distributereplica. This is the global batch size divided bytf.distribute.Strategy.num_replicas_in_sync.tf_distribute_strategy – The
tf.distribute.Strategysubclass to use. Optionally, this method will detect if it is already inside a :py:meth:`tf.distribute.Strategy.scope context manager.
- classmethod metadata_from_filesystem(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str) scalarstop.datablob_metadata.DataBlobMetadata¶
Loads this
DataBlob‘sDataBlobMetadatafrom the filesystem, calculating the filename from the hyperparameters.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlobs. The exact filename is calculated from the class name and hyperparams.
- classmethod from_filesystem_or_new(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1, **kwargs)¶
Load a
DataBlobfrom the filesystem, calculating the filename from the hyperparameters. Create a newDataBlobif we cannot find a saved one on the filesystem.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlobs. The exact filename is calculated from the class name and hyperparams.**kwargs – Other keyword arguments that you need to pass to your
__init__().
- static from_exact_path(path: str, *, shard_offset: Optional[int] = None, shard_quantity: int = 1) DataBlob¶
Load a
DataBlobfrom a directory on the filesystem.
- classmethod from_exact_path_distributed(cls, *, path: str, cache: bool = False, repeat: Union[bool, int, None] = True, per_replica_batch_size: Optional[int] = None, tf_distribute_strategy: Optional[tensorflow.distribute.get_strategy] = None) DistributedDataBlob¶
- Parameters
path – The exact location of the saved
DataBlobon the filesystem.cache – Whether to cache the
DataBlobin memory. Ifrepeatis also enabled, then caching will happen before repeating.repeat – Repeats the
DataBlobafter loading it. Set toTrueto enable infinite repeating. Set to a positive integernto repeat theDataBlobntimes. Set toFalseto disable repeating.per_replica_batch_size – The batch size for each individual
tf.distributereplica. This is the global batch size divided bytf.distribute.Strategy.num_replicas_in_sync.tf_distribute_strategy – The
tf.distribute.Strategysubclass to use. Optionally, this method will detect if it is already inside a :py:meth:`tf.distribute.Strategy.scope context manager.
- static metadata_from_exact_path(path: str) scalarstop.datablob_metadata.DataBlobMetadata¶
Loads this
DataBlob‘sDataBlobMetadatafrom a directory on the filesystem.
- exists_in_datablobs_directory(self, datablobs_directory: str) bool¶
Returns
Trueif thisDataBlobwas already saved withindatablobs_directory.- Parameters
datablobs_directory – The parent directory of all of your saved
DataBlobs.- Returns
Returns
Trueif we found a py:class:DataBlob metadata file at the expected location.
- set_training(self) tf.data.Dataset¶
Create a
tf.data.Datasetfor the training set.
- property training(self) tf.data.Dataset¶
A
tf.data.Datasetinstance representing the training set.
- set_validation(self) tf.data.Dataset¶
Create a
tf.data.Datasetfor the validation set.
- property validation(self) tf.data.Dataset¶
A
tf.data.Datasetinstance representing the validation set.
- set_test(self) tf.data.Dataset¶
Create a
tf.data.Datasetfor the test set.
- property test(self) tf.data.Dataset¶
A
tf.data.Datasetinstance representing the test set.
- batch(self, batch_size: int, *, training: bool = True, validation: bool = True, test: bool = True, with_tf_distribute: bool = False) DataBlob¶
Batch this
DataBlob.- Parameters
batch_size – The number of items to collect into a batch.
training – Whether to batch the training set. Defaults to
True.validation – Whether to batch the validation set. Defaults to
True.test – Whether to batch the test set. Defaults to
True.with_tf_distribute – Whether to consider
tf.distributeauto-data sharding when calculating the batch size.
- cache(self, *, training: bool = True, validation: bool = True, test: bool = True, precache_training: bool = False, precache_validation: bool = False, precache_test: bool = False) DataBlob¶
Cache this
DataBlobinto memory before iterating over it.By default, this creates a
DataBlobcontaining a TensorFlowCacheDatasetfor each of the training, validation and testtf.data.Datasets.But these datasets do not load into memory until the first time you completely iterate over one–from start to end. If you want to immediately load your training, validation, or test sets, you can set
precache_training,precache_validation, and/orprecache_testtoTrue.- Parameters
training – Lazily cache the training set in CPU memory. Defaults to
True.validation – Lazily cache the validation set in CPU memory. Defaults to
True.test – Lazily cache the test set in CPU memory. Defaults to
True.precache_training – Eagerly cache the training set into memory. Defaults to
False.precache_validation – Eagerly cache the validation set into memory. Defaults to
False.precache_test – Eagerly cache the test set into memory. Defaults to
False.
- prefetch(self, buffer_size: int, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob¶
Creates a
DataBlobthat prefetches elements for performance.- Parameters
buffer_size – The maximum number of elements that will be buffered when prefetching. If the value
tf.data.experimental.AUTOTUNE()is used, then the buffer is dynamically tuned.training – Apply the repeat operator to the training set. Defaults to
True.validation – Apply the repeat operator to the validation set. Defaults to
True.test – Apply the repeat operator to the test set. Defaults to
True.
- repeat(self, count: Optional[int] = None, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob¶
Repeats this
DataBlob.- Parameters
count – Represents the number of times that the elements in the
tf.data.Datasetshould be repeated. The default behavior (ifcountisNoneor-1) is for the dataset be repeated indefinitely.training – Apply the repeat operator to the training set. Defaults to
True.validation – Apply the repeat operator to the validation set. Defaults to
True.test – Apply the repeat operator to the test set. Defaults to
True.
- repeat_interleaved(self, count: int, cycle_length: Optional[int] = None, block_length: Optional[int] = None, num_parallel_calls: Optional[int] = None, deterministic: Optional[bool] = None, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob¶
Repeats this
DataBlob, but interleaved order.- Parameters
count – Represents the number of times that the elements in the
tf.data.Datasetshould be repeated. This must be a finite integer greater than 0. It cannot be a negative number,None, or an infinite value.training – Apply the repeat operator to the training set. Defaults to
True.validation – Apply the repeat operator to the validation set. Defaults to
True.test – Apply the repeat operator to the test set. Defaults to
True.
- with_options(self, options: tf.data.Options, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob¶
Apply a
tf.data.Optionsobject to thisDataBlob.- Parameters
options – The
tf.data.Optionsobject to apply.training – Apply the options to the training set. Defaults to
True.validation – Apply the options to the validation set. Defaults to
True.test – Apply the options to the test set. Defaults to
True.
- save_hook(self, *, subtype: str, path: str) None¶
Override this method to run additional code when saving this
DataBlobto disk.
- save(self, datablobs_directory: str, *, ignore_existing: bool = False, num_shards: int = 1, save_load_version: int = _DEFAULT_SAVE_LOAD_VERSION) DataBlob¶
Save this
DataBlobto disk.- Parameters
datablobs_directory – The directory where you plan on storing all of your DataBlobs. This method will save this
DataBlobin a subdirectory ofdatablobs_directorywith same name asDataBlob.name.ignore_existing – Set this to
Trueto ignore if there is already aDataBlobat the given path.save_load_version – The ScalarStop version for the ScalarStop protocol.
- Returns
Return
self, enabling you to place this call in a chain.
- classmethod calculate_name(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None) str¶
Calculate the hashed name of this object, given the hyperparameters.
This classmethod can be used to calculate what an object would be without actually having to call
__init__().
- property hyperparams(self) scalarstop.hyperparams.HyperparamsType¶
Returns a
HyperparamsTypeinstance containing hyperparameters.
- property hyperparams_flat(self) Dict[str, Any]¶
Returns a Python dictionary of “flattened” hyperparameters.
AppendDataBlobobjects modify a “parent”DataBlob, nesting the parent’s Hyperparams within theAppendDataBlob‘s own Hyperparams.This makes it hard to look up a given hyperparams key. A value at
parent_datablob.hyperparams.ais stored atchild_datablob.hyperparams.parent.hyperparams.a.This
hyperparams_flatproperty provides all nested hyperparams keys as a flat Python dictionary. If a childAppendDataBlobhas a hyperparameter key that that conflicts with the parent, the child’s value will overwrite the parent’s value.
- class DataFrameDataBlob(*, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, **kwargs)¶
Bases:
DataBlobSubclass this to transform a
pandas.DataFrameinto your training, validation, and test sets.DataBlobis useful when you want to manually define yourtf.datapipelines and their input tensors.However, if your input tensors are in a fixed-size list or
DataFramethat you want to slice into a training, validation, and test set, then you might findDataFrameDataBlobhandy.Here is how to use it:
Subclass
DataFrameDataBlobwith a class name that describes your dataset.Override
DataFrameDataBlob.set_dataframe()and have it return a singleDataFramethat contains all of the inputs for your training, validation, and test sets. TheDataFrameshould have one column representing training samples and another column representing training labels.Override
DataFrameDataBlob.transform()and define a method that transforms an arbitraryDataFrameof inputs into atf.data.Datasetpipeline that represents the actual dataset needed for training and evaluation.
We define what fraction of the
DataFrameto split with the class attributesDataFrameDataBlob.training_fractionandDataFrameDataBlob.validation_fraction. By default, 60 percent of theDataFrameis marked for the training set, 20 percent for the validation set, and the remainder of theDataFramefor the test set.Roughly, this looks like:
>>> import pandas as pd >>> import tensorflow as tf >>> import scalarstop as sp >>> >>> class MyDataFrameDataBlob(sp.DataFrameDataBlob): ... samples_column: str = "samples" ... labels_column: str = "labels" ... training_fraction: float = 0.6 ... validation_fraction: float = 0.2 ... ... @sp.dataclass ... class Hyperparams(sp.HyperparamsType): ... length: int = 0 ... ... def set_dataframe(self): ... samples = list(range(self.hyperparams.length)) ... labels = list(range(self.hyperparams.length)) ... return pd.DataFrame({self.samples_column: samples, self.labels_column: labels}) ... ... def transform(self, dataframe: pd.DataFrame): ... return tf.data.Dataset.zip(( ... tf.data.Dataset.from_tensor_slices(dataframe[self.samples_column]), ... tf.data.Dataset.from_tensor_slices(dataframe[self.labels_column]), ... ))
>>> datablob2 = MyDataFrameDataBlob(hyperparams=dict(length=10))
And you can use the resulting object in all of the same ways as we’ve demonstrated with
DataBlobsubclass instances above.- Parameters
hyperparams – The hyperparameters to initialize this class with.
- samples_column :str = samples¶
- labels_column :str = labels¶
- training_fraction :float = 0.6¶
- validation_fraction :float = 0.2¶
- Hyperparams :Type[scalarstop.hyperparams.HyperparamsType]¶
- static from_exact_path(path: str, *, shard_offset: Optional[int] = None, shard_quantity: int = 1) Union[DataBlob, DataFrameDataBlob]¶
Load a
DataFrameDataBlobfrom a directory on the filesystem.
- set_dataframe(self) pandas.DataFrame¶
Create a new
pandas.DataFramethat contains all of the data for the training, validation, and test sets.
- property dataframe(self) pandas.DataFrame¶
A
pandas.DataFramethat represents the entire training, validation, and test set.
- set_training_dataframe(self) pandas.DataFrame¶
Sets the
pandas.DataFramefor the training set.By default, this method slices the
pandas.DataFrameyou have supplied toset_dataframe().Alternatively, you can choose to directly subclass
set_training_dataframe(),set_validation_dataframe(), and :py:meth`set_test_dataframe`.- Returns
Returns a
pandas.DataFrame.
- property training_dataframe(self) pandas.DataFrame¶
A
pandas.DataFramerepresenting training set input tensors.
- set_validation_dataframe(self) pandas.DataFrame¶
Sets the
pandas.DataFramefor the validation set.By default, this method slices the
pandas.DataFrameyou have supplied toset_dataframe().Alternatively, you can choose to directly subclass
set_training_dataframe(),set_validation_dataframe(), and :py:meth`set_test_dataframe`.- Returns
Returns a
pandas.DataFrame.
- property validation_dataframe(self) pandas.DataFrame¶
A
pandas.DataFramerepresenting validation set input tensors.
- set_test_dataframe(self) pandas.DataFrame¶
Sets the
pandas.DataFramefor the test set.By default, this method slices the DataFrame you have supplied to
set_dataframe().Alternatively, you can choose to directly subclass
set_training_dataframe(),set_validation_dataframe(), and :py:meth`set_test_dataframe`.- Returns
Returns a Pandas
pandas.DataFrame.
- property test_dataframe(self) pandas.DataFrame¶
A
pandas.DataFramerepresenting test set input tensors.
- transform(self, dataframe: pandas.DataFrame) tf.data.Dataset¶
Transforms any input tensors into an output
tf.data.Dataset.
- set_training(self) tf.data.Dataset¶
Create a
tf.data.Datasetfor the training set.
- set_validation(self) tf.data.Dataset¶
Create a
tf.data.Datasetfor the validation set.
- set_test(self) tf.data.Dataset¶
Create a
tf.data.Datasetfor the test set.
- save_hook(self, *, subtype: str, path: str) None¶
Override this method to run additional code when saving this
DataBlobto disk.
- classmethod from_filesystem(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1)¶
Loads a
DataBlobfrom the filesystem, calculating the filename from the hyperparameters.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlobs. The exact filename is calculated from the class name and hyperparams.
- classmethod from_filesystem_distributed(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, cache: bool = False, repeat: Union[bool, int, None] = True, per_replica_batch_size: Optional[int] = None, tf_distribute_strategy: Optional[tf.distribute.Strategy] = None) DistributedDataBlob¶
Loads a sharded
DataBlobfrom the filesystem, automatically splitting the shards amongs the input workers of atf.distribute.Strategy.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlobs. The exact filename is calculated from the class name and hyperparams.cache – Whether to cache the
DataBlobin memory. Ifrepeatis also enabled, then caching will happen before repeating.repeat – Repeats the
DataBlobafter loading it. Set toTrueto enable infinite repeating. Set to a positive integernto repeat theDataBlobntimes. Set toFalseto disable repeating.per_replica_batch_size – The batch size for each individual
tf.distributereplica. This is the global batch size divided bytf.distribute.Strategy.num_replicas_in_sync.tf_distribute_strategy – The
tf.distribute.Strategysubclass to use. Optionally, this method will detect if it is already inside a :py:meth:`tf.distribute.Strategy.scope context manager.
- classmethod metadata_from_filesystem(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str) scalarstop.datablob_metadata.DataBlobMetadata¶
Loads this
DataBlob‘sDataBlobMetadatafrom the filesystem, calculating the filename from the hyperparameters.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlobs. The exact filename is calculated from the class name and hyperparams.
- classmethod from_filesystem_or_new(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1, **kwargs)¶
Load a
DataBlobfrom the filesystem, calculating the filename from the hyperparameters. Create a newDataBlobif we cannot find a saved one on the filesystem.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlobs. The exact filename is calculated from the class name and hyperparams.**kwargs – Other keyword arguments that you need to pass to your
__init__().
- classmethod from_exact_path_distributed(cls, *, path: str, cache: bool = False, repeat: Union[bool, int, None] = True, per_replica_batch_size: Optional[int] = None, tf_distribute_strategy: Optional[tensorflow.distribute.get_strategy] = None) DistributedDataBlob¶
- Parameters
path – The exact location of the saved
DataBlobon the filesystem.cache – Whether to cache the
DataBlobin memory. Ifrepeatis also enabled, then caching will happen before repeating.repeat – Repeats the
DataBlobafter loading it. Set toTrueto enable infinite repeating. Set to a positive integernto repeat theDataBlobntimes. Set toFalseto disable repeating.per_replica_batch_size – The batch size for each individual
tf.distributereplica. This is the global batch size divided bytf.distribute.Strategy.num_replicas_in_sync.tf_distribute_strategy – The
tf.distribute.Strategysubclass to use. Optionally, this method will detect if it is already inside a :py:meth:`tf.distribute.Strategy.scope context manager.
- static metadata_from_exact_path(path: str) scalarstop.datablob_metadata.DataBlobMetadata¶
Loads this
DataBlob‘sDataBlobMetadatafrom a directory on the filesystem.
- exists_in_datablobs_directory(self, datablobs_directory: str) bool¶
Returns
Trueif thisDataBlobwas already saved withindatablobs_directory.- Parameters
datablobs_directory – The parent directory of all of your saved
DataBlobs.- Returns
Returns
Trueif we found a py:class:DataBlob metadata file at the expected location.
- property training(self) tf.data.Dataset¶
A
tf.data.Datasetinstance representing the training set.
- property validation(self) tf.data.Dataset¶
A
tf.data.Datasetinstance representing the validation set.
- property test(self) tf.data.Dataset¶
A
tf.data.Datasetinstance representing the test set.
- batch(self, batch_size: int, *, training: bool = True, validation: bool = True, test: bool = True, with_tf_distribute: bool = False) DataBlob¶
Batch this
DataBlob.- Parameters
batch_size – The number of items to collect into a batch.
training – Whether to batch the training set. Defaults to
True.validation – Whether to batch the validation set. Defaults to
True.test – Whether to batch the test set. Defaults to
True.with_tf_distribute – Whether to consider
tf.distributeauto-data sharding when calculating the batch size.
- cache(self, *, training: bool = True, validation: bool = True, test: bool = True, precache_training: bool = False, precache_validation: bool = False, precache_test: bool = False) DataBlob¶
Cache this
DataBlobinto memory before iterating over it.By default, this creates a
DataBlobcontaining a TensorFlowCacheDatasetfor each of the training, validation and testtf.data.Datasets.But these datasets do not load into memory until the first time you completely iterate over one–from start to end. If you want to immediately load your training, validation, or test sets, you can set
precache_training,precache_validation, and/orprecache_testtoTrue.- Parameters
training – Lazily cache the training set in CPU memory. Defaults to
True.validation – Lazily cache the validation set in CPU memory. Defaults to
True.test – Lazily cache the test set in CPU memory. Defaults to
True.precache_training – Eagerly cache the training set into memory. Defaults to
False.precache_validation – Eagerly cache the validation set into memory. Defaults to
False.precache_test – Eagerly cache the test set into memory. Defaults to
False.
- prefetch(self, buffer_size: int, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob¶
Creates a
DataBlobthat prefetches elements for performance.- Parameters
buffer_size – The maximum number of elements that will be buffered when prefetching. If the value
tf.data.experimental.AUTOTUNE()is used, then the buffer is dynamically tuned.training – Apply the repeat operator to the training set. Defaults to
True.validation – Apply the repeat operator to the validation set. Defaults to
True.test – Apply the repeat operator to the test set. Defaults to
True.
- repeat(self, count: Optional[int] = None, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob¶
Repeats this
DataBlob.- Parameters
count – Represents the number of times that the elements in the
tf.data.Datasetshould be repeated. The default behavior (ifcountisNoneor-1) is for the dataset be repeated indefinitely.training – Apply the repeat operator to the training set. Defaults to
True.validation – Apply the repeat operator to the validation set. Defaults to
True.test – Apply the repeat operator to the test set. Defaults to
True.
- repeat_interleaved(self, count: int, cycle_length: Optional[int] = None, block_length: Optional[int] = None, num_parallel_calls: Optional[int] = None, deterministic: Optional[bool] = None, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob¶
Repeats this
DataBlob, but interleaved order.- Parameters
count – Represents the number of times that the elements in the
tf.data.Datasetshould be repeated. This must be a finite integer greater than 0. It cannot be a negative number,None, or an infinite value.training – Apply the repeat operator to the training set. Defaults to
True.validation – Apply the repeat operator to the validation set. Defaults to
True.test – Apply the repeat operator to the test set. Defaults to
True.
- with_options(self, options: tf.data.Options, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob¶
Apply a
tf.data.Optionsobject to thisDataBlob.- Parameters
options – The
tf.data.Optionsobject to apply.training – Apply the options to the training set. Defaults to
True.validation – Apply the options to the validation set. Defaults to
True.test – Apply the options to the test set. Defaults to
True.
- save(self, datablobs_directory: str, *, ignore_existing: bool = False, num_shards: int = 1, save_load_version: int = _DEFAULT_SAVE_LOAD_VERSION) DataBlob¶
Save this
DataBlobto disk.- Parameters
datablobs_directory – The directory where you plan on storing all of your DataBlobs. This method will save this
DataBlobin a subdirectory ofdatablobs_directorywith same name asDataBlob.name.ignore_existing – Set this to
Trueto ignore if there is already aDataBlobat the given path.save_load_version – The ScalarStop version for the ScalarStop protocol.
- Returns
Return
self, enabling you to place this call in a chain.
- classmethod calculate_name(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None) str¶
Calculate the hashed name of this object, given the hyperparameters.
This classmethod can be used to calculate what an object would be without actually having to call
__init__().
- property hyperparams(self) scalarstop.hyperparams.HyperparamsType¶
Returns a
HyperparamsTypeinstance containing hyperparameters.
- property hyperparams_flat(self) Dict[str, Any]¶
Returns a Python dictionary of “flattened” hyperparameters.
AppendDataBlobobjects modify a “parent”DataBlob, nesting the parent’s Hyperparams within theAppendDataBlob‘s own Hyperparams.This makes it hard to look up a given hyperparams key. A value at
parent_datablob.hyperparams.ais stored atchild_datablob.hyperparams.parent.hyperparams.a.This
hyperparams_flatproperty provides all nested hyperparams keys as a flat Python dictionary. If a childAppendDataBlobhas a hyperparameter key that that conflicts with the parent, the child’s value will overwrite the parent’s value.
- class AppendDataBlob(*, parent: DataBlob, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, **kwargs)¶
Bases:
DataBlobSubclass this to create a new
DataBlobthat extends an existingDataBlob.The
AppendDataBlobclass is useful when you have an existingDataBloborDataFrameDataBlobwith most, but not all of the functionality you need. If you are trying to implement multiple data pipelines that share a common compute-intensive first step, you can implement your pipelines asAppendDataBlobsubclasses with the common first step as aDataBlobthat you save and load to/from the filesystem.Let’s begin by creating a
DataBlobthat we will use as a parent for anAppendDataBlob.>>> import tensorflow as tf >>> import scalarstop as sp >>> >>> class MyDataBlob(sp.DataBlob): ... ... @sp.dataclass ... class Hyperparams(sp.HyperparamsType): ... length: int ... ... def _data(self): ... length = self.hyperparams.length ... x = tf.data.Dataset.from_tensor_slices(list(range(0, length))) ... y = tf.data.Dataset.from_tensor_slices(list(range(length, length * 2))) ... return tf.data.Dataset.zip((x, y)) ... ... def set_training(self): ... return self._data() ... ... def set_validation(self): ... return self._data() ... ... def set_test(self): ... return self._data() >>>
And then we create an instance of the datablob and save it to the filesystem.
>>> import os >>> import tempfile >>> tempdir = tempfile.TemporaryDirectory() >>> >>> datablob = MyDataBlob(hyperparams=dict(length=5)) >>> datablob <sp.DataBlob MyDataBlob-dac936v7mb1ue9phjp6tc3sb> >>> >>> list(datablob.training.as_numpy_iterator()) [(0, 5), (1, 6), (2, 7), (3, 8), (4, 9)] >>> >>> datablob = datablob.save(tempdir.name) >>> >>> os.listdir(tempdir.name) ['MyDataBlob-dac936v7mb1ue9phjp6tc3sb']
Now, let’s say that we want to create an
AppendDataBlobthat takes in any inputDataBloborDataFrameDataBloband multiplies every number in every tensor by a constant.>>> class MyAppendDataBlob(sp.AppendDataBlob): ... ... @sp.dataclass ... class Hyperparams(sp.AppendHyperparamsType): ... coefficient: int ... ... hyperparams: "MyAppendDataBlob.Hyperparams" ... ... def __init__(self, *, parent: sp.DataBlob, hyperparams): ... hyperparams_dict = sp.enforce_dict(hyperparams) ... if hyperparams_dict["coefficient"] < 1: ... raise ValueError("Coefficient is too low.") ... super().__init__(parent=parent, hyperparams=hyperparams_dict) ... ... def _wrap_tfdata(self, tfdata: tf.data.Dataset) -> tf.data.Dataset: ... return tfdata.map( ... lambda x, y: ( ... x * self.hyperparams.coefficient, ... y * self.hyperparams.coefficient, ... ) ... ) >>> >>> append = MyAppendDataBlob(parent=datablob, hyperparams=dict(coefficient=3)) >>> list(append.training.as_numpy_iterator()) [(0, 15), (3, 18), (6, 21), (9, 24), (12, 27)]
(And now let’s clean up the temporary directory that we created earlier.)
>>> tempdir.cleanup()
- Parameters
- Hyperparams :Type[scalarstop.hyperparams.AppendHyperparamsType]¶
- classmethod create_append_hyperparams(cls, *, parent: DataBlob, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None)¶
Combine the hyperparams from the parent
DataBlobwith the hyperparams meant for thisAppendDataBlob.
- classmethod calculate_name_from_parent(cls, *, parent: DataBlob, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None)¶
Calculate the hashed name of this
AppendDataBlob, given the hyperparameters and the parentDataBlob.
- classmethod from_filesystem_with_parent(cls, *, parent: DataBlob, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1)¶
Load a
AppendDataBlobfrom the filesystem, calculating the filename from the parent and the hyperparameters..
- classmethod from_filesystem_or_new_with_parent(cls, *, parent: DataBlob, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]], datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1, **kwargs)¶
Load a
AppendDataBlobfrom the filesystem, calculating the filename from the hyperparameters. Create a newAppendDataBlobif we cannot find a saved one on the filesystem.
- set_training(self) tf.data.Dataset¶
Create a
tf.data.Datasetfor the training set.
- property training(self) tf.data.Dataset¶
A
tf.data.Datasetinstance representing the training set.
- set_validation(self) tf.data.Dataset¶
Create a
tf.data.Datasetfor the validation set.
- property validation(self) tf.data.Dataset¶
A
tf.data.Datasetinstance representing the validation set.
- set_test(self) tf.data.Dataset¶
Create a
tf.data.Datasetfor the test set.
- property test(self) tf.data.Dataset¶
A
tf.data.Datasetinstance representing the test set.
- classmethod from_filesystem(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1)¶
Loads a
DataBlobfrom the filesystem, calculating the filename from the hyperparameters.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlobs. The exact filename is calculated from the class name and hyperparams.
- classmethod from_filesystem_distributed(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, cache: bool = False, repeat: Union[bool, int, None] = True, per_replica_batch_size: Optional[int] = None, tf_distribute_strategy: Optional[tf.distribute.Strategy] = None) DistributedDataBlob¶
Loads a sharded
DataBlobfrom the filesystem, automatically splitting the shards amongs the input workers of atf.distribute.Strategy.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlobs. The exact filename is calculated from the class name and hyperparams.cache – Whether to cache the
DataBlobin memory. Ifrepeatis also enabled, then caching will happen before repeating.repeat – Repeats the
DataBlobafter loading it. Set toTrueto enable infinite repeating. Set to a positive integernto repeat theDataBlobntimes. Set toFalseto disable repeating.per_replica_batch_size – The batch size for each individual
tf.distributereplica. This is the global batch size divided bytf.distribute.Strategy.num_replicas_in_sync.tf_distribute_strategy – The
tf.distribute.Strategysubclass to use. Optionally, this method will detect if it is already inside a :py:meth:`tf.distribute.Strategy.scope context manager.
- classmethod metadata_from_filesystem(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str) scalarstop.datablob_metadata.DataBlobMetadata¶
Loads this
DataBlob‘sDataBlobMetadatafrom the filesystem, calculating the filename from the hyperparameters.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlobs. The exact filename is calculated from the class name and hyperparams.
- classmethod from_filesystem_or_new(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, datablobs_directory: str, shard_offset: Optional[int] = None, shard_quantity: int = 1, **kwargs)¶
Load a
DataBlobfrom the filesystem, calculating the filename from the hyperparameters. Create a newDataBlobif we cannot find a saved one on the filesystem.- Parameters
hyperparams – The hyperparameters of the model that we want to load.
datablobs_directory – The parent directory of all of your saved
DataBlobs. The exact filename is calculated from the class name and hyperparams.**kwargs – Other keyword arguments that you need to pass to your
__init__().
- static from_exact_path(path: str, *, shard_offset: Optional[int] = None, shard_quantity: int = 1) DataBlob¶
Load a
DataBlobfrom a directory on the filesystem.
- classmethod from_exact_path_distributed(cls, *, path: str, cache: bool = False, repeat: Union[bool, int, None] = True, per_replica_batch_size: Optional[int] = None, tf_distribute_strategy: Optional[tensorflow.distribute.get_strategy] = None) DistributedDataBlob¶
- Parameters
path – The exact location of the saved
DataBlobon the filesystem.cache – Whether to cache the
DataBlobin memory. Ifrepeatis also enabled, then caching will happen before repeating.repeat – Repeats the
DataBlobafter loading it. Set toTrueto enable infinite repeating. Set to a positive integernto repeat theDataBlobntimes. Set toFalseto disable repeating.per_replica_batch_size – The batch size for each individual
tf.distributereplica. This is the global batch size divided bytf.distribute.Strategy.num_replicas_in_sync.tf_distribute_strategy – The
tf.distribute.Strategysubclass to use. Optionally, this method will detect if it is already inside a :py:meth:`tf.distribute.Strategy.scope context manager.
- static metadata_from_exact_path(path: str) scalarstop.datablob_metadata.DataBlobMetadata¶
Loads this
DataBlob‘sDataBlobMetadatafrom a directory on the filesystem.
- exists_in_datablobs_directory(self, datablobs_directory: str) bool¶
Returns
Trueif thisDataBlobwas already saved withindatablobs_directory.- Parameters
datablobs_directory – The parent directory of all of your saved
DataBlobs.- Returns
Returns
Trueif we found a py:class:DataBlob metadata file at the expected location.
- batch(self, batch_size: int, *, training: bool = True, validation: bool = True, test: bool = True, with_tf_distribute: bool = False) DataBlob¶
Batch this
DataBlob.- Parameters
batch_size – The number of items to collect into a batch.
training – Whether to batch the training set. Defaults to
True.validation – Whether to batch the validation set. Defaults to
True.test – Whether to batch the test set. Defaults to
True.with_tf_distribute – Whether to consider
tf.distributeauto-data sharding when calculating the batch size.
- cache(self, *, training: bool = True, validation: bool = True, test: bool = True, precache_training: bool = False, precache_validation: bool = False, precache_test: bool = False) DataBlob¶
Cache this
DataBlobinto memory before iterating over it.By default, this creates a
DataBlobcontaining a TensorFlowCacheDatasetfor each of the training, validation and testtf.data.Datasets.But these datasets do not load into memory until the first time you completely iterate over one–from start to end. If you want to immediately load your training, validation, or test sets, you can set
precache_training,precache_validation, and/orprecache_testtoTrue.- Parameters
training – Lazily cache the training set in CPU memory. Defaults to
True.validation – Lazily cache the validation set in CPU memory. Defaults to
True.test – Lazily cache the test set in CPU memory. Defaults to
True.precache_training – Eagerly cache the training set into memory. Defaults to
False.precache_validation – Eagerly cache the validation set into memory. Defaults to
False.precache_test – Eagerly cache the test set into memory. Defaults to
False.
- prefetch(self, buffer_size: int, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob¶
Creates a
DataBlobthat prefetches elements for performance.- Parameters
buffer_size – The maximum number of elements that will be buffered when prefetching. If the value
tf.data.experimental.AUTOTUNE()is used, then the buffer is dynamically tuned.training – Apply the repeat operator to the training set. Defaults to
True.validation – Apply the repeat operator to the validation set. Defaults to
True.test – Apply the repeat operator to the test set. Defaults to
True.
- repeat(self, count: Optional[int] = None, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob¶
Repeats this
DataBlob.- Parameters
count – Represents the number of times that the elements in the
tf.data.Datasetshould be repeated. The default behavior (ifcountisNoneor-1) is for the dataset be repeated indefinitely.training – Apply the repeat operator to the training set. Defaults to
True.validation – Apply the repeat operator to the validation set. Defaults to
True.test – Apply the repeat operator to the test set. Defaults to
True.
- repeat_interleaved(self, count: int, cycle_length: Optional[int] = None, block_length: Optional[int] = None, num_parallel_calls: Optional[int] = None, deterministic: Optional[bool] = None, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob¶
Repeats this
DataBlob, but interleaved order.- Parameters
count – Represents the number of times that the elements in the
tf.data.Datasetshould be repeated. This must be a finite integer greater than 0. It cannot be a negative number,None, or an infinite value.training – Apply the repeat operator to the training set. Defaults to
True.validation – Apply the repeat operator to the validation set. Defaults to
True.test – Apply the repeat operator to the test set. Defaults to
True.
- with_options(self, options: tf.data.Options, *, training: bool = True, validation: bool = True, test: bool = True) DataBlob¶
Apply a
tf.data.Optionsobject to thisDataBlob.- Parameters
options – The
tf.data.Optionsobject to apply.training – Apply the options to the training set. Defaults to
True.validation – Apply the options to the validation set. Defaults to
True.test – Apply the options to the test set. Defaults to
True.
- save_hook(self, *, subtype: str, path: str) None¶
Override this method to run additional code when saving this
DataBlobto disk.
- save(self, datablobs_directory: str, *, ignore_existing: bool = False, num_shards: int = 1, save_load_version: int = _DEFAULT_SAVE_LOAD_VERSION) DataBlob¶
Save this
DataBlobto disk.- Parameters
datablobs_directory – The directory where you plan on storing all of your DataBlobs. This method will save this
DataBlobin a subdirectory ofdatablobs_directorywith same name asDataBlob.name.ignore_existing – Set this to
Trueto ignore if there is already aDataBlobat the given path.save_load_version – The ScalarStop version for the ScalarStop protocol.
- Returns
Return
self, enabling you to place this call in a chain.
- classmethod calculate_name(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None) str¶
Calculate the hashed name of this object, given the hyperparameters.
This classmethod can be used to calculate what an object would be without actually having to call
__init__().
- property hyperparams(self) scalarstop.hyperparams.HyperparamsType¶
Returns a
HyperparamsTypeinstance containing hyperparameters.
- property hyperparams_flat(self) Dict[str, Any]¶
Returns a Python dictionary of “flattened” hyperparameters.
AppendDataBlobobjects modify a “parent”DataBlob, nesting the parent’s Hyperparams within theAppendDataBlob‘s own Hyperparams.This makes it hard to look up a given hyperparams key. A value at
parent_datablob.hyperparams.ais stored atchild_datablob.hyperparams.parent.hyperparams.a.This
hyperparams_flatproperty provides all nested hyperparams keys as a flat Python dictionary. If a childAppendDataBlobhas a hyperparameter key that that conflicts with the parent, the child’s value will overwrite the parent’s value.
- class DistributedDataBlob(*, name: str, group_name: str, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None, hyperparams_class: Type[scalarstop.hyperparams.HyperparamsType], cache: bool = False, repeat: Union[bool, int, None] = True, per_replica_batch_size: Optional[int] = None, tf_distribute_strategy: Optional[tensorflow.distribute.get_strategy] = None)¶
Bases:
DataBlobBaseWraps a
DataBlobto create a TensorFlowtf.distribute.DistributedDataset.A
DataBlobcontains three TensorFlowtf.data.Datasetpipelines, representing a training, validation, and test set. TheDistributedDataBlobwraps the creation of aDataBlobto turn eachtf.data.Datasetinto atf.distribute.DistributedDatasetwhich is used to distribute a dataset across multiple workers according to atf.distribute.Strategy.If you have saved a
DataBlobto the filesystem withDataBlob.save(), then you can automatically load theDataBlobfrom the filesystem as aDistributedDataBlobusing the classmethodDataBlob.from_filesystem_distributed()orDataBlob.from_exact_path_distributed().For more fine-grained control, you can subclass
DistributedDataBloband overrideDistributedDataBlob.new_sharded_datablob()with your ownDataBlobcreation and sharding logic. Optionally, you can also subclassDistributedDataBlob.transform_datablob()to change howDistributedDataBlobhandles repeating and batching. Finally, you can also subclassDistributedDataBlob.postprocess_tfdata()to make changes to individualtf.data.Datasetinstances rather than theDataBlobas a whole.- Parameters
name – The name of the wrapped
DataBlob.group_name – The group name of the wrapped
DataBlob.hyperparams – The hyperparameters of the wrapped
DataBlob.hyperparams_class – The
HyperparamsTypeclass thathyperparamsinstances are created from.cache – Whether to cache the
DataBlobin memory. Ifrepeatis also enabled, then caching will happen before repeating.repeat – Repeats the
DataBlobafter loading it. Set toTrueto enable infinite repeating. Set to a positive integernto repeat theDataBlobntimes. Set toFalseto disable repeating.per_replica_batch_size – The batch size for each individual
tf.distributereplica. This is the global batch size divided bytf.distribute.Strategy.num_replicas_in_sync.tf_distribute_strategy – The
tf.distribute.Strategysubclass to use. Optionally, this method will detect if it is already inside a :py:meth:`tf.distribute.Strategy.scope context manager.
- Hyperparams :Type[scalarstop.hyperparams.HyperparamsType]¶
- abstract new_sharded_datablob(self, ctx: tf.distribute.InputContext) DataBlob¶
Subclass this method to return a sharded
DataBlob.- Parameters
ctx – A
tf.distribute.InputContextinstance. The attributetf.distribute.InputContext.input_pipeline_idreturns the current input pipeline. The attributetf.distribute.InputContext.num_input_pipelinesreturns the total number of distributed input pipelines in the currenttf.distribute.Strategy.
- transform_datablob(self, datablob: DataBlob, ctx: tf.distribute.InputContext) DataBlob¶
Transforms an already-initialized
DataBlobto add repeating and sharding logic.- Parameters
datablob – The already-initialized
DataBlob.ctx – A
tf.distribute.InputContextinstance. The attributetf.distribute.InputContext.input_pipeline_idreturns the current input pipeline. The attributetf.distribute.InputContext.num_input_pipelinesreturns the total number of distributed input pipelines in the currenttf.distribute.Strategy.
- Returns
- Returns a
DataBlobthat has been modified by repeating, batching, or another transformation.
- Returns a
- postprocess_tfdata(self, tfdata: tf.data.Dataset, ctx: tf.distribute.InputContext) tf.data.Dataset¶
Performs additional
tf.data.Datasettransformations before turning them intotf.distribute.DistributedDatasetinstances.Currently, the implementation in
DistributedDataBlobdoes nothing, but is avaiable for you to subclass and change.- Parameters
tfdata – The input
tf.data.Datasetinstance to transform.ctx – A
tf.distribute.InputContextinstance. The attributetf.distribute.InputContext.input_pipeline_idreturns the current input pipeline. The attributetf.distribute.InputContext.num_input_pipelinesreturns the total number of distributed input pipelines in the currenttf.distribute.Strategy.
- Returns
Returns a transformed
tf.data.Dataset.
- property tf_distribute_strategy(self) tf.distribute.Strategy¶
Returns the currently-active
tf.distribute.Strategy.
- set_training(self) tf.distribute.DistributedDataset¶
Creates a new
tf.distribute.DistributedDatasetfor the training set.
- property training(self) tf.distribute.DistributedDataset¶
A
tf.distribute.DistributedDatasetinstance for the training set.
- set_validation(self) tf.distribute.DistributedDataset¶
Creates a new
tf.distribute.DistributedDatasetfor the validation set.
- property validation(self) tf.distribute.DistributedDataset¶
A
tf.distribute.DistributedDatasetinstance for the validation set.
- set_test(self) tf.distribute.DistributedDataset¶
Creates a new
tf.distribute.DistributedDatasetfor the test set.
- property test(self) tf.distribute.DistributedDataset¶
A
tf.distribute.DistributedDatasetinstance for the test set.
- classmethod calculate_name(cls, *, hyperparams: Optional[Union[Mapping[str, Any], scalarstop.hyperparams.HyperparamsType]] = None) str¶
Calculate the hashed name of this object, given the hyperparameters.
This classmethod can be used to calculate what an object would be without actually having to call
__init__().
- property hyperparams(self) scalarstop.hyperparams.HyperparamsType¶
Returns a
HyperparamsTypeinstance containing hyperparameters.
- property hyperparams_flat(self) Dict[str, Any]¶
Returns a Python dictionary of “flattened” hyperparameters.
AppendDataBlobobjects modify a “parent”DataBlob, nesting the parent’s Hyperparams within theAppendDataBlob‘s own Hyperparams.This makes it hard to look up a given hyperparams key. A value at
parent_datablob.hyperparams.ais stored atchild_datablob.hyperparams.parent.hyperparams.a.This
hyperparams_flatproperty provides all nested hyperparams keys as a flat Python dictionary. If a childAppendDataBlobhas a hyperparameter key that that conflicts with the parent, the child’s value will overwrite the parent’s value.