`scalarstop.train_store`¶

Persists DataBlob, ModelTemplate, and Model metadata to a database.

What database should I use?¶

Currently the TrainStore supports saving metadata and metrics to either a SQLite or a PostgreSQL database. If you are doing all of your work on a single machine, a SQLite database is easier to set up. But if you are training machine learning models on multiple machines, you should use a PostgreSQL database instead of SQLite. The SQLite database is not optimal for handling multiple concurrent writes.

How can I extend the `TrainStore`?¶

The TrainStore does not implement absolutely every type of query that you might want to perform on your training metrics. However, we directly expose our SQLAlchemy engine, connection, and tables in the TrainStore attributes TrainStore.engine, TrainStore.connection, and TrainStore.table.

Module Contents¶

Classes¶

TrainStore

Loads and saves names, hyperparameters, and training metrics from DataBlob, ModelTemplate, and Model objects.

class TrainStore(connection_string: str, *, table_name_prefix: Optional[str] = None, echo: bool = False)¶

Loads and saves names, hyperparameters, and training metrics from DataBlob, ModelTemplate, and Model objects.

Create a TrainStore instance connected to an external database.

Use this constructor if you want to connect to a PostgreSQL database. If you want to use a SQLite file as the database, you should instead use the TrainStore.from_filesystem() classmethod.

Parameters

connection_string – A SQLAlchemy database connection string for connecting to a database. A typical PostgreSQL connection string looks like "postgresql://username:password@hostname:port/database", with the port defaulting to 5432.
table_name_prefix – A string prefix to add to all of the table names we generate. This allows multiple installations of ScalarStop to share the same database.
echo – Set to True to print out the SQL statements that the TrainStore executes.

classmethod from_filesystem(cls, *, filename: str, table_name_prefix: Optional[str] = None, echo: bool = False) → TrainStore¶

Use a SQLite3 database file on the local filesystem as the train store.

Parameters

filename – The filename of the SQLite3 file.
table_name_prefix – A string prefix to add to all of the table names we generate. This allows multiple installations of ScalarStop to share the same database.
echo – Set to True to print out the SQL statements that the TrainStore executes.

property table(self) → _TrainStoreTables¶

References to the sqlalchemy.schema.Table objects representing our database tables.

Currently, there are four tables that are attributes to this property:

datablob
model_template
model
model_epoch

property engine(self) → sqlalchemy.engine.Engine¶

The currently active sqlalchemy.engine.Engine.

This is useful if you want to write custom SQLAlchemy code on top of TrainStore.

property connection(self) → sqlalchemy.engine.Connection¶

The currently active sqlalchemy.engine.Connection.

This is useful if you want to write custom SQLAlchemy code on top of TrainStore.

insert_datablob(self, datablob: scalarstop.datablob.DataBlobBase, *, ignore_existing: bool = False) → None¶

Logs the DataBlob name, group name, and hyperparams to the TrainStore.

This also supports inserting other subclasses of DataBlobBase, such as DistributedDataBlob.

Parameters

datablob – A DataBlob instance whose name and hyperparameters that we want to record in the database.
ignore_existing – Set this to True to ignore if a DataBlob with the same name is already in the database, in which case this function will do nothing. Note that DataBlob instances are supposed to be immutable, so TrainStore does not implement updating them.

insert_datablob_by_str(self, *, name: str, group_name: str, hyperparams: Any, ignore_existing: bool = False)¶

Logs the DataBlob name, group name, and hyperparams to the TrainStore.

Parameters

name – Your DataBlob name.
group_name – Your DataBlob group name.
hyperparams – Your DataBlob hyperparameters.
ignore_existing – Set this to True to ignore if a DataBlob with the same name is already in the database, in which case this function will do nothing. Note that DataBlob instances are supposed to be immutable, so TrainStore does not implement updating them.

list_datablobs(self, *, datablob_name: Optional[Union[str, Sequence[str]]] = None, datablob_group_name: Optional[Union[str, Sequence[str]]] = None) → pandas.DataFrame¶

Returns a pandas.DataFrame listing the DataBlob names in the database.

If you call this method without any arguments, it will list ALL of the DataBlob s in the database. You can narrow down your results by providing ONE (but not both) of the below arguments.

Parameters

datablob_name – Either a single DataBlob name or a list of names to select.
datablob_group_name – Either a single DataBlob group name or a list of group names to select.

insert_model_template(self, model_template, *, ignore_existing: bool = False)¶

Logs the ModelTemplate name, group name, and hyperparams to the TrainStore.

Parameters

model_template – A ModelTemplate instance whose name and hyperparameters that we want to record in the database.
ignore_existing – Set this to True to ignore if a ModelTemplate with the same name is already in the database, in which case this function will do nothing. Note that ModelTemplate instances are supposed to be immutable, so TrainStore does not implement updating them.

insert_model_template_by_str(self, *, name: str, group_name: str, hyperparams, ignore_existing: bool = False)¶

Logs the ModelTemplate name, group name, and hyperparams to the TrainStore.

Parameters

name – Your ModelTemplate name.
group_name – Your ModelTemplate group name.
hyperparams – Your ModelTemplate hyperparameters.
ignore_existing – Set this to True to ignore if a ModelTemplate with the same name is already in the database, in which case this function will do nothing. Note that ModelTemplate instances are supposed to be immutable, so TrainStore does not implement updating them.

list_model_templates(self, *, model_template_name: Optional[Union[str, Sequence[str]]] = None, model_template_group_name: Optional[Union[str, Sequence[str]]] = None)¶

Returns a pandas.DataFrame listing ALL of the rows in the ModelTemplate table.

If you call this method without any arguments, it will list ALL of the ModelTemplate s in the database. You can narrow down your results by providing ONE (but not both) of the below arguments.

Parameters

model_template_name – Either a single ModelTemplate name or a list of names to select.
model_template_group_name – Either a single ModelTemplate group name or a list of group names to select.

insert_model(self, model, *, ignore_existing: bool = False)¶

Logs the Model name, DataBlob, and :py:class;`~scalarstop.model_template.ModelTemplate` to the TrainStore.

Parameters

model – A Model instance whose name and hyperparameters that we want to record in the database.
ignore_existing – Set this to True to ignore if a Model with the same name is already in the database, in which case this function will do nothing. The TrainStore does not implement the updating of Model name or hyperparameters. The only way to change a Model is to log more epochs.

insert_model_by_str(self, *, name: str, model_class_name: str, datablob_name: str, model_template_name: str, ignore_existing: bool = False) → None¶

Logs the Model name, DataBlob, and :py:class;`~scalarstop.model_template.ModelTemplate` to the TrainStore.

Parameters

name – The Model name.
model_class_name – The Model subclass name used. If you are using KerasModel, then this value is the string "KerasModel".
datablob_name – The DataBlob name used to create the Model instance.
model_template_name – The ModelTemplate name used to create the Model instance.
ignore_existing – Set this to True to ignore if a Model with the same name is already in the database, in which case this function will do nothing. The TrainStore does not implement the updating of Model name or hyperparameters. The only way to change a Model is to log more epochs.

list_models(self, *, datablob_name: Optional[Union[str, Sequence[str]]] = None, datablob_group_name: Optional[Union[str, Sequence[str]]] = None, model_template_name: Optional[Union[str, Sequence[str]]] = None, model_template_group_name: Optional[Union[str, Sequence[str]]] = None) → pandas.DataFrame¶

Returns a pandas.DataFrame listing ALL of the rows in the Model table.

If you call this method without any arguments, it will list ALL of the Model s in the database. Optionally, you can narrow down the results with the following values.

Note that you can provide either datablob_name or datablob_group_name, but not both.

Similarly, you can provide either model_template_name or model_template_group_name, but not both.

Parameters

datablob_name – Either a single DataBlob name or a list of names to select.
datablob_group_name – Either a single DataBlob group name or a list of group names to select.
model_template_name – Either a single ModelTemplate name or a list of names to select.
model_template_group_name – Either a single ModelTemplate group name or a list of group names to select.

list_models_grouped_by_epoch_metric(self, *, metric_name: str, metric_direction: str, datablob_name: Optional[Union[str, Sequence[str]]] = None, datablob_group_name: Optional[Union[str, Sequence[str]]] = None, model_template_name: Optional[Union[str, Sequence[str]]] = None, model_template_group_name: Optional[Union[str, Sequence[str]]] = None) → pandas.DataFrame¶

Returns a pandas.DataFrame listing ALL of the rows in the Model table AND a metric from the model’s best-performing epoch.

You provide this method with a model epoch metric name and whether to maximize or minimize this, and then it returns all of the models and the best metric value.

Note that you can provide either datablob_name or datablob_group_name, but not both.

Similarly, you can provide either model_template_name or model_template_group_name, but not both.

Parameters

metric_name – The name of one of the metrics tracked when training a model. This might be a value like "loss" or "val_accuracy".
metric_direction – Set this to "min" if the metric you picked in metric_name is a value where lower values are better–such as "loss". Set this to "max" if higher values of your metric are better–such as "accuracy".
datablob_name – Either a single DataBlob name or a list of names to select.
datablob_group_name – Either a single DataBlob group name or a list of group names to select.
model_template_name – Either a single ModelTemplate name or a list of names to select.
model_template_group_name – Either a single ModelTemplate group name or a list of group names to select.

Returns a pandas.DataFrame with the following columns:

model_name
model_class_name
model_last_modified
datablob_name
datablob_group_name
model_template_name
model_template_group_name
sort_metric_value
ModelTemplate hyperparameter names prefixed with mth__
DataBlob hyperparameter names prefixed with dbh__

insert_model_epoch(self, *, epoch_num: int, model_name: str, metrics, steps_per_epoch: Optional[int] = None, validation_steps_per_epoch: Optional[int] = None, ignore_existing: bool = False) → None¶

Logs a new epoch for a Model to the TrainStore.

Parameters

epoch_num – The epoch number that we are adding.
model_name – The name of the Model tha we are training.
metrics – A dictionary of metric names and values to save.
steps_per_epoch – The number of training steps that count as one epoch. Defaults to None, which means that an epoch is defined by how long it takes for the model’s DataBlob training dataset to be exhausted.
validation_steps_per_epoch – The number of validation steps that count as one epoch. Defaults to None, which means that an epoch is defined by how long it takes for the model’s DataBlob validation dataset to be exhausted.
ignore_existing – Set this to True to ignore if the database already has a row with the same (model_name, epoch_num) pair.

bulk_insert_model_epochs(self, model) → None¶

Insert a list of Model epochs at once.

This method will politely ignore if the database already contains rows with the same model name and epoch number.

Currently this method only works if you are using either SQLite or PostgreSQL as the backing database.

Parameters: model – The Model with the epochs that we want to save.

list_model_epochs(self, model_name: Optional[Union[str, Sequence[str]]] = None) → pandas.DataFrame¶

Returns a pandas.DataFrame listing Model epochs.

By default, this lists ALL epochs in the database for ALL models. You can narrow down the search with the following arguments.

Parameters: model_name – Specify a single model name or a list of model names whose epochs we are interested in.

get_current_epoch(self, model_name: str) → int¶

Returns how many epochs a given Model has been trained for.

Returns 0 if the given model is not registered in the TrainStore.

This information is also saved in the directory created when a Model instance is saved to the filesystem and is available in the attribute current_epoch.

get_best_model(self, *, metric_name: str, metric_direction: str, datablob_name: Optional[Union[str, Sequence[str]]] = None, datablob_group_name: Optional[Union[str, Sequence[str]]] = None, model_template_name: Optional[Union[str, Sequence[str]]] = None, model_template_group_name: Optional[Union[str, Sequence[str]]] = None) → _ModelMetadata¶

Return metadata about the model with the best performance on a metric.

This method queries the database, looking for the Model with the best performance on the metric you specified in the parameter metric_name. By default, this returns ALL models in the database sorted by your metric name. Most likely, you will want to narrow down your search using the below arguments.

Note that you can provide either datablob_name or datablob_group_name, but not both.

Similarly, you can provide either model_template_name or model_template_group_name, but not both.

Parameters

metric_name – The name of one of the metrics tracked when training a model. This might be a value like "loss" or "val_accuracy".
metric_direction – Set this to "min" if the metric you picked in metric_name is a value where lower values are better–such as "loss". Set this to "max" if higher values of your metric are better–such as "accuracy".
datablob_name – Either a single DataBlob name or a list of names to select.
datablob_group_name – Either a single DataBlob group name or a list of group names to select.
model_template_name – Either a single ModelTemplate name or a list of names to select.
model_template_group_name – Either a single ModelTemplate group name or a list of group names to select.

Returns a dataclass with the following attributes:

model_name
model_class_name
model_epoch_metrics
model_last_modified
datablob_name
datablob_group_name
datablob_hyperparams
datablob_hyperparams_flat
model_template_name
model_template_group_name
model_template_hyperparams
sort_metric_name
sort_metric_value

close(self) → None¶

Close the database connection.

This is also called by the context manager’s __exit__() method.

scalarstop.train_store¶

What database should I use?¶

How can I extend the TrainStore?¶

Module Contents¶

Classes¶

`scalarstop.train_store`¶

How can I extend the `TrainStore`?¶