scalarstop.train_store
¶
Persists DataBlob
,
ModelTemplate
,
and Model
metadata to a database.
What database should I use?¶
Currently the TrainStore
supports saving metadata
and metrics to either a SQLite or a PostgreSQL database.
If you are doing all of your work on a single machine, a
SQLite database is easier to set up. But if you are training machine
learning models on multiple machines, you should use a PostgreSQL
database instead of SQLite. The SQLite database is not optimal
for handling multiple concurrent writes.
How can I extend the TrainStore
?¶
The TrainStore
does not implement absolutely every
type of query that you might want to perform on your
training metrics. However, we directly expose our SQLAlchemy
engine, connection, and tables in the TrainStore
attributes TrainStore.engine
,
TrainStore.connection
, and
TrainStore.table
.
Module Contents¶
Classes¶
Loads and saves names, hyperparameters, and training metrics from |
- class TrainStore(connection_string: str, *, table_name_prefix: Optional[str] = None, echo: bool = False)¶
Loads and saves names, hyperparameters, and training metrics from
DataBlob
,ModelTemplate
, andModel
objects.Create a
TrainStore
instance connected to an external database.Use this constructor if you want to connect to a PostgreSQL database. If you want to use a SQLite file as the database, you should instead use the
TrainStore.from_filesystem()
classmethod.- Parameters
connection_string – A SQLAlchemy database connection string for connecting to a database. A typical PostgreSQL connection string looks like
"postgresql://username:password@hostname:port/database"
, with theport
defaulting to5432
.table_name_prefix – A string prefix to add to all of the table names we generate. This allows multiple installations of ScalarStop to share the same database.
echo – Set to
True
to print out the SQL statements that theTrainStore
executes.
- classmethod from_filesystem(cls, *, filename: str, table_name_prefix: Optional[str] = None, echo: bool = False) TrainStore ¶
Use a SQLite3 database file on the local filesystem as the train store.
- Parameters
filename – The filename of the SQLite3 file.
table_name_prefix – A string prefix to add to all of the table names we generate. This allows multiple installations of ScalarStop to share the same database.
echo – Set to
True
to print out the SQL statements that theTrainStore
executes.
- property table(self) _TrainStoreTables ¶
References to the
sqlalchemy.schema.Table
objects representing our database tables.Currently, there are four tables that are attributes to this property:
datablob
model_template
model
model_epoch
- property engine(self) sqlalchemy.engine.Engine ¶
The currently active
sqlalchemy.engine.Engine
.This is useful if you want to write custom SQLAlchemy code on top of
TrainStore
.
- property connection(self) sqlalchemy.engine.Connection ¶
The currently active
sqlalchemy.engine.Connection
.This is useful if you want to write custom SQLAlchemy code on top of
TrainStore
.
- insert_datablob(self, datablob: scalarstop.datablob.DataBlobBase, *, ignore_existing: bool = False) None ¶
Logs the
DataBlob
name, group name, and hyperparams to theTrainStore
.This also supports inserting other subclasses of
DataBlobBase
, such asDistributedDataBlob
.- Parameters
datablob – A
DataBlob
instance whose name and hyperparameters that we want to record in the database.ignore_existing – Set this to
True
to ignore if aDataBlob
with the same name is already in the database, in which case this function will do nothing. Note thatDataBlob
instances are supposed to be immutable, soTrainStore
does not implement updating them.
- insert_datablob_by_str(self, *, name: str, group_name: str, hyperparams: Any, ignore_existing: bool = False)¶
Logs the
DataBlob
name, group name, and hyperparams to theTrainStore
.- Parameters
name – Your
DataBlob
name.group_name – Your
DataBlob
group name.hyperparams – Your
DataBlob
hyperparameters.ignore_existing – Set this to
True
to ignore if aDataBlob
with the same name is already in the database, in which case this function will do nothing. Note thatDataBlob
instances are supposed to be immutable, soTrainStore
does not implement updating them.
- list_datablobs(self, *, datablob_name: Optional[Union[str, Sequence[str]]] = None, datablob_group_name: Optional[Union[str, Sequence[str]]] = None) pandas.DataFrame ¶
Returns a
pandas.DataFrame
listing theDataBlob
names in the database.If you call this method without any arguments, it will list ALL of the
DataBlob
s in the database. You can narrow down your results by providing ONE (but not both) of the below arguments.
- insert_model_template(self, model_template, *, ignore_existing: bool = False)¶
Logs the
ModelTemplate
name, group name, and hyperparams to theTrainStore
.- Parameters
model_template – A
ModelTemplate
instance whose name and hyperparameters that we want to record in the database.ignore_existing – Set this to
True
to ignore if aModelTemplate
with the same name is already in the database, in which case this function will do nothing. Note thatModelTemplate
instances are supposed to be immutable, soTrainStore
does not implement updating them.
- insert_model_template_by_str(self, *, name: str, group_name: str, hyperparams, ignore_existing: bool = False)¶
Logs the
ModelTemplate
name, group name, and hyperparams to theTrainStore
.- Parameters
name – Your
ModelTemplate
name.group_name – Your
ModelTemplate
group name.hyperparams – Your
ModelTemplate
hyperparameters.ignore_existing – Set this to
True
to ignore if aModelTemplate
with the same name is already in the database, in which case this function will do nothing. Note thatModelTemplate
instances are supposed to be immutable, soTrainStore
does not implement updating them.
- list_model_templates(self, *, model_template_name: Optional[Union[str, Sequence[str]]] = None, model_template_group_name: Optional[Union[str, Sequence[str]]] = None)¶
Returns a
pandas.DataFrame
listing ALL of the rows in theModelTemplate
table.If you call this method without any arguments, it will list ALL of the
ModelTemplate
s in the database. You can narrow down your results by providing ONE (but not both) of the below arguments.- Parameters
model_template_name – Either a single
ModelTemplate
name or a list of names to select.model_template_group_name – Either a single
ModelTemplate
group name or a list of group names to select.
- insert_model(self, model, *, ignore_existing: bool = False)¶
Logs the
Model
name,DataBlob
, and :py:class;`~scalarstop.model_template.ModelTemplate` to theTrainStore
.- Parameters
model – A
Model
instance whose name and hyperparameters that we want to record in the database.ignore_existing – Set this to
True
to ignore if aModel
with the same name is already in the database, in which case this function will do nothing. TheTrainStore
does not implement the updating ofModel
name or hyperparameters. The only way to change aModel
is to log more epochs.
- insert_model_by_str(self, *, name: str, model_class_name: str, datablob_name: str, model_template_name: str, ignore_existing: bool = False) None ¶
Logs the
Model
name,DataBlob
, and :py:class;`~scalarstop.model_template.ModelTemplate` to theTrainStore
.- Parameters
name – The
Model
name.model_class_name – The
Model
subclass name used. If you are usingKerasModel
, then this value is the string"KerasModel"
.datablob_name – The
DataBlob
name used to create theModel
instance.model_template_name – The
ModelTemplate
name used to create theModel
instance.ignore_existing – Set this to
True
to ignore if aModel
with the same name is already in the database, in which case this function will do nothing. TheTrainStore
does not implement the updating ofModel
name or hyperparameters. The only way to change aModel
is to log more epochs.
- list_models(self, *, datablob_name: Optional[Union[str, Sequence[str]]] = None, datablob_group_name: Optional[Union[str, Sequence[str]]] = None, model_template_name: Optional[Union[str, Sequence[str]]] = None, model_template_group_name: Optional[Union[str, Sequence[str]]] = None) pandas.DataFrame ¶
Returns a
pandas.DataFrame
listing ALL of the rows in theModel
table.If you call this method without any arguments, it will list ALL of the
Model
s in the database. Optionally, you can narrow down the results with the following values.Note that you can provide either
datablob_name
ordatablob_group_name
, but not both.Similarly, you can provide either
model_template_name
ormodel_template_group_name
, but not both.- Parameters
datablob_name – Either a single
DataBlob
name or a list of names to select.datablob_group_name – Either a single
DataBlob
group name or a list of group names to select.model_template_name – Either a single
ModelTemplate
name or a list of names to select.model_template_group_name – Either a single
ModelTemplate
group name or a list of group names to select.
- list_models_grouped_by_epoch_metric(self, *, metric_name: str, metric_direction: str, datablob_name: Optional[Union[str, Sequence[str]]] = None, datablob_group_name: Optional[Union[str, Sequence[str]]] = None, model_template_name: Optional[Union[str, Sequence[str]]] = None, model_template_group_name: Optional[Union[str, Sequence[str]]] = None) pandas.DataFrame ¶
Returns a
pandas.DataFrame
listing ALL of the rows in theModel
table AND a metric from the model’s best-performing epoch.You provide this method with a model epoch metric name and whether to maximize or minimize this, and then it returns all of the models and the best metric value.
Note that you can provide either
datablob_name
ordatablob_group_name
, but not both.Similarly, you can provide either
model_template_name
ormodel_template_group_name
, but not both.- Parameters
metric_name – The name of one of the metrics tracked when training a model. This might be a value like
"loss"
or"val_accuracy"
.metric_direction – Set this to
"min"
if the metric you picked inmetric_name
is a value where lower values are better–such as"loss"
. Set this to"max"
if higher values of your metric are better–such as"accuracy"
.datablob_name – Either a single
DataBlob
name or a list of names to select.datablob_group_name – Either a single
DataBlob
group name or a list of group names to select.model_template_name – Either a single
ModelTemplate
name or a list of names to select.model_template_group_name – Either a single
ModelTemplate
group name or a list of group names to select.
Returns a
pandas.DataFrame
with the following columns:model_name
model_class_name
model_last_modified
datablob_name
datablob_group_name
model_template_name
model_template_group_name
sort_metric_value
ModelTemplate
hyperparameter names prefixed withmth__
DataBlob
hyperparameter names prefixed withdbh__
- insert_model_epoch(self, *, epoch_num: int, model_name: str, metrics, steps_per_epoch: Optional[int] = None, validation_steps_per_epoch: Optional[int] = None, ignore_existing: bool = False) None ¶
Logs a new epoch for a
Model
to theTrainStore
.- Parameters
epoch_num – The epoch number that we are adding.
model_name – The name of the
Model
tha we are training.metrics – A dictionary of metric names and values to save.
steps_per_epoch – The number of training steps that count as one epoch. Defaults to
None
, which means that an epoch is defined by how long it takes for the model’sDataBlob
training dataset to be exhausted.validation_steps_per_epoch – The number of validation steps that count as one epoch. Defaults to
None
, which means that an epoch is defined by how long it takes for the model’sDataBlob
validation dataset to be exhausted.ignore_existing – Set this to
True
to ignore if the database already has a row with the same(model_name, epoch_num)
pair.
- bulk_insert_model_epochs(self, model) None ¶
Insert a list of
Model
epochs at once.This method will politely ignore if the database already contains rows with the same model name and epoch number.
Currently this method only works if you are using either SQLite or PostgreSQL as the backing database.
- Parameters
model – The
Model
with the epochs that we want to save.
- list_model_epochs(self, model_name: Optional[Union[str, Sequence[str]]] = None) pandas.DataFrame ¶
Returns a
pandas.DataFrame
listingModel
epochs.By default, this lists ALL epochs in the database for ALL models. You can narrow down the search with the following arguments.
- Parameters
model_name – Specify a single model name or a list of model names whose epochs we are interested in.
- get_current_epoch(self, model_name: str) int ¶
Returns how many epochs a given
Model
has been trained for.Returns 0 if the given model is not registered in the
TrainStore
.This information is also saved in the directory created when a
Model
instance is saved to the filesystem and is available in the attributecurrent_epoch
.
- get_best_model(self, *, metric_name: str, metric_direction: str, datablob_name: Optional[Union[str, Sequence[str]]] = None, datablob_group_name: Optional[Union[str, Sequence[str]]] = None, model_template_name: Optional[Union[str, Sequence[str]]] = None, model_template_group_name: Optional[Union[str, Sequence[str]]] = None) _ModelMetadata ¶
Return metadata about the model with the best performance on a metric.
This method queries the database, looking for the
Model
with the best performance on the metric you specified in the parametermetric_name
. By default, this returns ALL models in the database sorted by your metric name. Most likely, you will want to narrow down your search using the below arguments.Note that you can provide either
datablob_name
ordatablob_group_name
, but not both.Similarly, you can provide either
model_template_name
ormodel_template_group_name
, but not both.- Parameters
metric_name – The name of one of the metrics tracked when training a model. This might be a value like
"loss"
or"val_accuracy"
.metric_direction – Set this to
"min"
if the metric you picked inmetric_name
is a value where lower values are better–such as"loss"
. Set this to"max"
if higher values of your metric are better–such as"accuracy"
.datablob_name – Either a single
DataBlob
name or a list of names to select.datablob_group_name – Either a single
DataBlob
group name or a list of group names to select.model_template_name – Either a single
ModelTemplate
name or a list of names to select.model_template_group_name – Either a single
ModelTemplate
group name or a list of group names to select.
- Returns a dataclass with the following attributes:
model_name
model_class_name
model_epoch_metrics
model_last_modified
datablob_name
datablob_group_name
datablob_hyperparams
datablob_hyperparams_flat
model_template_name
model_template_group_name
model_template_hyperparams
sort_metric_name
sort_metric_value