Configuration basics#

Ablator framework uses a configuration system to define everything related to the training of machine learning models, from the model architecture, to the environment that it’s being trained in.

Ablator has the ability to dynamically create a hierarchical configuration by composition, and you can either override it through yaml config files and the command line, or you can just play around with python objects and classes. Refer to these examples or the last two sections in this tutorial to see how you can implement these two methods.

Configuration categories#

For our framework, configuration is organized into different categories: - Running configuration (either for training a single model or training multiple models in parallel) - Model configuration - Training configuration - Optimizer configuration - Scheduler configuration.

Most of them will be used together in order for ablator to work seamlessly.

RunConfig#

RunConfig is used to configure the experiment environment, e.g where to store experiment artifacts (loss, accuracy, other evaluation metrics), the device to be used (GPU, CPU), when to do validation step or progress logging while running the experiment.

The table below summarizes the parameters, either required or customizable. Note that RunConfig requires TrainConfig and ModelConfig to be included during initialization, which are covered in the next sections of this tutorial.

Parameter

Usage

experiment_dir

location to store experiment artifacts.

random_seed

random seed.

train_config

training configuration. (check TrainConfig for more details)

model_config

model configuration. (check ModelConfig for more details)

keep_n_checkpoints

number of latest checkpoints to keep.

tensorboard

whether to use tensorboardLogger.

amp

whether to use automatic mixed precision when running on gpu.

device

device to run on.

verbose

verbosity level.

eval_subsample

fraction of the dataset to use for evaluation.

metrics_n_batches

max number of batches stored in every tag(train, eval, test) for evaluation.

metrics_mb_limit

max number of megabytes stored in every tag(train, eval, test) for evaluation.

early_stopping_iter

The maximum allowed difference between the current iteration and the last iteration with the best metric before applying early stopping. Early stopping will be triggered if the difference (current_itr-best_itr) exceeds early_stopping_iter.If set to None, early stopping will not be applied.

eval_epoch

The epoch interval between two evaluations.

log_epoch

The epoch interval between two logging.

init_chkpt

path to a checkpoint to initialize the model with.

warm_up_epochs

number of epochs marked as warm up epochs.

divergence_factor

if cur_loss > best_loss > divergence_factor, the model is considered to have diverged.

ParallelConfig#

ParallelConfig is a subclass of RunConfig. It introduces additional arguments to configure parallel training and enabling horizontal scaling of a single experiment, such as the number of trials, the maximum number of trials to run concurrently, the target metrics to optimize, and more.

Parameter

Usage

total_trials

total number of trials.

concurrent_trials

number of trials to run concurrently.

search_space

search space for hyperparameter search,eg. {"train_config.optimizer_config.arguments.lr": SearchSpace(value_range=[0, 10], value_type="int"),}

optim_metrics

metrics to optimize, eg. {"val_loss": "min"}

search_algo

type of search algorithm.

ignore_invalid_params

whether to ignore invalid parameters when sampling.

remote_config

remote storage configuration.

gcp_config

gcp configuration.

gpu_mb_per_experiment

gpu resource to assign to an experiment.

cpus_per_experiment

cpu resource to assign to an experiment.

It’s worth to mention search_space, which is used to define a set of continuous or categorical/discrete values for a certain hyperparameter that you want to ablate. Refer to Search Space basics to learn more about how to use it for ablation.

ModelConfig#

This configuration can be used to add parameters specific to the model you’re using. A sample use case for this is when you want to try different model sizes, number of layers, activation functions, etc. You can do this by creating a custom ModelConfig class for the model and include these parameters. One advantage of this is that ablator will be able to create a search space over the parameters and then run Hyperparameter optimization.

There are 2 steps that are required after defining a custom model config class for your model:

  • Pass the custom config to its constructor so you can construct the model using the parameters that’s defined in the custom config.

  • Create a custom running config class (decorated with configclass decorator), to update model_config argument to proper type, e.g MyCustomModelConfig (since model_config attribute of the running configuration, RunConfig or ParallelConfig, is originally of type ModelConfig).

Note that in the model config class, arguments can be defined as Stateless or Derived data type. These are custom Python annotations to define attributes to which the experiment state is agnostic.

  • Stateless is used if a variable can take different value assignments between trials or experiments. For example, the learning rate, as we can resume training a model with different learning rates, should be stateless. Note that if you’re declaring a variable to be Stateless, it must be assigned an initial value before launching the experiment.

  • Derived attributes are Stateful and are un-decided at the start of the experiment. Their values are determined by internal experiment processes that can depend on other experimental attributes, e.g model input size that depends on the dataset.

  • Stateful is opposite to Stateless, i.e its value must be the same between different experiments. For example, when you continue training a paused model, the model architecture should be the same (number of layers, output size). Stateful variables, defined as a primitive datatype, are required at initialization.

Below is an example of a simple 1-layer neural network model, with configuration for input size (to be inferred); hidden layer dimension, activation function, and dropout rate (all of which are stateful); learning rate (stateless).

from ablator import RunConfig, ModelConfig, Stateless, Derived, configclass

import torch.nn as nn
import torch

class MyModelConfig(ModelConfig):
    inp_size: Derived[int]
    lr: Stateless[float]
    hidden_dim: int
    activation: str
    dropout: float

@configclass
class CustomRunConfig(RunConfig):
    model_config: MyModelConfig

class MyCustomModel(nn.Module):
    def __init__(self, config: MyModelConfig) -> None:
        super().__init__()
        self.linear = nn.Linear(config.inp_size, config.hidden_dim)
        self.dropout = nn.Dropout(config.dropout)
        if config.activation == "relu":
            self.activate = nn.ReLU()
        elif config.activation == "elu":
            self.activate = nn.ELU()

    def forward(self, x: torch.Tensor):
        out = self.linear(x)
        out = self.dropout(out)
        out = self.activate(out)

        return {"preds": out, "labels": out}, x.sum().abs()

model_config = MyModelConfig(lr=0.01, hidden_dim=100, activation="relu", dropout=0.3)

TrainConfig#

This configuration class defines everything that is related to the main training process of your model, which includes dataset name, batch size, number of epochs, optimizer, scheduler. 2 important attributes to metion are optimizer_config and scheduler_config. As the names suggest, they configure the optimizer and scheduler to be used in the training process.

Parameter

Usage

dataset

dataset name. maybe used in custom dataset loader functions.

batch_size

batch size.

epochs

number of epochs to train.

optimizer_config

optimizer configuration. (check OptimizerConfig for more details)

scheduler_config

scheduler configuration. (check SchedulerConfig for more details)

rand_weights_init

whether to initialize model weights randomly.

OptimizerConfig and SchedulerConfig#

OptimizerConfig is a config class that allows user choose the optimizer they wanted. Currently, we supports SGD optimizer, Adam optimizer, and AdamW optimizer.

SchedulerConfig, on the other hand, can be used for scheduling learning rate updates in the training process.

Both of these config classes have similar arguments:

Parameter

Usage

name

The type of the scheduler or optimizer, this can be any in ['None', 'step', 'cycle', 'plateau'] for schedulers and in ['sgd', 'adam', 'adamw'] for optimizers.

arguments

The arguments for the scheduler or optimizer, specific to a certain type of scheduler or scheduler.

The table below shows how arguments can be defined for each type of optimzer:

Optimizer type

Arguments

sgd

weight_decay : Weight decay rate momentum : Momentum factor

adam

betas : Coefficients for computing running averages of gradient and its square (default is (0.5, 0.9)). weight_decay : Weight decay rate (default is 0.0).

adamw

betas : Coefficients for computing running averages of gradient and its square (default is (0.9, 0.999)). eps : Term added to the denominator to improve numerical stability (default is 1e-8). weight_decay : Weight decay rate (default is 0.0).

The table below shows how arguments can be defined for each type of scheduler:

Sc heduler type

Arguments

cycle

max_lr : Upper learning rate boundaries in the cycle. total_steps : The total number of steps to run the scheduler in a cycle. step_when : The step type at which the scheduler.step() should be invoked: 'train', 'val', or 'epoch'.

plataeu

patience : Number of epochs with no improvement after which learning rate will be reduced. min_lr : A lower bound on the learning rate. mode : One of 'min', 'max', or 'auto', which defines thedirection of optimization, so as to adjust the learning rate accordingly, i.e when a certain metric ceases improving. factor : Factor by which the learning rate will be reduced. new_lr = lr * factor. threshold : Threshold for measuring the new optimum, to only focus on significant changes. verbose : If True, prints a message to stdout for each update. step_when : The step type at which the scheduler should be invoked: 'train', 'val', or 'epoch'.

step

step_size : Period of learning rate decay, by default 1. gamma : Multiplicative factor of learning rate decay, by default 0.99. step_when : The step type at which the scheduler should be invoked: 'train', 'val', or 'epoch'.

The following code snippet describes how to initialize these objects

from ablator import OptimizerConfig, SchedulerConfig

optimizer_config = OptimizerConfig(name="sgd", arguments={"lr": 0.1})
scheduler_config = SchedulerConfig(name="cycle", arguments={"max_lr": 0.5, "total_steps": 50})

Now let’s combine everything. Ablator trainer requires a model wrapper and a running config when initializing, after that, experiment can be launched via trainer.launch(). Note that this tutorial only focuses on defining the running configuration run_config, for Ablator trainer, refer to Prototyping models and HPO.

Take the code snippet below as an example, train_config sets up the dataset, batch size, epochs, and references the optimizer configuration and scheduler configuration. Next, config object combines the train_config and model_config, along with runtime settings like verbosity and device.

from ablator import TrainConfig

train_config = TrainConfig(
    dataset="test",
    batch_size=128,
    epochs=2,
    optimizer_config=optimizer_config,
    scheduler_config=scheduler_config,
)

config = CustomRunConfig(
    train_config=train_config,
    model_config=MyModelConfig(),
    verbose="silent",
    device="cpu",
)

With the configuration created, we are half-way to running ablation experiment with the ablator trainer.

trainer = ParallelTrainer(wrapper=model_wrapper, run_config=run_config)
trainer.launch()

In the next chapter, you will learn how to create the model wrapper, the other half that’s left. We will start with training a single model.

Different methods to define running configurations#

There are 3 ways to provide values to the configurations: named arguments, file-based, or dictionary-based. All examples from the previous sections are actually the named arguments method. Now let’s look at how file based method and dictionary based method work.

File-based#

File based configuration is a way for you to create simple configuration files, passing configuration values to a single yaml file. After that, based on the type of running configuration you want, you can use RunConfigClass.load(path/to/yaml/file) method to create configuration with values provided in the config file.

To write these config files, simply follow key : value syntax (each pair on a single line). The following example shows what a config yaml file looks like. We will name it config.yaml:

experiment_dir: "/tmp/dir"
train_config:
  dataset: test
  batch_size: 128
  epochs: 2
  optimizer_config:
    name: sgd
    arguments:
      lr: 0.1
  scheduler_config:
    name: cycle
    arguments:
      max_lr: 0.5
      total_steps: 50
model_config:
  inp_size: 50
  hidden_dim: 100
  activation: "relu"
  dropout: 0.15
verbose: "silent"
device: "cpu"

We can see that the outermost arguments are from RunConfig. Also note how train_config, which corresponds to TrainConfig object in the running config, has its arguments defined 1 level below (indented). Therefore, the first rule to follow is that the arguments to use are from the running config class, either RunConfig or ParallelConfig, so make sure you use the right set of arguments. The second rule is that any arguments that is another config class should be indented 1 level from its parent config class.

Now in your code, only 1 single line of code is required to load these values to create the config object:

config = CustomRunConfig.load("path/to/yaml/file")

Note that since we created a custom running configuration class CustomRunConfig that is tied to the custom model config in the previous sections, we used CustomRunConfig.load("path/to/yaml/file") to load configuration from file. Otherwise, if you’re not creating any subclasses, RunConfig.load("path") or ParallelConfig.load("path") also works.

Dictionary based#

Another alternative is similar to the file-based method, but it’s defining configurations in a dictionary instead of a yaml file, and then the dictionary will be passed (as keyword arguments) to the running configuration at initialization

configuration = {
    "experiment_dir": "/tmp/dir",
    "train_config": {
        "dataset": "test",
        "batch_size": 128,
        "epochs": 2,
        "optimizer_config":{
            "name": "sgd",
            "arguments": {
                "lr": 0.1
            }
        },
        "scheduler_config":{
            "name": "cycle",
            "arguments":{
                "max_lr": 0.5,
                "total_steps": 50
            }
        }
    },
    "model_config": {
        "inp_size": 50,
        "hidden_dim": 100,
        "activation": "relu",
        "dropout": 0.15
    },
    "verbose": "silent",
    "device": "cpu"
}

config = CustomRunConfig(
    **configuration
)

Conclusion#

Now that you know how to define running configurations, you can start creating your own prototype. In the next chapter, we will learn how to write a prototype for your model, combine it with the running configuration, and launch the experiment.