tensorpack.train package

exception tensorpack.train.StopTraining[source]

Bases: BaseException

An exception thrown to stop training.

class tensorpack.train.Trainer(config=None)[source]

Bases: object

Base class for a trainer.


config is only for compatibility reasons in case you’re using custom trainers with old-style API. You should never use config.


The number of the currently ongoing epoch.

An epoch is defined to cover the moment before calling before_epoch until after calling trigger_epoch. i.e., in the trigger_epoch of epoch 3, self.epoch_num is 3. If you need use self.epoch_num in your callback, you’ll need to know this.


The tensorflow global_step, i.e. how many times hooked_sess.run has been called.


  1. global_step is incremented after each hooked_sess.run returns from TF runtime.

  2. If you make zero or more than one calls to hooked_sess.run in one run_step(), local_step and global_step may increment at different speed.

initialize(session_creator, session_init)[source]

Initialize self.sess and self.hooked_sess. Must be called after callbacks are setup.

is_chief = True

The number of steps that have finished in the current epoch.

main_loop(steps_per_epoch, starting_epoch, max_epoch)[source]

Run the main training loop.

Parameters:starting_epoch, max_epoch (steps_per_epoch,) –

Register callbacks to the trainer. It can only be called before Trainer.train().

Parameters:cb (Callback or [Callback]) – a callback or a list of callbacks
Returns:succeed or not

Defines what to do in one iteration. The default is: self.hooked_sess.run(self.train_op).

The behavior can be changed by either defining what is train_op, or overriding this method.

setup_callbacks(callbacks, monitors)[source]

Setup callbacks and monitors. Must be called after the main graph is built.

train(callbacks, monitors, session_creator, session_init, steps_per_epoch, starting_epoch=1, max_epoch=9999999)[source]

Implemented by three lines:

self.setup_callbacks(callbacks, monitors)
self.initialize(session_creator, session_init)
self.main_loop(steps_per_epoch, starting_epoch, max_epoch)

You can call those methods by yourself to have better control on details if needed.

train_with_defaults(_sentinel=None, callbacks=None, monitors=None, session_creator=None, session_init=None, steps_per_epoch=None, starting_epoch=1, max_epoch=9999999, extra_callbacks=None)[source]

Same as train(), except:

  1. Add extra_callbacks to callbacks. The default value for extra_callbacks is DEFAULT_CALLBACKS().

  2. Default value for monitors is DEFAULT_MONITORS().

  3. Provide default values for every option except steps_per_epoch.

class tensorpack.train.TrainConfig(dataflow=None, data=None, model=None, callbacks=None, extra_callbacks=None, monitors=None, session_creator=None, session_config=None, session_init=None, starting_epoch=1, steps_per_epoch=None, max_epoch=99999, **kwargs)[source]

Bases: object

A collection of options to be used for single-cost trainers.

__init__(dataflow=None, data=None, model=None, callbacks=None, extra_callbacks=None, monitors=None, session_creator=None, session_config=None, session_init=None, starting_epoch=1, steps_per_epoch=None, max_epoch=99999, **kwargs)[source]
  • dataflow (DataFlow) –

  • data (InputSource) –

  • model (ModelDesc) –

  • callbacks (list) – a list of Callback to perform during training.

  • extra_callbacks (list) –

    the same as callbacks. This argument is only used to provide the defaults in addition to callbacks. The list of callbacks that will be used in the end is callbacks + extra_callbacks.

    It is usually left as None and the default value for this option will be the return value of train.DEFAULT_CALLBACKS(). You can override it when you don’t like any of the default callbacks.

  • monitors (list) – a list of TrainingMonitor. Defaults to the return value of train.DEFAULT_MONITORS().

  • session_creator (tf.train.SessionCreator) – Defaults to sesscreate.NewSessionCreator() with the config returned by tfutils.get_default_sess_config().

  • session_config (tf.ConfigProto) – when session_creator is None, use this to create the session.

  • session_init (SessionInit) – how to initialize variables of a session. Defaults to do nothing.

  • starting_epoch (int) – The index of the first epoch.

  • steps_per_epoch (int) – the number of steps (defined by Trainer.run_step()) to run in each epoch. Defaults to the input data size.

  • max_epoch (int) – maximum number of epoch to run training.

class tensorpack.train.AutoResumeTrainConfig(always_resume=True, **kwargs)[source]

Bases: tensorpack.train.config.TrainConfig

Same as TrainConfig, but does the following to automatically resume from training:

  1. If a checkpoint was found in logger.get_logger_dir(), set session_init option to load it.

  2. If a JSON history was found in logger.get_logger_dir(), try to load the epoch number from it and set the starting_epoch option to continue training.

You can choose to let the above two option to either overwrite or not overwrite user-provided arguments, as explained below.

__init__(always_resume=True, **kwargs)[source]
  • always_resume (bool) – If False, user-provided arguments session_init and starting_epoch will take priority. Otherwise, resume will take priority.

  • kwargs – same as in TrainConfig.


The main goal of this class is to let a training job to resume without changing any line of code or command line arguments. So it’s useful to let resume take priority over user-provided arguments sometimes:

If your training starts from a pretrained model, you would want it to use user-provided model loader at the beginning, but a “resume” model loader when the job was interrupted and restarted.


Return the default callbacks, which will be used in TrainConfig and Trainer.train_with_defaults(). They are:

  1. MovingAverageSummary()

  2. ProgressBar()

  3. MergeAllSummaries()

  4. RunUpdateOps()


Return the default monitors, which will be used in TrainConfig and Trainer.train_with_defaults(). They are:

  1. TFEventWriter()

  2. JSONWriter()

  3. ScalarPrinter()

tensorpack.train.launch_train_with_config(config, trainer)[source]

Train with a TrainConfig and a Trainer, to present a simple training interface. It basically does the following 3 things (and you can easily do them by yourself if you need more control):

  1. Setup the input with automatic prefetching, from config.data or config.dataflow.

  2. Call trainer.setup_graph with the input as well as config.model.

  3. Call trainer.train with rest of the attributes of config.



    config, SyncMultiGPUTrainerParameterServer(8, ps_device='gpu'))
tensorpack.train.apply_default_prefetch(input_source_or_dataflow, trainer)[source]

Apply a set of default rules to make a fast InputSource.

  • input_source_or_dataflow (InputSource | DataFlow) –

  • trainer (Trainer) –



class tensorpack.train.SingleCostTrainer(config=None)[source]

Bases: tensorpack.train.tower.TowerTrainer

Base class for single-cost trainer.

Single-cost trainer has a setup_graph() method which takes (inputs_desc, input, get_cost_fn, get_opt_fn), and build the training operations from them.

To use a SingleCostTrainer object, call trainer.setup_graph(…); trainer.train(…).

setup_graph(inputs_desc, input, get_cost_fn, get_opt_fn)[source]

Responsible for building the main training graph for single-cost training.

  • inputs_desc ([InputDesc]) –

  • input (InputSource) –

  • get_cost_fn ([tf.Tensor] -> tf.Tensor) – callable, takes some input tensors and return a cost tensor.

  • get_opt_fn (-> tf.train.Optimizer) – callable which returns an optimizer. Will only be called once.


get_cost_fn will be the tower function. It must follows the rules of tower function..

class tensorpack.train.TowerTrainer(config=None)[source]

Bases: tensorpack.train.base.Trainer

Base trainers for models that can be built by calling a tower function under a TowerContext.

This is required by some features that replicates the model automatically, e.g. creating a predictor.

To use features of TowerTrainer, set tower_func and use it to build the graph. Note that tower_func can only be set once per instance.

get_predictor(input_names, output_names, device=0)[source]

Returns a callable predictor built under TowerContext(is_training=False).

  • input_names (list) – list of input names, matching the inputs declared for the trainer.

  • output_names (list) – list of tensor names without the tower prefix.

  • device (int) – build the predictor on device ‘/gpu:{device}’ or use -1 for ‘/cpu:0’.


an OnlinePredictor.


# in the graph:
interesting_tensor = tf.identity(x, name='fun')
# in _setup_graph callback method:
self._predictor = self.trainer.get_predictor(['input1'], ['fun'])
# After session is initialized (see Tutorials - Write a Callback), can use it by:
outputs = self._predictor(inputs)

The CycleGAN example and DQN example have more concrete use of this method.


Returns – list[InputDesc]: metainfo about the inputs to the tower.


A TowerFuncWrapper instance. A callable which takes some input tensors and builds one replicate of the model.

Returns – a TowerTensorHandles object, to
access the tower handles by either indices or names.

It is accessbile only after the graph is set up.

class tensorpack.train.SimpleTrainer(config=None)[source]

Bases: tensorpack.train.tower.SingleCostTrainer

Single-GPU single-cost single-tower trainer.

class tensorpack.train.QueueInputTrainer(config=None)[source]

Bases: tensorpack.train.trainers.SimpleTrainer


Return a default multi-GPU trainer, if you don’t care about the details. It may not be the most efficient one for your task.

Parameters:gpus (list[int]) – list of GPU ids.
class tensorpack.train.SyncMultiGPUTrainerReplicated(gpus, average=True, mode=None, use_nccl=None)[source]

Bases: tensorpack.train.tower.SingleCostTrainer

Data-parallel training in “replicated” mode, where each GPU contains a replicate of the whole model. It will build one tower on each GPU under its own variable scope. Each gradient update is averaged or summed across or GPUs through NCCL.

It is an equivalent of --variable_update=replicated in tensorflow/benchmarks.

grads: #GPU number of lists of (g, v). Synchronized gradients on each device, available after build()
Though on different deviecs, they should contain the same value.
__init__(gpus, average=True, mode=None, use_nccl=None)[source]
  • gpus (int or [int]) – list of GPU ids.

  • average (bool) – whether to average or sum gradients.

  • mode (str or None) – Gradient aggregation mode. Supported values: [‘nccl’, ‘hierarchical’, ‘cpu’]. Default to pick automatically by heuristics. These modes may have slight (within 5%) differences in speed.

devices = None
class tensorpack.train.SyncMultiGPUTrainerParameterServer(gpus, ps_device=None)[source]

Bases: tensorpack.train.tower.SingleCostTrainer

Data-parallel training in ‘ParameterServer’ mode. It builds one tower on each GPU with shared variable scope. It synchronoizes the gradients computed from each tower, averages them and applies to the shared variables.

It is an equivalent of --variable_update=parameter_server in tensorflow/benchmarks.

grads: list of (g, v). Averaged gradients, available after build()
__init__(gpus, ps_device=None)[source]
  • gpus ([int]) – list of GPU ids.

  • ps_device – either ‘gpu’ or ‘cpu’, where variables are stored. The default value is subject to change.

devices = None
class tensorpack.train.AsyncMultiGPUTrainer(gpus, scale_gradient=True)[source]

Bases: tensorpack.train.tower.SingleCostTrainer

Data-parallel training with async update. It builds one tower on each GPU with shared variable scope. Every tower computes the gradients and independently applies them to the variables, without synchronizing and averaging across towers.

__init__(gpus, scale_gradient=True)[source]
  • gpus ([int]) – list of GPU ids.

  • scale_gradient (bool) – if True, will scale each gradient by 1.0/nr_gpu.

devices = None
class tensorpack.train.DistributedTrainerParameterServer(gpus, server, caching_device='cpu')[source]

Bases: tensorpack.train.trainers.DistributedTrainerBase

Distributed parameter server training. A single copy of parameters are scattered around PS. Gradients across GPUs are averaged within the worker, and applied to PS. Each worker also caches the variables for reading.

It is an equivalent of --variable_update=parameter_server in tensorflow/benchmarks. However this implementation hasn’t been well tested. It probably still has issues in model saving, etc. Check ResNet-Horovod for fast and correct distributed examples.


1. Gradients are not averaged across workers, but applied to PS variables directly (either with or without locking depending on the optimizer).

__init__(gpus, server, caching_device='cpu')[source]
  • gpus ([int]) – list of GPU ids.

  • server (tf.train.Server) – the server with ps and workers.

  • caching_device (str) – either ‘cpu’ or ‘gpu’. The device to cache variables copied from PS

class tensorpack.train.DistributedTrainerReplicated(gpus, server)[source]

Bases: tensorpack.train.trainers.DistributedTrainerBase

Distributed replicated training. Each worker process builds the same model on one or more GPUs. Gradients across GPUs are averaged within the worker, and get synchronously applied to the global copy of variables located on PS. Then each worker copy the latest variables from PS back to local.

It is an equivalent of --variable_update=distributed_replicated in tensorflow/benchmarks. Note that the performance of this trianer is still not satisfactory. Check ResNet-Horovod for fast and correct distributed examples.


1. Gradients are not averaged across workers, but applied to PS variables directly (either with or without locking depending on the optimizer).

  1. Some details about collections: all variables created inside tower will become local variables, and a clone will be made in global variables for all trainable/model variables.


# Create the server object like this:
hosts = ['host1.com', 'host2.com']
cluster_spec = tf.train.ClusterSpec({
    'ps': [h + ':2222' for h in hosts],
    'worker': [h + ':2223' for h in hosts]
server = tf.train.Server(
    cluster_spec, job_name=args.job, task_index=args.task,
# initialize trainer with this server object
# Start training like this:
(host1)$ ./train.py --job worker --task 0
(host1)$ CUDA_VISIBLE_DEVICES= ./train.py --job ps --task 0
(host2)$ ./train.py --job worker --task 1
(host2)$ CUDA_VISIBLE_DEVICES= ./train.py --job ps --task 1
__init__(gpus, server)[source]
  • gpus (list[int]) – list of GPU ids.

  • server (tf.train.Server) – the server with ps and workers.

class tensorpack.train.HorovodTrainer(average=True)[source]

Bases: tensorpack.train.tower.SingleCostTrainer

Horovod trainer, support multi-GPU and distributed training.

To use for multi-GPU training:

CUDA_VISIBLE_DEVICES=0,1,2,3 mpirun -np 4 –output-filename mylog python train.py

To use for distributed training:

/path/to/mpirun -np 8 -H server1:4,server2:4 -bind-to none -map-by slot –output-filename mylog -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py

(Add other environment variables you need by -x, e.g. PYTHONPATH, PATH)


  1. If using all GPUs, you can always skip the CUDA_VISIBLE_DEVICES option.

  2. Due to the use of MPI, training is less informative (no progress bar).

  3. MPI often fails to kill all processes. Be sure to check it.

Parameters:average (bool) – whether to average or sum the gradients across processes.