tensorpack.train package

exception tensorpack.train.StopTraining[source]

Bases: BaseException

An exception thrown to stop training.

class tensorpack.train.TrainConfig(dataflow=None, data=None, model=None, callbacks=None, extra_callbacks=None, monitors=None, session_creator=None, session_config=None, session_init=None, starting_epoch=1, steps_per_epoch=None, max_epoch=99999, nr_tower=1, tower=None, **kwargs)[source]

Bases: object

A collection of options to be used for trainers.

__init__(dataflow=None, data=None, model=None, callbacks=None, extra_callbacks=None, monitors=None, session_creator=None, session_config=None, session_init=None, starting_epoch=1, steps_per_epoch=None, max_epoch=99999, nr_tower=1, tower=None, **kwargs)[source]
  • dataflow (DataFlow) –

  • data (InputSource) –

  • model (ModelDescBase) –

  • callbacks (list) – a list of Callback to perform during training.

  • extra_callbacks (list) – the same as callbacks. This argument is only used to provide the defaults in addition to callbacks. The defaults are MovingAverageSummary(), ProgressBar(), MergeAllSummaries(), RunUpdateOps(). The list of callbacks that will be used in the end is callbacks + extra_callbacks.

  • monitors (list) – a list of TrainingMonitor. Defaults to TFEventWriter(), JSONWriter(), ScalarPrinter().

  • session_creator (tf.train.SessionCreator) – Defaults to sesscreate.NewSessionCreator() with the config returned by tfutils.get_default_sess_config().

  • session_config (tf.ConfigProto) – when session_creator is None, use this to create the session.

  • session_init (SessionInit) – how to initialize variables of a session. Defaults to do nothing.

  • starting_epoch (int) – The index of the first epoch.

  • steps_per_epoch (int) – the number of steps (defined by Trainer.run_step()) to run in each epoch. Defaults to the input data size.

  • max_epoch (int) – maximum number of epoch to run training.

  • nr_tower (int) – number of training towers, used by multigpu trainers.

  • tower ([int]) – list of training towers in relative GPU id.

class tensorpack.train.Trainer(config=None)[source]

Bases: object

Base class for a trainer.


config is only for compatibility reasons in case you’re using custom trainers with old-style API. You should never use config.


The number of the currently ongoing epoch.

An epoch is defined to cover the moment before calling before_epoch until after calling trigger_epoch. i.e., in the trigger_epoch of epoch 3, self.epoch_num is 3. If you need use self.epoch_num in your callback, you’ll need to know this.


The tensorflow global_step, i.e. how many times hooked_sess.run has been called.


  1. global_step is incremented after each hooked_sess.run returns from TF runtime.

  2. If you make zero or more than one calls to hooked_sess.run in one run_step(), local_step and global_step may increment at different speed.

initialize(session_creator, session_init)[source]

Initialize self.sess and self.hooked_sess. Must be called after callbacks are setup.

is_chief = True

The number of steps that have finished in the current epoch.

main_loop(steps_per_epoch, starting_epoch, max_epoch)[source]

Run the main training loop.

Parameters:starting_epoch, max_epoch (steps_per_epoch,) –

Register a callback to the trainer. It can only be called before Trainer.train().


Defines what to do in one iteration. The default is: self.hooked_sess.run(self.train_op).

The behavior can be changed by either defining what is train_op, or overriding this method.

setup_callbacks(callbacks, monitors)[source]

Setup callbacks and monitors. Must be called after the main graph is built.

train(callbacks, monitors, session_creator, session_init, steps_per_epoch, starting_epoch=1, max_epoch=9999999)[source]

Implemented by:

self.setup_callbacks(callbacks, monitors)
self.initialize(session_creator, session_init)
self.main_loop(steps_per_epoch, starting_epoch, max_epoch)

You can call those methods by yourself to have better control on details if needed.

train_with_defaults(callbacks=None, monitors=None, session_creator=None, session_init=None, steps_per_epoch=None, starting_epoch=1, max_epoch=9999999)[source]

Same as train(), but will:

  1. Append DEFAULT_CALLBACKS() to callbacks.

  2. Append DEFAULT_MONITORS() to monitors.

  3. Provide default values for every option except steps_per_epoch.


Return the default monitors, which will be used in TrainConfig and Trainer.train_with_defaults(). They are:

  1. TFEventWriter()

  2. JSONWriter()

  3. ScalarPrinter()


Return the default callbacks, which will be used in TrainConfig and Trainer.train_with_defaults(). They are:

  1. MovingAverageSummary()

  2. ProgressBar()

  3. MergeAllSummaries()

  4. RunUpdateOps()

tensorpack.train.launch_train_with_config(config, trainer)[source]

Train with a TrainConfig and a Trainer, to mimic the old training interface. It basically does the following 3 things (and you can easily do them by yourself):

  1. Setup the InputSource with automatic prefetching, for config.data or config.dataflow.

  2. Call trainer.setup_graph with the InputSource, as well as config.model.

  3. Call trainer.train with rest of the attributes of config.



# with the old trainer:
SyncMultiGPUTrainerParameterServer(config, ps_device='gpu').train()
# with the new trainer:
    config, SyncMultiGPUTrainerParameterServer(towers, ps_device='gpu'))
tensorpack.train.apply_default_prefetch(input_source_or_dataflow, trainer, towers)[source]

Apply a set of default rules to make a fast InputSource.

  • input_source_or_dataflow (InputSource | DataFlow) –

  • trainer (Trainer) –

  • towers ([int]) – list of GPU ids.

class tensorpack.train.SingleCostTrainer(config=None)[source]

Bases: tensorpack.train.tower.TowerTrainer

Base class for single-cost trainer.

Single-cost trainer has a setup_graph() method which takes (inputs_desc, input, get_cost_fn, get_opt_fn), and build the training operations from them.

To use a SingleCostTrainer object, call trainer.setup_graph(…); trainer.train(…).

setup_graph(inputs_desc, input, get_cost_fn, get_opt_fn)[source]

Responsible for building the main training graph for single-cost training.

  • inputs_desc ([InputDesc]) –

  • input (InputSource) –

  • get_cost_fn ([tf.Tensor] -> tf.Tensor) – callable, takes some input tenosrs and return a cost tensor.

  • get_opt_fn (-> tf.train.Optimizer) – callable which returns an optimizer. Will only be called once.


  1. get_cost_fn will always be called under a TowerContext. which will contain information about reuse, training/inference, scope name, etc.

  2. get_cost_fn might get called multiple times for data-parallel training or inference.

  3. To respect variable reuse, use tf.get_variable instead of tf.Variable in get_cost_fn.

class tensorpack.train.TowerTrainer(config=None)[source]

Bases: tensorpack.train.base.Trainer

Base trainers for models that can be built by calling a tower function under a TowerContext.

This is required by some features that replicates the model automatically, e.g. creating a predictor.

get_predictor(input_names, output_names, device=0)[source]

Returns a callable predictor built under TowerContext(is_training=False).

  • input_names (list), output_names(list) – list of names

  • device (int) – build the predictor on device ‘/gpu:{device}’ or use -1 for ‘/cpu:0’.


an OnlinePredictor.


Returns – list[InputDesc]: metainfo about the inputs to the tower.

Parameters:tower_func (TowerFuncWrapper) –
tower_func = None

Returns – a TowerTensorHandles object, to access the tower handles by either indices or names.

class tensorpack.train.SimpleTrainer(config=None)[source]

Bases: tensorpack.train.tower.SingleCostTrainer

Single-GPU single-cost single-tower trainer.

class tensorpack.train.QueueInputTrainer(config=None)[source]

Bases: tensorpack.train.trainers.SimpleTrainer


Return a default multi-GPU trainer, if you don’t care about the details. It may not be the most efficient one for your task.

Parameters:gpus (list[int]) – list of GPU ids.
class tensorpack.train.SyncMultiGPUTrainerReplicated(gpus)[source]

Bases: tensorpack.train.tower.SingleCostTrainer

Data-parallel training in “replicated” mode, where each GPU contains a replicate of the whole model. It will build one tower on each GPU under its own variable scope. Each gradient update is averaged across or GPUs through NCCL.

See https://www.tensorflow.org/performance/benchmarks for details.

Parameters:gpus ([int]) – list of GPU ids.
class tensorpack.train.SyncMultiGPUTrainerParameterServer(gpus, ps_device='gpu')[source]

Bases: tensorpack.train.tower.SingleCostTrainer

Data-parallel training in ‘ParameterServer’ mode. It builds one tower on each GPU with shared variable scope. It synchronoizes the gradients computed from each tower, averages them and applies to the shared variables.

See https://www.tensorflow.org/performance/benchmarks for details.

__init__(gpus, ps_device='gpu')[source]
  • gpus ([int]) – list of GPU ids.

  • ps_device – either ‘gpu’ or ‘cpu’, where variables are stored. Setting to ‘cpu’ might help when #gpu>=4

class tensorpack.train.AsyncMultiGPUTrainer(gpus, scale_gradient=True)[source]

Bases: tensorpack.train.tower.SingleCostTrainer

Data-parallel training with async update. It builds one tower on each GPU with shared variable scope. Every tower computes the gradients and independently applies them to the variables, without synchronizing and averaging across towers.

__init__(gpus, scale_gradient=True)[source]
  • gpus ([int]) – list of GPU ids.

  • scale_gradient (bool) – if True, will scale each gradient by 1.0/nr_gpu.

class tensorpack.train.DistributedTrainerReplicated(gpus, server)[source]

Bases: tensorpack.train.tower.SingleCostTrainer

Distributed replicated training. Each worker process builds the same model on one or more GPUs. Gradients across GPUs are averaged within the worker, and get synchronously applied to the global copy of variables located on PS. Then each worker copy the latest variables from PS back to local.

See https://www.tensorflow.org/performance/benchmarks for details.


Gradients are not averaged across workers, but applied to PS variables directly (either with or without locking depending on the optimizer).


# Create the server object like this:
hosts = ['host1.com', 'host2.com']
cluster_spec = tf.train.ClusterSpec({
    'ps': [h + ':2222' for h in hosts],
    'worker': [h + ':2223' for h in hosts]
server = tf.train.Server(
    cluster_spec, job_name=args.job, task_index=args.task,
# initialize trainer with this server object
# Start training like this:
(host1)$ train.py --job worker --task 0
(host1)$ train.py --job ps --task 0
(host2)$ train.py --job worker --task 1
(host2)$ train.py --job ps --task 1
__init__(gpus, server)[source]
  • gpus (list[int]) – list of GPU ids.

  • server (tf.train.Server) – the server with ps and workers.

class tensorpack.train.HorovodTrainer[source]

Bases: tensorpack.train.tower.SingleCostTrainer

Horovod trainer, support multi-GPU and distributed training.

To use for multi-GPU training:

CUDA_VISIBLE_DEVICES=0,1,2,3 mpirun -np 4 –output-filename mylog python train.py

To use for distributed training:

/path/to/mpirun -np 8 -H server1:4,server2:4 -bind-to none -map-by slot –output-filename mylog -x LD_LIBRARY_PATH -x CUDA_VISIBLE_DEVICES=0,1,2,3 python train.py


  1. If using all GPUs, you can always skip the CUDA_VISIBLE_DEVICES option.

  2. About performance, horovod is expected to be slightly slower than native tensorflow on multi-GPU training, but faster in distributed training.

  3. Due to the use of MPI, training is less informative (no progress bar). It’s recommended to use other multi-GPU trainers for single-node experiments, and scale to multi nodes by horovod.