tensorpack.train package

class tensorpack.train.Trainer(config)[source]

Bases: object

Base class for a trainer.

config

TrainConfig – the config used in this trainer.

model

ModelDesc – alias for config.model.

sess

tf.Session – the current session in use.

hooked_sess

tf.train.MonitoredSession – the session with hooks.

monitors

Monitors – the monitors. Other callbacks can use it for logging.

local_step

int – the number of (tensorpack) steps that have finished in the current epoch.

__init__(config)[source]
Parameters:config (TrainConfig) – the train config.
epoch_num

The number of the currently ongoing epoch.

An epoch is defined to cover the moment before calling before_epoch until after calling trigger_epoch. i.e., in the trigger_epoch of epoch 3, self.epoch_num is 3. If you need use self.epoch_num in your callback, you’ll need to know this.

get_predictor(input_names, output_names, tower=0)[source]

Returns a callable predictor built under is_training=False tower context. Note that this method is only valid when this trainer has a ModelDesc.

Parameters:
  • input_names (list), output_names(list) – list of names

  • tower (int) – build the predictor on device ‘/gpu:{tower}’ or use -1 for ‘/cpu:0’.

Returns:

an OnlinePredictor.

global_step

The tensorflow global_step, i.e. how many times hooked_sess.run has been called.

Note

  1. global_step is incremented after each hooked_sess.run returns from TF runtime.

  2. If you make zero or more than one calls to hooked_sess.run in one run_step(), local_step and global_step may increment at different speed.

is_chief = True
main_loop()[source]

Run the main training loop.

register_callback(cb)[source]

Register a callback to the trainer. It can only be called before Trainer.train() gets called.

register_monitor(mon)[source]

Register a monitor to the trainer. It can only be called before Trainer.train() gets called.

run_step()[source]

Defines what to do in one iteration. The default is: self.hooked_sess.run(self.train_op).

The behavior can be changed by either defining what is train_op, or overriding this method.

setup()[source]

Setup the trainer and be ready for the main loop.

train()[source]

Start training

exception tensorpack.train.StopTraining[source]

Bases: BaseException

An exception thrown to stop training.

class tensorpack.train.TrainConfig(dataflow=None, data=None, model=None, callbacks=None, extra_callbacks=None, monitors=None, session_creator=None, session_config=None, session_init=None, starting_epoch=1, steps_per_epoch=None, max_epoch=99999, nr_tower=1, tower=None, predict_tower=None, **kwargs)[source]

Bases: object

Config for trainer.

__init__(dataflow=None, data=None, model=None, callbacks=None, extra_callbacks=None, monitors=None, session_creator=None, session_config=None, session_init=None, starting_epoch=1, steps_per_epoch=None, max_epoch=99999, nr_tower=1, tower=None, predict_tower=None, **kwargs)[source]

Note

It depends on the specific trainer what fields are necessary. Most existing trainers in tensorpack requires one of dataflow or data, and model to be present in the config.

Parameters:
  • dataflow (DataFlow) –

  • data (InputSource) –

  • model (ModelDescBase) –

  • callbacks (list) – a list of Callback to perform during training.

  • extra_callbacks (list) – the same as callbacks. This argument is only used to provide the defaults in addition to callbacks. The defaults are MovingAverageSummary(), ProgressBar(), MergeAllSummaries(), RunUpdateOps(). The list of callbacks that will be used in the end is callbacks + extra_callbacks.

  • monitors (list) – a list of TrainingMonitor. Defaults to TFEventWriter(), JSONWriter(), ScalarPrinter().

  • session_creator (tf.train.SessionCreator) – Defaults to sesscreate.NewSessionCreator() with the config returned by tfutils.get_default_sess_config().

  • session_config (tf.ConfigProto) – when session_creator is None, use this to create the session.

  • session_init (SessionInit) – how to initialize variables of a session. Defaults to do nothing.

  • starting_epoch (int) – The index of the first epoch.

  • steps_per_epoch (int) – the number of steps (defined by Trainer.run_step()) to run in each epoch. Defaults to the input data size.

  • max_epoch (int) – maximum number of epoch to run training.

  • nr_tower (int) – number of training towers, used by multigpu trainers.

  • tower ([int]) – list of training towers in relative GPU id.

callbacks
nr_tower
class tensorpack.train.DistributedTrainerReplicated(config, server)[source]

Bases: tensorpack.train.multigpu.MultiGPUTrainerBase

Distributed replicated training. Each worker process builds the same model on one or more GPUs. Gradients across GPUs are averaged within the worker, and get synchronously applied to the global copy of variables located on PS. Then each worker copy the latest variables from PS back to local.

See https://www.tensorflow.org/performance/benchmarks for details.

Note

Gradients are not averaged across workers, but applied to PS variables directly (either with or without locking depending on the optimizer).

Example

hosts = ['host1.com', 'host2.com']
cluster_spec = tf.train.ClusterSpec({
    'ps': [h + ':2222' for h in hosts],
    'worker': [h + ':2223' for h in hosts]
})
server = tf.train.Server(
    cluster_spec, job_name=args.job, task_index=args.task,
    config=get_default_sess_config())
DistributedTrainerReplicated(config, server).train()
# start your jobs:
(host1)$ train.py --job worker --task 0
(host1)$ train.py --job ps --task 0
(host2)$ train.py --job worker --task 1
(host2)$ train.py --job ps --task 1
__init__(config, server)[source]
Parameters:
  • config (TrainConfig) – Must contain ‘model’ and ‘data’.

  • server (tf.train.Server) – the server object with ps and workers

tensorpack.train.QueueInputTrainer(config, input_queue=None)[source]

A wrapper trainer which automatically wraps config.dataflow by a QueueInput. It is an equivalent of SimpleTrainer(config) with config.data = QueueInput(dataflow).

Parameters:
  • config (TrainConfig) – Must contain ‘model’ and ‘dataflow’.

  • input_queue (tf.QueueBase) – an input queue. Defaults to the QueueInput default.

class tensorpack.train.MultiGPUTrainerBase(config)[source]

Bases: tensorpack.train.base.Trainer

Base class for multi-gpu training

static build_on_multi_tower(towers, func, devices=None, use_vs=None)[source]
Parameters:
  • towers – list of gpu relative ids

  • func – a lambda to be called inside each tower

  • devices – a list of devices to be used. By default will use GPUs in towers.

  • use_vs (list[bool]) – list of use_vs to passed to TowerContext

Returns:

List of outputs of func, evaluated on each tower.

class tensorpack.train.LeastLoadedDeviceSetter(worker_device, ps_devices)[source]

Bases: object

Helper class to assign variables on the least loaded ps-device.

__init__(worker_device, ps_devices)[source]
Parameters:
  • worker_device – the device to use for compute ops.

  • ps_devices – a list of device to use for Variable ops.

class tensorpack.train.SyncMultiGPUTrainerReplicated(config, gpu_prefetch=True)[source]

Bases: tensorpack.train.multigpu.MultiGPUTrainerBase

Data-parallel multi-GPU trainer where each GPU contains a replicate of the whole model. It will build one tower on each GPU under its own variable scope. Each gradient update is averaged across or GPUs through NCCL.

See https://www.tensorflow.org/performance/benchmarks for details.

__init__(config, gpu_prefetch=True)[source]
Parameters:gpu_prefetch (config,) – same as in SyncMultiGPUTrainerParameterServer
static get_post_init_ops()[source]

Copy values of variables on GPU 0 to other GPUs.

static setup_graph(model, input, tower)[source]
Parameters:
Returns:

tf.Operation – the training op

[Callback]: the callbacks to be added

class tensorpack.train.SyncMultiGPUTrainerParameterServer(config, ps_device='gpu', gpu_prefetch=True)[source]

Bases: tensorpack.train.multigpu.MultiGPUTrainerBase

A data-parallel multi-GPU trainer. It builds one tower on each GPU with shared variable scope. It synchronoizes the gradients computed from each tower, averages them and applies to the shared variables.

See https://www.tensorflow.org/performance/benchmarks for details.

__init__(config, ps_device='gpu', gpu_prefetch=True)[source]
Parameters:
  • config (TrainConfig) – Must contain ‘model’ and either one of ‘data’ or ‘dataflow’.

  • ps_device – either ‘gpu’ or ‘cpu’, where variables are stored. Setting to ‘cpu’ might help when #gpu>=4

  • gpu_prefetch (bool) – whether to prefetch the data to each GPU. Usually improve performance.

static setup_graph(model, input, ps_device, tower)[source]
Parameters:
Returns:

tf.Operation – the training op

[Callback]: the callbacks to be added

class tensorpack.train.AsyncMultiGPUTrainer(config, scale_gradient=True)[source]

Bases: tensorpack.train.multigpu.MultiGPUTrainerBase

A data-parallel multi-GPU trainer. It builds one tower on each GPU with shared variable scope. Every tower computes the gradients and independently applies them to the variables, without synchronizing and averaging across towers.

__init__(config, scale_gradient=True)[source]
Parameters:
  • config (TrainConfig) – Must contain ‘model’ and either one of ‘data’ or ‘dataflow’.

  • scale_gradient (bool) – if True, will scale each gradient by 1.0/nr_gpu.

static setup_graph(model, input, scale_gradient, tower)[source]
Parameters:
Returns:

tf.Operation – the training op

[Callback]: the callbacks to be added

tensorpack.train.SyncMultiGPUTrainer(config)[source]

Alias for SyncMultiGPUTrainerParameterServer(config, ps_device='gpu'), as this is the most commonly used synchronous multigpu trainer (but may not be more efficient than the other).

class tensorpack.train.SimpleTrainer(config)[source]

Bases: tensorpack.train.base.Trainer

A naive single-tower single-cost demo trainer. It simply builds one tower and minimize model.cost. It supports both InputSource and DataFlow.

When DataFlow is given instead of InputSource, the InputSource to be used will be FeedInput(df) (no prefetch).

__init__(config)[source]
Parameters:config (TrainConfig) – Must contain ‘model’ and either one of ‘data’ or ‘dataflow’.
static setup_graph(model, input)[source]

Setup graph for SimpleTrainer. It simply build one tower and optimize model.cost.

Parameters:
Returns:

tf.Operation – the training op

[Callback]: the callbacks to be added