TensorFlow & Tensorpack follow the “define-and-run” paradigm. Therefore a training contains two steps:
Define: Build graph for the model. Users can call whatever tensorflow functions to setup the graph. Users may or may not use tensorpack
ModelDescor other utilities to build the graph. The goal of this step is to define “what to run” in later training steps.
Run: Train the model (the Trainer.train() method):
Finalize graph, initialize session.
Run the training loop.
Trainers aims to simplify the above two steps
by exploiting some universal patterns.
Assumptions of Base Trainer¶
Q: What types of training can you do with tensorpack?
A: Anything that runs in a loop.
In research we do training of various kind. Tensorpack trainers avoid making assumptions on what type of training you want to do. For example, unlike Keras, tensorpack does not wrongly assume that:
Your training data is batched
Your training is gradient-based optimization
Your data has
You want to evaluate on zero or one validation dataset
… and more
The only assumption is that your training follows this pattern:
for epoch_num in range(starting_epoch, max_epoch): for local_step in range(steps_per_epoch): run_step() # do something
In other words, the assumptions are:
Training is running some iterations. Tensorpack base trainer implements the logic of running the iterations. Users or derived trainers should implement what the iterations are.
The concept of “epoch”, i.e. we assume that the iterations run in nested for-loops. In fact, the steps per epoch can be any number and it only affects the schedule of callbacks. In other words, an “epoch” in tensorpack is the default period to run callbacks (validation, summary, checkpoint, etc.). So this assumption effectively puts no extra constraints.
Tensorpack implements a few builtin trainers for single-dataloader single-cost gradient-based optimization, as this is the most common type of task. If your training follows this pattern, you only need to select a trainer, and use it with its training interface.
The simplest example of such a trainer is SimpleTrainer. All it does is building your model (which you have to provide) once (or twice if inference is needed by callbacks) and minimizing its cost.
For data-parallel multi-GPU training, different multi-GPU trainers
implement different distribution strategies.
They take care of device placement, gradient averaging and synchronization
in the efficient way, which is why multi-GPU training in tensorpack
is up to
5x faster than Keras.
It takes only one line of code change to use them, e.g.
Note some common confusions when using these trainers:
In each iteration, instead of taking one input tensor for all GPUs and split, tensorpack trainers let all GPUs take tensors from the input. Therefore, the total batch size across all GPUs is
(batch size of input source) * #GPU. You may want to change
steps_per_epochor learing rate appropriately according to the total batch size.
Splitting a tensor for data-parallel training (as done by frameworks like Keras) makes no sense at all. First, it wastes time doing the split because typically data is first concatenated by the user. Second, this puts unnecessary shape constraints on the data, that the inputs on each GPU needs to have compatible shapes.
The tower function (your model code) will get called once on each GPU. So you must follow some rules of tower function.