Summary and Logging¶
During training, everything other than the iterations are executed through callbacks. This tutorial will explain how summaries and logging are handled in callbacks and how can you customize them. The default logging behavior should be good enough for normal use cases, so you may skip this tutorial.
This is how TensorFlow summaries eventually get logged/saved/printed:
What to Log: Define what you want to log in the graph, by just calling
tf.summary.xxx. When you call
tf.summary.xxxin your graph code, TensorFlow adds an op to
tf.GraphKeys.SUMMARIEScollection (by default). Tensorpack will further remove summaries (in the default collection) not from the first training tower.
When to Log: MergeAllSummaries callback is one of the default callbacks. It runs ops in the
tf.GraphKeys.SUMMARIEScollection (by default) every epoch (by default), and writes results to the monitors.
Where to Log: Several monitors are enabled by default.
All the “what, when, where” can be customized in either the graph or with the callbacks/monitors setting:
You can call
tf.summary.xxx(collections=[...])to add your custom summaries a different collection.
You can use the
MergeAllSummaries(key=...)callback to write a different collection of summaries to monitors.
You can use
MergeAllSummaries(period=...)to make the callback execute less or more frequent.
You can tell the trainer to use different monitors.
The design goal to disentangle “what, when, where” is to make components reusable.
Suppose you have
M items to log
(possibly from differently places, not necessarily the graph)
N backends to log your data to, you
automatically obtain all the
Despite of that, if you only care about logging one specific tensor in the graph (e.g. for debugging purpose), you can check out the FAQ for easier options.
Noisy TensorFlow Summaries¶
Since TF summaries are evaluated infrequently (every epoch) by default, if the content is data-dependent (e.g., training loss), the infrequently-sampled values could have high variance. To address this issue, you can:
Change “When to Log”: log more frequently, but note that certain large summaries can be expensive to log. You may want to use a separate collection for frequent logging.
Change “What to Log”: you can call tfutils.summary.add_moving_summary on scalar tensors, which will summarize the moving average of those scalars, instead of their instant values. The moving averages are updated every step by the MovingAverageSummary callback (enabled by default).
Other Logging Data¶
Besides TensorFlow summaries,
a callback can also write other data to the monitor backend anytime once the training has started,
As long as the type of data is supported, the data will be dispatched to and logged to the same places.
As a result, tensorboard will show not only summaries in the graph, but also your custom data. For example, a precise validation error often needs to be computed manually, outside the TensorFlow graph. With a uniform monitor backend, this number will show up in tensorboard as well.
It is also easy to send data to online logging services for experiment management and reproducibility.
To send logging data to WandB, it’s even simpler – you only need to do:
import wandb; wandb.init(..., sync_tensorboard=True)