What is DataFlow¶
DataFlow is a library to build Python iterators for efficient data loading.
Definition: A DataFlow is something that has a
get_data() generator method,
A datapoint is a list of Python objects which is called the
components of a datapoint.
Example: to train on MNIST dataset, you may need a DataFlow with a
that yields datapoints (lists) of two components:
a numpy array of shape (64, 28, 28), and an array of shape (64,).
Composition of DataFlow¶
One good thing about having a standard interface is to be able to provide the greatest code reusability. There are a lot of existing DataFlow utilities in tensorpack, which you can use to compose complex DataFlow with a long data pipeline. A common pipeline usually would read from disk (or other sources), apply augmentations, group into batches, prefetch data, etc. A simple example is as the following:
# a DataFlow you implement to produce [tensor1, tensor2, ..] lists from whatever sources: df = MyDataFlow(dir='/my/data', shuffle=True) # resize the image component of each datapoint df = AugmentImageComponent(df, [imgaug.Resize((225, 225))]) # group data into batches of size 128 df = BatchData(df, 128) # start 3 processes to run the dataflow in parallel df = PrefetchDataZMQ(df, 3)
You can find more complicated DataFlow in the ResNet training script with all the data preprocessing.
Unless you are working with standard data types (image folders, LMDB, etc),
you would usually want to write the base DataFlow (
MyDataFlow in the above example) for your data format.
See another tutorial
for simple instructions on writing a DataFlow.
Once you have the base reader, all the existing DataFlows are ready for you to complete
the rest of the data pipeline.
It's easy: write everything in pure Python, and reuse existing utilities. On the contrary, writing data loaders in TF operators is usually painful, and performance is hard to tune.
It's fast: see Efficient DataFlow on how to build a fast DataFlow with parallel prefetching. If you're using DataFlow with tensorpack, also see Input Pipeline tutorial on how tensorpack further accelerates data loading in the graph.
Nevertheless, tensorpack support data loading with native TF operators / TF datasets as well.
Use DataFlow outside Tensorpack¶
DataFlow is independent of both tensorpack and TensorFlow.
import tensorpack.dataflow, you don't even have to install TensorFlow.
You can simply use it as a data processing pipeline and plug it into any other frameworks.
To use a DataFlow independently, you will need to call
reset_state() first to initialize it,
and then use the generator however you want:
df = SomeDataFlow() df.reset_state() generator = df.get_data() for dp in generator: # dp is now a list. do whatever