Write a DataFlow¶
First, make sure you know about Python’s generators and
If you don’t, learn it on Google.
Write a Source DataFlow¶
There are several existing DataFlow, e.g. ImageFromFile, DataFromList, which you can use if your data format is simple. In general, you probably need to write a source DataFlow to produce data for your task, and then compose it with existing modules (e.g. mapping, batching, prefetching, …).
The easiest way to create a DataFlow to load custom data, is to wrap a custom generator, e.g.:
def my_data_loader(): # load data from somewhere with Python, and yield them for k in range(100): yield [my_array, my_label] df = DataFromGenerator(my_data_loader)
To write more complicated DataFlow, you need to inherit the base
Usually, you just need to implement the
__iter__() method which yields a datapoint every time.
class MyDataFlow(DataFlow): def __iter__(self): # load data from somewhere with Python, and yield them for k in range(100): digit = np.random.rand(28, 28) label = np.random.randint(10) yield [digit, label] df = MyDataFlow() df.reset_state() for datapoint in df: print(datapoint, datapoint)
Optionally, you can implement the
The detailed semantics of these three methods are explained
in the API documentation.
If you’re writing a complicated DataFlow, make sure to read the API documentation
for the semantics.
DataFlow implementations for several well-known datasets are provided in the dataflow.dataset module, you can take them as a reference.
More Data Processing¶
You can put any data processing you need in the source DataFlow you write, or you can write a new DataFlow for data processing on top of the source DataFlow, e.g.:
class ProcessingDataFlow(DataFlow): def __init__(self, ds): self.ds = ds def reset_state(self): self.ds.reset_state() def __iter__(self): for datapoint in self.ds: # do something yield new_datapoint