Write a DataFlow¶
First, make sure you know about Python’s generators and
If you don’t, learn it on Google.
Write a Source DataFlow¶
There are several existing DataFlow, e.g. ImageFromFile, DataFromList, which you can use if your data format is simple. In general, you probably need to write a source DataFlow to produce data for your task, and then compose it with other DataFlow (e.g. mapping, batching, prefetching, …).
The easiest way to create a DataFlow to load custom data, is to wrap a custom generator, e.g.:
def my_data_loader(): # load data from somewhere with Python, and yield them for k in range(100): yield [my_array, my_label] df = DataFromGenerator(my_data_loader)
To write more complicated DataFlow, you need to inherit the base
Usually, you just need to implement the
__iter__() method which yields a datapoint every time.
class MyDataFlow(DataFlow): def __iter__(self): # load data from somewhere with Python, and yield them for k in range(100): digit = np.random.rand(28, 28) label = np.random.randint(10) yield [digit, label] df = MyDataFlow() df.reset_state() for datapoint in df: print(datapoint, datapoint)
Optionally, you can implement the
The detailed semantics of these three methods are explained
in the API documentation.
If you’re writing a complicated DataFlow, make sure to read the API documentation
for the semantics.
DataFlow implementations for several well-known datasets are provided in the dataflow.dataset module. You can take them as examples.
More Data Processing¶
You can put any data processing you need in the source DataFlow you write, or you can write a new DataFlow for data processing on top of the source DataFlow, e.g.:
class ProcessingDataFlow(DataFlow): def __init__(self, ds): self.ds = ds def reset_state(self): self.ds.reset_state() def __iter__(self): for datapoint in self.ds: # do something yield new_datapoint