tensorpack.dataflow package

class tensorpack.dataflow.DataFlow[source]

Bases: object

Base class for all DataFlow

get_data()[source]

The method to generate datapoints.

Yields:list – The datapoint, i.e. list of components.
reset_state()[source]

Reset state of the dataflow. It has to be called before producing datapoints.

For example, RNG has to be reset if used in the DataFlow, otherwise it won’t work well with prefetching, because different processes will have the same RNG state.

size()[source]
Returns:int – size of this data flow.
Raises:NotImplementedError if this DataFlow doesn’t have a size.
class tensorpack.dataflow.ProxyDataFlow(ds)[source]

Bases: tensorpack.dataflow.base.DataFlow

Base class for DataFlow that proxies another. Every method is proxied to self.ds unless override by subclass.

__init__(ds)[source]
Parameters:ds (DataFlow) – DataFlow to proxy.
class tensorpack.dataflow.RNGDataFlow[source]

Bases: tensorpack.dataflow.base.DataFlow

A DataFlow with RNG

reset_state()[source]

Reset the RNG

exception tensorpack.dataflow.DataFlowTerminated[source]

Bases: BaseException

An exception indicating that the DataFlow is unable to produce any more data, i.e. something wrong happened so that calling get_data() cannot give a valid iterator any more. In most DataFlow this will never be raised.

class tensorpack.dataflow.TestDataSpeed(ds, size=5000)[source]

Bases: tensorpack.dataflow.base.ProxyDataFlow

Test the speed of some DataFlow

__init__(ds, size=5000)[source]
Parameters:
  • ds (DataFlow) – the DataFlow to test.

  • size (int) – number of datapoints to fetch.

get_data()[source]

Will run testing at the beginning, then produce data normally.

start()[source]

Alias of start_test.

start_test()[source]

Start testing with a progress bar.

class tensorpack.dataflow.PrintData(ds, num=1, label=None, name=None)[source]

Bases: tensorpack.dataflow.base.ProxyDataFlow

Behave like an identity mapping, but print shape and range of the first few datapoints.

Example

To enable this debugging output, you should place it somewhere in your dataflow like

def get_data():
    ds = CaffeLMDB('path/to/lmdb')
    ds = SomeInscrutableMappings(ds)
    ds = PrintData(ds, num=2)
    return ds
ds = get_data()

The output looks like:

[0110 09:22:21 @common.py:589] DataFlow Info:
datapoint 0<2 with 4 components consists of
   dp 0: is float of shape () with range [0.0816501893251]
   dp 1: is ndarray of shape (64, 64) with range [0.1300, 0.6895]
   dp 2: is ndarray of shape (64, 64) with range [-1.2248, 1.2177]
   dp 3: is ndarray of shape (9, 9) with range [-0.6045, 0.6045]
datapoint 1<2 with 4 components consists of
   dp 0: is float of shape () with range [5.88252075399]
   dp 1: is ndarray of shape (64, 64) with range [0.0072, 0.9371]
   dp 2: is ndarray of shape (64, 64) with range [-0.9011, 0.8491]
   dp 3: is ndarray of shape (9, 9) with range [-0.5585, 0.5585]
__init__(ds, num=1, label=None, name=None)[source]
Parameters:
  • ds (DataFlow) – input DataFlow.

  • num (int) – number of dataflow points to print.

  • name (str, optional) – name to identify this DataFlow.

class tensorpack.dataflow.BatchData(ds, batch_size, remainder=False, use_list=False)[source]

Bases: tensorpack.dataflow.base.ProxyDataFlow

Stack datapoints into batches. It produces datapoints of the same number of components as ds, but each component has one new extra dimension of size batch_size. The batch can be either a list of original components, or (by default) a numpy array of original components.

__init__(ds, batch_size, remainder=False, use_list=False)[source]
Parameters:
  • ds (DataFlow) – When use_list=False, the components of ds must be either scalars or np.ndarray, and have to be consistent in shapes.

  • batch_size (int) – batch size

  • remainder (bool) – When the remaining datapoints in ds is not enough to form a batch, whether or not to also produce the remaining data as a smaller batch. If set to False, all produced datapoints are guranteed to have the same batch size.

  • use_list (bool) – if True, each component will contain a list of datapoints instead of an numpy array of an extra dimension.

get_data()[source]
Yields:Batched data by stacking each component on an extra 0th dimension.
class tensorpack.dataflow.BatchDataByShape(ds, batch_size, idx)[source]

Bases: tensorpack.dataflow.common.BatchData

Group datapoints of the same shape together to batches. It doesn’t require input DataFlow to be homogeneous anymore: it can have datapoints of different shape, and batches will be formed from those who have the same shape.

Note

It is implemented by a dict{shape -> datapoints}. Datapoints of uncommon shapes may never be enough to form a batch and never get generated.

__init__(ds, batch_size, idx)[source]
Parameters:
  • ds (DataFlow) – input DataFlow. dp[idx] has to be an np.ndarray.

  • batch_size (int) – batch size

  • idx (int) – dp[idx].shape will be used to group datapoints. Other components are assumed to have the same shape.

class tensorpack.dataflow.FixedSizeData(ds, size)[source]

Bases: tensorpack.dataflow.base.ProxyDataFlow

Generate data from another DataFlow, but with a fixed total count. The iterator state of the underlying DataFlow will be kept if not exhausted.

__init__(ds, size)[source]
Parameters:
class tensorpack.dataflow.MapData(ds, func)[source]

Bases: tensorpack.dataflow.base.ProxyDataFlow

Apply a mapper/filter on the DataFlow.

Note

  1. Please make sure func doesn’t modify the components unless you’re certain it’s safe.

  2. If you discard some datapoints, ds.size() will be incorrect.

__init__(ds, func)[source]
Parameters:
  • ds (DataFlow) – input DataFlow

  • func (datapoint -> datapoint | None) – takes a datapoint and returns a new datapoint. Return None to discard this datapoint.

class tensorpack.dataflow.MapDataComponent(ds, func, index=0)[source]

Bases: tensorpack.dataflow.common.MapData

Apply a mapper/filter on a datapoint component.

Note

  1. This dataflow itself doesn’t modify the datapoints. But please make sure func doesn’t modify the components unless you’re certain it’s safe.

  2. If you discard some datapoints, ds.size() will be incorrect.

__init__(ds, func, index=0)[source]
Parameters:
  • ds (DataFlow) – input DataFlow.

  • func (TYPE -> TYPE|None) – takes dp[index], returns a new value for dp[index]. return None to discard this datapoint.

  • index (int) – index of the component.

class tensorpack.dataflow.RepeatedData(ds, nr)[source]

Bases: tensorpack.dataflow.base.ProxyDataFlow

Take data points from another DataFlow and produce them until it’s exhausted for certain amount of times. i.e.: dp1, dp2, …. dpn, dp1, dp2, ….dpn

__init__(ds, nr)[source]
Parameters:
  • ds (DataFlow) – input DataFlow

  • nr (int) – number of times to repeat ds. Set to -1 to repeat ds infinite times.

size()[source]
Raises:ValueError when nr == -1.
class tensorpack.dataflow.RepeatedDataPoint(ds, nr)[source]

Bases: tensorpack.dataflow.base.ProxyDataFlow

Take data points from another DataFlow and produce them a certain number of times. i.e.: dp1, dp1, …, dp1, dp2, …, dp2, …

__init__(ds, nr)[source]
Parameters:
  • ds (DataFlow) – input DataFlow

  • nr (int) – number of times to repeat each datapoint.

class tensorpack.dataflow.RandomChooseData(df_lists)[source]

Bases: tensorpack.dataflow.base.RNGDataFlow

Randomly choose from several DataFlow. Stop producing when any of them is exhausted.

__init__(df_lists)[source]
Parameters:df_lists (list) – a list of DataFlow, or a list of (DataFlow, probability) tuples. Probabilities must sum to 1 if used.
class tensorpack.dataflow.RandomMixData(df_lists)[source]

Bases: tensorpack.dataflow.base.RNGDataFlow

Perfectly mix datapoints from several DataFlow using their size(). Will stop when all DataFlow exhausted.

__init__(df_lists)[source]
Parameters:df_lists (list) – a list of DataFlow. All DataFlow must implement size().
class tensorpack.dataflow.JoinData(df_lists)[source]

Bases: tensorpack.dataflow.base.DataFlow

Join the components from each DataFlow.

Examples:

df1 produces: [c1, c2]
df2 produces: [c3, c4]
joined: [c1, c2, c3, c4]
__init__(df_lists)[source]
Parameters:df_lists (list) – a list of DataFlow. When these dataflows have different sizes, JoinData will stop when any of them is exhausted.
class tensorpack.dataflow.ConcatData(df_lists)[source]

Bases: tensorpack.dataflow.base.DataFlow

Concatenate several DataFlow. Produce datapoints from each DataFlow and go to the next when one DataFlow is exhausted.

__init__(df_lists)[source]
Parameters:df_lists (list) – a list of DataFlow.
tensorpack.dataflow.SelectComponent(ds, idxs)[source]

Select / reorder components from datapoints.

Parameters:
  • ds (DataFlow) – input DataFlow.

  • idxs (list[int]) – a list of component indices.

Example:

original df produces: [c1, c2, c3]
idxs: [2,1]
this df: [c3, c2]
class tensorpack.dataflow.LocallyShuffleData(ds, buffer_size, nr_reuse=1, shuffle_interval=None)[source]

Bases: tensorpack.dataflow.base.ProxyDataFlow, tensorpack.dataflow.base.RNGDataFlow

Maintain a pool to buffer datapoints, and shuffle before producing them. This can be used as an alternative when a complete random read is too expensive or impossible for the data source.

__init__(ds, buffer_size, nr_reuse=1, shuffle_interval=None)[source]
Parameters:
  • ds (DataFlow) – input DataFlow.

  • buffer_size (int) – size of the buffer.

  • nr_reuse (int) – reuse each datapoints several times to improve speed, but may hurt your model.

  • shuffle_interval (int) – shuffle the buffer after this many datapoints went through it. Frequent shuffle on large buffer may affect speed, but infrequent shuffle may affect randomness. Defaults to buffer_size / 3

class tensorpack.dataflow.CacheData(ds, shuffle=False)[source]

Bases: tensorpack.dataflow.base.ProxyDataFlow

Cache the first pass of a DataFlow completely in memory, and produce from the cache thereafter.

__init__(ds, shuffle=False)[source]
Parameters:
  • ds (DataFlow) – input DataFlow.

  • shuffle (bool) – whether to shuffle the datapoints before producing them.

class tensorpack.dataflow.HDF5Data(filename, data_paths, shuffle=True)[source]

Bases: tensorpack.dataflow.base.RNGDataFlow

Zip data from different paths in an HDF5 file.

Warning

The current implementation will load all data into memory. (TODO)

__init__(filename, data_paths, shuffle=True)[source]
Parameters:
  • filename (str) – h5 data file.

  • data_paths (list) – list of h5 paths to zipped. For example [‘images’, ‘labels’].

  • shuffle (bool) – shuffle all data.

class tensorpack.dataflow.LMDBData(lmdb_path, shuffle=True, keys=None)[source]

Bases: tensorpack.dataflow.base.RNGDataFlow

Read a LMDB database and produce (k,v) raw string pairs.

__init__(lmdb_path, shuffle=True, keys=None)[source]
Parameters:
  • lmdb_path (str) – a directory or a file.

  • shuffle (bool) – shuffle the keys or not.

  • keys (list[str] or str) –

    list of str as the keys, used only when shuffle is True. It can also be a format string e.g. {:0>8d} which will be formatted with the indices from 0 to total_size - 1.

    If not provided, it will then look in the database for __keys__ which dump_dataflow_to_lmdb() used to store the list of keys. If still not found, it will iterate over the database to find all the keys.

class tensorpack.dataflow.LMDBDataDecoder(lmdb_data, decoder)[source]

Bases: tensorpack.dataflow.common.MapData

Read a LMDB database and produce a decoded output.

__init__(lmdb_data, decoder)[source]
Parameters:
  • lmdb_data – a LMDBData instance.

  • decoder (k,v -> dp | None) – a function taking k, v and returning a datapoint, or return None to discard.

class tensorpack.dataflow.LMDBDataPoint(*args, **kwargs)[source]

Bases: tensorpack.dataflow.common.MapData

Read a LMDB file and produce deserialized datapoints. It only accepts the database produced by tensorpack.dataflow.dftools.dump_dataflow_to_lmdb(), which uses tensorpack.utils.serialize.dumps() for serialization.

Example

ds = LMDBDataPoint("/data/ImageNet.lmdb", shuffle=False)

# alternatively:
ds = LMDBData("/data/ImageNet.lmdb", shuffle=False)
ds = LocallyShuffleData(ds, 50000)
ds = LMDBDataPoint(ds)
__init__(*args, **kwargs)[source]
Parameters:kwargs (args,) – Same as in LMDBData.
tensorpack.dataflow.CaffeLMDB(lmdb_path, shuffle=True, keys=None)[source]

Read a Caffe LMDB file where each value contains a caffe.Datum protobuf. Produces datapoints of the format: [HWC image, label].

Note that Caffe LMDB format is not efficient: it stores serialized raw arrays rather than JPEG images.

Parameters:shuffle, keys (lmdb_path,) – same as LMDBData.
Returns:a LMDBDataDecoder instance.

Example

ds = CaffeLMDB("/tmp/validation", keys='{:0>8d}')
class tensorpack.dataflow.SVMLightData(filename, shuffle=True)[source]

Bases: tensorpack.dataflow.base.RNGDataFlow

Read X,y from a svmlight file, and produce [X_i, y_i] pairs.

__init__(filename, shuffle=True)[source]
Parameters:
  • filename (str) – input file

  • shuffle (bool) – shuffle the data

class tensorpack.dataflow.TFRecordData(path, size=None)[source]

Bases: tensorpack.dataflow.base.DataFlow

Produce datapoints from a TFRecord file, assuming each record is serialized by serialize.dumps(). This class works with dftools.dump_dataflow_to_tfrecord().

__init__(path, size=None)[source]
Parameters:
  • path (str) – path to the tfrecord file

  • size (int) – total number of records, because this metadata is not stored in the tfrecord file.

class tensorpack.dataflow.ImageFromFile(files, channel=3, resize=None, shuffle=False)[source]

Bases: tensorpack.dataflow.base.RNGDataFlow

Produce images read from a list of files.

__init__(files, channel=3, resize=None, shuffle=False)[source]
Parameters:
  • files (list) – list of file paths.

  • channel (int) – 1 or 3. Will convert grayscale to RGB images if channel==3.

  • resize (tuple) – int or (h, w) tuple. If given, resize the image.

class tensorpack.dataflow.AugmentImageComponent(ds, augmentors, index=0, copy=True)[source]

Bases: tensorpack.dataflow.common.MapDataComponent

Apply image augmentors on 1 image component.

__init__(ds, augmentors, index=0, copy=True)[source]
Parameters:
  • ds (DataFlow) – input DataFlow.

  • augmentors (AugmentorList) – a list of imgaug.ImageAugmentor to be applied in order.

  • index (int) – the index of the image component to be augmented in the datapoint.

  • copy (bool) – Some augmentors modify the input images. When copy is True, a copy will be made before any augmentors are applied, to keep the original images not modified. Turn it off to save time when you know it’s OK.

class tensorpack.dataflow.AugmentImageCoordinates(ds, augmentors, img_index=0, coords_index=1, copy=True)[source]

Bases: tensorpack.dataflow.common.MapData

Apply image augmentors on an image and a list of coordinates. Coordinates must be a Nx2 floating point array, each row is (x, y).

__init__(ds, augmentors, img_index=0, coords_index=1, copy=True)[source]
Parameters:
  • ds (DataFlow) – input DataFlow.

  • augmentors (AugmentorList) – a list of imgaug.ImageAugmentor to be applied in order.

  • img_index (int) – the index of the image component to be augmented.

  • coords_index (int) – the index of the coordinate component to be augmented.

  • copy (bool) – Some augmentors modify the input images. When copy is True, a copy will be made before any augmentors are applied, to keep the original images not modified. Turn it off to save time when you know it’s OK.

class tensorpack.dataflow.AugmentImageComponents(ds, augmentors, index=(0, 1), coords_index=(), copy=True)[source]

Bases: tensorpack.dataflow.common.MapData

Apply image augmentors on several components, with shared augmentation parameters.

Example

ds = MyDataFlow()   # produce [image(HWC), segmask(HW), keypoint(Nx2)]
ds = AugmentImageComponents(
    ds, augs,
    index=(0,1), coords_index=(2,))
__init__(ds, augmentors, index=(0, 1), coords_index=(), copy=True)[source]
Parameters:
  • ds (DataFlow) – input DataFlow.

  • augmentors (AugmentorList) – a list of imgaug.ImageAugmentor instance to be applied in order.

  • index – tuple of indices of the image components.

  • coords_index – tuple of indices of the coordinates components.

  • copy (bool) – Some augmentors modify the input images. When copy is True, a copy will be made before any augmentors are applied, to keep the original images not modified. Turn it off to save time when you know it’s OK.

class tensorpack.dataflow.PrefetchData(ds, nr_prefetch, nr_proc)[source]

Bases: tensorpack.dataflow.base.ProxyDataFlow

Prefetch data from a DataFlow using Python multiprocessing utilities. It will fork the process calling __init__(), collect datapoints from ds in each process by a Python multiprocessing.Queue.

Note

  1. The underlying dataflow worker will be forked multiple times when nr_proc>1. As a result, unless the underlying dataflow is fully shuffled, the data distribution produced by this dataflow will be different. (e.g. you are likely to see duplicated datapoints at the beginning)

  2. This is significantly slower than PrefetchDataZMQ when data is large.

  3. When nesting like this: PrefetchDataZMQ(PrefetchData(df, nr_proc=a), nr_proc=b). A total of a instances of df worker processes will be created. This is different from the behavior of PrefetchDataZMQ

  4. reset_state() is a no-op. The worker processes won’t get called.

__init__(ds, nr_prefetch, nr_proc)[source]
Parameters:
  • ds (DataFlow) – input DataFlow.

  • nr_prefetch (int) – size of the queue to hold prefetched datapoints.

  • nr_proc (int) – number of processes to use.

class tensorpack.dataflow.PrefetchDataZMQ(ds, nr_proc=1, hwm=50)[source]

Bases: tensorpack.dataflow.base.ProxyDataFlow

Prefetch data from a DataFlow using multiple processes, with ZeroMQ for communication. It will fork the process calling reset_state(), collect datapoints from ds in each process by ZeroMQ IPC pipe.

Note

  1. The underlying dataflow worker will be forked multiple times When nr_proc>1. As a result, unless the underlying dataflow is fully shuffled, the data distribution produced by this dataflow will be different. (e.g. you are likely to see duplicated datapoints at the beginning)

  2. Once reset_state() is called, this dataflow becomes not fork-safe. i.e., if you fork an already reset instance of this dataflow, it won’t be usable in the forked process.

  3. When nesting like this: PrefetchDataZMQ(PrefetchDataZMQ(df, nr_proc=a), nr_proc=b). A total of a * b instances of df worker processes will be created. Also in this case, some zmq pipes cannot be cleaned at exit.

  4. By default, a UNIX named pipe will be created in the current directory. However, certain non-local filesystem such as NFS/GlusterFS/AFS doesn’t always support pipes. You can change the directory by export TENSORPACK_PIPEDIR=/other/dir. In particular, you can use somewhere under ‘/tmp’ which is usually local.

    Note that some non-local FS may appear to support pipes and code may appear to run but crash with bizarre error. Also note that ZMQ limits the maximum length of pipe path. If you hit the limit, you can set the directory to a softlink which points to a local directory.

  5. Calling reset_state() more than once is a no-op, i.e. the worker processes won’t get called.

__init__(ds, nr_proc=1, hwm=50)[source]
Parameters:
  • ds (DataFlow) – input DataFlow.

  • nr_proc (int) – number of processes to use.

  • hwm (int) – the zmq “high-water mark” for both sender and receiver.

reset_state()[source]

All forked dataflows are reset once and only once in spawned processes. Nothing more can be done when calling this method.

class tensorpack.dataflow.PrefetchOnGPUs(ds, gpus)[source]

Bases: tensorpack.dataflow.prefetch.PrefetchDataZMQ

Similar to PrefetchDataZMQ, but prefetch with each process having its own CUDA_VISIBLE_DEVICES variable mapped to one GPU.

__init__(ds, gpus)[source]
Parameters:
  • ds (DataFlow) – input DataFlow.

  • gpus (list[int]) – list of GPUs to use. Will also start this number of processes.

class tensorpack.dataflow.ThreadedMapData(ds, nr_thread, map_func, buffer_size=200, strict=False)[source]

Bases: tensorpack.dataflow.base.ProxyDataFlow

Same as MapData, but start threads to run the mapping function. This is useful when the mapping function is the bottleneck, but you don’t want to start processes for the entire dataflow pipeline.

Note

  1. There is tiny communication overhead with threads, but you should avoid starting many threads in your main process to reduce GIL contention.

    The threads will only start in the process which calls reset_state(). Therefore you can use PrefetchDataZMQ(ThreadedMapData(...), 1) to reduce GIL contention.

  2. Threads run in parallel and can take different time to run the mapping function. Therefore the order of datapoints won’t be preserved, and datapoints from one pass of df.get_data() might get mixed with datapoints from the next pass.

    You can use strict mode, where ThreadedMapData.get_data() is guranteed to produce the exact set which df.get_data() produces. Although the order of data still isn’t preserved.

__init__(ds, nr_thread, map_func, buffer_size=200, strict=False)[source]
Parameters:
  • ds (DataFlow) – the dataflow to map

  • nr_thread (int) – number of threads to use

  • map_func (callable) – datapoint -> datapoint | None

  • buffer_size (int) – number of datapoints in the buffer

  • strict (bool) – use “strict mode”, see notes above.

class tensorpack.dataflow.FakeData(shapes, size=1000, random=True, dtype='float32', domain=(0, 1))[source]

Bases: tensorpack.dataflow.base.RNGDataFlow

Generate fake data of given shapes

__init__(shapes, size=1000, random=True, dtype='float32', domain=(0, 1))[source]
Parameters:
  • shapes (list) – a list of lists/tuples. Shapes of each component.

  • size (int) – size of this DataFlow.

  • random (bool) – whether to randomly generate data every iteration. Note that merely generating the data could sometimes be time-consuming!

  • dtype (str or list) – data type as string, or a list of data types.

  • domain (tuple or list) – (min, max) tuple, or a list of such tuples

class tensorpack.dataflow.DataFromQueue(queue)[source]

Bases: tensorpack.dataflow.base.DataFlow

Produce data from a queue

__init__(queue)[source]
Parameters:queue (queue) – a queue with get() method.
class tensorpack.dataflow.DataFromList(lst, shuffle=True)[source]

Bases: tensorpack.dataflow.base.RNGDataFlow

Produce data from a list

__init__(lst, shuffle=True)[source]
Parameters:
  • lst (list) – input list.

  • shuffle (bool) – shuffle data.

class tensorpack.dataflow.DataFromGenerator(gen, size=None)[source]

Bases: tensorpack.dataflow.base.DataFlow

Wrap a generator to a DataFlow

tensorpack.dataflow.send_dataflow_zmq(df, addr, hwm=50, print_interval=100, format=None)[source]

Run DataFlow and send data to a ZMQ socket addr. It will dump and send each datapoint to this addr with a PUSH socket. This function never returns unless an error is encountered.

Parameters:
  • df (DataFlow) – Will infinitely loop over the DataFlow.

  • addr – a ZMQ socket addr.

  • hwm (int) – high water mark

class tensorpack.dataflow.RemoteDataZMQ(addr1, addr2=None)[source]

Bases: tensorpack.dataflow.base.DataFlow

Produce data from ZMQ PULL socket(s). See http://tensorpack.readthedocs.io/en/latest/tutorial/efficient-dataflow.html#distributed-dataflow

cnt1, cnt2

int – number of data points received from addr1 and addr2

__init__(addr1, addr2=None)[source]
Parameters:addr1,addr2 (str) – addr of the socket to connect to. Use both if you need two protocols (e.g. both IPC and TCP). I don’t think you’ll ever need 3.

tensorpack.dataflow.dftools module

tensorpack.dataflow.dftools.dump_dataflow_to_process_queue(df, size, nr_consumer)[source]

Convert a DataFlow to a multiprocessing.Queue. The DataFlow will only be reset in the spawned process.

Parameters:
  • df (DataFlow) – the DataFlow to dump.

  • size (int) – size of the queue

  • nr_consumer (int) – number of consumer of the queue. The producer will add this many of DIE sentinel to the end of the queue.

Returns:

tuple(queue, process) – The process will take data from df and fill the queue, once you start it. Each element in the queue is (idx, dp). idx can be the DIE sentinel when df is exhausted.

tensorpack.dataflow.dftools.dump_dataflow_to_lmdb(df, lmdb_path, write_frequency=5000)[source]

Dump a Dataflow to a lmdb database, where the keys are indices and values are serialized datapoints. The output database can be read directly by tensorpack.dataflow.LMDBDataPoint.

Parameters:
  • df (DataFlow) – the DataFlow to dump.

  • lmdb_path (str) – output path. Either a directory or a mdb file.

  • write_frequency (int) – the frequency to write back data to disk.

tensorpack.dataflow.dftools.dump_dataflow_to_tfrecord(df, path)[source]

Dump all datapoints of a Dataflow to a TensorFlow TFRecord file, using serialize.dumps() to serialize.

Parameters:
  • df (DataFlow) –

  • path (str) – the output file path