Skip to content

Latest commit

 

History

History
346 lines (276 loc) · 17.8 KB

data_reader.md

File metadata and controls

346 lines (276 loc) · 17.8 KB

Data Readers

A data reader is the main interface of MLIO for reading datasets. A dataset is a collection of one or more data stores all of which contain data in the same format (e.g. CSV or RecordIO-protobuf). By instantiating a subclass of DataReader such as a CsvReader or a RecordIOProtobufReader a dataset can be read in batches.

DataReader

Represents an abstract base class for all data reader types.

Methods

read_schema

Reads the Schema of the underlying dataset.

read_schema()

read_example

Reads the next Example from the underlying dataset. If the end of the dataset is reached, returns None.

read_examle()

peek_example

Peeks the next Example from the underlying dataset without consuming it. Calling read_example afterwards will return the same example.

peek_examle()

reset

Resets the state of the data reader. Calling read_example() the next time will start reading from the beginning of the dataset.

reset()

__iter__

All data readers are iterable and can be used in contexts such as for loops, list comprehensions, and generator expressions.

Properties

num_bytes_read

Gets the number of bytes read from the dataset. The returned number won't include the size of the discarded parts of the dataset such as comment blocks.

The returned number can be greater than expected as MLIO reads ahead the dataset in background.

CsvReader

Represents a data reader for reading CSV datasets. Inherits from DataReader.

CsvReader(data_reader_params : DataReaderParams, csv_params : CsvReaderParams = None)

RecordIOProtobufReader

Represents a data reader for reading RecordIO-protobuf datasets.

RecordIOProtobufReader(data_reader_params : DataReaderParams)

ImageReader

Represents a data reader for reading image datasets in JPEG and PNG formats.

ImageReader(data_reader_params : DataReaderParams, image_reader_params : ImageReaderParams)

DataReaderParams

Contains the common parameters used by all data readers.

All constructor parameters described below have a same-named read/write accessor property. Not though that, due to a shortcoming in pybind11-based language bindings, values cannot be added to container types via properties and updates must instead be made via assignment.

DataReaderParams(dataset : Sequence[DataStore],
                 batch_size : int,
                 num_prefetched_examples : int = 0,
                 num_parallel_reads : int = 0,
                 last_example_handling : LastExampleHandling = LastExampleHandling.NONE,
                 bad_example_handling : BadExampleHandling = BadExampleHandling.ERROR,
                 warn_bad_instances : True,
                 num_instances_to_skip : int = 0,
                 num_instances_to_read : Optional[int] = None,
                 shard_index : int = 0,
                 num_shards : int = 0,
                 sample_ratio: Optional[float] : None,
                 shuffle_instances : bool = False,
                 shuffle_window : int = 0,
                 shuffle_seed : Optional[int] = None,
                 reshuffle_each_epoch : bool = True)
  • dataset: A sequence of DataStore instances that together form the dataset to read from.
  • batch_size: A number indicating how many data instances should be packed into a single Example.
  • num_prefetched_examples: The number of Examples to prefetch in background to accelerate reading. If zero, defaults to the number of processor cores.
  • num_parallel_reads: The number of parallel reads. If not specified, it equals to num_prefetched_examples. In case a large number of Examples should be prefetched, this parameter can be used to avoid thread oversubscription.
  • last_example_handling: See LastExampleHandling.
  • bad_example_handling: See BadExampleHandling.
  • warn_bad_instances: A boolean value indicating whether a warning will be output for each bad instance.
  • num_instances_to_skip: The number of data instances to skip from the beginning of the dataset.
  • num_instances_to_read: The number of data instances to read. The rest of the dataset will be ignored.
  • shard_index: The index of the shard to read.
  • num_shards: The number of shards the dataset should be split into. The reader will only read 1/num_shards of the dataset.
  • sample_ratio: A ratio between zero and one indicating how much of the dataset should be read. The dataset will be sampled based on this number.
  • shuffle_instances: A boolean value indicating whether to shuffle the data instances while reading from the dataset.
  • shuffle_window: The number of data instances to buffer and sample from. The selected data instances will be replaced with new data instances read from the dataset. A value of zero means perfect shuffling and requires loading the whole dataset into memory first.
  • shuffle_seed: The seed that will be used for initializing the sampling distribution. If not specified, a random seed will be generated internally.
  • reshuffle_each_epoch: A boolean value indicating whether the dataset should be reshuffled after every reset() call.

CsvParams

Contains the parameters used by CsvReader.

All constructor parameters described below have a same-named read/write accessor property. Not though that, due to a shortcoming in pybind11-based language bindings, values cannot be added to container types via properties and updates must instead be made via assignment.

CsvReaderParams(column_names : Sequence[str] = None,
                name_prefix : str = ""
                use_columns : Set[str] = None,
                use_columns_by_index : Set[int] = None,
                default_data_type : Optional[DataType] = None,
                column_types : Dict[str, DataType] = None,
                column_types_by_index : Dic[int, DataType] = None,
                header_row_index : Optional[int] = 0,
                has_single_header : bool = False,
                dedupe_column_names : bool = True,
                delimiter : str = ',',
                quote_char : str = '"',
                comment_char : str = None,
                allow_quoted_new_lines : bool = False,
                skip_blank_lines : bool = True,
                encoding : str = None,
                max_field_length : Optional[int] = None,
                max_field_length_handling : MaxFieldLengthHandling = MaxFieldLengthHandling.ERROR
                max_line_length : Optional[int] = None,
                parser_params : ParserParams = None)
  • column_names: The colum names. If the dataset has a header and header_row_index is specified, this list can be left empty to infer the column names from the dataset.
  • name_prefix: The prefix to prepend to column names.
  • use_columns: The columns that should be read. The rest of the columns will be skipped.
  • use_columns_by_index: The columns, specified by index, that should be read. The rest of the columns will be skipped.
  • default_data_type: The data type for columns for which no explicit data type is specified via column_types or column_types_by_index. If not specified, the column data types will be inferred from the dataset.
  • column_types: The mapping between columns and data types by name.
  • column_types_by_index: The mapping between columns and data types by index.
  • header_row_index: The index of the row that should be treated as the header of the dataset. If column_names is empty, the column names will be inferred from that row. If neither header_row_index nor column_names is specified, the column ordinal positions will be used as column names. Each data store in the dataset should have its header at the same index.
  • has_single_header: A boolean value indicating whether the dataset has a header row only in the first data store.
  • dedupe_column_names: A boolean value indicating whether duplicate columns should be renamed. If true, duplicate columns 'X', ..., 'X' will be renamed to 'X', 'X_1', 'X_2', ..., 'X_N'.
  • delimiter: The delimiter character.
  • quote_char: The character used for quoting field values.
  • comment_char: The comment character. Lines that start with the comment character will be skipped.
  • allow_quoted_new_lines: A boolean value indicating whether quoted fields can be multiline. Note that enabling this option will slow down the reading speed.
  • skip_blank_lines: A boolean value indicating whether to skip empty lines.
  • encoding: The text encoding to use. If not specified, it will be inferred from the preamble of the text; otherwise, falls back to UTF-8. The specified encoding should be a valid name that is accepted by iconv(1).
  • max_field_length: The maximum number of characters that will be read in a field. Any characters beyond this limit will be handled using the strategy specified in max_field_length_handling.
  • max_field_length_handling: See MaxFieldLengthHandling.
  • max_line_length: The maximum length of a row. A row longer than this threshold will cause the data reader to fail.
  • parser_params: See ParserParams.

ImageReaderParams

Contains the parameters used by ImageReader.

All constructor parameters described below have a same-named read/write accessor property. Not though that, due to a shortcoming in pybind11-based language bindings, values cannot be added to container types via properties and updates must instead be made via assignment.

ImageReaderParams(image_frame : ImageFrame = ImageFrame.NONE,
                  resize : Optional[int] = None,
                  image_dimensions : Sequence[int] = None,
                  to_rgb : bool = False)
  • image_frame: See ImageFrame
  • resize: Scales the shorter edge of the image to this value before applying other augmentations.
  • image_dimensions: The dimensions of output image in channels, height, width format.
  • to_rgb: A boolean value for converting from BGR (OpenCV default) to RGB color scheme.

ParserParams

Contains the parameters used for parsing dataset features.

All constructor parameters described below have a same-named read/write accessor property. Not though that, due to a shortcoming in pybind11-based language bindings, values cannot be added to container types via properties and updates must instead be made via assignment.

ParserParams(nan_values : Set[str] = None, number_base : int = 10)
  • nan_values: For a floating-point parse operation holds the list of strings that should be treated as NaN.
  • number_base: For a number parse operation specifies the base of the number in its string represetation.

Example

Represents a batch returned by read_example() of a data reader. It contains a collection of Tensor instances corresponding to each feature in the dataset and an associated Schema instance describing the dataset.

Example(schema : Schema, features : Sequence[Tensor])
  • schema: The schema of the dataset.
  • features: The data of the batch in form of Tensor instances per dataset feature.

Functions

__len__, __contains__, __getitem__

Example implements the Python sequence protocol. The features of an example can be retrieved by both index and name:

# Either by name
lbl = example["label"]
# or by index
lbl = example[1]

__iter__

The features of an Example instance can be iterated:

for feature in example:
    ...

Properties

schema

Gets the Schema instance describing the dataset.

padding

Gets the padding of the batch. If it is greater than zero, it means that the last padding number of elements in the batch dimension are zero-initialized. This is typically the case for the last batch read from a dataset if the size of the dataset is not evenly divisible by the batch size.

Schema

Describes the attributes of a dataset.

Schema(attrs : Sequence[Attribute])
  • attrs: A sequence of Attribute instances describing the attributes of the dataset.

Functions

get_index

Returns the index of the attributes with the specified name.

get_index(name : str)
  • name: The name of the attribute.

Properties

descriptors

Gets the list of attributes.

Attribute

Describes an attribute which defines a measurable property of a dataset.

Attribute(name : str,
          data_type : DataType,
          shape : Sequence[int],
          strides : Sequence[int] = None,
          sparse : bool = False)
  • name: The name of the attribute.
  • data_type: The data type of the attribute.
  • shape: The shape of the attribute.
  • strides: The strides, if any, of the attribute.
  • sparse: A boolean value indicating whether the attribute is sparse or dense.

Properties

name

Gets the name of the attribute.

data_type

Gets the data type of the attribute.

shape

Gets the strides of the attribute.

spase

Gets a boolean value indicating whether the attribute is sparse or dense.

Enumerations

LastExampleHandling

Specifies how the last batch Example from a dataset should be handled if the dataset size is not evenly divisible by the batch size.

Value Description
NONE Return an Example where the size of the batch dimension is less that the requested batch size.
DROP Drop the last Example.
DROP_WARN Drop the last Example and warn.
PAD Pad the features of the Example with zero so that the size of the batch dimension equals the requested batch size.
PAD_WARN Pad the features of the Example with zero so that the size of the batch dimension equals the requested batch size and warn.

BadExampleHandling

Specifies how an Example that contains erroneous data should be handled.

Value Description
ERROR Raise an error.
SKIP Skip the Example.
SKIP_WARN Skip the Example and warn.
PAD Skip bad instances, pad the Example to the batch size.
PAD_WARN Skip bad instances, pad the Example to the batch size, and warn.

ImageFrame

Specifies what image frame to use for reading an image dataset.

Value Description
NONE For reading raw image files in JPEG or PNG format.
RECORDIO For reading MXNet RecordIO based image files.

MaxFieldLengthHandling

Specifies how field and columns should be handled when breached.

Value Description
TREAT_AS_BAD Treat the corresponding row as bad.
TRUNCATE Truncate the field.
TRUNCATE_WARN Truncate the field and warn.

Exceptions

Type Description
DataReaderError Thrown when the dataset cannot be read. Inherits from MLIOError.
SchemaError Thrown when the dataset has a schema error. Inherits from DataReaderError.
InvalidInstanceError Thrown when the dataset contains an invalid data instance. Inherits from DataReaderError.