Skip to content

Latest commit

 

History

History
72 lines (55 loc) · 3.61 KB

record_reader.md

File metadata and controls

72 lines (55 loc) · 3.61 KB

Record Readers

Record readers are relatively low-level constructs in the MLIO API. They allow reading raw binary records from an InputStream instance.

As of today the only publicly available record reader type is ParquetRecordReader which reads Parquet files as memory blobs that can be passed to Apache Arrow.

RecordReader

Represents the abstract base class for all record reader types.

Methods

read_record

Reads the next Record from the underlying InputStream. If the end of the stream is reached, returns None.

read_record()

peek_record

Peeks the next Record from the underlying InputStream without consuming it. Calling read_record afterwards will return the same record.

peek_record()

__iter__

All record readers are iterable and can be used in contexts such as for loops, list comprehensions, and generator expressions.

ParquetRecordReader

Reads Parquet files from an underlying InputStream and returns them as binary blobs via Record instances. Inherits from RecordReader.

ParquetRecordReader(strm : InputStream)
  • strm: The input stream from which to read the Parquet files.

This class is meant to be used with input streams that can potentially contain more than one Parquet file. For example a SageMakerPipe data store pointing to an S3 location with more than one Parquet file should use ParquetRecordReader to extract them from the input stream.

Conventional data stores such as Files don't need to use ParquetRecordReader. A data store containing only a single Parquet file can be directly converted into an Arrow file via mlio.integ.arrow.as_arrow_file() function.

Record

Represents a binary blob containing the raw bytes of a data instance. It supports the Python Buffer protocol.

Properties

kind

Gets the kind of the record; indicating whether it is a complete or a partial record.

Enumerations

RecordKind

Specifies the kind of a record.

Value Description
COMPLETE The record contains a complete data instance.
BEGIN The record contains the beginning of a data instance.
MIDDLE The record contains the middle of a data instance.
END The record contains the end of a data instance.

Exceptions

Type Description
RecordError Thrown when the record cannot be read. Inherits from MLIOError.
CorruptRecordReader Thrown when the record is corrupt. Inherits from RecordError.
CorruptRecordHeader Thrown when the record has a corrupt header. Inherits from CorruptRecordReader.
CorruptRecordFooter Thrown when the record has a corrupt footer. Inherits from CorruptRecordReader.
RecordTooLargeError Thrown when the record is larger than a threshold. Inherits from RecordError.