diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 97b9198..99e96a8 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -4,11 +4,11 @@ Thanks for contributing to IterTable! Here are some guidelines to help you get ## Questions -Feel free to use the issue tracker to ask questions! We don't currently have a separate mailing list or active chat tool. +Questions and ideas can be submitted to the [Django Data Wizard discussion board](https://github.com/wq/django-data-wizard/discussions). ## Bug Reports -Bug reports can take any form as long as there is enough information to diagnose the problem. To speed up response time, try to include the following whenever possible: +Bug reports can be submitted to either [IterTable issues](https://github.com/wq/itertable/issues) or [Django Data Wizard issues](https://github.com/wq/itertable/issues). Reports can take any form as long as there is enough information to diagnose the problem. To speed up response time, try to include the following whenever possible: * Versions of Fiona and/or Pandas, if applicable * Expected (or ideal) behavior * Actual behavior @@ -18,9 +18,10 @@ Bug reports can take any form as long as there is enough information to diagnose Pull requests are very welcome and will be reviewed and merged as time allows. To speed up reviews, try to include the following whenever possible: * Reference the issue that the PR fixes (e.g. [#3](https://github.com/wq/itertable/issues/3)) * Failing test case fixed by the PR - * If the PR provides new functionality, update [the documentation](https://github.com/wq/itertable/blob/master/docs/) + * If the PR provides new functionality, update [the documentation](https://github.com/wq/django-data-wizard/tree/main/docs/itertable) * Ensure the PR passes lint and unit tests. This happens automatically, but you can also run these locally with the following commands: ```bash -./runtests.sh # run the test suite -LINT=1 ./runtests.sh # run code style checking +python -m unittest discover -s tests -t . -v # run the test suite +flake8 # run code style checking +``` diff --git a/README.md b/README.md index 51f693a..e6c00f8 100644 --- a/README.md +++ b/README.md @@ -17,128 +17,34 @@ for row in load_file("example.xlsx"): [![Tests](https://github.com/wq/itertable/actions/workflows/test.yml/badge.svg)](https://github.com/wq/itertable/actions/workflows/test.yml) [![Python Support](https://img.shields.io/pypi/pyversions/itertable.svg)](https://pypi.python.org/pypi/itertable) -> **Note:** Prior to version 2.0, IterTable was **wq.io**, a submodule of the [wq framework]. The package has been renamed to avoid confusion with the wq framework website (). -Similarly, IterTable's `*IO` classes have been renamed to `*Iter`, as the API is not intended to match that of Python's `StringIO` or other `io` classes. - -```diff -- from wq.io import CsvFileIO -- data = CsvFileIO(filename='data.csv') -+ from itertable import CsvFileIter -+ data = CsvFileIter(filename='data.csv') -``` - -## Getting Started - -```bash -# Recommended: create virtual environment -# python3 -m venv venv -# . venv/bin/activate - -python3 -m pip install itertable - -# GIS support (Fiona & Shapely) -python3 -m pip install itertable[gis] - -# Excel 97-2003 (.xls) support -python3 -m pip install itertable[oldexcel] -# (xlsx support is enabled by default) - -# Pandas integration -python3 -m pip install itertable[pandas] -``` - -## Overview - -IterTable provides a general purpose API for loading, iterating over, and writing tabular datasets. The goal is to avoid needing to remember the unique usage of e.g. [csv], [openpyxl], or [xml.etree] every time one needs to work with external data. Instead, IterTable abstracts these libraries into a consistent interface that works as an [iterable] of [namedtuples]. Whenever possible, the field names for a dataset are automatically determined from the source file, e.g. the column headers in an Excel spreadsheet. - -```python -from itertable import ExcelFileIter -data = ExcelFileIter(filename='example.xlsx') -for row in data: - print(row.name, row.date) -``` - -IterTable provides a number of built-in classes like the above, including a `CsvFileIter`, `XmlFileIter`, and `JsonFileIter`. There is also a convenience function, `load_file()`, that attempts to automatically determine which class to use for a given file. - -```python -from itertable import load_file -data = load_file('example.csv') -for row in data: - print(row.name, row.date) -``` - -All of the included `*FileIter` classes support both reading and writing to external files. - -### Network Client - -IterTable also provides network-capable equivalents of each of the above classes, to facilitate loading data from third party webservices. - -```python -from itertable import JsonNetIter -class WebServiceIter(JsonNetIter): - url = "http://example.com/api" - -data = WebServiceIter(params={'type': 'all'}) -for row in data: - print(row.timestamp, row.value) -``` - -The powerful [requests] library is used internally to load data over HTTP. - -### Pandas Analysis - -When [Pandas] is installed (via `itertable[pandas]`), the `as_dataframe()` method on itertable classes can be used to create a [DataFrame], enabling more extensive analysis possibilities. - -```python -instance = WebServiceIter(params={'type': 'all'}) -df = instance.as_dataframe() -print(df.value.mean()) -``` - -### GIS Support - -When [Fiona] and [Shapely] are installed (via `itertable[gis]`), itertable can also open and create shapefiles and other OGR-compatible geographic data formats. - -```python -from itertable import ShapeIter -data = ShapeIter(filename='sites.shp') -for id, site in data.items(): - print(id, site.geometry.wkt) -``` - -More information on IterTable's gis support is available [here][gis]. - -### Command-Line Interface - -IterTable provides a simple CLI for rendering the content of a file or Iter class. This can be useful for e.g. inspecting a file or for integrating a shell automation workflow. The default output is CSV, but can be changed to JSON by setting `-f json`. - -```bash -python3 -m itertable example.json # JSON to CSV -python3 -m itertable -f json example.csv # CSV to JSON -python3 -m itertable example.xlsx "start_row=5" -python3 -m itertable http://example.com/example.csv -python3 -m itertable itertable.CsvNetIter "url=http://example.com/example.csv" -``` - -### Extending IterTable - -It is straightforward to [extend IterTable][custom] to support arbitrary formats. Each provided class is composed of a [BaseIter][base] class and mixin classes ([loaders], [parsers], and [mappers]) that handle the various steps of the process. - -[wq framework]: https://wq.io/ -[csv]: https://docs.python.org/3/library/csv.html -[openpyxl]: https://openpyxl.readthedocs.io/en/stable/ -[xml.etree]: https://docs.python.org/3/library/xml.etree.elementtree.html -[iterable]: https://docs.python.org/3/glossary.html#term-iterable -[namedtuples]: https://docs.python.org/3/library/collections.html#collections.namedtuple -[requests]: http://python-requests.org/ -[Pandas]: http://pandas.pydata.org/ -[DataFrame]: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html -[Fiona]: https://github.com/Toblerity/Fiona -[Shapely]: https://github.com/Toblerity/Shapely - -[custom]: https://github.com/wq/itertable/blob/master/docs/about.md -[base]: https://github.com/wq/itertable/blob/master/docs/base.md -[loaders]: https://github.com/wq/itertable/blob/master/docs/loaders.md -[parsers]: https://github.com/wq/itertable/blob/master/docs/parsers.md -[mappers]: https://github.com/wq/itertable/blob/master/docs/mappers.md -[gis]: https://github.com/wq/itertable/blob/master/docs/gis.md +### [Documentation][docs] + +[**Installation**][installation] + +[**API**][api] +
+[CLI][cli] +• +[GIS][gis] + +[**Extending IterTable**][custom] +
+[BaseIter][base] +• +[Loaders][loaders] +• +[Parsers][parsers] +• +[Mappers][mappers] + +[docs]: https://django-data-wizard.wq.io/itertable/ + +[installation]: https://django-data-wizard.wq.io/itertable/#getting-started +[api]: https://django-data-wizard.wq.io/itertable/#overview +[cli]: https://django-data-wizard.wq.io/itertable/#command-line-interface +[custom]: https://django-data-wizard.wq.io/itertable/custom +[base]: https://django-data-wizard.wq.io/itertable/base +[loaders]: https://django-data-wizard.wq.io/itertable/loaders +[parsers]: https://django-data-wizard.wq.io/itertable/parsers +[mappers]: https://django-data-wizard.wq.io/itertable/mappers +[gis]: https://django-data-wizard.wq.io/itertable/gis diff --git a/docs/about.md b/docs/about.md deleted file mode 100644 index e91b273..0000000 --- a/docs/about.md +++ /dev/null @@ -1,68 +0,0 @@ -Extending IterTable -=================== - -[IterTable] provides a consistent interface for working with data from a variety of common formats. However, it is not possible to support every conceivable file format and data structure in a single library. Because of this, IterTable is designed to be customized and extended. To facilitate fully modular customization, the IterTable APIs are designed as combinations of a `BaseIter` class and several mixin classes. - -The `BaseIter` class and mixins break the process into several steps: - -1. The [BaseIter][base] class initializes each instance, saving any passed arguments as properties on the instance, then immediately triggering the next two steps. -2. A [Loader][loaders] mixin loads an external resource into file-like object and saves it to a `file` property on the instance -3. A [Parser][parsers] mixin extracts data from the `file` property and saves it to a `data` property, which should almost always be a `list` of `dict`s. -4. After initialization, the BaseIter class and a [Mapper][mappers] mixin provide a transparent interface for iterating over the instance's `data`, e.g. by transforming each row into a `namedtuple` for convenience. - -These steps and their corresponding classes are detailed in the following pages. - -When writing to a file, the above steps are done more or less in reverse: the [Mapper][mappers] transforms data back into the `dict` format used in the `data` list; and the [Parser][parsers] dumps the data into a file-like object prepared by the [Loader][loaders] which then writes the output file. - -There are a number of pre-mixed classes directly exported by the [itertable module]. By convention, each pre-mixed class has a suffix "Iter", e.g. `ExcelFileIter`. The class names provide hints to the mixins that were used in their creation: for example, `JsonFileIter` extends `FileLoader`, `JsonParser`, `TupleMapper`, and `BaseIter`. Note that all of the pre-mixed classes extend `TupleMapper`, and all Iter classes extend `BaseIter` by definition. - -To extend IterTable, you can subclass one of these pre-mixed classes: - -```python -from itertable import JsonFileIter - -class MyJsonFileIter(JsonFileIter): - def parse(self): - # custom parsing code... -``` - -... or, subclass one of the mixins and mix your own class: - -```python -# Load base classes -from itertable.base import BaseIter -from itertable.loaders import FileLoader -from itertable.parsers import JsonParser -from itertable.mappers import TupleMapper - -# Equivalent: -# from itertable import BaseIter, FileLoader, JsonParser, TupleMapper - -# Define custom mixin class -class MyJsonParser(JsonParser): - def parse(self): - # custom parsing code ... - -# Mix together a usable Iter class -class MyJsonFileIter(FileLoader, MyJsonParser, TupleMapper, BaseIter): - pass -``` - -Note that the order of classes is important: `BaseIter` should always be listed last to ensure the correct method resolution order. - -You can then use your new class like any other Iter class: - -```python -for record in MyJsonFileIter(filename='file.json'): - print(record.id) -``` - -[IterTable]: https://github.com/wq/itertable -[custom]: https://github.com/wq/itertable/blob/master/docs/about.md -[base]: https://github.com/wq/itertable/blob/master/docs/base.md -[loaders]: https://github.com/wq/itertable/blob/master/docs/loaders.md -[parsers]: https://github.com/wq/itertable/blob/master/docs/parsers.md -[mappers]: https://github.com/wq/itertable/blob/master/docs/mappers.md -[gis]: https://github.com/wq/itertable/blob/master/docs/gis.md - -[itertable module]: https://github.com/wq/itertable/blob/master/itertable/__init__.py diff --git a/docs/base.md b/docs/base.md deleted file mode 100644 index 54c48af..0000000 --- a/docs/base.md +++ /dev/null @@ -1,92 +0,0 @@ -The BaseIter class -================== - -> Source: [`itertable.base`][itertable.base] - -The `BaseIter` class forms the core of [IterTable]'s built-in classes, and should always be extended when [defining custom classes][custom]. `BaseIter` serves two primary functions: - - * Initializing the class and orchestrating the [load][loaders] and [parse][parsers] mixin tasks - * Providing a convenient `iterable` interface for working with the parsed data (with support from a [mapper][mappers] mixin) - -To accomplish these functions, BaseIter contains a number of methods and properties: - - 1. Synchronization methods and configuration properties. These are discussed below. - 2. Stub functions meant to be overridden by the mixin classes. - 3. Magic methods to facilitate iteration and data manipulation. These should rarely need to be called directly or overridden. - -## Methods - - name | purpose -------|-------- -`refresh()` | Triggers the [load][loaders] and [parse][parsers] mixins to ensure the dataset is ready for iteration. Called automatically when the class is initialized. -`copy(other_io, save=True)` | Copy the entire dataset to another Iter instance, which presumably uses a different loader or parser. This method provides a means of converting data between formats. Any existing data on the other Iter instance will be erased. If `save` is `True` (the default), the `save()` method on the other Iter will be immediately triggered after the data is copied. -`sync(other_io, save=True)` | Like `copy()`, but uses `key_field` (see below) to update existing records in the other Iter rather than replacing the entire dataset. If a key is not found it is added automatically. -`as_dataframe()` | Generates a [Pandas DataFrame] containing the data in the Iter instance. Useful for more complex data analysis tasks. Requires Pandas which is not installed by default. - -## Properties - - name | purpose -------|-------- -`field_names` | The field or column names in the dataset. This can usually be determined automatically. -`key_field` | A "primary key" on the dataset. If `key_field` is set, the Iter will behave more like a dictionary, e.g. the default iteration will be over the key field values instead of over the rows. -`nested` | Boolean indicating whether the Iter has a two-tiered API (see below). -`tabular` | Boolean indicating whether the dataset comes from an inherently tabular file format (e.g. a spreadsheet). See [Parsers][parsers] for more details. - -### Assigning Values to Properties - -Most properties (including mixin properties) can be set by passing them as arguments when initializing the class. However, in general it is better to create a subclass with the properties pre-set. - -```python -# Works, but less re-usable -instance = CustomIter(field_names=['id','name']) - -# Usually better -class MyCustomIter(CustomIter) - field_names = ['id', 'name'] -instance = MyCustomIter() -``` - -The main exception to this rule is for properties that are almost guaranteed to be different every time the Iter is instantiated, e.g. [FileLoader][loaders]'s `filename` property. - -### Nested Iters - -IterTable pports the notion of "nested" tables containing two levels of iteration. This is best illustrated by example: - -```python - -instance = MyNestedIter(option1=value) -for group in instance: - print(group.group_name) - for row in group.data: - print(row.date, row.value) -``` - -For compatibility with tools that expect only a single level table (e.g. [Django Data Wizard]), nested tables can be "flattened" using a function from `itertable.util`: - -```python -from itertable.util import flattened -instance = flattened(MyNestedIter, option1=value) -for row in instance: - print(row.group_name, row.date, row.value) -``` - -To be compatible with `flattened()`, nested Iter classes need to have the following characteristics: - 1. `nested = True` - 2. Extend `TupleMapper` - 3. Each mapped row should have a `data` property pointing to a nested Iter class instance. - -Note that none of the pre-mixed Iter classes in IterTable are nested. The [climata library] provides a number of examples of nested Iter classes. - -[itertable.base]: https://github.com/wq/itertable/blob/master/itertable/base.py - -[IterTable]: https://github.com/wq/itertable -[custom]: https://github.com/wq/itertable/blob/master/docs/about.md -[base]: https://github.com/wq/itertable/blob/master/docs/base.md -[loaders]: https://github.com/wq/itertable/blob/master/docs/loaders.md -[parsers]: https://github.com/wq/itertable/blob/master/docs/parsers.md -[mappers]: https://github.com/wq/itertable/blob/master/docs/mappers.md -[gis]: https://github.com/wq/itertable/blob/master/docs/gis.md - -[Pandas DataFrame]: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html -[Django Data Wizard]: https://github.com/wq/django-data-wizard -[climata library]: https://github.com/heigeo/climata diff --git a/docs/gis.md b/docs/gis.md deleted file mode 100644 index 8d2963c..0000000 --- a/docs/gis.md +++ /dev/null @@ -1,84 +0,0 @@ -Geospatial support -================== - -> Source: [`itertable.gis`][itertable.gis] - -[IterTable] includes a [gis submodule][itertable.gis] with a number of extensions for working with geospatial data. This submodule requires [Fiona], [Shapely], and [GeoPandas], which can be installed by specifying `itertable[gis]`. itertable.gis provides a Fiona-powered [loader][loaders] and [parser][parsers], as well as three Shapely and GeoPandas-powered [mapper][mappers] classes. These are combined with a GIS-aware [BaseIter][base] extension to provide a set of three pre-mixed base classes, described below. - -To leverage all of these features: -```bash -python3 -m pip install itertable[gis] -``` - -### GisIter - -The `GisIter` class (and the corresponding `GisMapper` mixin) provide an API similar to `TupleMapper`, but with a `geometry` field on each row containing the [GeoJSON-like objects] returned from Fiona. - -```python -from itertable.gis import GisIter -data = GisIter(filename='sites.shp') -for id, site in data.items(): - print(id, site.name, site.geometry['type']) -``` - -Note that all of the gis Iter classes assume a `key_field` of "id" and will behave like a `dict` (See [BaseIter][base]). - -### ShapeIter - -The `ShapeIter` class (and corresponding `ShapeMapper` mixin) replaces the GeoJSON-like `geometry` attribute with a [Shapely geometry object] for convenient manipulation and computation. - -```python -from itertable.gis import ShapeIter -data = ShapeIter(filename='sites.shp') -for id, site in data.items(): - print(id, site.name, site.geometry.area) -``` - -### WktIter - -The `WktIter` class (and corresponding `WktMapper` mixin) replaces the Shapely `geometry` attribute with a [WKT string] to simplify use with other libraries. - -```python -from itertable.gis import WktIter -data = WktIter(filename='sites.shp') -for id, site in data.items(): - OrmModel.objects.create(name=site.name, geometry=site.geometry) -``` - -### GeoDataFrame - -Like all IterTable classes, the gis Iter classes provide an `as_dataframe()` function for Pandas-powered analysis. - -```python -from itertable.gis import ShapeIter -data = ShapeIter(filename='sites.shp') -df = data.as_dataframe() -df.plot() -``` - -### Syncing gis Iter classes - -All gis Iter classes support the `sync()` operation (see [BaseIter][base]). Additional care is taken to ensure the Shapely metadata (other than the driver) is synced together with the data. - -```python -source = ShapeIter(filename="source.shp") -dest = ShapeIter(filename="dest.geojson") -source.sync(dest) -``` - -[itertable.gis]: https://github.com/wq/itertable/blob/master/itertable/gis/ - -[IterTable]: https://github.com/wq/itertable -[custom]: https://github.com/wq/itertable/blob/master/docs/about.md -[base]: https://github.com/wq/itertable/blob/master/docs/base.md -[loaders]: https://github.com/wq/itertable/blob/master/docs/loaders.md -[parsers]: https://github.com/wq/itertable/blob/master/docs/parsers.md -[mappers]: https://github.com/wq/itertable/blob/master/docs/mappers.md -[gis]: https://github.com/wq/itertable/blob/master/docs/gis.md - -[Fiona]: https://github.com/Toblerity/Shapely -[Shapely]: https://github.com/Toblerity/Shapely -[GeoPandas]: http://geopandas.org/ -[GeoJSON-like objects]: http://toblerity.org/fiona/manual.html#data-model -[Shapely geometry object]: http://toblerity.org/shapely/manual.html#geometric-objects -[WKT String]: http://en.wikipedia.org/wiki/Well-known_text diff --git a/docs/loaders.md b/docs/loaders.md deleted file mode 100644 index 992f112..0000000 --- a/docs/loaders.md +++ /dev/null @@ -1,64 +0,0 @@ -Loaders -======= - -> Source: [`itertable.loaders`][itertable.loaders] - -[IterTables]'s `Loader` [mixin classes][custom] facilitate loading an external resource from the local filesystem or from the web into a file-like object. A loader is essentially just a class with `load()` and `save()` methods defined. The canonical example is `FileLoader`, represented in its entirely below: - -```python -class FileLoader(BaseLoader): - filename = None - read_mode = 'r' - write_mode = 'w+' - - def load(self): - try: - self.file = open(self.filename, self.read_mode) - self.empty_file = False - except IterError: - self.file = StringIter() - self.empty_file = True - - def save(self): - file = open(self.filename, self.write_mode) - self.dump(file) - file.close() -``` - -As can be seen above, every `Loader`'s `load()` method should take no arguments, instead determining what to load based on properties on the class instance. (Remember that the [BaseIter][base] class provides a convenient method for setting class properties on initialization). `load()` should set two properties on the class: - - * `file`, a file-like object that will be accessed by the parser - * `empty_file`, a boolean indicating that the file was empty or nonexistent (used to short-circuit avoid parser errors) - -To support file output, loaders should define a `save()` method, which should prepare a file-like object for writing, call `self.dump()` with the output file, and perform any needed wrap-up operations. - -### Built-In Loaders - -There are six built-in loader classes defined in [itertable.loaders]. - -name | purpose ------|--------- -`FileLoader` | Loads text data from a local file (e.g a CSV or XML file). Expects a `filename` property to be set on the class instance. -`BinaryFileLoader` | Loads binary data from the local filesystem (e.g. an Excel spreadsheet). Expects a `filename` property. -`ZipFileLoader` | Opens a local zip file and extracts a single inner file. If there is more than one inner file, the `inner_filename` property should be set. If the inner file is binary, `inner_binary` should be set to `True`. -`StringLoader` | Loads data to and from a `string` property that should be set on the class instance. -`NetLoader` | Loads data over HTTP(S) and expects a `url` property to be set or computed by the class or instance. -`ZipNetLoader` | Loads a zip file over HTTP(S) and extracts a single inner file. If there is more than one inner file, the `inner_filename` property should be set. If the inner file is binary, `inner_binary` should be set to true. - -`NetLoader` and `ZipNetLoader` support optional HTTP `username`, `password`, and URL `params` properties. Note that `save()` is not implemented for these loaders. - -### Custom Loaders - -The built in loaders should be enough for many use cases. The most common use for a custom loader is to encapsulate a number of `NetLoader` options into a reusable mixin class. For example, the [climata library] defines a `WebserviceLoader` for this purpose. - -[itertable.loaders]: https://github.com/wq/itertable/blob/master/itertable/loaders.py - -[IterTable]: https://github.com/wq/itertable -[custom]: https://github.com/wq/itertable/blob/master/docs/about.md -[base]: https://github.com/wq/itertable/blob/master/docs/base.md -[loaders]: https://github.com/wq/itertable/blob/master/docs/loaders.md -[parsers]: https://github.com/wq/itertable/blob/master/docs/parsers.md -[mappers]: https://github.com/wq/itertable/blob/master/docs/mappers.md -[gis]: https://github.com/wq/itertable/blob/master/docs/gis.md - -[climata library]: https://github.com/heigeo/climata diff --git a/docs/mappers.md b/docs/mappers.md deleted file mode 100644 index 6f97bb6..0000000 --- a/docs/mappers.md +++ /dev/null @@ -1,93 +0,0 @@ -Mappers -======= - -> Source: [`itertable.mappers`][itertable.mappers] - - -[IterTable]'s `Mapper` [mixin classes][custom] are used to make code for working with a loaded dataset more readable. This is accomplished by "mapping" each item in the dataset to a "usable item". Mappers are used during iteration, after the [parser][parsers] has created the `data` object as a `list` of `dicts`. The primary mapper class is `TupleMapper`, which converts each `dict` in the dataset into a [namedtuple] so fields can be accessed as `row.name` instead of `row['name']`. - -```python -from itertable import ExcelFileIter - -# Loader and Parser do their work here -instance = ExcelFileIter(filename='example.xlsx') - -# Mapper does its work here -for row in instance: - print(row.name, row.date) - -# You can also access the unmapped data directly -for row in instance.data: - print(row['name'], row['date']) - -``` - -A mapper class should have two methods to accomplish the mapping: - -name | purpose ------|--------- -`usable_item(row)` | Convert the source `dict` into a "usable item", e.g. a `namedtuple`. (This is just the method name, it's not meant to imply that dicts are unusable.) -`parse_usable_item(item)` | Convert a usable item back into the source `dict` format. This is needed for full read+write support. - -IterTable's [built-in mapper classes][mappers] build on this foundation and on each other. - -### BaseMapper -`BaseMapper` breaks down `usable_item` and `parse_usable_item` into functions that work on each field individually. All of the functions are effectively no-ops and meant to be overridden. The usable item `BaseMapper` returns is still a `dict`. - -name | purpose ------|--------- -`map_field(field)` | Map a field name into its "usable" equivalent -`unmap_field(field)` | Map a "usable" field name back to its original name -`map_value(field, value)` | Map a field value into its "usable" equivalent -`unmap_value(field, value)` | Map a "usable" value back into the source value - -### DictMapper -`DictMapper` extends `BaseMapper` with two simple dictionaries that facilitate field and value mapping. The usable item `DictMapper` returns is still a `dict`. - -name | purpose ------|--------- -`field_map` | Map of fields to their usable equivalents: `{"source_field1": "usable_name1", "source_field2": "usable_name2"}` -`value_map` | Map of values to be replaced with usable equivalents. If a value is not found in the map it will be preserved as is. - -### TupleMapper -`TupleMapper` extends `DictMapper` with a `usable_item()` that returns a [namedtuple] instead of a `dict`. Since `namedtuple` field names cannot contain spaces or punctuation, `TupleMapper` automatically computes a `field_map` with compatible values. `TupleMapper` defines the following method to facilitate adding new records. - -name | purpose ------|--------- -`create(**kwargs)` | Create an instance of the internal `namedtuple` class with values for each field given as keyword arguments. This can be passed to `append()` which will update the underlying dataset as shown in the example below. - -```python -from itertable import CsvFileIter -instance = CsvFileIter(filename="example.csv") -# len(instance) == len(instance.data) == 2 - -record = instance.create(name='test', value=123) -instance.append(record) -instance.save() - -# len(instance) == len(instance.data) == 3 - -``` - -### TimeSeriesMapper - -`TimeSeriesMapper` extends `TupleMapper` with a `map_value()` implementation that automatically converts string dates into [datetime] objects. It can also automatically convert string number into `float`s. Two properties are used to configure `TimeSeriesMapper`: - -name | purpose ------|--------- -`date_formats` | A list of [format strings] to use when attempting to parse dates. -`map_floats` | Whether to attempt to map string numbers into floats (default `True`) - -[itertable.mappers]: https://github.com/wq/itertable/blob/master/itertable/mappers.py - -[IterTable]: https://github.com/wq/itertable -[custom]: https://github.com/wq/itertable/blob/master/docs/about.md -[base]: https://github.com/wq/itertable/blob/master/docs/base.md -[loaders]: https://github.com/wq/itertable/blob/master/docs/loaders.md -[parsers]: https://github.com/wq/itertable/blob/master/docs/parsers.md -[mappers]: https://github.com/wq/itertable/blob/master/docs/mappers.md -[gis]: https://github.com/wq/itertable/blob/master/docs/gis.md - -[namedtuple]: https://docs.python.org/3/library/collections.html#collections.namedtuple -[datetime]: https://docs.python.org/3/library/datetime.html -[format strings]: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior diff --git a/docs/parsers.md b/docs/parsers.md deleted file mode 100644 index fd72646..0000000 --- a/docs/parsers.md +++ /dev/null @@ -1,138 +0,0 @@ -Parsers -======= - -> Source: [`itertable.parsers`][itertable.parsers] - - -[IterTable]'s `Parser` [mixin classes][custom] facilitate parsing data from a loaded `file` object into a `list` of `dict`s. A parser is essentially just a class with `parse()` and `dump()` methods defined. In general, a parser class should just provide a wrapper around a third party API (e.g. [csv], [xml.etree] or [xlrd]). A hypothetical parser class would look like this: - -```python -from some_library import some_api - -class HypotheticalParser(BaseParser): - def parse(self): - self.data = some_api.load(self.file) - - def dump(self, file): - some_api.dump(self.data, file) -``` - -As can be seen in the above example, the `parse()` function takes no arguments, instead assuming `self.file` has already been defined by a [Loader][loaders] mixin. The data object should be defined as a `list` of `dict`s (e.g. `[{"id":1},{"id":2}]`. If the result returned by the API has some other structure, it should be processed to match the expected format. The `dump()` function should accept a writable file handle as an argument and use the API to write the data object back to the file. - -## Extending Parser Classes -There are two main ways in which parser classes are customized. One way is to define a completely new class to support a file format or API not currently supported by the built-in IterTable parsers. The other way, which is much more common, is to extend or change the behavior of an existing parser. With that in mind, each of the built-in parser classes is discussed below together with common customization options and techniques. - -### Non-Tabular Parsers -Two of the built-in parsers are used for file formats that are *not* inherently tabular and can describe arbitrary data structures. While these file data formats are not inherently tabular, they often are used represent table-like data. These parsers directly extend `BaseParser` and have the `tabular` property set to `False`. - -> Non-tabular file formats allow for some records to have more fields than others. By default, IterTable only searches the first record when automatically determining field names. This can cause issues with [TupleMapper][mappers] in particular which expects consistent field names throughout the dataset. If this happens to you, set `scan_fields = True` on your class to tell IterTable to scan the entire dataset when determining field names. - -#### [JsonParser][itertable.parsers.text] - -The JSON parser is a simple wrapper around Python's built-in [json] API. `JsonParser` assumes the result of `json.load(self.file)` will either be an array or an object with an array somewhere in an inner property (in which case `namespace` should be set). Each item in the array is assumed to be a relatively flat key-value mapping. The keys of the first item in the array will be assumed to be the same for the rest of the items. - -JsonParser supports the following options, specified as properties on the class or instance: - -##### Properties - -name | purpose ------|--------- -`namespace` | The dotted path to the array within the JSON object. For example, if the expected JSON will be of the form `{"records":[{"id":1},{"id":2}]}` then the namespace should be "records". -`indent` | Used by the `dump()` method, which passes it on to `json.dump` to "pretty-print" the output JSON file. - -#### [XmlParser][itertable.parsers.text] - -The XML parser is a simple wrapper around Python's built-in [xml.etree] API. While it can be adapted to work with arbitrary XML documents, it assumes a basic structure like the following: - -```xml - - - 1 - val - - - 2 - val - - -``` - -In addition to the `parse()` and `dump()` methods, `XmlParser` provides row-level methods, described below. - -##### Properties & Methods - -name | purpose ------|--------- -`root_tag` | The name of the top level XML tag. Determined automatically by `parse()`; only required for `dump()`. -`item_tag` | The name of the series tag. Defaults to the name of the first child tag under the root. `parse()` will conduct a search for all instances of `item_tag` (whether explicitly specified or computed) and call `parse_item()` on each result. Required for `dump_item()`. -`parse_item(elem)` | If overridden, should return a `dict` corresponding to the item. The default implementation assumes each property is specified as an inner tag name and XML attributes are ignored. -`dump_item(obj)` | The inverse of `parse_item()`; if overridden, should accept a `dict` and return an `Element` instance. - -### Tabular Parsers - -The tabular parsers are geared toward handling spreadsheets and other tabular data formats. These formats are differentiated from the non-tabular formats in that there is typically a single grid structure encompassing the entire file, and the field names / column headings are listed only once (usually, but not always, in the first row of the file). - -The tabular parsers extend [itertable.base.TableParser][itertable.parsers.base], which defines the following properties: - -name | purpose ------|----------- -`tabular = True` | The `tabular` property is used to signify the presence of these other properties. It is checked by [Django Data Wizard] when importing data. -`header_row` | The location of the column headers within the table. This is often 0 (the first row), but can be determined automatically by examining the first few rows of the table. -`max_header_row` | The maximum number of rows to scan looking for the column headers. The default is 20. -`start_row` | The first row containing actual data. This defaults to `header_row` + 1. Useful when there is an empty row or two between the column headers and data in a spreadsheet. -`extra_data` | A sparse matrix containing any data found in the cells above the header row. The format is `{row: {col: "Data"}}`. Currently only supported by `ExcelParser`. - -#### [CsvParser][itertable.parsers.text] - -`CsvParser` utilizes Python's [csv] module to provide a CSV-capable `TableParser`. `CsvParser` leverages [SkipPreludeReader][itertable.parsers.readers], a customized [DictReader] that adds support for files that have extra "prelude" text before the actual header row. - -##### Properties & Methods -name | purpose ------|----------- -delimiter | Column separator, default is `,` -quotechar | Quotation character for text values containing spaces or delimiters, default is `"` -reader_class() | Method returning an uninstantiated `DictReader` class for use in parsing the data. The default method returns a subclass of `SkipPreludeReader` that passes along the `max_header_row` option. - -#### [ExcelParser (`WorkbookParser`)[itertable.parsers.xls] -`ExcelParser` provides support for `.xlsx` files via the [openpyxl] module, while `OldExcelParser` supports `.xls` if [xlrd] and [xlwt] are installed. `ExcelParser` and `OldExcelParser` extend a somewhat more generic `WorkbookParser`, with the idea that the latter could eventually be extended to other "workbook" style formats like ODS. - -> Note: In previous versions of itertable, `ExcelParser` relied on [xlrd] to support both `.xlsx` and `.xls` formats. Now that xlrd has dropped `.xlsx` support, `ExcelParser` has been rewritten to use [openpyxl], which only supports `.xlsx` files. The old `xlrd`-based `ExcelParser` class has been renamed to `OldExcelParser`. (In most cases you can just use `itertable.load_file()` which automatically determines whether to use `ExcelParser` or `OldExcelParser`). - -##### Properties -name | purpose ------|--------- -`sheet_name` | Determines which sheet to load data from in an multi-sheet workbook. Defaults to `0` (the first sheet) - -##### Methods -name | purpose ------|--------- -`sheet_names` | List the available sheets in the workbook (declared as a `@property` method). -`parse_workbook()` | Load `self.file` into a `Workbook` or equivalent class and save it to `self.workbook` -`parse_worksheet(name)` | Load the specified worksheet into memory and save an array of row objects to `self.worksheet` -`parse_row(row)` | Convert the given row object into a dict, usually by mapping the column header to the value in each cell -`get_value(cell)` | Retrieve the actual value from the cell. - -The methods listed above are called in turn by `parse()`, which is defined by `WorkbookParser`. Working implementations of the methods are defined in `ExcelParser` and `OldExcelParser`. - -[itertable.parsers]: https://github.com/wq/itertable/blob/master/itertable/parsers/ -[itertable.parsers.base]: https://github.com/wq/itertable/blob/master/itertable/parsers/base.py -[itertable.parsers.readers]: https://github.com/wq/itertable/blob/master/itertable/parsers/readers.py -[itertable.parsers.text]: https://github.com/wq/itertable/blob/master/itertable/parsers/text.py -[itertable.parsers.xls]: https://github.com/wq/itertable/blob/master/itertable/parsers/xls.py - -[IterTable]: https://github.com/wq/itertable -[custom]: https://github.com/wq/itertable/blob/master/docs/about.md -[base]: https://github.com/wq/itertable/blob/master/docs/base.md -[loaders]: https://github.com/wq/itertable/blob/master/docs/loaders.md -[parsers]: https://github.com/wq/itertable/blob/master/docs/parsers.md -[mappers]: https://github.com/wq/itertable/blob/master/docs/mappers.md -[gis]: https://github.com/wq/itertable/blob/master/docs/gis.md - -[csv]: https://docs.python.org/3/library/csv.html -[xml.etree]: https://docs.python.org/3/library/xml.etree.elementtree.html -[xlrd]: http://www.python-excel.org/ -[json]: https://docs.python.org/3/library/json.html -[Django Data Wizard]: https://github.com/wq/django-data-wizard -[DictReader]: https://docs.python.org/3/library/csv.html#csv.DictReader -[xlwt]: http://www.python-excel.org/ -[openpyxl]: https://openpyxl.readthedocs.io/en/stable/