Redesign of DataSet API #45

hughleat · 2021-01-26T16:22:22Z

Redesign the Dataset class to not depend on tar balls and particular data structures.

Currently the datasets are hard coded as tarballs - https:/facebookresearch/CompilerGym/blob/development/compiler_gym/envs/llvm/datasets.py
these later get unpacked into a particular format where the directory structure is very important.

This means that we have to curate them. E.g. we can't pull benchmarks from Anghabench directly, we have to host a tarball somewhere.

Also, if we had random program generators, e.g. CSmith, CLGen, we couldn't really work with them.

Instead, add methods to the Dataset class to install and extract:

class Dataset:
  ...
  def name() -> String
  def install(path_to_use: Path)    # e.g. ~/.compiler_gym/datasets/<name>
  def benchmark_ids() -> Iterable[Any] # Possibly lazy list of benchmark names
  def benchmarks() -> Iterable[Benchmark] # Possibly lazy list of benchmarks
  def benchmark(id: Any) -> Benchmark

class TarballDataset(Dataset):
  def __init__(name: String, url: Stirng, etc..)
  def install(p): download_unpack(p, url)
  def benchmark_ids(): get_filenames_from_install_dir()
  ...

class Anghabench(Dataset):
  ...
  def name(): return "anghabench"
  def install(p): git_clone_into(p, "https:/brenocfg/AnghaBench.git")
  
class CSmith(Dataset):
  def install(p): install_csmith_binary(p)
  def benchmark_ids(): range_all_ints
  def benchmark(id: int): csmith_for_seed(id)

At the same time, something like this would free datasets to have their own directory structure. This might be useful when considering input data for correctness and performance. Sometimes multiple benchmarks will share things.

At init time, the gym could look in the dataset dir (e.g. ~/.compiler_gym/datasets or [ENV COMPILER_GYM_DIR]). Any python scripts in there could be run to register datasets.
Programmatically, people could register their own datasets, outside of that common mechanism.
A command line tool, install_dataset url, could fetch a script from the url, drop it in the dataset dir, then run install.

The text was updated successfully, but these errors were encountered:

ChrisCummins · 2021-01-26T16:39:29Z

Thanks @hughleat, I think that a python-side dataset API makes a lot of sense, and I like the class hierarchy you're proposing. I think a prerequisite for this is to figure out the role of the backend service in managing datasets as at the moment, the frontend python code and backend service have slightly jumbled and overlapping roles. We could shift the responsibility of managing benchmarks from the service to the frontend by:

Removing the list-of-all-benchmarks from the BenchmarkFactory in the backend service.
Remove the benchmark management methods from the service API.
Make the benchmark a compulsory parameter for starting episodes.

Cheers,
Chris

Add semantics validation for cBench benchmarks. This is achieved by adding a new validation callback mechanism that, when invoked, compiles the given cBench benchmark to a binary and executes it using prepared datasets. The output of the program, along with any generated output files, is differential tested against a copy of the program compiled without optimizations. A change in program behavior that is detected by this mechanism is reported. Calling `compiler_gym.validate_state()` on a benchmark that supports semantics validation will automatically run it. The core of the implementation is in compiler_gym/envs/llvm/dataset.py. It defines a set of library functions so that these validation callbacks can be defined ad-hoc for cBench in quite a succinct form, e.g.: validator( benchmark="benchmark://cBench-v0/ghostscript", cmd="$BIN -sDEVICE=ppm -dNOPAUSE -dQUIET -sOutputFile=output.ppm -- 1.ps", data=["office_data/1.ps"], outs=["output.ppm"], linkopts=["-lm", "-lz"], pre_execution_callback=setup_ghostscript_library_files, ) As part of #45 we may want to make a public API similar to this and move it into the dataset definitions. Multiple validation callbacks can be defined for a single benchmark. Where a benchmark matches multiple validators, they are executed in parallel. Compiling binaries from cBench benchmarks requires that the bitcodes be compiled against the system-specific standard library, so this patch also splits the cBench dataset into macOS and Linux versions.

ChrisCummins · 2021-02-14T12:53:46Z

I think it would be a good idea also to wrap the Benchmark proto with a python class that can add extra functionality like the ability to validate benchmark behavior etc. To begin with, something simple like:

class Benchmark(object):
  def __init__(self, proto: BenchmarkProto)
  def sha1(self) -> bytes  # Name for caching benchmark+any other attributes
  def program_data(self) -> BenchmarkProto  # The data that the service needs
  def program_data_sha1(self) -> bytes  # Used for caching benchmarks on service-side
  def is_validatable(self) -> bool
  def validation_callbacks(self) -> List[Callback[[CompilerEnv], Optional[str]]  # Run any ad-hoc validation, e.g. difftest, valgrind, etc

Add a new CompilerEnv.validate() method that replaces the previous validate_state(env, state) call. This is a stepping stone to enabling a more flexible API for custom benchmark validation routines. github.com//issues/45

In preparation for introducing a new Dataset class. Issue #45.

github.com//issues/45

Issue facebookresearch#45.

With the new dataset API, enumerating the benchmarks is not advised (the list may be infinite), and there is now no need to install datasets ahead of time. Issue facebookresearch#45.

This test is flaky, and the functionality tested here will be removed in facebookresearch#45.

A benchmark represents that particular program that is being compiled. Issue facebookresearch#45.

Issue facebookresearch#45.

This extends the LLVM data archive to include the following additional binaries: bin/llc bin/llvm-as bin/llvm-bcanalyer bin/llvm-config bin/llvm-dis bin/llvm-mca This also moves the location of the unpacked archive to llvm-v0 (with a version suffix), and fixes a race condition in the download logic. Issue facebookresearch#45.

Decode the binary data from the manifest. Issue facebookresearch#45.

This adds python operator overloads that alias to existing methods to make the Dataset class "feel" more like a regular python dictionary: >>> len(dataset) # equivalent to dataset.n 23 >>> for benchmark in dataset: # iterate over the class directly ... pass >>> dataset["cbench-v1/crc32"] # key a benchmark This also renames Dataset.n to Dataset.size for consistent with other containers like np.ndarray, and returns math.inf if the number of benchmarks is infinite, not a negative integer. The advantage of math.inf is that will poison any integer arithemtic, e.g. >>> sum(d.size for d in datasets) inf if any one of the datasets has an infinite size. With a negative number, this would instead compute a regular integer value. Issue facebookresearch#45.

Issue facebookresearch#45.

This patch makes two simplifications to the Datasets API: 1) It removes the random-benchmark selection logic from `Dataset.benchmark()`. Now, calling `benchmark()` requires a URI. If you wish to select a benchmark randomly, you can implement this random selection yourself. The idea is that random benchmark selection is quite a minor use case that introduces quite a bit of complexity into the implementation. 2) It removes the `Union[str, Dataset]` types to `Datasets` methods. Now, only a string is permitted. This is to make it easier to understand the argument types. If the user has a `Dataset` instance that they would like to use, they can explicitly pass in `dataset.name`. Issue facebookresearch#45.

This is to start the transition from the LegacyDatasets to the new Datasets API. Issue facebookresearch#45.

This adds new Dataset class implementations of some of the LLVM datasets. The original LegacyDatasets are still used for now, they will be migrated once everything is in place. Issue facebookresearch#45.

This differs from the previous version in that it downloads the original C++ sources and compiles them on-demand, rather than downloading prepared bitcodes. Issue facebookresearch#45.

This adds two new datasets, csmith-v0 and llvm-stress-v0, that are parametrized program generators. csmith-v0 uses Csmith to generate C99 programs that are then lowered to bitcode. llvm-stress-v0 generates random LLVM-IR. Both generators were developed to stress test compilers, so they have an above-average chance that a generated benchmark will cause the compiler to enter an unexpected state. Issue facebookresearch#45.

Issue facebookresearch#45.

This adds a dataset of 1k OpenCL kernels that were used in the paper: Cummins, Chris, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. "Synthesizing benchmarks for predictive modeling." In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 86-99. IEEE, 2017. The OpenCL kernels are compiled on-demand. Issue facebookresearch#45.

The dataset is from: da Silva, Anderson Faustino, Bruno Conde Kind, José Wesley de Souza Magalhaes, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimaraes, and Fernando Magno Quinão Pereira. "ANGHABENCH: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction." In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 378-390. IEEE, 2021. Issue facebookresearch#45.

This adds the new Dataset implementation of the cBench dataset. The validation logic isn't super tidy and could be tidied up a bit, its just copied over from //compiler_gym/envs/llvm:legacy_datasets. Issue facebookresearch#45.

Do not permit that the benchmark name can be missing for a benchmark URI to be considered well formed, as we no longer support dataset-only URIs. Issue facebookresearch#45.

We no longer require running compiler_gym.bin.datasets to download a dataset for testing. Issue facebookresearch#45.

Issue facebookresearch#45.

This updates the documentation of the getting started guide, tutorials, API reference etc to the new dataset API. In general, this means simplifying things, as we no longer need to explain how to download and manage datasets. Issue facebookresearch#45.

This switches over the `CompilerEnv` environment to use the new dataset API, dropping the `LegacyDataset` class. Background ---------- Since the very first prototype of CompilerGym, a `Benchmark` protocol buffer has been used to provide a serializable representation of benchmarks that can be passed back and forth between the service and the frontend. Initially, it was up to the compiler service to maintain the set of available benchmarks, exposing the available benchmarks with a `GetBenchmarks()` RPC method, and allowing new benchmarks to be added using an `AddBenchmarks()` method. This was fine for the initial use case of shipping a handful of benchmarks and allowing ad-hoc new benchmarks to be added, but for managing larger sets of benchmarks, a *datasets* abstraction was added. Initial Datasets abstraction ---------------------------- To add support for managing large sets of programs, a [Dataset](https:/facebookresearch/CompilerGym/blob/49c10d77d1c1b1297a1269604584a13c10434cbb/compiler_gym/datasets/dataset.py#L20) tuple was added that describes a set of programs, and a link to the a tarball containing those programs. The tarball is required to have a JSON file containing metadata, and a directory containing the benchmarks, one file per benchmark. A set of operations were added to the frontend command line to make downloading and unpacking these tarballs easier: https:/facebookresearch/CompilerGym/blob/49c10d77d1c1b1297a1269604584a13c10434cbb/compiler_gym/bin/datasets.py#L5-L133 Problems with this approach --------------------------- (1) **Leaky abstraction** Both the environment and backend service have to know about datasets. This means redundant duplicated logic, and adds a maintenance burden of keeping the C++/python logic in sync. (2) **Inflexible** Only supports environments in which a single file represents a benchmark. No support for multi-file benchmarks, benchmarks that are compiled on-demand, etc. (3) **O(n) space and time overhead** on each service instance, where *n* is the total number of benchmarks. At init time, each service needs to recursively scan a directory tree to build a list of available benchmarks. This list must be kept in memory. This adds startup time, and also causes cache invalidation issues when multiple environment instances are modifying the underlying filesystem. New Dataset API --------------- This commit changes the ownership model so that the *Environment* owns the benchmarks and datasets, not the service. This uses the new `Dataset` class hierarchy that has been added in previous pull requests: facebookresearch#190, facebookresearch#191, facebookresearch#192, facebookresearch#200, facebookresearch#201. Now, the backend has no knowledge of "datasets". Instead the service simply keeps a small cache of benchmarks that it has seen. If a session request has a benchmark URI that is not in this cache, the service returns a "resource not found" error and the frontend logic can then respond by sending it a copy of the benchmark as a `Benchmark` proto. The service is free to cache this for future use, and can empty the cache whenever it wants. This new approach has a few key benefits: (1) By moving all of the datasets logic into the frontend, it becomes much easier for users to define their own datasets. (2) Reduces compiler service startup time as it removes the need for each service to do a recursive filesystem sweep. (3) Removes the requirement that the set of benchmarks is fully enumerable, allow for program generators that can produce a theoretically infinite number of benchmarks. (4) Adds support for lazily-compiled datasets of programs that are generated on-demand. (5) Removes the need to download datasets ahead of time. Datasets can now be installed on-demand. Summary of changes ------------------ (1) Changes the type of `env.benchmark` from a string to a `Benchmark` instance. (2) Makes `env.benchmark` a mandatory attribute. If no benchmark is provided at init time, one is chosen deterministically. If you wish to select a random benchmark, use `env.datasets.benchmark()`. (3) `env.fork()` no longer requires `env.reset()` to have been called first. It will call `env.reset()` if required. (4) `env.benchmark = None` is no longer a valid way of requesting a random benchmark. If you would like a random benchmark, you must now roll your own random picker using `env.datasets.benchmark_uris()` and similar. (5) Deprecates all `LegacyDataset` operations, changing their behavior to no-ops, and removing the class. (6) Renames `cBench` to `cbench` to be consistent with the lower-case naming convention of gym. The old `cBench` datasets are kept around but are marked deprecated to encourage migration. Migrating to the new interface ------------------------------ To migrate existing code to the new interface: (1) Update references to `cBench-v[01]` to `cbench-v1`. (2) Review code that accesses the `env.benchmark` property and update to `env.benchmark.uri` if a string name is required. (3) Review code that calls `env.reset()` without first setting a benchmark. Previously, calling `env.reset()` would select a random benchmark. Now, `env.reset()` always selects the last used benchmark, or a predetermined default if none is specified. (4) Review code that relies on `env.benchmark` being `None` to select benchmarks randomly. Now, `env.benchmark` is always set to the previously used benchmark, or a predetermined default benchmark if none has been provided. (5) Remove calls to `env.require_dataset()`. Issue facebookresearch#45.

Issue facebookresearch#45.

This replaces the boolean `hidden` value with a `deprecated` message, which is emitted automatically on a call to `install()`. Issue facebookresearch#45. Fixes facebookresearch#219.

Issue facebookresearch#45.

hughleat added the Enhancement New feature or request label Jan 26, 2021

ChrisCummins mentioned this issue Jan 27, 2021

Add semantics validation for cBench benchmarks #43

Merged

ChrisCummins added the RPC label Feb 3, 2021

This was referenced Feb 17, 2021

Call env.require_dataset() when required #79

Closed

How to execute the generated bitcode? #88

Open

ChrisCummins added the Datasets Issues relating to datasets label Feb 26, 2021

ChrisCummins added a commit that referenced this issue Feb 26, 2021

Rename Dataset tuple to LegacyDataset.

c8e6116

In preparation for introducing a new Dataset class. Issue #45.

ChrisCummins added a commit that referenced this issue Feb 26, 2021

Rename Dataset tuple to LegacyDataset.

dd8751c

In preparation for introducing a new Dataset class. Issue #45.

ChrisCummins added a commit that referenced this issue Feb 26, 2021

Rename Dataset tuple to LegacyDataset.

804fead

In preparation for introducing a new Dataset class. Issue #45.

ChrisCummins mentioned this issue Feb 27, 2021

WIP: Dataset api overhaul #108

Closed

ChrisCummins modified the milestones: v0.1.4, v0.1.5 Mar 2, 2021

ChrisCummins added a commit that referenced this issue Mar 17, 2021

Rename Dataset tuple to LegacyDataset.

e013ea8

In preparation for introducing a new Dataset class. Issue #45.

ChrisCummins added a commit that referenced this issue Mar 18, 2021

Rename llvm datasets to legacy datasets.

f91a576

github.com//issues/45

ChrisCummins self-assigned this Jul 13, 2021

bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021

Update deprecation warning versions.

4d3b0de

Issue facebookresearch#45.

bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021

Remove an unneeded test.

8805dc1

This test is flaky, and the functionality tested here will be removed in facebookresearch#45.

bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021

[datasets] Add a new Benchmark class.

d373b67

A benchmark represents that particular program that is being compiled. Issue facebookresearch#45.

bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021

[datasets] Add a new Dataset class.

17fed3e

Issue facebookresearch#45.

bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021

[datasets] Add Dataset helper subclasses.

d1016fa

Issue facebookresearch#45.

bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021

[llvm] Update make_benchmark() to use the new Benchmark class.

cb8310d

Issue facebookresearch#45.

bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021

[datasets] Hotfix for read_manifest()

db57ce3

Decode the binary data from the manifest. Issue facebookresearch#45.

bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021

[datasets] Add a Datasets class for managing datasets.

8d069c8

Issue facebookresearch#45.

bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021

Add CompilerEnv.datasets member variable.

ddd87f4

This is to start the transition from the LegacyDatasets to the new Datasets API. Issue facebookresearch#45.

bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021

[datasets] Fix POJ-104 header concatenation on macOS.

2cd74ff

Issue facebookresearch#45.

bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021

[Makefile] Remove redundant test dataset rule.

b7c36be

We no longer require running compiler_gym.bin.datasets to download a dataset for testing. Issue facebookresearch#45.

bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021

[datasets] Fix llvm-stress test.

b25512f

Issue facebookresearch#45.

bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021

[datasets] Allow benchmark to be None at constructor time.

ebcbe1d

Issue facebookresearch#45.

bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021

[datasets] Fix the capitalization of TensorFlow.

f740eb4

Issue facebookresearch#45.

ChrisCummins mentioned this issue Sep 23, 2021

[env] Wrap all RPC calls in reset() in retry loop. #423

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redesign of DataSet API #45

Redesign of DataSet API #45

hughleat commented Jan 26, 2021 •

edited by ChrisCummins

Loading

ChrisCummins commented Jan 26, 2021

ChrisCummins commented Feb 14, 2021 •

edited

Loading

Redesign of DataSet API #45

Redesign of DataSet API #45

Comments

hughleat commented Jan 26, 2021 • edited by ChrisCummins Loading

ChrisCummins commented Jan 26, 2021

ChrisCummins commented Feb 14, 2021 • edited Loading

hughleat commented Jan 26, 2021 •

edited by ChrisCummins

Loading

ChrisCummins commented Feb 14, 2021 •

edited

Loading