Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redesign of DataSet API #45

Closed
hughleat opened this issue Jan 26, 2021 · 4 comments
Closed

Redesign of DataSet API #45

hughleat opened this issue Jan 26, 2021 · 4 comments
Assignees
Labels
Datasets Issues relating to datasets Enhancement New feature or request RPC
Milestone

Comments

@hughleat
Copy link
Contributor

hughleat commented Jan 26, 2021

Redesign the Dataset class to not depend on tar balls and particular data structures.

Currently the datasets are hard coded as tarballs - https:/facebookresearch/CompilerGym/blob/development/compiler_gym/envs/llvm/datasets.py
these later get unpacked into a particular format where the directory structure is very important.

This means that we have to curate them. E.g. we can't pull benchmarks from Anghabench directly, we have to host a tarball somewhere.

Also, if we had random program generators, e.g. CSmith, CLGen, we couldn't really work with them.

Instead, add methods to the Dataset class to install and extract:

class Dataset:
  ...
  def name() -> String
  def install(path_to_use: Path)    # e.g. ~/.compiler_gym/datasets/<name>
  def benchmark_ids() -> Iterable[Any] # Possibly lazy list of benchmark names
  def benchmarks() -> Iterable[Benchmark] # Possibly lazy list of benchmarks
  def benchmark(id: Any) -> Benchmark

class TarballDataset(Dataset):
  def __init__(name: String, url: Stirng, etc..)
  def install(p): download_unpack(p, url)
  def benchmark_ids(): get_filenames_from_install_dir()
  ...

class Anghabench(Dataset):
  ...
  def name(): return "anghabench"
  def install(p): git_clone_into(p, "https:/brenocfg/AnghaBench.git")
  
class CSmith(Dataset):
  def install(p): install_csmith_binary(p)
  def benchmark_ids(): range_all_ints
  def benchmark(id: int): csmith_for_seed(id)

At the same time, something like this would free datasets to have their own directory structure. This might be useful when considering input data for correctness and performance. Sometimes multiple benchmarks will share things.

At init time, the gym could look in the dataset dir (e.g. ~/.compiler_gym/datasets or [ENV COMPILER_GYM_DIR]). Any python scripts in there could be run to register datasets.
Programmatically, people could register their own datasets, outside of that common mechanism.
A command line tool, install_dataset url, could fetch a script from the url, drop it in the dataset dir, then run install.

@hughleat hughleat added the Enhancement New feature or request label Jan 26, 2021
@ChrisCummins
Copy link
Contributor

Thanks @hughleat, I think that a python-side dataset API makes a lot of sense, and I like the class hierarchy you're proposing. I think a prerequisite for this is to figure out the role of the backend service in managing datasets as at the moment, the frontend python code and backend service have slightly jumbled and overlapping roles. We could shift the responsibility of managing benchmarks from the service to the frontend by:

  1. Removing the list-of-all-benchmarks from the BenchmarkFactory in the backend service.
  2. Remove the benchmark management methods from the service API.
  3. Make the benchmark a compulsory parameter for starting episodes.

Cheers,
Chris

ChrisCummins added a commit that referenced this issue Jan 29, 2021
Add semantics validation for cBench benchmarks. This is achieved by
adding a new validation callback mechanism that, when invoked,
compiles the given cBench benchmark to a binary and executes it using
prepared datasets. The output of the program, along with any generated
output files, is differential tested against a copy of the program
compiled without optimizations. A change in program behavior that is
detected by this mechanism is reported.

Calling `compiler_gym.validate_state()` on a benchmark that supports
semantics validation will automatically run it.

The core of the implementation is in
compiler_gym/envs/llvm/dataset.py. It defines a set of library
functions so that these validation callbacks can be defined ad-hoc for
cBench in quite a succinct form, e.g.:

    validator(
        benchmark="benchmark://cBench-v0/ghostscript",
        cmd="$BIN -sDEVICE=ppm -dNOPAUSE -dQUIET -sOutputFile=output.ppm -- 1.ps",
        data=["office_data/1.ps"],
        outs=["output.ppm"],
        linkopts=["-lm", "-lz"],
        pre_execution_callback=setup_ghostscript_library_files,
    )

As part of #45 we may want to make a public API similar to this and
move it into the dataset definitions.

Multiple validation callbacks can be defined for a single
benchmark. Where a benchmark matches multiple validators, they are
executed in parallel.

Compiling binaries from cBench benchmarks requires that the bitcodes
be compiled against the system-specific standard library, so this
patch also splits the cBench dataset into macOS and Linux versions.
ChrisCummins added a commit that referenced this issue Jan 29, 2021
Add semantics validation for cBench benchmarks. This is achieved by
adding a new validation callback mechanism that, when invoked,
compiles the given cBench benchmark to a binary and executes it using
prepared datasets. The output of the program, along with any generated
output files, is differential tested against a copy of the program
compiled without optimizations. A change in program behavior that is
detected by this mechanism is reported.

Calling `compiler_gym.validate_state()` on a benchmark that supports
semantics validation will automatically run it.

The core of the implementation is in
compiler_gym/envs/llvm/dataset.py. It defines a set of library
functions so that these validation callbacks can be defined ad-hoc for
cBench in quite a succinct form, e.g.:

    validator(
        benchmark="benchmark://cBench-v0/ghostscript",
        cmd="$BIN -sDEVICE=ppm -dNOPAUSE -dQUIET -sOutputFile=output.ppm -- 1.ps",
        data=["office_data/1.ps"],
        outs=["output.ppm"],
        linkopts=["-lm", "-lz"],
        pre_execution_callback=setup_ghostscript_library_files,
    )

As part of #45 we may want to make a public API similar to this and
move it into the dataset definitions.

Multiple validation callbacks can be defined for a single
benchmark. Where a benchmark matches multiple validators, they are
executed in parallel.

Compiling binaries from cBench benchmarks requires that the bitcodes
be compiled against the system-specific standard library, so this
patch also splits the cBench dataset into macOS and Linux versions.
ChrisCummins added a commit that referenced this issue Feb 2, 2021
Add semantics validation for cBench benchmarks. This is achieved by
adding a new validation callback mechanism that, when invoked,
compiles the given cBench benchmark to a binary and executes it using
prepared datasets. The output of the program, along with any generated
output files, is differential tested against a copy of the program
compiled without optimizations. A change in program behavior that is
detected by this mechanism is reported.

Calling `compiler_gym.validate_state()` on a benchmark that supports
semantics validation will automatically run it.

The core of the implementation is in
compiler_gym/envs/llvm/dataset.py. It defines a set of library
functions so that these validation callbacks can be defined ad-hoc for
cBench in quite a succinct form, e.g.:

    validator(
        benchmark="benchmark://cBench-v0/ghostscript",
        cmd="$BIN -sDEVICE=ppm -dNOPAUSE -dQUIET -sOutputFile=output.ppm -- 1.ps",
        data=["office_data/1.ps"],
        outs=["output.ppm"],
        linkopts=["-lm", "-lz"],
        pre_execution_callback=setup_ghostscript_library_files,
    )

As part of #45 we may want to make a public API similar to this and
move it into the dataset definitions.

Multiple validation callbacks can be defined for a single
benchmark. Where a benchmark matches multiple validators, they are
executed in parallel.

Compiling binaries from cBench benchmarks requires that the bitcodes
be compiled against the system-specific standard library, so this
patch also splits the cBench dataset into macOS and Linux versions.
ChrisCummins added a commit that referenced this issue Feb 2, 2021
Add semantics validation for cBench benchmarks. This is achieved by
adding a new validation callback mechanism that, when invoked,
compiles the given cBench benchmark to a binary and executes it using
prepared datasets. The output of the program, along with any generated
output files, is differential tested against a copy of the program
compiled without optimizations. A change in program behavior that is
detected by this mechanism is reported.

Calling `compiler_gym.validate_state()` on a benchmark that supports
semantics validation will automatically run it.

The core of the implementation is in
compiler_gym/envs/llvm/dataset.py. It defines a set of library
functions so that these validation callbacks can be defined ad-hoc for
cBench in quite a succinct form, e.g.:

    validator(
        benchmark="benchmark://cBench-v0/ghostscript",
        cmd="$BIN -sDEVICE=ppm -dNOPAUSE -dQUIET -sOutputFile=output.ppm -- 1.ps",
        data=["office_data/1.ps"],
        outs=["output.ppm"],
        linkopts=["-lm", "-lz"],
        pre_execution_callback=setup_ghostscript_library_files,
    )

As part of #45 we may want to make a public API similar to this and
move it into the dataset definitions.

Multiple validation callbacks can be defined for a single
benchmark. Where a benchmark matches multiple validators, they are
executed in parallel.

Compiling binaries from cBench benchmarks requires that the bitcodes
be compiled against the system-specific standard library, so this
patch also splits the cBench dataset into macOS and Linux versions.
ChrisCummins added a commit that referenced this issue Feb 2, 2021
Add semantics validation for cBench benchmarks. This is achieved by
adding a new validation callback mechanism that, when invoked,
compiles the given cBench benchmark to a binary and executes it using
prepared datasets. The output of the program, along with any generated
output files, is differential tested against a copy of the program
compiled without optimizations. A change in program behavior that is
detected by this mechanism is reported.

Calling `compiler_gym.validate_state()` on a benchmark that supports
semantics validation will automatically run it.

The core of the implementation is in
compiler_gym/envs/llvm/dataset.py. It defines a set of library
functions so that these validation callbacks can be defined ad-hoc for
cBench in quite a succinct form, e.g.:

    validator(
        benchmark="benchmark://cBench-v0/ghostscript",
        cmd="$BIN -sDEVICE=ppm -dNOPAUSE -dQUIET -sOutputFile=output.ppm -- 1.ps",
        data=["office_data/1.ps"],
        outs=["output.ppm"],
        linkopts=["-lm", "-lz"],
        pre_execution_callback=setup_ghostscript_library_files,
    )

As part of #45 we may want to make a public API similar to this and
move it into the dataset definitions.

Multiple validation callbacks can be defined for a single
benchmark. Where a benchmark matches multiple validators, they are
executed in parallel.

Compiling binaries from cBench benchmarks requires that the bitcodes
be compiled against the system-specific standard library, so this
patch also splits the cBench dataset into macOS and Linux versions.
ChrisCummins added a commit that referenced this issue Feb 2, 2021
Add semantics validation for cBench benchmarks. This is achieved by
adding a new validation callback mechanism that, when invoked,
compiles the given cBench benchmark to a binary and executes it using
prepared datasets. The output of the program, along with any generated
output files, is differential tested against a copy of the program
compiled without optimizations. A change in program behavior that is
detected by this mechanism is reported.

Calling `compiler_gym.validate_state()` on a benchmark that supports
semantics validation will automatically run it.

The core of the implementation is in
compiler_gym/envs/llvm/dataset.py. It defines a set of library
functions so that these validation callbacks can be defined ad-hoc for
cBench in quite a succinct form, e.g.:

    validator(
        benchmark="benchmark://cBench-v0/ghostscript",
        cmd="$BIN -sDEVICE=ppm -dNOPAUSE -dQUIET -sOutputFile=output.ppm -- 1.ps",
        data=["office_data/1.ps"],
        outs=["output.ppm"],
        linkopts=["-lm", "-lz"],
        pre_execution_callback=setup_ghostscript_library_files,
    )

As part of #45 we may want to make a public API similar to this and
move it into the dataset definitions.

Multiple validation callbacks can be defined for a single
benchmark. Where a benchmark matches multiple validators, they are
executed in parallel.

Compiling binaries from cBench benchmarks requires that the bitcodes
be compiled against the system-specific standard library, so this
patch also splits the cBench dataset into macOS and Linux versions.
ChrisCummins added a commit that referenced this issue Feb 2, 2021
Add semantics validation for cBench benchmarks. This is achieved by
adding a new validation callback mechanism that, when invoked,
compiles the given cBench benchmark to a binary and executes it using
prepared datasets. The output of the program, along with any generated
output files, is differential tested against a copy of the program
compiled without optimizations. A change in program behavior that is
detected by this mechanism is reported.

Calling `compiler_gym.validate_state()` on a benchmark that supports
semantics validation will automatically run it.

The core of the implementation is in
compiler_gym/envs/llvm/dataset.py. It defines a set of library
functions so that these validation callbacks can be defined ad-hoc for
cBench in quite a succinct form, e.g.:

    validator(
        benchmark="benchmark://cBench-v0/ghostscript",
        cmd="$BIN -sDEVICE=ppm -dNOPAUSE -dQUIET -sOutputFile=output.ppm -- 1.ps",
        data=["office_data/1.ps"],
        outs=["output.ppm"],
        linkopts=["-lm", "-lz"],
        pre_execution_callback=setup_ghostscript_library_files,
    )

As part of #45 we may want to make a public API similar to this and
move it into the dataset definitions.

Multiple validation callbacks can be defined for a single
benchmark. Where a benchmark matches multiple validators, they are
executed in parallel.

Compiling binaries from cBench benchmarks requires that the bitcodes
be compiled against the system-specific standard library, so this
patch also splits the cBench dataset into macOS and Linux versions.
ChrisCummins added a commit that referenced this issue Feb 2, 2021
Add semantics validation for cBench benchmarks. This is achieved by
adding a new validation callback mechanism that, when invoked,
compiles the given cBench benchmark to a binary and executes it using
prepared datasets. The output of the program, along with any generated
output files, is differential tested against a copy of the program
compiled without optimizations. A change in program behavior that is
detected by this mechanism is reported.

Calling `compiler_gym.validate_state()` on a benchmark that supports
semantics validation will automatically run it.

The core of the implementation is in
compiler_gym/envs/llvm/dataset.py. It defines a set of library
functions so that these validation callbacks can be defined ad-hoc for
cBench in quite a succinct form, e.g.:

    validator(
        benchmark="benchmark://cBench-v0/ghostscript",
        cmd="$BIN -sDEVICE=ppm -dNOPAUSE -dQUIET -sOutputFile=output.ppm -- 1.ps",
        data=["office_data/1.ps"],
        outs=["output.ppm"],
        linkopts=["-lm", "-lz"],
        pre_execution_callback=setup_ghostscript_library_files,
    )

As part of #45 we may want to make a public API similar to this and
move it into the dataset definitions.

Multiple validation callbacks can be defined for a single
benchmark. Where a benchmark matches multiple validators, they are
executed in parallel.

Compiling binaries from cBench benchmarks requires that the bitcodes
be compiled against the system-specific standard library, so this
patch also splits the cBench dataset into macOS and Linux versions.
ChrisCummins added a commit that referenced this issue Feb 2, 2021
Add semantics validation for cBench benchmarks. This is achieved by
adding a new validation callback mechanism that, when invoked,
compiles the given cBench benchmark to a binary and executes it using
prepared datasets. The output of the program, along with any generated
output files, is differential tested against a copy of the program
compiled without optimizations. A change in program behavior that is
detected by this mechanism is reported.

Calling `compiler_gym.validate_state()` on a benchmark that supports
semantics validation will automatically run it.

The core of the implementation is in
compiler_gym/envs/llvm/dataset.py. It defines a set of library
functions so that these validation callbacks can be defined ad-hoc for
cBench in quite a succinct form, e.g.:

    validator(
        benchmark="benchmark://cBench-v0/ghostscript",
        cmd="$BIN -sDEVICE=ppm -dNOPAUSE -dQUIET -sOutputFile=output.ppm -- 1.ps",
        data=["office_data/1.ps"],
        outs=["output.ppm"],
        linkopts=["-lm", "-lz"],
        pre_execution_callback=setup_ghostscript_library_files,
    )

As part of #45 we may want to make a public API similar to this and
move it into the dataset definitions.

Multiple validation callbacks can be defined for a single
benchmark. Where a benchmark matches multiple validators, they are
executed in parallel.

Compiling binaries from cBench benchmarks requires that the bitcodes
be compiled against the system-specific standard library, so this
patch also splits the cBench dataset into macOS and Linux versions.
@ChrisCummins
Copy link
Contributor

ChrisCummins commented Feb 14, 2021

I think it would be a good idea also to wrap the Benchmark proto with a python class that can add extra functionality like the ability to validate benchmark behavior etc. To begin with, something simple like:

class Benchmark(object):
  def __init__(self, proto: BenchmarkProto)
  def sha1(self) -> bytes  # Name for caching benchmark+any other attributes
  def program_data(self) -> BenchmarkProto  # The data that the service needs
  def program_data_sha1(self) -> bytes  # Used for caching benchmarks on service-side
  def is_validatable(self) -> bool
  def validation_callbacks(self) -> List[Callback[[CompilerEnv], Optional[str]]  # Run any ad-hoc validation, e.g. difftest, valgrind, etc

ChrisCummins added a commit that referenced this issue Feb 24, 2021
Add a new CompilerEnv.validate() method that replaces the previous
validate_state(env, state) call. This is a stepping stone to enabling
a more flexible API for custom benchmark validation routines.

github.com//issues/45
ChrisCummins added a commit that referenced this issue Feb 24, 2021
Add a new CompilerEnv.validate() method that replaces the previous
validate_state(env, state) call. This is a stepping stone to enabling
a more flexible API for custom benchmark validation routines.

github.com//issues/45
ChrisCummins added a commit that referenced this issue Feb 24, 2021
Add a new CompilerEnv.validate() method that replaces the previous
validate_state(env, state) call. This is a stepping stone to enabling
a more flexible API for custom benchmark validation routines.

github.com//issues/45
ChrisCummins added a commit that referenced this issue Feb 25, 2021
Add a new CompilerEnv.validate() method that replaces the previous
validate_state(env, state) call. This is a stepping stone to enabling
a more flexible API for custom benchmark validation routines.

github.com//issues/45
@ChrisCummins ChrisCummins added the Datasets Issues relating to datasets label Feb 26, 2021
ChrisCummins added a commit that referenced this issue Feb 26, 2021
In preparation for introducing a new Dataset class.

Issue #45.
ChrisCummins added a commit that referenced this issue Feb 26, 2021
In preparation for introducing a new Dataset class.

Issue #45.
ChrisCummins added a commit that referenced this issue Feb 26, 2021
In preparation for introducing a new Dataset class.

Issue #45.
@ChrisCummins ChrisCummins modified the milestones: v0.1.4, v0.1.5 Mar 2, 2021
ChrisCummins added a commit that referenced this issue Mar 17, 2021
In preparation for introducing a new Dataset class.

Issue #45.
ChrisCummins added a commit that referenced this issue Mar 18, 2021
@ChrisCummins ChrisCummins self-assigned this Jul 13, 2021
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
With the new dataset API, enumerating the benchmarks is not
advised (the list may be infinite), and there is now no need to
install datasets ahead of time.

Issue facebookresearch#45.
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
This test is flaky, and the functionality tested here will be removed
in facebookresearch#45.
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
A benchmark represents that particular program that is being compiled.

Issue facebookresearch#45.
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
This extends the LLVM data archive to include the following additional
binaries:

    bin/llc
    bin/llvm-as
    bin/llvm-bcanalyer
    bin/llvm-config
    bin/llvm-dis
    bin/llvm-mca

This also moves the location of the unpacked archive to llvm-v0 (with
a version suffix), and fixes a race condition in the download logic.

Issue facebookresearch#45.
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
Decode the binary data from the manifest.

Issue facebookresearch#45.
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
This adds python operator overloads that alias to existing methods to
make the Dataset class "feel" more like a regular python dictionary:

     >>> len(dataset)  # equivalent to dataset.n
     23

     >>> for benchmark in dataset:  # iterate over the class directly
     ...     pass

     >>> dataset["cbench-v1/crc32"]  # key a benchmark

This also renames Dataset.n to Dataset.size for consistent with other
containers like np.ndarray, and returns math.inf if the number of
benchmarks is infinite, not a negative integer. The advantage of
math.inf is that will poison any integer arithemtic, e.g.

     >>> sum(d.size for d in datasets)
     inf

if any one of the datasets has an infinite size. With a negative
number, this would instead compute a regular integer value.

Issue facebookresearch#45.
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
This patch makes two simplifications to the Datasets API:

1) It removes the random-benchmark selection logic from
`Dataset.benchmark()`. Now, calling `benchmark()` requires a URI. If
you wish to select a benchmark randomly, you can implement this random
selection yourself. The idea is that random benchmark selection is
quite a minor use case that introduces quite a bit of complexity into
the implementation.

2) It removes the `Union[str, Dataset]` types to `Datasets`
methods. Now, only a string is permitted. This is to make it easier to
understand the argument types. If the user has a `Dataset` instance
that they would like to use, they can explicitly pass in
`dataset.name`.

Issue facebookresearch#45.
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
This is to start the transition from the LegacyDatasets to the new
Datasets API.

Issue facebookresearch#45.
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
This adds new Dataset class implementations of some of the LLVM
datasets. The original LegacyDatasets are still used for now, they
will be migrated once everything is in place.

Issue facebookresearch#45.
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
This differs from the previous version in that it downloads the
original C++ sources and compiles them on-demand, rather than
downloading prepared bitcodes.

Issue facebookresearch#45.
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
This adds two new datasets, csmith-v0 and llvm-stress-v0, that are
parametrized program generators. csmith-v0 uses Csmith to generate C99
programs that are then lowered to bitcode. llvm-stress-v0 generates
random LLVM-IR.

Both generators were developed to stress test compilers, so they have
an above-average chance that a generated benchmark will cause the
compiler to enter an unexpected state.

Issue facebookresearch#45.
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
This adds a dataset of 1k OpenCL kernels that were used in the paper:

    Cummins, Chris, Pavlos Petoumenos, Zheng Wang, and Hugh
    Leather. "Synthesizing benchmarks for predictive modeling." In
    2017 IEEE/ACM International Symposium on Code Generation and
    Optimization (CGO), pp. 86-99. IEEE, 2017.

The OpenCL kernels are compiled on-demand.

Issue facebookresearch#45.
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
The dataset is from:

    da Silva, Anderson Faustino, Bruno Conde Kind, José Wesley de
    Souza Magalhaes, Jerônimo Nunes Rocha, Breno Campos Ferreira
    Guimaraes, and Fernando Magno Quinão Pereira. "ANGHABENCH: A Suite
    with One Million Compilable C Benchmarks for Code-Size Reduction."
    In 2021 IEEE/ACM International Symposium on Code Generation and
    Optimization (CGO), pp. 378-390. IEEE, 2021.

Issue facebookresearch#45.
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
This adds the new Dataset implementation of the cBench dataset. The
validation logic isn't super tidy and could be tidied up a bit, its
just copied over from //compiler_gym/envs/llvm:legacy_datasets.

Issue facebookresearch#45.
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
Do not permit that the benchmark name can be missing for a benchmark
URI to be considered well formed, as we no longer support dataset-only
URIs.

Issue facebookresearch#45.
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
We no longer require running compiler_gym.bin.datasets to download a
dataset for testing.

Issue facebookresearch#45.
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
This updates the documentation of the getting started guide,
tutorials, API reference etc to the new dataset API.

In general, this means simplifying things, as we no longer need to
explain how to download and manage datasets.

Issue facebookresearch#45.
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
This switches over the `CompilerEnv` environment to use the new
dataset API, dropping the `LegacyDataset` class.

Background
----------

Since the very first prototype of CompilerGym, a `Benchmark` protocol
buffer has been used to provide a serializable representation of
benchmarks that can be passed back and forth between the service and
the frontend.

Initially, it was up to the compiler service to maintain the set of
available benchmarks, exposing the available benchmarks with a
`GetBenchmarks()` RPC method, and allowing new benchmarks to be added
using an `AddBenchmarks()` method.

This was fine for the initial use case of shipping a handful of
benchmarks and allowing ad-hoc new benchmarks to be added, but for
managing larger sets of benchmarks, a *datasets* abstraction was
added.

Initial Datasets abstraction
----------------------------

To add support for managing large sets of programs, a
[Dataset](https:/facebookresearch/CompilerGym/blob/49c10d77d1c1b1297a1269604584a13c10434cbb/compiler_gym/datasets/dataset.py#L20)
tuple was added that describes a set of programs, and a link to the a
tarball containing those programs. The tarball is required to have a
JSON file containing metadata, and a directory containing the
benchmarks, one file per benchmark. A set of operations were added to
the frontend command line to make downloading and unpacking these
tarballs easier:

https:/facebookresearch/CompilerGym/blob/49c10d77d1c1b1297a1269604584a13c10434cbb/compiler_gym/bin/datasets.py#L5-L133

Problems with this approach
---------------------------

(1) **Leaky abstraction** Both the environment and backend service
have to know about datasets. This means redundant duplicated logic,
and adds a maintenance burden of keeping the C++/python logic in sync.

(2) **Inflexible** Only supports environments in which a single file
represents a benchmark. No support for multi-file benchmarks,
benchmarks that are compiled on-demand, etc.

(3) **O(n) space and time overhead** on each service instance, where *n*
is the total number of benchmarks. At init time, each service needs to
recursively scan a directory tree to build a list of available
benchmarks. This list must be kept in memory. This adds startup time,
and also causes cache invalidation issues when multiple environment
instances are modifying the underlying filesystem.

New Dataset API
---------------

This commit changes the ownership model so that the *Environment* owns
the benchmarks and datasets, not the service. This uses the new
`Dataset` class hierarchy that has been added in previous pull
requests: facebookresearch#190, facebookresearch#191, facebookresearch#192, facebookresearch#200, facebookresearch#201.

Now, the backend has no knowledge of "datasets". Instead the service
simply keeps a small cache of benchmarks that it has seen. If a
session request has a benchmark URI that is not in this cache, the
service returns a "resource not found" error and the frontend logic
can then respond by sending it a copy of the benchmark as a
`Benchmark` proto. The service is free to cache this for future use,
and can empty the cache whenever it wants.

This new approach has a few key benefits:

(1) By moving all of the datasets logic into the frontend, it becomes
much easier for users to define their own datasets.

(2) Reduces compiler service startup time as it removes the need for
each service to do a recursive filesystem sweep.

(3) Removes the requirement that the set of benchmarks is fully
enumerable, allow for program generators that can produce a
theoretically infinite number of benchmarks.

(4) Adds support for lazily-compiled datasets of programs that are
generated on-demand.

(5) Removes the need to download datasets ahead of time. Datasets can
now be installed on-demand.

Summary of changes
------------------

(1) Changes the type of `env.benchmark` from a string to a `Benchmark`
instance.

(2) Makes `env.benchmark` a mandatory attribute. If no benchmark is
provided at init time, one is chosen deterministically. If you wish to
select a random benchmark, use `env.datasets.benchmark()`.

(3) `env.fork()` no longer requires `env.reset()` to have been called
first. It will call `env.reset()` if required.

(4) `env.benchmark = None` is no longer a valid way of requesting a
random benchmark. If you would like a random benchmark, you must now
roll your own random picker using `env.datasets.benchmark_uris()` and
similar.

(5) Deprecates all `LegacyDataset` operations, changing their behavior
to no-ops, and removing the class.

(6) Renames `cBench` to `cbench` to be consistent with the lower-case
naming convention of gym. The old `cBench` datasets are kept around
but are marked deprecated to encourage migration.

Migrating to the new interface
------------------------------

To migrate existing code to the new interface:

(1) Update references to `cBench-v[01]` to `cbench-v1`.

(2) Review code that accesses the `env.benchmark` property and update
to `env.benchmark.uri` if a string name is required.

(3) Review code that calls `env.reset()` without first setting a
benchmark. Previously, calling `env.reset()` would select a random
benchmark. Now, `env.reset()` always selects the last used benchmark,
or a predetermined default if none is specified.

(4) Review code that relies on `env.benchmark` being `None` to select
benchmarks randomly. Now, `env.benchmark` is always set to the
previously used benchmark, or a predetermined default benchmark if
none has been provided.

(5) Remove calls to `env.require_dataset()`.

Issue facebookresearch#45.
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
This replaces the boolean `hidden` value with a `deprecated` message,
which is emitted automatically on a call to `install()`.

Issue facebookresearch#45. Fixes facebookresearch#219.
bwasti pushed a commit to bwasti/CompilerGym that referenced this issue Aug 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datasets Issues relating to datasets Enhancement New feature or request RPC
Projects
None yet
Development

No branches or pull requests

2 participants