Replace C-code generation and compilation backend #312

brandonwillard · 2021-03-02T01:00:41Z

The text-based C-code generation and compilation backend in Aesara is difficult to use, debug, maintain, and extend. We need to fix that ASAP.

Cython is a well established Python-to-C transpiler that provides a much cleaner, automatic means of generating the same kind of Python C API code that's written by hand in Aesara. Here are some possible benefits to replacing our current C implementations with Cython-generated C code:

we could make more of our logic transparent to pure-Python readers
automatically benefit from updates and new features provided by Cython over time (e.g. Python C API version and capability updates)
use Cython's build and code caching features, which could have much better support for different platforms and environments
attempt automatic conversion of Python-only Op implementations, resulting in more C-only code (i.e. fewer calls to-and-from Python/C during graph evaluation)
C-level interactions with NumPy are much easier in Cython, so we might—for example—be able to generate C code for all the Subtensor*/indexing operations with a little bit of Cython (instead of our very limited C implementations for only certain types of indexing)

This general idea has been brought up in numerous different locations, so I'm creating this issue as a means of collecting all the relevant details, ideas, discussions, requirements, etc., into one place.

Related issues:

The text was updated successfully, but these errors were encountered:

twiecki · 2021-03-02T07:49:40Z

Totally agree. I suppose we first need some functionality to enable usage of Cython.

brandonwillard · 2021-03-02T22:18:28Z

It's not actually that difficult to use Cython-generated code in Aesara right now. For instance, an Op can create its own thunk—via Op.make_thunk—that calls out to a Cython-generated extension. This is what the Scan Op does, and it's the approach used by the old example referenced in pymc-devs/pytensor#10.

My impression is that this approach isn't the best because it doesn't use Aesara's C-based thunk machinery. This machinery is assumedly faster than the corresponding pure Python machinery, perhaps due to reduced Python-to-C and C-to-Python overhead—among other things.

Aesara graph evaluation primer

For anyone who's not familiar with the idea of a "thunk" in Aesara, this paragraph might help.

Simply put, a "thunk" is an argumentless function that calls an Op's implementation code (either C or Python) with Aesara's input and output storage arrays (i.e. plain lists with entries for each graph node/Apply's inputs and outputs).

Here's a simple example:

inputs = [1, 2]
outputs = [None]

class SomeOp(Op):
    def perform(self, inputs, outputs):
        outputs[0] = inputs[0] + inputs[1]

def a_thunk(inputs=inputs, outputs=outputs):
    SomeOp().perform(inputs, outputs)

a_thunk()

# `outputs` should contain `3`

Those storage arrays make up the graph's memory model, and they're stored inside a thunk function's closure. When the thunk is evaluated those output arrays are populated with the computed values. A thunk is created for each node/Apply in a graph, and, when a node's output is used as the input to another node, the output storage array of the first node will be used as the input to the second.

Continuing from the previous example:

other_outputs = [None]

class SomeOtherOp(Op):
    def perform(self, inputs, outputs):
        outputs[0] = inputs[0]**2

def another_thunk(inputs=outputs, outputs=other_outputs):
    SomeOtherOp().perform(inputs, outputs)

# This thunk depends on the output of the previous thunk
another_thunk()

This allows Linkers to create thunks for each Op in a graph that can be evaluated very easily by the VM classes, then, by returning the contents of the output storage arrays that correspond to the desired output of a graph, we get the kind of results produced by aesara.function.

Here's what aesara.function produces—in a nutshell:

# Using the example thunks above, we can create a function 
# that computes the graph for `(a + b)**2`
def compiled_graph_fn(a, b):
    inputs[0] = a
    inputs[1] = b
    for thunk in [a_thunk, another_thunk]:
        thunk()
    return other_outputs[0]

The for-loop in that example function is the job of the VMs, and the Linkers walk a graph and create the thunks. aesara.function creates a Function object that simply orchestrates the use of those two.

How compiled C code is used

There are a few places where the C and Python thunks are clearly distinguished. In the CVM (aka lazylinker_c.CLazyLinker from the C extension end), which is generally used whenever the C toolchain is available, C thunks are treated specially (see here), which sets a variable that signals the use of a special CVM.c_call. There doesn't seem to be much to it, just a pointer to the thunk's C function and that function's data/arguments.

From the Python side (e.g. when graph evaluation is performed using the pure Python VM Stack), there's a _CThunk class that appears to do the same thing as the CVM.c_call within _CThunk.__call__. It uses the run_cthunk function that's implemented in C here and exposed to Python via the cutils_ext extension. Ultimately run_cthunk uses the same pointers in the same way as CVM.c_call.

From what I can tell, _CThunks are exclusively created by the CLinker, which is briefly used by COp.make_c_thunk (called from the standard entry point Op.make_thunk) to make its thunks. Aside from the questionable need for an entirely distinct CLinker class and/or object in this situation, it seems like the whole situation could be as simple as obtaining—and using—those pointers.

Regarding those thunk pointers, they seem to come from CLinker.cthunks_factory, which kicks off the C-code compilation process—the same one that we're considering replacing here (e.g. with Cython, or at least some use of distutils's compilation code). In Python, those thunk pointers are PyCapsule objects, and they can be easily created/accessed in Cython.

brandonwillard · 2021-03-09T16:18:55Z

For anyone who wants to try this (e.g. @aseyboldt for #327), take a look at how COp creates C thunks. I think the relevant parts start here (i.e. CLinker.make_thunk), so we might need to jump into whatever happens there.

twiecki · 2021-03-09T16:25:52Z

https:/pymc-devs/aesara/blob/erfcx_c/aesara/graph/op.py#L552 doesn't work.

brandonwillard · 2021-03-09T16:32:30Z

Actually, it looks like it might be as simple as creating an aesara.link.c.basic._CThunk object. In order to do that, we'll need valid _CThunk arguments for Cython/Numba-generated functions.

The self.__compile__(...) step is how we normally generate those arguments, but it goes through the irrelevant process of compiling str-derived C code and creating extensions. Regardless, the cthunk value returned by CLinker.__compile__ is a PyCapsule object, module is a module-type object, in_storage and out_storage are lists of aesara.link.basic.Containers, and error_storage is just a list of Nones.

The first two values (i.e. the PyModule and module objects) seem obtainable from Cython/Numba, so it looks like we'll only need to reproduce the storage and error array creation steps.

brandonwillard added enhancement New feature or request help wanted Extra attention is needed question Further information is requested important refactor This issue involves refactoring C-backend labels Mar 2, 2021

brandonwillard pinned this issue Mar 2, 2021

brandonwillard changed the title ~~Replace C-code generation and processing backend~~ Replace C-code generation and compilation backend Mar 2, 2021

brandonwillard mentioned this issue Mar 14, 2021

C Elemwise implementation doesn't broadcast variables #335

Closed

brandonwillard closed this as completed Apr 16, 2021

aesara-devs locked and limited conversation to collaborators Apr 16, 2021

brandonwillard unpinned this issue Apr 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Replace C-code generation and compilation backend #312

Replace C-code generation and compilation backend #312

brandonwillard commented Mar 2, 2021 •

edited

Loading

twiecki commented Mar 2, 2021

brandonwillard commented Mar 2, 2021 •

edited

Loading

brandonwillard commented Mar 9, 2021 •

edited

Loading

twiecki commented Mar 9, 2021

brandonwillard commented Mar 9, 2021 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

Replace C-code generation and compilation backend #312

Replace C-code generation and compilation backend #312

Comments

brandonwillard commented Mar 2, 2021 • edited Loading

twiecki commented Mar 2, 2021

brandonwillard commented Mar 2, 2021 • edited Loading

How compiled C code is used

brandonwillard commented Mar 9, 2021 • edited Loading

twiecki commented Mar 9, 2021

brandonwillard commented Mar 9, 2021 • edited Loading

This issue was moved to a discussion.

brandonwillard commented Mar 2, 2021 •

edited

Loading

brandonwillard commented Mar 2, 2021 •

edited

Loading

brandonwillard commented Mar 9, 2021 •

edited

Loading

brandonwillard commented Mar 9, 2021 •

edited

Loading