release gil in pixelize_cylinder #5018

chrishavlin · 2024-10-04T16:34:23Z

release the gil!

This surprisingly seemed to work with minimal effort... took the pixelization time from 53s to 27s for one example (see this slack thread).

neutrinoceros · 2024-10-04T16:38:15Z

Releasing the GIL is likely a good idea here, but I also suspect that the abysmal performance in this routine may be fixed independently (possibly by reverting an earlier patch from mine that I'll dig up for in my future review). Any way, I'll try to review in details in the coming days. Thanks !

chrishavlin · 2024-10-04T16:41:27Z

if you figure out a separate fix I'm happy to go with it and close this one!

chrishavlin · 2024-10-04T16:41:41Z

oh, especially since image tests seem to be failing and it was not as simple as i hoped :)

chrishavlin · 2024-10-04T16:47:34Z

ok, i think i see the issue but won't have time to fix it for a while... so will hold off in case you devise a separate performance fix.

neutrinoceros · 2024-10-05T08:07:39Z

oh wow these diffs are catastrophic. I'll try to tackle the performance issue from another angle and I'll move this to draft if you don't mind.

matthewturk · 2024-10-08T13:48:48Z

These lines:

                     ptbounds[0] = ptbounds[0] % (2*np.pi)
                     ptbounds[1] = ptbounds[1] % (2*np.pi)

are using Python math operations inside a tight loop. What if we used a constant for PI instead of np.pi, which I believe will do a python-level lookup?

cphyc · 2024-10-08T14:04:34Z

are using Python math operations inside a tight loop. What if we used a constant for PI instead of np.pi, which I believe will do a python-level lookup?

With the latest version, they are actually expanded to

(__pyx_v_ptbounds[0]) = fmod((__pyx_v_ptbounds[0]), (2.0 * __pyx_v_npPI));

so there might be a small speedup to be found by precomputing 2π ahead of time, but I'm not sure this is super important.

cphyc · 2024-10-08T14:05:13Z

yt/utilities/lib/pixelization_routines.pyx

- sintheta = math.sin(theta_i)
- while r_i < r0 + dr_i:
- if rmax <= r_i:
+ with nogil:


Is this a case where we'd want to use

for i in prange(radius.shape[0], nogil=True): pass

sure! I think prange should work here and could give an extra performance boost. but i'll wait until the rest of the PR is actually working before I change it.

ok well using prange might be a bit more work than I want to put in right now -- lots of compilation errors along the lines of :

yt/utilities/lib/pixelization_routines.pyx:635:20: Cannot read reduction variable in loop body Error compiling Cython file: ------------------------------------------------------------ ... while r_i < r0 + dr_i: if rmax <= r_i: r_i += r_inc continue y = r_i * costheta x = r_i * sintheta ^ ------------------------------------------------------------ yt/utilities/lib/pixelization_routines.pyx:636:20: Cannot read reduction variable in loop body

yt/utilities/lib/pixelization_routines.pyx

matthewturk · 2024-10-08T14:57:20Z

are using Python math operations inside a tight loop. What if we used a constant for PI instead of np.pi, which I believe will do a python-level lookup?

With the latest version, they are actually expanded to
(__pyx_v_ptbounds[0]) = fmod((__pyx_v_ptbounds[0]), (2.0 * __pyx_v_npPI));
so there might be a small speedup to be found by precomputing 2π ahead of time, but I'm not sure this is super important.

Huh. I guess I wasn't on the latest version! For me it did a ton of unpacking and python type coercion. I guess that's one of the advantages of upgrading, eh?

Are our performance tests being run on the latest version? (i.e., do you see the slowness, or is what I am referring to showing up in our benchmarks)

Co-authored-by: Corentin Cadiou <[email protected]>

chrishavlin · 2024-10-10T15:18:36Z

still expect the image diffs to fail -- i'll look later today and if it's a quick fix (which i think it is) I'll put it in and do some performance tests.

neutrinoceros · 2024-10-10T19:24:22Z

yt/utilities/lib/pixelization_routines.pyx

@@ -566,7 +566,8 @@ def pixelize_cylinder(np.float64_t[:,:] buff,
 cdef np.float64_t r_inc, theta_inc
 cdef np.float64_t costheta, sintheta
 cdef int i, i1, pi, pj
- cdef np.float64_t twoPI = 2 * np.pi
+ cdef np.float64_t onePI = np.pi


here's the direct way to "import" the constant from numpy as the C level

cdef extern from "numpy/npy_math.h": double NPY_PI

(taken from https:/yt-project/yt/pull/4948/files#r1692841029)

(NPY_2_PI is also available)

hah, will import NPY_2_PI as well in a minute after tests run this round.... i think this one will pass...

oh wait, NPY_2_PI is 2/pi, not 2*pi ?

aaaah, missed that, sorry

chrishavlin · 2024-10-10T20:07:11Z

yt/utilities/lib/pixelization_routines.pyx

+ if ptbounds[0] < 0:
+ ptbounds[0] += NPY_PI
+ ptbounds[1] = ptbounds[1] % twoPI
+ if ptbounds[1] < 0:
+ ptbounds[1] += NPY_PI


so I'm not sure why I needed to add the check for negative numbers here, but after switching from ptbounds[1] % (2*np.pi) to ptbounds[1] % twoPI, the result returns numbers in the range (-pi,pi).

I think we need to understand this bit, and possibly come up with a way to perform the computation that doesn't involve branching; hard-wiring a somewhat expensive calculation generally hurts performance less than ruining branch prediction.
I'm writing this comment early in my review but I'll try to come up with more insight.

I'm taking a bet that the reason why you're seeing a change in behavior is that the function is decorated with @cython.cdivision(True): on main the Python % is used, while with twoPI you're getting a C operator instead.

http://docs.cython.org/en/latest/src/tutorial/caveats.html

can you try this patch

diff --git a/yt/utilities/lib/pixelization_routines.pyx b/yt/utilities/lib/pixelization_routines.pyx index 194ce7af6..6728375d0 100644 --- a/yt/utilities/lib/pixelization_routines.pyx +++ b/yt/utilities/lib/pixelization_routines.pyx @@ -664,12 +664,8 @@ def pixelize_cylinder(np.float64_t[:,:] buff, ptbounds[1] = fmax(ptbounds[1], corners[i1]) # shift to a [0, PI] interval - ptbounds[0] = ptbounds[0] % twoPI - if ptbounds[0] < 0: - ptbounds[0] += NPY_PI - ptbounds[1] = ptbounds[1] % twoPI - if ptbounds[1] < 0: - ptbounds[1] += NPY_PI + ptbounds[0] = (ptbounds[0]+twoPI) % twoPI + ptbounds[1] = (ptbounds[1]+twoPI) % twoPI if prbounds[0] >= rmin and prbounds[1] <= rmax and \ ptbounds[0] >= tmin and ptbounds[1] <= tmax:

the result should be the same in all cases, but it avoids branching, so I expect it to be faster.
If this works, we'll want to add a comment for why it's done this way.

chrishavlin · 2024-10-11T18:05:01Z

Here's a performance comparison between this branch and main that compares the scaling with increasing number of cells in r and theta.

Full code for the test case is over in this repo. The test dataset is a 2d cylindrical dataset built with:

def get_2d_ds(n_r, n_theta):

    shp = (n_r,n_theta,1)
    bbox = np.array([[0, 1], [0, np.pi*2],[-.5, .5]])
    data = {
         'density': rs.random(shp)
            }
    ds = yt.load_uniform_grid(data,
                              shp,
                              bbox=bbox,
                              geometry='cylindrical',
                              axis_order=('r', 'theta', 'z'))
    return ds

And I'm timing a SlicePlot here:

def plot_ds(ds):
    slc = yt.SlicePlot(ds, 'z', ('stream', 'density'))
    slc.render()

The first and second column in the following graph are the elapsed time in seconds averaged over 10 iterations for the main branch and this PR's branch. Final column is the percent speedup.

Average speedup over all resolutions is about 33%.

matthewturk · 2024-10-11T18:08:22Z

I like the comparison. I immediately thought, "what about in AMR" then realized this is likely a better general purpose diagnostic.

neutrinoceros · 2024-10-17T13:20:33Z

I looked for most recent patches to this routine from ~3 years ago when I was focused on correctness and possibly neglecting performance. Here's what I found, in case that's relevant here

#3782 and #3818 may have impacted performance, so it might be worth trying to revert them and see if the performance changes measurably.
If #3818 is costly, it may be argued that it should be reverted, as it really comes down to an arbitrary choice between two equally valid approaches, with converging results in the limit of infinite resolution.
#3782 seems actually more likely to have a negative impact, but it's a lot less trivial to decide what to do if so. In any case I think it's worth investigating now, and if performance reaches an acceptable level with this PR, maybe we don't need to do anything more. I'd just like it to be an informed decision.

release gil in pixelize_cylinder

f06761f

chrishavlin added viz: 2D performance coordinates: non-cartesian labels Oct 4, 2024

neutrinoceros self-requested a review October 4, 2024 16:35

chrishavlin added the enhancement Making something better label Oct 4, 2024

neutrinoceros marked this pull request as draft October 5, 2024 08:07

cphyc previously approved these changes Oct 8, 2024

View reviewed changes

pre-compute twoPI

5269837

Co-authored-by: Corentin Cadiou <[email protected]>

chrishavlin dismissed cphyc’s stale review via 5269837 October 10, 2024 15:16

precompute twoPI

ad92f03

Co-authored-by: Corentin Cadiou <[email protected]>

neutrinoceros reviewed Oct 10, 2024

View reviewed changes

chrishavlin force-pushed the release_gil_in_pixelize_cylinder branch from 9c76d68 to ad92f03 Compare October 10, 2024 19:25

handle theta<0 in shift to 0,2pi

3e1ef2c

chrishavlin commented Oct 10, 2024

View reviewed changes

chrishavlin marked this pull request as ready for review October 11, 2024 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release gil in pixelize_cylinder #5018

release gil in pixelize_cylinder #5018

chrishavlin commented Oct 4, 2024

neutrinoceros commented Oct 4, 2024

chrishavlin commented Oct 4, 2024

chrishavlin commented Oct 4, 2024

chrishavlin commented Oct 4, 2024

neutrinoceros commented Oct 5, 2024

matthewturk commented Oct 8, 2024

cphyc commented Oct 8, 2024

cphyc Oct 8, 2024

chrishavlin Oct 10, 2024 •

edited

Loading

chrishavlin Oct 10, 2024

matthewturk commented Oct 8, 2024

chrishavlin commented Oct 10, 2024

neutrinoceros Oct 10, 2024 •

edited

Loading

neutrinoceros Oct 10, 2024

chrishavlin Oct 10, 2024

chrishavlin Oct 10, 2024

neutrinoceros Oct 10, 2024

chrishavlin Oct 10, 2024

neutrinoceros Oct 17, 2024

neutrinoceros Oct 17, 2024

neutrinoceros Oct 17, 2024

chrishavlin commented Oct 11, 2024

matthewturk commented Oct 11, 2024

neutrinoceros commented Oct 17, 2024

release gil in pixelize_cylinder #5018

Are you sure you want to change the base?

release gil in pixelize_cylinder #5018

Conversation

chrishavlin commented Oct 4, 2024

neutrinoceros commented Oct 4, 2024

chrishavlin commented Oct 4, 2024

chrishavlin commented Oct 4, 2024

chrishavlin commented Oct 4, 2024

neutrinoceros commented Oct 5, 2024

matthewturk commented Oct 8, 2024

cphyc commented Oct 8, 2024

Choose a reason for hiding this comment

chrishavlin Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewturk commented Oct 8, 2024

chrishavlin commented Oct 10, 2024

neutrinoceros Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chrishavlin commented Oct 11, 2024

matthewturk commented Oct 11, 2024

neutrinoceros commented Oct 17, 2024

chrishavlin Oct 10, 2024 •

edited

Loading

neutrinoceros Oct 10, 2024 •

edited

Loading