Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release gil in pixelize_cylinder #5018

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

chrishavlin
Copy link
Contributor

release the gil!

This surprisingly seemed to work with minimal effort... took the pixelization time from 53s to 27s for one example (see this slack thread).

@neutrinoceros
Copy link
Member

Releasing the GIL is likely a good idea here, but I also suspect that the abysmal performance in this routine may be fixed independently (possibly by reverting an earlier patch from mine that I'll dig up for in my future review). Any way, I'll try to review in details in the coming days. Thanks !

@chrishavlin
Copy link
Contributor Author

if you figure out a separate fix I'm happy to go with it and close this one!

@chrishavlin
Copy link
Contributor Author

oh, especially since image tests seem to be failing and it was not as simple as i hoped :)

@chrishavlin
Copy link
Contributor Author

ok, i think i see the issue but won't have time to fix it for a while... so will hold off in case you devise a separate performance fix.

@neutrinoceros
Copy link
Member

oh wow these diffs are catastrophic. I'll try to tackle the performance issue from another angle and I'll move this to draft if you don't mind.

@neutrinoceros neutrinoceros marked this pull request as draft October 5, 2024 08:07
@matthewturk
Copy link
Member

These lines:

                     ptbounds[0] = ptbounds[0] % (2*np.pi)
                     ptbounds[1] = ptbounds[1] % (2*np.pi)

are using Python math operations inside a tight loop. What if we used a constant for PI instead of np.pi, which I believe will do a python-level lookup?

@cphyc
Copy link
Member

cphyc commented Oct 8, 2024

are using Python math operations inside a tight loop. What if we used a constant for PI instead of np.pi, which I believe will do a python-level lookup?

With the latest version, they are actually expanded to

(__pyx_v_ptbounds[0]) = fmod((__pyx_v_ptbounds[0]), (2.0 * __pyx_v_npPI));

so there might be a small speedup to be found by precomputing 2π ahead of time, but I'm not sure this is super important.

cphyc
cphyc previously approved these changes Oct 8, 2024
sintheta = math.sin(theta_i)
while r_i < r0 + dr_i:
if rmax <= r_i:
with nogil:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a case where we'd want to use

for i in prange(radius.shape[0], nogil=True):
   pass

Copy link
Contributor Author

@chrishavlin chrishavlin Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure! I think prange should work here and could give an extra performance boost. but i'll wait until the rest of the PR is actually working before I change it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok well using prange might be a bit more work than I want to put in right now -- lots of compilation errors along the lines of :

      yt/utilities/lib/pixelization_routines.pyx:635:20: Cannot read reduction variable in loop body
      
      Error compiling Cython file:
      ------------------------------------------------------------
      ...
                  while r_i < r0 + dr_i:
                      if rmax <= r_i:
                          r_i += r_inc
                          continue
                      y = r_i * costheta
                      x = r_i * sintheta
                          ^
      ------------------------------------------------------------
      
      yt/utilities/lib/pixelization_routines.pyx:636:20: Cannot read reduction variable in loop body

yt/utilities/lib/pixelization_routines.pyx Outdated Show resolved Hide resolved
yt/utilities/lib/pixelization_routines.pyx Outdated Show resolved Hide resolved
@matthewturk
Copy link
Member

are using Python math operations inside a tight loop. What if we used a constant for PI instead of np.pi, which I believe will do a python-level lookup?

With the latest version, they are actually expanded to

(__pyx_v_ptbounds[0]) = fmod((__pyx_v_ptbounds[0]), (2.0 * __pyx_v_npPI));

so there might be a small speedup to be found by precomputing 2π ahead of time, but I'm not sure this is super important.

Huh. I guess I wasn't on the latest version! For me it did a ton of unpacking and python type coercion. I guess that's one of the advantages of upgrading, eh?

Are our performance tests being run on the latest version? (i.e., do you see the slowness, or is what I am referring to showing up in our benchmarks)

Co-authored-by: Corentin Cadiou <[email protected]>
Co-authored-by: Corentin Cadiou <[email protected]>
@chrishavlin
Copy link
Contributor Author

still expect the image diffs to fail -- i'll look later today and if it's a quick fix (which i think it is) I'll put it in and do some performance tests.

@@ -566,7 +566,8 @@ def pixelize_cylinder(np.float64_t[:,:] buff,
cdef np.float64_t r_inc, theta_inc
cdef np.float64_t costheta, sintheta
cdef int i, i1, pi, pj
cdef np.float64_t twoPI = 2 * np.pi
cdef np.float64_t onePI = np.pi
Copy link
Member

@neutrinoceros neutrinoceros Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's the direct way to "import" the constant from numpy as the C level

cdef extern from "numpy/npy_math.h":
    double NPY_PI

(taken from https:/yt-project/yt/pull/4948/files#r1692841029)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(NPY_2_PI is also available)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hah, will import NPY_2_PI as well in a minute after tests run this round.... i think this one will pass...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh wait, NPY_2_PI is 2/pi, not 2*pi ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aaaah, missed that, sorry

@chrishavlin chrishavlin force-pushed the release_gil_in_pixelize_cylinder branch from 9c76d68 to ad92f03 Compare October 10, 2024 19:25
Comment on lines +668 to +672
if ptbounds[0] < 0:
ptbounds[0] += NPY_PI
ptbounds[1] = ptbounds[1] % twoPI
if ptbounds[1] < 0:
ptbounds[1] += NPY_PI
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I'm not sure why I needed to add the check for negative numbers here, but after switching from ptbounds[1] % (2*np.pi) to ptbounds[1] % twoPI, the result returns numbers in the range (-pi,pi).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to understand this bit, and possibly come up with a way to perform the computation that doesn't involve branching; hard-wiring a somewhat expensive calculation generally hurts performance less than ruining branch prediction.
I'm writing this comment early in my review but I'll try to come up with more insight.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm taking a bet that the reason why you're seeing a change in behavior is that the function is decorated with @cython.cdivision(True): on main the Python % is used, while with twoPI you're getting a C operator instead.

http://docs.cython.org/en/latest/src/tutorial/caveats.html

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try this patch

diff --git a/yt/utilities/lib/pixelization_routines.pyx b/yt/utilities/lib/pixelization_routines.pyx
index 194ce7af6..6728375d0 100644
--- a/yt/utilities/lib/pixelization_routines.pyx
+++ b/yt/utilities/lib/pixelization_routines.pyx
@@ -664,12 +664,8 @@ def pixelize_cylinder(np.float64_t[:,:] buff,
                             ptbounds[1] = fmax(ptbounds[1], corners[i1])
 
                         # shift to a [0, PI] interval
-                        ptbounds[0] = ptbounds[0] % twoPI
-                        if ptbounds[0] < 0:
-                            ptbounds[0] += NPY_PI
-                        ptbounds[1] = ptbounds[1] % twoPI
-                        if ptbounds[1] < 0:
-                            ptbounds[1] += NPY_PI
+                        ptbounds[0] = (ptbounds[0]+twoPI) % twoPI
+                        ptbounds[1] = (ptbounds[1]+twoPI) % twoPI
 
                         if prbounds[0] >= rmin and prbounds[1] <= rmax and \
                            ptbounds[0] >= tmin and ptbounds[1] <= tmax:

the result should be the same in all cases, but it avoids branching, so I expect it to be faster.
If this works, we'll want to add a comment for why it's done this way.

@chrishavlin
Copy link
Contributor Author

Here's a performance comparison between this branch and main that compares the scaling with increasing number of cells in r and theta.

Full code for the test case is over in this repo. The test dataset is a 2d cylindrical dataset built with:

def get_2d_ds(n_r, n_theta):

    shp = (n_r,n_theta,1)
    bbox = np.array([[0, 1], [0, np.pi*2],[-.5, .5]])
    data = {
         'density': rs.random(shp)
            }
    ds = yt.load_uniform_grid(data,
                              shp,
                              bbox=bbox,
                              geometry='cylindrical',
                              axis_order=('r', 'theta', 'z'))
    return ds

And I'm timing a SlicePlot here:

def plot_ds(ds):
    slc = yt.SlicePlot(ds, 'z', ('stream', 'density'))
    slc.render()

The first and second column in the following graph are the elapsed time in seconds averaged over 10 iterations for the main branch and this PR's branch. Final column is the percent speedup.

results

Average speedup over all resolutions is about 33%.

@chrishavlin chrishavlin marked this pull request as ready for review October 11, 2024 18:05
@matthewturk
Copy link
Member

I like the comparison. I immediately thought, "what about in AMR" then realized this is likely a better general purpose diagnostic.

@neutrinoceros
Copy link
Member

I looked for most recent patches to this routine from ~3 years ago when I was focused on correctness and possibly neglecting performance. Here's what I found, in case that's relevant here

#3782 and #3818 may have impacted performance, so it might be worth trying to revert them and see if the performance changes measurably.
If #3818 is costly, it may be argued that it should be reverted, as it really comes down to an arbitrary choice between two equally valid approaches, with converging results in the limit of infinite resolution.
#3782 seems actually more likely to have a negative impact, but it's a lot less trivial to decide what to do if so. In any case I think it's worth investigating now, and if performance reaches an acceptable level with this PR, maybe we don't need to do anything more. I'd just like it to be an informed decision.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants