Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve perf of _set_continuous! & _set_discrete! #25

Merged
merged 2 commits into from
Jun 18, 2024

Conversation

albert-de-montserrat
Copy link
Contributor

Adds a Vararg signature at the end of _set_continuous! and _set_discrete! so that the compiler can specialize.

@kernel inbounds = true function _set_continuous!(dst, grid, loc, fun::F, args::Vararg{Any, N}) where {F, N}
    I = @index(Global, NTuple)
    dst[I...] = fun(coord(grid, loc, I...)..., args...)
end

@kernel inbounds = true function _set_discrete!(dst, grid, loc, fun::F, args::Vararg{Any, N}) where {F, N}
    I = @index(Global, NTuple)
    dst[I...] = fun(grid, loc, I..., args...)
end

Example:

using Chmy, Chmy.Architectures, Chmy.Grids, Chmy.Fields, Chmy.BoundaryConditions, Chmy.GridOperators, Chmy.KernelLaunch

backend=CPU(); nxy=(126, 126)
arch = Arch(backend)
# geometry
grid   = UniformGrid(arch; origin=(-1, -1), extent=(2, 2), dims=nxy)
launch = Launcher(arch, grid; outer_width=(16, 8))
allocate fields
C = Field(backend, grid, Center())
init_incl(x, y, x0, y0, r, in, out) = ifelse((x - x0)^2 + (y - y0)^2 < r^2, in, out)

#main:

using Chairmarks
julia> @b set!($(C, grid, init_incl)...; parameters=(x0=0.0, y0=0.0, r=0.1, in=1, out=0))
705.100 μs (112084 allocs: 1.980 MiB)

This PR

julia> @b set!($(C, grid, init_incl)...; parameters=(x0=0.0, y0=0.0, r=0.1, in=1, out=0))
8.650 μs (70 allocs: 7.266 KiB)

@luraess
Copy link
Member

luraess commented Jun 18, 2024

@utkinis can you x-check. LGTM

@youwuyou
Copy link
Collaborator

Looks cool! How is the performance boost achieved? I think I haven't fully comprehended how compiler optimizes it here such that we have less memory allocations requied. @albert-de-montserrat

@albert-de-montserrat
Copy link
Contributor Author

albert-de-montserrat commented Jun 18, 2024

It's briefly explained here. Tbh it's usually not that bad, I am surprised to see so many allocations. But the profiler shows some run time dispatch that I can't really pin it down with @descend, the macro does not go really deep into KA kernels.

I guess this is just good practice to avoid surprising perf drops anyway. Same as when passing functions around

@utkinis
Copy link
Member

utkinis commented Jun 18, 2024

Thanks @albert-de-montserrat for the fix! I always keep forgetting about these specialisation rules =)

Interestingly, on GPU this code doesn't result in unstable code, otherwise it wouldn't compile, I wonder if the GPU specialisation rules are more strict...

Looks good to me, will merge as soon as CI finishes

@utkinis utkinis merged commit 8bd81f7 into PTsolvers:main Jun 18, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants