Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address sanitizer tests in the CI pipeline #2246

Merged
merged 27 commits into from
May 8, 2024
Merged

Address sanitizer tests in the CI pipeline #2246

merged 27 commits into from
May 8, 2024

Conversation

jblueh
Copy link
Contributor

@jblueh jblueh commented Mar 21, 2024

Proposed Changes

This PR automates the address sanitizer analysis that lead to the fixes in #2212. I'll make the additions step by step, to see if things in the CI pipeline work as expected.

Related Work

su2code/Docker-Builds#24, #2212, #2068, #2213

PR Checklist

  • I am submitting my contribution to the develop branch.
  • My contribution generates no new compiler warnings (try with --warnlevel=3 when using meson).
  • My contribution is commented and consistent with SU2 style (https://su2code.github.io/docs_v7/Style-Guide/).
  • I used the pre-commit hook to prevent dirty commits and used pre-commit run --all to format old commits.
  • I have added a test case that demonstrates my contribution, if necessary.
  • I have updated appropriate documentation (Tutorials, Docs Page, config_template.cpp), if necessary.

@jblueh
Copy link
Contributor Author

jblueh commented Mar 25, 2024

Discussion #2213 is about fixes for the address sanitizer detections in the harmonic_balance and hb_rans_preconditioning test cases (see https:/su2code/SU2/actions/runs/8422887468/job/23065483005?pr=2246).

@pcarruscag
Copy link
Member

Maybe we can avoid creating another instance is the surface_movement pointer is not nullptr when the init function is called? In CDriver we would just make sure all surface_movement are initialized to nullptr.

@jblueh
Copy link
Contributor Author

jblueh commented Mar 25, 2024

Maybe we can avoid creating another instance is the surface_movement pointer is not nullptr when the init function is called? In CDriver we would just make sure all surface_movement are initialized to nullptr.

Yeah, that would be an alternative to creating surface_movement per instance. It's the easier fix but I am not sure if the resulting behaviour would be correct either. @maxaehle have you tried this?

@jblueh
Copy link
Contributor Author

jblueh commented Apr 3, 2024

The address sanitizer reports a heap buffer overflow for this piece of code.

I presume the issue is the access SpanValuesDonor[kSpan + 1] in the case that kSpan == nSpanDonor - 1 so that kSpan references the last array entry already. Below is a stripped-down version of the corresponding code.

...
for (jSpan = 0; jSpan < nSpanDonor; jSpan++) {
  ...
  if (...)
    kSpan = jSpan;
  ...
}
... SpanValuesDonor[kSpan + 1] ...

Is someone familiar with the CInterface code and could make recommendations for a fix?

Below is the address sanitizer report, see also here.

==2308==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x60f0008c96a0 at pc 0x558b6435ae51 bp 0x7ffc1cebf680 sp 0x7ffc1cebf670
READ of size 8 at 0x60f0008c96a0 thread T0
    #0 0x558b6435ae50 in CInterface::PreprocessAverage(CGeometry*, CGeometry*, CConfig const*, CConfig const*, unsigned short) ../SU2_CFD/src/interfaces/CInterface.cpp:314
    #1 0x558b6373d23a in CDriver::PreprocessTurbomachinery(CConfig**, CGeometry****, CSolver*****, CInterface***, bool) ../SU2_CFD/src/drivers/CDriver.cpp:2725
    #2 0x558b63770d76 in CDriver::CDriver(char*, unsigned short, int, bool) ../SU2_CFD/src/drivers/CDriver.cpp:250
    #3 0x558b637f54c2 in CMultizoneDriver::CMultizoneDriver(char*, unsigned short, int) ../SU2_CFD/src/drivers/CMultizoneDriver.cpp:35
    #4 0x558b63615b53 in main ../SU2_CFD/src/SU2_CFD.cpp:135
    #5 0x7f0d8c503082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082)
    #6 0x558b6367005d in _start (/github/workspace/install/bin/SU2_CFD+0x63405d)

0x60f0008c96a0 is located 0 bytes to the right of 176-byte region [0x60f0008c95f0,0x60f0008c96a0)
allocated by thread T0 here:
    #0 0x7f0d8ed00787 in operator new[](unsigned long) ../../../../src/libsanitizer/asan/asan_new_delete.cc:107
    #1 0x558b656d1590 in CPhysicalGeometry::ComputeNSpan(CConfig*, unsigned short, unsigned short, bool) ../Common/src/geometry/CPhysicalGeometry.cpp:4806
    #2 0x558b6373c1df in CDriver::PreprocessTurbomachinery(CConfig**, CGeometry****, CSolver*****, CInterface***, bool) ../SU2_CFD/src/drivers/CDriver.cpp:2649
    #3 0x558b63770d76 in CDriver::CDriver(char*, unsigned short, int, bool) ../SU2_CFD/src/drivers/CDriver.cpp:250
    #4 0x558b637f54c2 in CMultizoneDriver::CMultizoneDriver(char*, unsigned short, int) ../SU2_CFD/src/drivers/CMultizoneDriver.cpp:35
    #5 0x558b63615b53 in main ../SU2_CFD/src/SU2_CFD.cpp:135
    #6 0x7f0d8c503082 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x24082)

@pcarruscag
Copy link
Member

@joshkellyjak can you take a look at the issue above? Thanks

@joshkellyjak
Copy link
Contributor

I think the issue is looping jSpan between 0 and nSpanDonor-1, I think it should be looped between 1 and nSpanDonor-2 (as is done in the higher level of the for loop) as the hub and shroud are handled seperately and there is no interpolation performed here. Admittedly, this is probably the area of the turbomachinery code I am most unfamiliar with.

@jblueh
Copy link
Contributor Author

jblueh commented Apr 3, 2024

I think the issue is looping jSpan between 0 and nSpanDonor-1, I think it should be looped between 1 and nSpanDonor-2 (as is done in the higher level of the for loop) as the hub and shroud are handled seperately and there is no interpolation performed here. Admittedly, this is probably the area of the turbomachinery code I am most unfamiliar with.

Thanks @joshkellyjak! I'll try this.

@joshkellyjak
Copy link
Contributor

Interesting that only one of the tests fails - to be honest I was expecting all the turbomachinery tests to fail, does this solve the heap overflow?

@jblueh
Copy link
Contributor Author

jblueh commented Apr 3, 2024

The address sanitizer records are clean with this fix 👍

@alecappiello
Copy link
Contributor

Interesting that only one of the tests fails - to be honest I was expecting all the turbomachinery tests to fail, does this solve the heap overflow?

Hi all, I've just tried to run the Aachen parallel regression test locally (code from SU2 feature_asan, restart files from SU2/TestCases) and I get a pretty different outcome than the PR here.

residuals I get locally:
| 9| -15.378838| -15.209087| -15.081009| -13.857694| -12.724170| -10.052202|

residuals I see as a result of the regression test:
| 9| -9.511426| -8.866578| -9.325615| -7.546349| -7.420281| -4.042154|

Also, feature_asan doesn't have any Aachen_turbine folder among the testcases.

@jblueh
Copy link
Contributor Author

jblueh commented Apr 15, 2024

Hi @alecappiello, thanks for joining the discussion.

Hi all, I've just tried to run the Aachen parallel regression test locally (code from SU2 feature_asan, restart files from SU2/TestCases) and I get a pretty different outcome than the PR here.

residuals I get locally: | 9| -15.378838| -15.209087| -15.081009| -13.857694| -12.724170| -10.052202|

If you use your local environment for building and running (as opposed to the containers), that can lead to differences, too (although it's not a good sign if they are large). So I think the question is rather, with the changes in the feature_asan branch, do you see reasonable results and/or the familiar residuals in your local environment?

@bigfooted reports that the Aachen turbine case in current develop is not deterministic (see here and here), that could be an issue, too. If it is due to memory issues, the changes in this PR might fix this.

Also, feature_asan doesn't have any Aachen_turbine folder among the testcases.

I see this Aachen turbine folder, is it not the correct one?

@jblueh
Copy link
Contributor Author

jblueh commented Apr 17, 2024

@alecappiello The Aachen turbine case is in the feature_asan branch since 046de67, could it be that you tested with an older version of the branch?

@alecappiello
Copy link
Contributor

@alecappiello The Aachen turbine case is in the feature_asan branch since 046de67, could it be that you tested with an older version of the branch?

Hi @jblueh, I double-checked and you're right. Sorry for that. I'm installing the right one now.

@alecappiello
Copy link
Contributor

@bigfooted reports that the Aachen turbine case in current develop is not deterministic (see here and here), that could be an issue, too. If it is due to memory issues, the changes in this PR might fix this.

I'm not sure about what is the cause of this poorly deterministic behaviour. I can imagine that the convergence level of the restart plays a role. I'll be running some tests with better converged restart files and let you guys know

@jblueh
Copy link
Contributor Author

jblueh commented Apr 18, 2024

I'll be running some tests with better converged restart files and let you guys know

@alecappiello As the restart files were not obtained with the changes in this branch, maybe they are no longer consistent and need to be updated?

@jblueh
Copy link
Contributor Author

jblueh commented May 6, 2024

@alecappiello Any news from your side?

@jblueh jblueh changed the title [WIP] Address sanitizer tests in the CI pipeline Address sanitizer tests in the CI pipeline May 8, 2024
@jblueh
Copy link
Contributor Author

jblueh commented May 8, 2024

To get this PR going, I have reverted the proposed fix c97610b for now and excluded the Aachen turbine test case from address sanitizer testing. I'll create an issue about this.

@jblueh jblueh merged commit a1ff242 into develop May 8, 2024
34 of 35 checks passed
@jblueh jblueh deleted the feature_asan branch May 8, 2024 11:25
@jblueh jblueh mentioned this pull request May 8, 2024
@alecappiello
Copy link
Contributor

@alecappiello Any news from your side?

Sorry for my late reply. Unfortunately, I don't have great news.
I've done some tests locally, but I couldn't find a reason why the Aachen test case should behave differently.

I have tried restarting the calculation by means of feature_asan from:

  • Aachen restart files obtained with better convergence of SU2_CFD
  • Aachen restart files obtained with a setup of the CFD calculation similar to the Jones turbine, that is, same turbulence model, spacial fourier disable for the giles bc.
    In both cases I still get a big jump in the residuals, although smaller in the second case.

@alecappiello
Copy link
Contributor

@alecappiello As the restart files were not obtained with the changes in this branch, maybe they are no longer consistent and need to be updated?

To my understanding they should work as the others...
Why do you expect these restart to be unconsistent while the others work?

@jblueh jblueh mentioned this pull request May 29, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants