Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory Consumption of Sensitive Detectors #1285

Open
1 task done
s6anloes opened this issue Jun 28, 2024 · 23 comments
Open
1 task done

Memory Consumption of Sensitive Detectors #1285

s6anloes opened this issue Jun 28, 2024 · 23 comments
Assignees
Labels

Comments

@s6anloes
Copy link

s6anloes commented Jun 28, 2024

Check duplicate issues.

  • Checked for duplicates

Goal

I'm trying to understand the memory usage of sensitive volumes in dd4hep. I have a detector with a large number of sensitive volumes which seem to have a large impact on the memory consumption. More details are given below.

Operating System and Version

Centos 7

compiler

GCC 12.2.0

ROOT Version

6.28/10

DD4hep Version

1.28

Reproducer

To install the dual-readout calorimeter geometry:
Note: the export command will need to be executed in each new shell

source /cvmfs/sw.hsf.org/key4hep/setup.sh
git clone --single-branch --branch dd4hep_github_issue https:/s6anloes/DDDRCaloTubes.git
cd DDDRCaloTubes
mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=../install/ ..
make install -j6
cd ../install/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$PWD/lib64

To run the simulation with all fibres marked as sensitive detector and monitor the memory usage via htop:
Note: With the full geometr, this will take ~10GB of memory and about 10 minutes to build the geometry (at around 3 1/2 minutes you should be able to see the memory usage increase gradually)

cd ../DRdetector/DRcalo/compact
ddsim --compactFile DDDRCaloTubes.xml -N 1 -G --steeringFile steering.py --outputFile=test.root --part.userParticleHandler="" &
htop

Then to run the simulation without fibres being sensitive:
In the DDDRCaloTibes.xml file in lines 333 and 335 change the "sensitive" value to false and run with the same command. This should take less than 1GB of memory.
Note: the simulation will take slightly longer, because optical photons are propagated instead of killed like in the custom sensitive detector action

Additional context

I have been trying to improve the memory consumption of the calorimeter for some time now and had meetings with and feedback from some of the experts. In a recent FCC Full Sim Working Group meeting I presented some studies I did on the memory consumption. Mainly I show the CPU and memory usage as function of time using the psrecord software. There you can also find one slide on the geometry and volume hierarchy of the detector.

The slides are mainly about trying different options to improve the memory consumption, but one important point was also the discrepancy in memory consumption between the ddsim and dd4hep2root commands. While running ddsim takes 10GB of memory, dd4hep2root takes less than 1GB.

In this meeting, a colleague working on the 'monolithic' version of the geometry, suggested to run the simulation without having any volume marked as sensitive. And indeed, this seems to be the cause for the high discrepancy.
A colleague said that in Geant4 the sensitive volume is linked to the logical volume, so even if the volume is placed many times (as is the case for the fibres in my geometry), there is still just one sensitive volume. It looks like this is not the case in dd4hep, where it seems that the sensitive volume is tied to the placed volume.

I wonder if this is something that can be changed, as it causes a problem with geometries with many small sensitive volumes.

Sidenote: this issue is somewhat related to issue #1173, where Sarah Eno was also looking at the memory consumption of the dual-readout calorimeter. While I think there is still some optimisation possible for my geometry (I'm not using all possible symmetries at the moment), this doesn't seem to be the main problem here

@s6anloes s6anloes added the bug label Jun 28, 2024
@andresailer
Copy link
Member

Hi @s6anloes ,

Can you reduce your example to something that still shows the scaling behaviour, but can be run in a much shorter time , and use not more than 1GB of memory for example?

Thanks,
Andre

@s6anloes
Copy link
Author

Hi Andre,
yes I can do this by reducing the number of towers placed. I have uploaded this to a branch called dd4hep_github_issue. It should take just around 1.2GB now and be done in two minutes. I have updated the Reproducer to clone this branch only.

While doing this I have found something, that might be of interest:

When thinking about how to make a smaller scale example for you to test, there are essentially two ways to place just a small number of towers

  1. Placing only one stave (fixed phi) over the full theta range. The geometry would look something like this (tower size increased for visualisation)
    stave
  2. Placing only one (or two) towers in the stave (fixed theta effectively) and repeatedly placing it in phi, such that you end up with a ring of towers, like this:
    ring

There is one significant difference between the two scenarios:
The towers within one stave (so towers for different theta) are slightly different from one another geometry-wise (except for a forward backward symmetry at eta=0 (theta=90 deg). The enveloping trapezoid shape is for sure different for each tower. However, the tubes and fibres within the various towers are a bit of a different story. I try to reuse the volumes for the tubes and fibres throughout the simulation, by creating them once and storing them in a map. If a tube of a given length needs to be placed, I first check if this volume already exists in the map and if so, place it in the tower. For instance, we expect at the centre of each tower the tubes to be all of same length across the towers because the tubes reach all the way from the back to the front face. Only on the sides (the wings of the tower) where we need to stagger to tubes to get the overall shape, we expect the lengths to differ from one tower to the next.

How is this important?

Let's first look at Scenario 2, placing a ring of towers. When using 1deg by 1deg towers, we are placing one single stave volume 360 times in different phi rotations. The volumes are all identical. This is the geometry I have now pushed to the new branch for you to test. And from the plots below, you can see that there is still a significant difference between running the simulation with sensitive volumes or without.
Running with sensitive volumes:
plot_ddsim_onelayer

Running without sensitive volumes:
plot_ddsim_nosens_onelayer

You can see from the blue line that after the geometry has been converted to Geant4 the memory rises much higher for the case with sensitive volumes (to a smaller scale now than with the full geometry).

In Scenario 1 (one full stave) things look different though. If you compare running with and without sensitive volumes, the memory consumption is the same.
Running with sensitve volumes:
plot_ddsim_onestave

Running without :
plot_ddsim_nosens_onestave

So what is different now? Well, probably that the towers are all different volumes. So the issue might be related to whether or not this volume already exists in the memory. But in this case it would still be surprising though to see no difference, since the volumes for the tubes and fibres are reused and therefore should already exist in the memory. Also it's wrong to say that all towers are different, because of the symmetry at theta=90deg. This should contribute by a factor of two, since the tower is created once and placed twice within the stave. Only difference is the position and rotation.

I don't really know what this means however, or if this is even relevant to the underlying issue. I just thought I should share this in case I'm onto something.

@andresailer andresailer self-assigned this Jul 1, 2024
@andresailer
Copy link
Member

I think the issue is that there is an entry for each sensitive element with its unique path.

m_entries.emplace(code,path);

@MarkusFrankATcernch
Copy link
Contributor

MarkusFrankATcernch commented Jul 1, 2024

Yes. I confirm this. There is an entry for each path to allow lookups using the touchable history.
But: how else would you perform the lookup ?
The only alternative is walking down the tree using strings. This has huge run-time hits.

@andresailer
Copy link
Member

I guess one shouldn't add a DetElement for each fiber, but depend on a segmentation to give a number to the fiber in the tower?

@s6anloes
Copy link
Author

s6anloes commented Jul 1, 2024

I'm not sure how this could work. Currently we need to mark the fibres as sensitive for signal generation. We have had some discussion with Sanghyun, whose geometry propagates optical photons, but it takes several minutes to simulate one event. Something we would like to avoid

@MarkusFrankATcernch
Copy link
Contributor

This path reflects the path of volumes. Having a DetElement at each level makes things worse, but already the
"unfolded" tree with all these little sensitive volumes makes a huge tree with vectors of volume IDs as lookup keys.
It is well possible that one would have to somehow develop an alternative lookup mechanism for certain types
of readouts.

In DD4hep such situations are meant to be handled by a relatively large sensitive volumes and then the little sensitive
elements handled by a segmentation. Example: a wafer is a sensitive volume, the pixels on the wafer are not sensitive
volumes, but handled by the segmentation.

In this case the envelope of fibers would be the sensitive volume and the individual fibres would then be handed by a
segmentation. If such a adhoc approach is reasonable I cannot tell. Alternatively one tries to seek a model which describes
such a setup efficiently.

@s6anloes
Copy link
Author

s6anloes commented Jul 1, 2024

I think I understand how this approach might work. Although I have one question. You say the envelope of the fibre would be the sensitive volume, I guess this means the mother volume. For our geometry, the sensible choice for sensitive volume would be three levels of hierarchy higher (the grand-grandmother volume), since each fibre core is placed within a claading volume within a tube volume. And then the tower would be the large sensitive volume which can be segmented. Would this approach still work? It is not clear to me how sensitive volumes treat daughter and grand-daughter volumes and even further. These volumes are no longer 'sensitive' in the sense that the sensitve detector action would not be called for steps in this daughter volume?

@MarkusFrankATcernch
Copy link
Contributor

MarkusFrankATcernch commented Jul 1, 2024

This is sort of the idea behind the segmentation concept.
You would get the energy deposit in the grand-grandmother volume and compute the fiber from the location of the
energy deposit within this volume.

If this works for fibers (which I guess are thin cylinders) I cannot tell, because there is some space between them filled probably with some glue. This then would not be handled correctly by Geant4, because the glue has different material characteristics than the fibers.

@MarkusFrankATcernch
Copy link
Contributor

@s6anloes
I tried to somewhat understand the code here: https:/s6anloes/DDDRCaloTubes/blob/689347a36627b16471c012551fc3a7caa250bbe5/DRdetector/DRcalo/src/DRconstructor.cpp
Depending on the granularity these are really a lot of volumes since apparently in theta things cannot be re-used,
but must be recreated.

Nevertheless: Do you know where the memory really goes ?

  • Is it the ROOT geometry ?
  • Is it the Geant4 geometry when it gets translated ?
  • Accoding to your presentation mentioned above it is not the Geant4VolumeManager.
  • How are the dd4hep2root results to be understood ? Why does it need suddenly so much less memory?

So where does the memory really go?

@andresailer
Copy link
Member

I ran heaptrack to monitor allocations

PYTHONMALLOC=malloc heaptrack  python `which ddsim` --compactFile ../DRdetector/DRcalo/compact/DDDRCaloTubes.xml -N1 -G --part.userParticleHandler=''

With and without setting the couple volumes as sensitive. And the line I link to above is the main difference between the two runs, as far as I can tell. This is a bit complicated because I have never used heaptrack before, and the recursion makes callstacks a little bit broader.

@s6anloes
Copy link
Author

s6anloes commented Jul 2, 2024

@MarkusFrankATcernch

Hmm, these are really good questions I wished I knew the answer to. I'm not really an expert on these things, so if you know any way I can figure this out, it would be greatly appreciated.
The only thing I can tell you, is that the steady and linear increase in memory occurs the moment dd4hep prints the output : "successfully converted geometry to Geant4...".
Since not nearly as much memory is used in the dd4hep2root command, my understanding was that it was probably the Geant4 geometry, and not the ROOT geometry.

@MarkusFrankATcernch
Copy link
Contributor

MarkusFrankATcernch commented Jul 2, 2024

@s6anloes
Well.... when it says "successfully converted geometry to Geant4..." I think Geant4 is far from having finished its setup.
All the voxelization business and I do not know what other internal details are then still going on which may require
loads and loads of memory. There are certainly internal caches to speed up tracking etc. What cannot be avoided is the fact that there are 2 geometries in memory: the TGeo geometry and the Geant4 geometry.
All this will probably happen when the geometry gets closed just before the event simulation starts and probably is entirely independent of dd4hep.

Now for the facts:

  • the memory usage of dd4hep2root does not mean a lot. In your detector constructor you do not really use
    DetElements to build the structural hierarchy. Hence the overhead of dd4hep itself should be quite small: actually it is only the Volume extensions which are supplied to ROOT. This should be small. Though would be interesting what the difference of the dd4hep2root geometry and the true dd4hep geometry without Geant4 is.
  • If you do not convert to Geant4 you cannot simulate. Hence this is a must unless Geant4 is loaded from a GDML dump of the ROOT geometry. This has its own difficulties, but may be considered to debug the memory usage.
  • This leaves as the only significant free element the Geant4VolumeManager.
    If you only look at the memory usage with and without sensitive volumes in your plots above ie. having a populated or a not-populated Geant4VolumeManager, the memory useage is about the same. Suggests to me the effect of the Geant4VolumeManager is smallish.

So where does the memory go?
One probably can only go through the main steps of setting up Geant4 with the debugger and see in the setup where the memory jumps....

@MarkusFrankATcernch
Copy link
Contributor

@s6anloes , @andresailer
I do not have /cvmfs/sw.hsf.org/ , but it should also run on any LCG view -- not?

Apparently the LCG views miss the Geant4 data tables:

#13 0x00007efdaad65299 in G4Exception (originOfException=originOfException
entry=0x7efdab1f8d2c "G4NuclideTable", exceptionCode=exceptionCode
entry=0x7efdab1f8d83 "PART70001", severity=severity
entry=FatalException, description=description
entry=0x7efdab1f8d66 "ENSDFSTATE.dat is not found.") at /build/jenkins/workspace/lcg_release_pipeline/build/projects/Geant4-11.2.1/src/Geant4/11.2.1/source/global/management/src/G4Exception.cc:115
#14 0x00007efdab189210 in G4NuclideTable::GenerateNuclide (this=this

Do I miss some environment or is LCG_106 incomplete?

@andresailer
Copy link
Member

source /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/setup.sh
ddsim --compactFile $DD4hepINSTALL/DDDetectors/compact/SiD.xml -N 2 -G

Works for me on lxplus.
Do you maybe also not have /cvmfs/geant4.cern.ch?

G4ENSDFSTATEDATA=/cvmfs/geant4.cern.ch/share/data/G4ENSDFSTATE2.3

@MarkusFrankATcernch
Copy link
Contributor

Yes this is the problem: /cvmfs/geant4.cern.ch is missing.
I thought the idea of the LCG views is to have everything together in a compact form ?

@andresailer
Copy link
Member

It seems geant4 cvmfs is the hidden dependency.
But you probably have those datafiles then on some LHCb CVMFS repo?

@MarkusFrankATcernch
Copy link
Contributor

There are more problems. I tried to build on lxplus, but there I got a clash with python between system python and hsf python:

CMake Error at /cvmfs/sw.hsf.org/key4hep/releases/2024-03-10/x86_64-almalinux9-gcc11.3.1-opt/cmake/3.27.9-4qfmfr/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find Python: Found unsuitable version "3.10", but required is
  exact version "3.10.13" (found /usr/include/python3.11, )
Call Stack (most recent call first):
  /cvmfs/sw.hsf.org/key4hep/releases/2024-03-10/x86_64-almalinux9-gcc11.3.1-opt/cmake/3.27.9-4qfmfr/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:598 (_FPHSA_FAILURE_MESSAGE)
  /cvmfs/sw.hsf.org/key4hep/releases/2024-03-10/x86_64-almalinux9-gcc11.3.1-opt/cmake/3.27.9-4qfmfr/share/cmake-3.27/Modules/FindPython/Support.cmake:3824 (find_package_handle_standard_args)
  /cvmfs/sw.hsf.org/key4hep/releases/2024-03-10/x86_64-almalinux9-gcc11.3.1-opt/cmake/3.27.9-4qfmfr/share/cmake-3.27/Modules/FindPython.cmake:574 (include)
  /cvmfs/sw.hsf.org/key4hep/releases/2024-03-10/x86_64-almalinux9-gcc11.3.1-opt/dd4hep/1.28-q6ea5f/cmake/DD4hepBuild.cmake:693 (FIND_PACKAGE)
  /cvmfs/sw.hsf.org/key4hep/releases/2024-03-10/x86_64-almalinux9-gcc11.3.1-opt/dd4hep/1.28-q6ea5f/cmake/DD4hepConfig.cmake:62 (DD4HEP_SETUP_ROOT_TARGETS)
  CMakeLists.txt:35 (find_package)

@peterkostka
Copy link
Contributor

peterkostka commented Jul 2, 2024 via email

@MarkusFrankATcernch
Copy link
Contributor

MarkusFrankATcernch commented Jul 2, 2024

Here are some results from simply using top:

Invocation of TGeo alone:

TGeo:  geoPluginRun -input /scratch/online/frankm/SW/DDDRCaloTubes/install/share/compact/DDDRCaloTubes.xml -interactive -ui

    PID    PPID  PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ USER      P COMMAND                                                                                           
1472811 1469076  20   0  744700 568888 432088 T   0.0   0.1   0:20.13 frankm   58 geoPluginRun -input /scratch/online/frankm/SW/DDDRCaloTubes/install/share/compact/DDDRCaloTubes.+ 

Virt: 700 MB Resident: 569 MB

Tests involving Geant4:

    PID    PPID  PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ USER      P COMMAND
Start of DetectorImp::init
1485953 1485861  20   0 1037628 760188 503960 t   0.0   0.1   0:10.87 frankm   47 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+ 

End of DetectorImp::init
1485953 1485861  20   0 1037628 760188 503944 t   0.0   0.1   0:10.88 frankm   47 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+ 

Start of dd4hep::DetectorImp::endDocument
1485953 1485861  20   0 1039112 761852 504348 t   0.0   0.1   0:10.92 frankm   47 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+ 

End of dd4hep::DetectorImp::endDocument
1485953 1485861  20   0 1040084 763004 504348 t  85.0   0.1   0:26.60 frankm   47 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+ 

Before Geant4Converter:
1485219 1485112  20   0 1097056 807788 530032 t   0.0   0.2   0:28.08 frankm   53 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+ 

After Geant4Converter:
1485219 1485112  20   0 1099176 809900 530144 t   0.0   0.2   0:44.64 frankm   53 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+ 

Before Geant4VolumeManager:
1485219 1485112  20   0 1099176 809900 530140 t   0.0   0.2   0:44.64 frankm   53 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+ 

After Geant4VolumeManager:
1485219 1485112  20   0 1505076   1.2g 530136 t   0.0   0.2   1:45.48 frankm   53 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+ 

After dd4hep::sim::Geant4Exec::initialize 
1485219 1485112  20   0 1511056   1.2g 530240 t   0.0   0.2   1:46.10 frankm   53 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+

Start of dd4hep::sim::Geant4Exec::run
1485219 1485112  20   0 1511928   1.2g 530428 t   0.0   0.2   1:46.11 frankm   53 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+ 

After first event:
1485219 1485112  20   0 1520236   1.2g 532180 t   0.0   0.2   1:48.15 frankm   53 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+ 

Hence:

  • pure ROOT uses about 700 MB virtual/570 MB resident memory
  • Geant4 uses on top about 400 MB virtual/240 MB resident
  • The Geant4 volume manager uses on top another 400 MB virtual/400 MB resident (total of 1.2 GB resident)
    The rest can more or less be neglected.

Hence a possible strategy would be:

  • Try to make the Geant4VolumeManager configurable and do not recurse into certain subdetectors.
  • For this detector use a specialized sensitive detector construct, which does not use CellID lookups using the Geant4VolumeManager. The whole concept will not work for so many sensitive elements. Somehow a new CellID lookup has to be thought of.

This is all not implossible, but requires significant work and is not done in an afternoon. We can develop this as a common effort provided several persons work together.....

@s6anloes
Copy link
Author

s6anloes commented Jul 2, 2024

How do you know Geant4 is 240MB resident memory? The jump between calling TGeo alone and after Geant4Converter may be 240MB, but it is already close to this at the before Geant4Converter stage, no?

How did you get this output? I would be interested to see how this scales with the full (or at least more complete) geometry.
But I guess it does track with what we have seen, that the main culprit is the Geant4VolumeManager given the difference when running with and without sensitive detectors.

@MarkusFrankATcernch
In this comment your last point confuses me. It is kind of the opposite of what I was trying to communicate, except for this one geometry, which is not the one you have been testing.

@MarkusFrankATcernch
Copy link
Contributor

@s6anloes So what?
This is the cost of loading G4. Loading all these libraries is far from free even if nothing is done with them (yet).
The volume conversion in this case is apparently not very expensive.

@BrieucF
Copy link

BrieucF commented Jul 3, 2024

Regarding @s6anloes description here: #1285 (comment) . Shall we first try to understand why Scenario 1 leads to no difference with/without sensitive volumes while Scenario 2 leads to significant differences with/without sensitive volumes? Can someone explain that to me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants