You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have started a new cluster using aws-eda-slurm-cluster with parallelcluster 3.11.0 (though I suspect it happens with any version as I have logs from 3.9.1 that suggest this happened there).
When I submit jobs, I get error messages on HeadNode's /var/log/slurmctld.log:
[2024-10-19T06:22:38.009] error: Node od-r7a-2xl-dy-od-r7a-2xl-2 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
It seems the head node starts up and I suspect some aws-eda-slurm-cluster configuration is happening post head_node startup. This is only a guess.
Launch time of my HeadNode is 2024/10/17 14:08 GMT -7, yet slurm.conf and other files in /opt/slurm/etc show modification times of Oct 17 14:15, e.g.:
$ ls -l *.conf
-rw-r--r-- 1 root root 249 Oct 17 14:15 cgroup.conf
-rw-r--r-- 1 root root 174 Oct 17 14:15 gres.conf
-rw-r--r-- 1 root root 2136 Oct 17 14:15 slurm.conf
-rw-r--r-- 1 root root 177 Oct 17 14:15 slurm_parallelcluster_cgroup.conf
-rw-r--r-- 1 root root 3703 Oct 17 14:15 slurm_parallelcluster.conf
-rw-r--r-- 1 root root 3270 Oct 17 14:15 slurm_parallelcluster_gres.conf
-rw-r--r-- 1 root root 168 Oct 17 14:15 slurm_parallelcluster_slurmdbd.conf
I thought it might have to do with the files being modified to include the pathname of the new cluster name in the files, but even files that don't have the cluster name, e.g. slurm_parallelcluster_slurmdbd.conf show 14:15 timestamp.
Nonetheless, I make the error message go away with the command:
sudo scontrol reconfigure
on the HeadNode.
I'm reporting this here vs. parallelcluster issues as I don't want to believe this is prevalent in standard parallelcluster deployment.
Reproduce:
start a new cluster
start a new job on the new cluster
observe the /var/log/slurmctld.log on the HeadNode.
The text was updated successfully, but these errors were encountered:
I have started a new cluster using aws-eda-slurm-cluster with parallelcluster 3.11.0 (though I suspect it happens with any version as I have logs from 3.9.1 that suggest this happened there).
When I submit jobs, I get error messages on HeadNode's /var/log/slurmctld.log:
[2024-10-19T06:22:38.009] error: Node od-r7a-2xl-dy-od-r7a-2xl-2 appears to have a different slurm.conf than the slurmctld. This could cause issues with communication and functionality. Please review both files and make sure they are the same. If this is expected ignore, and set DebugFlags=NO_CONF_HASH in your slurm.conf.
It seems the head node starts up and I suspect some aws-eda-slurm-cluster configuration is happening post head_node startup. This is only a guess.
Launch time of my HeadNode is 2024/10/17 14:08 GMT -7, yet slurm.conf and other files in /opt/slurm/etc show modification times of Oct 17 14:15, e.g.:
$ ls -l *.conf
-rw-r--r-- 1 root root 249 Oct 17 14:15 cgroup.conf
-rw-r--r-- 1 root root 174 Oct 17 14:15 gres.conf
-rw-r--r-- 1 root root 2136 Oct 17 14:15 slurm.conf
-rw-r--r-- 1 root root 177 Oct 17 14:15 slurm_parallelcluster_cgroup.conf
-rw-r--r-- 1 root root 3703 Oct 17 14:15 slurm_parallelcluster.conf
-rw-r--r-- 1 root root 3270 Oct 17 14:15 slurm_parallelcluster_gres.conf
-rw-r--r-- 1 root root 168 Oct 17 14:15 slurm_parallelcluster_slurmdbd.conf
I thought it might have to do with the files being modified to include the pathname of the new cluster name in the files, but even files that don't have the cluster name, e.g. slurm_parallelcluster_slurmdbd.conf show 14:15 timestamp.
Nonetheless, I make the error message go away with the command:
sudo scontrol reconfigure
on the HeadNode.
I'm reporting this here vs. parallelcluster issues as I don't want to believe this is prevalent in standard parallelcluster deployment.
Reproduce:
start a new cluster
start a new job on the new cluster
observe the /var/log/slurmctld.log on the HeadNode.
The text was updated successfully, but these errors were encountered: