결론만 말하자면, virtualbox에서 가상 cpu의 갯수는 아래의 곱으로 설정을 해주어야 한다.
물리 서버에서 한다면 하이퍼 쓰레드랑 고려해서 적절하게 해주면 될 듯.
아래의 값이 기본값인데 openhpc 에서는 좀 높게 설정하네?
Sockets=1 CoresPerSocket=1 ThreadsPerCore=1
/etc/slurm/slurm.conf.ohpc
Sockets=2 CoresPerSocket=8 ThreadsPerCore=2
Socket은 TCP랑은 1도 상관없는 물리적인 CPU 소켓 갯수를 의미한다.
요즘 추세야 1cpu 멀티코어니까 1로 해도 무방할듯하고
CoresPerSocket은 1개 물리 CPU에 들어있는 physical CPU의 갯수
ThreadsPerCore는 intel 기준 HT 사용시 2로 1개 코어에서 사용하는 쓰레드 갯수를 의미한다.
Sockets Number of physical processor sockets/chips on the node (e.g. "2"). If Sockets is omitted, it will be inferred from CPUs, CoresPerSocket, and ThreadsPerCore. NOTE: If you have multi-core processors, you will likely need to specify these parameters. Sockets and SocketsPerBoard are mutually exclusive. If Sockets is specified when Boards is also used, Sockets is interpreted as SocketsPerBoard rather than total sockets. The default value is 1.
CoresPerSocket Number of cores in a single physical processor socket (e.g. "2"). The CoresPerSocket value describes physical cores, not the logical number of processors per socket. NOTE: If you have multi-core processors, you will likely need to specify this parameter in order to optimize scheduling. The default value is 1.
ThreadsPerCore Number of logical threads in a single physical core (e.g. "2"). Note that the Slurm can allocate resources to jobs down to the resolution of a core. If your system is configured with more than one thread per core, execution of a different job on each thread is not supported unless you configure SelectTypeParameters=CR_CPU plus CPUs; do not configure Sockets, CoresPerSocket or ThreadsPerCore. A job can execute a one task per thread from within one job step or execute a distinct job step on each of the threads. Note also if you are running with more than 1 thread per core and running the select/cons_res or select/cons_tres plugin then you will want to set the SelectTypeParameters variable to something other than CR_CPU to avoid unexpected results. The default value is 1.
CPUs: Count of processors on each compute node. If CPUs is omitted, it will be inferred from: Sockets, CoresPerSocket, and ThreadsPerCore.
Sockets: Number of physical processor sockets/chips on the node. If Sockets is omitted, it will be inferred from: CPUs, CoresPerSocket, and ThreadsPerCore.
CoresPerSocket: Number of cores in a single physical processor socket. The CoresPerSocket value describes physical cores, not the logical number of processors per socket.
ThreadsPerCore: Number of logical threads in a single physical core.
# srun mpi_hello_world srun: error: openhpc-1: task 0: Exited with exit code 2 slurmstepd: error: couldn't chdir to `/root/src/mpitutorial/tutorials/mpi-hello-world/code': No such file or directory: going to /tmp instead slurmstepd: error: execve(): mpi_hello_world: No such file or directory
헐...?!?!?
vnfs를 다시 만들어야 하게 생겼네 ㅠㅠ
# srun mpi_hello_world
srun: error: openhpc-1: task 0: Exited with exit code 127
mpi_hello_world: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory
# ldconfig -v | grep mpi
ldconfig: Can't stat /libx32: No such file or directory
ldconfig: Path `/usr/lib' given more than once
ldconfig: Path `/usr/lib64' given more than once
ldconfig: Can't stat /usr/libx32: No such file or directory
/opt/ohpc/pub/mpi/mpich-ucx-gnu9-ohpc/3.3.2/lib:
libmpicxx.so.12 -> libmpicxx.so.12.1.8
libmpi.so.12 -> libopa.so
libmpifort.so.12 -> libmpifort.so.12.1.8
이제는 libmpi.so.12는 사라지고 libmpi.so.40이 없다고 나오는군.. 흐음..
$ srun mpi_hello_world
mpi_hello_world: error while loading shared libraries: libmpi.so.40: cannot open shared object file: No such file or directory
srun: error: openhpc-1: task 0: Exited with exit code 127
$ srun mpi_hello_world
mpi_hello_world: error while loading shared libraries: libgfortran.so.5: cannot open shared object file: No such file or directory
srun: error: openhpc-1: task 0: Exited with exit code 127
libgfortran.i686 : Fortran runtime libquadmath-devel.x86_64 : GCC __float128 support ucx.x86_64 : UCX is a communication library implementing high-performance messaging
아무튼 2개 노드에 1개의 태스크를 하라니 1개는 할당을 못하고 openhpc-1에서만 구동한 것 같고
$ srun -N 2 -n 1 mpi_hello_world srun: Warning: can't run 1 processes on 2 nodes, setting nnodes to 1 Hello world from processor openhpc-1, rank 0 out of 1 processors
2개 노드에 2개 태스크 하라니 openhpc-1,openhpc-2 노드에서 각각 하나씩 실행한 듯.
$ srun -N 2 -n 2 mpi_hello_world Hello world from processor openhpc-1, rank 0 out of 1 processors Hello world from processor openhpc-2, rank 0 out of 1 processors
다만 2개 노드에 3개 하라니까 그 숫자 이상부터는 task로 할당되는데 실행이 안되는걸 보면 또 설정 문제인가.. ㅠㅠ
$ srun -N 2 -n 3 mpi_hello_world srun: Requested partition configuration not available now srun: job 84 queued and waiting for resources ^Csrun: Job allocation 84 has been revoked srun: Force Terminated job 84
slurmd랑 munged가 좀 튀는데..
너무 순식간에 끝나는 애라 그런가 top에 잡히지도 않는다.
+
서버측의 /var/log/slurmctld.log 를 확인하니
시스템에서 구성된 리소스 보다 크게 잡을순 없다고..
[2020-12-23T02:40:11.527] error: Node openhpc-1 has low socket*core*thread count (1 < 4) [2020-12-23T02:40:11.527] error: Node openhpc-1 has low cpu count (1 < 4) [2020-12-23T02:40:11.527] error: _slurm_rpc_node_registration node=openhpc-1: Invalid argument
+
virtualbox에서 cpu를 4개로 올려주고 설정 바꾸어서 정상작동 확인
$ srun -N 2 -n 4 mpi_hello_world Hello world from processor openhpc-1, rank 0 out of 1 processors Hello world from processor openhpc-1, rank 0 out of 1 processors Hello world from processor openhpc-2, rank 0 out of 1 processors Hello world from processor openhpc-2, rank 0 out of 1 processors
저 rank는 멀까..
$ srun -N 2 -n 8 mpi_hello_world Hello world from processor openhpc-2, rank 0 out of 1 processors Hello world from processor openhpc-2, rank 0 out of 1 processors Hello world from processor openhpc-1, rank 0 out of 1 processors Hello world from processor openhpc-2, rank 0 out of 1 processors Hello world from processor openhpc-1, rank 0 out of 1 processors Hello world from processor openhpc-1, rank 0 out of 1 processors Hello world from processor openhpc-1, rank 0 out of 1 processors Hello world from processor openhpc-2, rank 0 out of 1 processors
아무튼 사용 가능한 코어 갯수를 넘어가면 아래와 같이 무기한 대기가 걸려
사실상 실행을 못하게 되는 것 같기도 하다?
$ srun -N 2 -n 12 mpi_hello_world srun: Requested partition configuration not available now srun: job 92 queued and waiting for resources ^Csrun: Job allocation 92 has been revoked srun: Force Terminated job 92
# srun -n 2 -N 2 --pty /bin/bash srun: Required node not available (down, drained or reserved) srun: job 5 queued and waiting for resources ^Csrun: Job allocation 5 has been revoked srun: Force Terminated job 5
# sinfo -all Tue Dec 22 02:54:55 2020 PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST normal* up 1-00:00:00 1-infinite no EXCLUSIV all 2 drained openhpc-[1-2]
# sacct -a JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 2 bash normal (null) 0 CANCELLED 0:0 3 bash normal (null) 0 CANCELLED 0:0 4 bash normal (null) 0 CANCELLED 0:0 5 bash normal (null) 0 CANCELLED 0:0 6 mpi_hello+ normal (null) 0 CANCELLED 0:0
4.1 Development Tools
4.2 Compilers
yum -y install ohpc-autotools
yum -y install EasyBuild-ohpc
yum -y install hwloc-ohpc
yum -y install spack-ohpc
yum -y install valgrind-ohpc
yum -y install gnu9-compilers-ohpc
4.3 MPI Stacks
yum -y install mpich-ucx-gnu9-ohpc
module avail mpich
4.4 Performance Tools
yum -y install ohpc-gnu9-perf-tools
4.5 Setup default development environment
yum -y install lmod-defaults-gnu9-openmpi4-ohpc
4.6 3rd Party Libraries and Tools
yum -y install ohpc-gnu9-mpich-parallel-libs
yum -y install ohpc-gnu9-openmpi4-parallel-libs
4.7 Optional Development Tool Builds
yum -y install intel-compilers-devel-ohpc
yum -y install intel-mpi-devel-ohpc
5 Resource Manager Startup
systemctl enable munge
systemctl enable slurmctld
systemctl start munge
systemctl start slurmctld
+
yum -y install yum install pdsh-ohpc
pdsh -w $compute_prefix[1-2] systemctl start munge
pdsh -w $compute_prefix[1-2] systemctl start slurmd
scontrol update nodename=c[1-4] state=idle
7 Run a Test Job
useradd -m test
wwsh file resync passwd shadow group
pdsh -w $compute_prefix[1-2] /warewulf/bin/wwgetfiles
7.1 Interactive execution
su - test
mpicc -O3 /opt/ohpc/pub/examples/mpi/hello.c
srun -n 8 -N 2 --pty /bin/bash[test@c1 ~]$ prun ./a.out
7.2 Batch execution
cp /opt/ohpc/pub/examples/slurm/job.mpi .
cat job.mpi
#!/bin/bash
#SBATCH -J test # Job name
#SBATCH -o job.%j.out # Name of stdout output file (%j expands to %jobId)
#SBATCH -N 2 # Total number of nodes requested
#SBATCH -n 16 # Total number of mpi tasks #requested
#SBATCH -t 01:30:00 # Run time (hh:mm:ss) - 1.5 hours
# Launch MPI-based executable
prun ./a.out
sbatch job.mpi
# yum install ohpc-gnu9-perf-tools
마지막 메타 데이터 만료 확인 : 0:00:55 전에 2020년 12월 14일 (월) 오후 09시 28분 27초.
오류:
문제: package ohpc-gnu9-perf-tools-2.0-47.1.ohpc.2.0.x86_64 requires scalasca-gnu9-mpich-ohpc, but none of the providers can be installed
- package scalasca-gnu9-mpich-ohpc-2.5-2.3.ohpc.2.0.x86_64 requires lmod-ohpc >= 7.6.1, but none of the providers can be installed
- cannot install the best candidate for the job
- nothing provides lua-filesystem needed by lmod-ohpc-8.2.10-15.1.ohpc.2.0.x86_64
- nothing provides lua-posix needed by lmod-ohpc-8.2.10-15.1.ohpc.2.0.x86_64
(설치할 수 없는 패키지를 건너 뛰려면 '--skip-broken'을 (를) 추가하십시오. 또는 '--nobest'은/는 최상의 선택된 패키지만 사용합니다)
# yum -y install gnu9-compilers-ohpc
마지막 메타 데이터 만료 확인 : 0:01:36 전에 2020년 12월 14일 (월) 오후 09시 28분 27초.
패키지 gnu9-compilers-ohpc-9.3.0-15.1.ohpc.2.0.x86_64이/가 이미 설치되어 있습니다.
종속성이 해결되었습니다.
할 것이 없음.
완료되었습니다!
# yum -y install mpich-ucx-gnu9-ohpc
마지막 메타 데이터 만료 확인 : 0:01:46 전에 2020년 12월 14일 (월) 오후 09시 28분 27초.
패키지 mpich-ucx-gnu9-ohpc-3.3.2-13.1.ohpc.2.0.x86_64이/가 이미 설치되어 있습니다.
종속성이 해결되었습니다.
할 것이 없음.
완료되었습니다!
# yum -y install ohpc-gnu9-perf-tools
마지막 메타 데이터 만료 확인 : 0:02:45 전에 2020년 12월 14일 (월) 오후 09시 28분 27초.
오류:
문제: package ohpc-gnu9-perf-tools-2.0-47.1.ohpc.2.0.x86_64 requires scalasca-gnu9-mpich-ohpc, but none of the providers can be installed
- package scalasca-gnu9-mpich-ohpc-2.5-2.3.ohpc.2.0.x86_64 requires lmod-ohpc >= 7.6.1, but none of the providers can be installed
- cannot install the best candidate for the job
- nothing provides lua-filesystem needed by lmod-ohpc-8.2.10-15.1.ohpc.2.0.x86_64
- nothing provides lua-posix needed by lmod-ohpc-8.2.10-15.1.ohpc.2.0.x86_64
(설치할 수 없는 패키지를 건너 뛰려면 '--skip-broken'을 (를) 추가하십시오. 또는 '--nobest'은/는 최상의 선택된 패키지만 사용합니다)
# yum install lmod-ohpc
마지막 메타 데이터 만료 확인 : 0:03:12 전에 2020년 12월 14일 (월) 오후 09시 28분 27초.
오류:
문제: cannot install the best candidate for the job
- nothing provides lua-filesystem needed by lmod-ohpc-8.2.10-15.1.ohpc.2.0.x86_64
- nothing provides lua-posix needed by lmod-ohpc-8.2.10-15.1.ohpc.2.0.x86_64
(설치할 수 없는 패키지를 건너 뛰려면 '--skip-broken'을 (를) 추가하십시오. 또는 '--nobest'은/는 최상의 선택된 패키지만 사용합니다)
# yum -y install lmod-defaults-gnu9-openmpi4-ohpc
마지막 메타 데이터 만료 확인 : 0:03:47 전에 2020년 12월 14일 (월) 오후 09시 28분 27초.
오류:
문제: package lmod-defaults-gnu9-openmpi4-ohpc-2.0-4.1.ohpc.2.0.noarch requires lmod-ohpc, but none of the providers can be installed
- conflicting requests
- nothing provides lua-filesystem needed by lmod-ohpc-8.2.10-15.1.ohpc.2.0.x86_64
- nothing provides lua-posix needed by lmod-ohpc-8.2.10-15.1.ohpc.2.0.x86_64
(설치할 수 없는 패키지를 건너 뛰려면 '--skip-broken'을 (를) 추가하십시오. 또는 '--nobest'은/는 최상의 선택된 패키지만 사용합니다)