일단 현재 실행은 실패 -_ㅠ
scontrol update NodeName=c[1-5] state=RESUME sinfo -all srun -n8 hellompi.o sacct -a |
[링크 : https://groups.io/g/OpenHPC-users/topic/srun_required_node_not/74202339...]
# srun -n 2 -N 2 --pty /bin/bash srun: Required node not available (down, drained or reserved) srun: job 5 queued and waiting for resources ^Csrun: Job allocation 5 has been revoked srun: Force Terminated job 5 |
# sinfo -all Tue Dec 22 02:54:55 2020 PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST normal* up 1-00:00:00 1-infinite no EXCLUSIV all 2 drained openhpc-[1-2] |
# sacct -a JobID JobName Partition Account AllocCPUS State ExitCode ------------ ---------- ---------- ---------- ---------- ---------- -------- 2 bash normal (null) 0 CANCELLED 0:0 3 bash normal (null) 0 CANCELLED 0:0 4 bash normal (null) 0 CANCELLED 0:0 5 bash normal (null) 0 CANCELLED 0:0 6 mpi_hello+ normal (null) 0 CANCELLED 0:0 |
drain 상태..
[링크 : https://stackoverflow.com/questions/22480627/what-does-the-state-drain-mean]
state를 바꾸어 주면 된다는데 안되네..
작업을 다 죽이고 idle로 바꾸라는데 작업은 어떻게 죽이지?
[링크 : https://stackoverflow.com/questions/29535118/how-to-undrain-slurm-nodes-in-drain-state]
slurm job cancel(잡 죽이기)
The normal method to kill a Slurm job is: $ scancel <jobid> You can find your jobid with the following command: $ squeue -u $USER If the the job id is 1234567 then to kill the job: $ scancel 1234567 |
[링크 : https://researchcomputing.princeton.edu/faq/how-to-kill-a-slurm-job]
state가 cancelled 는 이미 취소된것이기 때문에 scancel로 취소되지 않는다.
$ scancel -v 8 scancel: Terminating job 8 scancel: error: Kill job error on job id 8: Job/step already completing or completed |
왜 안되나 했는데 코어와 쓰레드 갯수에 제한이 있었던 건가?
# sinfo -R -v |
Reason=Low socket*core*thread count, Low CPUs |
[링크 : https://groups.io/g/OpenHPC-users/topic/slurmd_in_compute_nodes/22449264?p=]
+
오예~
원래는 2 8 2 였나 그런데 1 1 1 로 바꾸니
NodeName=openhpc-[1-2] Sockets=1 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN |
idle로 전환이 된다?
# scontrol update Nodename=openhpc-[1-2] state=idle # sinfo -all Tue Dec 22 03:49:23 2020 PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST normal* up 1-00:00:00 1-infinite no EXCLUSIV all 2 idle* openhpc-[1-2] |
실행 안되는건 매한가지 ㅠㅠ
+
컴퓨트 노드쪽의 slurmd 가 구동되지 않아서 그런듯
하지만...
역시 안되는건 매한가지 ㅠㅠ 산넘어 산이구나
srun: error: slurm_receive_msgs: Socket timed out on send/recv operation |
+
음...
[2020-12-22T04:18:08.435] error: Node openhpc-1 has low socket*core*thread count (1 < 32) [2020-12-22T04:18:08.435] error: Node openhpc-1 has low cpu count (1 < 32) [2020-12-22T04:18:08.435] error: _slurm_rpc_node_registration node=openhpc-1: Invalid argument |
도대체 어떻게 값을 주어야 잘 도냐...
# slurmd -C NodeName=openhpc-1 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=968 |
NodeName=openhpc-[1-2] Sockets=1 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN |
아래가 원래값. 위에 식에 의해서 2*8*2 니까 32를 넘어서 작동이 되는건가?
NodeName=c[1-4] Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN |
'프로그램 사용 > openHPC' 카테고리의 다른 글
slurm 먼가 까다롭네... (3) | 2020.12.23 |
---|---|
slurm.conf 생성기 (0) | 2020.12.23 |
openmpi 및 예제 (0) | 2020.12.22 |
openmpi on centos8 (0) | 2020.12.17 |
slurmd: fatal: Unable to determine this slurmd's NodeName (0) | 2020.12.15 |