openhpc, slurm 시도..

프로그램 사용/openHPC2020. 12. 22. 17:38

openhpc, slurm 시도..

일단 현재 실행은 실패 -_ㅠ

scontrol update NodeName=c[1-5] state=RESUME
sinfo -all
srun -n8 hellompi.o
sacct -a

[링크 : https://groups.io/g/OpenHPC-users/topic/srun_required_node_not/74202339...]

# srun -n 2 -N 2 --pty /bin/bash
srun: Required node not available (down, drained or reserved)
srun: job 5 queued and waiting for resources
^Csrun: Job allocation 5 has been revoked
srun: Force Terminated job 5

# sinfo -all
Tue Dec 22 02:54:55 2020
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST
normal* up 1-00:00:00 1-infinite no EXCLUSIV all 2 drained openhpc-[1-2]

# sacct -a
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
2                  bash     normal     (null)          0  CANCELLED      0:0
3                  bash     normal     (null)          0  CANCELLED      0:0
4                  bash     normal     (null)          0  CANCELLED      0:0
5                  bash     normal     (null)          0  CANCELLED      0:0
6            mpi_hello+     normal     (null)          0  CANCELLED      0:0

drain 상태..

[링크 : https://stackoverflow.com/questions/22480627/what-does-the-state-drain-mean]

state를 바꾸어 주면 된다는데 안되네..

작업을 다 죽이고 idle로 바꾸라는데 작업은 어떻게 죽이지?

[링크 : https://stackoverflow.com/questions/29535118/how-to-undrain-slurm-nodes-in-drain-state]

slurm job cancel(잡 죽이기)

The normal method to kill a Slurm job is:

    $ scancel <jobid>

You can find your jobid with the following command:

    $ squeue -u $USER

If the the job id is 1234567 then to kill the job:

    $ scancel 1234567

[링크 : https://researchcomputing.princeton.edu/faq/how-to-kill-a-slurm-job]

state가 cancelled 는 이미 취소된것이기 때문에 scancel로 취소되지 않는다.

$ scancel -v 8
scancel: Terminating job 8
scancel: error: Kill job error on job id 8: Job/step already completing or completed

왜 안되나 했는데 코어와 쓰레드 갯수에 제한이 있었던 건가?

# sinfo -R -v

Reason=Low socket*core*thread count, Low CPUs

[링크 : https://groups.io/g/OpenHPC-users/topic/slurmd_in_compute_nodes/22449264?p=]

오예~

원래는 2 8 2 였나 그런데 1 1 1 로 바꾸니

NodeName=openhpc-[1-2] Sockets=1 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN

idle로 전환이 된다?

# scontrol update Nodename=openhpc-[1-2] state=idle
# sinfo -all
Tue Dec 22 03:49:23 2020
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST
normal* up 1-00:00:00 1-infinite no EXCLUSIV all 2 idle* openhpc-[1-2]

실행 안되는건 매한가지 ㅠㅠ

컴퓨트 노드쪽의 slurmd 가 구동되지 않아서 그런듯

하지만...

역시 안되는건 매한가지 ㅠㅠ 산넘어 산이구나

srun: error: slurm_receive_msgs: Socket timed out on send/recv operation

음...

[2020-12-22T04:18:08.435] error: Node openhpc-1 has low socket*core*thread count (1 < 32)
[2020-12-22T04:18:08.435] error: Node openhpc-1 has low cpu count (1 < 32)
[2020-12-22T04:18:08.435] error: _slurm_rpc_node_registration node=openhpc-1: Invalid argument

도대체 어떻게 값을 주어야 잘 도냐...

# slurmd -C
NodeName=openhpc-1 CPUs=1 Boards=1 SocketsPerBoard=1 CoresPerSocket=1 ThreadsPerCore=1 RealMemory=968

NodeName=openhpc-[1-2] Sockets=1 CoresPerSocket=1 ThreadsPerCore=1 State=UNKNOWN

아래가 원래값. 위에 식에 의해서 2*8*2 니까 32를 넘어서 작동이 되는건가?

NodeName=c[1-4] Sockets=2 CoresPerSocket=8 ThreadsPerCore=2 State=UNKNOWN

저작자표시 (새창열림)

'프로그램 사용 > openHPC' 카테고리의 다른 글

slurm 먼가 까다롭네... (3)	2020.12.23
slurm.conf 생성기 (0)	2020.12.23
openmpi 및 예제 (0)	2020.12.22
openmpi on centos8 (0)	2020.12.17
slurmd: fatal: Unable to determine this slurmd's NodeName (0)	2020.12.15

Posted by 구차니

구차니의 잡동사니 모음

openhpc, slurm 시도..

'프로그램 사용 > openHPC' 카테고리의 다른 글

카테고리

공지사항

태그목록

최근에 올라온 글

최근에 달린 댓글

티스토리툴바