엥...
라즈베리 파이 3B 64bit OS 에서는 잘 실행되는데
pi@raspberrypi:~ $ mpirun -------------------------------------------------------------------------- mpirun could not find anything to do.
It is possible that you forgot to specify how many processes to run via the "-np" argument. --------------------------------------------------------------------------
pi@raspberrypi:~ $ mpirun --version mpirun (Open MPI) 4.1.0
Report bugs to http://www.open-mpi.org/community/help/ |
odroid c2 에서는(aarch64) 에러가 난다
strace라던가 써서 추적해보려는데 패키지가 되먹질 않으니 짜증나네 -_ㅠ
odroid c2
geteuid() = 1000 openat(AT_FDCWD, "/proc/cpuinfo", O_RDONLY) = 3 --- SIGILL {si_signo=SIGILL, si_code=ILL_ILLOPC, si_addr=0x7fa18b7cfc} --- +++ killed by SIGILL +++ Illegal instruction |
라즈베리 파이
geteuid() = 1000 openat(AT_FDCWD, "/proc/cpuinfo", O_RDONLY) = 3 close(3) = 0 getuid() = 1000 geteuid() = 1000 |
도대체 고작 /proc/cpuinfo 여는걸로 죽다니 머지?
+
gdb로 해보니 뜬금없이 libopen-pal.so.40 에서 죽는다. 도대체 PAL 이 멀하는데 illegal instruction 을 띄울 작업을 하는거지?
$ gdb mpirun GNU gdb (Ubuntu 9.2-0ubuntu1~20.04.1) 9.2 Copyright (C) 2020 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "aarch64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>.
For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from mpirun... (No debugging symbols found in mpirun) (gdb) r Starting program: /usr/bin/mpirun [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
Program received signal SIGILL, Illegal instruction. 0x0000007fb7ec5cfc in ?? () from /usr/lib/aarch64-linux-gnu/libopen-pal.so.40 |
눈이 이상해졌나 open-pam 인 줄.. -_-
아무튼 libopen-pal 혹은 OPAL 이라는 녀석이 현재 문제인데
openmpi는 liboshmem(shared memory 관리?) libmpi 그리고 libopen-pal로 구성된다고 한다.
That is, they are compiled into separate libraries: liboshmem, libmpi, libopen-pal with a strict dependency order: OSHMEM depends on OMPI, OMPI depends on OPAL. |
[링크: https://docs.open-mpi.org/en/v5.0.x/developers/terminology.html]
open PAL은 체크포인트 와 프로그램 재시작을 할 수 있도록 해주는 녀석이니..
프로세서에 대해서 잘 알아야 프로그램을 멈추고 실행하게 할테니..
반대로 생각하면 odroid-c2에서 적용된 aarch64 버전이 amlogic 기반의 ap와는 맞지 않게 빌드된걸려나?
Open PAL can involuntarily checkpoint and restart sequential programs. Doing so requires that Open PAL was compiled with thread support and that the back-end checkpointing systems are available at run-time. |
[링크 : https://www.open-mpi.org/doc/v4.1/man7/opal_crs.7.php]
+
odroid c2 에서 make -j4 로 openmpi-4.1.0 빌드에 걸린시간
중간에 gdb 설치한다고 순수하게 돌리면 24분쯤으로 되지 않았을까?
real 25m20.199s user 52m9.680s sys 5m27.920s |
+
빡세게(?) 빌드까지 해서 겨우겨우 돌렸는데 왜 저런 에러가 또 나냐고 ㅠㅠ
Program received signal SIGILL, Illegal instruction. 0x0000007fb7ec236c in opal_timer_linux_find_freq () from /usr/local/lib/libopen-pal.so.40 |
에러를 보니 먼가 함수명이 나와서 추적
$ grep -rn "opal_timer_linux_find_freq" . Binary file ./opal/mca/timer/linux/.libs/timer_linux_component.o matches Binary file ./opal/mca/timer/linux/.libs/libmca_timer_linux.a matches ./opal/mca/timer/linux/timer_linux_component.c:105:static int opal_timer_linux_find_freq(void) ./opal/mca/timer/linux/timer_linux_component.c:203: ret = opal_timer_linux_find_freq();
$ cat opal/mca/timer/linux/timer_linux_component.c static int opal_timer_linux_find_freq(void) { FILE *fp; char *loc; float cpu_f; int ret; char buf[1024];
fp = fopen("/proc/cpuinfo", "r"); if (NULL == fp) { return OPAL_ERR_IN_ERRNO; }
opal_timer_linux_freq = 0;
#if OPAL_ASSEMBLY_ARCH == OPAL_ARM64 opal_timer_linux_freq = opal_sys_timer_freq(); #endif
$ grep -rni "opal_sys_timer_freq" . ./mca/timer/linux/timer_linux_component.c:121: opal_timer_linux_freq = opal_sys_timer_freq(); ./include/opal/sys/arm64/timer.h:36:opal_sys_timer_freq(void)
$ cat openmpi-4.1.0/opal/include/opal/sys/arm64/timer.h static inline opal_timer_t opal_sys_timer_freq(void) { opal_timer_t freq; __asm__ __volatile__ ("mrs %0, CNTFRQ_EL0" : "=r" (freq)); return (opal_timer_t)(freq); } |
레지스터 이름부터가 무시무시하네?
MRS Move System Register.
Syntax MRS Xt, (systemreg|Sop0_op1_Cn_Cm_op2)
Where:
Xt Is the 64-bit name of the general-purpose destination register. systemreg Is a System register name.
The System register names are defined in 'AArch64 System Registers' in the System Register XML.
op0 Is an unsigned immediate, and can be either 2 or 3. op1 Is a 3-bit unsigned immediate, in the range 0 to 7. Cn Is a name Cn, with n in the range 0 to 15. Cm Is a name Cm, with m in the range 0 to 15. op2 Is a 3-bit unsigned immediate, in the range 0 to 7. Usage Move System Register allows the PE to read an AArch64 System register into a general-purpose register. |
[링크 : https://developer.arm.com/documentation/dui0801/h/A64-General-Instructions/MRS]
Accessing CNTFRQ_EL0 Accesses to this register use the following encodings in the System register encoding space:
MRS <Xt>, CNTFRQ_EL0 op0 |
[링크 : https://developer.arm.com/documentation/ddi0601/2022-03/AArch64-Registers/CNTFRQ-EL0--Counter-timer-Frequency-register]
The two instructions you show above are therefore -
MSR HPFAR_EL2, X0
MSR PSTATEField_SP, #0 |
[링크 : https://reverseengineering.stackexchange.com/questions/14617/arm-understanding-msr-mrs-instructions]
보면.. uboot에도 있는 아주 평범한(?) 녀석인데.. 머지?
unsigned long timer_read_counter(void) { unsigned long cntpct; unsigned long temp;
isb(); asm volatile("mrs %0, cntpct_el0" : "=r" (cntpct)); asm volatile("mrs %0, cntpct_el0" : "=r" (temp)); while (temp != cntpct) { asm volatile("mrs %0, cntpct_el0" : "=r" (cntpct)); asm volatile("mrs %0, cntpct_el0" : "=r" (temp)); }
return cntpct; } |
[링크 : https://github.com/qemu/u-boot/blob/master/arch/arm/cpu/armv8/generic_timer.c]
%0은 이전의 dst 를 의미하는걸려나?
int val = 50; __asm volatile ("MOV R0, %0": : "r"(var) );
==> (컴파일러 해석)
ldr r3, [r7, #4]
mov r0, r3
|
[링크 : https://dhpark1212.tistory.com/entry/ARM-GCC-Inline-assembly-coding]
amlogic 의 Cortex-A53인데 많이 다른가?
ODROID-C2 |
ODROID-C1+ |
RPi 3 Model B |
CPU |
Amlogic S905 SoC
4 x ARM Cortex-A53 1.5GHz
64bit ARMv8 Architecture @28nm
|
Amlogic S805 SoC
4 x ARM Cortex-A5 1.5GHz
32bit ARMv7 Architecture @28nm
|
Broadcom BCM2837
4 x ARM Cortex-A53 1.2Ghz
64bit ARMv7 Architecture @40nm
|
[링크 : https://www.hardkernel.com/ko/shop/odroid-c2/]