구차니의 잡동사니 모음

embeded/ARM2023. 8. 9. 13:54

arm vsub operator

저번에 작성한 프로그램에서 VFP 를 통한 연산가속을 활성화 해봤는데 혹시나 해서, 어떤 명령어를 이용했나 역으로 찾아보는 중

   111a0: f35668e8 vsub.i16 q11, q11, q12
   1120c: f35318a1 vsub.i16 d17, d19, d17
   1172c: f2600de8 vsub.f32 q8, q8, q12
   11730: f2644de8 vsub.f32 q10, q10, q12
   11784: ee377a46 vsub.f32 s14, s14, s12
   117a8: ee755ac6 vsub.f32 s11, s11, s12
   117c0: ee744ac6 vsub.f32 s9, s9, s12
   117d8: ee355a46 vsub.f32 s10, s10, s12
   117f0: ee755ac6 vsub.f32 s11, s11, s12
   1180c: ee744ac6 vsub.f32 s9, s9, s12
   11818: ee355a46 vsub.f32 s10, s10, s12
   11824: ee356ac6 vsub.f32 s12, s11, s12
   11844: f2600de8 vsub.f32 q8, q8, q12
   11848: f2644de8 vsub.f32 q10, q10, q12
   118a8: ee344a67 vsub.f32 s8, s8, s15
   118d4: ee744ae7 vsub.f32 s9, s9, s15
   118ec: ee355a67 vsub.f32 s10, s10, s15
   11908: ee755ae7 vsub.f32 s11, s11, s15
   11918: ee344a67 vsub.f32 s8, s8, s15
   11924: ee744ae7 vsub.f32 s9, s9, s15
   11930: ee355a67 vsub.f32 s10, s10, s15
   1193c: ee757ae7 vsub.f32 s15, s11, s15

역어셈블 해보니 위와 같이 vsub.i16과 같은 neon 으로도 될 것 같은 녀석은 패스하면 vsub.f32 밖에 없다.

vsub.f32가 neon 껀지 vfp껀지 궁금해서 찾아보는 중

VSUB (floating-point)
Floating-point subtract.
This instruction can be scalar, vector, or mixed, but VFP vector mode and mixed mode are deprecated.

[링크 : https://developer.arm.com/documentation/dui0489/i/neon-and-vfp-programming/vsub--floating-point-]

Instruction Section Instruction set
V{Q}SUB V{Q}SUB, VSUBL and VSUBW NEON
VSUB VSUB VFP

[링크 : https://developer.arm.com/documentation/den0018/a/NEON-and-VFP-Instruction-Summary/List-of-all-NEON-and-VFP-instructions]

cortex a9의 NEON MPE는 Advanced SIMD와 VFP 확장을 구현하였지만

IEEE754 연산중 아래의 연산을 하드웨어적으로 제공하지 않는다 인데

round float-point number to nearest integer-valued in floating point number 때문에

gcc 에서 --fast-math 를 켜줘야 VFP 명령이 활성화 되는걸까?

IEEE754 standard compliance
The IEEE754 standard provides a number of implementation choices. The ARM Architecture Reference Manual describes the choices that apply to the Advanced SIMD and VFPv3 architectures.

The Cortex-A9 NEON MPE implements the ARMv7 Advanced SIMD and VFP extensions. It does not provide hardware support for the following IEEE754 operations:

remainder
round floating-point number to nearest integer-valued in floating-point number
binary-to-decimal conversion
decimal-to-binary conversion
direct comparison of single-precision and double-precision values
any extended-precision operations.

[링크 : https://developer.arm.com/documentation/ddi0409/e/programmers-model/ieee754-standard-compliance]

다시 옵션에 따른 비교를 해보니

어찌 된게 ffast-math 한게 디스어셈블한 부분이 더 길다.. 그런데 왜 빠르지?

for (int i = 0; i < READ_SIZE; i += 2)
   11710: f3f48c46 vdup.32 q12, d6[0]
float diff = data[i] - avg_0;
   11714: f46c434d vld2.16 {d20-d23}, [ip]!
   11718: f2d00a34 vmovl.s16 q8, d20
   1171c: e151000c cmp r1, ip
   11720: f2d04a35 vmovl.s16 q10, d21
   11724: f3fb0660 vcvt.f32.s32 q8, q8
   11728: f3fb4664 vcvt.f32.s32 q10, q10
   1172c: f2600de8 vsub.f32 q8, q8, q12
   11730: f2644de8 vsub.f32 q10, q10, q12
std_0 += diff * diff;
   11734: f3400df0 vmul.f32 q8, q8, q8
   11738: f2440df4 vmla.f32 q8, q10, q10
   1173c: f2422de0 vadd.f32 q9, q9, q8

for (int i = 0; i < READ_SIZE; i += 2)
   1177c: e15e000c cmp lr, ip
float diff = data[i] - avg_0;
   11780: ee072a90 vmov s15, r2
   11784: eef87ae7 vcvt.f32.s32 s15, s15
   11788: ee777ac6 vsub.f32 s15, s15, s12
std_0 += diff * diff;
   1178c: ee077aa7 vmla.f32 s14, s15, s15

흐으으으으음.. 어셈은 어려워 -_ㅠ

Instruction Section Instruction set
VMLA VMUL, VMLA, VMLS, VNMUL, VNMLA, and VNMLS VFP
VMLA{L} VMUL{L}, VMLA{L}, and VMLS{L} (by scalar) NEON

[링크 : https://developer.arm.com/documentation/den0018/a/NEON-and-VFP-Instruction-Summary/List-of-all-NEON-and-VFP-instructions]

저작자표시 (새창열림)

'embeded > ARM' 카테고리의 다른 글

cortex-a53 (0)	2023.08.31
aarch64 vector register (0)	2023.08.23
ARM NEON SLP (0)	2023.08.07
cortex a9 ptm (0)	2023.07.21
openOCD와 jtag (0)	2023.07.06

Posted by 구차니

구차니의 잡동사니 모음

arm vsub operator

'embeded > ARM' 카테고리의 다른 글

카테고리

공지사항

태그목록

최근에 올라온 글

최근에 달린 댓글

티스토리툴바