저번에 작성한 프로그램에서 VFP 를 통한 연산가속을 활성화 해봤는데 혹시나 해서, 어떤 명령어를 이용했나 역으로 찾아보는 중
111a0: f35668e8 vsub.i16 q11, q11, q12 1120c: f35318a1 vsub.i16 d17, d19, d17 1172c: f2600de8 vsub.f32 q8, q8, q12 11730: f2644de8 vsub.f32 q10, q10, q12 11784: ee377a46 vsub.f32 s14, s14, s12 117a8: ee755ac6 vsub.f32 s11, s11, s12 117c0: ee744ac6 vsub.f32 s9, s9, s12 117d8: ee355a46 vsub.f32 s10, s10, s12 117f0: ee755ac6 vsub.f32 s11, s11, s12 1180c: ee744ac6 vsub.f32 s9, s9, s12 11818: ee355a46 vsub.f32 s10, s10, s12 11824: ee356ac6 vsub.f32 s12, s11, s12 11844: f2600de8 vsub.f32 q8, q8, q12 11848: f2644de8 vsub.f32 q10, q10, q12 118a8: ee344a67 vsub.f32 s8, s8, s15 118d4: ee744ae7 vsub.f32 s9, s9, s15 118ec: ee355a67 vsub.f32 s10, s10, s15 11908: ee755ae7 vsub.f32 s11, s11, s15 11918: ee344a67 vsub.f32 s8, s8, s15 11924: ee744ae7 vsub.f32 s9, s9, s15 11930: ee355a67 vsub.f32 s10, s10, s15 1193c: ee757ae7 vsub.f32 s15, s11, s15 |
역어셈블 해보니 위와 같이 vsub.i16과 같은 neon 으로도 될 것 같은 녀석은 패스하면 vsub.f32 밖에 없다.
vsub.f32가 neon 껀지 vfp껀지 궁금해서 찾아보는 중
VSUB (floating-point) Floating-point subtract. This instruction can be scalar, vector, or mixed, but VFP vector mode and mixed mode are deprecated. |
[링크 : https://developer.arm.com/documentation/dui0489/i/neon-and-vfp-programming/vsub--floating-point-]
Instruction Section Instruction set V{Q}SUB V{Q}SUB, VSUBL and VSUBW NEON VSUB VSUB VFP |
[링크 : https://developer.arm.com/documentation/den0018/a/NEON-and-VFP-Instruction-Summary/List-of-all-NEON-and-VFP-instructions]
cortex a9의 NEON MPE는 Advanced SIMD와 VFP 확장을 구현하였지만
IEEE754 연산중 아래의 연산을 하드웨어적으로 제공하지 않는다 인데
round float-point number to nearest integer-valued in floating point number 때문에
gcc 에서 --fast-math 를 켜줘야 VFP 명령이 활성화 되는걸까?
IEEE754 standard compliance The IEEE754 standard provides a number of implementation choices. The ARM Architecture Reference Manual describes the choices that apply to the Advanced SIMD and VFPv3 architectures.
The Cortex-A9 NEON MPE implements the ARMv7 Advanced SIMD and VFP extensions. It does not provide hardware support for the following IEEE754 operations:
remainder round floating-point number to nearest integer-valued in floating-point number binary-to-decimal conversion decimal-to-binary conversion direct comparison of single-precision and double-precision values any extended-precision operations. |
[링크 : https://developer.arm.com/documentation/ddi0409/e/programmers-model/ieee754-standard-compliance]
+
다시 옵션에 따른 비교를 해보니
어찌 된게 ffast-math 한게 디스어셈블한 부분이 더 길다.. 그런데 왜 빠르지?
for (int i = 0; i < READ_SIZE; i += 2) 11710: f3f48c46 vdup.32 q12, d6[0] float diff = data[i] - avg_0; 11714: f46c434d vld2.16 {d20-d23}, [ip]! 11718: f2d00a34 vmovl.s16 q8, d20 1171c: e151000c cmp r1, ip 11720: f2d04a35 vmovl.s16 q10, d21 11724: f3fb0660 vcvt.f32.s32 q8, q8 11728: f3fb4664 vcvt.f32.s32 q10, q10 1172c: f2600de8 vsub.f32 q8, q8, q12 11730: f2644de8 vsub.f32 q10, q10, q12 std_0 += diff * diff; 11734: f3400df0 vmul.f32 q8, q8, q8 11738: f2440df4 vmla.f32 q8, q10, q10 1173c: f2422de0 vadd.f32 q9, q9, q8 |
for (int i = 0; i < READ_SIZE; i += 2) 1177c: e15e000c cmp lr, ip float diff = data[i] - avg_0; 11780: ee072a90 vmov s15, r2 11784: eef87ae7 vcvt.f32.s32 s15, s15 11788: ee777ac6 vsub.f32 s15, s15, s12 std_0 += diff * diff; 1178c: ee077aa7 vmla.f32 s14, s15, s15 |
흐으으으으음.. 어셈은 어려워 -_ㅠ
Instruction Section Instruction set VMLA VMUL, VMLA, VMLS, VNMUL, VNMLA, and VNMLS VFP VMLA{L} VMUL{L}, VMLA{L}, and VMLS{L} (by scalar) NEON |
[링크 : https://developer.arm.com/documentation/den0018/a/NEON-and-VFP-Instruction-Summary/List-of-all-NEON-and-VFP-instructions]