-O3 하면 자동으로 -ftree-vectorize가 추가되었다고.
아무튼 연산만 하고 출력을 안하니 사용하지 않는 코드로 해서 vadd가 안나와서 한참을 헤맸네..
$ g++ -O3 -mavx autovector.cpp -fopt-info-vec-all autovector.cpp:22:22: missed: couldn't vectorize loop autovector.cpp:25:19: missed: not vectorized: complicated access pattern. autovector.cpp:23:21: missed: couldn't vectorize loop autovector.cpp:25:14: missed: not vectorized: complicated access pattern. autovector.cpp:16:23: optimized: loop vectorized using 32 byte vectors autovector.cpp:10:5: note: vectorized 1 loops in function. autovector.cpp:15:43: missed: statement clobbers memory: now = std::chrono::_V2::system_clock::now (); autovector.cpp:27:77: missed: statement clobbers memory: D.189348 = std::chrono::_V2::system_clock::now (); autovector.cpp:28:2: missed: statement clobbers memory: __assert_fail ("result[2] == ( 2.0f + 0.1335f)+( 1.50f*2.0f + 0.9383f)-(0.33f*2.0f+0.1172f)+3*(float)(noTests-1)", "autovector.cpp", 28, "int main()"); /usr/include/c++/9/ostream:570:18: missed: statement clobbers memory: std::__ostream_insert<char, std::char_traits<char> > (&cout, "CG> message -channel \"exercise results\" Time used: ", 51); /usr/include/c++/9/ostream:221:29: missed: statement clobbers memory: _46 = std::basic_ostream<char>::_M_insert<double> (&cout, _42); /usr/include/c++/9/ostream:570:18: missed: statement clobbers memory: std::__ostream_insert<char, std::char_traits<char> > (_46, "s, N * noTests=", 15); autovector.cpp:29:112: missed: statement clobbers memory: _35 = std::basic_ostream<char>::operator<< (_46, 2000000000); /usr/include/c++/9/ostream:113:13: missed: statement clobbers memory: std::endl<char, std::char_traits<char> > (_35); /usr/include/c++/9/iostream:74:25: missed: statement clobbers memory: std::ios_base::Init::Init (&__ioinit); /usr/include/c++/9/iostream:74:25: missed: statement clobbers memory: __cxa_atexit (__dt_comp , &__ioinit, &__dso_handle); |
$ gcc -mcpu=native -march=native -Q --help=target The following options are target specific: -mabi= aapcs-linux -mabort-on-noreturn [disabled] -mandroid [disabled] -mapcs [disabled] -mapcs-frame [disabled] -mapcs-reentrant [disabled] -mapcs-stack-check [disabled] -march= armv7ve+vfpv3-d16 -marm [enabled] -masm-syntax-unified [disabled] -mbe32 [enabled] -mbe8 [disabled] -mbig-endian [disabled] -mbionic [disabled] -mbranch-cost= -1 -mcallee-super-interworking [disabled] -mcaller-super-interworking [disabled] -mcmse [disabled] -mcpu= cortex-a7 -mfix-cortex-m3-ldrd [disabled] -mflip-thumb [disabled] -mfloat-abi= hard -mfp16-format= none -mfpu= vfp -mglibc [enabled] -mhard-float -mlittle-endian [enabled] -mlong-calls [disabled] -mmusl [disabled] -mneon-for-64bits [disabled] -mpic-data-is-text-relative [enabled] -mpic-register= -mpoke-function-name [disabled] -mprint-tune-info [disabled] -mpure-code [disabled] -mrestrict-it [disabled] -msched-prolog [enabled] -msingle-pic-base [disabled] -mslow-flash-data [disabled] -msoft-float -mstructure-size-boundary= 8 -mthumb [disabled] -mthumb-interwork [disabled] -mtls-dialect= gnu -mtp= cp15 -mtpcs-frame [disabled] -mtpcs-leaf-frame [disabled] -mtune= -muclibc [disabled] -munaligned-access [enabled] -mvectorize-with-neon-double [disabled] -mvectorize-with-neon-quad [enabled] -mword-relocations [disabled] Known ARM ABIs (for use with the -mabi= option): aapcs aapcs-linux apcs-gnu atpcs iwmmxt Known __fp16 formats (for use with the -mfp16-format= option): alternative ieee none Known ARM FPUs (for use with the -mfpu= option): auto crypto-neon-fp-armv8 fp-armv8 fpv4-sp-d16 fpv5-d16 fpv5-sp-d16 neon neon-fp-armv8 neon-fp16 neon-vfpv3 neon-vfpv4 vfp vfp3 vfpv2 vfpv3 vfpv3-d16 vfpv3-d16-fp16 vfpv3-fp16 vfpv3xd vfpv3xd-fp16 vfpv4 vfpv4-d16 Valid arguments to -mtp=: auto cp15 soft Known floating-point ABIs (for use with the -mfloat-abi= option): hard soft softfp TLS dialect to use: gnu gnu2 |
[링크 : https://www.raspberrypi.org/forums/viewtopic.php?t=155461]
[링크 : https://www.codingame.com/playgrounds/283/sse-avx-vectorization/autovectorization]
+
$ cat neon.c #include <stdio.h> void main() { int a[256]; int b[256]; int c[256]; int i; for(i = 0; i < 256; i++) { a[i] = b[i] + c[i]; } printf("%d %d %d\n", a[0], b[0], c[0]); } |
$ gcc -O3 neon.c -mfpu=neon $ objdump -d a.out | grep v 10320: e1a01000 mov r1, r0 10328: e1a0300d mov r3, sp 1032c: f4610add vld1.64 {d16-d17}, [r1 :64]! 10330: f4622add vld1.64 {d18-d19}, [r2 :64]! 10334: f26008e2 vadd.i32 q8, q8, q9 10338: f4430add vst1.64 {d16-d17}, [r3 :64]! 10360: e3a0b000 mov fp, #0 10364: e3a0e000 mov lr, #0 1036c: e1a0200d mov r2, sp 1043c: e3a03001 mov r3, #1 10454: e1a07000 mov r7, r0 1046c: e1a08001 mov r8, r1 10470: e1a09002 mov r9, r2 10480: e3a04000 mov r4, #0 1048c: e1a02009 mov r2, r9 10490: e1a01008 mov r1, r8 10494: e1a00007 mov r0, r7 |
$ gcc neon.c -mfpu=neon $ objdump -d a.out | grep v 10318: e3a0b000 mov fp, #0 1031c: e3a0e000 mov lr, #0 10324: e1a0200d mov r2, sp 103f4: e3a03001 mov r3, #1 10418: e3a03000 mov r3, #0 10490: e1a00000 nop ; (mov r0, r0) 104a4: e1a07000 mov r7, r0 104bc: e1a08001 mov r8, r1 104c0: e1a09002 mov r9, r2 104d0: e3a04000 mov r4, #0 104dc: e1a02009 mov r2, r9 104e0: e1a01008 mov r1, r8 104e4: e1a00007 mov r0, r7 |
-fopt-info-vec-all 추가. -all 때문인지 어마어마하게 나오네
-fopt-info-vec 으로만 하니 깔끔하게 vectorized 라고 뜬다.
$ gcc neon.c -mfpu=neon -fopt-info-vec -O3 neon.c:10:2: note: loop vectorized |
'프로그램 사용 > gcc' 카테고리의 다른 글
구조체 타입과 변수명은 구분된다? (0) | 2021.11.18 |
---|---|
gcc unsigned to signed upcast 테스트 (0) | 2021.07.08 |
gcc unsigned to signed cast (0) | 2021.06.22 |
gcc %p (nil) (0) | 2021.05.07 |
gcc -D 옵션 인자를 printf로 출력하기 (0) | 2021.04.08 |