'embeded/ARM' 카테고리의 글 목록

SVE(Scalable Vector Extension) (0)	2025.08.28
emmc 파티션 정렬 (0)	2024.02.07
arm asm rev (0)	2023.09.14
cortex-a53 (0)	2023.08.31
aarch64 vector register (0)	2023.08.23

Posted by 구차니

SVE(Scalable Vector Extension)

armv7 까지는 NEON 이고 armv8 부터는 SVE 라고 이름이 달라지는 듯

Scalable Vector Extension (SVE) is a vector extension the A64 instruction set of the Armv8-A architecture. Armv9-A builds on SVE with the SVE2 extension. Unlike other SIMD architectures, SVE and SVE2 do not define the size of the vector registers, but constrains it to a range of possible values, from a minimum of 128 bits up to a maximum of 2048 in 128-bit wide units. Therefore, any CPU vendor can implement the extension by choosing the vector register size that better suits the workloads the CPU is targeting. The design of SVE and SVE2 guarantees that the same program can run on different implementations of the instruction set architecture without the need to recompile the code.

[링크 : https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions]

[링크 : https://developer.arm.com/Architectures/SVE]

저작자표시 (새창열림)

'embeded > ARM' 카테고리의 다른 글

Ethos-U85 (0)	2026.02.06
emmc 파티션 정렬 (0)	2024.02.07
arm asm rev (0)	2023.09.14
cortex-a53 (0)	2023.08.31
aarch64 vector register (0)	2023.08.23

Posted by 구차니

emmc 파티션 정렬

erase block 단위로 정렬하면 좋다는데 그걸 어떻게 확인하지?

데이터시트 안보고 리눅스 레벨에서 확인할 순 없나?

Try to align to eMMC erasure block size. It usually equals 0.5, 1, 2, 4, 8 MiB depending on eMMC datasheet. If you find block size alignment too much memory wasting, then stick to the page size, generally found in the range of 4..16 KiB.

[링크 : https://unix.stackexchange.com/questions/248939/how-to-achieve-optimal-alignment-for-emmc-partition]

저작자표시 (새창열림)

'embeded > ARM' 카테고리의 다른 글

Ethos-U85 (0)	2026.02.06
SVE(Scalable Vector Extension) (0)	2025.08.28
arm asm rev (0)	2023.09.14
cortex-a53 (0)	2023.08.31
aarch64 vector register (0)	2023.08.23

Posted by 구차니

arm asm rev

On ARMv6 and above, you can just use the rev instruction, but I assume that you're not allowed to do that for whatever reason.

[링크 : https://stackoverflow.com/questions/2755171/arm-assembly-converting-endianness]

REV
Reverse the byte order in a word.

Syntax
REV{cond} Rd, Rn

where:

cond
is an optional condition code.

Rd
is the destination register.

Rn
is the register holding the operand.

[링크 : https://developer.arm.com/documentation/dui0473/m/arm-and-thumb-instructions/rev]

unsigned int foo(unsigned int a)
{
return __builtin_bswap32(a);
}

[링크 : https://stackoverflow.com/questions/35133829/does-arm-gcc-have-a-builtin-function-for-the-assembly-rev-instruction]

[링크 : https://teus.me/726]

gcc built-in function 이고 자매품(?) 으로 __builtin_bswap16 이라는 녀석도 있다.

Built-in Function: uint32_t __builtin_bswap32 (uint32_t x)
Similar to __builtin_bswap16, except the argument and return types are 32-bit.

[링크 : https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html]

어.. vertorized 안되면.. 그냥 쌩으로 for 돌려야 하는건데...?! 이럼 나가리인데?!?!?

지원되는 GCC 비벡터 내장 함수
마지막 업데이트 날짜: 2023-07-13

IBM® Open XL C/C++ for AIX® 17.1.1 는 다음 GCC 비벡터 내장 함수를 지원합니다.

[링크 : https://www.ibm.com/docs/en/openxl-c-and-cpp-aix/17.1.0?topic=functions-supported-gcc-non-vector-built-in]

저작자표시 (새창열림)

'embeded > ARM' 카테고리의 다른 글

SVE(Scalable Vector Extension) (0)	2025.08.28
emmc 파티션 정렬 (0)	2024.02.07
cortex-a53 (0)	2023.08.31
aarch64 vector register (0)	2023.08.23
arm vsub operator (0)	2023.08.09

Posted by 구차니

cortex-a53

주로 다루고 있는게 A9(zynq z7020, imx6q) 과 A53(imx8mp) 인데

A53만 해도 나온지 10년이 된 녀석.. A5x 시리즈의 최초 버전

벤치해보면 의외로 빠르지 않았는데, 클럭빨이었나.. A9보다 낮은 2.3DMIPS라니..

라즈베리 같은데 A7 인데 얘도 공정빨로 클럭이 높아서 쓸만할뿐 1.9DMIPS 로 성능이 좋은편은 아니다

A9이 클럭이 낮고 열이 많이 나서 그렇지(!) 기본 성능 자체는 2.5DMIPS로 의외로 좋게 설계되어 있다.

(물론 설계대로 성능이 나온다는 말은 아님)

근데 A9도 out-of-order를 지원하는데 A53은 in-order 라니!! 이게 무슨 소리요!!

다른 의미로는 A53 보다는 A57 이후 버전은 되어야 그나마(?) 쓸만 하다는 의미?

[링크 : https://en.wikipedia.org/wiki/List_of_ARM_processors]

저작자표시 (새창열림)

'embeded > ARM' 카테고리의 다른 글

emmc 파티션 정렬 (0)	2024.02.07
arm asm rev (0)	2023.09.14
aarch64 vector register (0)	2023.08.23
arm vsub operator (0)	2023.08.09
ARM NEON SLP (0)	2023.08.07

Posted by 구차니

aarch64 vector register

armv7에 비해서 armv8(aarch64)의 simd 통합은 더 강해졌는지

명령어가 사라지고 src와 dst의 레지스터에 vertor와 scalar가 사용된다.

예전에 이상하다 싶어서 찾아두기만 한 녀석인데

add 명령은 그대로이고 v0 이라는 vertor 레지스터에 4s, 4개의 32bit 변수형(아마도 signed int?)을

한번에 더하는 계산을 하라고 시킨다.

dst, src1, oper 일테니까 v0.4s = v0.4s + v1.4s 로 보면 될 듯.

add v0.4s, v0.4s, v1.4s

2021.06.30 - [embeded/raspberry pi] - aarch, armv8 asimd build (neon)

scalar

평범(?)한 Q / D / S / H / B

[링크 : https://developer.arm.com/documentation/den0024/a/ARMv8-Registers/NEON-and-floating-point-registers/Scalar-register-sizes]

vector

D가 아마도 double형 같은 64bit(8byte) 변수일텐데

그것 조차도 한번에 2개씩 연산이 가능한 레지스터라니..

[링크 : https://developer.arm.com/documentation/den0024/a/ARMv8-Registers/NEON-and-floating-point-registers/Vector-register-sizes]

저작자표시 (새창열림)

'embeded > ARM' 카테고리의 다른 글

arm asm rev (0)	2023.09.14
cortex-a53 (0)	2023.08.31
arm vsub operator (0)	2023.08.09
ARM NEON SLP (0)	2023.08.07
cortex a9 ptm (0)	2023.07.21

Posted by 구차니

arm vsub operator

저번에 작성한 프로그램에서 VFP 를 통한 연산가속을 활성화 해봤는데 혹시나 해서, 어떤 명령어를 이용했나 역으로 찾아보는 중

   111a0: f35668e8 vsub.i16 q11, q11, q12
   1120c: f35318a1 vsub.i16 d17, d19, d17
   1172c: f2600de8 vsub.f32 q8, q8, q12
   11730: f2644de8 vsub.f32 q10, q10, q12
   11784: ee377a46 vsub.f32 s14, s14, s12
   117a8: ee755ac6 vsub.f32 s11, s11, s12
   117c0: ee744ac6 vsub.f32 s9, s9, s12
   117d8: ee355a46 vsub.f32 s10, s10, s12
   117f0: ee755ac6 vsub.f32 s11, s11, s12
   1180c: ee744ac6 vsub.f32 s9, s9, s12
   11818: ee355a46 vsub.f32 s10, s10, s12
   11824: ee356ac6 vsub.f32 s12, s11, s12
   11844: f2600de8 vsub.f32 q8, q8, q12
   11848: f2644de8 vsub.f32 q10, q10, q12
   118a8: ee344a67 vsub.f32 s8, s8, s15
   118d4: ee744ae7 vsub.f32 s9, s9, s15
   118ec: ee355a67 vsub.f32 s10, s10, s15
   11908: ee755ae7 vsub.f32 s11, s11, s15
   11918: ee344a67 vsub.f32 s8, s8, s15
   11924: ee744ae7 vsub.f32 s9, s9, s15
   11930: ee355a67 vsub.f32 s10, s10, s15
   1193c: ee757ae7 vsub.f32 s15, s11, s15

역어셈블 해보니 위와 같이 vsub.i16과 같은 neon 으로도 될 것 같은 녀석은 패스하면 vsub.f32 밖에 없다.

vsub.f32가 neon 껀지 vfp껀지 궁금해서 찾아보는 중

VSUB (floating-point)
Floating-point subtract.
This instruction can be scalar, vector, or mixed, but VFP vector mode and mixed mode are deprecated.

[링크 : https://developer.arm.com/documentation/dui0489/i/neon-and-vfp-programming/vsub--floating-point-]

Instruction Section Instruction set
V{Q}SUB V{Q}SUB, VSUBL and VSUBW NEON
VSUB VSUB VFP

[링크 : https://developer.arm.com/documentation/den0018/a/NEON-and-VFP-Instruction-Summary/List-of-all-NEON-and-VFP-instructions]

cortex a9의 NEON MPE는 Advanced SIMD와 VFP 확장을 구현하였지만

IEEE754 연산중 아래의 연산을 하드웨어적으로 제공하지 않는다 인데

round float-point number to nearest integer-valued in floating point number 때문에

gcc 에서 --fast-math 를 켜줘야 VFP 명령이 활성화 되는걸까?

IEEE754 standard compliance
The IEEE754 standard provides a number of implementation choices. The ARM Architecture Reference Manual describes the choices that apply to the Advanced SIMD and VFPv3 architectures.

The Cortex-A9 NEON MPE implements the ARMv7 Advanced SIMD and VFP extensions. It does not provide hardware support for the following IEEE754 operations:

remainder
round floating-point number to nearest integer-valued in floating-point number
binary-to-decimal conversion
decimal-to-binary conversion
direct comparison of single-precision and double-precision values
any extended-precision operations.

[링크 : https://developer.arm.com/documentation/ddi0409/e/programmers-model/ieee754-standard-compliance]

+

다시 옵션에 따른 비교를 해보니

어찌 된게 ffast-math 한게 디스어셈블한 부분이 더 길다.. 그런데 왜 빠르지?

for (int i = 0; i < READ_SIZE; i += 2)
   11710: f3f48c46 vdup.32 q12, d6[0]
float diff = data[i] - avg_0;
   11714: f46c434d vld2.16 {d20-d23}, [ip]!
   11718: f2d00a34 vmovl.s16 q8, d20
   1171c: e151000c cmp r1, ip
   11720: f2d04a35 vmovl.s16 q10, d21
   11724: f3fb0660 vcvt.f32.s32 q8, q8
   11728: f3fb4664 vcvt.f32.s32 q10, q10
   1172c: f2600de8 vsub.f32 q8, q8, q12
   11730: f2644de8 vsub.f32 q10, q10, q12
std_0 += diff * diff;
   11734: f3400df0 vmul.f32 q8, q8, q8
   11738: f2440df4 vmla.f32 q8, q10, q10
   1173c: f2422de0 vadd.f32 q9, q9, q8

for (int i = 0; i < READ_SIZE; i += 2)
   1177c: e15e000c cmp lr, ip
float diff = data[i] - avg_0;
   11780: ee072a90 vmov s15, r2
   11784: eef87ae7 vcvt.f32.s32 s15, s15
   11788: ee777ac6 vsub.f32 s15, s15, s12
std_0 += diff * diff;
   1178c: ee077aa7 vmla.f32 s14, s15, s15

흐으으으으음.. 어셈은 어려워 -_ㅠ

Instruction Section Instruction set
VMLA VMUL, VMLA, VMLS, VNMUL, VNMLA, and VNMLS VFP
VMLA{L} VMUL{L}, VMLA{L}, and VMLS{L} (by scalar) NEON

[링크 : https://developer.arm.com/documentation/den0018/a/NEON-and-VFP-Instruction-Summary/List-of-all-NEON-and-VFP-instructions]

저작자표시 (새창열림)

'embeded > ARM' 카테고리의 다른 글

cortex-a53 (0)	2023.08.31
aarch64 vector register (0)	2023.08.23
ARM NEON SLP (0)	2023.08.07
cortex a9 ptm (0)	2023.07.21
openOCD와 jtag (0)	2023.07.06

Posted by 구차니

ARM NEON SLP

SLP가 먼가 해서 보는데 gcc/gnu 문서 내에서는 없어서

word 보다 더 큰 크기의 데이터들에 대해서(super word level) 병렬화(parallelism) 한다는 의미인가?

Superword-Level Parallelism (SLP) vectorizer

[링크 : https://rcor.me/papers/cgo19snslp.pdf]

[링크 : https://llvm.org/docs/Vectorizers.html#slp-vectorizer]

Example 20: Basic block SLP with multiple types, loads with different offsets, misaligned load, and not-affine accesses:

void foo (int * __restrict__ dst, short * __restrict__ src,
          int h, int stride, short A, short B)
{
  int i;
  for (i = 0; i < h; i++)
    {
      dst[0] += A*src[0] + B*src[1];
      dst[1] += A*src[1] + B*src[2];
      dst[2] += A*src[2] + B*src[3];
      dst[3] += A*src[3] + B*src[4];
      dst[4] += A*src[4] + B*src[5];
      dst[5] += A*src[5] + B*src[6];
      dst[6] += A*src[6] + B*src[7];
      dst[7] += A*src[7] + B*src[8];
      dst += stride;
      src += stride;
    }
}

[링크 : https://gcc.gnu.org/projects/tree-ssa/vectorization.html#slp]

저작자표시 (새창열림)

'embeded > ARM' 카테고리의 다른 글

aarch64 vector register (0)	2023.08.23
arm vsub operator (0)	2023.08.09
cortex a9 ptm (0)	2023.07.21
openOCD와 jtag (0)	2023.07.06
cmsis (Common Microcontroller Software Interface Standard) (0)	2023.02.27

Posted by 구차니

cortex a9 ptm

PTM은 Program Trace Macrocell의 약자로 말그대로 프로그램을 추적하는 녀석이라

데이터만을 추적하는 기능은 제공하지 않는 듯.

CortexA9의 PTM과 같이 데이터 트레이스를 지원하지 않는 환경에서 ITM은 제한적으로나마 데이터 트레이스를 해볼 수 있는 방안을 제공한다

[링크 : https://www.epnc.co.kr/news/articleView.html?idxno=45715]

PTM interface
The Cortex-A9 processor optionally implements a Program Trace Macrocell (PTM) interface, that is compliant with the Program Flow Trace (PFT) instruction-only architecture protocol. Waypoints, changes in the program flow or events such as changes in context ID, are output to enable the trace to be correlated with the code image.

[링크 : https://developer.arm.com/documentation/100511/0401/functional-description/about-the-functions/ptm-interface]

저작자표시 (새창열림)

'embeded > ARM' 카테고리의 다른 글

arm vsub operator (0)	2023.08.09
ARM NEON SLP (0)	2023.08.07
openOCD와 jtag (0)	2023.07.06
cmsis (Common Microcontroller Software Interface Standard) (0)	2023.02.27
i.mx8m plus arm trust zone (0)	2023.02.24

Posted by 구차니

openOCD와 jtag

문득.. JTAG은 표준인데 왜 업체별로 다르지? 라는 생각에 검색하다 보니

공용으로 쓸 수 있는진 모르겠지만

openOCD를 이용하면 viviado 등이 없어도 FPGA에 쓸 수 있다고 하는걸 봐서는

openOCD가 각종 jtag를 지원한다고 보는게 맞을 듯.

We decided to support both urJTAG and the well known OpenOCD out of the box.

Supported devices
The list of supported devices is constantly being expanded and here is a small selection of the supported devices.

ARM7TDMI » fx LPC2148, AT91SAM7
ARM720T » fx LH79520, EP7312
ARM9TDMI
ARM920T » fx S3C2410, S3C2440
ARM922T
ARM926EJS » fx S3C2412, STN8811, STN8815
ARM966E » fx STR91XF
ARM11 » fx S3C6400, OMAP2420, MSM7200
ARM1136
ARM1156
ARM1176
CORTEX-M1 » fx LPC11 series
CORTEX-M3 » fx LM3S series, STM32F1/F2/F3 series, LPC17 series
CORTEX-M4 » fx STM32F4
CORTEX-A8 » fx OMAP3530 BeagleBoard
CORTEX-A8 » fx DM3730 BeagleBoard-xM
CORTEX-A9 » fx OMAP4430 PandaBoard
XSCALE » fx PXA255, PXA270, IXP42X
MARVEL » fx FEROCEON CPU CORE
FPGA » fx Xilinx Spartan, Virtex or Altera Cyclone, Stratix
CPLD » fx Xilinx CoolRunner or Altera MAX

Technical details
The board itself is 5 by 5 cm with a USB B connector at one side and JTAG and IO headers at another.

The JTAG port supports a wide range of voltages, as it is connected to a couple of voltage translators (74LVC2T45). This makes the uniJTAG even more universal, as you can use it together with any JTAG�able device, running at 1.2V to 5.5V.

The IO header can be used as 8 single controllable IO�s, or it can be used as a full standard UART port. With a jumper you can chose whether the IO�s should be at a 5V level, or a 3.3V level.

The board has also an onboard EEProm for storing the FT2232 configurations, so the uniJTAG is a plug and play solution, and it automatically enumerates as a JTAG and a Serial device.

[링크 : http://www.tkjelectronics.dk/?p=products&product=unijtag]

다른 JTAG을 사용..

$ lsusb
Bus 001 Device 006: ID 0403:6014 Future Technology Devices International, Ltd FT232H Single HS USB-UART/FIFO IC
Bus 001 Device 028: ID 09fb:6001 Altera Blaster

$ sudo /opt/openocd/bin/openocd -d \
             -f /opt/openocd/share/openocd/scripts/interface/ftdi/digilent_jtag_smt2.cfg \
             -f /opt/openocd/share/openocd/scripts/cpld/xilinx-xc6s.cfg \
             -c "adapter_khz 1000"

$ sudo /opt/openocd/bin/openocd \
             -f /opt/openocd/share/openocd/scripts/interface/altera-usb-blaster.cfg \
             -f /opt/openocd/share/openocd/scripts/cpld/xilinx-xc6s.cfg \
             -c "adapter_khz 1000; init; xc6s_program xc6s.tap; pld load 0 ./ise/top.bit ; exit"

[링크 : https://tomverbeure.github.io/2019/09/15/Loading-a-Spartan-6-bitstream-with-openocd.html]

저작자표시 (새창열림)

'embeded > ARM' 카테고리의 다른 글

ARM NEON SLP (0)	2023.08.07
cortex a9 ptm (0)	2023.07.21
cmsis (Common Microcontroller Software Interface Standard) (0)	2023.02.27
i.mx8m plus arm trust zone (0)	2023.02.24
ampere altra / 기가바이트 R272-P30 / 우분투 (0)	2023.02.03

Posted by 구차니

구차니의 잡동사니 모음

'embeded/ARM'에 해당되는 글 108건

Ethos-U85

'embeded > ARM' 카테고리의 다른 글

SVE(Scalable Vector Extension)

'embeded > ARM' 카테고리의 다른 글

emmc 파티션 정렬

'embeded > ARM' 카테고리의 다른 글

arm asm rev

'embeded > ARM' 카테고리의 다른 글

cortex-a53

'embeded > ARM' 카테고리의 다른 글

aarch64 vector register

'embeded > ARM' 카테고리의 다른 글

arm vsub operator

'embeded > ARM' 카테고리의 다른 글

ARM NEON SLP

'embeded > ARM' 카테고리의 다른 글

cortex a9 ptm

'embeded > ARM' 카테고리의 다른 글

openOCD와 jtag

'embeded > ARM' 카테고리의 다른 글

카테고리

공지사항

태그목록

최근에 올라온 글

최근에 달린 댓글

티스토리툴바