i.MX 8M PLUS tensorflow NPU

embeded/i.mx 8m plus2021. 10. 13. 14:38

i.MX 8M PLUS tensorflow NPU

LF_v5.10.52-2.1.0_images_IMX8MPEVK.zip 파일을 받아서 이미지를 sd 카드에 굽고

부팅해서 들어가보니 경로가 좀 다르다.

tensorflow 2.5.0 버전이면.. 쓸 수 있는 건가?

# cd /usr/bin/tensorflow-lite-2.5.0/examples
# ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite
STARTING!
Log parameter values verbosely: [0]
Graph: [mobilenet_v1_1.0_224_quant.tflite]
Use VXdelegate : [0]
Loaded model mobilenet_v1_1.0_224_quant.tflite
The input model file size (MB): 4.27635
Initialized session in 1.807ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=4 first=167959 curr=162606 min=162606 max=167959 avg=164253 std=2159

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=50 first=162727 curr=163003 min=162308 max=163308 avg=162758 std=190

Inference timings in us: Init: 1807, First inference: 167959, Warmup (avg): 164253, Inference (avg): 162758
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=2.51562 overall=8.64062

# ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --use_nnapi=true
STARTING!
Log parameter values verbosely: [0]
Graph: [mobilenet_v1_1.0_224_quant.tflite]
Use NNAPI: [1]
NNAPI accelerators available: [vsi-npu]
Use VXdelegate : [0]
Loaded model mobilenet_v1_1.0_224_quant.tflite
INFO: Created TensorFlow Lite delegate for NNAPI.
Explicitly applied NNAPI delegate, and the model graph will be completely executed by the delegate.
The input model file size (MB): 4.27635
Initialized session in 4.183ms.
Running benchmark for at least 1 iterations and at least 0.5 seconds but terminate if exceeding 150 seconds.
count=1 curr=4649626

Running benchmark for at least 50 iterations and at least 1 seconds but terminate if exceeding 150 seconds.
count=360 first=2665 curr=2733 min=2632 max=2783 avg=2715.67 std=16

Inference timings in us: Init: 4183, First inference: 4649626, Warmup (avg): 4.64963e+06, Inference (avg): 2715.67
Note: as the benchmark tool itself affects memory footprint, the following is only APPROXIMATE to the actual memory footprint of the model at runtime. Take the information at your discretion.
Peak memory footprint (MB): init=2.59766 overall=30.1836

label_image로 해보면.. warm up이 먼진 모르겠지만 invoke() 함수 자체는 짧게 걸리는데

그 이전에 먼가 하는게 오래 걸리는지 cpu만으로 돌리는 것 보다 4초 이상 오래 걸린다.

# time ./label_image -w 1
INFO: Loaded model ./mobilenet_v1_1.0_224_quant.tflite
INFO: resolved reporter
INFO: invoked
INFO: average time: 43.865 ms
INFO: 0.764706: 653 military uniform
INFO: 0.121569: 907 Windsor tie
INFO: 0.0156863: 458 bow tie
INFO: 0.0117647: 466 bulletproof vest
INFO: 0.00784314: 835 suit

real    0m0.142s
user    0m0.385s
sys     0m0.020s

# time ./label_image -w 1 -a 1
INFO: Loaded model ./mobilenet_v1_1.0_224_quant.tflite
INFO: resolved reporter
INFO: Created TensorFlow Lite delegate for NNAPI.
INFO: Use NNAPI acceleration.
INFO: Applied NNAPI delegate.
INFO: invoked
INFO: average time: 2.797 ms
INFO: 0.768627: 653 military uniform
INFO: 0.105882: 907 Windsor tie
INFO: 0.0196078: 458 bow tie
INFO: 0.0117647: 466 bulletproof vest
INFO: 0.00784314: 835 suit

real    0m4.748s
user    0m4.648s
sys     0m0.092s

아래는 2.1.0 버전에 맞춰서 한 구버전 문서 내용 인 듯.

$ cd /usr/bin/tensorflow-lite-2.1.0/examples
$ ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite
$: ./benchmark_model --graph=mobilenet_v1_1.0_224_quant.tflite --use_nnapi=true

./lbl_img -i grace_hopper.bmp -l labels.txt -w 1
./lbl_img -i grace_hopper.bmp -l labels.txt -w 1 -a 1

[링크 : https://www.mouser.com/pdfDocs/AN12964.pdf]

망할 놈(?)들 도움말이랑 다르잖아?!

# ./label_image --help
ERROR: usage: ./label_image <flags>
Flags:
        --num_threads=1                 int32   optional        number of threads used for inference on CPU.
        --max_delegated_partitions=0    int32   optional        Max number of partitions to be delegated.
        --min_nodes_per_partition=0     int32   optional        The minimal number of TFLite graph nodes of a partition that has to be reached for it to be delegated.A negative value or 0 means to use the default choice of each delegate.
        --num_threads=1                 int32   optional        number of threads used for inference on CPU.
        --max_delegated_partitions=0    int32   optional        Max number of partitions to be delegated.
        --min_nodes_per_partition=0     int32   optional        The minimal number of TFLite graph nodes of a partition that has to be reached for it to be delegated.A negative value or 0 means to use the default choice of each delegate.
        --use_xnnpack=false             bool    optional        use XNNPack
        --use_nnapi=false               bool    optional        use nnapi delegate api
        --nnapi_execution_preference=   string  optional        execution preference for nnapi delegate. Should be one of the following: fast_single_answer, sustained_speed, low_power, undefined
        --nnapi_execution_priority=     string  optional        The model execution priority in nnapi, and it should be one of the following: default, low, medium and high. This requires Android 11+.
        --nnapi_accelerator_name=       string  optional        the name of the nnapi accelerator to use (requires Android Q+)
        --disable_nnapi_cpu=true        bool    optional        Disable the NNAPI CPU device
        --nnapi_allow_fp16=false        bool    optional        Allow fp32 computation to be run in fp16

    static struct option long_options[] = {
        {"accelerated", required_argument, nullptr, 'a'},
        {"allow_fp16", required_argument, nullptr, 'f'},
        {"count", required_argument, nullptr, 'c'},
        {"verbose", required_argument, nullptr, 'v'},
        {"image", required_argument, nullptr, 'i'},
        {"labels", required_argument, nullptr, 'l'},
        {"tflite_model", required_argument, nullptr, 'm'},
        {"profiling", required_argument, nullptr, 'p'},
        {"threads", required_argument, nullptr, 't'},
        {"input_mean", required_argument, nullptr, 'b'},
        {"input_std", required_argument, nullptr, 's'},
        {"num_results", required_argument, nullptr, 'r'},
        {"max_profiling_buffer_entries", required_argument, nullptr, 'e'},
        {"warmup_runs", required_argument, nullptr, 'w'},
        {"gl_backend", required_argument, nullptr, 'g'},
        {"hexagon_delegate", required_argument, nullptr, 'j'},
        {"xnnpack_delegate", required_argument, nullptr, 'x'},
        {nullptr, 0, nullptr, 0}};

[링크 : https://github.com/tensorflow/tensorflow/blob/v2.5.0/tensorflow/lite/examples/label_image/label_image.cc]

그러면.. 어떤식으로 라이브러리를 빌드해서 저게 가능해진거지?

# ldd label_image
        linux-vdso.so.1 (0x0000ffffa0989000)
        libtensorflow-lite.so.2.5.0 => /usr/lib/libtensorflow-lite.so.2.5.0 (0x0000ffffa05ab000)
        libm.so.6 => /lib/libm.so.6 (0x0000ffffa0501000)
        libstdc++.so.6 => /usr/lib/libstdc++.so.6 (0x0000ffffa032a000)
        libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x0000ffffa0305000)
        libc.so.6 => /lib/libc.so.6 (0x0000ffffa0190000)
        /lib/ld-linux-aarch64.so.1 (0x0000ffffa0957000)
        libtim-vx.so => /usr/lib/libtim-vx.so (0x0000ffffa00c7000)
        libdl.so.2 => /lib/libdl.so.2 (0x0000ffffa00b1000)
        libpthread.so.0 => /lib/libpthread.so.0 (0x0000ffffa0082000)
        librt.so.1 => /lib/librt.so.1 (0x0000ffffa006a000)
        libovxlib.so.1.1.0 => /usr/lib/libovxlib.so.1.1.0 (0x0000ffff9fcd1000)
        libOpenVX.so.1 => /usr/lib/libOpenVX.so.1 (0x0000ffff9fa7e000)
        libVSC.so => /usr/lib/libVSC.so (0x0000ffff9eae2000)
        libGAL.so => /usr/lib/libGAL.so (0x0000ffff9e91b000)
        libArchModelSw.so => /usr/lib/libArchModelSw.so (0x0000ffff9e8f3000)
        libNNArchPerf.so => /usr/lib/libNNArchPerf.so (0x0000ffff9e8d0000)

PRELU 연산자 자체는 지원하는 것 같은데 output size mistach가 원인인가?

INFO: Use NNAPI acceleration.
WARNING: Operator RESIZE_BILINEAR (v3) refused by NNAPI delegate: Operator refused due performance reasons.
INFO: Applied NNAPI delegate.
W [vsi_nn_op_eltwise_setup:178]Output size mismatch, expect 917504, but got 50176
E [setup_node:448]Setup node[52] PRELU fail
W [vsi_nn_op_eltwise_setup:178]Output size mismatch, expect 917504, but got 50176
E [setup_node:448]Setup node[52] PRELU fail
ERROR: NN API returned error ANEURALNETWORKS_BAD_DATA at line 4151 while running computation.

ERROR: Node number 56 (TfLiteNnapiDelegate) failed to invoke.

ERROR: Failed to invoke tflite!

[링크 : https://www.nxp.com/docs/en/user-guide/IMX-MACHINE-LEARNING-UG.pdf]

warm up은 코드상으로 1회 invoke 하는 것인데 해당 작업이 4649ms 정도 소요되며

warm up 없이 1회 실행하면 대략 그 정도 시간이 소요된다.

root@imx8mpevk:/usr/bin/tensorflow-lite-2.5.0/examples# time ./label_image -a 1 -w 0 -p 1 -c 1
INFO: Loaded model ./mobilenet_v1_1.0_224_quant.tflite
INFO: resolved reporter
INFO: Created TensorFlow Lite delegate for NNAPI.
INFO: Use NNAPI acceleration.
INFO: Applied NNAPI delegate.
INFO: invoked
INFO: average time: 4649.78 ms
INFO: 0.768627: 653 military uniform
INFO: 0.105882: 907 Windsor tie
INFO: 0.0196078: 458 bow tie
INFO: 0.0117647: 466 bulletproof vest
INFO: 0.00784314: 835 suit

real    0m4.757s
user    0m4.655s
sys     0m0.096s
root@imx8mpevk:/usr/bin/tensorflow-lite-2.5.0/examples# time ./label_image -a 1 -w 0 -p 1 -c 4
INFO: Loaded model ./mobilenet_v1_1.0_224_quant.tflite
INFO: resolved reporter
INFO: Created TensorFlow Lite delegate for NNAPI.
INFO: Use NNAPI acceleration.
INFO: Applied NNAPI delegate.
INFO: invoked
INFO: average time: 1164.36 ms
INFO: 0.768627: 653 military uniform
INFO: 0.105882: 907 Windsor tie
INFO: 0.0196078: 458 bow tie
INFO: 0.0117647: 466 bulletproof vest
INFO: 0.00784314: 835 suit

real    0m4.768s
user    0m4.663s
sys     0m0.092s
root@imx8mpevk:/usr/bin/tensorflow-lite-2.5.0/examples# time ./label_image -a 1 -w 0 -p 1 -c 10000
INFO: Loaded model ./mobilenet_v1_1.0_224_quant.tflite
INFO: resolved reporter
INFO: Created TensorFlow Lite delegate for NNAPI.
INFO: Use NNAPI acceleration.
INFO: Applied NNAPI delegate.
INFO: invoked
INFO: average time: 3.30189 ms
INFO: 0.768627: 653 military uniform
INFO: 0.105882: 907 Windsor tie
INFO: 0.0196078: 458 bow tie
INFO: 0.0117647: 466 bulletproof vest
INFO: 0.00784314: 835 suit

real    0m33.128s
user    0m7.516s
sys     0m1.590s

openVX를 통해 처리하는 것 같은데 처음 처리하면 그래프 처리 결과를 스토리지에 저장한다고.

11.3 Hardware accelerators warmup time
For both Arm NN and TensorFlow Lite, the initial execution of model inference takes longer time, because of the model graph initialization needed by the GPU/NPU hardware accelerator. The initialization phase is known as warmup. This time duration can be decreased for subsequent application that runs by storing on disk the information resulted from the initial OpenVX graph processing. The following environment variables should be used for this purpose:
VIV_VX_ENABLE_CACHE_GRAPH_BINARY: flag to enable/disable OpenVX graph caching
VIV_VX_CACHE_BINARY_GRAPH_DIR: set location of the cached information on disk
For example, set these variables on the console in this way:
export VIV_VX_ENABLE_CACHE_GRAPH_BINARY="1"
export VIV_VX_CACHE_BINARY_GRAPH_DIR=`pwd`

[링크 : https://www.nxp.com/docs/en/user-guide/IMX-MACHINE-LEARNING-UG.pdf]

저작자표시 (새창열림)

'embeded > i.mx 8m plus' 카테고리의 다른 글

i.mx8m plus win iot 실행 (0)	2023.02.23
i.mx8 tensilica dsp (0)	2023.02.07
i.mx8m plus win iot (0)	2023.02.02
imx 8m plus NPU 에러 추적 (5)	2021.10.14
i.MX 8M PLUS (0)	2021.10.13

Posted by 구차니

구차니의 잡동사니 모음

i.MX 8M PLUS tensorflow NPU

'embeded > i.mx 8m plus' 카테고리의 다른 글

카테고리

공지사항

태그목록

최근에 올라온 글

최근에 달린 댓글

티스토리툴바