K230 RVV Performance Optimization Description#
Preface#
Overview#
This document mainly introduces the impact of RVV on model inference performance.
Audience#
This document (this guide) is mainly intended for the following personnel:
Technical Support Engineers
Software Development Engineers
Revision History#
Document Version |
Modification Description |
Modifier |
Date |
---|---|---|---|
V1.0 |
Initial Version |
Yang Haoqi |
2023/08/04 |
1. Overview#
In recent years, the rapid development of the AI field has led to the emergence of various neural network models and new operators. However, the iteration cycle of AI chips is much longer compared to AI models, and many of these new operators cannot directly use AI chips for inference acceleration. At the same time, some old operators are also not suitable for inference acceleration using AI chips. Therefore, the CPU becomes the execution carrier for these operators, which means that the performance of the CPU will become a factor affecting the final performance during model deployment. The RVV extension is an important means to improve the performance of RISC-V CPUs. The K230 uses the Xuantie C908 dual-core processor, where the large core has the RVV1.0 extension feature, which can significantly improve the performance during CPU operator inference.
To perform model inference on the K230, a model in the .kmodel
format is required. The .kmodel
format is a compiled model format by nncase for ONNX
and TFLite
models, suitable for development boards produced by the Company and related cooperative enterprises. nncase supports common neural network operators, but some operators cannot be accelerated for inference by the K230 and must use the CPU for inference.
2. RVV Application Scenarios#
Currently, the most widely researched and applied neural network is the Transformer
, which has a model structure significantly different from CNN
. Many AI chips designed based on CNN
cannot fully accelerate the Transformer
. Below is the execution of operators in the Decoder
model of Transformer
with and without RVV optimization enabled.
2.1 Without RVV Optimization#
stackvm tensor op |
count |
time consumption(ms) |
percentage(%) |
---|---|---|---|
softmax |
5 |
1749.61 |
88.6574 |
where |
4 |
199.432 |
10.1058 |
EXTCALL |
65 |
16.099 |
0.815779 |
layer_norm |
7 |
5.81 |
0.294408 |
gather |
2 |
0.393 |
0.0199144 |
STLOCAL |
212 |
0.391 |
0.019813 |
LDC_I4 |
241 |
0.388 |
0.019661 |
reduce_arg |
1 |
0.336 |
0.017026 |
reshape |
26 |
0.281 |
0.014239 |
LDLOCAL |
149 |
0.26 |
0.0131749 |
LDNULL |
106 |
0.166 |
0.00841166 |
LDTENSOR |
29 |
0.103 |
0.00521929 |
LEA_GP |
58 |
0.097 |
0.00491525 |
LDDATATYPE |
29 |
0.07 |
0.00354709 |
LDARG |
5 |
0.008 |
0.000405381 |
RET |
1 |
0.004 |
0.000202691 |
LDTUPLE |
1 |
0.003 |
0.000152018 |
total |
941 |
1973.45 |
100 |
2.2 With RVV Optimization#
stackvm tensor op |
count |
time consumption(ms) |
percentage(%) |
---|---|---|---|
softmax |
5 |
25.722 |
55.6175 |
EXTCALL |
65 |
16.179 |
34.9831 |
layer_norm |
7 |
0.967 |
2.0909 |
where |
4 |
0.912 |
1.97198 |
gather |
2 |
0.39 |
0.84328 |
LDC_I4 |
241 |
0.386 |
0.834631 |
STLOCAL |
212 |
0.379 |
0.819495 |
reduce_arg |
1 |
0.34 |
0.735167 |
LDLOCAL |
149 |
0.259 |
0.560024 |
reshape |
26 |
0.243 |
0.525428 |
LDNULL |
106 |
0.17 |
0.367583 |
LEA_GP |
58 |
0.103 |
0.222712 |
LDTENSOR |
29 |
0.103 |
0.222712 |
LDDATATYPE |
29 |
0.076 |
0.164331 |
LDARG |
5 |
0.011 |
0.0237848 |
RET |
1 |
0.005 |
0.0108113 |
LDTUPLE |
1 |
0.003 |
0.00648677 |
total |
941 |
46.248 |
100 |
2.3 Performance Analysis and Description#
In the model inference above, the KPU unit of K230 does not support hardware inference acceleration for softmax
, layer_norm
, where
, gather
, reduce_arg
, and reshape
, so the C908 is used for inference. RVV optimization has been completed for softmax
, layer_norm
, and where
, with significant performance improvement.
Below are the pie charts showing the proportion of each operator’s time consumption in model inference before and after RVV optimization.
Below is the performance comparison of related operators before and after RVV optimization.
From the comparison results above, it can be seen that enabling RVV optimization can greatly improve the inference performance of CPU operators, significantly reducing the overall model inference time (1973 ms to 46 ms). After RVV optimization, the softmax
operator time is reduced to 25 ms, the layer_norm
operator time is reduced to 0.97 ms, and the where
operator time is reduced to 0.91 ms. The overall model inference time is shortened by 97.6%, which has high application value in actual model deployment.
2.4 RVV Optimization Example#
2.4.1 RVV Code#
For specific implementation, please refer to layer_norm
in nncase
here. It requires some knowledge of RV instructions and V extension instructions.
The calculation formula for layer_norm
is as follows:
y= (x−E[x])/sqrt(Var[x]+ϵ)∗γ+β
The overall calculation process can be found in the layernorm_impl
function. For better readability, the RVV optimized code in this process is split into three parts:
Calculate
E[x]
, refer to theget_mean
function.Calculate
Var[x]
, refer to theget_var
function.Perform layer_norm calculation according to the formula above, refer to the
layer_norm_update1
function.
Since multiplication is less time-consuming than division, the formula in step 3 is transformed to use rsqrt
instead of sqrt
and multiplication instead of division.
2.4.2 Core Code Explanation#
Below is an explanation of the core code in get_mean
. This code segment implements the loop load and summation of the array at a1, stores the summation result in v0, and finally stores the average value in ret. It uses RVV’s vector load and vector accumulate instructions to achieve summation, thereby improving computational performance.
"vle32.v v8, (a1);" // Load 32-bit vector at a1 address into v8 register
"sub a0,a0, t0;" // a0 -= t0, used for loop control count
"slli t1, t0, 2;" // t1 = t0 << 2, as each float32 is 4 bytes, so address increases by 4*t0
"vfredsum.vs v0,v8,v0;" // v0 += v8, vector accumulate sum into v0
"add a1, a1, t1;" // a1 += t1, update load address
"bnez a0, XXXXXX%=;" // If a0 != 0, jump to loop start address
"vfmv.f.s f0, v0;" // Move vector accumulate result from v0 to f0
"fcvt.s.w f1, %[avl];" // Convert avl to float and save to f1
"fdiv.s %[ret], f0, f1;" // ret = f0/f1, i.e., calculate average
2.4.3 Adding RVV Operator Process#
In the following process, paths are based on nncase as the root directory
Function Declaration: src/Native/src/kernels/stackvm/optimized/opt_ops.h
Operator Implementation:
General Optimization: src/Native/src/kernels/stackvm/optimized
x86 Optimization: src/Native/src/kernels/stackvm/optimized/x86_64
RVV Optimization: src/Native/src/kernels/stackvm/optimized/riscv64
Logical Call: src/Native/src/kernels/stackvm/tensor_ops.cpp
Modify CMakeLists: src/Native/src/kernels/stackvm/optimized/CMakeLists.txt
General Optimization: Add source file name in line 15
Platform-specific Optimization: Add source file name in line 44
2.5 Tips#
If you encounter operators that are not yet supported by RVV optimization and need support, feel free to raise issues and PRs on nncase.