Setup and Walk-through#
Vectorization#
We will divide the computation into vectors that run on the NEON units in our ARM cores. Fig. 16 shows the high level block diagram of the ARM core in Ultra96 board.
The Ultra96 boards have ARM Cortex A-53 cores. It’s a 2-way decode, in-order core with one 64-bit NEON SIMD unit, as shown in Table 4.
Cortex A-53 |
|
---|---|
ARM ISA |
ARMv8 (32/64-bit) |
Decoder Width |
2 micro-ops |
Maximum Pipeline Length |
8 |
Integer Add |
2 |
Integer Mul |
1 |
Load/Store Units |
1 |
Branch Units |
1 |
FP/NEON ALUs |
1x64-bit |
L1 Cache |
8KB-64KB I$ + 8KB-64KB D$ |
L2 Cache |
128KB - 2MB (Optional) |
We will use the NEON Intrinsics API to program the NEON Units in our cores. An intrinsic behaves syntactically like a function, but the compiler translates it to a specific instruction that is inlined in the code. In the following sections, we will guide you through reading the NEON Programmer’s guide and learning to use these APIs.
Obtaining the Code#
In the previous homework, we dealt with a streaming application that
compressed a video stream, and explored how to implement coarse-grain data-level parallelism
and pipeline parallelism using std::threads
to speedup the application. For this homework,
we will use the same application and implement fine-grain, data-level
parallelism on a vector architecture; we will explore both auto
vectorization with the compiler and hand-crafted NEON vector intrinsics.
On you local machine, clone the
ese532_code
repository using the following command:git clone https://github.com/icgrp/ese532_code.git
If you already have it cloned, pull in the latest changes using:
cd ese532_code/ git pull origin master
The code you will use for homework submission is in the
hw4
directory. The directory structure looks like this:hw4/ assignment/ Makefile common/ App.h Constants.h Stopwatch.h Utilities.h Utilities.cpp src/ App.cpp Compress.cpp Differentiate.cpp Filter.cpp Scale.cpp neon_example/ Example.cpp data/ Input.bin Golden.bin
Environment Setup#
Setup your Ultra96 like you did in HW3, and copy the hw4
directory onto the Ultra96.
Running the Code#
There are 3 targets, which we will build in the Ultra96. You can build all of them by executing
make all
in thehw4/assignment
directory. You can build separately by:make baseline
and./baseline
to run the project with no vectorization ofFilter_vertical
function.make neon_filter
and./neon_filter
to run the project withFilter_vertical
vectorized (you will modify the vectorized code later).make example
and./example
to run the neon example.
The
data
folder contains the input data,Input.bin
, which has 200 frames of size \(960\) by \(540\) pixels, where each pixel is a byte.Golden.bin
contains the expected output. Each program uses this file to see if there is a mismatch between your program’s output and the expected output.The
assignment/common
folder has header files and helper functions used by the four parts.You will mostly be working with the code in the
assignment/src
folder.
Working with NEON#
We are going to do some reading from the arm developer website articles and the NEON Programmer’s Guide in the following sections.
Basics#
Read Introducing Neon for Armv8-a and answer the following questions. We have given you the answers, however make sure you do the reading! Knowing where to look in a programmer’s guide is a skill by itself and we want to learn it now rather than later.
1. Give an example of a SISD instruction.
add r0, r5
and any instruction from the ARM and Thumb-2 ISA quick reference guide
2. Give an example of a SIMD instruction.
add v10.4s, v8.4s, v9.4s
and any instruction from the NEON quick reference guide
3. What is the size of a register in a Armv8-A NEON unit?
128-bit
4. What does a NEON register contain?
vectors of elements of the same data type
5. How many sizes of NEON vectors are there and what are those sizes?
Two sizes: 64-bit and 128-bit NEON vectors
6. What is a lane?
The same element position in the input and output registers is referred to as a lane.
7. How many lanes are there in a uint16x8_t NEON vector data type?
8
8. How many lanes are there in a uint32x2_t NEON vector data type?
2
9. Can there be a carry or overflow from one lane to another?
No.
Read NEON and floating-point registers and answer the following questions:
1. How many NEON registers are there in ARMv8 and what are they labeled as?
32 128-bit NEON registers, labeled as V0-V31.
2. What is the difference between an operand labeled v0.16b and an operand labeled q0?
v0.16b is a vector register and has 16 lanes with each lane having 1 byte. q0 is a scalar register of 128-bits.
3. Are registered labeled b0, h0, s0, d0, q0 separate registers?
No, all of them belong to the same register v0. They are qualified names for registers when a NEON instruction operate on scalar data.
Read chapter four from the NEON Programmer's Guide
and answer the following questions:
1. Where are the NEON Intrinsics declared?
in arm_neon.h
header file
2. What NEON data type are you going to use for an unsigned char array of size 16 elements?
uint8x16_t. It will got to Q register.
3. When should you use intrinsics with ‘q’ suffix vs intrinsics without ‘q’ suffix?
When the input and output vectors are 64-bit vectors, don’t use intrinsics with ‘q’ suffix. When the input and output vectors are 128-bit vectors, do use intrinsics with ‘q’ suffix.
Coding with NEON Intrinsics#
Read chapter four from the NEON Programmer’s Guide and answer the following questions. Use the Neon Intrinsics Reference website to find and understand any instruction.
Tip
This will help you in coding for your homework.
1. Which intrinsic should you use to duplicate a scalar value to a variable of type uint16x8_t?
vdupq_n_u16
2. Which intrinsic should you use to load 16 bytes from a pointer to a variable of type uint8x16_t?
vld1q_u8
3. Which intrinsic should you use to add two vectors of type uint8x8_t without overflowing?
vaddl_u8
4. Which intrinsic should you use to get the first 8 lanes (low) of a variable of type uint8x16_t?
vget_low_u8
5. Which intrinsic should you use to get the second 8 lanes (high) of a variable of type uint8x16_t?
vget_high_u8
6. Which intrinsic should you use to multiply two vectors of type uint16x8_t?
vmulq_u16
7. Which intrinsic should you use to multiply two vectors of type uint16x8_t and accumlate the result to a variable of type uint16x8_t?
vmlaq_u16
8. Which intrinsic should you use to shift a variable of type uint16x8_t to the right?
vshrq_n_u16
9. Which intrinsic should you use to cast the uint8_t values in a variable of type uint8x8_t to be uint16_t?
vmovl_u8
10. Which intrinsic should you use to cast the uint16_t values in a variable of type uint16x8_t to be uint8_t?
vmovn_u16
11. Which intrinsic should you use to join two uint8x8_t vectors into a uint8x16_t vector?
vcombine_u8
12. Which intrinsic should you use to store data from a uint8x16_t variable to a pointer?
vst1q_u8
Optimization:#
Read section 2.1.10, 2.8, and chapter 5 from the NEON Programmer’s Guide.
Watch the talk: Taming ARMv8 NEON: from theory to benchmark results
Read (supplemental) Optimizing C Code with Neon Intrinsics
Read (supplemental) Coding for NEON
Read (supplemental) Neon Intrinsics Chromium Case Study
Read (supplemental) Program Optimization through Loop Vectorization