Homework Submission#
Your writeup should follow the writeup guidelines. Your writeup should include your answers to the following questions:
Teamwork
As the difficulty of homework is ramping up, we encourage you to spend a moment planning on how to tackle the homework as a team.
Describe which tasks of this homework you will perform, which tasks will be performed by your teammate(s), and which tasks you will perform together (e.g., pair programming, where you both sit together at the same terminal). Motivate your task distribution. (5 lines)
Give an estimate of the duration of each of the tasks. (5 lines)
Record the actual time spent on tasks as you work through the assignment.
Explain how you will make sure that the lessons and knowledge gained from the exercises are shared with everybody in the team. (3 lines)
Compiler Optimizations
Before we dive into the vector optimizations, we will investigate the effects of different levels of compiler optimizations.
# Optimization level
Latency (ns)
Code size (bytes)
-O0
-O1
-O2
-O3
-Os
Important
You will compile all code in this homework directly in the Ultra96. The
g++
compiler in the Ultra96 is the ARM compiler.You should edit your code in your host computer (
vim
in the Ultra96 doesn’t work properly). Every time you edit, you canscp
your revised code, or:use Remote Explorer in VSCode to open a connection to the Ultra96 or
if on Windows, use MobaXterm to directly edit the files in the device.
Make sure that you are able to keep track of your edited files. Given there is no internet connection in the Ultra96 at the moment, you should copy back results as needed and version control your code using git repositories in your host computer.
Measure the latency and size of the
baseline
target at the different optimization levels. Put your measurements in a table like Table 5. You can change the optimization level by editing theCXXFLAGS
in the hw4 Makefile.Include the assembly code of the innermost loop of
Filter_horizontal
at optimization level-O0
in your report. Use the following command to get the assembly and then look forFilter_horizontal
inFilter_O1.s
:g++ -S -O0 -mcpu=native -fno-tree-vectorize Filter.cpp -o /dev/stdout | c++filt > Filter_O1.s
Note
-fno-tree-vectorize
disables automatic vectorization. We will look at automatic vectorization in the next section.Include the assembly code of the innermost loop of
Filter_horizontal
at optimization level-O2
in your report.Based on the machine code of questions 2.2 and 2.3, explain the most important difference between the
-O0
and-O2
versions. (2 lines)Hint
Leading questions:
for each case (
-O0
,-O2
), how many times does the loop read the variable i?for each case (
-O0
,-O2
), how many times does the loop read and write the variable Sum?why is the
-O2
loop able to avoid recalculatingY*INPUT_WIDTH+X
inside the loop body?what else is the
-O2
loop able to avoid reading from memory or recaculating?how is the
-O2
loop able to perform fewer operations?
Why would you want to use optimization level
-O0
? (3 lines)Hint
Compile the code with
-O3
and track the values of the variablesX
,Y
, andi
as you step throughFilter_horizontal
.Include the assembly code of the innermost loop of
Filter_horizontal
at optimization level-O3
in your report.Based on the machine code of questions 2.3 and 2.6, explain the most important difference between the
-O2
and-O3
versions. (1 line)What are two drawbacks of using a higher optimization level? (5 lines)
Automatic Vectorization
The easiest way to take advantage of vector instructions is by using the automatic vectorization feature of the GCC compiler, which automatically generates NEON instructions from loops. Automatic vectorization in GCC is sparsely documented in the GCC documentation. Although we are not using the ARM compiler, the ARM compiler user guide may give some more insight on how to style your code for auto vectorization. This talk on GCC vectorization may also be useful.
Vectorization Speedup Summary
# Baseline
Baseline with SIMD
Baseline with SIMD Modified
Latency (ns)
Suitability (Y/N)
Ideal Vectorization Speedup
Latency (ns)
Speedup
Latency (ns)
Speedup
Scale
Filter_horizontal
Filter_vertical
Differentiate
Compress
Overall
N/A
Report the latency of each stage of the baseline application at
-O3
. (Start a table like Table 6; we will continue to fill in this table throughout this problem.)Based on your understanding of the C code, which loops in the streaming stages of the application have sufficient data parallelism for vectorization? Motivate your answer. (Mark suitability by filling in Yes or No in the suitability column of Table 6; add explanation in 2–5 lines after table.)
Identify the critical path lower bound for
Filter_vertical
in terms of compute operations. Focus on the data path. Ignore control flow and offset computations. You may assume associativity for integer arithmetic. (5 lines)Hint
Consider only the dependencies in the computation. What happens if you unroll the loops completely?
What is the size of the (non-index) multiplications performed in
Filter_vertical
? (How many bits for each of the input operands? How many bits are necessary to hold the output?) (one line)Report the resource capacity lower bound for
Filter_vertical
. Focus on the computation and the computation size identified in the question 3.d while computing resource capacity; you may ignore control flow and addressing computations. There are many resources that may limit the performance.(5 lines)
Hint
As with any resource capacity lower bound analysis, you may have multiple resources and may need to consider them each to identify the one that is most constraining.
You will need to review the NEON architecture (which we discussed in class and in Setup and Walk-through) and reason about what resources it has available to be used on each cycle. Think about how vectorization could exploit the set of computations a NEON unit can do in parallel.
What speedup do you expect your application can achieve if the compiler is able to achieve the resource bound identified in 3e? (5 lines)
Hint
Remember Amdahl’s Law; think about critical path lower bounds and resource capacity lower bounds.
(Fill in the ideal vectorization speedup column in Table 6; separately show Amdahl’s Law calculation for overall speedup.)
We will now enable the vectorization in g++. You can enable it by removing the
-fno-tree-vectorize
flag from theCXXFLAGS
in the hw4 Makefile.-O3
optimization automatically turns on the flag-ftree-vectorize
, which vectorizes your code.(You do not need to modify code for 3.7 and 3.8. Just report the speedup for the given code with vectorization)Report the speedup of the vectorized code with respect to the baseline. (Fill in the “Baseline with SIMD” columns in Table 6.)
Explain the discrepancy between your measured and ideal performance based on the optimization of
Filter_horizontal
. (3 lines)Hint
Look at the size of the multiplications in the assembly code.
To read this code, you probably need to understand the relation between Q and V registers. Perhaps useful:
Show how you can resolve the issue that you identified in the previous problem. (1 line) Include the assembly code of
Filter_vertical
after you have resolved the issue.Report the speedup with respect to the baseline after resolving the issue in both
Filter_horizontal
andFilter_vertical
. (Fill in the “Baseline with SIMD Modified” columns in Table 6.)
NEON Intrinsics Example
Review the Setup and Walk-through to learn about NEON intrinsics.
Review the code in the
hw4/assignment/neon_example
directory. Note how the Neon version instantiates Neon vector intrinsics to perform the operation. Convince yourself the C version and Neon version perform the same computation. (no turn in)Build and run the code by doing
make example
and./example
.Report the speedup for the Neon version compared to the C version. (1 line)
Review the assembly code produced for both the C and Neon versions. Based on the assembly code, explain how the Neon version is able to achieve the speedup you observed compared to the C version. Include assembly code to support your description. (probably 3–5 lines of description in addition to snippets of assembly)
Using NEON Intrinsics
You will now accelerate the
Scale
function using neon intrinsics. Accelerate this function by using vector loads and stores. If you look atFilter_vertical
inFilter.cpp
right after the#ifdef VECTORIZED
, you will see an implimentation ofFilter_vertical
using neon intrinsics, which may help you become more familiar with using intrinsics. This page should give you some idea about how to exploit certain vector loads to help perform Scale. You can use this page to help you find documentation for particular intrinsics. You can use this page to help you figure out how to work with different neon datatypes, especially for those that use structs.Explain your strategy for accelerating
Scale
, and include a screenshot of your function in the report. You will also submit code for this section (see the Deliverables section).Compile the target
baseline
with-O3
but autovectorization turned off with-fno-tree-vectorize
. Run it and report the latency ofScale
.Compile the target
baseline
with-O3
but this time with autovectorization. Run it and report the latency ofScale
.Compile the target
neon
. Run it and report the latency ofScale
.How much faster was your neon implimentation over the two baseline implimentations?
Reflection
Reflect on the cooperation in your team.
Compare your actual time on tasks with your original estimates. (table with 1-2 line explanation of major discrepancies)
Reflect on your task decomposition (1.1). Were you able to complete the task as you originally planned? What aspects of your original task distribution worked well and why? Did you refine the plan during the assignment? How and why? In hindsight, how should you have distributed the tasks? (paragraph)
What was the most useful thing you learned from or working with your teammate? (2–4 lines)
What do you believe was the most useful thing that you were able to contribute to your team? (1–3 lines)
Deliverables#
In summary, upload the following in their respective links in canvas:
a tarball containing the hw4 source code with your modified neon intrinsics code.
Quick linux commands for tar files
# Compress tar -cvzf <file_name.tgz> directory_to_compress/ # Decompress tar -xvzf <file_name.tgz>
writeup in pdf.