Homework Submission

Homework Submission#

Your writeup should follow the writeup guidelines. Your writeup should include your answers to the following questions:

Teamwork

As the difficulty of homework is ramping up, we encourage you to spend a moment planning on how to tackle the homework as a team.
1. Describe which tasks of this homework you will perform, which tasks will be performed by your teammate(s), and which tasks you will perform together (e.g., pair programming, where you both sit together at the same terminal). Motivate your task distribution. (5 lines)
2. Give an estimate of the duration of each of the tasks. (5 lines)
3. Record the actual time spent on tasks as you work through the assignment.
4. Explain how you will make sure that the lessons and knowledge gained from the exercises are shared with everybody in the team. (3 lines)
Compiler Optimizations

Before we dive into the vector optimizations, we will investigate the effects of different levels of compiler optimizations.

Table 5 Latency and Code size per Optimization Level#

Optimization level

Latency (ns)

Code size (bytes)

-O0

-O1

-O2

-O3

-Os
Important
- You will compile all code in this homework directly in the Ultra96. The g++ compiler in the Ultra96 is the ARM compiler.
- You should edit your code in your host computer (vim in the Ultra96 doesn’t work properly). Every time you edit, you can scp your revised code, or:
  - use Remote Explorer in VSCode to open a connection to the Ultra96 or
  - if on Windows, use MobaXterm to directly edit the files in the device.
- Make sure that you are able to keep track of your edited files. Given there is no internet connection in the Ultra96 at the moment, you should copy back results as needed and version control your code using git repositories in your host computer.
1. Measure the latency and size of the baseline target at the different optimization levels. Put your measurements in a table like Table 5. You can change the optimization level by editing the CXXFLAGS in the hw4 Makefile.
2. Include the assembly code of the innermost loop of Filter_horizontal at optimization level -O0 in your report. Use the following command to get the assembly and then look for Filter_horizontal in Filter_O1.s:
```
g++ -S -O0 -mcpu=native -fno-tree-vectorize Filter.cpp -o /dev/stdout | c++filt > Filter_O1.s
```
  Note
  
  -fno-tree-vectorize disables automatic vectorization. We will look at automatic vectorization in the next section.
3. Include the assembly code of the innermost loop of Filter_horizontal at optimization level -O2 in your report.
4. Based on the machine code of questions 2.2 and 2.3, explain the most important difference between the -O0 and -O2 versions. (2 lines)
  Hint
  
  Leading questions:
  - for each case (-O0, -O2), how many times does the loop read the variable i?
  - for each case (-O0, -O2), how many times does the loop read and write the variable Sum?
  - why is the -O2 loop able to avoid recalculating Y*INPUT_WIDTH+X inside the loop body?
  - what else is the -O2 loop able to avoid reading from memory or recaculating?
  - how is the -O2 loop able to perform fewer operations?
5. Why would you want to use optimization level -O0? (3 lines)
  
  Hint
  
  Compile the code with -O3 and track the values of the variables X, Y, and i as you step through Filter_horizontal.
6. Include the assembly code of the innermost loop of Filter_horizontal at optimization level -O3 in your report.
7. Based on the machine code of questions 2.3 and 2.6, explain the most important difference between the -O2 and -O3 versions. (1 line)
8. What are two drawbacks of using a higher optimization level? (5 lines)

Automatic Vectorization

The easiest way to take advantage of vector instructions is by using the automatic vectorization feature of the GCC compiler, which automatically generates NEON instructions from loops. Automatic vectorization in GCC is sparsely documented in the GCC documentation. Although we are not using the ARM compiler, the ARM compiler user guide may give some more insight on how to style your code for auto vectorization. This talk on GCC vectorization may also be useful.

Vectorization Speedup Summary

Table 6 Vectorization Speedup Summary#
	Baseline			Baseline with SIMD		Baseline with SIMD Modified
	Latency (ns)	Suitability (Y/N)	Ideal Vectorization Speedup	Latency (ns)	Speedup	Latency (ns)	Speedup
`Scale`
`Filter_horizontal`
`Filter_vertical`
`Differentiate`
`Compress`
Overall		N/A

Report the latency of each stage of the baseline application at -O3. (Start a table like Table 6; we will continue to fill in this table throughout this problem.)
Based on your understanding of the C code, which loops in the streaming stages of the application have sufficient data parallelism for vectorization? Motivate your answer. (Mark suitability by filling in Yes or No in the suitability column of Table 6; add explanation in 2–5 lines after table.)
Identify the critical path lower bound for Filter_vertical in terms of compute operations. Focus on the data path. Ignore control flow and offset computations. You may assume associativity for integer arithmetic. (5 lines)

Hint

Consider only the dependencies in the computation. What happens if you unroll the loops completely?
What is the size of the (non-index) multiplications performed in Filter_vertical? (How many bits for each of the input operands? How many bits are necessary to hold the output?) (one line)
Report the resource capacity lower bound for Filter_vertical. Focus on the computation and the computation size identified in the question 3.d while computing resource capacity; you may ignore control flow and addressing computations. There are many resources that may limit the performance.

(5 lines)
Hint
- As with any resource capacity lower bound analysis, you may have multiple resources and may need to consider them each to identify the one that is most constraining.
- You will need to review the NEON architecture (which we discussed in class and in Setup and Walk-through) and reason about what resources it has available to be used on each cycle. Think about how vectorization could exploit the set of computations a NEON unit can do in parallel.
Calculate the ideal vectorization speedup for each stage and fill in Table 6. Additionally, what speedup do you expect your application can achieve if the compiler is able to achieve the ideal vectorization speedup? (5 lines)

Hint

For each stage, Identify how many operations can run in vector parallel on the NEON. (Part 3) How does that reduce the resource bound? How does this reduce the overall number of cycles. (Part 4) You should consider both critical path lower bounds and resource capacity lower bounds.

Remember Amdahl’s Law for speedup.

(Fill in the ideal vectorization speedup column in Table 6; separately show Amdahl’s Law calculation for overall speedup.)
We will now enable the vectorization in g++. You can enable it by removing the -fno-tree-vectorize flag from the CXXFLAGS in the hw4 Makefile. -O3 optimization automatically turns on the flag -ftree-vectorize, which vectorizes your code.(You do not need to modify code for 3.7 and 3.8. Just report the speedup for the given code with vectorization)
Report the speedup of the vectorized code with respect to the baseline. (Fill in the “Baseline with SIMD” columns in Table 6.)
Explain the discrepancy between your measured and ideal performance based on the optimization of Filter_horizontal. (3 lines)
Hint
- Look at the size of the multiplications in the assembly code.
- To read this code, you probably need to understand the relation between Q and V registers. Perhaps useful:
Show how you can resolve the issue that you identified in the previous problem. (1 line) Include the assembly code of Filter_vertical after you have resolved the issue.
Report the speedup with respect to the baseline after resolving the issue in both Filter_horizontal and Filter_vertical. (Fill in the “Baseline with SIMD Modified” columns in Table 6.)

NEON Intrinsics Example

Review the Setup and Walk-through to learn about NEON intrinsics.
1. Review the code in the hw4/assignment/neon_example directory. Note how the Neon version instantiates Neon vector intrinsics to perform the operation. Convince yourself the C version and Neon version perform the same computation. (no turn in)
2. Build and run the code by doing make example and ./example.
3. Report the speedup for the Neon version compared to the C version. (1 line)
4. Review the assembly code produced for both the C and Neon versions. Based on the assembly code, explain how the Neon version is able to achieve the speedup you observed compared to the C version. Include assembly code to support your description. (probably 3–5 lines of description in addition to snippets of assembly)
Using NEON Intrinsics

You will now accelerate the Scale function using neon intrinsics. Accelerate this function by using vector loads and stores. If you look at Filter_vertical in Filter.cpp right after the #ifdef VECTORIZED, you will see an implimentation of Filter_vertical using neon intrinsics, which may help you become more familiar with using intrinsics. This page should give you some idea about how to exploit certain vector loads to help perform Scale. You can use this page to help you find documentation for particular intrinsics. You can use this page to help you figure out how to work with different neon datatypes, especially for those that use structs.
1. Explain your strategy for accelerating Scale, and include a screenshot of your function in the report. You will also submit code for this section (see the Deliverables section).
2. Compile the target baseline with -O3 but autovectorization turned off with -fno-tree-vectorize. Run it and report the latency of Scale.
3. Compile the target baseline with -O3 but this time with autovectorization. Run it and report the latency of Scale.
4. Compile the target neon. Run it and report the latency of Scale.
5. How much faster was your neon implimentation over the two baseline implimentations?
Reflection

Reflect on the cooperation in your team.
1. Compare your actual time on tasks with your original estimates. (table with 1-2 line explanation of major discrepancies)
2. Reflect on your task decomposition (1.1). Were you able to complete the task as you originally planned? What aspects of your original task distribution worked well and why? Did you refine the plan during the assignment? How and why? In hindsight, how should you have distributed the tasks? (paragraph)
3. What was the most useful thing you learned from or working with your teammate? (2–4 lines)
4. What do you believe was the most useful thing that you were able to contribute to your team? (1–3 lines)

Deliverables#

In summary, upload the following in their respective links in canvas:

a tarball containing the hw4 source code with your modified neon intrinsics code.
Quick linux commands for tar files
```
# Compress
tar -cvzf <file_name.tgz> directory_to_compress/
# Decompress
tar -xvzf <file_name.tgz>
```
writeup in pdf.

Table 5 Latency and Code size per Optimization Level#
Optimization level	Latency (ns)	Code size (bytes)
`-O0`
`-O1`
`-O2`
`-O3`
`-Os`

Homework Submission

Contents

Homework Submission#

Deliverables#