Homework Submission#
Your writeup should follow the writeup guidelines. Your writeup should include your answers to the following questions:
Baseline
Get the source code and run the
baseline
project on the Ultra96 as shown in Setup and Walk-through. Copyhw3
directory to Ultra96:scp -r hw3 root@10.10.7.1:/home/root/
Determine the throughput of
baseline
in pictures per second. This is your baseline. We use-O2
for the baseline, so you should keep using -O2 for the rest of the homework. Ignore overhead such as loading and storing pictures for this and the following questions. (1 line)
Coarse-grain data parallelism
We will parallelize the application by processing half of each picture on core 0 and the other half on core 1, a form of coarse-grain, data-level parallelism. The initial implementation can be found in
hw3/assignment/coarse_grain
. We have parallelizedScale
already for you. Note that there is no hyper-thread on ARM, so the core indices of 0 and 1 are used forpin_thread_to_cpu
to access two different physical cores.Can we parallelize all streaming functions in our application, i.e.
Filter_horizontal
,Filter_vertical
,Differentiate
, andCompress
in the same way asScale
? Motivate your answer. Assume that we synchronize our cores between each producer-consumer pair. (3 lines)What speedup do you expect from parallelizing the functions that you considered parallelizable in the previous question? [Include an equation for the expected parallel runtime and show the equation you use for computing the speedup as well as your final, numeric result. Report both per function speedup and overall application speedup.] (5–7 lines)
Complete the implementation by parallelizing the functions that you considered parallelizable in the previous question. Provide the relevant sections of code in your report.
Measure the throughput of your parallel implementation.
Validate your results. Make sure that your parallel version produces the same answers as the original serial version. Explain how you validated your results; report any discrepancies in your final implementation. (1 line)
Compare your measurement with your ideal, expected speedup. (1 line)
If your speedup is different from ideal, expected, what effects are likely to be responsible for the difference? (1-3 lines)
Pipelining
As an alternative to coarse-grain, data-level parallelism, we will investigate a pipelined implementation in this question. The initial implementation can be found in
hw3/assignment/pipeline_2_cores
. The provided stream has only \(200\) frames, but assume in your performance computations that you are dealing with a stream of infinite length.Report the throughput of the initial pipelined implementation on 2 cores in pictures per second. (1 lines)
What is the best performance that one could theoretically achieve with a pipelined mapping of the streaming application on 2 cores over the single core solution? (1 line)
Hint
Where is the bottleneck? How does pipelining help in hiding the bottleneck?
Describe the mapping that achieves the best performance. (3 lines)
Reviewing the provided code, explain how it is able to deal with filling and draining the pipeline of operators? That is, when the application starts, there is only data for the first stage in the pipeline (
Scale
) and no data for the later stages. After the input data has been consumed by theScale
stage, the later stages will still have data to process. How does the code assure the program runs correctly to completion on all data? (4–6 lines)Review the provided code. Explain how you can adjust the
PIPELINE_PAR
parameter (inFilter.cpp
) to maximize throughput. Don’t experimentally find a betterPIPELINE_PAR
but find it analytically. (2–3 lines)Adapt the implementation by changing the parameter
PIPELINE_PAR
to optimize the pipeline task or implement your own mapping to optimize the pipeline tasks. Include the sections of the code that you modified in your report.Validate your results. Report on how you validated and any discrepancies. (1 line)
Report the throughput of your new application in pictures per second. (1–2 lines)
Let’s investigate the performance if we incorporate the optimized pipeline in a video broadcast server. The input data is read from an interface with \(80\) MB/s throughput. \(75\%\) of traffic is video traffic that is compressed using our pipeline (running on 2 processors). Assume the 2 cores can pipeline the process perfectly. The remaining \(25\%\) is other traffic that we protect with an error correction code (ECC) running on a dedicated hardware unit that adds \(10\%\) overhead in size. The hardware ECC unit can process \(30\) MB/s. The output of the ECC unit and compression pipeline are output to a single \(2\)-Gigabit/s Ethernet port.
Draw a streaming dataflow diagram for the network server. Indicate throughput and data transfer ratios where applicable.
What is the maximum throughput that the server can achieve? (10 lines)
Where is the bottleneck? (1 line)
How much smaller do we have to make the kernel (
FILTER_LENGTH
) ofFilter
to move the bottleneck? Don’t experimentally find aFILTER_LENGTH
but find it analytically. (7 lines)
CDC Parallel
Note
Note, this section is optional, but will help you out with the final project.
Building on techniques and observations from previous parts, create a data-parallel implementation of your CDC function from homework 2 that uses four cores of ARM A53 to achieve parallelism speedup. The starter code can be found in
hw3/assignment/cdc_parallel
. If you do not see the speedup with four cores, please increase the size of the .txt file.What is the best performance that one could theoretically achieve with a data-parallel mapping of CDC on 4 cores over the single core solution? (1 line)
Describe the data-parallel mapping that achieves the best performance. Try to achieve the best speedup over the single core solution.
Implement your design and include your code in your report.
Hint
Use the techniques shown in the walk-through!
You can divide the input file equally for each thread to work on, however, make sure that hash calculation for the window of characters between two threads is accounted for, in other words, think about the indices you’ll pass to your threads. You should not miss a chunk break as a result of the way you split inputs between cores.
Report speedup obtained and relate it to your solution. (3–5 lines)
Validate your design and report on any discrepancies.
Deliverables#
In summary, upload the following in their respective links in canvas:
a tarball containing the 4 projects with your modified code.
Quick linux commands for tar files
# Compress tar -cvzf <file_name.tgz> directory_to_compress/ # Decompress tar -xvzf <file_name.tgz>
writeup in pdf.