Homework Submission#
Your writeup should follow the writeup guidelines. Your writeup should include your answers to the questions below. Even if a certain is just a “step”, please include it in your report and leave the bullet blank for the sake of easy grading.
Note
Note that the last part of this assignment could take longer than the previous parts.
Accelerating the Filter horizontal
Create a new Vitis HLS project and add the provided source files. Use a clock
xczu3eg-sbva484-1-i
in the device selection. Use a 150 MHz clock, and select the Vitis Kernel Flow Target for the Flow Target.Does
Filter_horizontal
offer any opportunity for data reuse? What is the smallest buffer that we can use? (3 lines)What is the optimal order for traversing the input data (column-wise or row-wise)? Assume that the input and output are stored in a BRAM. Motivate your answer. (3 lines)
Create a function
Filter_horizontal_HW
that is a version ofFilter_horizontal_SW
that you modified based on the insights from the previous two questions. You don’t have to use the streams at this point. Include the code in your report.Pipeline the loop body of
Filter_horizontal_HW
. Write a testbench to verifyFilter_horizontal_HW
. Similar to the one we used in HW5, the testbench should compare the result ofFilter_horizontal_SW
andFilter_horizontal_HW
and exit your program with a value of 1 if the output is not correct. If the output is correct, the testbench can simply print out “TEST PASSED”. The input of the functions can be arbitrary values. Verify that your test function works. Include the testbench in your report. What is the latency(in cycles) that Vitis HLS predicts? (1 line)Note
Make sure you’ve selected the correct top function for the synthesis. Also check that you are not forcing another function as the top function in your constraints file like
directives.tcl
. You can create multiple Solutions in Vitis HLS for convenience.Note
Remember that
malloc()
is not synthesizable. You can have user-defined macro to seperate simulation code and synthesis code as shown in HLS user guide.
Accelerating the Filter vertical
Let’s continue with accelerating
Filter_vertical_HW
. We could store pixels that are used multiple times in a buffer that is mapped to a local memory. Assuming we still produce the output pixels in the same order asFilter_horizontal_HW
, what is the smallest buffer that we can use? Motivate your answer. (3 lines)What is the optimal order for traversing the input data (column-wise or row-wise) with respect to FPGA on-chip memory usage? Assume that the input and output data are stored in a BRAM. Motivate your answer. (3 lines)
Create a function
Filter_vertical_HW
that is a version ofFilter_vertical_SW
that you modified based on the insights from the previous two questions. You don’t have to use the streams yet. Include the code in your report.Pipeline the loop body of
Filter_vertical_HW
. Write a testbench to verifyFilter_vertical_HW
. What is the latency(in cycles) that Vitis HLS predicts? (1 line)
hls::stream
Write a verification function for
Filter_HW
. Verify that your test function works. Include the test function in your report.Create a function
Filter_HW
that connects both parts of the filter together. Store the intermediate results in a local array. IncludeFilter_HW
in your report. Use the default data movers. Also include the testbench’s output in your report. What is the expected latency(in cycles) ofFilter_HW
?We could replace the local array in
Filter_HW
with a stream. Assume that the stream requires no resources for buffering. What impact do you expect that will have on the resource consumption? Quantify your answer. (3 lines)Replace the local array with an
hls::stream
object and insert a dataflow pragma intoFilter_HW
. Thehls::stream
class is declared inhls_stream.h
. Modify the remaining functions as necessary. IncludeFilter_HW
and any other significant changes in your report.Hint
We are concerned with streaming now, and that could merit a reconsideration of how we travese the data.
What is the latency of
Filter_HW
that Vitis HLS predicts? Make sure you verify your code. (1 line)
Moving on HW
Partition the
Filter_HW
in a Load-Compute-Store pattern as we did in HW6.(Partition the Code into a Load-Compute-Store pattern) Verify the code and include the final code in the report.Export your
Filter_HW
as.xo
file and build.xclbin
file as we did in HW5. Create a host code and include other functions like scale, differentiate, and compress so that they run on ARM core. Run Filter function on FPGA. Use the sameInput.bin
as input data andGolden.bin
from HW3 to verify the output. UseO2
as the optimization level for the host code compile. Include the host code in the report.Note
Refer to
Makefile
and the host code we used for the previous HWs. Collect the data before transferring to Filter kernel, and collect the data back after the kernel computation to feed in to the next stage, compress. You want to enable out-of-order queue to overlap communication and computation.Note
Don’t worry too much about the performance for now. In this question, we just want you to integrate the HW kernel with other application running on CPU.
Report the application latency to process 200 frames. Compare it with the baseline application latency from HW3 (1 line).
How can you run other stages on the processor concurrently with the Filter kernel on FPGA? What is the speedup you expect to achieve?
Deliverables#
In summary, upload the following in their respective links in canvas:
writeup in pdf.