Homework Submission#

Your writeup should follow the writeup guidelines. Your writeup should include your answers to the questions below. Even if a certain question(like 1-(c) or 1-(f)) is just a “step”, please include it in your report and leave the bullet blank for the sake of easy grading.

  1. Accelerator Interface

    1. Run CPU version on Ultra96, and report the latency.

    2. For FPGA version, run the code, copy the Vitis Analyzer files to your computer and open them with Vitis Analyzer. Click Profile Summary, and then Summary to see Total application runtime and Total kernel runtime. Click Kernels & Compute Units to see only the kernel execution time. We will check these three latencies throughout this HW. Report the latencies.

    3. In the Vitis Analyzer, open the application timeline. Zoom in at the beginning of the kernel execution, and provide a screenshot in the write up. Based on the analyzer, suggest at least two ways of improving the performance of the FPGA code.

    4. We will now modify the kernel code.

      • In terminal, make sure you correctly sourced the settings, and open Vitis HLS, with:

        vitis_hls
        
      • Click on open project and browse to the your build generated directory: hw6/apps/mmult/_x/kernel/mmult_fpga/mmult_fpga and click open.

    5. Partition the HLS code into Load-Compute-Store Pattern as can be seen in this example

    and this tutorial. Enable dataflow with HLS DATAFLOW pragma and use hls::stream to pass data between Load, Compute, Store functions. Make additional changes to the code to achieve II=1.

    1. Make sure to run C simulation and verify that your HLS code is functionally correct. Provide the code in your report. Also, provide the screenshot of Performance & Resource Estimates table in the Synthesis Summary Report. Because you have Load, Compute, Store functions, expand each function in the table to show that you achieved II=1.

    2. Rebuild the project with the dataflow-enabled kernel, copy the binaries and boot files, reboot and test. This will take about 30 minutes to build. Report the latencies. Provide a screenshot of the relevant section of Application Trace from Vitis Analyzer.

    3. Our initial FPGA host code uses an in-order command queue. Find out how to use an out-of-order command queue to get overlap between communication and computation. Make the necessary change in the Host.cpp and provide the change in the report. Build the project with the modified host code. Report the three latencies. Provide a screenshot from Vitis Analyzer. We expect you to see something like Fig. 20 or Fig. 21.

      ../_images/comp_comm_overlap.png

      Fig. 20 Communication and Computation overlap#

      ../_images/comp_comm_overlap_2.png

      Fig. 21 Communication and Computation overlap when a kernel runtime is longer#

    4. Use the following command in your host machine. Report the clocks, memory ports and resources that are available on the platform:

      platforminfo $PLATFORM_REPO_PATHS/u96v2_sbc_base.xpfm
      
    5. Assign separate ports to the mmult_fpga, then rebuild and run the kernel. This will take about 30 minutes to build. Report the latencies. Provide a screenshot of the relevant section of Application Trace from Vitis Analyzer. Does assigning multiple ports on Ultra96 have any impact on your design? Save/Move the hw6/apps/mmult/_x folder of the project to somewhere else before doing the next question. We will use the outputs from this question in the next part.

      Hint

      • Learn about how to add multiple ports from here

      • Under hw6/apps/mmult/fpga there is a file called design.cfg. In that file you will need to add the commands to map the kernel arguments to the ports under the [connectivity] section. You can find the commands in the tutorial linked above. You can use the memory ports available from part i. There is no need to create a separate connectivity.cfg file and modify the Makefile.

      • Read this paper to find out how to efficiently use the ports on Ultra96 (optional).

    6. Learn about how to use multiple compute units from here and apply it to your design. Note that you need to modify your host code to get multiple compute units working. Use 2 mmult_fpga units. This can be done by modifying design.cfg.
      Rebuild the FPGA version, copy the binaries and boot files, reboot and test. This will take about >30 minutes to build (While this is building, you can work on Part 2: Analyze Implementation). Report the latencies. Provide a screenshot of the relevant section of Application Trace from Vitis Analyzer. If you are run out of the FPGA resources, report it.

    7. Put your latency results from the CPU version in part a, and your FPGA implementations in parts b, g, j, and k into a table, and show the speedup relative to the CPU version.

  2. Analyze Implementation

    In this question, we will investigate what the FPGA implementation of the matrix multiplication (Part 1k) look like using Vivado (not Vivado HLS). Vivado is part of the Vitis installation.

    1. Report how many resources and utilization percentage of each type (BlockRAM, DSP unit, flip-flop, and LUT) the implementation (Part 1k) consumes. You can find this information in the Implementation tab on the left hand side. Click Report Utilization under Open Implemented Design. Launch Vivado using the following commands and open the project you saved/moved from the location hw6/apps/mmult/_x/link/vivado/vpl/prj/prj.xpr. (4 lines)

      • In terminal, make sure you correctly sourced the settings, and open Vivado, with:

        vivado
        
    2. Report the expected power consumption of this design by clicking Report Power of the Implementation tab. (1 line)

    3. On the left top corner, you will see IP Integrator. Click Open Block Design under IP Integrator. Open the Address Editor by choosing the corresponding tab above the block design. In which memory region is the control interface of the accelerator wrapper mmult_fpga_1 mapped? This region is used for such communication as starting the accelerator and querying its status. Writes and reads by the ARM processor are to this region are sent over an AXI4-Lite bus to the accelerator wrapper, which handles them and controls the accelerator. (1 line)

    4. Open the timing report by going to the Implementation tab and pressing Design Timing Summary from the Timing tab. Click on the number next to Worst Negative Slack. Look at the Path Properties. Report in which of the hardware modules that we saw in the block design the path begins and ends. (1 line)

    5. Include a screenshot of the critical path in your writeup. Zoom in to make sure all elements of the path are clearly visible. Indicate the type of each element (e.g. LUT, flip-flop, carry chain) on the screenshot.

    6. Highlight the accelerators in green, the interconnect (M_AXI and S_AXI) in yellow. You can do this by right- clicking the modules in the Netlist view and selecting Highlight Leaf Cells. Include a screenshot of the entire device in your report.

Deliverables#

In summary, upload the following in their respective links in canvas:

  • writeup in pdf.