# CE515 Advanced Processor Architecture and SoC Design

This course is separated into 2 parts. The first one deals will SoC design and programming and is supervised by David Hély. The second one is focused on the SIMD extension present inside ARM architecture. The goal of this lab is implement and develop algorithms (functions) optimized for the NEON extension included in the architecture ARMv7-A et ARMv7-R. Slides of the course can be found here. This course requires the basic knowledge of a micro-controller architecture and classical C programming. Processing will run over a Xilinx Zynq 7000 SoC which contain an ARM Cortex A9 MP Core. Performance of the NEON engine will be compared to the performance of classical implementation (C function executed on the ARM core).

## Introduction

Zynq 7000 architecture

A lot of peripherals generate (or accept) data whose the size is smaller than the processor register size (e.g., 16 bits ADC and 32 bits processor.) When the CPU process these data, a single data is used at a time and processor register is not used fully which imply that the efficiency of the system is decreased. SIMD technology uses a single instruction to realize the same operation on multiple data (in the same time) which increase the execution speed.

NEON extension has been introduced in the ARMv7 architecture. This extension adds a set of registers (64 and 128 bits) and a SIMD instruction set [1]. NEON instructions and ARM instructions are processed in the same time. Exact execution time depends on the Cortex used by the SoC [2] but all NEON instructions can be processed by the NEON engine. Classical instructions (ARM, Thumb, Thumb2) are described in the reference manual [3].

During this lab, we will use the development board Zybo. This board include a Zynq 7000 System on Chip. This Soc is composed of FPGA and a Cortex A9 MP Core. Each core include one NEON engine. Figure 1 presents the system architecture. The four first exercises evaluate the performance of simple processing algorithms without any operating system (baremetal or standalone). The final exercise use the NEON extension in a Linux environment.

## Standalone Application

In this first exercise, we will develop, build, and run a simple application on the Zybo board.

• Download the `void` project.
• Extract the project into a new directory on your computer.
• Observe the structure of the project and open the `main.c` source file in an editor.
```#include <stdio.h>
#include "platform.h"
#include "xil_printf.h"

int main()
{
init_platform();

print("Hello World\n\r");

cleanup_platform();
return 0;
}
```

`init_platform()` function is defined in `platform.c` file and permit to activate cache memory on the Cortex A9 MP Core. This function call is not mandatory. `printf()` allows to send the argument on the UART of the Zynq 7000. This function also supports the classical formatting of numbers.

• Open a terminal (`cmd.exe`) and go to your project location (using `cd`).
• Execute the following script: `C:\Xilinx\Vitis\2019.2\settings64.bat`. This script modifies the PATH evironment variable and allows to call the Xilinx tools directly form the command line.
• Build process is fully automatized using `make`. Open the `Makefile` in your editor:
```CC=arm-none-eabi-gcc
CFLAGS=-mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=hard -c -Wall -I./include
LDFLAGS=-mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=hard -Wl,-build-id=none -specs=Xilinx.spec -Wl,-T -Wl,lscript.ld -Llib
LIBS=-Wl,--start-group,-lxil,-lgcc,-lc,--end-group

all: main.elf

main.o: main.c
\$(CC) \$(CFLAGS) -o main.o main.c

platform.o: platform.c
\$(CC) \$(CFLAGS) -o platform.o platform.c

main.elf: main.o platform.o
\$(CC) \$(LDFLAGS) -o main.elf main.o platform.o \$(LIBS)

clean:
rm *.o main.elf
```

The four first lines are simple variables. The rest of the file is composed of 4 rules. The first line of each rule defines the name of the target and its associated dependencies. For each rule, `make` checks the modification date of the dependencies compared to the one of the target. If one of the dependencies has been modified after the target, `make` run the command(s) located after the first line (and which have to begin by tab).

• To build the project invoke `make` at the prompt. You should observe the following output:
```\$ make
arm-none-eabi-gcc -c -Wall -I./include -o main.o main.c
arm-none-eabi-gcc -c -Wall -I./include -o platform.o platform.c
arm-none-eabi-gcc -mcpu=cortex-a9 -mfpu=vfpv3 -mfloat-abi=hard -Wl,-build-id=none -specs=Xilinx.spec -Wl,-T -Wl,lscript.ld -Llib -o main.elf main.o platform.o -Wl,--start-group,-lxil,-lgcc,-lc,--end-group
\$
```

If the building process is successful, the executable file `main.elf` is generated. If the process fails, errors located in the source files (or Makefile) have to be corrected.

• Check that the file `main.elf` has been generated in your project directory. Note that this file has been built for the ARM architecture and cannot be executed on your computer.

Until now, we have successfully built our first application. The following steps allow to transfer the executable file into the memory of the Zynq 7000 and to run the program:

• Before transferring the executable file on the board, check that jumper JP5 is placed on the JTAG position and connect the PROG UART connector to the PC.

Finally, switch on the SW4 switch (LD11 diode should be on).

• Start the `xsct` software in a terminal:
```\$ xsct
```
• Type the following commands in `xsct` (one at a time):
```connect -host localhost -port 3121
targets
targets 2
source ps7_init.tcl
ps7_init
dow main.elf
```

The function `ps7_init()` permits to activate the clock, the memory controller, and the peripherals (UART). This function have to be called before writing to the memory.

• In another terminal, observe the serial port with a terminal (e.g. `putty` in Windows).

For the link settings Baud rate: 115200, 8 bits data, 1 bit stop, without parity check.

• In `xmd`, now type the following commands:
```con
stop
```

and observe the results on the serial port.

• Validate the correct execution of the program.

Before continuing with the next exercises, open 3 different terminals: the first one will be used to build the executable file (with cs-make); the second one will be used to transfer the program on the board (with `xmd`); the last one will be used to observe the results. Note that for the second terminal, a single `xmd` session can be used to transfer multiple programs and that the start command will automatically detect if a new version of the program is available and transfer it on the board.

## Timming

Programming using NEON is not straight forward and many implementations can be evaluated. In order to compare the performance in between different possible implementations, timming has to be used to accurately estimate the execution time of a given section of code. The Zynq 7000 provides a real-time timer which can be controlled using the Xilinx library. This timer is incremented at a rate which is 2 time lower that the Cortex A9 clock frequency.

• Add the header file `xtime_l.h` to `main.c` by adding the line:
```#include <xtime_l.h>
```
• In the function main.c, declare a variable of type `XTime`
• Use the functions `XTime_SetTime(XTime Xtime)` and `XTime_GetTime(XTime *Xtime)`, to set or get the value inside the timer.
• Add a dummy loop in main.c and determine the execution time of the associated processing. Note that the micro-controller frequency can be obtained in the macro `XPAR_CPU_CORTEXA9_CORE_CLOCK_FREQ_HZ` defined in the file `xparameters.h`.

We are now ready to develop our first programs using the NEON engine. In each exercises, we will implement a simple algorithm in C and evaluate the performance in term of execution time. Then, we will implement the same algorithm using NEON instruction and compare the execution time.

## Sum of an array

• Create a new project by copying the `void` directory under a different name.
• Include the file `stdint.h` which give access to sets of integer types having specified widths.
• Declare an array of size 2048 where each element is a 16 bits signed integer and initialize the array to a default value.
• Add a function `sum_c()` which compute the sum of this array.
• Determine the execution time of this function (this result will be used as a reference later).
• Add a new file `neon.c` to your project and write the function `sum_ni()` which can compute the sum of the same array using NEON intrinsics.
• Modify `Makefile` to add a new rule for the file `neon.c` (don't forget the options `-mcpu`, `-mfloat-abi` and `-mfpu`). Generate the new executable file and check the correctness of your new function.
• Determine its execution time and compare to the one obtained using the classical implementation in C. Rebuild your application using different optimization option (`O1`, `O2` and `O3`).
• Observe the assembly routine generated by the compiler with the command:
```\$arm-xilinx-eabi-gcc -S neon.c -o neon.s
```

Locate the NEON instructions and compare them to the ones in `neon.c`.

## Matrix Multiplication

We consider the multiplication of 2 matrices ${\displaystyle A}$ and ${\displaystyle B}$ of dimension 4x4 into a third matrix ${\displaystyle C}$.

• Create a new project
• Write on a piece of paper the expression of each element of ${\displaystyle c_{i,j}}$
• Modify the `main()` function to declare 2 matrix `A` and `B` initialized and a matrix `C` (which is not initialized) of size 4x4 where each element is a float (32 bits).
• Add a function `mat_product_c()` to compute the matrix product of `A` and `B` and store the result in the `C` matrix.
• Validate the result by executing your function and measure the execution time of the C function.
• Add a file `neon.c` and write a function `mat_product_ni()` to realize the matrix multiplication using NEON intrinsics. Check the correctness of your program.
• Determine the execution time and the gain that we can obtain by using the NEON extension.

## Edge Detection

Element position

The goal of this exercise is to detect a large variation of the value of 2 neighboring elements in a 2D array. We which to store the result in a new 2D array composed of only 2 values 0 or 1 (0 if the neighboring elements are closed in value, 1 otherwise).

For each element ${\displaystyle C}$, we can compute the quantity ${\displaystyle |G|}$:

${\displaystyle |G|=|E-W|+|N-S|}$

where ${\displaystyle |G|}$ can be viewed as an approximation of the magnitude of the gradient.

If ${\displaystyle |G|}$ is higher than a threshold value, then an edge is detected (and we store a 1 in the output array). If ${\displaystyle |G|}$ is lower than the threshold value, a 0 has to be placed inside the output array.

• Create a new project.
• Modify the `main()` function to declare 2 2D array `x` and `y` of size 10x10 where each element is an 8 bits unsigned integer. Initialize the input array `x` to 0 for the upper triangular part and 100 for the lower triangular part.
• Add a function `edge_c()` to compute the values in the output array `y` (edge detection).
• Check the result by executing your program and determine the execution time of the C function
• Add the file `neon.c` and write the function `edge_ni()` to realize the edge detection and store the result in the output array.
• Determine the execution time and the gain obtained using the NEON extension.

## Linux Application

Edge Detection

The objective of this part is to start a Linux distribution on the Zybo development board and to evaluate the performance of our edge detection application on a bitmap image. Several distribution are available to run over Zynq 7000 SoC. Some of them provide a file system only in the RAM. Some others can use the file system from an external flash memory (SD card). Linaro distribution is a Ubuntu based distribution and provide the same tools than a desktop distribution (graphical user interface). Before doing the following operations, check that the Zybo board is switch off. Also, this exercise can be realize with a Raspberry Pi 2 or 3 (`rpi.zip`).

• Insert the SD card in the slot located under the board.
• Place jumper JP5 on SD position. The board will now start from the SD card.
• Connect a USB cable between UART and the PC. Connect the USB hub on the OTG connector.
• Connect the HDMI output to a screen by using an adaptor.
• Switch on the board. You should see on the UART output, the startup messages of the kernel and finally the root invite.
• Go inside the \verb+/home/linaro+ directory and modify the existing project to add your edge detection function (add the C version and the NEON version).
• Build the project with the Makefile.
• Start the program with the name of a bitmap image as a second argument. The program should generate the image `output.bmp`.
• We wish to analyze the performance of our program with `gprof`. This tool is able to periodically interrupt the program under test to evaluate the load of each function on the processor. In order to increase the accuracy of the measurement, add a loop to repeat all the process 100 times. Rebuild your program using the `-pg` option.
• Execute your newly generated program (with the `edge_c()` function). Notice that the compilation process generate now a `gmon.out` file. This file contains the profiling information. Finally execute the following command:
`\$ gprof test gmon.out > res`
• Open the `res` file in an editor to see the results.
• Replace the `edge_c()` function with the `edge_ni()` function and profile the program again.
• If you have enough time, replace the functions `color2gray_c()` and `gray2color_c()` by new functions using the NEON engine.

## Conclusion

At the end of these labs, you should now be able to develop, build, and optimize an application using the NEON engine present in Cortex-A processors.

## References

1. NEON Programmer's Guide Version 1.0 (ARM DEN0018A)
2. Cortex-A9 NEON Media Processing Engine, Technical Reference Manual (ARM DDI 0409)
3. ARM Architecture Reference Manual, ARMv7-A and ARMv7-R edition (ARM DDI 0406)