Making the Blackfin Perform
Optimizing source code to run on a specific platform can be challenging. While code creation tools are improving, they have not kept up with the rapidly increasing functionality and complexity of hardware. The more complex the hardware architecture is, the harder it is to program in assembly language. This creates the need for abstraction via a robust C compiler, or operating system, however, since not all compilers handle source code in the same manner. It can take many iterations of re-writing your C source to achieve an efficient output.
While trial and test is a valuable method of optimizing source code, there are other methods that can be used. Two specific methods are allowing the compiler to use built-in functions and implementing C-callable specific libraries. The combination of the built-in functions and the core specific libraries will improve the code performance, without getting involved in the complexity of the architecture.
The Analog Devices Blackfin processor is supported by open source tools including a gcc compiler and uClinux kernel. Recent efforts to incorporate built-in functions to the compiler and port several signal processing specific libraries have greatly improved code efficiency.
Two code examples will show how to accomplish this. First a demonstration of how easy it is to measure performance will be made. Then a short Dot-Product example will be reviewed - going over the trial and test methodology that many software developers still do. At the end, a short example of using a C callable library will be made, using the Fast Fourier Transform demonstrating the relative ease of including someone else's library functions in your code. By showing these few examples, the relative simplicity of finding C callable function libraries, and using them in your application can be demonstrated.
Measuring Performance
Blackfin, like many modern processor architectures, has a complex memory architecture which uses L1 (cache and no-cache), L2, L3. Depending on how memory is configured, and where code is running from will impact execution speed more than any other parameter in the system. In a typical system, cache will be turned on, and the code in question will be run a few times, allowing it to be pre-loaded into cache so that the time of loading things into cache will not be measured. The basic method is:
foo () {
code to be measured
}
main () {
foo() /* run once - load into cache */
start=clock(); /* capture clock */
foo() /* take measurement */
stop=clock(); /* capture clock */
interval=stop-start; /* find the time */
print interval; /* print the results */
}
This is done with:
and is used like:
If the measured period is long, the 32-bit CYCLES register may overflow, in which case the CYCLES2 register can be used. CYCLES2 has some very nice architectural features for overflow protection. A read of the LSW CYCLES causes a shadow register of CYCLES2 to latch, and a successive read of CYCLES2 will contain the MSW of the last read CYCLES register. Therefore, no overflow check is required when dealing with the internal cycles counter. Reference pg 21-24 of the ADSP (Analog Digital Signal Processor)-BF53x/BF56x programmers manual.
:
static inline unsigned long long read_cycles (void)
unsigned long long t0;
asm volatile ("%0=cycles; %H0=cycles2;" : "=d" (t0));
return t0;
}
To use:
unsigned long long start, end;
start = read_cycles();
foo() /* measure here */
end = read_cycles();
Example - Dot Product
A dot product is a very common mathematical operation in many signal processing algorithms which takes two vectors and returns a scalar quantity. For example:
If
and
then the dot product would be:
This is represented in generic, portable C code as:
the performance of this is close to: (in each case, a low number of cycles is better)
Notice how the first time the application is run, the number of cycles is more than normal, since the function and data are not in instruction or data cache. After things are in cache, the data and instructions are reached at CCLK (Core Clock) rates.
Since many signal processing applications use a native multiple accumulate, simply using the assembly instructions to perform the dot product may not approach the theoretical limit of the processor (which in the Blackfin case is 1/2 a MAC (Media Access Control (network interface)) per cycle).
Changing things to assembly provides:
This provides an output of:
This is more than a 8x times improvement in performance! However, this code is now completely optimized for the Blackfin, and is not able to be run on any other architecture, and is still over 3x the theoretical performance. Since this function is using both the instruction and data cache, some optimizations can still be done.
The next step to do is to add the cycles measurement into the same function which actually does the dot product:
This function does not have the overhead of calling a function to measure the cycles of the multiply accumulate which does the dot product. This has an output of:
While this still approaches the theoretical maximum, it is still 2x what it should be. This is due to the memory structure of the processor. Since the main part of the hardware loop involves a parallel instruction A1 += R0.H * R1.H, A0 += R0.L * R1.L || R0 = [I0++] || R1 = [I1++]; where the two loads are from I0 and I1, if both point to the same memory bank (see memory overview), then the access will be stalled one instruction, as the two load data buses come from L1 Bank A, and L1 Bank B.
This can be shown with:
which ensures that the two vectors are in different data banks. This provides an output close to the theoretical maximum.
A 100 point dot product should take 50 clock cycles on a Blackfin. The code runs 4 test cases, and manages to reduce the execution time from 3838 cycles to 53 cycles through various tricks.
Each test runs 10 times, in several of the tests you can see the number of cycles reducing as the instruction and data cache gets loaded over successive runs.
This example showed that even if you hand write functions in assembly, it is easy to slow your application by a factor of 2x to 4x by not understanding the chip's implementation of the memory structure, or simply by calling a function. With a little optimisation, and some hand coded assembler, it is possible to get full performance from the chip.
Normally writing hand-optimising assembler sounds terrible, and most people do not enjoy it much. However, for a few “inner loop” routines, which are run many times and fit entirely in cache, it can be worth the effort.
For those not wanting to write their own assembler routines, many pre-written signal processing algorithms are available.
Example - FFT
Another good example of a commonly used function is the Fast Fourier Transform (FFT). It is an efficient algorithm to compute the discrete Fourier transform (DFT) and its inverse. FFTs are of great importance to a wide variety of applications, from signal processing to solving partial differential equations to algorithms for quickly multiplying large integers.
If of complex numbers, the DFT is:
Evaluating these sums normally would take O(N2) arithmetical operations, or the algorithm has order of n2 time complexity.
There are many ways and algorithms to calculate an FFT, Cooley Tukey being one of the most common.
A C implementation is shown at:?
Performing this function for a 1024 point dataset takes over 1 million cycles:
1261633 1273733 1268853 1250113 1268218 1270938 1249993 1268398 1267898 1251448
An alternative solution is to use the fft function in the libbfdsp library:
void cfft_fr16(in[], t[], out[], w[], wst, n, block_exponent,scale_method)
const complex_fract16 in[]; /* input sequence */
complex_fract16 t[]; /* temporary working buffer */
complex_fract16 out[]; /* output sequence */
const complex_fract16 w[] /* twiddle sequence */
int wst; /* twiddle factor stride */
int n; /* number of FFT points */
int block_exponent; /* block exponent of output data */
int scale_method; /* scaling method desired:
0-none, 1-static, 2-dynamic */
The two different functions are functionally equivalent, but from a performance standpoint, there is more than 50x difference. Using the built in function takes only 18,400 cycles.
As another point of reference the following link points to a page which optimizes an FFT with intrinsics and inline assembler. example-builtin
Conclusion
From the examples we can see that the more complex the algorithm is, the greater chance of optimization. When moved from C to assembly, the Dot product example decreased from 1470 to 166 cycles, but it was not until modifying the memory layout where it was able to get down the theoretical minimum of 53 cycles. In the FFT example, simply using the provided libraries makes over a 50x difference in performance. Since the Blackfin library contains a broad collection of functions that are commonly required by signal processing applications, including support for general-purpose signal processing such as companders, filters, and Fast Fourier Transform (FFT) functions it can be useful for many applications and provide the base building blocks for many algorithms.
Developers who examine and use provided run-time library will end up with code that always out performs their peers, however, it will not always be portable to other processor architectures, without going back to the unoptimized c code.
Related Links
on-chip_sram
optimization
built-in_functions
How to Make your Blackfin Fly Part 1
How to Make your Blackfin Fly Part 2
Acknowledgements
David Rowe help put this together.