world leader in high performance signal processing
Trace: » on-chip_sram

Using On-Chip SRAM Memory

Blackfin processors support a hierarchical memory model with different performance and size parameters, depending on the memory location within the hierarchy. Level 1 (L1) memories interconnect closely and efficient with the core for best performance. Separate blocks of L1 memory can be accessed simultaneously through multiple bus systems. Instruction memory is separated from data memory, but unlike classical Harvard architectures, all L1 memory blocks are accessed by one unified addressing scheme. Portions of L1 memory can be configured to function as cache memory. Some Blackfin derivatives also feature on-chip Level 2 (L2) memories. Based on a Von-Neumann architecture, L2 memories have a unified purpose and can freely store instructions and data. Although L2 memories still reside inside the CCLK clock domain, they take multiple CCLK cycles to access. The processors also provide support of an external memory space that includes asynchronous memory space for static RAM devices and synchronous memory space for dynamic RAM such as SDRAM devices. For details on the memory architecture, see the section on cache. For memory size, population, and off-chip memory interfaces, refer to the specific Blackfin Processor Hardware Reference manual for your derivative.

Using the L1 memory blocks are key to being able to effectively and efficient run the Blackfin. Simply turning cache on, will only use 1/2 of the available L1 SRAM in the system, and can run into cache pollution or cache thrashing. To prevent these issues, and to best use L1, some experimentation of placing things in L1 (not cached) memory must be done to see how that effects your criteria for overall system performance.

There are times that a 2x improvement in performance can be made when allocating and managing L1 SRAM Data banks A and B separately, since loads from both banks can occur simultaneously.

Using on chip SRAM in User Space

Applications in L1

Currently, we can put whole code segment and whole data segment of executable and shared library into L1 SRAM, or put specific functions and variables into L1 SRAM by GCC.

Now applications should be compiled using the bfin-linux-uclibc-gcc (FDPIC elf executables) to be put into L1 MEMORY. The bfin-uclinux-gcc (flat format) does not support placing individual pieces into L1 memory.

Putting whole code segment and whole data segment into L1

Usage
  1. To put application code into L1 SRAM, adding the following option to gcc when compiling -fno-jump-tables and adding the following options to gcc when linking -pie -Wl,--sep-code -Wl,--code-in-l1 -Wl,-z,now -shared-libgcc
  2. To put shared library code into L1 SRAM, adding the following option to gcc when compiling -fno-jump-tables and adding the following options to gcc when linking -Wl,--sep-code -Wl,--code-in-l1 -Wl,-z,now -shared-libgcc
  3. To put application/shared library data into L1 SRAM, adding the following options to gcc when linking -Wl,--data-in-l1
Test Code
#include <stdio.h>
 
void foo ()
{
    return;
}
 
int a;
 
int main(int argc, char *argv[])
{
    /* Due to FDPIC ELF, the address of foo is the address of the FDPIC entry
     * rather than the address of foo itself.
     */
    printf("foo=%p a=%p\n", ((int*)foo)[0], &a);
    return 0;
}

To put the whole data and code sections to L1:

bfin-linux-uclibc-gcc -pie -Wl,-sep-code -Wl,-code-in-l1,-z,now -Wl,-data-in-l1 test.c -o test

And result:

root:/>./test
foo=0xffa00734 a=0xff803f90

Functions and Data in L1

Usage
  1. To put a function into L1 SRAM, add the following GCC attribute to its declaration, like:
        void foo () __attribute__ ((l1_text));

Adding option -fno-jump-tables to GCC when compiling the file containing the definition of foo ().

  1. To put a global variable into L1 SRAM, add the following GCC attribute to its definition, like:
        int a __attribute__ ((l1_data_A));

All possible attributes for variables are:

l1_data_A Put the variable into L1 DATA BANK A SRAM l1_data_B Put the variable into L1 DATA BANK B SRAM l1_data Put the variable into L1 DATA BANK A or BANK B SRAM

Only global variables, file local variables and function local static variables can be put into L1 DATA SRAM.

Test code
#include <stdio.h>
 
__attribute__ ((l1_text))
void foo ()
{
    return; 
}
 
int a __attribute__ ((l1_data_A));
 
int main(int argc, char *argv[])
{
    /* Due to FDPIC ELF, the address of foo is the address of the FDPIC entry
     * rather than the address of foo itself.
     */
    printf("foo=%p a=%p\n", ((int*)foo)[0], &a);
    return 0;
}

To build:

bfin-linux-uclibc-gcc -fno-jump-tables test.c -o test

”-fno-jump-tables” tells the compiler not to use jump tables for switch statements even where it would be more efficient than other code generation strategies.

root:/var> ./test 
foo=0xffa00624 a=0xff803e58

Dynamically Allocating

There are now three new Blackfin specific system calls in kernel. They are exported in uClibc. They can be used in both FDPIC and FLAT.

To use these three functions, you need

#include <bfin_sram.h>
 
void *sram_alloc (size_t size, unsigned long flags)
int sram_free (void *addr)
void *dma_memcpy (void *dest, const void *src, size_t size)

sram_alloc

void *sram_alloc (size_t size, unsigned long flags)

  • flags can be anyone or combination of the following values:
    1. L1_INST_SRAM Allocate instruction SRAM
    2. L1_DATA_A_SRAM Allocate data A bank SRAM [Currently, it will allocate data B bank SRAM if data A bank cannot be allocated.]
    3. L1_DATA_B_SRAM Allocate data B bank SRAM
    4. L1_DATA_SRAM Allocate data A or B bank SRAM
    5. L2 Allocate L2 SRAM (data or instruction)
  • size is the number of the required memory in bytes.
  • return value is the address of the first byte of the allocated memory if success. NULL if fail.

If using SMP Linux on BF561, make sure you do not place any stacks (like pthreads) in L2. See anomaly 05000428 for more information.

sram_free

int sram_free (void *addr)

  • addr is the first byte of the SRAM memory block. It has to be a value returned by sram_alloc ().
  • return value is 0 if success. -1 if fail.

dma_memcpy

void *dma_memcpy (void *dest, const void *src, size_t size)

Like memcpy, except dma_memcpy () copies memory through DMA. So it can be used to copy into or from L1 instruction SRAM.

The return value is the dest if success. NULL if fail.

Test Code

#include <stdio.h>
#include <bfin_sram.h>
 
int main(int argc, char *argv[])
{
    char *sram = sram_alloc(256, L1_INST_SRAM);
    printf("sram=%p\n", sram);
    sram_free(sram);
    return 0;
}

To build as BFLT:

bfin-uclinux-gcc -Wl,-elf2flt test.c -o test

And these functions also works for FDPIC:

bfin-linux-uclibc-gcc test.c -o test
root:/var> ./test 
sram=0xffa00630

Application Stack in L1 Scratchpad

The scratchpad is a 4k area of L1 memory which can only be used for data. One possible way of using it is to place the application stack in it. This can be done for flat binaries by using bfin-uclinux-flthdr.

bfin-uclinux-flthdr -u -s 3500 my-program

This command will change the binary my-program to allocate 3500 bytes of stack space and place it in L1 memory. Note that the kernel will refuse to load the binary if the amount of stack reserved in the flat header, plus the space taken up by command line arguments and environment variables, is larger than the 4k available in scratchpad memory.

Multiple programs can use this feature at the same time, but in such a situation, context switches involve copying the stack in and out of scratchpad memory. This makes it unlikely to be a performance win.

In threaded applications, only one thread can have the stack in scratchpad memory. When using threads, it is better to use the sram_alloc functionality to allocate a stack in normal L1 data SRAM and then use e.g. pthread_attr_setstackaddr before creating the thread.

Note: For FDPIC format, this feature is not supported currently.