world leader in high performance signal processing
Trace: » cache

Cache Management

Memory Overview

Blackfin processors have a unified 4G byte address range that spans a combination of on-chip and off-chip memory and memory-mapped I/O resources. The memory model is hierarchical with different performance and size parameters, depending on the memory location within the hierarchy, with some of the address space is dedicated to internal, on-chip resources. The processor populates portions of this internal memory space with:

  • Level 1 (L1) Static Random Access Memories (SRAM) - L1 memories interconnect closely and efficient with the core for best performance. Separate blocks of L1 memory can be accessed simultaneously through multiple bus systems. Instruction memory is separated from data memory, but unlike classical Harvard architectures, all L1 memory blocks are accessed by one unified addressing scheme.
    • Instruction SRAM (0-80k): is only accessible as instruction SRAM. No core load/stores may happen to these locations.
    • Data SRAM : (0-64k): is only accessible as data SRAM. Code can not execute from this memory.
    • Scratchpad SRAM (4k) : is only accessible as data SRAM and cannot be configured as cache memory. This memory cannot be accessed by the DMA controller.
  • Level 1 (L1) Cache - Portions of L1 memory can be configured to function as cache. This memory is accessed at full processor speed.
    • Instruction cache (0-16k) : a 4-way set-associative cache.
    • Data cache (0 - 32k): a 2-way set-associative cache.
  • Level 2 (L2) Static Random Access Memories (SRAM) - L2 memories avalible on select Blackfin derivatives feature a Von-Neumann architecture. L2 memories have a unified purpose and can freely store instructions and data. Although L2 memories still reside inside the CCLK clock domain, they take multiple CCLK cycles to access. On all existing Blackfin derivatives with L2, L2 memory can not be configured as cache.
  • A set of memory-mapped registers (MMRs)
  • A boot Read-Only Memory (ROM)
  • Level 3 (L3) memory is the external memory space that includes asynchronous memory space for static RAM devices (Flash) and synchronous memory space for dynamic RAM such as SDRAM or DDR devices.

Differences between L1 SRAM and Cache

Since off chip SDRAM/DDR (L3) access time (133MHz) is much slower than the maximum core processor speed (600 Mhz or more), if all memory accesses were limited to L3 timing, the processor could not take advantage of it's own higher clock speeds, because it would be waiting for data to be read or written to external memory. To overcome this problem - on chip L1 is added, giving a 4 or 6 times speed advantage over L3.

The L1 memory system performance provides high bandwidth and low latency (single clock cycle). Because SRAMs provide deterministic access time and very high throughput, signal processing systems have traditionally achieved performance improvements by providing fast SRAM on the chip which operate at core clock speeds. However, using on-chip L1 SRAM requires the application or algorithm to be tightly coupled to to the targeted device, since each Blackfin derivative has different amounts of L1 SRAM.

The architecture of the L1 SRAM is known as a modified Harvard architecture - since data and instruction memories are separate, and have separate buses which connect to the core, with one unified addressing scheme. This means that standard applications which use the standard method of obtaining memory (heap, stack, bss, data, or rodata sections) are unable to use this memory. For those applications which use L1 SRAM the application must use non-standard calls to allocate L1 data memory, and define non-standard sections in their application so the kernel knows what portion of the application to load into L1 instruction memory.

The instruction and data caches (SRAMs with cache control hardware) provides high performance while working within the standard programming model. Since caches eliminate the need to explicitly manage data movement into and out of L1 memories, code can be ported to or developed for the processor quickly without requiring performance optimization for the memory organization. This is the model in which all standard Linux applications work. On many Blackfin derivatives, L1 instruction and data SRAM which is not cacheable and therefore unused by standard Linux applications. It is possible to re-write a Linux application to take advantage of the L1 SRAM found on the Blackfin, but then this application will be targeted specifically at the Blackfin, and will not compile for other architectures (however this type of optimizations are common in the embedded environments).

Different Blackfin processors have different amounts of each L1 and L2 (while the amount of L3 depends on the board design). For a complete list, check the Product Selection Table.

  • Not all the above Blackfin derivatives are supported by Linux. For a list of supported processors, check out the features page.
  • Lower speed grade parts are available, check datasheet for details.
  • L1 memory cache blocks may be configured as a mix of SRAM and/or cache in 16k increments. For example a 32k data cache can be configured as a 16k data cache, and an extra 16k of SRAM.

Blackfin memory architecture

Since the Blackfin processor has multiple memory buses, and treats L1 separately from L3, (see fig) the core is able to do up to four core memory accesses per core clock cycle:

  • one 64-bit instruction fetch,
  • two 32-bit data loads (assuming the the loads are from different L1 banks, more about this later), and
  • one pipelined 32-bit data store.

This also allows simultaneous system DMA, cache maintenance, and core accesses. The combination of these increases the throughput and performance of the Blackfin processor, if everything is configured properly, and your code is well laid out to take advantage of things.

In the following sections, different strategies for increasing performance will be described. This will include how cache works, and how to ensure proper configuration of cache, as well as how to put different parts of the kernel or your application into L1 instruction or data SRAM.

The processor provides a dedicated 4K byte bank of scratchpad data SRAM. The scratchpad is independent of the configuration of the other L1 memory banks and cannot be configured as cache or targeted by DMA. Typical applications use the scratchpad data memory where speed is critical. For example, it is possible for small stacks to be placed in scratchpad memory for the fastest context switching during interrupt handling. This is not done for the Linux kernel since Linux kernel stacks are 8k. Exception processing stacks can be placed in L1 since it is only 1k.

Cache Overview

Unfortunately, even to make cache work well on the Blackfin, you need to understand the details of how things work.

Instruction Cache

The L1 Instruction cache (when enabled) contains a, 4-Way set associative instruction 16 kbyte cache. To improve the average access latency for critical code sections, each Way or line of the cache can be locked independently. When the memory is configured as cache, it cannot be accessed directly.

When cache is enabled, only memory regions further specified as cacheable (by the CPLBs) will be cached.

Cache Organization

To understand the cache organization, and how to lay out your application, it is worth understanding how the cache is structured, and how physical memory is mapped in cache.

The 16 kbytes of instruction cache is split into 4 x 4 kbyte sub-banks. There is a one to one mapping of physical address, and the specific 4k sub-bank. The physical address bits 12 and 13 control which 4k sub-bank is the target. Every 4k, a different sub-bank is used. This helps with large linear applications. However applications which have two functions that are called many times, which happen to have physical memory addresses which are exactly 16k out of alignment (bits 12 and 13 are the same), will be fighting for the same 4k sub-bank of instruction cache - to resolve this - they can both fit in cache (in different WAYs).

These 4k sub-banks are split into 1k ways, known as WAY 0, WAY 1, WAY 2, WAY 3 (because we like to be innovative when it comes to naming things). Which Way is chosen based on least recently used (LRU) algorithm. The LRU algorithm used to determine which cache line should be replaced if a cache miss occurs.

In each Way the cache consists of a collection of 32 cache lines. Each cache line is made up of a tag component and a data component.

  • The tag component incorporates a 20-bit address tag, least recently used (LRU) bits, a Valid bit, and a Line Lock bit.
  • The data component is made up of four 64-bit words of instruction data. (256 bits in total)

The address tag consists of the upper 18 bits plus bits 11 and 10 of the physical address. Bits 12 and 13 of the physical address are not part of the address tag. Instead, these bits are used to identify the 4K byte memory subbank targeted for the access. What this means, is that the mapping can be made from physical address to a cache line by:

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Tag sub-bank Tag Line Byte

To determine which way things are in, you must check the cache line tag.

Cache Hits and Misses

A cache hit occurs when the address for an instruction fetch request from the core matches a valid entry in the cache. Specifically, a cache hit is determined by comparing the upper 18 bits and bits 11 and 10 of the instruction fetch address to the address tags of valid lines currently stored in a cache set. The cache set (cache line across ways) is selected, using bits 9 through 5 of the instruction fetch address. If the address-tag compare operation results in a match in any of the four ways and the respective cache line is valid, a cache hit occurs. If the address-tag compare operation does not result in a match in any of the four ways or the respective line is not valid, a cache miss occurs.

When a cache miss occurs, the instruction memory unit generates a cache line fill access to retrieve the missing cache line from memory that is external to the core. The address for the external memory access is the address of the target instruction word. When a cache miss occurs, the core halts until the target instruction word is returned from external memory.

Cache Line Fills

A cache line fill consists of fetching 32 bytes of data from memory. The operation starts when the instruction memory unit requests a line-read data transfer on its external read-data port. This is a burst of four 64-bit words of data from the line fill buffer. The line fill buffer translates then to the bus width of the External Access Bus (EAB). The address for the read transfer is the address of the target instruction word. When responding to a line-read request from the instruction memory unit, the external memory returns the target instruction 64 bit word first. After it has returned the target instruction word, the next three 64-bit words are fetched in sequential address order. This fetch wraps around if necessary, as shown below:

Target Word Fetching Order for Next Three Words
WD0 WD0, WD1, WD2, WD3
WD1 WD1, WD2, WD3, WD0
WD2 WD2, WD3, WD0, WD1
WD3 WD3, WD0, WD1, WD2

A complete cache line must always be brought into the cache - this is why the compiler tries to align functions, and target of a branches on 256-bit alignments - so that the previous function, or instructions that are not likely to be executed are not loaded into the cache, wasting cycles, and valuable cache locations. The downside is that this increases the number of NOPs in the application.

Line Fill Buffer

As the new cache line is retrieved from external memory, each 64-bit word is buffered in a four-entry line fill buffer before it is written to a 4K byte memory bank within L1 memory. The line fill buffer allows the core to access the data from the new cache line as the line is being retrieved from external memory, rather than having to wait until the line has been written into the cache. While the bus width between the fill buffer and cache is always 64 bits wide, the width of bus to external or L2 memory varies between derivatives.

Cache Line Replacement

When configured as cache, bits 9 through 5 of the instruction physical address are used as the index to select the cache set for the tag-address compare operation. If the tag-address compare operation results in a cache miss, the Valid and LRU bits for the selected set are examined by a cache line replacement unit to determine the entry to use for the new cache line, that is, whether to use Way0, Way1, Way2, or Way3.

The cache line replacement unit first checks for invalid entries (that is, entries having its Valid bit cleared). If only a single invalid entry is found, that entry is selected for the new cache line. If multiple invalid entries are found, the replacement entry for the new cache line is selected based on the following priority:

  • Way0 first
  • Way1 next
  • Way2 next
  • Way3 last

When no invalid entries are found, the cache replacement logic uses an LRU algorithm.

Instruction cache tests

We can look at dhrystone, and add additional -falign-functions=n and -falign-jumps=m compiler flags to the compile time options, and look at the results:

Dhyrstone Results
-falign-functions=n -falign-jumps=m
1 2 4 8 16 32
1 629722.9 629722.9 629722.9 630119.8 629722.9 630119.8
2 629722.9 630119.8 629722.9 629722.9 629722.9 630119.8
4 629722.9 629722.9 629722.9 630119.8 629722.9 630119.8
8 630517.0 630517.0 630517.0 629722.9 630119.8 630119.8
16 630517.0 630517.0 630517.0 629722.9 629722.9 630119.8
32 630517.0 630517.0 630517.0 630119.8 630119.8 629722.9

We can observe a few things from this test:

  • dhrystone is a bad test for this, since it fits completely in instruction and data cache, the results are nearly insignificant.
  • alignment of less than 2 bytes is not possible on an architecture which has a minimum instruction length of 16 bits
  • We can see a small increase in performance as function entry points become more aligned to the start of a cache line.
  • We can see small decreases in performance as jumps become aligned, as we execute more NOPs to get to the jumps.

Since function prologues are similar to:

00001dd0 <_main>:
    1dd0:       4a e1 01 00     P2.H=1 <__start-0x3>;
    1dd4:       e3 05           [--SP] = (R7:4, P5:3);  /* 8 cycle instruction */
    1dd6:       0a e1 98 51     P2.L=5198 <_DoNNetIteration+0xe30>;
    1dda:       30 30           R6=R0;

The instruction that saves all call preserved registers to the stack ([--SP] = (R7:4, P5:3);) is normally in the first 8 bytes of of the function entry point (will be within the first 64 bits of the instruction cache line read), and will create 8 x 32-bit writes to the data cache, each consuming a single core clock cycle (assuming that the data cache is configured in write back mode, and the victim cache line does not have to be flushed - If the data cache must be flushed, this push instruction stalls until the external bus is available (after the completion of the loading of the 256-bit instruction cache line)).

This is why the Linux exception stacks, Linux common interrupt entry points are all located in L1 data or instruction SRAM - to execute these functions without having to access the external bus, which can be very slow (comparatively).

Data Cache

The use of cache can help speed up SDRAM accesses by storing a cache line of 256 bits ( 32 bytes) at a time in the much faster on-chip SRAM. This SRAM acts as a sort of mirror or copy of the slower external SDRAM. When you want a byte of SDRAM which has been tagged as cacheable you will actually get a 32 byte cache line ( aligned to a 32 byte boundary) read into cache at the same time. So if the first byte was on a 32 byte boundary it will take a short while to get the first byte but the remaining 31 bytes will act as if they were stored in the on chip SRAM rather than the slower external SDRAM.

So the first byte will cost a bit of time but later bytes have a much faster access time. see memory optimization wiki page for a more detailed analysis on data cache performance and how fills and flushes add cycles to your application.

This works well when you are reading some data but when you are in a read / modify / write cycle there are some additional complexities.

If you write to some cached memory the data in the cache will be updated so subsequent reads will see the new value but the actual SDRAM may not contain the new value. Similarly, if you read from some cached memory, the read will occur from cache, and if something like a DMA operation has updated SDRAM, the read will not reflect the proper value.

This may not be a problem for cpu based memory access but as soon as you start to use DMA to access this data problems will start to arrive. ( see DMA considerations)

The possibility of a memory location having two values at the same time is called a coherency problem.

If the new value is in cache some process must be used to transfer the new value to the actual SDRAM.

To help reduce this problem, cache can be configured in three different modes (Cache mode is selected by the DCPLB descriptors):

  • Write-through with cache line allocation only on reads
  • Write-through with cache line allocation on both reads and writes
  • Write-back which allocates cache lines on both reads and writes

For each store operation, write-through caches initiate a write to external memory immediately upon the write to cache. If the cache line is replaced or explicitly flushed by software, the contents of the cache line are invalidated rather than written back to external memory. Writes are a bit slower but reads are as fast as they can be.

A write-back cache does not write to external memory until the line is replaced by a load operation that needs the line. You do not have control over when external memory update will happen.

Whether or not the cache is in write-back, or write-through, software can flush the cache line. This causes an area of cache to be written back to external memory.

Data Cache Organization

Cache Operation

This is a very complex topic but here is a simplified version of how it works.

When an SDRAM address is marked as cacheable the hardware will do a special search in the cache memory to see if this address is already in cache. This is called finding a cache Tag. Any given data memory address is given the option of 2 different cache memory locations ( called ways ). If a match is found then the data is in cache. No costly SDRAM access is needed. The instruction cache can have up to 4 ways.

Dirty Data

If there is no Tag match then the system will look at one of the 2 ways or cache blocks to use for a given address. Not every possible SDRAM address is given its own cache location and only a few address bits are used to identify the possible cache location options for any given address. If none of those are available then an existing cache slot is reused for that SDRAM data access. The selected cache slot is called a victim If this slot contained modified data that has not yet been written back to SDRAM it is called dirty the data needs to be written back to SDRAM before the victim slot is made available for the new cache line. Data modified in cache is called “dirty”. If the cache is marked as write through the SDRAM will have already been updated so the dirty bit can be ignored.

Once the victim cache slot is freed then it is time to load the new data into it. The new data is actually loaded into a temp buffer before being transferred to the actual cache memory. When ever a selected address contents is loaded into the temp buffer the read operation is completed even before the rest of the cache line is read from SDRAM and before the cache data is moved from the temp buffer to the true cache location. You can see why hardware designers have some real challenges here.

The cache ways can be locked to prevent swapping out of data so that a small section of SDRAM can be moved into cache and kept there while it is being used. The cost of doing this is increased use of the remaining ways for other data operations.

Making it all work

The secret of using this is careful code design with very tight data operation loops. As you get to a given data location it is better to do all the operations needed on that item before moving on to the next item.

Loops like this

char a[4096];
char b[4096];
char c[4096];
for ( i = 0 ; i < 4096 ; i++ ) {
    a[i] = b[i];
for ( i = 0 ; i < 4096 ; i++ ) {
    a[i] += c[i];

would be not as efficient as

for ( i = 0 ; i < 4096 ; i++ ) {
    a[i] = (b[i] + c[i]);

When the variables a[0], b[0] and c[0] are read into cache you will get a[31], b[31] and c[31] in at the same time.

Another secret is to keep data structures on a cache boundary ( 32 bytes). If the data structure is less than 32 bytes and it is on a boundary then when you get the first byte the whole structure is contained in a single cache line. Keeping the data structure on a cache boundary will stop you needing two cache lines to store the structure (if it is less than 32 bytes in size). Keep all commonly used data items in a structure close together to minimize the cache hits required to use the structure. Some architectures define


to help with this.

In general keep all your most frequently used code and data items close together in your applications.

Cache Performance

Cachebench is an application to empirically determine some parameters about an architectures memory subsystem. It includes 8 different tests, including read, write and read/modify/write.

DMA and Cache Considerations

Well all of this works really well if JUST the CPU is looking at data. The CPU has the cache hardware at its disposal so it can take best advantage of the cache system. The Blackfin has many peripherals ( SPORT, PPI … ) that rely on DMA processes to read in high throughput io data.

A DMA process knows nothing about any data caches. It directly reads and writes to SDRAM locations or to the selected peripherals.

This means that you can have a selected memory location in cache but the actual real SDRAM location has been modified by a DMA transfer and now holds different data to the in cache copy. You can attempt to read the data as many times as you like but you will not normally see the updated contents of the SDRAM (caused by the DMA write operation).


The invalidate operation allow the cache entry to be updated from the SDRAM memory area the next time an attempt is made to read the address.

This function forces a reread of the SDRAM even though the SDRAM address contents may have already been held in cache memory.

NOTE that the blackfin invalidate also first performs a flush of any dirty cache data. This is not a problem when using the write-through operation but needs to be considered otherwize.

The whole cache may be invalidated or just an address range.

The invalidate function can be used for a range of addresses even if the addresses are not actually in the cache.

Here are the results from a simple test.

DMA Test Results

insmod /lib/modules/
 buf 1 before copy
 00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15
 DMA_D0 dma irq 28 status 1

DMA status 0x0
 buf 2 after dma copy where is the data ??
 00  00  00  00  00  00  00  00  00  00  00  00  00  00  00  00

 buf 2 after invalidate OK an invalidate was needed
 00  01  02  03  04  05  06  07  08  09  10  11  12  13  14  15

root:~> cat /proc/interrupts
  6:      14033   BFIN Timer Tick
 14:          0   rtc
 21:          0   BFIN_UART_RX
 22:       2281   BFIN_UART_TX
 28:          1   MEM DMA0
 40:         45   eth0
Err:          0

Flushing Cache

With the kernel configuration option


The cache is configured as Write Through.

This adds a penalty to SDRAM writes in that each write will be flushed to SDRAM.

NOTE There are chip configuration options to also bypass the caching of data on writes.

This penalty is offset by the fact that after a write to SDRAM the data may not held in cache but actually present in the memory.

This means that if you are setting up a buffer to be sent to a peripheral via DMA action then you will not have to explicitly flush the buffer before starting the DMA transfer.

Remember, however, that the CONFIG_BLKFIN_WT cache option is a configuration option.

You may have a situation where the kernel is compiled with this option turned off. Your code may then suddenly seem to stop working.

When setting up a buffer for DMA output it is always safe to flush the cache to memory. This operation is simply optioned out and has no effect when CONFIG_BLKFIN_WT is enabled.

Look in linux-2.6.x/include/asm/cacheflush.h for more details

Instruction Cache

In addition to the data cache the Blackfin has an instruction cache. When enabled, this is used when executing code. As instructions are read from memory they are read into cache just like the data reads. There is a 4 way Cache for instructions.

When loading executables into SDRAM memory the instruction cache should be invalidated to allow any old cached content to be removed before trying to execute the new code.

The new code maybe loaded using a data write through the data cache so the instruction cache does not register the new data.

The linux-2.6.x/fs/binfmt_flat.c file contains an example.

You may also need to flush the icache in the case of handling signals where the return stack to user space is modified.

Look for an example in linux-2.6.x/arch/blackfin/kernel/signal.c.

The cache flushing kernel functions are in the following files:

  • linux-2.6.x/arch/blackfin/mach-common/cache.S
  • linux-2.6.x/include/asm/cacheflush.h

A drastic system call that flushes all of cache is also in linux-2.6.x/arch/blackfin/kernel/sys_bfin.c.

Cache Setup and Control

The Blackfin has detailed control over the amount of SRAM that can be set up as either data or instruction caches. Some of the fast data SRAM area can be set up as data storage and also, you may want to use some of the instruction SRAM area for code. The default setup will use the maximum data ( 32K out of 64K) and instruction (16k out of 32K ) cache allocations from the on chip SRAM.

The cache allocations are set up in the file linux-2.6.x/arch/blackfin/mach-common/cacheinit.S.

The IMEM_CONTROL and DMEM_CONTROL registers control how much SRAM is used for cache.

IMEM_CONTROL also controls which, if any, of the 4 instruction ways are locked.

Kernel Configuration

The Blackfin Processors have their Kernel configuration options set up in the Blackfin Specific Features section.

Select the following from the Kernel Configuration Menu

Blackfin Processor Options  --->
  --- Cache Support
  [*] Enable ICACHE
  [*] Enable DCACHE
  [ ] Enable Cache Locking
      Policy (Write back)  --->


There are a set of 16 register sets that control the cacheable of External memory.

( CPLB = Cache Protection Lookaside Buffer.) There are 16 Data and 16 Instruction CPLB's in the system.

Each register set contains an address register and a data register The address register will provide a match for the top 22 bits [31:10] of a given address and the data register contains control information for that address.

The control information describes the following for example

  • range ( 1K, 4K , 1M, 4M ) of that address
  • can that address be cached
  • is the register set locked
  • user / supervisor read / write access permissions

NOTE that the Instruction and Data CPLBS have different control options.

When cache is enabled the CPLB buffers must be set up and used.

There are only 16 Instruction CPLB Register Sets and 16 Data CPLB Register Sets. There may be a need for more memory descriptors in a system configuration.

When an access to a memory location is requested that is not represented by a current Instruction or Data CPLB an exception is raised.

The exception handler is given the task of determining if the exception occurred due to a protection violation or due to an unmatched address.

The cplb manager is given the task of loading a new entry into the core register set from one of the sets defined by the system configuration.

A set of configuration register sets is set up in the file linux-2.6.x/arch/blackfin/kernel/setup.c.

The code in linux-2.6.x/arch/blackfin/mach-common/cplbmgr.S is given the job of looking through the current core registers to find an eviction victim and replacing that victim with an entry from the configuration tables.

A kernel panic is triggered if no entry is found for the address that is being accessed or if no victim can be found.

CPLB lines can also be locked into the core registers to prevent them being swapped out.

The code in linux-2.6.x/arch/blackfin/mach-common/cplbhdlr.S services the exception and involves either the protection violation system or the cplbmgr code to initiate a CPLB replacement.

Reading /proc/cplbinfo will get currently CPLB entries.

Using L1 Memory

The L1 data and memory sections can be used for data and code.

If a code section is marked as residing in the L1 memory area it will be compiled and loaded into SRDAM by the system but then transferred to L1 memory during the boot processed.

In fact the only way to get data into the L1 Instruction Memory is by using a DMA process to copy the data. In an assembler file the following example shows how to identify code that needs to be transferred to L1 instruction memory.

// extract from arch/blackfin/mach-bf533/head.S

.section .text.l1
	p0.h = hi(SIC_IWR);
	p0.l = lo(SIC_IWR);
	r0.l = 0x1;
	[p0] = r0;

This is the actual relocation call from the same file

	/*Put The Code for PLL Programming and SDRAM Programming in L1 ISRAM*/
	call _bf53x_relocate_l1_mem;

The actual memory copy to the L1 ram is done here: <source trunk/arch/blackfin/kernel/setup.c:bf53x_relocate_l1_mem() c linux-kernel>

The linker magic to make the L1 code and data components have an address in L1 memory but reside in the image is in the file arch/blackfin/kernel/ This will generate the as linked absolute address in L1 cache but locate the sections inside the SDRAM kernel image.

References are made to the as loaded addresses to give the relocation code a chance to find the area to move to L1 memory.

	 __l1_lma_start = .;
	.text_l1 L1_CODE_START :
		AT ( __l1_lma_start )
		. = ALIGN(4) ;
		 __stext_l1 = . ;

		. = ALIGN(4) ;
		 __etext_l1 = . ;

	.data_l1 L1_DATA_A_START :
		AT ( __l1_lma_start + SIZEOF(.text_l1) )
		. = ALIGN(4) ;
		 __sdata_l1 = . ;
		 __edata_l1 = . ;

		. = ALIGN(4) ;
		 __sbss_l1 = . ;

		. = ALIGN(4) ;
		 __ebss_l1 = . ;

	. = __l1_lma_start + SIZEOF(.text_l1) + SIZEOF(.data_l1) ;
	.data (__l1_lma_start + SIZEOF(.text_l1) + SIZEOF(.data_l1)) :	AT ( __l1_lma_start + SIZEOF(.text_l1) + SIZEOF(.data_l1) )

More Information

Please refer to this page kernel_space_memory_allocation for some more details on cache management.

External Link