Inter Core Communication Introduction Parallel Processing refers to the concept of speeding-up the execution of a program by dividing the program into multiple fragments that can execute simultaneously, each on its own processor. A program being executed across n processors might execute n times faster than it would using a single processor. Traditionally, multiple processors were provided within a specially designed “parallel computer”; along these lines, Linux now supports SMP systems in which multiple processors share a single memory and bus interface within a single computer. It is also possible for a group of computers (for example, a group of PCs each running Linux) to be interconnected by a network to form a parallel-processing cluster. The third alternative for parallel computing using Linux is to use the multimedia instruction extensions (i.e. (in other words), MMX) to operate in parallel on vectors of integer data. Finally, it is also possible to use a Linux system as a “host” for a specialized attached parallel processing compute engine. All these approaches are discussed in detail in the Parallel-Processing How To, and the 4th, (a specialized attached parallel processing compute engine) will be described here. This document describes some design/architecture ideas on how to make Core B easier to use within a Linux framework. This initial implementation can be found in blackfin kernel and uClinux-dist source repositories under folder named icc. If you would like to provide feedback, please do on thefourms. The classical trade-off between system performance and ease of programming is one of the primary differentiators between general purpose operating system (GPOS) and real-time operating systems (RTOS). GPOSes tend to provide a higher degree of resource abstraction. This improves application portability, ease of development and increases system robustness through software modularity and isolation of resources. This makes a GPOS ideal for addressing general purpose system components such as networking, user interface and display management. However, this abstraction sacrifices the fine-grained control of system resources required to meet the performance goals of computationally intensive algorithms such as signal processing code. For this level of control, developers typically turn to a real-time operating system (RTOS), or program directly on bare metal. Use Cases There are various use cases for wanting to be able to load bare metal applications or appications under real time OS (Operating System) into Core B, and use it like a hardware accelerator. Compute Accelerator There are various things that can be done to accelerate some task which normally runs under the Linux kernel. Video Accelerator Running an optimized H.264 or MPEG (Motion Picture Experts Group) or WMV video codec on Core B, with mplayer running on Core A. mplayer runs in Linux on Core A the Linux kernel manages 100% of the peripherals (including LCD) H.264 decoder on core B The two pass the raw bitstream, and decoded video through the CPU→DSP (Digital Signal Processor) framework. The decoder does nothing except for decoding video stream into some frame buffers. Mplayer open a h.264 bit stream from either a file on disk or a connection over network. Video mplayer runs in Linux on Core A the Linux kernel manages some of the peripherals (not including LCD) H.264 decoder on core B The mplayer passes the raw bitstream to the decoder, and decoded video is passed directly to the PPI (Parallel Peripheral Interface) from Core B. The DSP (Digital Signal Processor) code should negotiate a proper DMA (Direct Memory Access)/IRQ (Interrupt request)/GPIO (General Purpose Input/Output)/DRAM resource allocation with Linux kernel through the blackfin DSP (Digital Signal Processor) framework. Crypto Accelerator Crypto_API_(Linux) offers hardware acceleration support. Real Time Task There are times where the hard real time performance offered by the Linux kernel or by ADEOS are not enough for the application. In those select times, you can still use a thin RTOS (VDK, uCos, etc.) on CORE B, and Linux on Core A. Shared Memory based Inter-core Communication Protocol This section is intended to define the communication well enough that different implementations can successfully communicate. Shared Memory There are a fixed number of shared variables with sizes and addresses known to each processor. There is a shared variable for the basic message queue. Protocols that use this queue may require additional shared variables or may require individual processors to have a pool of shareable memory from which buffers can be allocated. If processors have different word sizes and address maps then addresses of shared buffers and the size of addressable units could differ, and the protocol would need to define a common address representation and addressable unit. We also define types that have at least 16 bits and 32 bits with size larger than or equal to the smallest addressable unit. We will use the data types: typedef 'some unsigned integer' sm_unit_t; // defined in specifics typedef 'some unsigned integer' sm_uint16_t; // defined in specifics typedef 'some unsigned integer' sm_uint32_t; // defined in specifics typedef 'some integral type' sm_address_t; // defined in specifics Specifics for Blackfin Both cores on BF561 and BF60x have the same address space and are byte addressed. typedef uint8_t sm_unit_t; typedef uint16_t sm_uint16_t; typedef uint32_t sm_uint32_t; typedef void *sm_address_t; Cache policy One of the assumptions of the MCAPI/ICC protocol is that the payload buffer received on one core is located in the memory region managed(owned) by the other cores. Core 0 should set up write through CPLB entries for the memory region managed by core 1. So, The invalidate instruction on core 0 doesn’t flush dummy data in cache back to the MCAPI payload buffer sent by core 1 or drop unrelated data in the same cache line near the MCAPI payload boundary. What CPLB entries (WT/WB) are set up for the same memory region on core 1 doesn’t matter, because core 1 should flush the MCAPI payload buffer before sending. For example: BF609 mem addr , Owner , Core0 cache , Core1 cache 0~0x3FFFFF , Core0 , WB , WT 0x400000~0x800000 , Core1 , WT , WB Atomic access The specific part defines a type which may be read and written atomically. Two operations are defined on the type: Read and Write, and it may hold values of sm_uint16_t. // defined at specifics typedef 'some type' sm_atomic_t; sm_uint16_t sm_read_atomic(volatile sm_atomic_t *); void sm_write_atomic(volatile sm_atomic_t *, sm_uint16_t); Atomic means that if one core writes a variable and another core reads it, the value read is either the value before the write or the value after it and not some third value because the write had only half completed when the read occurred. Atomic operations on a single core must also be ordered with respect to each other so the following logic holds. // initial values sm_atomic_t a = 0, b = 0; on processor 0: sm_write_atomic(&a, 1); while (sm_read_atomic(&b) == 0) ; on processor 1: sm_write_atomic(&b, 1); x = sm_read_atomic(&a); assert(x == 1); // because read(b) must follow write(a) on processor 0 Specifics for Blackfin On BF561 and BF60x both cores use the same bus to L2 and the EBUI. L2 memory is 64-bits wide and memory attached to the EBUI is at least 16-bits wide. So an uncached 16-bit write to L2 or L3 is atomic. typedef uint16_t sm_atomic_t; inline void sm_atomic_write(volatile sm_atomic_t *a, sm_uint16_t v) { *a = v; } inline sm_uint16_t sm_atomic_read(volatile sm_atomic_t *a) { return *a; } Interrupts Each processor must be able to raise interrupts on the other processor. We use one interrupt on each core which indicates some action is required. The interrupt handler works out what the action is from the channel state. So all modifications to shared data should be visible to both processors by the time the interrupt handler is entered. The mechanism for initialising interrupt handlers and clearing the interrupt source is necessarily processor and environment specific. The initialisation sequence described below requires the interrupt to be initially masked, which is usually the case. Specifics for Blakcfin CPU , master core ICC interrupt , slave core ICC interrupt BF561 , core supplemental interrupt 0 , core supplemental interrupt 0 BF60x , SEC soft interrupt 0 , SEC soft interrupt 1 The protocol does not define the core interrupt vectors used to handle this interrupts or whether they are shared with other interrupt sources, as this is a decision local to the environment running on the core. Modifications to shared data are made visible to the other core before raising the interrupt by ensuring any cached writes are flushed from cached executing an SSYNC instruction to flush the write buffer. The interrupted core is responsible for ensuring the initial reads of shared data are not from cache. Message passing The protocol is for two way communication between two processors. If there are more processors in the system then the protocol could be used for separate two way channels between each pair of processors. There are four message queues. Two in each direction, one for high priority messages and the other for standard priority. Message queues are circular buffers containing SM_MSGQ_LEN fixed size messages. typedef struct { sm_atomic_t sent; sm_atomic_t received; sm_msg_t buf[SM_MSGQ_LEN]; } sm_msgq_t; The size and content of sm_msg_t is defined in Part 2 below. SM_MSGQ_LEN is a constant. For efficiency it should be a power of two. The message queue uses a lockless protocol. The sender always writes a message at sent % SM_MSGQ_LEN and then increments sent, and the receiver always reads from received % SM_MSGQ_LEN and then increments received. The number of messages in the queue is (sm_uint16_t)(sent - received). The counters are unsigned so, due to the wonders of modulo arithmetic, this is true even if received > sent because sent has wrapped round. Before sending a message the sender checks that there is space available in the buffer. If space is available the sender writes the message to buf[sent % SM_MSGQ_LEN], increments sent, then raises the 'Action Required' interrupt on the other processor. If no space is available in the buffer the calling process must block. The handler for the 'Action Required' interrupt causes a receiver for both high and standard priority queues to run. Whether the receivers execute within the handler or are just scheduled to run once it returns is environment dependent. The receiver checks the number of messages in the buffer. If there are any it reads the message at buf[received % SM_MSGQ_LEN] and tries to deliver it. If successful it increments received and raises the 'Action Required' interrupt on the sending core. The interrupt handler also checks whether space has come available in the queues there are processes blocked on. The definition of a process, the mechanism for blocking a process, and the method of dealing with the race condition between the sender blocking and the receiver raising the interrupt is processor and environment specific and outside the scope of the protocol. For example in a bare metal environment there is only one “process” and it can block by spinning on a variable that is set by the interrupt handler whereas other environments would use operating system primitives. A message channel between a pair of processors is composed of 4 message queues. typedef struct { volatile sm_msgq_t msgq[2][2]; } sm_channel_t; The processor of large cpu_id is on the contrary. It receives on msgq[priority][1] and sends on mesgq[priority][0]. The message queue id can be identified according to current cpu, destination cpu and cpu which sends the inter processor interrupt. if (cur_cpuid == ipi_src_cupid || cur_cpuid == des_cpuid) BUG(); recv_msgq_id = cur_cpuid < ipi_src_cpuid ? 0 : 1; send_msgq_id = cur_cpuid < des_cpuid ? 1 : 0; Each processor receives high priority on the queue msgq[0][recv_msgq_id], and the standard priority messages on msgq[1][recv_msgq_id]. So do the queue to send message. If there are N processors in the architecture, there should be a channel array of (N - 1) * N / 2 channels. The channel arrays exist at a known location of the shared memory. The channel id can be identified according to current cpu and remote cpu. Message Channel ID Table Processor ID, 0, 1, 2, 3 0, NA, 0, 1, 2 1, NA, NA, 3, 4 2, NA, NA, NA, 5 3, NA, NA, NA, NA #define CPU_NUM 4 sm_channel_t channels[(CPU_NUM - 1) * CPU_NUM / 2]; int8 channel_table[CPU_NUM][CPU_NUM] = { {-1, 0, 1, 2}, {-1,-1, 3, 4}, {-1,-1,-1, 5}, {-1,-1,-1,-1}, }; channel_id = channel_table[cur_cpuid, remote_cpuid]; if (channel_id < 0 || channel_id >= CPU_NUM) BUG(); channel = channels[channel_id]; Each message queue is statically initialised with received and sent containing the value 0. When a processor starts running the 'Action Required' interrupt is masked before attempting to sending the first message a handler is installed and the interrupt is unmasked. A message queue can be written to before the receiver has initialized its interrupts. If it fills up the 'Action Required' signal is raised but not serviced until the receiver unmasks the interrupt. Specifics for Blackfin A single block of four message queues is held in the shared variable at a known address msgq. Its start address should be at a fixed position known to code running on all processors. #define MSGQ_START_ADDR 0xFEB00000 // in BF561 and BF60x L2 SRAM typedef struct { volatile sm_msgq_t msgq[2][2]; } sm_channel_t; static dsp_channel_t *sm_ch = (sm_channel_t *)MSGQ_START_ADDR; Core A receives messages on sm_ch→msgq[priority][0] Core B receives messages on sm_ch→msgq[priority][1] Message Format typedef sm_unint16_t sm_endpoint_t; typedef struct { sm_endpoint_t dst_ep, src_ep; sm_uint32_t type; sm_uint32_t length; sm_address_t payload; } sm_msg_t; The fields dst_ep and src_ep denote endpoints. The meaning of an endpoint is application dependent. The receiver should inspect the dst_ep field to decide how to process the message. The src_ep field indicates the sender which may be meaningful to the receiving endpoint. type is an an unsigned 32 bit integer value that indicates a message type defined in one of the higher level protocols and is mainly interpreted by the endpoint. The top eight bits of the value indicates the protocol and the low 24 bits the subtype. // compose type enumeration value from protocol & subtype #define SM_MSG_TYPE(protocol, subtype) (((protocol)<<24)|(subtype)) // extract subtype from type enumeration value #define SM_MSG_SUBTYPE(type) ((type)&0xffffff) // extract protocol from type enumeration value #define SM_MSG_PROTOCOL(type) (((type)>>24)&0xff) An endpoint may recognise more than one protocol. The receiver must know the protocols recognised by each endppoint. When dst_ep has the value 0xffff the message is broadcast to every endpoint which recognises the protocol encoded in the type field. The meaning of length and payload is dependent on the value of type and interpreted by the endpoint. Specifics for BF561 On BF561 message reads and writes to a queue in L2 is more efficient if 'sm_msg_t' is aligned on a 64 bit boundary. An aligned access should take 14 rather than 21 cycles. The type declaration for the VisualDSP compiler should use pragma align: typedef struct { #pragma align 8 ... } sm_msg_t; All protocol Types All protocol types enum { SP_GENERAL = 0, SP_CORE_CONTROL, SP_TASK_MANAGER, SP_RES_MANAGER, SP_PACKET, SP_SESSION_PACKET, SP_SCALAR, SP_SESSION_SCALAR, SP_MAX, }; Standard Message Types All protocols should recognise the standard message types. A couple of common error conditions are covered by standard messages. These are sent with the same priority as the message to which they responding. SM_BAD_ENDPOINT All endpoints should recognise the message: SM_BAD_ENDPOINT = SM_MSG_TYPE(0, 0) This may be sent in response to a message sent by this endpoint to indicate the dst_ep field was invalid. The SM_BAD_ENDPOINT message has its src_ep field set to the invalid endpoint id, its length to 0, and its payload to the type value of the original message. The SM_BAD_ENDPOINT message may not be sent in all environments. If endpoints can be created dynamically it may be more appropriate to queue the message until the endpoint is created. SM_BAD_MSG All endpoints should recognise and be able to send the message: SM_BAD_MSG = SM_MSG_TYPE(0, 1) When an endpoint receives a message with a type field it does not expect it should return an SM_BAD_MSG message with the payload set to the type value it did not recognise and length field set to 0. The message queue layer should also return SM_BAD_MSG if a message with an invalid protocol value is sent to an endpoint. Either 0 or the endpoint's known protocol is valid. SM_QUERY_MSG All endpoints should recognise and be able to send the message: SM_QUERY_MSG = SM_MSG_TYPE(0, 2) SM_QUERY_MSG and SM_QUERY_ACK_MSG messages are used for query remote endpoint status. Query message should set dsp_ep field and type field. When an endpoint receives a SM_QUERY_MSG, it should return a SM_QUERY_ACK_MSG. The message queue layer should return SM_QUERY_NOEP_MSG if the endpoint hasn't been created. SM_QUERY_ACK_MSG All endpoints should recognise and be able to send the message: SM_QUERY_ACK_MSG = SM_MSG_TYPE(0, 3) SM_QUERY_ACK_MSG message should set src_ep field and type field. SM_QUERY_NOEP_MSG All endpoints should recognise and be able to send the message: SM_QUERY_NOEP_MSG = SM_MSG_TYPE(0, 4) SM_NOTIFY_EP_CREATE_MSG All endpoints should recognise and be able to send the message: SM_NOTIFY_EP_CREATE_MSG = SM_MSG_TYPE(0, 5) If a new endpoint has been created, it should send a SM_NOTIFY_EP_CREATE_MSG notify to remote message queue layer. SM_NOTIFY_EP_CREATE_MSG should set src_ep field. Communication Protocols Communication protocols defined in DSP (Digital Signal Processor) bridge framework are as following. Protocol type , value , Protocol Name SP_CORE_CONTROL , 1 , Core Control Protocol SP_TASK_MANAGER , 2 , Task Manager Protocol SP_RES_MANAGER , 3 , Resource Manager Protocol SP_PACKET , 4 , Connectionless Packet Transfer Protocol SP_SESSION_PACKET , 5 , Connection based Packet Transfer Protocol SP_SCALAR , 6 , Connectionless Scalar Transfer Protocol SP_SESSION_SCALAR , 7 , Connection based Scalar Transfer Protocol Core Control Protocol The core control protocol is a simple set of messages for controlling a slave core. message , value , sent by , meaning SM_CORE_START , SM_MSG_TYPE(SP_CORE_CONTROL, 0) , Master , Change slave state from stopped to started SM_CORE_STARTED , SM_MSG_TYPE(SP_CORE_CONTROL, 1) , Slave , in response to SM_CORE_START once started SM_CORE_STOP , SM_MSG_TYPE(SP_CORE_CONTROL, 2) , Master , Change slave state from started to stopped SM_CORE_STOPPED , SM_MSG_TYPE(SP_CORE_CONTROL, 3) , Slave , in response to SM_CORE_STOPPED once stopped SM_CORE_RESET , SM_MSG_TYPE(SP_CORE_CONTROL, 4) , Master , Put slave in stopped state if not already stopped and reset state including PC. SM_CORE_RESETED , SM_MSG_TYPE(SP_CORE_CONTROL, 5) , Slave , in response to SM_CORE_STOPPED once stopped All messages are sent with high priority. Task Manage Protocol The task manage protocol is a simple set of messages to run and kill a task on the slave cores. message , value , sent by , meaning SM_TASK_RUN , SM_MSG_TYPE(SP_TASK_MANAGER, 0) , Master , ask slave core to execute a task with function addresses and parameters of init and exit. addresses and parameters are stored in payload buffer allocated by master. SM_TASK_RUNNING , SM_MSG_TYPE(SP_TASK_MANAGER, 1) , Slave , in response to SM_TASK_RUN. task id or 0 is stored in payload. master can free payload buffer after received this response. SM_TASK_KILL , SM_MSG_TYPE(SP_TASK_MANAGER, 2) , Master , ask slave core to stop running a task of give id in payload. SM_TASK_KILLED , SM_MSG_TYPE(SP_TASK_MANAGER, 3) , Slave , in response to SM_TASK_KILL once return to idle. task id or 0 is stored in payload. All messages are sent with high priority. Resource Manager Protocol How the application use this resource manager protocol depends on how the precedent shared resource partition is defined for all cores. Precedent shared resource partition may be more suitable to systems that don't need dynamic resource allocation and free. Different implementations can make their own decision. message , value , sent by , meaning SM_RES_MGR_REQUEST , SM_MSG_TYPE(SP_RES_MANAGER, 0) , slave , request shared resources SM_RES_MGR_REQUEST_OK , SM_MSG_TYPE(SP_RES_MANAGER, 1) , master , request succeeds for all resources in the slave's request list SM_RES_MGR_REQUEST_FAIL , SM_MSG_TYPE(SP_RES_MANAGER, 2) , master , request fails for at least one resource in he slave's request list SM_RES_MGR_FREE , SM_MSG_TYPE(SP_RES_MANAGER, 3) , slave , free reserved resources SM_RES_MGR_FREE_DONE , SM_MSG_TYPE(SP_RES_MANAGER, 4) , master , free done SM_RES_MGR_EXPIRE , SM_MSG_TYPE(SP_RES_MANAGER, 5) , master , ask slave to stop using the resources SM_RES_MGR_EXPIRE_DONE , SM_MSG_TYPE(SP_RES_MANAGER, 6) , slave , SM_RES_MGR_LIST , SM_MSG_TYPE(SP_RES_MANAGER, 7) , slave , request a list of all shared resources of a type, no payload SM_RES_MGR_LIST_OK , SM_MSG_TYPE(SP_RES_MANAGER, 8) , master , reply a list of all available shared resources of a resource type in payload buffer SM_RES_MGR_LIST_DONE , SM_MSG_TYPE(SP_RES_MANAGER, 9), slave , finish access this list buffer The same payload address should be returned in all reply messages, while list message has no payload. All messages are with normal priority. enum { SM_RES_MGR_REQUEST = SM_MSG_TYPE(SP_RES_MANAGER, 0), SM_RES_MGR_REQUEST_OK, SM_RES_MGR_REQUEST_FAIL, SM_RES_MGR_FREE, SM_RES_MGR_FREE_DONE, SM_RES_MGR_EXPIRE, SM_RES_MGR_EXPIRE_DONE, SM_RES_MGR_LIST, SM_RES_MGR_LIST_OK, SM_RES_MGR_LIST_DONE, SM_RES_MGR_MAX, }; The resource manager service should bind to endpoint 0 on each processor. Slave applications and OS (Operating System) should always request all types of shared resources from this endpoint in master OS (Operating System). // resource manager service endpoint #define EP_RESMGR_SERVICE 0 The ID of a shared resource is unique among all kinds of resources. The supper 4 bits indicate the type of the shared resource, while the rest 12 bits is the index in the given type group. There are at most 16 (2^4) types and only 5 is defined yet. For each type, there could be at most 4096 (2^12) individual resources. The SM_RES_MGR message use payload to pass resouce ID, and use length to point to a 32-bit resouce description data address if resouce type is RESMGR_TYPE_PERIPHERAL. // resource types enum { RESMGR_TYPE_PERIPHERAL = 0, RESMGR_TYPE_GPIO, RESMGR_TYPE_SYS_IRQ, RESMGR_TYPE_DMA, RESMGR_TYPE_MAX, }; #define RES_TYPE_OFFSET 12 #define RES_TYPE_MASK 0xF #define RES_SUBID_MASK 0xFFF // compose resource id from resource type & sub id #define RESMGR_ID(type, subid) ((type << RES_TYPE_OFFSET ) | (subid & RES_SUBID_MASK)) // extract resource subid from resource id #define RESMGR_SUBID(id) (id & RES_SUBID_MASK) // extract resource type from resource id #define RESMGR_TYPE(id) ((id >> RES_TYPE_OFFSET) & RES_TYPE_MASK) Resource description data address should be put in the length of the message in following format. typedef struct { uint8_t label[32]; // resource device owner name uint16_t count; // resource number in next array uint32_t resources_array; // address of the resource ID array } resources_t; Resource manager APIs declaration: int sm_request_resource(uint32_t dst_cpu, uint32_t resource_id, resources_t *data) int sm_free_resource(uint32_t dst_cpu, uint32_t resource_id, resources_t *data) peripherals type For peripherals type, the peripheral name and list is passed by resouce description data. example to request/free peripherals type unsigned short bfin_peripheral_list[] = {P_SPI1_SCK, P_SPI1_MISO, P_SPI1_MOSI, 0}; resources_t bfin_peri_res = { .label = "bfin-spi1", }; bfin_peri_res.count = 3; bfin_peri_res.resources_array = (uint32_t)bfin_peripheral_list; COREB_DEBUG(1, "request resource id %s\n", bfin_peri_res.label); ret = sm_request_resource(EP_RESMGR_SERVICE, RESMGR_ID(RESMGR_TYPE_PERIPHERAL, 0), &bfin_peri_res); if (ret) { COREB_DEBUG(1, "request peri resource failed\n"); } ret = sm_free_resource(EP_RESMGR_SERVICE, RESMGR_ID(RESMGR_TYPE_PERIPHERAL, 0), &bfin_peri_res); if (ret) { COREB_DEBUG(1, "free peri resource failed\n"); } GPIO, IRQ and DMA type The generic map of the GPIOs, system IRQs and DMA (Direct Memory Access) channels to their ID should be defined for each arch. The resource sequence in the HRM ((Blackfin) Hardware Reference Manual) can be one reference for the generic map. Specifics for BF561 GPIO (General Purpose Input/Output) ID , GPIO (General Purpose Input/Output) in bf561 HRM ((Blackfin) Hardware Reference Manual) 0 , PF0 1 , PF1 … , … 47 , PF47 System IRQ (Interrupt request) ID , System IRQ (Interrupt request) in bf561 HRM ((Blackfin) Hardware Reference Manual) 0 , PLL (Phase Locked Loop)_WAKEUP 1 , DMA1_ERROR 2 , DMA2_ERROR 3 , IMDMA (Intercore Memory Direct Memory Access)_ERROR 4 , PPI0_ERROR 5 , PPI1_ERROR 6 , SPORT0_ERROR 7 , SPORT1_ERROR 8 , SPI0_ERROR 9 , UART0_ERROR 10 , RESERVED 11 , DMA1_CH0 12 , DMA1_CH1 13 , DMA1_CH2 14 , DMA1_CH3 15 , DMA1_CH4 16 , DMA1_CH5 17 , DMA1_CH6 18 , DMA1_CH7 19 , DMA1_CH8 20 , DMA1_CH9 21 , DMA1_CH10 22 , DMA1_CH11 23 , DMA2_CH0 24 , DMA2_CH1 25 , DMA2_CH2 26 , DMA2_CH3 27 , DMA2_CH4 28 , DMA2_CH5 29 , DMA2_CH6 30 , DMA2_CH7 31 , DMA2_CH8 32 , DMA2_CH9 33 , DMA2_CH10 34 , DMA2_CH11 35 , TIMER0 36 , TIMER1 37 , TIMER2 38 , TIMER3 39 , TIMER4 40 , TIMER5 41 , TIMER6 42 , TIMER7 43 , TIMER8 44 , TIMER9 45 , TIMER10 46 , TIMER11 47 , PF0_PF15_A 48 , PF0_PF15_B 49 , PF16_PF31_A 50 , PF16_PF31_B 51 , PF32_PF47_A 52 , PF32_PF47_B 53 , DMA1_MDMA (Memory Direct Memory Access)_STREAM0 54 , DMA1_MDMA (Memory Direct Memory Access)_STREAM1 55 , DMA2_MDMA (Memory Direct Memory Access)_STREAM0 56 , DMA2_MDMA (Memory Direct Memory Access)_STREAM1 57 , IMDMA (Intercore Memory Direct Memory Access)_STREAM0 58 , IMDMA (Intercore Memory Direct Memory Access)_STREAM0 59 , WATCHDOG 60 , RESERVED 61 , RESERVED 62 , RESERVED (SUPPLE_0 is reserved by DSP (Digital Signal Processor) bridge framework 63 , SUPPLE_1 DMA (Direct Memory Access) ID , DMA (Direct Memory Access) in bf561 HRM ((Blackfin) Hardware Reference Manual) 0 , DMA1_PPI0 1 , DMA1_PPI1 2 , RESERVED 3 , RESERVED 4 , RESERVED 5 , RESERVED 6 , RESERVED 7 , RESERVED 8 , RESERVED 9 , RESERVED 10 , RESERVED 11 , RESERVED 12 , DMA1_MEM_STREAM0_DES 13 , DMA1_MEM_STREAM0_SRC 14 , DMA1_MEM_STREAM1_DES 15 , DMA1_MEM_STREAM1_SRC 16 , DMA2_SPORT0_RX 17 , DMA2_SPORT0_TX 18 , DMA2_SPORT1_RX 19 , DMA2_SPORT1_TX 20 , DMA2_SPI0 21 , DMA2_UART0_RX 22 , DMA2_UART0_TX 23 , RESERVED 24 , RESERVED 25 , RESERVED 26 , RESERVED 27 , RESERVED 28 , DMA2_MEM_STREAM0_DES 39 , DMA2_MEM_STREAM0_SRC 30 , DMA2_MEM_STREAM1_DES 31 , DMA2_MEM_STREAM1_SRC 32 , IMDMA (Intercore Memory Direct Memory Access)_MEM_STREAM0_DES 33 , IMDMA (Intercore Memory Direct Memory Access)_MEM_STREAM0_SRC 34 , IMDMA (Intercore Memory Direct Memory Access)_MEM_STREAM1_DES 35 , IMDMA (Intercore Memory Direct Memory Access)_MEM_STREAM1_SRC Specifics for BF609 GPIO (General Purpose Input/Output) ID , GPIO (General Purpose Input/Output) in bf609 HRM ((Blackfin) Hardware Reference Manual) 0 , GPIO0 1 , GPIO1 … , … 112 , GPIO112 System IRQ (Interrupt request) ID , System IRQ (Interrupt request) in b609 HRM ((Blackfin) Hardware Reference Manual) 0 , IRQ (Interrupt request)_SEC_ERR 1 , IRQ (Interrupt request)_CGU_EVT 2 , IRQ (Interrupt request)_WATCH0 3 , IRQ (Interrupt request)_WATCH1 4 , IRQ (Interrupt request)_L2CTL0_ECC_ERR 5 , IRQ (Interrupt request)_L2CTL0_ECC_WARN 6 , IRQ (Interrupt request)_C0_DBL_FAULT 7 , IRQ (Interrupt request)_C1_DBL_FAULT 8 , IRQ (Interrupt request)_C0_HW_ERR 9 , IRQ (Interrupt request)_C1_HW_ERR 10 , IRQ (Interrupt request)_C0_NMI (Non-Maskable Interrupt)_L1_PARITY_ERR 11 , IRQ (Interrupt request)_C1_NMI (Non-Maskable Interrupt)_L1_PARITY_ERR 12 , IRQ (Interrupt request)_TIMER0 13 , IRQ (Interrupt request)_TIMER1 14 , IRQ (Interrupt request)_TIMER2 15 , IRQ (Interrupt request)_TIMER3 16 , IRQ (Interrupt request)_TIMER4 17 , IRQ (Interrupt request)_TIMER5 18 , IRQ (Interrupt request)_TIMER6 19 , IRQ (Interrupt request)_TIMER7 20 , IRQ (Interrupt request)_TIMER_STAT 21 , IRQ (Interrupt request)_PINT0 22 , IRQ (Interrupt request)_PINT1 23 , IRQ (Interrupt request)_PINT2 24 , IRQ (Interrupt request)_PINT3 25 , IRQ (Interrupt request)_PINT4 26 , IRQ (Interrupt request)_PINT5 27 , IRQ (Interrupt request)_CNT 28 , IRQ (Interrupt request)_PWM0_TRIP 29 , IRQ (Interrupt request)_PWM0_SYNC 30 , IRQ (Interrupt request)_PWM1_TRIP 31 , IRQ (Interrupt request)_PWM1_SYNC 32 , IRQ (Interrupt request)_TWI0 33 , IRQ (Interrupt request)_TWI1 34 , IRQ (Interrupt request)_SOFT0 35 , IRQ (Interrupt request)_SOFT1 36 , IRQ (Interrupt request)_SOFT2 37 , IRQ (Interrupt request)_SOFT3 38 , IRQ (Interrupt request)_ACM_EVT_MISS 39 , IRQ (Interrupt request)_ACM_EVT_COMPLETE 40 , IRQ (Interrupt request)_CAN0_RX 41 , IRQ (Interrupt request)_CAN0_TX 42 , IRQ (Interrupt request)_CAN0_STAT 43 , IRQ (Interrupt request)_SPORT0_TX 44 , IRQ (Interrupt request)_SPORT0_TX_STAT 45 , IRQ (Interrupt request)_SPORT0_RX 46 , IRQ (Interrupt request)_SPORT0_RX_STAT 47 , IRQ (Interrupt request)_SPORT1_TX 48 , IRQ (Interrupt request)_SPORT1_TX_STAT 49 , IRQ (Interrupt request)_SPORT1_RX 50 , IRQ (Interrupt request)_SPORT1_RX_STAT 51 , IRQ (Interrupt request)_SPORT2_TX 52 , IRQ (Interrupt request)_SPORT2_TX_STAT 53 , IRQ (Interrupt request)_SPORT2_RX 54 , IRQ (Interrupt request)_SPORT2_RX_STAT 55 , IRQ (Interrupt request)_SPI0_TX 56 , IRQ (Interrupt request)_SPI0_RX 57 , IRQ (Interrupt request)_SPI0_STAT 58 , IRQ (Interrupt request)_SPI1_TX 59 , IRQ (Interrupt request)_SPI1_RX 60 , IRQ (Interrupt request)_SPI1_STAT 61 , IRQ (Interrupt request)_RSI 62 , IRQ (Interrupt request)_RSI_INT0 63 , IRQ (Interrupt request)_RSI_INT1 64 , IRQ (Interrupt request)_SDU 65 , DMA12 Data Reserved 66 , Reserved 67 , Reserved 68 , IRQ (Interrupt request)_EMAC0_STAT 69 , EMAC0 Power Reserved 70 , IRQ (Interrupt request)_EMAC1_STAT 71 , EMAC1 Power Reserved 72 , IRQ (Interrupt request)_LP0 73 , IRQ (Interrupt request)_LP0_STAT 74 , IRQ (Interrupt request)_LP1 75 , IRQ (Interrupt request)_LP1_STAT 76 , IRQ (Interrupt request)_LP2 77 , IRQ (Interrupt request)_LP2_STAT 78 , IRQ (Interrupt request)_LP3 79 , IRQ (Interrupt request)_LP3_STAT 80 , IRQ (Interrupt request)_UART0_TX 81 , IRQ (Interrupt request)_UART0_RX 82 , IRQ (Interrupt request)_UART0_STAT 83 , IRQ (Interrupt request)_UART1_TX 84 , IRQ (Interrupt request)_UART1_RX 85 , IRQ (Interrupt request)_UART1_STAT 86 , IRQ (Interrupt request)_MDMA0_SRC_CRC0 87 , IRQ (Interrupt request)_MDMA0_DEST_CRC0/ IRQ (Interrupt request)_MDMAS0 88 , IRQ (Interrupt request)_CRC0_DCNTEXP 89 , IRQ (Interrupt request)_CRC0_ERR 90 , IRQ (Interrupt request)_MDMA1_SRC_CRC1 91 , IRQ (Interrupt request)_MDMA1_DEST_CRC1/IRQ (Interrupt request)_MDMAS1 92 , IRQ (Interrupt request)_CRC1_DCNTEXP 93 , IRQ (Interrupt request)_CRC1_ERR 94 , IRQ (Interrupt request)_MDMA2_SRC 95 , IRQ (Interrupt request)_MDMA2_DEST/IRQ (Interrupt request)_MDMAS2 96 , IRQ (Interrupt request)_MDMA3_SRC 97 , IRQ (Interrupt request)_MDMA3_DEST/IRQ (Interrupt request)_MDMAS3 98 , IRQ (Interrupt request)_EPPI0_CH0 99 , IRQ (Interrupt request)_EPPI0_CH1 100 , IRQ (Interrupt request)_EPPI0_STAT 101 , IRQ (Interrupt request)_EPPI2_CH0 102 , IRQ (Interrupt request)_EPPI2_CH1 103 , IRQ (Interrupt request)_EPPI2_STAT 104 , IRQ (Interrupt request)_EPPI1_CH0 105 , IRQ (Interrupt request)_EPPI1_CH1 106 , IRQ (Interrupt request)_EPPI1_STAT 107 , IRQ (Interrupt request)_PIXC_CH0 108 , IRQ (Interrupt request)_PIXC_CH1 109 , IRQ (Interrupt request)_PIXC_CH2 110 , IRQ (Interrupt request)_PIXC_STAT 111 , IRQ (Interrupt request)_PVP_CPDOB 112 , IRQ (Interrupt request)_PVP_CPDOC 113 , IRQ (Interrupt request)_PVP_CPSTAT 114 , IRQ (Interrupt request)_PVP_CPCI 115 , IRQ (Interrupt request)_PVP_STAT0 116 , IRQ (Interrupt request)_PVP_MPDO 117 , IRQ (Interrupt request)_PVP_MPDI 118 , IRQ (Interrupt request)_PVP_MPSTAT 119 , IRQ (Interrupt request)_PVP_MPCI 120 , IRQ (Interrupt request)_PVP_CPDOA 121 , IRQ (Interrupt request)_PVP_STAT1 122 , IRQ (Interrupt request)_USB (Universal Serial Bus)_STAT 123 , IRQ (Interrupt request)_USB (Universal Serial Bus)_DMA (Direct Memory Access) 124 , IRQ (Interrupt request)_TRU_INT0 125 , IRQ (Interrupt request)_TRU_INT1 126 , IRQ (Interrupt request)_TRU_INT2 127 , IRQ (Interrupt request)_TRU_INT3 128 , IRQ (Interrupt request)_DMAC0_ERROR 129 , IRQ (Interrupt request)_CGU0_ERROR 130 , Reserved 131 , IRQ (Interrupt request)_DPM 132 , Reserved 133 , IRQ (Interrupt request)_SWU0 134 , IRQ (Interrupt request)_SWU1 135 , IRQ (Interrupt request)_SWU2 136 , IRQ (Interrupt request)_SWU3 137 , IRQ (Interrupt request)_SWU4 138 , IRQ (Interrupt request)_SWU5 139 , IRQ (Interrupt request)_SWU6 DMA (Direct Memory Access) ID , DMA (Direct Memory Access) in bf609 HRM ((Blackfin) Hardware Reference Manual) 0 , CH_SPORT0_TX 1 , CH_SPORT0_RX 2 , CH_SPORT1_TX 3 , CH_SPORT1_RX 4 , CH_SPORT2_TX 5 , CH_SPORT2_RX 6 , CH_SPI0_TX 7 , CH_SPI0_RX 8 , CH_SPI1_TX 9 , CH_SPI1_RX 10 , CH_RSI 11 , CH_SDU 13 , CH_LP0 14 , CH_LP1 15 , CH_LP2 16 , CH_LP3 17 , CH_UART0_TX 18 , CH_UART0_RX 19 , CH_UART1_TX 20 , CH_UART1_RX 21 , CH_MEM_STREAM0_SRC_CRC0/CH_MEM_STREAM0_SRC 22 , CH_MEM_STREAM0_DEST_CRC0/CH_MEM_STREAM0_DEST 23 , CH_MEM_STREAM1_SRC_CRC1/CH_MEM_STREAM1_SRC 24 , CH_MEM_STREAM1_DEST_CRC1/CH_MEM_STREAM1_DEST 25 , CH_MEM_STREAM2_SRC 26 , CH_MEM_STREAM2_DEST 27 , CH_MEM_STREAM3_SRC 28 , CH_MEM_STREAM3_DEST 29 , CH_EPPI0_CH0 30 , CH_EPPI0_CH1 31 , CH_EPPI2_CH0 32 , CH_EPPI2_CH1 33 , CH_EPPI1_CH0 34 , CH_EPPI1_CH1 35 , CH_PIXC_CH0 36 , CH_PIXC_CH1 37 , CH_PIXC_CH2 38 , CH_PVP_CPDOB 39 , CH_PVP_CPDOC 40 , CH_PVP_CPSTAT 41 , CH_PVP_CPCI 42 , CH_PVP_MPDO 43 , CH_PVP_MPDI 44 , CH_PVP_MPSTAT 45 , CH_PVP_MPCI 46 , CH_PVP_CPDOA An example request/free other resource type ret = sm_request_resource(EP_RESMGR_SERVICE, RESMGR_ID(RESMGR_TYPE_GPIO, 40), 0); if (ret) COREB_DEBUG(1, "request resource failed\n"); ret = sm_request_resource(EP_RESMGR_SERVICE, RESMGR_ID(RESMGR_TYPE_SYS_IRQ, 52), 0); if (ret) COREB_DEBUG(1, "request resource failed\n"); ret = sm_request_resource(EP_RESMGR_SERVICE, RESMGR_ID(RESMGR_TYPE_DMA, 20), 0); if (ret) COREB_DEBUG(1, "request resource failed\n"); sm_free_resource(EP_RESMGR_SERVICE, RESMGR_ID(RESMGR_TYPE_GPIO, 40), 0); sm_free_resource(EP_RESMGR_SERVICE, RESMGR_ID(RESMGR_TYPE_SYS_IRQ, 52), 0); sm_free_resource(EP_RESMGR_SERVICE, RESMGR_ID(RESMGR_TYPE_DMA, 20), 0); Packet Transfer Protocol The packet transfer protocol is to transfer data via local allocated buffer between processors. It is based on top of the former message protocol. Each processor should be able to access other processor's local memory pool via proper CPLB configuration. This protocol is connectionless. One endpoint registered on one processor may receive packets sent from any src_enp on the other processors. To send a packet, the packet protocol: Allocate a buffer of the packet size in local memory management system. Prepare packet data into this buffer and flush its data cache. Send SM_PACKET_READY message with buffer address and length to the given endpoint on the other processor. Queue this packet buffer into a sent packet list. After receiving SM_PACKET_CONSUMED message, find the buffer in the sent packet list according to the received address and free to local memory management system. To received a packet in icc for bare metal application: Receive a message of type SM_PACKET_READY with packet address and length in message interrupt handler. Notify the message loop to dispatch it to the application, which binds to the same endpoint as des_enp in the message. Application can process the sender's buffer directly or allocate a local buffer from local memory pool and do copy for future use. After return from application's dispatch callback, invalidate data cache of the sender's buffer and send SM_PACKET_CONSUMED with the same payload address back to sender. To received a packet in icc for OS (Operating System) Receive a message of type SM_PACKET_READY with packet address and length in message interrupt handler. Allocate local buffer from OS (Operating System). Copy the packet data from sender's buffer to local buffer. Append the local buffer to a received packet list indexed by the des_ep in the message. Invalidate the data cache of the sender's buffer and send SM_PACKET_CONSUMED with the same payload address back to sender. Dispatch the message to application, who binds to the same endpoint as des_ep in the message. Message to deliver packet is with normal priority. message type , value , meaning SM_PACKET_READY , SM_MSG_TYPE(SP_PACKET, 0) , The sender allocates memory for the packet. len = packet length; payload = packet address of buffer allocated by sender SM_PACKET_CONSUMED , SM_MSG_TYPE(SP_PACKET, 1) , The receiver finishes processing the arriving packet and the sender can free its memory. len = packet length; payload = packet address of buffer allocated by sender SM_PACKET_ERROR , SM_MSG_TYPE(SP_PACKET, 2) , Signal an error, payload field is an error code, len=0. Both sides free local buffers in the received packet list. SM_PACKET_ERROR_ACK , SM_MSG_TYPE(SP_PACKET, 3) , In response to ERROR received. Endpoint reserved for broadcast packet. /* * Protocol layer should dispatch packet of des_ep 0xFFFF to all receivers. * Receivers should not bind to this endpoint. */ #define EP_PACKET_BORADCAST 0xFFFF Endpoint reserved for debug information service. /* * Debug information service should bind to 0 end point on each processor. * Senders should not bind to this endpoint. */ #define EP_PACKET_DEBUG_INFO 0 Session Packet Transfer Protocol The session packet transfer protocol establishes a connection between 2 endpoints on different processors to transfer data via local allocated buffers. It is based on top of the message protocol. Each processor should be able to access other processor's local memory pool via proper CPLB configuration. In this protcol, connection should be established before packet can be delivered. The server should bind to a listening endpoint in advance. After receive a connection request message, the server creates a session with an endpoint pair of the src_enp in connection request and a new free local endpoint. Then, application can deliver packets over this session, while the server backs to monitor the listening endpoint. This session is closed only after connection close request and ACK are received by any party. Broadcast data is not supported in this protocol. Message for session packet protocol is with normal priority. message type , value , meaning SM_SESSION_PACKET_CONNECT , SM_MSG_TYPE(SP_SESSION_PACKET, 0) , After allocate a new session and bind to a local endpoint, the client sends connection request to the server. SM_SESSION_PACKET_CONNECT_ACK , SM_MSG_TYPE(SP_SESSION_PACKET, 1) , The server allocates a new session and responses to the connection request. After client receives ACK, it thinks the connection is established and start to transfer data over this session. No payload. SM_SESSION_PACKET_CONNECT_DONE , SM_MSG_TYPE(SP_SESSION_PACKET, 2) , The client sends connection established status back to server after receive ACK and before real data transfer. No payload. After server receives DONE, server thinks the connection is established and wakes up application or thread to do data transfer on the new session. SM_SESSION_PACKET_ACTIVE , SM_MSG_TYPE(SP_SESSION_PACKET, 3) , The client sends this message at a minute-level interval and wait for the ACK to keep the connection active after the connection succeeds. No payload. SM_SESSION_PACKET_ACTIVE_ACK , SM_MSG_TYPE(SP_SESSION_PACKET, 4) , The server should answer the active tick message to keep the connection active. No payload. SM_SESSION_PACKET_CLOSE , SM_MSG_TYPE(SP_SESSION_PACKET, 5) , Any party in the session can send connection close request to the other. No payload. After receiving CLOSE, free the session. SM_SESSION_PACKET_CLOSE_ACK , SM_MSG_TYPE(SP_SESSION_PACKET, 6) , Response to the connection close request. No payload. After receiving ACK, free the session. SM_SESSION_PACKET_READY , SM_MSG_TYPE(SP_SESSION_PACKET, 7) , The sender allocates memory for the packet. len = packet length; payload = packet address of buffer allocated by sender SM_SESSION_PACKET_COMSUMED , SM_MSG_TYPE(SP_SESSION_PACKET, 8) , The receiver finishes processing the arriving packet and the sender can free its memory. len = packet length; payload = packet address of buffer allocated by sender SM_SESSION_PACKET_ERROR , SM_MSG_TYPE(SP_SESSION_PACKET, 9) , Signal an error, payload field is an error code, len=0. Both sides free local buffers in the connection received data list. SM_SESSION_PACKET_ERROR_ACK , SM_MSG_TYPE(SP_SESSION_PACKET, 10) , In response to ERROR received. To enable the session packet protocol without a standard socket stack, you have to have at least a simple stack library(API (Application Programming Interface)) to: create and free a session which binds to a local endpoint/cpuid pair listen on a server session init a connection request to a remote endpoint/cpuid pair and bind the session to this pair. accept a connection and allocated a new session which binds to the service endpoint and the remote endpoint/cpuid pair in the request. read and write data via this session. This library may differ on cores with different DSP (Digital Signal Processor) bridge implementation. Scalar Transfer Protocol Scalar transfer provide a efficient method to transmit scalars (8-bit, 16-bit, 32-bit and 64-bit variant) between endpoints. It is based on top of the former message protocol. Packet protocol tranfer pass a reference to local allocated buffers through ICC msg(payload, length). To transmit scalars efficiently payload and length of ICC sm_msg is used for passing 2 32-bits scalar data directly. message type , value , meaning SM_SCALAR_READY_8 , SM_MSG_TYPE(SP_SCALAR, 0) , SM_SCALAR_READY_16 , SM_MSG_TYPE(SP_SCALAR, 1) , SM_SCALAR_READY_32 , SM_MSG_TYPE(SP_SCALAR, 2) , SM_SCALAR_READY_64 , SM_MSG_TYPE(SP_SCALAR, 3) , SM_SCALAR_CONSUMED , SM_MSG_TYPE(SP_SCALAR, 4) , SM_SCALAR_ERROR , SM_MSG_TYPE(SP_SCALAR, 5) , SM_SCALAR_ERROR_ACK , SM_MSG_TYPE(SP_SCALAR, 6) , Session Scalar Transfer Protocol Like scalar transfer, session scalar transfer alse transmit scalars (8-bit, 16-bit, 32-bit and 64-bit variant) between endpoints. It is based on top of the former message protocol. In this protcol, connection should be established before scalar data can be delivered. message type , value , meaning SM_SESSION_SCALAR_READY_8 , SM_MSG_TYPE(SP_SESSION_SCALAR, 0) , SM_SESSION_SCALAR_READY_16 , SM_MSG_TYPE(SP_SESSION_SCALAR, 1) , SM_SESSION_SCALAR_READY_32 , SM_MSG_TYPE(SP_SESSION_SCALAR, 2) , SM_SESSION_SCALAR_READY_64 , SM_MSG_TYPE(SP_SESSION_SCALAR, 3) , SM_SESSION_SCALAR_COMSUMED , SM_MSG_TYPE(SP_SESSION_SCALAR, 4) , SM_SESSION_SCALAR_ERROR , SM_MSG_TYPE(SP_SESSION_SCALAR, 5) , SM_SESSION_SCALAR_ERROR_ACK , SM_MSG_TYPE(SP_SESSION_SCALAR, 6) , SM_SESSION_SCALAR_CONNECT , SM_MSG_TYPE(SP_SESSION_SCALAR, 7) , SM_SESSION_SCALAR_CONNECT_ACK , SM_MSG_TYPE(SP_SESSION_SCALAR, 8) , SM_SESSION_SCALAR_CONNECT_DONE , SM_MSG_TYPE(SP_SESSION_SCALAR, 9) , SM_SESSION_SCALAR_ACTIVE , SM_MSG_TYPE(SP_SESSION_SCALAR, 10) , SM_SESSION_SCALAR_ACTIVE_ACK , SM_MSG_TYPE(SP_SESSION_SCALAR, 11) , SM_SESSION_SCALAR_CLOSE , SM_MSG_TYPE(SP_SESSION_SCALAR, 12) , SM_SESSION_SCALAR_CLOSE_ACK , SM_MSG_TYPE(SP_SESSION_SCALAR, 13) , Inter-core communication Framework design for Linux and bare metal application This section describes a framework to be implemented on Linux that will use the above communication protocols. Design Goal The design goal is to be able to control Core B in a generic way as possible from (userspace and kernel) to load/start/stop/reload any potential acceleration or RTOS task that a user may want to do. To accomplish this, we lean on the OSI network model, which we review here, to provide a little context. The OSI model was developed by the International Organization for Standardization (ISO (International Organization for Standardization)) as a guideline for developing standards to enable the interconnection of dissimilar computing devices. It is important to understand that the OSI model is not itself a communication standard. In other words, it is not an agreed-on method that governs how data is sent and received; it is only a guideline for developing such standards. It would be difficult to overstate the importance of the OSI model. Virtually all vendors and users of products which must communicate over the network understand how important it is that their products adhere to and fully support the networking standards this model has generated. When a vendor's products adhere to the standards the OSI model has generated, connecting those products to other vendors' products is relatively simple. Conversely, the further a vendor departs from those standards, the more difficult it becomes to connect that vendor's products to those of other vendors. In addition, if a vendor were to depart from the communication standards the model has engendered, software development efforts would be very difficult because the vendor would have to build every part of all necessary software, rather than being able to build on the existing work of other vendors. In the “Core B” scenario, the implications are the same. By providing standard communications methods, and allowing people to build on these standard methods, it will make interoperability higher, at the same time as lowering development costs. Layer 7:Application Layer Defines interface to user processes for communication and data transfer in network Provides standardized services such as virtual terminal, file and job transfer and operations In the Core B model - this is an Linux Application on Core A talking to a RTOS application on Core B via Linux standard methods Layer 6:Presentation Layer Masks the differences of data formats between dissimilar systems Specifies architecture-independent data transfer format Encodes and decodes data; Encrypts and decrypts data; Compresses and decompresses data In the Core B model - this is responsible for Layer 5:Session Layer Manages user sessions and dialogues Controls establishment and termination of logic links between users Reports upper layer errors In the Core B model - this is responsible for Layer 4:Transport Layer Manages end-to-end message delivery in network Provides reliable and sequential packet delivery through error recovery and flow control mechanisms Provides connectionless oriented packet delivery In the Core B model - this is responsible for Layer 3:Network Layer Determines how data are transferred between network devices Routes packets according to unique network device addresses Provides flow and congestion control to prevent network resource depletion In the Core B model - this is responsible for Layer 2:Data Link Layer Defines procedures for operating the communication links Frames packets Detects and corrects packets transmit errors In the Core B model - this is the mechanics of using the layer 1 in a manner that both Cores know how to pass data back and forth in a manner which data will not be lost. Layer 1:Physical Layer Defines physical means of sending data over network devices Interfaces between network medium and devices Defines optical, electrical and mechanical characteristics In the Core B model - this is the physical addresses of common memory (and cache flushing if necessary), mailboxes, interrupts, etc and other things necessary to pass data to/from Core A and Core B. Summary The basic communication framework is intended to allow both message and stream based communication in both synchronous or asynchronous way. A simple API (Application Programming Interface) is defined and libraries are provided for both: Linux userspace applications Linux kernel modules and Core B code. At this time, only the layer 1 to layer 3 protocols are defined -- anything higher than just passing raw data back and forth are implementation and user application dependent. < Layer 5 ~ 7 > user data, user defined command, statics counters < Layer 3 ~ 4 > Packet, connection packet, data stream, core control, resource manager in local runtime allocated memory < Layer 2 > Message queue at memory of a fixed known address < Layer 1 > Inter-processor interrupt and share memory User Interface There are two kind of interface available for both the Linux application and bare metal application. Linux User Interface From the view of a linux user, the icc is a device driver that control the DSP (Digital Signal Processor) devices, and bridges the the program runing on DSPs and linux user applications. The program running on DSP (Digital Signal Processor), is an ELF (Executable and Linking Format) non-relocatable binary. It can be loaded by the icc driver per the request of the Linux user application. Kernel icc driver will build a packet list for each registered end point. The packets from the current dsp side will be copyed and added to this list, waiting user application to fetch. If the DSP (Digital Signal Processor) device is opened in non-block mode. Poll by select system call or register signal SIG_DSP (Digital Signal Processor)_PACKET_ARRIVE and do real message receiving operation in application. Control of DSP device DSP (Digital Signal Processor) bridge ioctl commands are executed under the combination efforts of main CPU and DSP (Digital Signal Processor) device. CMD_DSP (Digital Signal Processor)_LOAD - Load ELF (Executable and Linking Format) non-relocatable binary to the reserved memory of a specified DSP (Digital Signal Processor) device. CMD_DSP (Digital Signal Processor)_START - Wake up DSP (Digital Signal Processor) device and make it execute the user binary from start address. CMD_DSP (Digital Signal Processor)_STOP - Stop DSP (Digital Signal Processor) device to execute the user binary and make it sleep in idle loop. CMD_DSP (Digital Signal Processor)_RESET - Reinitialize the DSP (Digital Signal Processor) device resources and make it sleep in idle loop. char *pathname[]; ioctl(fd, CMD_DSP_LOAD, pathname); ioctl(fd, CMD_DSP_START, NULL); ioctl(fd, CMD_DSP_STOP, NULL); ioctl(fd, CMD_DSP_RESET, NULL); Connectionless Packet Communication Network layer interface is to transfer buffers among linux user application, kernel driver, and program running on DSP (Digital Signal Processor) core. /* * remote_ep - destination end point in sending operation, local endpoint which receiver binds to * local_ep - sender's endpoint in sending operation, should be 0 in receiving operation * buf_len is used to indicate the actual data size to send or have been received. * type - packet protocol type, connectionless or connection packet, SP_PACKET or SP_SESSION_PACKET */ struct sm_packet { sm_uint32_t session_idx; sm_uint32_t local_ep; sm_uint32_t remote_ep; sm_uint32_t type; sm_uint32_t dst_cpu; sm_uint32_t src_cpu; sm_uint32_t buf_len; void *buf; }; ioctl commands: CMD_SM_CREATE -create local endpoint by packet define. CMD_SM_SEND -send packet to dst cpu. CMD_SM_RECV -receive packet from local endpoint. CMD_SM_CONNECT -connect a remote endpoint, paired with local endpoint. CMD_SM_SHUTDOWN -shutdown a local session, disconnect and free all the resource. struct sm_packet pkt; char buf[64] = "1234567890abcdef"; memset(&pkt, 0, sizeof(struct sm_packet)); pkt.local_ep = 9; pkt.remote_ep = 5; pkt.type = SP_PACKET; pkt.dst_cpu = 1; pkt.buf_len = 16; pkt.buf = buf; ioctl(fd, CMD_SM_CREATE, &pkt); ioctl(fd, CMD_SM_SEND, &pkt); ioctl(fd, CMD_SM_SHUTDOWN, &pkt); Connection based Packet Communication struct sm_packet pkt; char buf[64] = "1234567890abcdef"; memset(&pkt, 0, sizeof(struct sm_packet)); pkt.local_ep = 9; pkt.remote_ep = 6; pkt.type = SP_SESSION_PACKET; pkt.dst_cpu = 1; pkt.buf_len = 16; pkt.buf = payload; printf("sp packet %d\n", pkt.type); printf("begin create ep\n"); ioctl(fd, CMD_SM_CREATE, &pkt); printf("finish create ep session index = %d\n", pkt.session_idx); ioctl(fd, CMD_SM_CONNECT, &pkt); ioctl(fd, CMD_SM_SEND, &pkt); ioctl(fd, CMD_SM_RECV, &pkt); /* get buffer from pkt.buf */ ioctl(fd, CMD_SM_SHUTDOWN, &pkt); DSP User Interface DSP initialization icc device nodes are /dev/icc. When coreb dsp binary is loaded by icc driver, it starts each dsp to initialize dsp's cplb and event contoller properly. IPI interrupt is configured especially for dsp bridge message and control notification. After initialization is done, DSP (Digital Signal Processor) devices sleep in idle loop in IRQ (Interrupt request) level 15. These DSP (Digital Signal Processor) initialization and idle loop code and data are in shared memory for all DSPs and main CPU. Application initialization Each DSP (Digital Signal Processor) application should implement two enrances(icc_task_init, icc_task_exit). icc_task_init is for DSP (Digital Signal Processor) application to register its end point and protocol based packet dispatch functions. DSP (Digital Signal Processor) runs this entrance in EVT7 mode when it is asked to start by a task run message. The DSP (Digital Signal Processor) applications should call icc_wait() to wait for any incoming messages or register session handler callbacks via registration API (Application Programming Interface)- sm_registe_session_handler(). After task_init it exit to EVT15 and wait for new message to handle. icc_task_exit is for DSP (Digital Signal Processor) application end running and exit with cleanup. sample1 sm_uint32_t __icc_task_data session_index; void icc_task_init(int argc, char *argv[]) { struct sm_session *session; void *buf; int len; int ret; int src_ep, src_cpu; session_index = sm_create_session(LOCAL_SESSION, SP_PACKET); coreb_msg("%s() %s %s index %d\n", __func__, argv[0], argv[1], session_index); if (session_index >= 32) coreb_msg("create session failed\n"); while (1) { coreb_msg("task loop\n"); if (icc_wait()) { ret = sm_recv_packet(session_index, &src_ep, &src_cpu, &buf, len); if (ret <= 0) { coreb_msg("recv packet failed\n"); } /* handle payload */ coreb_msg("processing msg %s\n", buf); if (*(char *)buf == '1') { int len = 64; int dst_ep = src_ep; int dst_cpu = src_cpu; void *send_buf = sm_send_request(len, session_index); coreb_msg("coreb send buf %x\n", send_buf); if (!send_buf) coreb_msg("NO MEM\n"); memset(send_buf, 0, len); strcpy(send_buf, "finish"); sm_send_packet(session_index, dst_ep, dst_cpu, send_buf, len); } else { coreb_msg("msg payload %s \n", buf); } sm_recv_release(buf, len, session_index); } } coreb_msg("%s() end\n", __func__); } void icc_task_exit(void) { sm_destroy_session(session_index); } sample2 void icc_task_init(int argc, char *argv[]) { struct sm_session *session; index = sm_create_session(LOCAL_SESSION, SP_PACKET); coreb_msg("%s() %s %s index %d\n", __func__, argv[0], argv[1], index); if (index >= 32) coreb_msg("create session failed\n"); session = &coreb_info.icc_info.sessions_table[index]; sm_registe_session_handler(index, default_session_handle); coreb_msg("%s() end\n", __func__); } void icc_task_exit(void) { sm_destroy_session(index); } int default_session_handle(struct sm_message *msg, struct sm_session *session) { void *buf; sm_uint32_t len; int ret; coreb_msg(" %s session %d msg %s \n",__func__, session->local_ep, msg->payload); coreb_msg("dst %d dstep %d, src %d, srcep %d\n", msg->dst, msg->dst_ep, msg->src, msg->src_ep); ret = sm_recv_packet(index, &buf, len); if (ret <= 0) { coreb_msg("recv packet failed\n"); return ret; } /* handle payload */ coreb_msg("processing msg %s\n", buf); if (*(char *)buf == '1') { int len = 64; int dst_ep = msg->src_ep; int dst_cpu = msg->src; void *send_buf = sm_send_request(len, session); coreb_msg("coreb send buf %x\n", send_buf); if (!send_buf) coreb_msg("NO MEM\n"); memset(send_buf, 0, len); *(char *)send_buf = 'f'; sm_send_packet(index, dst_ep, dst_cpu, send_buf, len); } else { coreb_msg("msg payload %s \n", buf); } sm_recv_release(buf, len, session); packet transfer int sm_send_packet(sm_uint32_t session_idx, sm_uint32_t dst_ep, sm_uint32_t dst_cpu, void *buf, sm_uint32_t len) - send packet from dsp application to linux side int sm_recv_packet(sm_uint32_t session_idx, void **buf, sm_uint32_t len) - receive packet from icc message queue to DSP (Digital Signal Processor) application manage message buffer void *sm_send_request(sm_uint32_t size, struct sm_session *session) - prepare message buffer before send packet, the message buffer will be auto freed after the message is handled void sm_recv_release(void *addr, sm_uint32_t size, struct sm_session *session) - after DSP (Digital Signal Processor) application finish handling the packet, call sm_recv_release to free message buffer Connectionless Packet Communication DSP (Digital Signal Processor) application call register_packet_dispatch_callback to register its packet dispatch function and sender's clean up function in main entrance. The registered packet receive callback functions are invoked in EVT15 mode(IPEND = 0x8000) as well. /* * endpoint - bind to a local endpoint to receive incoming packet. * src_cpuid - processor who sends the incoming packet. * src_enp - source endpoint of the incoming packet. * len - the length of the buffer. * packet - the buffer pointer. */ int sm_register_session_handler(sm_uint32_t session_idx, void (*handle)(struct sm_message *message, struct sm_session *session)) Connection based Packet Communication sm_connect_session(session_idx, remote_ep, dst_cpu); - connect the local endpoint to a remote endpoint on another processor sm_close_session(session_idx, remote_ep, dst_cpu); - close a connection between two endpoints After session is connected, send and receive data is same as packet transfer by sm_send_packet() and sm_recv_packet(). A Simple Example This example is based on network layer communication APIs defined for both Linux application and the bare metal DSP (Digital Signal Processor) application. Linux APP A simple packet sample Core 1 Bare Metal Application Core 1 bare metal sample Link the DSP application The bare metal application should be linked with the dsp bridge library in order to interact with linux application properly. The offset address of entry main() can be discovered by dsp bridge kernel moduel when loading. The compile command, when compiling on linux host, seems like, $bfin-elf-gcc -T coreb.lds -mcpu=bf561 -D__DSP__ coreb.c dsp_bridge.a -o coreb.bin The linker scripts coreb.lds, seems like, MEMORY { MEM_L1_CODE : ORIGIN = 0xFF600000, LENGTH = 0x4000 MEM_L1_CODE_CACHE : ORIGIN = 0xFF610000, LENGTH = 0x4000 MEM_L1_SCRATCH : ORIGIN = 0xFF700000, LENGTH = 0x1000 MEM_L1_DATA_B : ORIGIN = 0xFF500000, LENGTH = 0x8000 MEM_L1_DATA_A : ORIGIN = 0xFF400000, LENGTH = 0x8000 MEM_L2 : ORIGIN = 0xFEB00000, LENGTH = 0x20000 } OUTPUT_FORMAT("elf32-bfin", "elf32-bfin", "elf32-bfin") OUTPUT_ARCH(bfin) ENTRY(_main) SECTIONS { .text_l1 : { /* * Here is the reserved jump instruction to jump to the Linux * dsp device driver core B init code. */ . = MEM_L1_CODE + 0x10; *(.l1.text) } >MEM_L1_CODE =0 .text : { /* * Here is the static shared message queues between core A and B. */ . = MEM_L2 + 0x40; *(.text.*) } >MEM_L2 =0 .l2 : { *(.l2 .l2.*) } >MEM_L2 =0 .data_l1 : { *(.l1.data) } >MEM_L1_DATA_A =0 .data : { *(.data .data.*) } >MEM_L2 .bss : { __bss_start = .; *(.bss .bss.*) __bss_end = .; } >MEM_L2 __stack_end = ORIGIN(MEM_L1_SCRATCH) + LENGTH(MEM_L1_SCRATCH); } Implementation approach Following aspects are described: Initialize DSP (Digital Signal Processor) CPLB, Event controller. Loading and controlling the DSP (Digital Signal Processor) bare metal application via core control protocol. Dispatch packet via packet transfer protocol. ICC library for bare metal application. ICC Linux driver for Linux application. Load and control bare metal application The DSP (Digital Signal Processor) bridge relies on endpoint 0 to control DSP (Digital Signal Processor) application status via core control protocol. Message dispatch loop on DSP (Digital Signal Processor) core can react to the core control commands. enum { EP_CORE_CONTROL = 0; }; Bare metal application is loaded by user space loader into core B memory space. The main and disatch entries in application and its dsp_bridge library are figured out by the loader. The loader informs the dsp_driver of these entry address. Dispatch packet icc session layer manage the user space packet send/recv session the sm_session data structure: struct sm_session { struct list_head rx_messages; /*rx queue sm message*/ struct list_head tx_messages; uint32_t n_avail; uint32_t n_uncompleted; uint32_t local_ep; uint32_t remote_ep; /*remote ep*/ uint32_t type; pid_t pid; uint32_t flags; int (*handle)(struct sm_message *msg, struct sm_session *session); struct sm_proto *proto_ops; uint32_t queue_priority; wait_queue_head_t rx_wait; } __attribute__((__aligned__(4))); if the icc queue is full, packet send will be blocked on icc queue tx_wait wait queue until the tx queue is not full. packet receive will blocked on session's rx_wait queue if there's no available message, until the ipi wait up the icc queue thread to hanle incoming message, receive the message to packet and wakeup the packet recv process sleeping on rx_wait queue. message_queue_thread kernel thread to handle incoming msg, the remote ipi will wakeup this thread. On core running Linux To send a packet in application , the packet buffer should be allocated in user space first. Its pointer then is passed to the kernel system call. Kernel code also allocates a buffer in kernel space and copies the user data in. After that, the dsp bridge driver appends a packet ready message with packet address and length to the shared message queue in L2 memory and link the packet buffer to the sent list. When a packet ready message arrives, the DSP (Digital Signal Processor) bridge driver allocates a kernel buffer and copies the packet in. It links this buffer to the receiving list of the registered end point and send a packet consumed message with the received packet address back. When a packet consumed message arrives, the DSP (Digital Signal Processor) bridge driver frees the packet of the received address from the send list. On core running bare metal application To send a packet in bare metal application is similar to under Linux except there is no packet copying between the application and the dsp bridge library, which is linked into the bare metal application. The library appends a packet ready message with the packet address to the message queue. When message IPI is triggered, the DSP (Digital Signal Processor) core wakes up from the message dispatch loop in DSP (Digital Signal Processor)-bridge driver. If the message is not of core control protocol, it jumps to the entry of the upper layer dispatch loop in DSP (Digital Signal Processor) bridge library. This dispatch loop handles all message protocols except for the core control protocol. Its symbol name is known to the DSP (Digital Signal Processor) bridge driver and its address in bare metal application is figured out in loading stage. When a packet ready message arrives, the dispatch loop in DSP (Digital Signal Processor) bridge library looks up the registered callback of the same destination end point in message. Then, it invoked the callback with source end point, packet length and packet address. After application finish processing the packet, it send a packet consumed message with the received packet address back. The application should not access this packet after it exit the callback. When a packet consumed message arrives, the dispatch loop in DSP (Digital Signal Processor) bridge library call application callback to free the packet of the received address. ICC bare metal library Local system interrupt Supp0 and core interrupt EVT15 are always reserved for DSP (Digital Signal Processor) bridge library. Following APIs are to be implemented in DSP (Digital Signal Processor) bridge library: memory allocation against local L1, L2, DRAM register message callback handlers for packet transfer and resource manager protocols. socket alike interface(Optional) register local exception, hardware error and core timer callback handlers (Optional) map system interrupt to a given local core interrupt(Optional) Run and test ICC ICC framework is now enabled for both BF561 and BF609. Steps to run and test the initial ICC implementation for Linux and bare metal can be found at test_icc. Load ICC applications to slave core Load bare metal apps to core B You should following the example under folder icc_utils/. The ICC stub(main event loop) for core B should be loaded by icc_utility from Linux filesystem before loading any further core B ICC applications. This can be done in /etc/rc or any time later. You can also build the ICC stub and applications for core B into one elf binary and load at once after kernel bootup. Load RTOS to core B You can either boot the RTOS in the same way as ICC stub or boot it from proper address in NOR (Not OR (parallel flash memory type)) flash directly with the help of u-boot/kernel. Debug ICC applications GDB (GNU Debugger) and gdbserver over ethernet/UART (universal asynchronous receiver/transmitter) is the only way to debug Linux application on core A. For RTOS on core B, that depends on the application debuging tool available in that RTOS. For bare metal application on core B, only JTAG (Joint Test Action Group - low level interface to cpu) tool is applicable, such as GDB (GNU Debugger) and gdbproxy over JTAG (Joint Test Action Group - low level interface to cpu). To debug 2 ICC applications on 2 cores, you have to run 2 debugging instances concurrently and stepping each application individually. Process to debug ICC applications: Build Linux distribution with ICC driver and utility. Attach GDB (GNU Debugger) and gdbproxy to core B Boot Linux on core A. Build ICC stub/apps and copy to Linux filesystem via Ethernet. Load ICC stub and application to core B in Linux. Set break point on core B and run. Load Linux application under gdbserver and GDB (GNU Debugger). Set break point in Linux app and run. Debug. Revise ICC app source code and go to step 3. Wrap ICC framework by MCAPI To better address the issue of proprietary Inter-Processor Communication (IPC), the Multicore Association (MCA) created an API (Application Programming Interface)-based standard called the Multicore Communication API (Application Programming Interface) (MCAPI). MCAPI is used in AMP configurations that require communication and synchronization between multiple operating system instances. MCAPI defines three fundamental communication types. These are: 1. messages - connection-less datagrams 2. packet channels - connection-oriented, uni-directional, FIFO (first-in first-out) packet streams 3. scalar channels - connection-oriented single word uni-directional, FIFO (first-in first-out) packet streams MCAPI overview MCAPI Domains An MCAPI domain is comprised of one or more MCAPI nodes in a multicore topology, and it is used for routing purposes. Potential uses for domains: separation between different transports MCAPI Nodes An MCAPI node is a logical abstraction that can be mapped to many entities, including but not limited to: a process, a thread, a instance of an operating system, a hardware accelerator, or a proccessor core. MCAPI Endpoints MCAPI endpoints are socket-like communication termination points. MCAPI Channels Channels provide point-to-point FIFO (first-in first-out) connections between a pair of endpoints. MCAPI channels are unidirectional. MCAPI implementation concerns Link management MCAPI implementation on top of ICC The MCAPI specification is both an API (Application Programming Interface) and communications sematic specification. It does not define which link management, device model or wired protocol underneath it. In our use case, we will implement MCAPI2.0 APIs which sit on top of ICC protocol. The domain field in our impelmentation will be used to separate different transport type(0 for ICC protocol in our case). In long term there will be other transport types other than ICC according to the new multi-core architecture. The node ID will be used to identify processor cores (e.g. (for example) 0 for coreA 1 for coreB on BF561). The port id will be used to map to the ICC session to communicate with another end on another core. MCAPI and transport layer mcapi application interfaces, initialize and finalize mcapi, create endpoints, manage mcapi data communication between two endpoints. We can implement the mcapi on top of icc by modifying the transport layer. CoreA node and CoreB node will be statically created on coreA and coreB, each is a logical abstraction instance of a core node(or OS (Operating System) node). Then MCAPI ports can be implemented on top if ICC sessions, each endpoint maps a ICC session. MCAPI endpoints identified by a tuple will map to in ICC layer, then data delivery between a pair of MCAPI endpoints can be implemented on top of ICC. resource management OS (Operating System) specific resource management layer, manage share memory, semaphore synchromization, ICC dev interface device node implementation on top of ICC ICC intercore communication layer base on share memory and intercore interrupt phisical layer physical data delivery layer, can be L2 shared memory, link port, etc. Run and test MCAPI 2.0 Steps to run and test the MCAPI 2.0 implementation for Linux and bare metal on BF561 or BF609 can be found at test_mcapi. Other implementations ARM-DSP Bridge DSP Gateway