world leader in high performance signal processing
Trace: » dotprod

The Inner Product

As previously described the dot product or inner product is a critical element of signal processing. Its used in many different DSP applications, its interesting to note that things like DFT's, matrix multiplications and convolutions are all based on this primitive operation. dot product

This is the code from making_the_blackfin_perform.

int dot_asm_cycles(short *x, short *y, int len)
{
    int before,after;
	int dot;

	__asm__("I0 = %3;\n\t"
		"I1 = %4;\n\t"
		"A1 = A0 = 0;\n\t"
		"R0 = [I0++] || R1 = [I1++];\n\t"
		"R2 = CYCLES;\n\t"
		"%1 = R2;\n\t"
		"LOOP dot%= LC0 = %5 >> 1;\n\t"
		"LOOP_BEGIN dot%=;\n\t"
		"A1 += R0.H * R1.H, A0 += R0.L * R1.L || R0 = [I0++] || R1 = [I1++];\n\t"
		"LOOP_END dot%=;\n\t"
		"R2 = CYCLES;\n\t"
		"%2 = R2;\n\t"
		"R0 = (A0 += A1);\n\t"
		"%0 = R0 >> 1;\n\t"	/* correct for left shift during multiply */
		: "=&d"(dot), "=&d"(before), "=&d"(after)
		: "a"(x), "a"(y), "a"(len)
		: "I0", "I1", "A1", "A0", "R0", "R1", "R2");
	return dot;
}

The code given for the dot product on another page making_the_blackfin_perform is a little miss leading. This code representative of a dot product however the code by itself is not very useful in that it forces the alignment issue for all the computations. To draw a better picture of this lets say we have a vector x0,x1,x2,x3…xn and a set of coefficients c0,c1,..,cm. It is true that the dot_asm_cycles function performs the inner product for all sub vectors of X that are even however when the displacement into X is odd the code crashes. This problem is typical on many SIMD based architectures.

There are many different solutions to this problem on Blackfin, we will look at a one of them. First make sure you understand the problem of odd displacement data accesses and why thats an issue. If its not clear I guess you think of it this way Blackfin enforces strict data alignment and we as application programmers need to abide by that rule. To make full use of the Dual Compute capabilities of the Blackfin one needs to feed it data. And to feed the beast one needs to load a pair of coefficients and a pair of X's to perform the computation. Atleast that is how the previous inner loop worked here it is for reference.

		"LOOP_BEGIN dot%=;\n\t"
		"A1 += R0.H * R1.H, A0 += R0.L * R1.L || R0 = [I0++] || R1 = [I1++];\n\t"
		"LOOP_END dot%=;\n\t"

In the above inner loop R0 loads coefficients and R1 loads X's actually it doesn't matter which is which.

Lets first look at splitting the computation which currently we pay for anyways. We actually pay 1 cycle to keep the dot product working on a single element.

		"R0 = (A0 += A1);\n\t"

If we split the computation with out any details that A0+=A1 instruction vanishes that in itself is a savings. And now A0 computes 1 dot product, and A1 computes another dot product. Performing multiple dot products is rarely a problem considering a dot product is typically performed more than once anyways. If you look at a convolution Fir filtering IIR filtering, Matrixing etc they can all split the computation such that we compute 2 consecutive dot products in the inner loop.

This technique is one of the key architectural advantages of Blackfin. Here we will show you why the register halves have so much flexibility when working with load/store and mac instructions.

{x_0},{x_1},{x_2},{x_3},{cdots},{x_{n-1}} {c_0},{c_1},{c_2},{cdots},{c_{m-1}}

Then we combine the sums in pairs

{c_0}{x_0}+{c_1}{x_1}+{c_2}{x_2}... {c_0}{x_1}+{c_1}{x_2}+{c_2}{x_3}...

R0=*X++;

Here we load R0 with a pair of X's namely x0 and x1, where R0.h=x1 and R0.l=x0 due to the little endian hardware feature.

R1.L=*C++;

Here we load R1.L with the low order coef.

If we look at the series of multiplication we need to perform we get the following:

x0*c0, x1*c0
x1*c1, x2*c1
x2*c2, x3*c2
x3*c3, x4*c3
...

Which results in the recognition of the following MAC sequence:

  A1 += R0.L*R1.L, A0 += R0.H*R1.L || R1.L = *C++ || R0.L = *X++;
  A1 += R0.H*R1.L, A0 += R0.L*R1.L || R1.L = *C++ || R0.H = *X++;

Now lets replace C and X with there respective instructions, I0 will be the input pointer, and I1 will be the pointer to the coefficients. These two instructions make up a proper blackfin inner product.

  A1 += R0.L*R1.L, A0 += R0.H*R1.L || R1.L = W[I1++] || R0.L = W[I0++];
  A1 += R0.H*R1.L, A0 += R0.L*R1.L || R1.L = W[I1++] || R0.H = W[I0++];

Putting this into a loop

  lsetup (.0f,.1f) lc0=p0>>2;
  .0:  A1 += R0.L*R1.L, A0 += R0.H*R1.L || R1.L = W[I1++] || R0.L = W[I0++];
  .1:  A1 += R0.H*R1.L, A0 += R0.L*R1.L || R1.L = W[I1++] || R0.H = W[I0++];

Adding the loop prologue:

  A1=A0=0 || R1.L=W[I1++] || R0 = [I0++];
  lsetup (.0f,.1f) lc0=p0>>2;
  .0:  A1 += R0.L*R1.L, A0 += R0.H*R1.L || R1.L = W[I1++] || R0.L = W[I0++];
  .1:  A1 += R0.H*R1.L, A0 += R0.L*R1.L || R1.L = W[I1++] || R0.H = W[I0++];

The reader should notice that we have not split the first load of the samples x0,x1 which is not a requirement unless we are operating on miss aligned delay lines. Considering we are always working in pairs this should never be the case unless we have a programming error some where else.

Adding the prologue:

  A1=A0=0 || R1.L=W[I1++] || R0 = [I0++];
  lsetup (.0,.1) lc0=p0>>2;
  .0:  A1 += R0.L*R1.L, A0 += R0.H*R1.L || R1.L = W[I1++] || R0.L = W[I0++];
  .1:  A1 += R0.H*R1.L, A0 += R0.L*R1.L || R1.L = W[I1++] || R0.H = W[I0++];
  R0.H=(A1 += R0.L*R1.L), R0.L=(A0 += R0.H*R1.L);

The sequence R0.L=W[I0++], R0.H=W[I0++] forms a ping buffer in R0 where the samples are read into. This type of idiom is used all over the blackfin architecture to form a pipeline of data with respect to the computations.

unsigned realdot (short *x, short *y, int len)
{
    unsigned R0;
    asm(
  "A1=A0=0 || R1.L=W[%2++] || R0 = [%1++];\n\t"
  "lsetup (.0,.1) lc0=%3>>2;\n\t"
  ".0:  A1 += R0.L*R1.L, A0 += R0.H*R1.L || R1.L = W[%2++] || R0.L = W[%1++];\n\t"
  ".1:  A1 += R0.H*R1.L, A0 += R0.L*R1.L || R1.L = W[%2++] || R0.H = W[%1++];\n\t"
  "R0.H=(A1 += R0.L*R1.L), R0.L=(A0 += R0.H*R1.L);\n\t"
  : "=d" (R0) : "b" (x), "b" (y), "a" (len)
  );
    return R0;
}


short samples[100];
short coef[5] = { 0x100,0x100,0x200,0x100,0x110 };
main ()
{
    unsigned X;
    int i;

    for (i=0;i<100;i++)
        samples[i]=0x100*i;

    for (i=0;i<48;i++) {
        X = realdot (&samples[2*i], coef, 5);
        printf ("%08x\n", X);
    }
}