world leader in high performance signal processing
Trace: » example-builtin

Compiler Optimization Page

This page started out as a place to describe, how to use the builtin's of gcc which I couldn't seem to find any documentation on other than the page which points you here. So I set out to unwind the performance of the compiler with respect to a simple fixed point FFT algorithm.

Pages which might also be of interest are ffmpeg, multimedia, making_the_blackfin_perform, and data cache.

See Builtin Intrinsic Compiler Library.

EXAMPLE BUILTIN As applied to the FFT

“WORK IN PROGRESS”

So that there is no confusion the following code depends on the following data structures. All of the following code structures depend on these definitions the basics of which are copied from the libavcodec of ffmpeg so that we can drop these optimizations directly into that library.

typedef signed int int32_t;
typedef signed short int16_t;
typedef unsigned short uint16_t;
typedef signed long long int64_t;
typedef float FFTSample;

typedef struct {
    float re,im;
} FFTComplex;
typedef struct {
    int32_t re,im;
} FFTComplex32;
typedef struct {
    int16_t re,im;
} FFTComplex16;

typedef struct {
    int nbits;
    int inverse;
    FFTComplex   *exptab;
    FFTComplex   *exptabr4;
    FFTComplex16 *exptab16;
    uint16_t *revtab;
} FFTContext;

To start with we will take a very simple implementation of the radix 2 FFT. It is based on the Cooley-Tukey factorization of exp(-2*pi*j*x/N) which just takes advantage of the sign change of elements that are separated by pi. The following implementation uses 1.15 fixed point arithmetic to approximate the FFT. You will see the explicit scaling in the body of the inner loop.

void fft16 (FFTContext *c, FFTComplex16 *X)
{
    FFTComplex16 *W = c->exptab16;
    int lgN         = c->nbits;
    int N           = 1<<lgN;
    int s, j, k, m, w, ws, hm;
    int32_t wwr,wwi;
    int32_t tr,ti;

    for (m=2,s=N>>1; s>=1; s=s>>1, m=m<<1) {
        ws = s;
        w  = 0;
        hm = m>>1;
        for (j=0; j<hm; j++) {
            wwr=W[w].re;
            wwi=W[w].im;
            w+=ws;
            for (k=j; k<N; k+=m) {
                int k2 = k+hm;

                tr        = (X[k2].re*wwr + 0x4000)>>15;
                tr       -= (X[k2].im*wwi + 0x4000)>>15;
                ti        = (X[k2].re*wwi + 0x4000)>>15;
                ti       += (X[k2].im*wwr + 0x4000)>>15;

                X[k2].re  = X[k].re - tr;
                X[k2].im  = X[k].im - ti;

                X[k].re   = X[k].re + tr;
                X[k].im   = X[k].im + ti;
            }
        }
    }
}

Which produces the following inner loop:

.L20:
	R2 = W [P1] (X);
	R1 = W [P1+2] (X);
	R3 = R4;
	R0 = R5;
	R3 *= R2;
	R0 *= R1;
	R2 *= R5;
	R1 *= R4;
	R3 = R3 + R6;
	R0 = R0 + R6;
	R0 >>>= 15;
	R3 >>>= 15;
	R3 = R3 - R0 (NS) ||
	R0 = W [P2] (X) ||
	nop;
	R2 = R2 + R6;
	R1 = R1 + R6;
	R0 = R0 - R3;
	R1 >>>= 15;
	R2 >>>= 15;
	W [P1] = R0;
	R2 = R2 + R1 (NS) ||
	R0 = W [P2+2] (X) ||
	nop;
	R0 = R0 - R2;
	W [P1+2] = R0;
	R0 = W [P2] (X);
	R3 = R3 + R0 (NS) ||
	R0 = W [P2+2] (X) ||
	nop;
	R2 = R2 + R0 (NS) ||
	W [P2] = R3 ||
	nop;
	R0 = P3;
	R7 = R7 + R0 (NS) ||
	W [P2+2] = R2 ||
	nop;
	R0 = R7 - R0;
	R1 = I0;
	cc =R1<=R0;
	P1 = P1 + P0;
	P2 = P2 + P0;
	if !cc jump .L20 (bp);

Can be improved by replacing the multiplication operation with a builtin. In particular we will be using the multr_fr1x16 which multiplies 2 1.15 fixed point numbers together and resealing the result with bias rounding.

The modification is seen bellow

                tr        = __builtin_bfin_multr_fr1x16 (X[k2].re, wwr);
                tr       -= __builtin_bfin_multr_fr1x16 (X[k2].im, wwi);
                ti        = __builtin_bfin_multr_fr1x16 (X[k2].re, wwi);
                ti       += __builtin_bfin_multr_fr1x16 (X[k2].im, wwr);

This produce the improved DSP code which utilizes the Blackfin 16 bit multiplier. The previous variant used the micro controller variant which takes 3 clocks to execute.

.L410:
	R1 = W [P1] (X);
	R3 = W [P1+2] (X);
	R2.L = R1.L * R5.L ;
	R0.L = R3.L * R6.L ;
	R0 = R0.L (X);
	R2 = R2.L (X);
	R2 = R2 - R0 (NS) ||
	R0 = W [P2] (X) ||
	nop;
	R1.L = R1.L * R6.L ;
	R3.L = R3.L * R5.L ;
	R0 = R0 - R2;
	R1 = R1.L (X);
	R3 = R3.L (X);
	W [P1] = R0;
	R1 = R1 + R3 (NS) ||
	R0 = W [P2+2] (X) ||
	nop;
	R0 = R0 - R1;
	W [P1+2] = R0;
	R0 = W [P2] (X);
	R2 = R2 + R0 (NS) ||
	R0 = W [P2+2] (X) ||
	nop;
	R1 = R1 + R0 (NS) ||
	W [P2] = R2 ||
	nop;
	R7 = R7 + R4 (NS) ||
	W [P2+2] = R1 ||
	nop;
	R0 = R7 - R4;
	R1 = I0;
	cc =R1<=R0;
	P1 = P1 + P0;
	P2 = P2 + P0;
	if !cc jump .L410 (bp);

The next level optimization can be achieved by using the packed data types.

Lets create a packed vector of shorts, the following data type is used by the programmer to tell the compiler about the simd data type.

typedef short __attribute__ ((mode(V2HI))) v2hi;

Now when we declare things to be v2hi the compiler knows how to use the Blackfin DUAL MAC architecture. Simple example,

v2hi a, b, c;
c=a+b;

would produce something like
        R0 = R0 +|+ R1;

Obviously we need something a little more complex so lets turn back to our complex fft example:

void fft16_2 (FFTContext *c, FFTComplex16 *Xin)
{
    v2hi *X         = Xin;
    v2hi *W         = c->exptab16;
    int lgN         = c->nbits;
    int N           = 1<<lgN;
    int s, j, k, m, w, ws, hm;
    v2hi ww;
    v2hi t;

    for (m=2,s=N>>1; s>=1; s=s>>1, m=m<<1) {
        ws = s;
        w  = 0;
        hm = m>>1;
        for (j=0; j<hm; j++) {
            ww=W[w];
            w+=ws;
            for (k=j; k<N; k+=m) {
                int k2 = k+hm;

                t = __builtin_bfin_cmplx_mul (X[k2], ww);
                X[k2] = X[k] - t;
                X[k]  = X[k] + t;
            }
        }
    }
}

In the above example we have replaced the explict complex multiply with the use of the builtin bfin cmplx_mul primitive. 1) And we have replaced the butterfly with just a simple subtract and add operation. The compiler knows how to do the rest.

	R0 = [P1];
	A0 = R0.L * R2.L, A1 = R0.L * R2.H (W32) ||
	R1 = [P2] ||
	nop;
	R0.L = (A0 -= R0.H * R2.H), R0.H = (A1 += R0.H * R2.L) ;
	R1 = R1 -|- R0;
	[P1] = R1;
	R1 = [P2];
	R3 = R3 + R7;
	R1 = R1 +|+ R0;
	R0 = R3 - R7 (NS) ||
	[P2] = R1 ||
	nop;
	cc =R6<=R0;
	P1 = P1 + P0;
	P2 = P2 + P0;
	if !cc jump .L393 (bp);

Now the optimal inner loop is 3 machine cycles consisting of 2 cycles of multiplies for the complex multiply and 1 cycle of add subs. As of this writing I couldn't find a dual add sub dsp intrinsic in the table. So I'm going to use a single inlined assembler instruction.

void fft16_3 (FFTContext *c, FFTComplex16 *Xin)
{
    v2hi *X         = Xin;
    v2hi *W         = c->exptab16;
    int lgN         = c->nbits;
    int N           = 1<<lgN;
    int s, j, k, m, w, ws, hm;
    v2hi ww;
    v2hi t;

    for (m=2,s=N>>1; s>=1; s=s>>1, m=m<<1) {
        ws = s;
        w  = 0;
        hm = m>>1;
        for (j=0; j<hm; j++) {
            ww=W[w];
            w+=ws;
            for (k=j; k<N; k+=m) {
                int k2 = k+hm;

                t = __builtin_bfin_cmplx_mul (X[k2], ww);
                asm ("%0=%2+|+%3, %1=%2-|-%3;"
                     : "=d" (X[k]), "=d" (X[k2]) : "d" (X[k]), "d" (t));
            }
        }
    }
}

Produces the tighter inner loop:

	R0 = [P1];
	A0 = R0.L * R2.L, A1 = R0.L * R2.H (W32) ||
	R1 = [P2] ||
	nop;
	R3 = R3 + R7;
	R0.L = (A0 -= R0.H * R2.H), R0.H = (A1 += R0.H * R2.L) ;
#APP
	R1=R1+|+R0, R0=R1-|-R0;
#NO_APP
	[P2] = R1;
	[P1] = R0;
	R0 = R3 - R7;
	cc =R6<=R0;
	P1 = P1 + P0;
	P2 = P2 + P0;
	if !cc jump .L393 (bp);

Now we are only 4x away from optimal DSP performance.

So this is pretty cool, we have developed a very compact representation for a RADIX-2 Butterfly. The Radix-2 Butterfly is:

        t = __builtin_bfin_cmplx_mul (X[k2], ww);
        asm ("%0=%2+|+%3, %1=%2-|-%3;"
             : "=d" (X[k]), "=d" (X[k2]) : "d" (X[k]), "d" (t));

There is beauty in simplicity. I guess this whole thing should be a macro. If one looks into the code typically found inside of FFMPEG you would find the following code:

            for(l = nblocks; l < np2; l += nblocks) {
                CMUL(tmp_re, tmp_im, exptab[l].re, exptab[l].im, q->re, q->im);
                BF(p->re, p->im, q->re, q->im,
                   p->re, p->im, tmp_re, tmp_im);
                p++;
                q++;
            }

This code actually appears in many functions including WMA, AC3 and others as well.

Re-architecting the inner loop to be a more deterministic candidate for zero overhead loops should bring the complexity of the inner loop way down.

To do this I have chosen to rename X[k] and X[k2] with ie and io respectively. Where 'e' stands for even pointer and 'o' stands for odd pointer. By factoring the loop index out of the equation we should be able to see a lsetup for the body of the loop. Here is the new code:

void fft16_4 (FFTContext *c, FFTComplex16 *Xin)
{
    v2hi *X = Xin;
    v2hi *W = c->exptab16;
    v2hi *ie,*io;
    int lgN         = c->nbits;
    int N           = 1<<lgN;
    int s, j, k, m, w, ws, hm;
    v2hi ww;
    v2hi t;

    for (m=2,s=N>>1; s>=1; s=s>>1, m=m<<1) {
        ws = s;
        w  = 0;
        hm = m>>1;
        for (j=0; j<hm; j++) {
            ww=W[w];
            w+=ws;
            ie = &X[j];
            io = &X[j+hm];

            for (k=0; k<s; k++) {

                t = __builtin_bfin_cmplx_mul (*io, ww);
                asm ("%0=%2+|+%3, %1=%2-|-%3;"
                     : "=d" (*ie), "=d" (*io) : "d" (*ie), "d" (t));
                io += m;
                ie += m;
            }
        }
    }
}
	R0 = [P2];
	R3 += 1;
	A0 = R0.L * R2.L, A1 = R0.L * R2.H (W32) ||
	R1 = [P1] ||
	nop;
	cc =R7<=R3;
	R0.L = (A0 -= R0.H * R2.H), R0.H = (A1 += R0.H * R2.L) ;
#APP
	R1=R1+|+R0, R0=R1-|-R0;
#NO_APP
	[P1] = R1;
	[P2] = R0;
	P1 = P1 + P0;
	P2 = P2 + P0;
	if !cc jump .L406 (bp);

Well it didn't happen but we managed to squeeze 1 more cycle out of the inner loop.

1) I wish the _Complex short, mult could map directly to this