world leader in high performance signal processing
Trace: » multimedia_optimizations

Open Source Multimedia Optimizations

Pages which might also be of interest are ffmpeg, multimedia, making_the_blackfin_perform, memcpy and data cache.

To understand this document and some of the work being done for multimedia on Blackfin you need to understand which pieces of software I will be talking about. The software modules which I'm interested in are libavformat, libavcodec, libavutil and libswscale.


libavformat is the library whose primary responsibility is to read and write AVPackets. These packets can come from devices or files. This module demultiplexes or multiplexes multiple channels into or out of a multimedia stream. libavformat handles about 175 different video formats:


libavcodec is the library that is responsible for handling compression decompression of various formats. It handles roughly 225 different audio video formats.

The following represents the lowlevel pixel and video optimizations currently implemented for Blackfin.

file: lib/ffmpeg/ffmpeg-svn-9768/libavcodec/bfin/dsputil_bfin.c

scm failed with exit code 1:
file does not exist in git


libswscale is a software video scaler its a key system component that handles pixel arrangment. Pixel arrangment and format is critical when talking to real world devices like LCD's TV's or cameras. The software scaler performs operations likes up and down scaling and aspect ration adjustments. Libswscale handles 50 different pixel formats including gray scale, paletized video, yuv's and rgb formats.

uyvy to yuv420

In order to optimize video that originates from the ST sensor VS6524, one needs to look at how the samples originate from the device. The following is the detailed Blackfin algorithm that converts uyvy into yuv420p. yuv420p is a plannar format and is typically used in video compression.


routine cycles*10 cycles/pel mips
encode frmae 151060695 196.7 377.7
ld_picture 13535007 17.6 33.8
sws_scale 164501059 214.2 411.3
v4l_mm_read_picture 21657946 28.2 54.1


routine cycles*10 cycles/pel mips
encode frmae 151060695 196.7 377.7
ld_picture 13535007 17.6 33.8
sws_scale 14981578 19.5 37.5
v4l_mm_read_picture 21657946 28.2 54.1

And now we can encode 2x the amount of frames of video based on this optimization.

The Infamous memcpy

The memcpy is one of the most costly functions that a multimedia system uses believe it or not its my biggest nut today. This is because memcpy when used in a data cache environment on a Blackfin processor has some very interesting properties. Visit the page to see a detail of how this simple little function has caused me hours of entertainment.

At the bottom of the system is this concept that we need to copy video or bits around to make sure we have a unique copy at all times. In the case of video arriving from a device such as a frame grabber that has several buffers a memory copy is not required. So let me see if I can draw a picture of this for you. The frame video 4 linux driver produces a video frame which is mmap'ed into the processes address space. This video frame is copied from the mmap'd memory buffer to another location which is given by the allocation of a new place via the av_new_packet in libavutil/utils.c. The memory is copied byte by byte (quad) costing roughly 20c/pel because it has to move the entire frame from SDRAM location 1 to SDRAM location 2 poluting the cache etc which costs more than the measured 20c/pel because the entire cache is invalidated via the copy.

To make the story even more interesting there is a very nice function called load_input_picture which does the same copy of the buffer from one place to another byte by byte (quad). Adding another 20c/pel to the encoding process. As of this writing of June 2006 we take roughly 200c/pel to encode a picture so this is very significant being 10% of the load for each copy.

To elliminate the first copy in grab.c we need to assert a couple of conditions then copy the reference of the input picture into AVPacket instead of memcpy'ing the entire image. If the video 4 linux driver uses multiple buffers this is safe so we check it.

+    if (s->gb_buffers.frames > 1) {
+        av_free (pkt->data);
+        pkt->destruct= av_destruct_packet_nofree;
+        pkt->data = ptr;
+        pkt->size = s->frame_size;
+    }
+    else
+        memcpy(buf, ptr, s->frame_size);

Thats the easy one, the hard one is burried inside of libavcodec/mpegvideo.c and there is no interface yet to handle this. The best solution for this internal copy is to use an asynchronous memory copy 2d-dma.

Unfortunately this particular memory copy is a key component of the video compression algorithm. The motion estimation and motion compenstation algorithms use padded reference frames so that objects in the video can be tracked when they leave the frame. This padding requires that each line be coppied 1 at a time with a hole of 32 sample between lines for the luma (Y) channel.

The white bands are filled with pixels from the image in a particular way, they are specification dependent. With that said all the specs that I have seen seem to do it the same way. The boarder pixels are replicated into the padded region. This is typically named a feature unrestricted motion vector support disabling reduces the output quality of your video.

  memcpy(dst, src, w);
  dst += dst_stride;
  src += src_stride;

The code fragment is what is used to copy the frame. Unfortunately we can't use dma_memcpy because that routine generate a full context switch to kernel mode which cost around 1k cycles. This would not be very useful because w is fairly small compared to 1k cycles w<720.

Enhancing grab.c video4linux

Starting reference performance, basic cif input sequence, on a CCLK=500Mhz SCLK=100Mhz, cache configured as write-back.

ffmpeg -f rawvideo -s 352x288 -i /var/t4s.cif

This yields 22fps (June 2007) encoding the default bitrate of ffmpeg is 200kbps.

Now we look at what the performance of the webcam setup is, we are using the ST device. A more detailed description of this particular setup is described on our ffmpeg wiki page.

The particular device driver setup I'm using for these experiments are:

modprobe i2c-bfin-twi
modprobe blackfin_cam force_palette=3
modprobe snd_ad1836

Note that on VLC side you need to open up udp://@ to match the far end.

ffmpeg -f video4linux -s 320x240 -i /dev/video0 -f mpegts udp:

This yeilds 25fps (June 2007) encoding bitrate default 200kbps. A close inspection of top shows ffmpeg consumming 98% of the processor. So we just barely meet the encoding budget on 500Mhz. Lets insert some telemtry with the START_TIMER, END_TIMER performance analysis tools included in ffmpeg/libavutil/common.h. Specifically we want to look at how many cycles it takes to encode a video frame, how many cycles it takes to move data from the sensor to the software scaler.

routine cycles*10 cycles/pel mips
encode frmae 151060695 196.7 377.7
ld_picture 13535007 17.6 33.8
sws_scale 14981578 19.5 37.5
v4l_mm_read_picture 21657946 28.2 54.1

By elliminating the extra frame copy, and delivering the video frame directly to the system we get a noticable improvment in performance bring the overall processor load down to 81%.

routine cycles*10 cycles/pel mips
encode frame 151060695 196.7 377.7
ld_picture 13535007 17.6 33.8
sws_scale 14981578 19.5 37.5
v4l_mm_read_picture 765244 .996 1.9

As of Sept 4, 2007 directly from the same experiment from June results in ~88% processor utilization on 500Mhz BF537 for QVGA encode 25fps. This is roughly 229 c/pel.

| routine ^ cycles ^ c/pel ^

Motion Comp 14475.7 56
Motion Est 14722.5 57

Quick note to send mpeg4 video you need to tell ffmpeg to use the mpeg4 video codec.

 /u/dev/b/ffmpeg -b 300000 -f video4linux -s 320x240 -i /dev/video0 -vcodec mpeg4 -f mpegts udp:yoda:1234

Current Performance Analysis

Here is the current mips break down, at the current moment.

routine cycles*10 cycles/pel mips
sws_scale 14981578 19.5 37.5
v4l_mm_read_picture 765244 .996 1.9
encode frame 151060695 196.7 377.7
ld_picture 13535007 17.6 33.8
motionest 55477924 72.24 138.7
encode_mb 306530 119.74 229.9
encode_bits 34432 13.45 25.8
mc 171640 67.05 128.8
dctq 52166 20.4 39.1

As of 11/13/2007 well 'top' produces the following detail ffmpeg consumes 75% of the 500Mhz 537.

/u/dev/9768/ffmpeg -s qvga -f video4linux -i /dev/video0  -b 300k -f mpegts  udp:

Now even with a 75% load the ffmpeg/video4linux system is dropping one frame per second. You can see this with the command line option -v 10 applied prior to -s qvga.

Audio Decode Performance Analysis and Optimizations

MP3 Audio Decode Analysis

The audio configuration used for this analysis are based on the lavc or libavcodecs audio codecs. So when building something like mplayer you need to --disable-mp3lib, and to ensure reasonable performance use --disable-mpegaudio_hp.

The following is some performance analysis around MP3 audio decode. The table is based on 2 channel audio sampled at 44.1khz.

routine # cycles w/MULH opt
audec 4096 2146985.6 878949.7
imdct 16383 151114.4 45585.5
dct32 262131 8592.0 1709.4
87mips 38mips

By optimizing one small aspect of the audio decoder we get basically a 2x improvement in performance.

#define MULH(X,Y) ({ int xxo;\
    asm (\
	     "a1 = %2.L * %1.L (FU);\n\t"\
         "a1 = a1 >> 16;\n\t"\
	      "a1 += %2.H * %1.L (IS,M);\n\t"\
	      "a0 = %1.H * %2.H, a1+= %1.H * %2.L (IS,M);\n\t"\
         "a1 = a1 >>> 16;\n\t"\
         "%0 = (a0 += a1);\n\t"\
        : "=d" (xxo) : "d" (X), "d" (Y)); xxo; })

Now lets look at what happens when we drop the contribution of L*L which accounts for the accuracy of the LSB.

routine # cycles w/MULH opt w/MULH no L*L
audec 4096 2146985.6 878949.7 850732.0
imdct 16383 151114.4 45585.5 40229.9
dct32 262131 8592.0 1709.4 1421.4
87mips 38mips 36mips

Here we drop the Lo*Lo word calculation. This 2mips is not perceptible by me at least, if you need higher precision just ensure --enable-mpegaudio-hp when configuring lavc.

#define MULH(X,Y) ({ int xxo;\
    asm (\
	      "a1 = %2.H * %1.L (IS,M);\n\t"\
	      "a0 = %1.H * %2.H, a1+= %1.H * %2.L (IS,M);\n\t"\
         "a1 = a1 >>> 16;\n\t"\
         "%0 = (a0 += a1);\n\t"\
        : "=d" (xxo) : "d" (X), "d" (Y)); xxo; })

AAC Audio Decode Analysis

AAC Decode uses faad2, library and by default it uses double precision floating point arithmetic so an AAC decode will consume an entire 500% core and still have unacceptable quality. This needs to be configure with --enable-faad-fixed to get an aac decoder which consumes currently 165 mips for 44100 2 channel decode.

Using the primitive developed above for doing MULH in libfaad2, reduces the complexity by roughly 2x reducing the mips to 85 for the same content.

#ifdef bfin
#define _MulHigh(X,Y) ({ int xxo;                       \
    asm (                                               \
        "a1 = %2.H * %1.L (IS,M);\n\t"                  \
        "a0 = %1.H * %2.H, a1+= %1.H * %2.L (IS,M);\n\t"\
        "a1 = a1 >>> 16;\n\t"                           \
        "%0 = (a0 += a1);\n\t"                          \
        : "=d" (xxo) : "d" (X), "d" (Y)); xxo; })

Complete Table of Contents/Topics