Free Hardware Implementation of Theora Videoencoder

The content below is downloaded from www.free-it.de/archiv/talks_2005/paper-11081/paper-11081.html, copyright(s) of the source apply.


 [ L T Net ] OPEN-EVENTS :: OPEN MUSIC :: MINICONTENTLINUXTAG.org 
Cornerstone
// LinuxTag 2006
Besuchen Sie uns auch nächstes Jahr wieder auf dem LinuxTag 2006 im Karlsruher Messe- und Kongresszentrum. Für nähere Details und den genauen Termin besuchen Sie bitte die LinuxTag Homepage.
EUROPAS GRÖSSTE GNU/LINUX MESSE UND KONFERENZ
KONFERENZ-DVD 2005
 Hauptseite  Vorträge  Bücher  History  Software  Kanotix  Videos  Sponsoren  Abspann  Impressum 
Hauptseite // Vorträge // Free Hardware Implementation of Theora Videoencoder

Free Hardware Implementation of Theora Videoencoder

Andrey Filippov

Elphel, Inc.

Dieser Beitrag ist unter der GNU Free Documentation Licence (International) lizensiert.

Abstract

FPGA are excellent devices for advancing the achievements of the Free Software into the hardware world and Elphel model 333 camera is a project in this area. Being the next step to the previous Elphel design that used reconfigurable Xilinx FPGA for the fast JPEG/motion JPEG compression the model 333 uses new Xilinx Spartan 3 to implement more advanced Ogg Theora videoencoder that is capable of processing 1.3Mpix images at 30 fps, larger images (up to 4.5 MPix) can be served at proportionally lower frame rate. Motion compensation is not currently implemented but the required bandwidth for average videosecurity applications (where the camera does not move) is still much lower than that needed for the conventional motion JPEG streams. Theora was chosen because while providing better compression than MPEG-2 this format usage does not require payment of the license fees for the patents involved. Camera code (both software and FPGA) is available under GNU/GPL.


Free Hardware Implementation of Theora Videoencoder

Andrey Filippov (andrey@elphel.com)

Background

Model 333 camera is the third generation of the cameras designed at Elphel, Inc. The first one (model 303 in 2001) used the Axis ETRAX100LX Linux-optimized 32-bit 100MHz processor and a simple one-time programmable FPGA to interface the CMOS image sensor. That camera was designed for single-frame (not video) applications where the speed of software JPEG compression was not critical – one 1280x1024 frame required 5 seconds of CPU time.

The next Elphel camera (model 313 in 2003) targeted more general network cameras applications, trying to combine the high resolution images from CMOS sensors with the highest frame rate they were able to provide. At that time sensors were capable of 1280x1024 at 15 frames per second – 75 times faster than model 303 software-only compression could handle. To match that challenge I decided to use reconfigurable FPGA – technology that was new for me at that time. Before starting actual design I looked for existence of the proprietary solutions (“IP cores”) with similar specs to see if the goal is possible to achieve and how big/fast FPGA was needed.

It did not take long to find a good candidate – Xilinx Spartan-2e 300K gates chip (supported by free for download development tools from Xilinx) and soon I was ready to start FPGA hacking – my first complex FPGA design. And in three months there was a camera that could compress 15 frames per second at 1280x1024 resolution – more than any other network camera at that time. Flexibility of the reconfigurable FPGA was essential as many features were added and bugs fixed after the hardware was released, similar to how it is done with the software updates. The frame rate at the 1280x1024 resolution increased to 22 fps, support for higher resolution sensors (2, 3 and even 11 megapixels) was added when such sensors became available.

FPGA-based video processing in the model 313 camera demonstrated that such approach can significantly enhance performance of the software-based embedded systems – the camera with the total power consumption of just 3 watts has the performance of JPEG compression equivalent to that of 2.5-3.0 GHz PC-type computer. Having “hardware” speed such system retains flexibility of the software product.

When the model 313 design was mostly ready (mostly – as the FPGA/software code is never really finished) it turned out that 98% of the FPGA resources are used. Later new sensors became available that are faster than the model 313 JPEG encoder so it was a good time to think of a hardware upgrade.

Project goals

Requirements to an advanced camera

Network cameras are replacing older analog videocameras used for security applications. This legacy in many cases determine what customers are expecting from the cameras, but an advanced network camera can provide new functionality if it has a combination of the following three features:

  • high resolution is important so a camera with a wide angle lens will not miss an important event as low resolution ones placed on rotating (pan/tilt) platforms do;

  • high frame rate is not a new feature compared to the legacy analog cameras, but it is more difficult to maintain in the multi-megapixel cameras;

  • Low bit rate requirement is especially critical for the systems that combine high resolution with high frame rate. Without the advanced compression methods the amount of data produced by such cameras can easily saturate the network and fill up the hard drives used for archiving.

Currently model 333 camera is the only one that combines all the three features listed above. Many cameras have high frame rate but low resolution – 640x480 or even less. Some cameras implement MPEG-2 or even MPEG-4 video compression that provide low bit rate, but it is also available only for the low resolution cameras. Most of the multimegapixel cameras have low frame rate (compression done by software), only a few – Elphel model 313 and, recently new cameras by other manufacturer combine high resolution with high frame rate. All of them use motion JPEG that is relatively easy to implement but can not provide efficient compression of the video data.

Selection of the video compression algorithm

Most of the high resolution network cameras use motion JPEG to compress the video frames – the computational complexity is relatively low (same as that of a decoder), it is free for implementation, but compression efficiency is low as it does not make use of the subsequent frames similarity.

Video compression algorithms are usually asymmetrical and need more computations for the encoder than for the decoder – that makes camera design more challenging.

And unfortunately there are other problems with free implementation of the high performance video encoders – not just the computational complexity. Most popular formats (MPEG-2,-4) are not free for implementation and their usage requires payment of licensing fees. The license fee is what they call “reasonable”, and really is small compared to the hardware costs, but it is still a hassle to users. Being an opponent of the idea of the software patents, I didn't want to support it financially.

So I had to look harder for the better alternative and it was not that difficult to find one – at the time I was ready to start the code design (the hardware was already tested by running the MJPEG code ported from the model 313 camera) the rather stable version of the new video codec was already available.

Ogg Theora developed by Xiph.org Foundation (based on VP3 made by On2 Technologies) competes in efficiency with MPEG encoders. It is royalty-free and is supported now by most players included with many GNU/Linux distributives – Xine, Kaffeine, Noatun RealPlayer 10.

Challenges of the high performance video encoding

High resolution combined with high frame rate already require a lot of processing power and earlier model 313 camera that served motion JPEG needed a PC with a 2.5GHz CPU to match it performance. And video (asymmetrical) compression at even higher frame rate would need a computational performance too high to be implemented in the camera by the universal processor – that make pure software solutions not practical.

Other possible solution could be a dedicated compressor IC. There are some available but most support only low resolution video (640x480 or less). Development of a new custom ASIC is expensive, time consuming and inevitably lags behind of the new compression algorithms. Such solution is not flexible, no upgrades or corrections of the errors are possible after the ICs are built.

As a result – there are no other network cameras capable of providing all three features (high resolution, high frame rate and low bit rate) simultaneously.

Most of the “hard job” of the video encoding in model 333 camera is done by a million-gate Xilinx Spartan 3 reconfigurable FPGA that supplements the general purpose processor that runs GNU/Linux and is responsible for the overall camera operation and streaming of the hardware-compressed video over the network.

Reconfigurable FPGA in the model 333 camera is able to outperform hypothetical 10GHz universal PC that will need at least hundred times more electrical power, it makes it possible to combine the “hardware” speed with flexibility of the software and allows usage of the new emerging video formats before they are finalized and constantly improve the system performance during the product lifetime.

Because of the similarity of the FPGA design to that of the software we apply highly productive FOSS development model to the hardware (FPGA) code development and release it with the free license (GNU/GPL). It gives our customers freedom to modify our products to better fit their specific requirements.

How FPGA programming is similar to the software development and what is different?

Some simple FPGAs are designed in a way similar to the schematic designs – with circuit diagrams where virtual components are used instead of real counters, registers, memories, and so on. This technique does not work well for complex designs and most of them look now very much similar to the software projects with the source files having preprocessor directives, modules, functions, variables declarations and operators.

There are two main hardware description languages in the industry: VHDL and Verilog HDL, and Verilog was chosen for Elphel designs.

The source code in Verilog will not look alien to the software developer, and if you open it for example with one of the KDE code editors – it will highlight the Verilog syntax.

So is FPGA programming the same as writing software? Ideally – probably yes, but it is still very different, especially for those who came to this frontier area from the software, not hardware side. As it was in earlier days of software design it is very important to understand what your code will be compiled into even when using high level directives.

There are at least two fundamental differences between FPGA and software code:

  • hardware (including the FPGA internal cells) is semi-analog and

  • all parts of the FPGA operate at the same time

Propagation delays for the signals are analog and depend on multiple factors including distance between the data source and the receiver and number of inputs connected to one output. So accounting for delays and specifying timing constraints is an integral part of the overall project together with the the functional design. Because of the physical nature of the hardware, in some cases design can stop functioning properly after recompilation with different seed or modifications in unrelated modules – changes in mapping (assigning primitive cells to functional modules) and layout influence timing delays.

In the software different instructions are usually processed at different time, HDL code is compiled so that different operators are mapped to different parts of a chip and are usually active simultaneously.

On the top level parallelism is usually achieved by designing the system in such a way, that the data flow processing is split into conveyor-like small operations. In FPGA such pipe-type parallelism is usually more effective than separating the data flow into several sub-streams (i.e. splitting image into separate areas and processing each by different module instance), because it is easier to implement in hardware a single specific operation than a complex sequence of instructions as the processors driven by software do.

The data rate in each chain remains the same, so after the complex algorithm is split into sequence of simpler procedures it is possible to plan the required resources for each module and actually implement them. Some modules might be simple, some require manual optimizations and careful planning of the data dependence and assigning register operations to particular time slots before the coding can start. A good example of such module could be an 8-point discrete cosine transform (DCT). Chain of two of such modules (with a memory buffer between them) is capable of a 2-d 8x8 DCT - one of the most computationally intensive parts of baseline JPEG, MPEG and Ogg Theora codecs.

Modern FPGAs are mature devices that have many features common to different families and manufacturers. They have:

  • Small universal cells that have several register bits and programmable lookup tables to implement logical functions. Usually they have alternative functions such as small memories or shift registers, and have circuitry to implement fast carry chains for implementation counters and adders.

  • Programmable wiring resources - usually there are 3-4 different types of wires – from “street level” to connect adjacent cells to “highways” and “freeways” for long distance links and global wires (usually clock signals) that are available to each cell on a chip (or a fraction of it).

  • Many FPGA devices have additional high-level hard-wired modules such as memory blocks, multipliers and even complete processors.

Camera hardware

Similar to its predecessor model 333 camera circuitry (Figure 1) is designed to be flexible and allow future modifications (i.e. new sensor support) without any changes on the board. It consists of a universal computer with Axis Communications ETRAX100LX 32bit Linux-optimized processor with embedded Ethernet MAC and other ports, 32MB of system SDRAM and 16MB flash. FPGA plays central role in video processing and is connected to the rest of the system using 32bit wide system bus, providing both PIO an DMA access. Some of the FPGA I/O pins are connected directly to the connector that is used to interface interchangeable sensor boards. All these pins are reprogrammable and it is easy to change the pins designation to match control signals for different sensors including those that were not available when the camera board was released.

Figure 1Model 333 Camera Block Diagram

Programmable clock generator adds to flexibility of the design – all of the system clock frequencies are adjustable by the software.

Maximal data transfer rate even in the DMA mode (80Mbytes/sec) is still not enough to provide a scratch pad memory access for video compression and the on-chip FPGA memory is far less than needed for the high resolution images (it is only 54 Kbytes in a million-gate Spartan-3). The fast access to large memory led to addition of the dedicated 32MB DDR SDRAM connected to the FPGA to supplement its computational resources with large scratch pad memory.

FPGA code

FPGA implementation of the Ogg Theora encoder is based on the format specifications written by Xiph.org and currently available at http :// www.theora.org /doc/Theora_I_ spec.pdf. The code in Elphel model 333 camera is primarily targeted to the applications where the camera does not move, and so motion compensation is not implemented yet. The other shortcut – omission of the loop filter that improves visual image quality at high compression ratios (low quality) by making borders between 8x8 pixel blocks less visible. The camera uses reconfigurable FPGA, so it will be possible to add this functionality in the future code releases.

The first step was to adjust the overall algorithm to the hardware implementation and present it as a chain of the processing modules (Figure 2).Special attention was needed to organize data transfer to and from external memory because, in contrast to the highly parallel distributed internal resources of the FPGA, that memory is a single physical device with a limited I/O bandwidth.

Figure 2FPGA Code Diagram

Memory (DDR SDRAM) controller

Calculation of the required scratch pad memory bandwidth indicated that approximately 95% of the peak bandwidth of 16-bit DDR SDRAM running at 125MHz (475MB/s of 500MB/s available) is needed to satisfy the requirements of the FPGA processing blocks. Such high efficiency (95%) of the SDRAM bandwidth usage is difficult to achieve with universal random-access memory controllers, but the synthesized ones can count on particular data structures.

Single-port SDRAM has to serve multiple FPGA modules that operate in parallel and moderate size internal memory blocks (about 2 KB each) turn very useful – they are used as individual FIFO buffers, so each of the eight channels can provide or accept data at the same time with others. These channels are:

  • sensor data to SDRAM in line-scan order;

  • correction data from SDRAM to FPN (fixed-pattern noise) elimination module;

  • raw pixel data from SDRAM to the compressor in 20x20 pixel tiles;

  • CPU PIO data to/from SDRAM;

  • reference YCbCr frame data from SDRAM to the compressor;

  • current YCbCr frame data from the compressor to memory;

  • intermediate encoded data tokens from compressor to SDRAM;

  • reordered data tokens from SDRAM to the final compressor stage

Video Compressor Data Flow

Theora video encoder shares some parts of algorithm with JPEG and MPEG – they all use 8x8 DCT to convert pixel values to spatial frequencies. Each 8x8 pixel block is converted and 64 result coefficients are ordered from lowest (starting with average value, usually referenced as DC) to highest frequencies (AC1 to AC63) in zig-zag order. That presents the same information (DCT by itself is lossless) so that most important for visual perception coefficients go first followed by less and less important while all the 64 original values (pixels) were of equal importance. That allows to transmit the coefficient data with different precision – most valuable (DC and low frequency components) with higher precision, less valuable – with less precision. This is achieved by the process of quantization – dividing coefficients by array of values and rounding results to the nearest integer. The opposite process (dequantization) is multiplication the quantized values by the same array.

With the real life images many of the quantized coefficients become zeros, especially the last (high spatial frequency) ones. Grouping of zero coefficients together allows efficient encoding of zero runs.

Theora encoding makes one extra step – all the coefficients in a frame are globally reordered, so in the output bitstream first go the DC coefficients from all blocks, then first AC from all blocks, ending with the highest frequency coefficients. This makes encoding more efficient, but requires buffering of the whole frame data between the DCT and bitstream output. As shown on the FPGA code diagram the compressor is split in two parts (Stage 1 and Stage 2) that use the external SDRAM to buffer data between them.

Compressor Stage 1 receives Bayer-encoded (after mosaic color filers) pixel data from the SDRAM stored there in line-scan order by the sensor interface module as 20x20 overlapping pixel tiles in scan order waiting (if needed) to the data from the sensor to become available. Bayer-to-YCbCr 4:2:0 converter calculates 16x16 pixels of intensity (Y) using 5x5 pixels for interpolation. Color data has lower resolution (anyway sensor does not provide RGB for each pixel) – two 8x8 blocks of color components (Cb and Cr) are provided in addition to 4 of 8x8 pixel Y component.

Next step depends on the type of the frame, currently only 2 of the Theora 8 types are supported:

  • INTRA (similar to JPEG or MPEG “i-frames”, also called “key frames” or “golden frames”) where all the frame data is encoded without dependence on information from any other frames and

  • INTER NOMV (no motion vectors) where only the difference between current and previous frame is encoded (similar to MPEG “p-frames”).

Depending on type of the frame YCbCr data goes to 8x8 Forward DCT either directly (INTRA) or after the reference frame is subtracted (INTER). DCT output is fed to the Quantizer that uses FPGA embedded block RAM memory to store up to 8 alternative tables.

Next the data is split into two branches, one of them is needed to calculate the reference frame for next INTER frame. The decoder on the other side does not have access to the original previous frame, only the decoded (from the lossy compression) one – so the encoder has to apply exactly the same algorithm as the decoder does. That includes Dequantizer and 8x8 Inverse DCT. Output is combined with the reference frame data delayed by the Bypass buffer so it arrives at the same time as the processed one and fed to the multiplexer that selects the IDCT output (INTRA frames), sum of the IDCT and reference frame (INTER) or just the reference frame (for the blocks that were not coded). The multiplexer output goes to the SDRAM controller and then the data is written to memory as a new reference frame.

The second branch of the video data from the quantizer goes to DC Prediction module where the data from already encoded blocks is used to extrapolate (predict) DC coefficient for the current one. Only the difference between the actual and predicted value is encoded – it is usually much less than the value itself. AC coefficients pass unchanged – they are just re-sequenced in reverse zig-zag order.

Next module in chain is the Coefficient Encoder. According to Theora specs each coefficient (or several of them together if there is a zero run) is encoded as a 5-bit token with 0 to 10 additional bits. In the FPGA implementation this data is stores in external memory as fixed-width (12-bit) “pre-tokens” that could be easily converted into variable-length ones later. The tokens include single coefficient ones as well as more complex, such as “2...3 zeros followed by +/-2...3”. Seven tokens are reserved for end of block (EOB) runs – when multiple subsequent blocks have only zeros for all quantized DCT coefficients with number equal or greater than current. That information is not available at this stage, EOB tokens are determined later after the stage 1 of compressor processes the whole frame and stores the intermediate results (pre-tokens) in external memory. This data is written in the sequence of the compressor stage 1 processing – the outer loop iterates through 16x16 pixel macroblocks in scan order, next 6 of 8x8 pixel blocks (4 intensity ones and 2 color ones) and the inner cycle goes through the 64 coefficients (some are empty in the case of zero runs).

Compressor Stage 2 receives pre-tokens in the the sequence that is very different from how they were written to external memory. Now the outer loop goes through DCT coefficient indexes (from DC to the highest frequency AC63), for each index color planes (brightness – Y, then color – Cb and Cr) are iterated, for each color plane - superblocks (32x32 pixels or 4x4 blocks) are scanned from left to right, then bottom to top, in each superblock blocks are iterated in Hilbert pattern (in this case it looks like capital omega with a dip on the top). The memory controller uses specially designed memory map and access sequence so that in both writing and reading all the DDR SDRAM bus cycles are used for data transfer.

The re-sequenced tokens are fed to the EOB Run Extractor – at this stage it is possible to generate this group of tokens. Both types of tokens (the coefficient and EOB ones ) are processed by the Huffman Encoder that uses individual groups of tables for different color planes and DCT index ranges. The variable length tokens and additional parameters are combined in Bitstream Packager module into 16-bit words that are buffered in embedded block RAM and sent out from the FPGA to the system memory through the 32-bit DMA channel.

Conclusion

Modern FPGAs are powerful enough to handle such complex tasks as a real time video compression – traditional application area of custom ASICs. FPGA implementation of Theora video encoder runs at 30fps for 1280x1024 resolution, 12fps for full 2048x1536 and several hundred frames per second for smaller windows – that makes Elphel model 333 the first high resolution network camera to support true video compression.

All the FPGA code (Verilog source, constraints and project settings) of Elphel products comes with GNU/GPL, and FPGA development does not require expensive tools – Xilinx provides free for download software. These tools can be used to compile the code into actual image that can be loaded to the on-board FPGA, so software developer can use the camera (or similar products based on the reconfigurable devices) to experiment with the FPGA design.

Theora video codec is a new one and is not yet as popular as traditional MPEG, but it can be freely used and modified without any royalty payments. That makes it the only high performance video codec suitable for inclusion into most GNU/Linux distributives and it is already supported by many of the available video players.

We hope that our products will help to introduce FPGA design to software developers and will contribute to the acceptance of Ogg Theora in the new application area of network cameras.

 
Impressum // © 2005 LinuxTag e.V.