Imaging solutions with Free Software & Open Hardware

Who's online

There are currently 0 users online.

Subscribe to Elphel Development Blog feed Elphel Development Blog
Updated: 11 min 44 sec ago

Elphel at CVPR 2018

Wed, 06/20/2018 - 00:12

Elphel booth at CVPR 2018 Expo

Elphel is presenting at CVPR 2018 EXPO: Long Range Passive 3D Reconstruction System,  providing NN with custom training sets, enabling realtime 3D-aware machine learning systems.

Please visit us at CVPR, Booth 132.

You can also find our presentation and related articles on wiki page



Capturing Aircraft Position with the Long Range Quad Stereo Camera

Sun, 05/06/2018 - 12:57

Figure 1. Aircraft positions during descent captured with the quad stereo camera. Each animation frame corresponds to the available 3-D model.

While we continue to work on the multi-sensor stereo camera hardware (we plan to double the number of sensors to capture single-exposure HDR image sets) and develop code to get the ground truth data for the CNN training, we had some fun testing the camera for capturing aircraft position in 3-D space. Completely passive method, of course.

We found a suitable spot about 2.5 km from the beginning of the runway 34L of the Salt Lake City international airport (exact location is shown in the model viewer) so approaching aircraft would pass almost over our heads. With the 60°(H)×45°(V) field of view of the camera aircraft are visible when they are 270 m away horizontally.

Camera was set to capture synchronized image sets at 5 frames per second (same as in Fig. 1 animation), each set was processed independently without any relation to the other ones (no optical flow or motion vectors). Our existing (not airplane-aware) prototype software generated the scene 3-D models, they in turn were used to measure the aircraft positions. Result accuracy may be improved by using directly the intermediate Disparity Space Image (DSI) data – current reconstruction code tries to fit some flat surfaces to approximate the DSI data. At long distances there is insufficient data about the shape of the objects, so at ~500 m range (Fig. 2) the software used a fronto-parallel surface (distances to the aircraft nose, center, and tail are the same) and at ~600m it selected a tilted surface in the wrong direction, so the measured distance to the tail is larger than the one measured to the nose and center.

Figure 2. Ground distance and altitude of the three points sampled.

The measurements accuracy is still good, considering that the disparity of 1 image pixel corresponds for this short baseline camera to 527 meters range, at 800 m 1/10 pixel disparity precision corresponds to 120 m uncertainty. Ground speed and the descend rate (indicated in the animation) are calculated between the first captured set and the current one, so the last value averages over the full 6 seconds.

This method of capturing aircraft position and speed does not require that it travels directly to/from the camera, it can fly in any direction. There could also be multiple aircraft (such as UAVs) in the same camera view. Another 3-D scene (Fig. 3) captured from the same spot perpendicular to the flight direction shows an aircraft at 1600 m from the camera at 110 m altitude. According to the map the center line of the 34R runway is 14% farther (at 1820 m) from the camera position. That corresponds to 0.045 pix disparity error in the distance measurement, so calculated ground speed and altitude would also be 14% lower than actual.

Figure 3. Aircraft traveling perpendicular to the view direction.

Each of the provided 3-D scene models has the corresponding “set of 4 images” link on the settings page that can be viewed as an animation, manually iterated over or downloaded – the one for Fig. 3 scene shows how a 0.33 pixel wiggle of the FedEx MD-11 image looks like (images are zoomable and pannable). Here are the direct links for image sets of the first (1.75 pix disparity) and the last (0.7 pix disparity) of the scene models of the approaching B737 shown in Fig. 1.

These four images are not used in the actual reconstruction process, they are rendered from the source raw (JP4 format) ones for the specified disparity (D0.0 in the file names show that disparity was 0.0 pixels) for viewing or evaluating with alternative software. These images are not rectilinear and have residual common (for all four) radial distortion. They have optical aberrations corrected by the space-variant deconvolution, sub-camera residual misalignment is field-corrected (such misalignment may be caused by an uneven heating of the camera tubular structure by the direct sunlight). The images are partially rectified to the common radial distortion model that bests fits simultaneously all four sub-cameras. The remaining radial distortion is compensated during final generation of the 3-D model. These images have visible grid artifacts caused by the Bayer pattern – no de-mosaic processing is involved because it hurts linear operations over the images, and it is not adequate for the peripheral image areas (see earlier post). Each color component is processed separately for the phase correlation, when combining texture images these artifacts are reduced as they are not coherent between the channels. Any additional non-linear improvements of the Bayer mosaic suppression can be applied to the texture generation step, not to the data used for correlation.

Implementing a linux driver for an image sensor for NC393

Fri, 04/20/2018 - 12:48

Fig.1 MT9F002


This post briefly covers implementation of a driver for On Semi’s MT9F002 14MPx image sensor for 10393 system boards – the steps are more or less general. The driver is included in the latest software/firmware image (20180416). The implemented features are programmable:

  • window size
  • horizontal & vertical mirror
  • color gains
  • exposure
  • fps and trigger-synced ports
  • frame-based commands sequence allowing to change settings of any image up to 16 frames ahead (didn’t need to be implemented as it’s the common part of the driver for all sensors)
  • auto cable phase adjustment during init for cables of various lengths


  • Get the register reference document for MT9F002
  • Setup the Eclipse IDE for linux kernel development – see instructions


Capturing images without the driver

There is a set of scripts used for verification in software (on PC) and hardware (on the target board). Among other things they can be used to establish initial communication with a sensor, run registers initialization and acquire the first images. The process for MT9F002 is described on our wiki.

Driver implementation

In general the driver can be split into 2 parts:

  • common for all sensors – sets up working with fpga frame-based command sequencers (see NC393 development progress and the future plans).
  • sensor specific – commands that do actual communication with a sensor. This is the part needs to be implemented.

For a sensor there are a few tables and a few functions, the most important ones are:


  • mt9f002_pages – a ‘page’ is the upper byte of a 16-bit register address of the sensor (for 8-bit reg addresses the page number is 0x00), the table should list all available pages in the sensor – or at least the most important ones – they are used to program fpga’s 256-max records per port tables used to write commands and at least one record is used to read from port – so in general for a setup w/o a MUX board the limit of allowed pages per sensor is 256-1 = 255 (with a MUX + 3x sensors on a single port it is 256-2(mux write & read)-4(3x sensor read + 1x broadcast read))/4 = 62)
  • mt9f002_par2addr – used convert 256 8-bit parameters available as P_SENSOR_REGSxxx to 16-bit register address
  • mt9f002_ahead_tab – sequencer commands latencies – leave default, they are needed on the final stages of development

functions (they are registered on top of the common ones):

  • mt9f002_pgm_detectsensor – remove reset read sensor id
  • mt9f002_pgm_initsensor – init registers that select interface, sensor’s PLL multipliers/dividers and other required settings
  • mt9f002_pgm_exposure – program exposure
  • mt9f002_pgm_window – program image width and height
  • mt9f002_pgm_limitfps – limit fps based on compressor capabilities and trigger period settings
  • mt9f002_pgm_gains – program color gains
  • mt9f002_pgm_triggermode – set sensor in triggered mode
  • mt9f002_pgm_sensorregs – write sensor registers directly

See this article for more details.


Dual Quad-Camera Rig for Capturing Image Sets

Tue, 03/20/2018 - 00:35

Figure 1. Dual quad-camera rig mounted on a car

Following the plan laid out in the earlier post we’ve built a camera rig for capturing training/testing image sets. The rig consists of the two quad cameras as shown in Figure 1. Four identical Sensor Front Ends (SFE) 10338E of each camera use 5 MPix MT9P006 image sensors, we will upgrade the cameras to 18 MPix SFE later this year, the circuit boards 103981 are in production now.

We decided to use cameras with different baselines: 150 mm and 258 mm, the distance between the quad-camera centers is 1256 mm. The distance map calculated from the larger of the two cameras will have higher precision, than the distance map from the smaller one. For the same processing algorithms and the same object the distance resolution is reverse proportional to the baseline. Even higher resolution will be achieved by fusing depth maps from the two cameras together. The higher precision results may be used for the cost functions of the lower distance resolution camera instead of the unavailable real ground truth data. To make the rig sufficiently rigid the cameras are mounted on the aluminum tube (OD=57 mm), and that tube is attached to the rest of the mount through the vibration isolators.

Each of the two cameras uses matched set of the lenses with focal length variation of less than 1% to simplify frequency domain mutual image rectification. Cameras still need fine alignment and calibration, after that we will start image sets capturing.

This year Computer Vision and Pattern Recognition conference (CVPR 2018) will be held in Salt Lake City, Utah. We plan to participate in the CVPR exhibition and we will be happy to see visitors at Elphel office that is less than 10 km from the conference venue.

High Resolution Multi-Vew Stereo: Tile Processor and Convolutional Neural Network

Mon, 02/05/2018 - 22:08

Figure 1. Multi-board setup for the TP+CNN prototype

This article describes our next steps that will continue the year-long research on high resolution multi-view stereo for long distance ranging and 3D reconstruction. We plan to fuse the methods of high resolution images calibration and processing, already emulated functionality of the Tile Processor (TP), developed RTL code for its implementation and the Convolutional Neural Network (CNN). Compared to the CNN alone this approach promises over a hundred times reduction in the number of input features without sacrificing universality of the end-to-end processing. The TP part of the system is responsible for the high resolution aspects of the image acquisition (such as optical aberrations correction and image rectification), preserves deep sub-pixel super-resolution using efficient implementation of the 2-D linear transforms. Tile processor is independent of any training, only a few hyperparameters define its operation, all the application-specific processing and “decision making” is delegated to the CNN.

Machine Learning for 3D Scene Reconstruction

Machine learning is an active development area, and its applications to the 3D scene reconstruction stimulated by the autonomous vehicles including self-driving cars is no exception. Use of the CNNs to extract surfaces from the random-dot stereograms was published as early as 1992[1]. Most of the modern researches use standard image sets: Middlebury stereo data set[2] for high resolution near objects and KITTI[3] for the longer range applications. KITTI images are acquired from a moving car, they have attached ground truth data captured by the LIDAR. This image set uses binocular pairs and has relatively low resolution (1.4 MPix) compared to the modern image sensors, and still most of the CNN architectures require from seconds to thousands of seconds even when implemented with GPU devices and so are not yet suitable for the real-time applications.

Most of the CNNs[4] input raw pixel data and perform unary feature extraction in the parallel (one for each image in a stereo set) subnets, merge features and perform additional processing of the resulting 3d data. This is a so-called “siamese” network architecture that benefits from sharing parameters between the identical subnetworks. It is common to put most resources to the unary part of the processing resulting in truncating of the common stage of the processing that can consist of just a single layer(Fast Architecture in [4]). Efficient implementation in [5] limits CNN processing to just generate DSI and use traditional methods of DSI enhancement such as semi-global matching [6], other architectures split network after exchanging features and generate depth maps for each of the stereo images individually[7].

Figure 2. 2D MDCT basis functions (¼ of all MCLT ones for N=8)

Convolutional Neural Networks and the Frequency Domain Processing

Early layers of the various CNNs (and the eye retina too) are very general and even remind the basis functions (Figure 2) of the two-dimensional Fourier (DFT), cosine/sine (DCT and DST) and wavelet (DWT) transforms so it is no surprise that there are works that explore combinations of such transforms and the neural networks. Some of them [89] exploit energy concentration property of these transforms that makes possible popular image compression such as JPEG. Others [1011] evaluate efficiency of the available Fast Fourier Transform implementations to speed-up convolutions by converting image data to the frequency domain and then applying the pointwise multiplication according to the convolution-multiplication property. Improvement is modest, as the frequency domain calculations are most efficient for the large windows, while most modern CNNs use small ones, such as 3×3, where Winograd algorithm is more efficient.

Tile Processor and the High Resolution Multi-View Camera

Multi-view high resolution cameras present a special case where frequency domain processing results may lead to the reduction of the CNN input features by two orders of magnitude compared to raw pixel input. As described in the earlier post Tile Processor is using efficient Modulated Complex Lapped Transform (MCLT) conversion of the Bayer mosaic (color) high resolution image data to the frequency domain, simultaneously providing subpixel resolution shift for image rectification. Frequency domain processing includes space-variant optical aberration correction (required for the high resolution small format image sensors), phase correlation for image pairs and textures processing if it is required in addition to the distance (disparity) measurement. After channels/pairs merging the frequency domain data is converted back to the pixel domain where the tile correlation maximums are located and measured. Each overlapping tile is 16×16 pixels, the tile period (corresponding to stride in CNN terms) is 8×8 pixels. In the case when only the disparity value and confidence are output from each tile it makes 2 features. If the raw pixels were input, for 4 channels it would be 4*8*8 = 256 features or 128 times more. It may be useful to increase the number of features and supplement average disparity for all 4 pairs of quad camera with separate and horizontal and vertical pairs to improve foreground-background separation, in that case the feature reduction would still be over 40.

Conversion of the raw pixels to the Disparity Space Image (DSI) involves significant reduction of the (X,Y) resolution, when using four of the 18 MPix (4912×3684) imagers the DSI resolution is just 614×460, – isn’t it a waste of the sensor resolution? No, it is not:

  • a deep sub-pixel resolution for disparity measurement needed for long-distance ranging requires matching of the large image areas anyway
  • most of the image area for the most real-world images corresponds to smooth 3-d surfaces where assumption of a common disparity value for a tile is reasonable.
  • the initial image resolution is preserved by the TP when source images are converted to the textures (simultaneously improving quality as the data from 4 rectified images is averaged)
  • pixel-accurate distance map may be restored by extra processing the pixel data for selected tiles where depth discontinuity is detected, then assigning each pixel to one of the available surfaces.

Advantages of the TP+CNN over the End-to-End CNN

Significant (42..128) reduction of the input features is not the only advantage of the TP+CNN combination over the CNN alone. Being “convolutional” CNNs depend on translation symmetry, the groups of related pixels are treated the same way regardless of their localization in the image. That is only an approximation, especially when dealing with the high resolution images and extracting subpixel disparity values. This divergence from the strict convolution model is caused by the optical aberrations and distortions and requires use of the space-variant convolution instead, or performing complete aberration correction and image rectification before the images are fed to the network. Otherwise both the complexity of the network and amount of training data would increase dramatically. Image rectification with pixel (or slightly better) precision is a common task in stereo processing. It involves interpolation and re-sampling of the pixel data, the process that leads to the phase noise introduction, especially harmful when deep super-resolution of the matched images is required. Tile processor implementation combines multiple operations (fractional pixel image shifts, optical aberrations correction, phase correlation of the matched pairs), TP avoids image re-sampling from the sensor pixel grid by replacing it with the phase rotation in the frequency domain.

Final step that reduces the number of features that are sent from the TP to the CNN is extraction of the disparity value by calculation of the argmax of the phase correlation data. This function has to be calculated with subpixel resolution for the data defined on the integer pixel grid. Certain biases are possible, and the TP implementation offers trade-off between speed and accuracy. The result disparity value is a sum of the pre-applied disparity (implemented as a phase rotation in the frequency domain on top of the integer pixel shift) and the argmax value (correlation maximum offset from zero). When higher accuracy is required, a second iteration may be performed by applying the full disparity from the first iteration, then the residual argmax offset will be close to zero and less subject to bias.

System Performance Estimation and the Prototype Setup

Optimal system for the real-time high resolution 3-D scene reconstruction and ranging would require development of the application-specific SoC. If used with a set of four 18 MPix image sensors (such as ON Semiconductor AR1820HS) and a single ×16 1600 MHz DDR4 memory device, the 16 nm technology process, the TP subsystem will be capable of 10 Hz operation covering the full 4912×3684 frames reserving half of the memory bandwidth for other then TP operations.

We plan to emulate such system using available NC393 camera electronic and optical-mechanical components, including multiple 10393 system boards based on Xilinx Zynq 7030 SoC. Each such board has a GigE port and four identical sensor ports routed directly to the FPGA I/O pads allowing flexible assignment of pin functions. Typical applications include up to 8 differential LVDS pairs, clock pair, I²C and clock input. The same connectors can be used for high-speed communication between the 10393 boards. Partitioning the system into multiple boards will allow to fit the required TP functionality into smaller FPGAs, then send the result features (614×460×6) over the GigE to a workstation with GPU for the experiments with different CNN implementations. The system bandwidth will be lower than that of the application-specific SoC, the 10 Hz operation will be possible with 5 MPix sensors (2.5 Hz with 18 MPix).

Inter-board connections are shown in Figure 1 (just the connections, the actual prototype camera will look more like in Figure 3, but with a wider body). Five to seven of the 10393 boards are arranged in 2 layers. Four layer 1 boards use one of the sensor ports to receive image data from the attached sensor, perform image conditioning, flat-field correction and store data in the dedicated DDR3 memory. They later read the data as 16×16 pixels overlapping tiles, calculate the tile centers using calibration data and requested location and nominal disparity from the data received over the GigE. Each tile is transformed to the frequency domain, the data is subject to the space-variant aberration correction. The result frequency domain tiles are output through three remaining sensor ports that are reconfigured to be LVDS transmitters. The layer 2 boards simultaneously receive frequency domain data through all 4 of their sensor ports from the layer 1 and perform phase correlation (pointwise multiplication followed by normalization) on the image pairs. There could be just a single layer 2 board, or up to 3 (limited by the available layer 1 ports) to perform different types of correlations in parallel (all 4 pairs combined, two vertical pairs and separately 2 horizontal pairs for better foreground/background separation. The results of the frequency domain calculations are then transformed to the pixel domain and the argmax is calculated. Then argmax value is used to calculate the full tile disparity, and the corresponding correlation value – as a disparity confidence. The pair (disparity, confidence) for each tile is then sent over GigE to the CNN implemented on a workstation computer.

Figure 3. Quad sensor camera for image sets acquisition

Image Sets for Training and Testing

While the TP functionality is already tested with the software emulation, and the efficient implementation is developed, more research is needed for the CNN part of the system. Available image sets, such as KITTI[3] have insufficient resolution (1.4 MPix) and they use different spatial arrangement of the cameras. We plan to capture high resolution quad camera image sets using available NC393-based cameras that will be upgraded from 5 MPix to 18 MPix sensors of the same 1/2.3″ format so the optical-mechanical design will remain the same. As we are primarily interested in long distance ranging (few hundreds to thousands meters), use of the LIDARs to capture ground truth data is not practical. Instead we plan to mount a pair of identical quad cameras (with the baseline of 150mm) on a car 1500 mm apart, pointed in the same direction, so when the 3D measurements from these quad cameras are fused, the accuracy of the composite distance data would be ten times better, because the effective baseline will be 1500mm. Of course that method has some limitations (it will not help to improve data from the poorly textured objects) but it will provide higher absolute distance resolution that can be used for the loss function during CNN training. Data from the individual quad cameras will be used for training and testing of the network.

All acquired images, related calibration data and software will be available online under GNU GPL.


[1] Becker, Suzanna, and Geoffrey E. Hinton. “Self-organizing neural network that discovers surfaces in random-dot stereograms.” Nature 355.6356 (1992): 161.

[2] Scharstein, Daniel, et al. “High-resolution stereo datasets with subpixel-accurate ground truth.” German Conference on Pattern Recognition. Springer, Cham, 2014.

[3] Menze, Moritz, and Andreas Geiger. “Object scene flow for autonomous vehicles.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

[4] J. Zbontar and Y. LeCun, “Stereo matching by training a convolutional neural network to compare image patches,” Journal of Machine Learning Research, vol. 17, no. 1-32, p. 2, 2016.

[5] W. Luo, A. G. Schwing, and R. Urtasun, “Efficient deep learning for stereo matching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5695–5703, 2016.

[6] H. Hirschmuller, “Accurate and efficient stereo processing by semi-global matching and mutual information,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 2, pp. 807–814, IEEE, 2005.

[7] J A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-end learning of geometry and context for deep stereo regression,” arXiv preprint arXiv:1703.04309, 2017.

[8] Sihag, Saurabh, and Pranab Kumar Dutta. “Faster method for Deep Belief Network based Object classification using DWT.” arXiv preprint arXiv:1511.06276 (2015).

[9] Ulicny, Matej, and Rozenn Dahyot. “On using CNN with DCT based Image Data.” Proceedings of the 19th Irish Machine Vision and Image Processing conference IMVIP 2017

[10] Vasilache, Nicolas, et al. “Fast convolutional nets with fbfft: A GPU performance evaluation.” arXiv preprint arXiv:1412.7580 (2014).

[11] Lavin, Andrew, and Scott Gray. “Fast algorithms for convolutional neural networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.

Photo Finish

Tue, 01/30/2018 - 14:53

Since 2005 and the older 333 model, Elphel cameras have a Photo Finish mode. First, it was ported to 353 generation, and then from 353 to 393 camera systems.

In this mode the camera samples scan lines and delivers composite images as video frames. Due to the Bayer pattern of the sensor the minimal sample height is 2 lines. The max fps for the minimal sample height is 2300 line pairs per second. The max width of a composite frame can be up to 16384px (is determined by WOI_HEIGHT). A sequence of these frames can be simply joined together without any missing scan lines.

Current firmware (20180130) includes a photo finish demo:

A couple notes for 393 photo finish implementation:

  • works in JP4 format (COLOR=5). Because in this format demosaicing is not done it does not require extra scan lines, which simplified fpga’s logic.
  • fps is controlled:
    • by exposure for the sensor in the freerun mode (TRIG=0, delivers max fps possible)
    • by external or internal trigger period for the sensor in the snapshot mode (TRIG=4, a bit lower fps than in freerun)

See our wiki’s Photo-finish article for instructions and examples.

Efficient Complex Lapped Transform Implementation for the Space-Variant Frequency Domain Calculations of the Bayer Mosaic Color Images

Mon, 01/08/2018 - 02:00

This post continues discussion of the small tile space-variant frequency domain (FD) image processing in the camera, it demonstrates that modulated complex lapped transform (MCLT) of the Bayer mosaic color images requires almost 3 times less computational resources than that of the full RGB color data.

“Small Tile” and “Space Variant”

Why “small tile“? Most camera images have short (up to few pixels) correlation/mutual information span related to the acquisition system properties – optical aberrations cause a single scene object point influence a small area of the sensor pixels. When matching multiple images increase of the window size reduces the lateral (x,y) resolution, so many of the 3d reconstruction algorithms do not use any windows at all, and process every pixel individually. Other limitation on the window size comes from the fact that FD conversions (Fourier and similar) in Cartesian coordinates are shift-invariant, but are sensitive to scale and rotation mismatch. So targeting say 0.1 pixel disparity accuracy the scale mismatch should not cause error accumulation over window width exceeding that value. With 8×8 tiles (16×16 overlapped) acceptable scale mismatch (such as focal length variations) should be under 1%. That tolerance is reasonable, but it can not get much tighter.

What is “space variant“? One of the most universal operations performed in the FD is convolution (also related to correlation) that exploits convolution-multiplication property. Mathematically convolution applies the same operation to each of the points of the source data, so shifted object of the source image produces just a shifted result after convolution. In the physical world it is a close approximation, but not an exact one. Stars imaged by a telescope may have sharper images in the center, but more blurred in the peripheral areas. While close (angularly) stars produce almost the same shape images, the far ones do not. This does not invalidate convolution approach completely, but requires kernel to (smoothly) vary over the input images [12], makes it a space-variant kernel.

Figure 1. Complex Lapped Transform with DCT-IV/DST-IV: time-domain aliasing cancellation (TDAC) property. a) selection of overlapping input subsequences 2*N-long, multiplication by sine window; b) creating N-long sequences for DCT-IV (left) and DST-IV (right); c) (after frequency domain processing) extending N-long sequence using DCT-IV boundary conditions (DST-IV processing is similar); d) second multiplication by sine window; e) combining partial data

There is another issue related to the space-variant kernels. Fractional pixel shifts are required for multiple steps of the processing: aberration correction (obvious in the case of the lateral chromatic aberration), image rectification before matching that accounts for lens optical distortion, camera orientation mismatch and epipolar geometry transformations. Traditionally it is handled by the image rectification that involves re-sampling of the pixel values for a new grid using some type of the interpolation. This process distorts the signal data and introduces non-linear errors that reduce accuracy of the correlation, that is important for subpixel disparity measurements. Our approach completely eliminates resampling and combines integer pixel shift in the pixel domain and delegates the residual fractional pixel shift (±0.5 pix) to the FD, where it is implemented as a cosine/sine phase rotator. Multiple sources of the required pixel shift are combined for each tile, and then a single phase rotation is performed as a last step of pixel domain to FD conversion.

Frequency Domain Conversion with the Modulated Complex Lapped Transform

Modulated Complex Lapped Transform (MCLT)[3] can be used to split input sequence into overlapping fractions, processed separately and then recombined without block artifacts. Popular application is the signal compression where “processed separately” means compressed by the encoder (may be lossy) and then reconstructed by the decoder. MCLT is similar to the MDCT that is implemented with DCT-IV, but it additionally preserves and allows frequency domain modification of the signal phase. This feature is required for our application (fractional pixel shifts and asymmetrical lens aberrations modify phase), and MCLT includes both MDCT and MDST (that use DCT-IV and DST-IV respectively). For the image processing (2d conversion) four sub-transforms are needed:

  • horizontal DCT-IV followed by vertical DCT-IV
  • horizontal DST-IV followed by vertical DCT-IV
  • horizontal DCT-IV followed by vertical DST-IV
  • horizontal DST-IV followed by vertical DST-IV

MathJax.Hub.Config({ extensions: ["mml2jax.js"], jax: ["input/MathML","output/HTML-CSS"], "HTML-CSS": { availableFonts: ["TeX"], scale: 75 }, MMLorHTML: { prefer: { MSIE: "HTML", Firefox: "MML", Opera: "HTML", other: "HTML" } } }); Reversibility of the Modified Discrete Cosine and Sine Transforms

Figure 1 illustrates the principle of TDAC (time-domain aliasing cancellation) that restores initial data from the series of individually converted subsections after seemingly lossy transformations. Step a) shows the 2*N long (in our case N=8) data subsets extraction, each subset is multiplied by a window function. It is a half period of the sine function, but other windows are possible as long as they satisfy the Princen-Bradley condition. Each of the sections (a…h) corresponds to N/2 of the input samples, color gradients indicate input order. Figure 2b has seemingly lossy result of MDCT (left) and MDST (right) performed on the 2*N long input (a,b,c,d) resulting in N-long (– c ~ –d,a– b ~ ) – tilde indicates time-reversal of the subsequence (image has the gradient direction reversed too). Upside-down pieces indicate subtraction, up-right – addition. Each of the 2*N → N transforms is irreversible by itself, TDAC depends on the neighbor sections.

Figure 1c shows the first step of original sequence restoration, it extends N-long sequence using DCT-IV boundary conditions – it continues symmetrically around left boundary, and anti-symmetrically around the right one (both around half-sample from the first/last one), behaving like a first quadrant (0 to π/2) of the cosine function. Conversions for DST branch are not shown, they are similar, just extended like a first quadrant of the sine function.

The result sequence of step c) (now 2*N long again) is multiplied by the window function for the second time in step d), the two added terms of the last column (d and c~) are swapped for clarity.

Figure 2. Space-variant phase shift (extra overlap), source-centered window. Distorted reconstruction in the center

The last image e) places result of the step d) and those of similarly processed subsequences (c,d,e,f) and (e,f,g,h) to the same time line. Fraction (d~ ,c~ ) of the first block compensates (–d~ ,– c~ ) of the second, and (f~,e~ ) of the second – (–f~,–e~ ) of the third. As the window satisfies Princen-Bradley condition, and it is applied twice, the c columns of the first and second block when added result in c segment of the original sequence. The same is true for the d, e, and f columns. First and last N samples are not restored as there are no respective left and right neighbors to be processed.

Space-variant shift and TDAC property

Modified discrete cosine and sine transforms exhibit a perfect reconstruction property when input sequence is split into regular overlapping intervals and this is the case for many applications, such as audio and video compression. But what happens in the case of a space-variant shift? As the total shift is split into integer and symmetrical fractional part even smooth variation of the required shift for the neighbor sample sequences there will be places where the required shift crosses ±0.5 pix boundaries. This will cause the overlap to be N±1 instead of exactly N and the remaining fractional pixel shift will jump from +0.5 pix to -0.5pix or vice versa.

Figure 3. Space-variant phase shift (reduced overlap), source-centered window. Distorted reconstruction in the center

Such condition is illustrated in Figure 2, where fractional pixel shift is increased to ±1.0 instead of ±0.5 to avoid signal shape distortion caused by a fractional shift. The sine window is also modified accordingly to have zeros on both ends and so to have a flat top, but it still satisfies Princen-Bradley condition. The sawtooth waveform represents input signal that has large pedestal = 10 and a full amplitude of 3. The two left input intervals (1 to 16 and 9 to 24) are shown in red, the two right ones (15 to 30 and 23 to 38) are blue. These intervals have extra overlap (N+2) between red and blue ones. Dotted waveforms show window functions centered around input intervals, dashed – inputs multiplied by the windows. Solid lines on the top plot show result of the FD rotation resulting in 1 shift to the right for the red subsequences, to the left – by the blue ones.

The bottom plot of Figure 2 is for the “rectified image”, where the overlapping intervals are spread evenly (0-15, 8-23, 16-31, 24-39), the restored data is multiplied by the windows again (red and blue solid lines) and then added together (dark green waveform) in an attempt to restore the original sawtooth.

Figure 4. Space-variant phase shift (extra overlap), modified window. Correct reconstruction

There is an obvious problem in the center where the peak is twice higher than the center sawtooth (dotted waveform). It is actually a “wrinkle” caused by the input signal pedestal that was pushed to the center by the increased overlap of the input, not by the sawtooth shape. The FD phase shift moved windowed input signal, not just the input signal. So even if the input signal was constant, left two windows would be shifted right by 1, and right ones – left by one sample, distorting their sum. Figure 3 illustrates the opposite case where the input subsequences have reduced rather than increased overlap and FD phase rotation moves windowed data away from the center – the restored signal has a dip in the middle.

This problem can be corrected if the first window function used for input signal multiplication takes the FD shift into account, so that the input windows after phase rotation match the output ones. Signals similar to those in Figure 2, but with appropriately offset input windows (dotted waveforms are by 1 pixel asymmetrical) are presented in Figure 4. It shows perfect reconstruction (dark green line) of the offset input sawtooth signal.

Figure 5. Space-variant phase shift (reduced overlap), modified window. Correct reconstruction[/caption] Figure ? illustrates the correct reconstruction of the input data of Figure 3 made with the input window function modified to match the FD phase rotation. --> Implementation of the MCLT for the Monochrome 16×16 Pixel Tiles

Figure 5. MCLT converter for the 16×16 pixel tiles

The MCLT converter illustrated in Figure 5 takes an array of 256 pixel samples and processes them at the rate of 1 pixel per clock, resulting in 4 of the 64 (8×8) arrays representing FD transformation of this tile. Converter incorporates phase rotator that is equivalent to the fractional pixel shifter with 1/128 pixel resolution. Multiple pixel tiles tiles can be processed immediately after each other or with a minimal gap of 16 pixels needed to restore the pipeline state. Fold sequencer receives start signal and simultaneously two 7-bit fractional pixel X and Y shifts in 1/128 increments (-64 to +63). Pixel data is received from the read-only port of the dual-port memory (input tile buffer) filled from the external DDR memory. Fold sequencer generates buffer addresses (including memory page number) and can be configured for various buffer latency.

Figure 6. DCT-IV and DST-IV basis functions for N=8

Fold sequencer simultaneously generates X and Y addresses for the 2-port window ROM that generates window function (it is a half-sine as discussed above) values that are combined by a multiplier as the 2-d window function is separable. Each of the dimensions calculate window values with appropriate subpixel shift to allow space-variant FD processing. Mapping from the 16×16 to 8×8 tiles is performed according to Figure 1b for each direction, resulting in four 8×8 tiles for DCT/DCT (horizontal/vertical), DST/DCT, DCT/DST and DST/DST future processing, Each of the source pixels contribute to all 4 of the 8×8 arrays, and each corresponding elements of the 4 output arrays share the same 4 contributing pixels – just with different signs. That allows to iterate through the source pixels in groups of 4 once, multiply them by appropriate window values and in 4 cycles add/subtract them in 4 accumulators. These values are registered and multiplexed during the next 4 cycles feeding 512×25 DTT input buffer (DTT stands for Discrete Trigonometric Transform – a collective name for both DCT and DST).

“Folded” data stored in the 512×25 DTT input buffer is fed to the 2-dimensional 8×8 pixel DTT module. It is similar to the one described in the DCT-IV implementation blog post, it was just modified to allow all 4 DTT variants, not just the DCT/DCT described there. This was done using the property of DCT-IV/DST-IV that DST-IV can be calculated as DCT-IV if the input sequence is reversed (x0 ↔ x7, x1 ↔ x6, x2 ↔ x5, x3 ↔ x4) and sign of the odd output samples (y1, y3, y5, y7) is inverted. This property can be seen by comparing plots of the basis functions in Figure 6, the proof will be later in the text.

Figure 7. Frequency domain phase rotator

Another memory buffer is needed after the 2d DTT module as it processes one of four 64-pixel transforms at a time, and the phase rotator (Figure 7) needs simultaneous access to all 4 components of the same FD sample. These 4 components are shown on the left of the diagram, they are multiplied by 4 sine/cosine values (shown as CH, SH, CV and SV) and then combined by the adders/subtracters. Total number of multiplications is 16, and they have to be performed it 4 clock cycles to maintain the same data rate through all the MCLT converter, so 4 multiplier-accumulator modules are needed. One FD point calculation uses 4 different sine/cosine coefficients, so a single ROM is sufficient. The phase rotator uses horizontal and vertical fractional pixel shift values for the second time (first was for the window function) and combines them with the multiplexed horizontal and vertical indexes (3 bits each) as address inputs of the coefficient ROM. Rotator provides rotated coefficients that correspond to the MCLT transform of the pixel data shifted in both horizontal and vertical directions by up to ±0.5 pix at a rate of 1 coefficient per clock cycle, providing both output data and address to the external multi-page memory buffer.

Modulated Complex Lapped Transform for the Bayer Mosaic Color Images

Fig. 8. Lateral chromatic aberration and the Bayer mosaic: a) monochrome (green) PSF, b) composite color PSF, c) Bayer mosaic of the sensor (direction of aberration shown), d) distorted mosaic matching chromatic aberration of b).

Most camera applications use color image sensors that provide Bayer mosaic color data: one red, one blue and two diagonal green pixels in each 2×2 pixel group. In old times image senor pixel density was below the lenses optical resolution and all the cameras performed color interpolation (usually bilinear) of the “missing” colors. This process implied that each red pixel, for example, had four green neighbors (up, down, left and right) at equal distance, and 4 blue pixels located diagonally. With the modern high-resolution sensors it is not the case, possible distortions are shown on Figure 8 (copied from the earlier post). More elaborate “de-mosaic” processing involves non-linear operations that would influence sub-pixel correlation results.

As the simple de-mosaic procedures can not be applied to the high resolution sensors without degrading the images, we treat each color subchannel individually, merging results after performing the optical aberration correction in the FD, or at least compensating the lateral chromatic aberration that causes pixels x/y shift.

Channel separation means that when converting data to the FD the input data arrays are decimated: for green color only half of the values (located in a checkerboard pattern) are non-zero, and for red and blue subchannels – only quarter of all pixels have non-zero values. The relative phase of the remaining pixels depends on the required integer pixel offset (only ±0.5 are delegated to the FD) and may have 2 values (black vs. white checkerboard cells) for green, and 4 different values (odd/even for each of the horizontal/vertical directions independently) for red and blue. MCLT can be performed on the sparse input arrays the same way as on the complete 16×16 ones, low-pass filters (LPF) may be applied on the later processing stages (LPF may be applied to the deconvolution kernels used for aberration correction during camera calibration). It is convenient to multiply red and blue values by 2 to compensate for the smaller number of participating pixels compared to the green sub-channel.

Direct approach to calculate FD transformation of the color mosaic input image would be to either run the monochrome converter (Figure 5) three times (in 256*3=768 clock cycles) masking out different pixel pattern each time, or implement the same module (including the input pixel buffer) three times to achieve 256 clock cycle operation. And actually the 16×16 pixel buffer used in monochrome converter is not sufficient – even the small lateral chromatic aberration leads to mismatch of the 16×16 tiles for different colors. That would require using larger – 18×18 (for minimal aberration) pixel buffer or larger if that aberration may exceed a full pixel.

Luckily MCLT that consists of MDCT and MDST, they it turn use DCT-IV and DST-IV that have a very convenient property when processing the Bayer mosaic data. Almost the same implementation as used for the monochrome converter can transform color data at the same 256 clock cycles. The only part that requires more resources is the final phase rotator – it has to output 3*4*64 values in 256 clock cycles so three instances of the same rotator are required to provide the full load of the rest of the circuitry. This almost (not including rotators) triple reduction of the MCLT calculation resources is based on “folding” of the input data for the DTT inputs and the following DCT-IV/DST-IV relation.

Relation between DCT-IV and DST-IV

DST-IV is equivalent to DCT-IV of the input sequence, where all odd input samples are multiplied by -1, and the result values are read in the reversed order. In the matrix form it is shown in (1) below:

.equation td{text-align:center; vertical-align:middle; } td.eq-no{ width:5%; } table.equation { width:100%; } DST IV = [ 0       1       1       1       ⋅⋅⋅       1       0 ] ⋅ DCT IV ⋅ [ 1       0   -1           1           ⋅⋅⋅   0       -1 ] (1)

Equations (2) and (3) show definitions of DCT-IV and DST-IV[4]:

DCT IV (k) = 2N ⋅ ∑ l=0 N–1 cos( π N ⋅(l+12) ⋅(k+12) ) (2) DST IV (k) = 2N ⋅ ∑ l=0 N–1 sin( π N ⋅(l+12) ⋅(k+12) ) (3)

The modified by (1) DST-IV can be re-writted as the two separate sums for even (l=2*m) and odd (l=2*m+1) input samples and by replacing output samples k with reversed (N-1-k). Then after removing full periods (n*2*π) and applying trigonometric identities it can be converted to the same value as DST-IV:

DCT mod IV (k) = 2N ⋅ ( ∑ m=0 N/2–1 cos( π N ⋅(2⋅m+12) ⋅((N–1–k)+12) ) – ∑ m=0 N/2–1 cos( π N ⋅(2⋅m+32) ⋅((N–1–k)+12) ) ) = 2N ⋅ ( ∑ m=0 N/2–1 cos( π N ⋅( 12⋅N – (2⋅m+12) ⋅(k+12) ) ) – ∑ m=0 N/2–1 cos( π N ⋅( 32⋅N – (2⋅m+32) ⋅(k+12) ) ) ) = 2N ⋅ ( ∑ m=0 N/2–1 cos( –π2 +π N ⋅(2⋅m+12) ⋅(k+12) ) – ∑ m=0 N/2–1 cos( π2 + π N ⋅(2⋅m+32) ⋅(k+12) ) ) = 2N ⋅ ( ∑ m=0 N/2–1 sin( π N ⋅(2⋅m+12) ⋅(k+12) ) + ∑ m=0 N/2–1 sin( π N ⋅(2⋅m+32) ⋅(k+12) ) ) = DST IV (k) (4) Two-dimensional MCLT Symmetries for the Bayer Mosaic Sub-arrays

As shown in Figure 1b, the 2*N-long input sequence of four fragments (a,b,c,d) is folded in N-long sequence (5) for DCT-IV input:

(a,b,c,d) ⟶ (– c ~ –d,a– b ~ ) (5)

and (6) – for DST-IV:

(a,b,c,d) ⟶ ( c ~ –d,a+ b ~ ) (6)

where tilde “~” over the name indicates reversal of the segment. Such direction reversal for the sequences of even length (N/2=4 in our case) swaps odd- and even-numbered samples. Each of the halves of each of (5) and (6) show that both have the same (–d,a) for direct and differ in sign for reversed: (– c~ , –b~ ) and ( c~ , b~ ) terms. Each of the input samples appears exactly once in each of the DCT-IV and DST inputs. Even samples of the input sequence contribute identically to the even samples of both output sequences (through a an d) and contribute with the opposite signs to the odd output samples (through b and c). And similarly the odd-numbered input samples contribute identically to the odd-numbered output samples and with the opposite sign to the even ones.

So, sparse input sequence with only non-zero values at even positions result in both DCT-IV and DST-IV having the same values at even positions and multiplied by -1 – in the odd. Odd-only input sequences result in same values in odd positions and negatives – in the even ones.

Now this property can be combined to the previously discussed DST-IV to DCT-IV relation and extended to the two-dimensional Bayer mosaic case. We can start with the red and blue channels that have 1-in-4 non-zero pixels. If the non-zero pixels are in the even columns, then after first horizontal pass the DST-IV output will differ from the DCT-IV only by the output samples order according to (1), as even input values are the same, and odd are negated. Reversed output coefficients order for the horizontal pass means that the DST-IV will be just horizontally flipped with respect to the DCT-IV one. If the non-zero input was for the odd columns, then the DST-IV output will be reversed and negated compared to the DCT-IV one.

The second (vertical) pass is applied similarly. If original pattern had non-zero even rows, the result of vertical DST-IV would be the same as those of the DCT-IV after a vertical flip. If the odd rows were non-zero instead – the result will be vertically flipped and negated with respect of the DCT-IV one.

This result means that for the 1-in-4 Bayer mosaic array only one DCT-IV/DCT-IV transform is need. The three other DTT combinations may be obtained by flipping the DCT-IV/DCT-IV output horizontally and/or vertically and possibly negating all the data. Reversal of the readout order does not require additional hardware resources (just a few gates for a little fancier memory address counter). Data negation is also “free” as it can easily be absorbed by the phase rotator that already has adders/subtracters and just needs an extra inversion sign control. Flips do not depend on the odd/even columns/rows, only negation depends on them.

Green channel has non-zero values in either (0,0) and (1,1) or (0,1) and (1,0) positions. So it can be considered as a sum of two 1-in-4 channels described above. In the (0,0) case neither horizontal, no vertical DST-IV inverts sign, and (1,1) inverts both, so the DST-IV/DST-IV does not have any inversion, and it will stay true for the green color – combination of (0,0) and (1,1). We can not compare DCT-IV/DCT-IV with ether DST-IV/DCT-IV or DCT-IV/DST-IV, but these two will have the same sign compared to each other (zero or double negation). So the DST-IV/DST-IV will be double-flipped (both horizontally and vertically) version of the DCT-IV/DCT-IV output, and DCT-IV/DST-IV – double flipped version of DST-IV/DCT-IV, and calculation for this green pattern requires just two of the 4 DTT operations. Similarly, for the (0,1)/(1,0) green pattern DST-IV/DST-IV will be double-flipped and negated version of the DCT-IV/DCT-IV output, and DCT-IV/DST-IV – double-flipped and negated version of DST-IV/DCT-IV.

Combining results for all three color components: reg, blue and green, we need total of four DTT operations on 8×8 tiles: one for red, one for blue and 2 for green component instead of twelve if each pixel had full RGB value.

Implementation of the MCLT for the Bayer Mosaic Pixel Tiles

Figure 9. MCLT converter for the Bayer mosaic tiles

MCLT converter for the Bayer Mosaic data shown in Figure 9 is similar to the already described converter for the monochrome data. The first difference is that now the Fold sequencer addresses larger source tiles – it is run-time configurable to be 16×16, 18×18, 20×20 or 22×22 pixels. Each tile may have different size – this functionality can be used to optimize external memory access: use smaller tiles in the center areas of the sensor with lower lateral chromatic aberrations, read larger tiles in the peripheral areas with higher aberrations. The extended tile should accommodate all three color shifted 16×16 blocks as shown in Figure 10.

Figure 10. Example color channels 16×16 pixel tiles mismatch due to the chromatic aberration

The start input triggers calculation of all 3 colors, it can appear either each 256-th clock cycle or it needs to wait for at least 16 cycles after the end of the previous tile input if started asynchronously. X/Y offsets are provided individually for each color channel and they are stored in a register file inside the module, the integer top-left corner offset is provided separately for each color to simplify internal calculations.

Dual-port window ROM is the same as in the monochrome module, both horizontal and vertical window components are combined by a multiplier and then the result is multiplied by the pixel data received from the external memory buffer with configurable latency.

There are only two (instead the four for monochrome) accumulators required because each of the “folded” elements of the 8×8 DTT inputs has two contributor source pixels at most (for green color), while red and blue use just a single accumulator. Red color is processed first, then blue, and finally green. For the first ones only the DCT-IV/DCT-IV input data is prepared (in 64 clock cycles each), green color requires DST-IV/DCT-IV block additionally (128 clock cycles).

DTT output data goes to three parallel dual-port memory buffers. Red and blue channels each require just single 64-element array that later provides 4 different values in 4 clock cycles for horizontally and vertically flipped data. Green channel requires 2*64 element array. These buffers are filled one at a time (red, blue, green) and each of them feed corresponding phase rotator. Rotator outputs need 256 cycles to provide 4*64=256 FD values, the symmetry exploited during DTT conversion is lost at this stage. The three phase rotators can provide output addresses/data asynchronously or the green and blue outputs can be delayed and output simultaneously with the green channel, sharing the external memory addresses.

MCLT Converter for the Bayer Mosaic Tiles Simulation and Comparison to the Software Post-processing results

Figure 11. MCLT converter for the Bayer mosaic simulation

Results of the simulation of the MCLT converter are shown in Figure 11. Simulation ran with the clock frequency set to 100 MHz, the synthesized code should be able to run at at least 250 MHz in the Zynq SoC of the NC393 camera – at the same rate as is currently used for the data compression – all the memories and DSPs are fully buffered. Simulation show processing of the 3 tiles: two consecutive and the third one after a minimal pause. The data input was provided by the Java code that was written for the post-processing of the camera images, the same code generated intermediate results and the output of the MCLT conversion. Java used code involved double precision floating point calculations, while the RTL code is based on fixed-point calculations with the precision of the architecture DSP primitives, so there is no bit-to-bit match of the data, instead there is a difference that corresponds to the width of the used data words.


The RTL code (licensed under GNU GPLv3+) developed for the MCLT-based frequency domain conversion of the Bayer color mosaic images uses almost the same resources as the monochrome image transformation for the tiles of the same size – three times less than a full RGB image would require. Input window function modification that accounts for the two-dimensional fractional pixel shift and the post-transform phase rotators allow to avoid re-sampling for the image rectification that degrades sub-pixel resolution.

This module is simulated with Icarus Verilog (a free software simulator), results are compared to those of the software post-processing used for the 3-d scene reconstruction of multiple image sets, described in the “Long range multi-view stereo camera with 4 sensors” post.

The MCLT module for the Bayer color mosaic images is not yet tested in the FPGA of the camera – it needs more companion code to implement a full processing chain that will generate disparity space images (DSI) required for the real-time 3d scene reconstruction and/or output of the aberration-corrected image textures. But it constitutes a critical part of the Tile Processor project and brings the overall system completion closer.


[1] Thiébaut, Éric, et al. “Spatially variant PSF modeling and image deblurring.” SPIE Astronomical Telescopes+ Instrumentation. International Society for Optics and Photonics, 2016. pdf

[2] Řeřábek, M., and P. Pata. “The space variant PSF for deconvolution of wide-field astronomical images.” SPIE Astronomical Telescopes+ Instrumentation. International Society for Optics and Photonics, 2008.pdf

[3] Malvar, Henrique. “A modulated complex lapped transform and its applications to audio processing.” Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on. Vol. 3. IEEE, 1999.pdf

[4] Britanak, Vladimir, Patrick C. Yip, and Kamisetty Ramamohan Rao. Discrete cosine and sine transforms: general properties, fast algorithms and integer approximations. Academic Press, 2010.

Updated Poky to Rocko

Wed, 12/20/2017 - 17:15

We have updated the Yocto build system to Poky Rocko released back in October. Here’s a short summary table of the updates:

before after Poky 2.0 (Jethro) 2.4 (Rocko) gcc 5.3.0 7.2.0 linux kernel 4.0 4.9

Other packages got updates as well:

  • apache2-2.4.18 => apache2-2.4.29
  • php-5.6.16 => php-5.6.31
  • udev-182 changed to eudev-3.2.2, etc.

This new version is in the rocko branch for now but will be merged into master after some transition period (and the current master will be moved to jethro branch). Below are a few tips for future updates.


It was relatively simple to update the recipes and build the images. Most of the packages were updated to newer versions. One noticeable change is tmp/sysroots/ is not a common location anymore – instead there is the recipe-sysroot/ directory in each recipe’s build directory. Also, our headers didn’t get to the sysroots which got fixed by having:
sysroot_stage_all_append() {
    sysroot_stage_dir ${WORKDIR}/headers/include ${SYSROOT_DESTDIR}/usr/include-uapi
We had this task in Jethro but there was another variable used ${STAGING_DIR_TARGET} (instead of ${SYSROOT_DESTDIR})

Linux kernel

It took most of the time to update it.

Luckily our drivers didn’t break but there were a few problems:

    1. Serial output

First, we merged our drivers and patches to make the kernel get built at least. After building the kernel the serial output didn’t work – the fix was to switch to earlycon in bootargs in the device tree (in 4.0 ealyprintk still worked):
chosen {
  bootargs = "cma=128M console=ttyPS0,115200 root=/dev/mmcblk0p2 rw earlyprintk rootwait rootfstype=ext4";
  linux,stdout-path = "/amba@0/serial@e0000000";
chosen {
  bootargs = "earlycon cma=128M root=/dev/mmcblk0p2 rw rootwait rootfstype=ext4";
  stdout-path = "serial0:115200n8";

    2. Ethernet

Next, there was no ethernet. After some time of debugging the problem was with the device tree again:
phy3: phy@3 {
  compatible = "atheros,8035";
  device_type = "ethernet-phy";
  reg = <0x3>;
phy3: phy@3 {
  /* Atheros 8035 */
  compatible = "ethernet-phy-id004d.d072";
  /* compatible = "ethernet-phy-ieee802.3-c22";*/
  device_type = "ethernet-phy";
  reg = <0x3>;
The compatible field has to be in a certain format. See the docs.
The next thing has Xilinx switched from xlnx,ps7-ethernet-1.00.a to cdns,zynq-gem (see arch/arm/boot/dts/zynq-7000.dtsi). Didn’t track when – both drivers work. The cdns,zynq-gem is in the mainline, the other is not.
And the last change to the ethernet driver was adding the fixup for Atheros 8035, the network chip we use.

    3. Bitstream

Then the bitstream wouldn’t get programmed. It was Xilinx’s problem – new gcc was opting away some of the variables (dma_done) in the driver – they fixed the kernel but didn’t change the hash of the linux-xlnx in the recipe yet ( So, it’s probably coming to 2017.4.

    4. Other minor things

Xilinx has changed some of its default config variables and in 4.9 there are a few new dependencies here and there.
The changes made to our config (elphel393.cfg) which is added to the default config for zynq devices:
# This option is for FPGA part
# prints time before messages
# dependency for DYNAMIC_DEBUG=y
# turned off because old:



udev package switched from udev-182 changed to eudev-3.2.2. udev-182 can copy static device files from /lib/udev/devices/, eudev-3.2.2 does not do this – created a new rule:
ACTION=="add", RUN+="/usr/bin/rsync -a /lib/udev/devices/ /dev/"
This rule adds up ~6 secs to boot time for some reason vs almost nothing if run from the camera init script

Other notes
  • Reminder: the system boots into initramfs first and runs init built by initramfs-live-boot. The script runs switch_root in the end. udev daemon gets killed and restarted

Assembling Long Range Stereo Camera

Wed, 12/13/2017 - 23:06

MNC393-XCAM partial assembly and parts

The long anticipated parts for the Long range camera have arrived!

The mechanical parts for the MNC393-XCAM – Long Range Multi-view Stereo Camera are machined, tested, and ready to be anodized. This enables us to have the X-camera assembled before the winter holidays. The holiday break will provide a good opportunity to test the camera, capture new photos, and create robust 3D models from calibrated images.
The titanium X-frame of the camera ensures thermal stability required for continuous accuracy of 3D measurements. The aluminum enclosure and sealed lens filters weatherproof the system allowing for the proposed outdoor use of the camera.
We intend to assemble two cameras: one with a 150 mm distance between the sensors and another with a longer baseline. The expected accuracy for the camera with the shorter baseline is greater than 10% at a 200 meter distance. We have achieved 10% accuracy with H-camera with calibrated sensors, even though the 3D-printed parts were not thermally stable and some error was accumulated over time. It was a very pleasant surprise that the software was still able to deal with somewhat un-calibrated images and detect distances very accurately, creating impressive 3D-scenes: Scene_viewer The second camera will have a 280 mm distance between sensors, which is determined by the longest FPC cables we can use without signal losses. It promises to double the measured distance with the same degree of accuracy, therefore an extremely long range 3D-scenes will be produced.

The Long Range Multi-View Stereo Camera with 4 sensors MNC393-XCAM is planned for release in early 2018.

Drupal, WordPress, Mediawiki, Mail Archive, Gitlab – all in one web site

Sun, 12/03/2017 - 23:23
Multiple Subdomains for the Same Web Site

It is a common case when a company or organization uses multiple content management systems (CMS) and specialized web application to organize its web presence. We describe here how Elphel handles such CMS variety and provide the source code that can be customized for other similar sites.

We currently use the following CMS and web applications:

  • Drupal as a general purpose CMS for the main site
  • WordPress for the development blogs
  • Mediawiki for the wiki-based documentation.
  • Mailman (self hosted) and Mail Archive (external site) for the mailing list that is our main channel of the user technical support
  • Gitlab CE for the code and other version-controlled content. We used Github but switched to self-hosted Gitlab CE following FSF recommendations
  • Other customized versions of FLOSS web applications, such as OSTicket for support tickets and FrontAccounting for inventory and production
  • In-house developed free software web applications, such as x3dom-based 3D scene and map viewer, 3D mechanical assembly viewer available for the assemblies and mechanical components on the wiki, WebGL Panorama Viewer/Editor.

Figure 1. Elphel web site on a desktop screen

When users visit company web site they expect to have well organized and logically solid navigation to get the information they came for. There are multiple ways how to combine several applications – for example, it is possible to add wiki to WordPress or use it with Gitlab repositories. Of course it is theoretically possible to develop a web site where all the information in available in these applications will be properly cross-linked and stay current. It may also be possible to customize one of the CMS to handle all the needs, but that would require significant web development resources. For a small company following that path the required number of web developers may easily exceed the number of hardware, FPGA and software developers.

On the other hand the small company size should not be an excuse – why the web site visitors can not find the information they are looking for. For us it was a very clear signal when our long time users asked if we can email them mechanical drawings of the camera boards and components. For us that meant a bad web site usability – for years we have this information for some 900+ standard and machined parts available in our wiki: DXF and PDF for 2-d drawings, STEP files for 3d in our wiki: 393 series and previous 353 series. Camera mechanical assemblies can be navigated with the x3dom-based viewer.

Improving Navigation of the Composite Web Site

As we were not going to give up simultaneous use of multiple content management systems (we needed functionality of each of them) we were looking for the following improvements of information accessibility:

  • to have a common search over all subdomains always available (looking glass icon in the top right corner)
  • as we can not cross-link properly all the information, then at least we have to communicate the idea that there are multiple subdomains to our visitors.

Figure 2. Site search for ‘ERS’ (acronym for Electronic Rolling Shutter)

We did try to launch simultaneous search over the main (Drupal) site, wiki, blog (WordPress) and Mail Archive using site-specific requests, but that was available only on the main site, and for Mail Archive we could not add it as it is hosted on the external site. An obvious solution to have site-wide search always accessible, including external sites was to use iframe elements. Figure 2 shows results of simultaneous search for “ERS”, each linking to the corresponding subdomain tab.

Figure 3. Elphel web site on a small screen

Immitating Multiple Application Windows

As we have to use iframe elements for multi-site search we can use the framed layout to tell visitors that instead of a single web site with a consistent navigation there are in fact multiple subdomains, each having different structure and navigation. Instead of a single browser window or tab we are presenting the web site as a stack of individual windows on a desktop as shown on Figure 1. Most web sites do not use full width of the modern computer desktop and have useful content width of just 800-1100 pixels – that gives some room to show parts of the subdomain pages that are out of focus in our case. For the not as wide screens (or browser windows) such as on mobile devices, the page layout is different (Figure 3) – subdomains that are currently out of focus are shown as colored horizontal bars. Navigation is rather intuitive – mouse click on the background tab brings it to focus, following links that lead to the other subdomain (e.g. from the blog tab to wiki) brings the corresponding page to focus and shows linked page there, avoiding the case when there will be two wiki tabs and no blog ones. External links open in new browser tabs – that also helps to keep multi-site structure intact.

Each subdomain page title (Main, Blog, Wiki/Docs…) is a link to the content shown in the current iframe element. Following that link or using “Copy Link Location” from a context menu opens just that subdomain page alone or copies its address to use it as a bookmark or send over email or post in social media.

Redirection and Canonical URLs

Next step after selecting the overall page layout was to decide – how should we show the framed version to the visitors and what URLs to have as canonical for different pages? At that point we had multiple subdomains: for the main site (Drupal), for the wiki (Mediawiki), for the blogs (WordPress). The default domain was reserved for the top page to open individual subdomains in the iframe elements.

One way to deal with it was to permanently redirect* to*,* to*, and so on, and then to rely on to process the request and open subdomains in respective iframe elements. That would always load all the stack of the pages and may be both intrusive and annoying – visitors may be already aware of the structure of the web site and want just the specific blog post or wiki page. And as iframes are routinely used for serving ads, their functionality may be disabled in the browser of a visitor.

We use the original URLs as canonical (e.g.*, not*), so search engines will return them as the results, and then redirect (with temporary 302) requests depending on the referrer. If referrer (HTTP_REFERER) is provided in the request and it does not belong to Elphel domain – open the framed version with focus on the tab with the requested URL. Otherwise, with no referrer (directly entered, received via email, disabled for privacy reasons) or when opening the link from other Elphel page (such as already framed version) the page is opened alone.

Below is the relevant portion of the Apache 2 .htaccess file for the (there are also WordPress-specific rules and enforcing SSL access (use of https instead of insecure http protocol). Lines 5..7 are to disable redirection for certain groups of addresses. Line 2 discloses the fact (by adding “Vary: Referer” to the headers) that response depends on the referrer page.

#Tell search engines we do not cloak Header always add Vary: Referer RewriteCond %{REQUEST_FILENAME} !-f RewriteCond "%{HTTP_REFERER}" "!^((.*)elphel\.com(.*)|)$" RewriteCond %{REQUEST_URI} !category.*feed RewriteCond %{REQUEST_URI} !^/[0-9]+\..+\.cpaneldcv$ <span style="color:#999999">RewriteCond %{REQUEST_URI} !^/\.well-known/pki-validation/[A-F0-9]{32}\.txt(?:\ Comodo\ DCV)?$</span> RewriteRule ^(.*)${REQUEST_URI} [L,R=302]

Such redirection is obviously impossible for external sites as Mail Archive and we did not use it for – in both cases if visitors followed those links from the search engines results, they would not expect such pages to be parts of the well cross-linked company web site.


Source code: iframed

The main frame is a single page with iframe elements. The iframes’ addresses are hard coded in the index.php.
To avoid, where possible, displaying the same content in multiple frames window.postMessage() method is used. Link click events in an iframe are blocked and the link data is passed to the main frame via postMessage for analysis following focusing on an iframe with a matching domain and refreshing the links or opening a new tab for external websites.

The following code was added to each controlled subdomain (see elphel_messenger.js):

<script src=""></script> <script> ElphelMessenger.init(); </script>

Usability ans SEO

We are trying to improve usability of the company web site that effectively is a collection of subdomains each running different CMS without major investments into the web design. At the same time we were trying to avoid mistakes in SEO (were we do not have much experience) that would make Elphel invisible for the search engines. Our SEO concerns were caused by the following reasons:

  • the top (almost like a “frameset” in older HTML days) page itself has very little content – most is provided in the included iframe elements
  • the served content depends on the referrer address – that might be considered as “cloaking.”

Knowing nothing how algorithms are distinguishing between cloaking and legitimate referrer redirects (such as visitors geographical location) we assumed that cloaking shows indexing bots some information that is not served to the human visitors. Our case is different – visitors get either exactly the same information as the bots, or the same one plus additional. We indicated that the served content depends on the referrer URL in the response header (so smart bot can try variable values for “referer”). Testing the new web site for half year did not show any negative effect on the web site visibility in the search engines.

But the main question still remains – did we reach our goal to improve web site usability?

  • Were we able to communicate the idea that the site consists of multiple loosely-connected subdomains with different navigation?
  • Is the framed site navigation intuitive or annoying?
  • Does the combined search over multiple subdomains do its job and does it behave as users expect?

Only our visitors can tell us this, user comments are always welcome. And while we do not have a one-click distribution to use this solution for other similar sites (some customization will be needed) our code is available in the Git repository iframed.

Developing with Eclipse CDT and Yocto – Linux kernel and applications

Wed, 11/22/2017 - 17:32

Elphel uses embedded GNU/Linux distribution based on Yocto. For most of our development (excluding just mechanical and PCB design) we use universal Eclipse IDE: for FPGA development, Linux kernel drivers development, embedded applications and web applications, for editing LaTeX texts. And we use this popular IDE for delivering pre-configured projects to our users to make it easier for them to start efficient modification of the initial camera software and then initiate the new projects.

We tried to use Yocto plugin but were not able to configure it for kernel development, and the kernel drivers development is one of the the largest and probably is the most difficult part of the camera software development, the part were we need Eclipse IDE assistance most.

One of the major challenges of using code analysis tools of Eclipse CDT with the Linux kernel is that there are so many files that define the same names. These files are selected during the build process, and for correct code analysis Eclipse CDT has to reproduce rather complex Linux system of configuration, multi-level macro defines to resolve references in the source code. We were able to solve this problem (to some extent), but it required a fair amount of manual tweaking and was not universal – developing applications would require different modifications.

Figure 1. Excluded source directories in the Navigator (left) panel are crossed out.

At the same time all the software components in the distribution are built with the powerful Bitbake build system, and existing ”recipes” and the invoked Makefiles “know” which files to use. Following DRY (“don’t repeat yourself”) principle we removed references to “make” command in the project build and replaced them with bitbake command “bitbake <target> -c compile -f” running in the Eclipse console. As we did that for the main build command (CDT Builder) the console output is parsed for errors and warnings, results appear as problem markers in “Problems” view and in the source code. To help Eclipse CDT (and users who navigate the source code with it) limit attention to only files and directories that are actually used in the bitbake build process we implemented the following trick (and coded it as script):

  • Initialize source/headers directories with bitbake, so it “knows” that everything needs to be rebuilt for the project
  • Create a list of the source files (resolving symlinks when needed) and “touch” them, setting modification timestamps. This action prepares the files so the next (first after modification) file access will be recorded as access timestamp. Record the current time.
  • Wait a few seconds to reliably distinguish if each file was accessed after modification
  • Run bitbake build (“bitbake <target> -c compile -f”)
  • Scan all the files from the previously created source list and generate ”include_list” of those that were accessed during the build process.
  • As CDT accepts only “exclude” filters in this context, recursively combine full source list and include_list to generate “exclude_list” pruning all the branches that have nothing to include and replacing them with the full branch reference
  • Apply the generated exclusion list to the CDT project file “.cproject”

In the case of such complex system as Linux kernel there still remain several incorrectly resolved links, because different parts of the code may use different header files that share the same names, but there are not many of them, they can be handled manually.

Elphel wiki page Eclipse_CDT_projects_with_bitbake has more details on using Elphel camera projects with Eclipse IDE.

Natural environments in 3D with Elphel camera and Blender

Mon, 09/25/2017 - 13:44

Setting 3D camera on the rock at Cape Alava

Testing 3D camera on a road trip

In August of 2017, my family and I went on a trip to the Pacific Northwest, partially for a much needed vacation, but equally as importantly, to test my dad’s new 3D camera. My dad had been designing calibrated multi-sensor cameras for as long as I can remember, and since February was working determinately on developing principally new algorithms for reconstructing a 3D model from a set of 4 simultaneously taken photographs . Now that the camera and the software were ready, there was no better time to test it.

The main portion of our trip was spent at the Olympic Peninsula in Washington. The gorgeous and complex natural setting was perfect for testing the abilities and limits of the camera. The camera lenses are arranged in a square configuration, each lens the same distance apart from the other. Such configuration is mimicking the position of human eyes, while adding the vertical pair helps measure distances to horizontal objects as well as vertical ones. Our depth perception, comes entirely from our brain’s ability to combine together two (from each eye) images to create a 3D space within our mind. The camera operates very similarly, using parallax, the technique that has been used throughout time to calculate distances using fairly simple geometry. Elphel 3D camera takes four individual images, which is more precise than two, and our software program calculates the distances to each object in the scene, combining data from 4 photographs and creates 3D model of the scene. You can explore our models with Elphel 3D model viewer: Also the models can be opened with 3D modeling program, such as Blender, and as a result you appear to be standing in a 3D realm, experiencing the environment, in a whole new sense .

screenshot of the 3D model of ocean waves

Although the camera is meant to be a long-range photogrammetric camera, capable of accurately measuring distances at 200 meters and farther, we were fascinated with the idea of creating realistic 3D environments where we can virtually walk through. Throughout most of the testing we chose locations which would include the complex organic forms of nature, as to test the camera’s ability to work with finer details and non-geometric forms . Naturally, we shot parts of the Olympic National Park rain forest, with very exciting results. We also photographed the ocean waves crashing onto the shores rocks, and many more natural beauties. While my brother and I were taking the photographs, many times we would have to scale rocks, laptop and tripod in hand, in order to get the proper location. The process could get tedious, but at times was oddly exciting. At one point, we wanted to get an up-close shot of the waves just as they were arching over the rocks, but to do so, I had to act as a shield in case the water got too close to the camera, ready to leap in front of it at any time. Another time, my brother packed the camera into his backpack and biked to the most northwestern point on the US mainland, Cape Flattery, and took a series of images there. My dad had also used the images we had taken as the test data to find and fix bugs in his software.

All in all, the experience was really helpful, the vacation was rejuvenating, and the results were astounding, and it gave me the feeling that my family and I were creating something of the future. I don’t really know how this whole idea wouldn’t be considered exciting. Imagine, if you could just take a picture, and have it turned into a 3D space. Not only does the idea itself seem like the fantastic inventions of any quality science fiction (aka super cool), but it’s application to the real world are endless.

Working with Elphel models in Blender

We have selected our favorite scenes to create a virtual path through the Olympic rain forest. Each model can be generated with a specified level of details, usually 500 meshes is enough for many scenes, however the rain forest looks more realistic when it is created with 2000 meshes.

$("#forest01").player(1); Tutorial

The procedure for importing and combining 3D model files in Blender, a Free software for 3D modeling and animation, is fairly simple.
Elphel example 3D meshes can be downloaded from by opening the desired scene, pressing the download icon (↓) and extracting archive into directory on your computer.

3D scene download

Download Blender form this link:, and follow the installation instructions available for Linux, Windows, and Mac.
Once Blender has been opened, the scene must be cleared. This is done by pressing “a” until everything is highlighted and then pressing “x”

Clear scene

The 3D model file can then be opened in Blender by selecting the .obj option in the import menu and then selecting the downloaded *.obj file.

Import menu in Blender

Follow these instructions if the computer struggles with graphics:
On right side of the window you will find the modifier options on the properties panel. Select the decimate modifier and set a ratio that works for your computer, this ratio will reduce the number of triangles but might also seriously warp textures. Do not press the apply button as viewport performance has already been improved.

Decimate modifier reduces the number of triangles

You will find that the imported mesh is gray; to view the textures change the viewport render mode to textured at the bottom of the 3d viewport (this may take a while on slower computers).

Texture mode

Now that textures are enabled, simulated lighting must be disabled. To do this hover mouse over the viewport and press “n” to open the properties region. Under the display tab check the “shadeless” box (this option is only available if the viewport render mode is set to textured).

Imported mesh with textures

The procedure can be repeated to import more models and manipulate them in Blender creating panoramic view of the city streets, path in the forest and other realistic 3D environments.

Long range multi-view stereo camera with 4 sensors

Wed, 09/20/2017 - 23:40

Figure 1. Four sensor stereo camera model

Four-camera stereo rig prototype is capable of measuring distances thousands times exceeding the camera baseline over wide (60 by 45 degrees) field of view. With 150 mm distance between lenses it provides ranging data at 200 meters with 10% accuracy, production units will have higher accuracy. Initial implementation uses software post-processing, but the core part of the software (tile processor) is designed as FPGA simulation and will be moved to the actual FPGA of the camera for the real time applications.

Scroll down or just hyper-jump to Scene viewer for the links to see example images and reconstructed scenes.


Most modern advances in the area of the visual 3d reconstruction are related to structure from motion (SfM) where high quality models are generated from the image sequences, including those from the uncalibrated cameras (such as cellphone ones). Another fast growing applications depend on the active ranging with either LIDAR scanning technology or time of flight (ToF) sensors.

Each of these methods has its limitations and while widespread smart phone cameras attracted most of the interest in the algorithms and software development, there are some applications where the narrow baseline (distance between the sensors is much smaller, than the distance to the objects) technology has advantages.

Such applications include autonomous vehicle navigation where other objects in the scene are moving and 3-d data is needed immediately (not when the complete image sequence is captured), and the elements to be ranged are ahead of the vehicle so previous images would not help much. ToF sensors are still very limited in range (few meters) and the scanning LIDAR systems are either slow to update or have very limited field of view. Passive (visual only) ranging may be desired for military applications where the system should stay invisible by not shining lasers around.

Technology snippets Narrow baseline and subpixel resolution

The main challenge for the narrow baseline systems is that the distance resolution is much worse than the lateral one. The minimal resolved 3d element, voxel is very far from resembling a cube (as 2d pixels are usually squares) – with the dimensions we use: pixel size – 0.0022 mm, lens focal length f = 4.5 mm and the baseline of 150 mm such voxel at 100 m distance is 50 mm high by 50 mm wide and 32 meters deep. The good thing is that while the lateral resolution generally is just one pixel (can be better only with additional knowledge about the object), the depth resolution can be improved with reasonable assumptions by an order of magnitude by using subpixel resolution. It is possible when there are multiple shifted images of the same object (that for such high range to baseline ratio can safely be assumed fronto-parallel) and every object is presented in each image by multiple pixels. With 0.1 pixel resolution in disparity (or shift between the two images) the depth dimension of the voxel at 100 m distance is 3.2 meters. And as we need multiple pixel objects for the subpixel disparity resolution, the voxel lateral dimensions increase (there is a way to restore the lateral resolution to a single pixel in most cases). With fixed-width window for the image matching we use 8×8 pixel grid (16×16 pixel overlapping tiles) similar to what is used by some image/video compression algorithms (such as JPEG) the voxel dimensions at 100 meter range become 0.4 m x 0.4 m x 3.2 m. Still not a cube, but the difference is significantly less dramatic.

Subpixel accuracy and the lens distortions

Matching images with subpixel accuracy requires that lens optical distortion of each lens is known and compensated with the same or better precision. Most popular way to present lens distortions is to use radial distortion model where relation of distorted and ideal pin-hole camera image is expressed as polynomial of point radius, so in polar coordinates the angle stays the same while the radius changes. Fisheye lenses are better described with “f-theta” model, where linear radial distance in the focal plane corresponds to the angle between the lens axis and ray to the object.

Such radial models provide accurate results only with ideal lens elements and when such elements are assembled so that the axis of each individual lens element precisely matches axes of the other elements – both in position and orientation. In the real lenses each optical element has minor misalignment, and that limits the radial model. For the lenses we had dealt with and with 5MPix sensors it was possible to get down to 0.2 – 0.3 pixels, so we supplemented the radial distortion described by the analytical formula with the table-based residual image correction. Such correction reduced the minimal resolved disparity to 0.05 – 0.08 pixels.

Fixed vs. variable window image matching and FPGA

Modern multi-view stereo systems that work with wide baselines use elaborate algorithms with variable size windows when matching image pairs, down to single pixels. They aggregate data from the neighbor pixels at later processing stages, that allows them to handle occlusions and perspective distortions that make paired images different. With the narrow baseline system, ranging objects at distances that are hundreds to thousands times larger than the baseline, the difference in perspective distortions of the images is almost always very small. And as the only way to get subpixel resolution requires matching of many pixels at once anyway, use of the fixed size image tiles instead of the individual pixels does not reduce flexibility of the algorithm much.

Processing of the fixed-size image tiles promises significant advantage – hardware-accelerated pixel-level tile processing combined with the higher level software that operates with the per-tile data rather than with per-pixel one. Tile processing can be implemented within the FPGA-friendly stream processing paradigm leaving decision making to the software. Matching image tiles may be implemented using methods similar to those used for image and especially video compression where motion vector estimation is similar to calculation of the disparity between the stereo images and similar algorithms may be used, such as phase-only correlation (PoC).

Two dimensional array vs. binocular and inline camera rigs

Usually stereo cameras or fixed baseline multi-view stereo are binocular systems, with just two sensors. Less common systems have more than two lenses positioned along the same line. Such configurations improve the useful camera range (ability to measure near and far objects) and reduce ambiguity when dealing with periodic object structures. Even less common are the rigs where the individual cameras form a 2d structure.

In this project we used a camera with 4 sensors located in the corners of a square, so they are not co-linear. Correlation-based matching of the images depends on the detailed texture in the matched areas of the images – perfectly uniform objects produce no data for depth estimation. Additionally some common types of image details may be unsuitable for certain orientations of the camera baselines. Vertical concrete pole can be easily correlated by the two horizontally positioned cameras, but if the baseline is turned vertical, the same binocular camera rig would fail to produce disparity value. Similar is true when trying to capture horizontal features with the horizontal binocular system – such predominantly horizontal features are common when viewing near flat horizontal surfaces at high angles of incidents (almost parallel to the view direction).

With four cameras we process four image pairs – 2 horizontal (top and bottom) and 2 vertical (right and left), and depending on the application requirements for particular image region it is possible to combine correlation results of all 4 pairs, or just horizontal and vertical separately. When all 4 baselines have equal length it is easier to combine image data before calculating the precise location of the correlation maximums – 2 pairs can be combined directly, and the 2 others after rotating tiles by 90 degrees (swapping X and Y directions, transposing the tiles 2d arrays).

Image rectification and resampling

Many implementations of the multi-view stereo processing start with the image rectification that involves correction for the perspective and lens distortions, projection of the individual images to the common plane. Such projection simplifies image tiles matching by correlation, but as it involves resampling of the images, it either reduces resolution or requires upsampling and so increases required memory size and processing complexity.

This implementation does not require full de-warping of the images and related resampling with fractional pixel shifts. Instead we split geometric distortion of each lens into two parts:

  • common (average) distortion of all four lenses approximated by analytical radial distortion model, and
  • small residual deviation of each lens image transformation from the common distortion model

Common radial distortion parameters are used to calculate matching tile location in each image, and while integer rounded pixel shifts of the tile centers are used directly when selecting input pixel windows, the fractional pixel remainders are preserved and combined with the other image shifts in the FPGA tile processor. Matching of the images is performed in this common distorted space, the tile grid is also mapped to this presentation, not to the fully rectified rectilinear image.

Small individual lens deviations from the common distortion model are smooth 2-d functions over the 2-d image plane, they are interpolated from the calibration data stored for the lower resolution grid.

We use low distortion sorted lenses with matching focal lengths to make sure that the scale mismatch between the image tiles is less than tile size in the target subpixel intervals (0.1 pix). Low distortion requirement extends the distances range to the near objects, because with the higher disparity values matching tiles in the different images land to the differently distorted areas. Focal length matching allows to use modulated complex lapped transform (CLT) that similar to discrete Fourier transform (DFT) is invariant to shifts, but not to scaling (log-polar coordinates are not applicable here, as such transformation would deny shift invariance).

Enhancing images by correcting optical aberrations with space-variant deconvolution

Matching of the images acquired with the almost identical lenses is rather insensitive to the lens aberrations that degrade image quality (mostly reduce sharpness), especially in the peripheral image areas. Aberration correction is still needed to get sharp textures in the result 3d models over full field of view, the resolution of the modern sensors is usually better than what lenses can provide. Correction can be implemented with space-variant (different kernels for different areas of the image) deconvolution, we routinely use it for post-processing of Eyesis4π images. The DCT-based implementation is described in the earlier blog post.

Space-variant deconvolution kernels can absorb (be combined with during calibration processing) the individual lens deviations from the common distortion model, described above. Aberration correction and image rectification to the common image space can be performed simultaneously using the same processing resources.

Two dimensional vs. single dimensional matching along the epipolar lines

Common approach for matching image pairs is to replace the two-dimensional correlation with a single-dimensional task by correlating pixels along the epipolar lines that are just horizontal lines for horizontally built binocular systems with the parallel optical axes. Aggregation of the correlation maximums locations between the neighbor parallel lines of pixels is preformed in the image pixels domain after each line is processed separately.

For tile-based processing it is beneficial to perform a full 2-d correlation as the phase correlation is performed in the frequency domain, and after the pointwise multiplication during aberration correction the image tiles are already available in the 2d frequency domain. Two dimensional correlation implies aggregation of data from multiple scan lines, it can tolerate (and be used to correct) small lens misalignments, with appropriate filtering it can be used to detect (and match) linear features.

Implementation Prototype camera

Experimental camera looks similar to Elphel regular H-camera – we just incorporated different sensor front ends (3d CAD model) that are used in Eyesis4π and added adjustment screws to align optical axes of the lenses (heading and tilt) and orientations of the image sensors (roll). Sensors are 5 Mpix 1/2″ format On Semiconductor MT9P006, lenses – Evetar N125B04530W.

We selected lenses with the same focal length within 1%, and calibrated the camera using our standard camera rotation machine and the target pattern. As we do not yet have production adjustment equipment and software, the adjustment took several iterations: calibrating the camera and measuring extrinsic parameters of each sensor front end, then rotating each of the adjustment screws according to spreadsheet-calculated values, and then re-running the whole calibration process again. Finally the calibration results: radial distortion parameters, SFE extrinsic parameters, vignetting and deconvolution kernels were converted to the form suitable for run-time application (now – during post-processing of the captured images).

Figure 2. Camera block diagram

This prototype still uses 3d-printed parts and such mounts proved to be not stable enough, so we had to add field calibration and write code for bundle adjustment of the individual imagers orientations from the 2-d correlation data for each of the 4 individual pairs.

Camera performance depends on the actual mechanical stability, software-compensation can only partially mitigate this misalignment problem and the precision of the distance measurements was reduced when cameras went off by more than 20 pixels after being carried in a backpack. Nevertheless the scene reconstruction remained possible.


Multi-view stereo rigs are capable of capturing dynamic scenes so our goal is to make a real-time system with most of the heavy-weight processing be done in the FPGA.

One of the major challenges here is how to combine parallel and stream processing capabilities of the FPGA with the flexibility of the software needed for implementation of the advanced 3d reconstruction algorithms. This approach is to use the FPGA-based tile processor to perform uniform operations on the lists of “tiles” – fixed square overlapping windows in the images. FPGA processes tile data at the pixel level, while software operates the whole tiles.

Figure 2 shows the overall block diagram of the camera, Figure 3 illustrates details of the tile processor.

Figure 3. FPGA tile processor

Initial implementation does not contain actual FPGA processing, so far we only tested in FPGA some of the core functions – two dimensional 8×8 DCT-IV needed for both 16×16 CLT and ICLT. Current code consists of the two separate parts – one part (tile processor) simulates what will be moved to the FPGA (it handles image tiles at the pixel level), and the other one is what will remain software – it operates on the tile level and does not deal with the individual pixels. These two parts interact using shared system memory, tile processor has exclusive access to the dedicated image buffer and calibration data.

Each tile is 16×16 pixels square with 8 pixel overlap, software prepares tile list including:

  • tile center X,Y (for the virtual “center” image),
  • center disparity, so the each of the 4 image tiles will be shifted accordingly, and
  • the code of operation(s) to be performed on that tile.

Figure 4. Correlation processor

Tile processor performs all or some (depending on the tile operation codes) of the following operations:

  • Reads the tile tasks from the shared system memory.
  • Calculates locations and loads image and calibration data from the external image buffer memory (using on-chip memory to cache data as the overlapping nature of the tiles makes each pixel to participate on average in 4 neighbor tiles).
  • Converts tiles to frequency domain using CLT based on 2d DCT-IV and DST-IV.
  • Performs aberration correction in the frequency domain by pointwise multiplication by the calibration kernels.
  • Calculates correlation-related data (Figure 4) for the tile pairs, resulting in tile disparity and disparity confidence values for all pairs combined, and/or more specific correlation types by pointwise multiplication, inverse CLT to the pixel domain, filtering and local maximums extraction by quadratic interpolation or windowed center of mass calculation.
  • Calculates combined texture for the tile (Figure 5), using alpha channel to mask out pixels that do not match – this is the way how to effectively restore single-pixel lateral resolution after aggregating individual pixels to tiles. Textures can be combined after only programmed shifts according to specified disparity, or use additional shift calculated in the correlation module.
  • Calculates other integral values for the tiles (Figure 5), such as per-channel number of mismatched pixels – such data can be used for quick second-level (using tiles instead of pixels) correlation runs to determine which 3d volumes potentially have objects and so need regular (pixel-level) matching.
  • Finally tile processor saves results: correlation values and/or texture tile to the shared system memory, so software can access this data.

Figure 5. Texture processor

Single tile processor operation deals with the scene objects that would be projected to this tile’s 16×16 pixels square on the sensor of the virtual camera located in the center between the four actual physical cameras. The single pass over the tile data is limited not just laterally, but in depth also because for the tiles to correlate they have to have significant overlap. 50% overlap corresponds to the correlation offset range of ±8 pixels, better correlation contrast needs 75% overlap or ±4 pixels. The tile processor “probes” not all the voxels that project to the same 16×16 window of the virtual image, but only those that belong to the certain distance range – the distances that correspond to the disparities ±4 pixels from the value provided for the tile.

That means that a single processing pass over a tile captures data in a disparity space volume, or a macro-voxel of 8 pixels wide by 8 pixels high by 8 pixels deep (considering the central part of the overlapping volumes). And capturing the whole scene may require multiple passes for the same tile with different disparity. There are ways how to avoid full range disparity sweep (with 8 pixel increments) for all tiles – following surfaces and detecting occlusions and discontinuities, second-level correlation of tiles instead of the individual pixels.

Another reason for the multi-pass processing of the same tile is to refine the disparity measured by correlation. When dealing with subpixel coordinates of the correlation maximums – either located by quadratic approximation or by some form of center of mass evaluation, the calculated values may have bias and disparity histograms reveal modulation with the pixel period. Second “refine” pass, where individual tiles are shifted by the disparity measured in the previous pass reduces the residual offset of the correlation maximum to a fraction of a pixel and mitigates this type of bias. Tile shift here means a combination of the integer pixel shift of the source images and the fractional (in the ±0.5 pixel range) shift that is performed in the frequency domain by multiplication by the cosine/sine phase rotator.

Total processing time and/or required FPGA resources linearly depend on the number of required tile processor operations and the software may use several methods to reduce this number. In addition to the two approaches mentioned above (following surfaces and second-level correlation) it may be possible to reduce the field of view to a smaller area of interest, predict current frame scene from the previous frames (as in 2d video compression) – tile processor paradigm preserves flexibility of the various algorithms that may be used in the scene 3d reconstruction software stack.

Scene viewer

The viewer for the reconstructed scenes is here: (viewer source code).

Figure 6. 3d+map index page

Index page shows a map (you may select from several providers) with the markers for the locations of the captured scenes. On the left there is a vertical ribbon of the thumbnails – you may scroll it with a mouse wheel or by dragging.

Thumbnails are shown only for the markers that fit on screen, so zooming in on the map may reduce number of the visible thumbnails. When you select some thumbnail, the corresponding marker opens on the map, and one or several scenes are shown – one line per each scene (identified by the Unix timestamp code with fractional seconds) captured at the same locations.

The scene that matches the selected thumbnail is highlighted (as 4-th line in the Figure 6). Some scenes have different versions of reconstruction from the same source images – they are listed in the same line (like first line in the Figure 6). Links lead to the viewers of the selected scene/version.

Figure 7. Selection of the map / satellite imagery provider

We do not have ground truth models for the captured scenes build with the active scanners. Instead as the most interesting is ranging of the distant objects (hundreds of meters) it is possible to use publicly available satellite imagery and match it to the captured models. We had ideal view from Elphel office window – each crack on the pavement was visible in the satellite images so we could match them with the 3d model of the scene. Unfortunately they ruined it recently by replacing asphalt :-).

The scene viewer combines x3dom representation of the 3d scene and the re-sizable overlapping map view. You may switch the map imagery provider by clicking on the map icon as shown in the Figure 7.

The scene and map views are synchronized to each other, there are several ways of navigation in either 3d or map area:

  • drag the 3d view to rotate virtual camera without moving;
  • move cross-hair ⌖ icon in the map view to rotate camera around vertical axis;
  • toggle ⇅ button and adjust camera view elevation;
  • use scroll wheel over the 3d area to change camera zoom (field of view is indicated on the map);
  • drag with middle button pressed in the 3d view to move camera perpendicular to the view direction;
  • drag the the camera icon (green circle) on the map to move camera horizontally;
  • toggle ⇅ button and move the camera vertically;
  • press a hotkey t over the 3d area to reset to the initial view: set azimuth and elevation same as captured;
  • press a hotkey r over the 3d area to set view azimuth as captured, elevation equal to zero (horizontal view).

Figure 8. 3D model to map comparison

Comparison of the 3d scene model and the map uses ball markers. By default these markers are one meter in diameter, the size can be changed on the settings () page.

Moving pointer over the 3d area with Ctrl key pressed causes the ball to follow the cursor at a distance where the view line intersects the nearest detected surface in the scene. It simultaneously moves the corresponding marker over the map view and indicates the measured distance.

Ctrl-click places the ball marker on the 3d scene and on the map. It is then possible to drag the marker over the map and read the ground truth distance. Dragging the marker over the 3d scene updates location on the map, but not the other way around, in edit mode mismatch data is used to adjust the captured scene location and orientation.

Program settings used during reconstruction limit the scene far distance to z = 1000 meters, all more distant objects are considered to be located at infinity. X3d allows to use images at infinity using backdrop element, but it is not flexible enough and is not supported by some other programs. In most models we place infinity textures to a large billboard at z = 10,000 meters, and it is where the ball marker will appear if placed on the sky or other far objects.

Figure 9. Settings and link to four images

The settings page () shown in the Figure 9 has a link to the four-image viewer (Figure 10). These four images correspond to the captured views and are almost “raw images” used for scene reconstruction. These images were subject to the optical aberration correction and are partially rectified – they are rendered as if they were captured by the same camera that has only strictly polynomial radial distortion.

Such images are not actually used in the reconstruction process, they are rendered only for the debug and demonstration purposes. The equivalent data exists in the tile processor only in the frequency domain form as an intermediate result, and was subject to just linear processing (to avoid possible unintended biases) so the images have some residual locally-checkerboard pattern that is due to the Bayer mosaic filter (discussed in the earlier blog). Textures that are generated from the combination of all four images have the contrast of such pattern significantly lower. It is possible to add some non-linear filtering at the very last stage of the texture generation.

Each scene model has a download link for the archive that contains the model itself as *.x3d file and Wavefront *.obj and *.mtl as well as the corresponding RGBA texture files as PNG images. Initially I missed the fact that x3d and obj formats have opposite direction of surface normals for the same triangular faces, so almost half of the Wavefront files still have incorrect (opposite direction) surface normals.


Our initial plan was to test algorithms for the tile processor before implementing them in FPGA. The tile processor provides data for the disparity space image (DSI) – confidence value of having certain disparity for specified 2d position in the image, it also generates texture tiles.

When the tile processor code was written and tested, we still needed some software to visualize the results. DSI itself seemed promising (much better coverage than what I had with earlier experiments with binocular images), but when I tried to convert these textured tiles into viewable x3d model directly, it was a big disappointment. Result did not look like a 3d scene – there were many narrow triangles that made sense only when viewed almost directly from the camera actual location, a small lateral viewpoint movement – and the image was falling apart into something unrecognizable.

Figure 10. Four channel images (click for actual viewer with zoom/pan capability)

I was not able to find ready to use code and the plan to write a quick demo for the tile processor and generated DSI seemed less and less realistic. Eventually it took at least three times longer to get somewhat usable output than to develop DCT-based tile processor code itself.

Current software is still incomplete, lacks many needed features (it even does not cut off background so wires over the sky steal a lot of surrounding space), it runs slow (several minutes per single scene), but it does provide a starting point to evaluate performance of the long range 4-camera multi-view stereo system. Much of the intended functionality does not work without more parameter tuning, but we decided to postpone improvements to the next stage (when we will have cameras that are more stable mechanically) and instead try to capture more of very different scenes, process them in batch mode (keeping the same parameter values for all new scenes) and see what will be the output.

As soon as the program was able to produce somewhat coherent 3d model from the very first image set captured through Elphel office window, Oleg Dzhimiev started development of the web application that allows to match the models with the map data. After adding more image sets I noticed that the camera calibration did not hold. Each individual sub-camera performed nicely (they use thermally compensated mechanical design), but their extrinsic parameters did change and we had to add code for field calibration that uses image themselves. The best accuracy in disparity measurement over the field of view still requires camera poses to match ones used at full calibration, so later scenes with more developed misalignment (>20 pixels) are less precise than earlier (captured in Salt Lake City).

We do not have an established method to measure ranging precision for different distances to object – the disparity values are calculated together with the confidence and in lower confidence areas the accuracy is lower, including places where no ranging is possible due to the complete absence of the visible details in the images. Instead it is possible to compare distances in various scene models to those on the map and see where such camera is useful. With 0.1 pixel disparity resolution and 150 mm baseline we should be able to measure 300 m distances with 10% accuracy, and for many captured scene objects it already is not much worse. We now placed orders to machine the new camera parts that are needed to build a more mechanically stable rig. And parallel to upgrading the hardware, we’ll start migrating the tile processor code from Java to Verilog.

And what’s next? Elphel goal is to provide our users with the high performance hackable products and freedom to modify them in the ways and for the purposes we could not imagine ourselves. But it is fun to fantasize about at least some possible applications:

  • Obviously, self-driving cars – increased number of cameras located in a 2d pattern (square) results in significantly more robust matching even with low-contrast textures. It does not depend on sequential scanning and provides simultaneous data over wide field of view. Calculated confidence of distance measurements tells when alternative (active) ranging methods are needed – that would help to avoid infamous accident with a self-driving car that went under a truck.
  • Visual odometry for the drones would also benefit from the higher robustness of image matching.
  • Rovers on Mars or other planets using low-power passive (visual based) scene reconstruction.
  • Maybe self-flying passenger multicopters in the heavy 3d traffic? Sure they will all be equipped with some transponders, but what about aerial roadkills? Like a flock of geese that forced water landing.
  • High speed boating or sailing over uneven seas with active hydrofoils that can look ahead and adjust to the future waves.
  • Landing on the asteroids for physical (not just Bitcoin) mining? With 150 mm baseline such camera can comfortably operate within several hundred meters from the object, with 1.5 m that will scale to kilometers.
  • Cinematography: post-production depth of field control that would easily beat even the widest format optics, HDR with a pair of 4-sensor cameras, some new VFX?
  • Multi-spectral imaging where more spatially separate cameras with different bandpass filters can be combined to the same texture in the 3d scene.
  • Capturing underwater scenes and measuring how far the sea creatures are above the bottom.

Current video stream latency and a way to reduce it

Tue, 07/11/2017 - 11:33

Fig.1 Live stream latency testing

Recently we had an inquiry whether our cameras are capable of streaming low latency video. The short answer is yes, the camera’s average output latency for 1080p at 30 fps is ~16 ms. It is possible to reduce it to almost 0.5 ms with a few changes to the driver.

However the total latency of the system, from capturing to displaying, includes delays caused by network, pc, software and display.

In the results of the experiment (similar to this one) these delays contribute the most (around 40-50 ms) to the stream latency – at least, for the given equipment.



Measure the total latency of a live stream over network from 10393 camera.

  • Camera: NC393-F-CS
    • Resolution@fps: 1080p@30fps,  720p@60fps
    • Compression quality: 90%
    • Exposure time: 1.7 ms
    • Stream formats: mjpeg, rtsp
    • Sensor: MT9P001, 5MPx, 1/2.5″
    • Lens: Computar f=5mm, f/1.4, 1/2″
  • PC: Shuttle box, i7, 16GB RAM, GeForce GTX 560 Ti
  • Display: ASUS VS24A, 60Hz (=16.7ms), 5ms gtg
  • OS: Kubuntu 16.04
  • Network connection: 1Gbps, direct camera-PC via cable
  • Applications:
    • gstreamer
    • chrome, firefox
    • mplayer
    • vlc
  • Stopwatch: basic javascript


Notes table{ border-collapse: collapse; } td{ padding:0px 5px; border:1px solid black; } th{ padding:5px; border:1px solid black; background:rgba(220,220,220,0.5); }

Fig.2 Image sensor is rotated 90°[/caption] It's important to keep in mind that displays are refreshed progressively - so, if an image sensor is a rolling shutter type - their scan directions will coinside. -->
Table 1: Transfer times and data rate

Resolution/fps Image size1, KB Transfer time2, ms Data rate3, Mbps 720p/60 250 2 120 1080p/30 500 4 120

1 – average compressed (90%) image size
2 – time it takes to transfer a single image over network. Jitter is unknown. t = Image_size*1Gbps
3 – required bandwidth: rate = fps*Image_size

Camera output latency calculation

All numbers are for the given lens, sensor and camera setup and parameters. Briefly.

Because of ERS each row’s latency is different. See tables 2 and 3.
Table 2: tROW and tTR

Resolution tROW1, us tTR2, us 720p 22.75 13.33 1080p 29.42 20 full res (2592×1936) 36.38 27

1 – row time, see datasheet. tROW = f(Width)
2 – time it takes to transfer a row over sensor cable, clock = 96MHz. tTR = Width/96MHz
Table 3: Average latency and the whole range.

Resolution tERS avg1, ms tERS whole range2, ms 720p 8 0.01-16 1080p 16 0.02-32

1 – average latency
2 – min – last row latency, max – 1st row latency


tEXP < 1 ms – typical exposure time for outdoors. A display is bright enough to set 1.7 ms with the gains maxed.


The compressor is implemented in fpga and works 3x times faster but needs a stripe of 20 rows in memory. Thus, the compressor will finish ~20/3*tROW after the whole image is read out.

tCMP = 20/3*tROW


tCAM = tERS + tEXP + tCMP

Since the image is read and compressed by fpga logic of the Zynq and this pipeline has been simulated we can be sure in these numbers.
Table 4: Average output latency + exposure

Resolution tCAM, ms 720p 9.9 1080p 17.9 Stopwatch accuracy

Not accurate. For simplicity, we will rely on the camera’s internal clock that time stamps every image, and take the javascript timer readings as unique labels, thus not caring what time they are showing.


Fig.2 1080p 30fps

Fig.3 720p 60fps

GStreamer has shown the best results among the tested programs.
Since the camera fps is discrete the result is a multiple of 1/fps (see this article):

  • 30 fps => 33.3 ms
  • 60 fps => 16.7 ms


Resolution/fps Total Latency, ms Network+PC+SW latency, ms 720p@60fps 33.3-50 23.4-40.1 1080p@30fps 33.3-66.7 15.4-48.8


Possible improvements Camera

Currently, the driver waits for the interrupt from the compressor that indicates the image is fully compressed and ready for transfer. Meanwhile one does not have to wait for the whole image but start the transfer when the minimum of the compressed is data ready.

There are 3 more interrupts related to the image pipeline events. One of them is “compression started” – switching to it can reduce the output latency to (10+20/3)*tROW or 0.4 ms for 720p and 0.5 ms for 1080p.

Other hardware and software

In addition to the most obvious improvements:

  • For wifi: use 5GHz over 2.4GHz – smaller jitter, non-overlapping channels
  • Lower latency software: for mjpeg use gstreamer or vlc (takes an extra effort to setup) over chrome or firefox because they do extra buffering

Lapped MDCT-based image conditioning with optical aberrations correction, color conversion, edge emphasis and noise reduction

Thu, 01/19/2017 - 21:55
jQuery(window).load(function(){ jQuery("#sravnitel_0").sravnitel({ images: ["","","","","",""], titles: ["Raw image, standard bilinear 3x3 pixels kernel de-mosaic, annotated","Raw image, standard bilinear 3x3 pixels kernel de-mosaic","No color conversion, low pass filtered RGB only","Colors converted and low-pass filtered, no nonlinear processing","Edge emphasis enabled, no noise suppression","Fully processed with noise suppression"], showtitles: true, width: 750, height: 560, index_l: 0, index_r: 5, zoom: 1, center_x: 1800, center_y: 1150, }); });

Fig.1. Image comparison of the different processing stages output Results of the processing of the color image

Previous blog post “Lens aberration correction with the lapped MDCT” described our experiments with the lapped MDCT[1] for optical aberration corrections of a single color channel and separation of the asymmetrical kernel into a small asymmetrical part for direct convolution and a larger symmetrical one to be applied in the frequency domain of the MDCT. We supplemented this processing chain with additional steps of the image conditioning to evaluate the overall quality of the of the results and feasibility of the MDCT approach for processing in the camera FPGA.

Image comparator in Fig.1 allows to see the difference between the images generated from the results of the several stages of the processing. It makes possible to compare any two of the image layers by either sliding the image separator or by just clicking on the image – that alternates right/left images. Zoom is controlled by the scroll wheel (click on the zoom indicator fits image), pan – by dragging.

Original image was acquired with Elphel model 393 camera with 5 Mpix MT9P006 image sensor and Sunex DSL227 fisheye lens, saved in jp4 format as a raw Bayer data at 98% compression quality. Calibration was performed with the Java program using calibration pattern visible in the image itself. The program is designed to work with the low-distortion lenses so fisheye was a stretch and the calibration kernels near the edges are just replicated from the ones closer to the center, so aberration correction is only partial in those areas.

First two layers differ just by added annotations, they both show output of a simple bilinear demosaic processing, same as generated by the camera when running in JPEG mode. Next layers show different stages of the processing, details are provided later in this blog post.

Linear part of the image conditioning: convolution and color conversion

Correction of the optical aberrations in the image can be viewed as convolution of the raw image array with the space-variant kernels derived from the optical point spread functions (PSF). In the general case of the true space-variant kernels (different for each pixel) it is not possible to use DFT-based convolution, but when the kernels change slowly and the image tiles can be considered isoplanatic (areas where PSF remains the same to the specified precision) it is possible to apply the same kernel to the whole image tile that is processed with DFT (or combined convolution/MDCT in our case). Such approach is studied in deep for astronomy [2],[3] (where they almost always have plenty of δ-function light sources to measure PSF in the field of view :-)).

The procedure described below is a combination of the sparse kernel convolution in the space domain with the lapped MDCT processing making use of its perfect (only approximate with the variant kernels) reconstruction property, but it still implements the same convolution with the variant kernels.

Signal flow is presented in Fig.2. Input signal is the raw image data from the sensor sampled through the color filter array organized as a standard Bayer mosaic: each 2×2 pixel tile includes one of the red and blue samples, and 2 of the green ones.

In addition to the image data the process depends on the calibration data – pairs of asymmetrical and symmetrical kernels calculated during camera calibration as described in the previous blog post.

Fig.2. Signal flow of the linear part of MDCT-based image conditioning

Image data is processed in the following sequence of the linear operations, resulting in intensity (Y) and two color difference components:

  1. Input composite signal is split by colors into 3 separate channels producing sparse data in each.
  2. Each channel data is directly convolved with a small (we used just four non-zero elements) asymmetrical kernel AK, resulting in a sequence of 16×16 pixel tiles, overlapping by 8 pixels (input pixels are not limited to 16×16 tiles).
  3. Each tile is multiplied by a window function, folded and converted with 8×8 pixel DCT-IV[4] – equivalent of the 16×16->8×8 MDCT.
  4. 8×8 result tiles are multiplied by symmetrical kernels (SK) – equivalent of convolving the pre-MDCT signal.
  5. Each channel is subject to the low-pass filter that is implemented by multiplying in the frequency domain as these filters are indeed symmetrical. The cutoff frequency is different for the green (LPF1) and other (LPF2) colors as there are more source samples for the first. That was the last step before inverse transformation presented in the previous blog post, now we continued with a few more.
  6. Natural images have strong correlation between different color channels so most image processing (and compression) algorithms involve converting the pure color channels into intensity (Y) and two color difference signals that have lower bandwidth than intensity. There are different standards for the color conversion coefficients and here we are free to use any as this process is not a part of a matched encoder/decoder pair. All such conversions can be represented as a 3×3 matrix multiplication by the (r,g,b) vector.
  7. Two of the output signals – color differences are subject to an additional bandwidth limiting by LPF3.
  8. IMDCT includes 8×8 DCT-IV, unfolding 8×8 into 16×16 tiles, second multiplication by the window function and accumulation of the overlapping tiles in the pixel domain.
Nonlinear image enhancement: edge emphasis, noise reduction

For some applications the output data is already useful – ideally it has all the optical aberrations compensated so the remaining far-reaching inter-pixel correlation caused by a camera system is removed. Next steps (such as stereo matching) can be done on- (or off-) line, and the algorithms do not have to deal with the lens specifics. Other applications may benefit from additional processing that improves image quality – at least the perceived one.

Such processing may target the following goals:

  1. To reduce remaining signal modulation caused by the Bayer pattern (each source pixel carries data about a single color component, not all 3), trying to remove it by a LPF would blur the image itself.
  2. Detect and enhance edges, as most useful high-frequency elements represent locally linear features
  3. Reduce visible noise in the uniform areas (such as blue sky) where significant (especially for the small-pixel sensors) noise originates from the shot noise of the pixels. This noise is amplified by the aberration correction that effectively increases the high frequency gain of the system.

Some of these three goals overlap and can be addressed simultaneously – edge detection can improve de-mosaic quality and reduce related colored artifacts on the sharp edges if the signal is blurred along the edges and simultaneously sharpened in the orthogonal direction. Areas that do not have pronounced linear features are likely to be uniform and so can be low-pass filtered.

The non-linear processing produces modified pixel value using 3×3 pixel array centered around the current pixel. This is a two-step process:

  • First the 3×3 center-symmetric matrices (one for Y, another for color) of coefficients are calculated using the Y channel data, then
  • they are applied to the Y and color components by replacing the pixel value with the inner product of the calculated coefficients and the original data.

Signal flow for one channel is presented in Fig.3:

Fig.3 Non-linear image processing: edge emphasis and noise reduction

  1. Four inner products are calculated for the same 9-sample Y data and the shown matrices (corresponding to second derivatives along vertical, horizontal and the two diagonal directions).
  2. Each of these values is squared and
  3. the following four 3×3 matrices are multiplied by these values. Matrices are symmetrical around the center, so gray-colored cells do not need to be calculated.
  4. Four matrices are then added together and scaled by a variable parameter K1. The first two matrices are opposite to each other, and so are the second two. So if the absolute value of the two orthogonal second derivatives are equal (no linear features detected), the corresponding matrices will annihilate each other.
  5. A separate 3×3 matrix representing a weighted running average, scaled by K2 is added for noise reduction.
  6. The sum of the positive values is compared to a specified threshold value, and if it exceed it – all the matrix is proportionally scaled down – that makes different line directions to “compete” against each other and against the blurring kernel.
  7. The sum of all 9 elements of the calculated array is zero, so the default unity kernel is added and when correction coefficients are zeros, the result pixels will be the same as the input ones.
  8. Inner product of the calculated 9-element array and the input data is calculated and used as a new pixel value. Two of the arrays are created from the same Y channel data – one for Y and the other for two color differences, configurable parameters (K1, K2, threshold and the smoothing matrix) are independent in these two cases.
Next steps How much is it possible to warp?

The described method of the optical aberration correction is tested with the software implementation that uses only operations that can be ported to the FPGA, so we are almost ready to get back to to Verilog programming. One more thing to try before is to see if it is possible to merge this correction with a minor distortion correction. DFT and DCT transforms are not good at scaling the images (when using the same pixel grid). It is definitely not possible no rectify large areas of the fisheye images, but maybe small (fraction of a pixel per tile) stretching can still be absorbed in the same step with shifting? This may have several implications.

Single-step image rectification

It would be definitely attractive to eliminate additional processing step and save FPGA resources and/or decrease the processing time. But there is more than that – re-sampling degrades image resolution. For that reason we use half-pixel grid for the offline processing, but it increases amount of data 4 times and processing resources – at least 4 times also.

When working with the whole pixel grid (as we plan to implement in the camera FPGA) we already deal with the partial pixel shifts during convolution for aberration correction, so it would be very attractive to combine these two fractional pixel shifts into one (calibration process uses half-pixel grid) and so to avoid double re-sampling and related image degrading.

Using analytical lens distortion model with the precision of the pixel mapping

Another goal that seems achievable is to absorb at least the table-based pixel mapping. Real lenses can only to some precision be described by an analytical formula of a radial distortion model. Each element can have errors and the multi-lens assembly can inevitably has some mis-alignments – all that makes the lenses different and deviate from a perfect symmetry of the radial model. When we were working with (rather low distortion) wide angle lenses Evetar N125B04530W we were able to get to 0.2-0.3 pix root mean square of the reprojection error in a 26-lens camera system when using radial distortion model with up to the 8-th power of the radial polynomial (insignificant improvement when going from 6-th to the 8-th power). That error was reduced to 0.05..0.07 pixels when we implemented table-based pixel mapping for the remaining (after radial model) distortions. The difference between one of the standard lens models – polynomial for the low-distortion ones and f-theta for fisheye and “f-theta” lenses (where angle from optical axis approximately linearly depends on the distance from the center in the focal plane) is rather small, so it is a good candidate to be absorbed by the convolution step. While this will not eliminate re-sampling when the image will be rectified, this distortion correction process will have a simple analytical formula (already supported by many programs) and will not require a full pixel mapping table.

High resolution Z-axis (distance) measurement with stereo matching of multiple images

Image rectification is an important precondition to perform correlation-based stereo matching of two or more images. When finding the correlation between the images of a relatively large and detailed object it is easy to get resolution of a small fraction of a pixel. And this proportionally increases the distance measurement precision for the same base (distance between the individual cameras). Among other things (such as mechanical and thermal stability of the system) this requires precise measurement of the sub-camera distortions over the overlapping field of view.

When correlating multiple images the far objects (most challenging to get precise distance information) have low disparity values (may be just few pixels), so instead of the complete rectification of the individual images it may be sufficient to have a good “mutual rectification”, so the processed images of the object at infinity will match on each of the individual images with the same sub-pixel resolution as we achieved for off-line processing. This will require to mechanically orient each sub-camera sensor parallel to the others, point them in the same direction and preselect lenses for matching focal length. After that (when the mechanical match is in reasonable few percent range) – perform calibration and calculate the convolution kernels that will accommodate the remaining distortion variations of the sub-cameras. In this case application of the described correction procedure in the camera will result in the precisely matched images ready for correlation.

These images will not be perfectly rectified, and measured disparity (in pixels) as well as the two (vertical and horizontal) angles to the object will require additional correction. But this X/Y resolution is much less critical than the resolution required for the Z-measurements and can easily tolerate some re-sampling errors. For example, if a car at a distance of 20 meters is viewed by a stereo camera with 100 mm base, then the same pixel error that corresponds to a (practically negligible) 10 mm horizontal shift will lead to a 2 meter error (10%) in the distance measurement.


[1] Malvar, Henrique S. Signal processing with lapped transforms. Artech House, 1992.

[2] Thiébaut, Éric, et al. “Spatially variant PSF modeling and image deblurring.” SPIE Astronomical Telescopes+ Instrumentation. International Society for Optics and Photonics, 2016. pdf

[3] Řeřábek, M., and P. Pata. “The space variant PSF for deconvolution of wide-field astronomical images.” SPIE Astronomical Telescopes+ Instrumentation. International Society for Optics and Photonics, 2008.pdf

[4] Britanak, Vladimir, Patrick C. Yip, and Kamisetty Ramamohan Rao. Discrete cosine and sine transforms: general properties, fast algorithms and integer approximations. Academic Press, 2010.

Lens aberration correction with the lapped MDCT

Sat, 01/07/2017 - 18:19

Modern small-pixel image sensors exceed resolution of the lenses, so it is the optics of the camera, not the raw sensor “megapixels” that define how sharp are the images, especially in the off-center areas. Multi-sensor camera systems that depend on the tiled images do not have any center areas, so overall system resolution may be as low as that of is its worst part.

Fig. 1. Lateral chromatic aberration and Bayer mosaic: a) monochrome (green) PSF, b) composite color PSF, c) Bayer mosaic of the sensor, d) distorted mosaic for the chromatic aberration of b).

De-mosaic processing and chromatic aberrations

Our current cameras role is to preserve the raw sensor data while providing some moderate compression, all the image correction is applied during post-processing. Handling the lens aberration has to be done before color conversion (or de-mosaicing). When converting Bayer data to color images most cameras start with the calculation of the “missing” colors in the RG/GB pattern using 3×3 or 5×5 kernels, this procedure relies on the specific arrangement of the color filters.

Each of the red and blue pixels has 4 green ones at the same distance (pixel pitch) and 4 of the opposite (R for B and B for R) color at the equidistant diagonal locations. Fig.1. shows how lateral chromatic aberration disturbs these relations.

Fig.1a is the point-spread function (PSF) of the green channel of the sensor. The resolution of the PSF measurement is twice higher than the pixel pitch, so the lens is not that bad – horizontal distance between the 2 greens in Fig.1c corresponds to 4 pixels of Fig.1a. It is also clearly visible that the PSF is elongated and the radial resolution in this part of the image is better than the tangential one (lens center is left-down).

Fig.1b shows superposition of the 3 color channels: blue center is shifted up-and-right by approximately 2 PSF pixels (so one actual pixel period of the sensor) and the red one – half-pixel left-and-down from the green center. So the point light of a star, centered around some green pixel will not just spread uniformly to the two “R”s and two “B”s shown connected with lines in Fig.1c, but the other ones and in different order. Fig.1d illustrates the effective positions of the sensor pixels that match the lens aberration.

Aberrations correction at post-processing stage

When we perform off-line image correction we start with separating each color channel and re-sampling it at twice the pixel pitch frequency (adding zero sample between each measured one) – this increase allows to shift image by a fraction of a pixel both preserving resolution and not introducing the phase errors that may be visually OK but hurt when relying on sub-pixel resolution during correlation of images.

Next is the conversion of the full image into the overlapping square tiles to the frequency domain using 2-d DFT, then multiplication by the inverted PSF kernels – individual for each color channel and each part of the whole image (calibration procedure provides a 2-d array of PSF kernels). Such multiplication in the frequency domain is equivalent to (much more computationally expensive) image convolution (or deconvolution as the desired result is to reduce the effect of the convolution of the ideal image with the PSF of the actual lens). This is possible because of the famous convolution-multiplication property of Fourier transform and its discrete versions.

After each color channel tile is corrected and the phases of color components match (lateral chromatic aberration is compensated) it is the time when the data may be subject to non-linear processing that relies on the properties of the images (like detection of lines and edges) to combine the color channels trying to achieve highest spacial resolution and not to introduce color artifacts. Our current software performs it while data is in the frequency domain, before the inverse Fourier transform and merging the lapped tiles to the restored image.

Fig.2. Histogram of difference between original image and after direct and inverse MDCT (with 8×8 pixels DCT-IV)

MDCT of an image – there and back again

It would be very appealing to use DCT-based MDCT instead of DFT for aberration correction. With just 8×8 point DCT-IV it may be possible to calculate direct 16×16 -> 8×8 MDCT and 8×8 -> 16×16 IMDCT providing perfect reconstruction of the image. 8×8 pixels DCT should be able to handle convolution kernels with 8 pixel radius – same would require 16×16 pixels DFT. I knew there will be a challenge to handle non-symmetrical kernels but first I gave a try to a 2-d MDCT to convert and reconstruct back a camera image that way. I was not able to find an efficient Java implementation of the DCT-IV so I had to write some code following the algorithms presented in [1].

That worked nicely – when I obtained a histogram of the difference between the original image (pixel values were in the range of 0 to 255) and the restored one – IMDCT(MDCT(original)) it demonstrated negligible error. Of course I had to discard 8 pixel border of the image added by replication before the procedure – these border pixels do not belong to 4 overlapping tiles as all internal ones and so can not be reconstructed.

When this will be done in the camera FPGA the error will be higher – DCT implementation there uses just an integer DSP – not capable of the double precision calculations as the Java code. But for the small 8×8 transformations it should be rather easy to manage calculation precision to the required level.

Convolution with MDCT

It was also easy to implement a low-pass symmetrical filter by multiplying 8×8 pixel MDCT output tiles by a DCT-III transform of the desired convolution kernel. To convolve f ☼ g you need to multiply DCT_IV(f) by DCT_III(g) in the transform domain [2], but that does not mean that DCT-III has also be implemented in the FPGA – the de-convolution kernels can be prepared during off-line calibration and provided to the camera in the required form.

But not much more can be done for the convolution with asymmetric kernels – they either require additional DST (so DCT and DST) of the image and/or padding data with extra zeros [3],[4] – all that reduces advantage of DCT compared to DFT. Asymmetric kernels are required for the lens aberration corrections and Fig.1 shows two cases not easily suitable for MDCT:

  • lateral chromatic aberrations (or just shift in the image domain) – Fig.1b and
  • “diagonal” kernels (Fig.1a) – not an even function of each of the vertical and horizontal axes.

Fig.3. Convolution kernel factorization: a) required asymmetrical and shifted kernel, b) 4-point direct convolution with (sparse) Bayer color channel data, c) symmetric convolution kernel for MDCT, d) symmetric kernel – DCT-III of c) to multiply DCT-IV kernels of the image.

Symmetric kernels are like what you can do with a twice folded piece of paper, cut to some shape and unfolded, with folds oriented strictly vertically and horizontally.

Factorization of the convolution

Another way to handle convolution with non-symmetrical kernels is to split it in two – first convolve with an asymmetrical one directly and then – use MDCT and symmetrical kernel. The input data for combined convolution is split Bayer data, so each color channel receives sparse sequence – green one has only half non-zero elements and red and blue – only 1/4 such pixels. In the case of half-pixel grid (to handle fraction-pixel shifts) the relative amount of non-zero pixels is four times smaller, so the total number of multiplications is the same as for the whole-pixel grid.

The goal of such factorization is to minimize the number of the non-zero elements in the asymmetrical kernel, imposing no restrictions on the symmetrical one. Factorization does not have to be absolutely precise – the effect of deconvolution is limited by several factors, most important being the amplification of the sensor noise (such as shot noise). The required number of non-zero pixel may vary with the type of the distortion, for the lens we experimented with (Sunex DSL227 fisheye) just 4 pixels were sufficient to achieve 2-4% error for each of the kernel tiles. Four pixel kernels make it 1 multiplication per each of the red and blue pixels and 2 multiplications per green. As the kernels are calculated during the camera off-line calibration it should be possible to simultaneously generate scheduling of the the DSP and buffer memories to additionally reduce the required run-time FPGA resources.

Fig.3 illustrates how the deconvolution kernel for the aberration correction is split into two for the consecutive convolutions. Fig.1a shows the required deconvolution kernel determined during the existing calibration procedure. This kernel is shown far off-center even for the green channel – it appeared near the edge of the fish-eye lens field of view as the current lens model is based on the radial polynomial and is not efficient for the fish-eye (f-theta) lenses, so aberration correction by deconvolution had to absorb that extra shift. As the convolution kernel has fixed non-zero elements, the computation complexity does not depend on the maximal kernel dimensions. Fig.3b shows the determined asymmetric convolution kernel of 4 pixels, and Fig.3c – the kernel for symmetric convolution with MDCT, the unique 8×8 pixels part of it (inside of the red square) is replicated to the other3 quadrants by mirroring along row 0 and column 0 because of the whole pixel even symmetry – right boundary condition for DCT-III. Fig.3d contains result of the DCT-III applied to the data shown in Fig.3c.

Fig.4. Symmetric convolution kernel tiles in MDCT domain. Full image (click to open) has peripheral kernels replicated as there was no calibration data outside of the fisheye lens filed of view

There should be some more efficient ways to find optimal combinations of the two kernels, currently I used a combination of the Levenberg-Marquardt Algorithm (LMA) that minimizes approximation error (root mean square of the differences between the given kernel and the result of the convolution of the two calculated) and adding/replacing pixels in the asymmetrical kernel, sorting the variants for the best LMA fit. Experimental code ( for the kernel calculation is in the same GitHub repository.

Each kernel tile is processed independently of the neighbors, so while the aberration deconvolution kernels are changing smoothly between the adjacent tiles, the individual asymmetrical (for direct convolution with Bayer signal data) and symmetrical (for convolution by multiplication in the MDCT space) may change dramatically (see Fig.4). But when the direct convolution is applied before the window multiplication to the source pixels that contribute to a 16×16 pixel MDCT overlapping tile, then the result (after IMDCT) depends on the convolution of the two kernels, not the individual ones.

Deconvolving the test image

Next step was to apply the convolution to the test image, see if there are any visible blocking (or other) artifacts and if the image sharpness was improved. Only a single (green) channel was tested as there is no DCT-based color conversion code in this program yet. Program was tested with the whole pixel grid (not half pixel) so some reduction of sharpness caused by fractional pixel shift was expected. For the comparison “before/after” aberration correction I used two pairs – one with the raw Bayer data (half of the pixels are black in a checker-board pattern) and the other – with the Bayer pattern after 0.4 pix low-pass filter to reduce the checkerboard pattern. Without this filtering image would be either twice darker or (as on these pictures) saturated at lower levels (checkerboard 0/255 alternating pixels result in average gray level of only half of the full range).

Fig.5. Alternating images of a segment (green channel only): low-pass filter of the Bayer mosaic before and after deconvolution. Click image to show comparison with raw Bayer component.
Raw Bayer
Bayer data, low pass filter, sigma = 0.4 pix

Fig.5 shows animated GIF of a fraction of the whole image, clicking the image shows comparison to the raw Bayer (with the limited gray level), caption links the full size images for these 3 modes.

No de-noise code is used, so amplification of the pixel shot noise is clearly visible, especially on the uniform surfaces, but aliasing cancellation remained functional even with abrupt changing of the convolution kernels as ones shown in Fig.4.


Algorithms suitable for FPGA implementation are tested with the simulation code. Processing of the images subject to the typical optical aberration of the fisheye lens DSL227 does not add significantly to the computational complexity compared to the pure symmetric convolution using lapped MDCT based on the 8×8 pixels two-dimensional DCT-IV.

This solution can be used as a first stage of the real time image correction and rectification, capable of sub-pixel resolution in multiple application areas, such as 3-d reconstruction and autonomous navigation.


[1] Plonka, Gerlind, and Manfred Tasche. “Fast and numerically stable algorithms for discrete cosine transforms.” Linear algebra and its applications 394 (2005): 309-345.
[2] Martucci, Stephen A. “Symmetric convolution and the discrete sine and cosine transforms.” IEEE Transactions on Signal Processing 42.5 (1994): 1038-1051. pdf
[3] Suresh, K., and T. V. Sreenivas. “Linear filtering in DCT IV/DST IV and MDCT/MDST domain.” Signal Processing 89.6 (2009): 1081-1089. Abstract and full text pdf.
[4] Reju, Vaninirappuputhenpurayil Gopalan, Soo Ngee Koh, and Ing Yann Soon. “Convolution using discrete sine and cosine transforms.” IEEE Signal Processing Letters 14.7 (2007): 445. pdf
[5] Malvar, Henrique S. “Extended lapped transforms: Properties, applications, and fast algorithms.” IEEE Transactions on Signal Processing 40.11 (1992): 2703-2714.

Measuring SSD interrupt delays

Thu, 12/22/2016 - 18:56

Sometimes we need to test disks connected to camera and we developed a small test program for this purpose. This program basically resembles camogm (in-camera recording program) in its operation and allows us to write repeating blocks of data containing counter value and then check the consistency of the data written. This program works directly with disk driver and collects some statistics during its operation. Disk driver, among other things, measures the time between two events: when write command is issued and when command completion interrupt from controller is received. This time can be used to measure disk write speed as the amount of data sent to disk with each command is also known. In general, this time slightly floats around its average value given that the amount of data written with each command is almost the same. But long run tests have shown that sometimes the interrupt return time after write command can be much longer then the average time.

We decided to investigate this situation in a little bit more details and tested two SSDs with our test program. The disks used for tests were SanDisk SD8SMAT128G1122 and Crucial CT250MX200SSD6, both were connected to eSATA camera port over M.2 SSD adapter. We used these disks before and they demonstrated different performance during recording. We ran camogm_test to write 3 MB blocks of data in cyclic mode. The program collected delayed interrupt times reported by driver as well as the amount of data written since the last delay event. The processed results of the test are in the following table:

Disk Average IRQ reception time, ms Standard deviation, ms Average IRQ delay time, ms Standard deviation, ms Data recorded since last IRQ delay, GB Standard deviation, GB CT250MX200SSD6 (250 GB) 11.9 1.1 804 12.7 499.7 111.7 SD8SMAT128G1122 (128 GB) 19.3 4.8 113 6.5 231.5 11.5

The delayed interrupt times of these disks are considerably different although the difference in average interrupt times which reflect disk write speeds is not that big. It is interesting to notice that the amount of data written to disk between two consecutive interrupt delays is almost twice the total disk size. smartctl reported the increase of Runtime_Bad_Block attribute for CT250MX200SSD6 after each delay but the delays occurred each time on different LBAs. Unfortunately, SD8SMAT128G1122 does not have such parameter in its smartctl attributes and it is difficult to compare the two disks by this parameter.

DCT type IV implementation

Sat, 12/17/2016 - 00:15

As we finished with the basic camera functionality and tested the first Eyesis4π built with the new 10393 system boards (it is smaller, requires less power and, is faster) we are moving forward with the in-camera image processing. We plan to combine our current camera calibration methods that require off-line post processing and the real-time image correction using the camera own FPGA resources. This project development will require switching between the actual FPGA coding and the software implementation of the same algorithms before going to the next step – software is still easier to design. The first part was in FPGA realm – it was to implement the fundamental image processing block that we already know we’ll be using and see how much of the resources it needs.

DCT type IV as a building block for in-camera image processing

We consider a small (8×8 pixel) DCT-IV to be a universal block for conditioning of the raw acquired images. Such operations as lens optical aberrations correction, color conversion (de-mosaic) in the presence of the lateral chromatic aberration, image rectification (de-warping) are easier to perform in the frequency domain using convolution-multiplication property and other algorithms.

In post-processing we use DFT (Discrete Fourier Transform) over rather large (64×64 to 512×512) tiles, but that would be too much for the in-camera processing. First is the tile size – for good lenses we do not need that large convolution kernels. Additionally we plan to combine several processing steps into one (based on our off-line post-processing experience) and so we do not need to sub-sample images – in our current software we double resolution of the raw images at the beginning and scale back the final result to reduce image degradation caused by re-sampling.

The second area where we plan to reduce computations is the replacement of the DFT with the DCT that is designed to be fed with the pure real data and so requires less arithmetic operations than DFT that processes complex input values.

Why “type IV” of the DCT?

Fig.1. Signal flow graph for DCT-IV

We already have DCT type II implemented for the JPEG/JP4 compression, and we still needed another one. Type IV is used in audio compression because it can be converted to a modified discrete cosine transform (MDCT) – a procedure when multiple overlapped windows are processed one at a time and the results are seamlessly combined without any block artifacts that are familiar for the JPEG with low settings of the compression quality. We too need lapped transform to process large images with relatively small (much smaller than the image itself) convolution kernels, and DCT-IV is a perfect fit. 8-point DCT-IV allows to implement transformation of 16-point segments with 8-point overlap in a reversible manner – the inverse transformation of 8-point data may be converted to 16-point overlapping segments, and being added together these segments result in the original data.

There is a price though to pay for switching from DFT to DCT – the convolution-multiplication property being so straightforward in FFT gets complicated for DCT[1]. While convolving with symmetrical kernels is still simple (just the kernel has to be transformed differently, but it is anyway done off-line in our case), the arbitrary kernel convolution (or just a shift in image space needed to compensate the lateral chromatic aberration) requires both DCT-IV and DST-IV transformed data. DST-IV can be calculated with the same DCT-IV modules (just by reversing the direction of input data and alternating the sign of the output samples), but it still requires additional hardware resources and/or more processing time. Luckily it is only needed for the direct (image domain to frequency domain) transform, the inverse transform IDCT-IV (frequency to image) does not require DST. And IDCT-IV is actually the same as the direct DCT-IV, so we can again instantiate the same module.

Most of the two-dimensional transforms combine 1-d transform modules (because DCT is a separable transform), so we too started with just an 8-point DCT. There are multiple known factorizations for such algorithm[2] and we used one of them (based on BinDCT-IV) shown in Fig.1.

Fig.2. Simplified diagram of Xilinx DSP48E1 primitive (only used functionality is shown)

DSP primitive in Xilinx Zynq

This algorithm is implemented with a pair of DSP48E1[3] primitives shown in Fig.2. This primitive is flexible and allows to configure different functionality, the diagram contains only the blocks and connections used in the current project. The central part is the multiplier (signed 18 bits by signed 25 bits) with inputs from a pair of multiplexed B registers (B1 and B2, 18 bits wide) and the pre-adder AD register (25 bits). The AD register stores sum/difference of the 25-bit D-register and a multiplexed pair of 25-bit A1 and A2 registers. Any of the inputs can be replaced by zero, so AD can receive D, A1, A2, -A1, -A2, D+A1, D-A1, D+A2 and D-A2 values. Result of the multiplier (43 bits) is stored in the M register and the data from M is combined with the 48-bit output accumulator register P. Final adder can add or subtract M to/from one of the P, 48-bit C-register or just 0, so the output P register can receive +/-M, P+/-M and C+/-M. The wrapper module dsp_ma_preadd_c.v reduces the number of DSP48E1 signals and parameters to those required for the project and in addition to the primitive instance have a simple model of the DSP slice to allow simulation without the DSP48E1 source code for convenience.

Fig.3. One-dimensional 8-point DCT-IV implementation

8-point DCT-IV transform

The DCT-IV implementation module (Fig.3.) operates in 16 clocks cycles (2 clock periods per data item) and the input/output permutations are not included – they can be absorbed in the data source and destination memories. Current implementation does not implement correct rounding and saturation to save resources – such processing can be added to the outputs after analysis for particular application data widths. This module is not in the coder/decoder signal chain so bit-accuracy is not required.

Data is output each other cycle (so two such modules can easily be used to increase bandwidth), while input data is scrambled more, some of the items have to appear twice in a 16-cycle period. This implementation uses two of the DSP48E1 primitives connected in series. First one implements the left half of the Fig.1. graph – 3 rotators (marked R8, and two of R4), four adders, and four subtracters, The second one corresponds to the right half with R1, R5, R9, R13, four adders, and four subtracters. Two of the small memories (register files) – 2 locations before the first DSP and 4 locations before the second effectively increase the number of the DSP internal D registers. The B inputs of the DSPs receive cosine coefficients, the same ROM provides values for both DSP stages.

The diagram shows just the data paths, all the DSP control signals as well as the memories write and read addresses are generated at the defined times decoded from the 16-cycle period. The decoder is based on the spreadsheet draft of the design.

Fig.4. Two-dimensional 8×8 DCT-IV

Two-dimensional 8×8 points DCT-IV

Next diagram Fig.4. show a two-dimensional DCT type IV implementation using four of the 1-d 8-point DCT-IV modules described above. Input data arrives continuously in line-scan order, next 64-item block may follow either immediately or after a delay of at least 16 cycles so the pipelines phases are correctly restarted. Two of the input 8×25 memories (width can be reduced to match input data, 25 is the width of the DSP48E1 inputs) are used to re-order the input data.As each of the 1-d DCT modules require input data at more than a half cycles (see bottom of Fig.3) interleaving with the common memory for both channels is not possible, so each channel has to have a dedicated one. First of the two DCT modules convert even lines of 8 points, the other one – odd lines. The latency of the data output from the RAM in the second channel is made 1 cycle longer, so the output data from the channels also arrive at odd/even time slots and can be multiplexed to a common transpose buffer memory. Minimal size of the buffer is 2 of the 64 item pages (width can be reduced to match application requirements), but having just a two-page buffer increases the minimal pause time between blocks (if they are not immediate), with a four page buffer (and BRAM primitives are larger even if just halves are used) the minimal non-immediate delay of the 16 cycles of a 1-d module is still valid.

The second (vertical) pass is similar to the first (horizontal) one, it also has individual small memories for input data reordering and 2 output de-scrambler memories. It is possible to use a single stage, but the memory should hold at least 17 items (>16) and the primitives are 16-deep, and I believe that splitting in series makes it easier for the placer/router tools to implement the design.

Next steps

Now when the 8×8 point DCT-IV is designed and simulated the next step is to switch to the Java coding (add to our ImageJ plugin for camera calibration and image post-processing), convert calibration data to the form suitable for the future migration to FPGA and try the processing based on the chosen 8×8 DCT-IV. When satisfied with the results – continue with the FPGA coding.


[1] Martucci, Stephen A. “Symmetric convolution and the discrete sine and cosine transforms.” IEEE Transactions on Signal Processing 42.5 (1994): 1038-1051. pdf

[2] Rao, K. Ramamohan, and Ping Yip. Discrete cosine transform: algorithms, advantages, applications. Academic press, 2014.

[3] 7 Series DSP48E1 Slice, UG479 (v1.9), Xilinx, Sep. 2016. pdf

Using a flash with a CMOS image sensor: ERS and GRR modes

Mon, 10/24/2016 - 17:56
Operation modes in conventional CMOS image sensors with the electronic rolling shutter

Flash test setup

Most of the CMOS image sensors have Electronic Rolling Shutter – the images are acquired by scanning line by line. Their strengths and weaknesses are well known and extremely wide usage made the technology somewhat perfect – Andrey might have already said this somewhere before.

There are CMOS sensors with a Global Shutter but (for the same formats) their pixels’ dynamic range is lower, the readout noise is bigger and the frame rates are lower as well. Here are a few links:

So, the typical sensor with ERS may support 3 modes of operation:

  • Electronic Rolling Shutter (ERS) Continuous
  • Electronic Rolling Shutter (ERS) Snapshot
  • Global Reset Release (GRR) Snapshot

GRR Snapshot was available in the 10353 cameras but ourselves we never tried it – one should have write directly to the sensor’s register to turn it on. But now it is tested and working in 10393s available through the TRIG (0x14) parameter.

MT9P001 sensor

Further, I will be writing about ON Semi’s MT9P001 image sensor focusing on snapshot modes. The operation modes are described in the sensor’s datasheet. In short:

In ERS Snapshot mode (Fig.1,3), exposure time is constant across all rows but each next row’s exposure start is delayed by tROW (row readout time) from the previous one (and so is the exposure end).

In GRR Snapshot mode (Fig.2,4), the exposure of all rows starts at the same moment but each next row is exposed by tROW longer than the previous one. This mode is good when a flash use is needed.

The difference between ERS Snapshot and Continuous is that in the latter mode the sensor doesn’t wait for a trigger and starts new image while still finishing reading the previous one. It provides the highest frame rate (Fig.5).

Fig.1 Electronic Rolling Shutter (ERS) Snapshot mode

Fig.2 Global Reset Release (GRR) Snapshot mode

Fig.3 ERS mode, whole frame

Fig.4 GRR mode, whole frame

Fig.5 Sensor operation modes, frame sequence

Here are some of the actual parameters of MT9P001:

Parameter Value Active pixels  2592h x 1944v tROW 33.5 μs Frame readout time (Nrows x tROW)  1944 x 36.38 μs ~ 72 ms Test setup
  • NC393L-389
  • 9xLEDs
  • Fan (Copal F251R, 25×25 mm, rotating at 5500-8000 RPM)

The LEDs were powered & controlled by the camera’s external trigger output, the delay and duration of which are programmable.

The flash duration was set to 20 μs to catch, without the motion blur, the fan’s blades are marked with stickers – 5500-8000 RPM that is 0.5-0.96° per 20 μs. There was not enough light from the LEDs, so the setup is placed in dark environment and the camera color gains were set to 8 (ISO ~800-1000) – the images are a bit noisy.

The trigger period was set to 250 ms – and the synced LEDs were blinking for each frame.

The information on how to program the NC393 camera to generate trigger signal, fps, change sensor’s operation modes (ERS/GRR) can be found here.

Fig.6a Setup: screen, camera view

Fig.6b Setup: fan

Fig.6c Setup: fan, camera view

Flash in ERS mode Fig.7a Fig.7b Fig.7c

In Fig.7a to expose all rows to the flash the exposure needs to be programmed so the 1st row’s end of exposure will exceed the last row’s start of exposure and the flash delayed until the exposure start of the last row. That makes the single row exposure 72ms+tflash.
Note: there is no ERS effect for moving objects – provided, of course, that the flash is much brighter than the other light sources that will be reducing the contrast during the 72ms frame time.

In Fig.7b the exposure is shorter than the frame readout time – the flash delay can be any – the result is a brighter band on the image as shown in the example below.

Another way to expose all rows is to keep the flash on from the 1st row start until the last row end (Fig.7c) – that’s as good as keeping the flash on all the time.


Diagram Screen Fan Exposure time, ms 5 20 Flash duration, ms 0.02 (20μs) 0.02 Flash delay, ms 40 40 Comments The fan blades are motion blurred in the rows not affected by the delayed 20μs flash. The flash delay is set so the affected rows appear in the middle of the image. Exposure time defines the width of the bright rows band. Flash in GRR mode Fig.8a Fig.8b

In GRR mode the flash does not need to be delayed and the exposure of the 1st row can be as low as tflash but the last row will be exposed for tflash+72ms (Fig.8a). If the scene is uniformly illuminated the the image tends to be darker in the top and getting brighter in the bottom. GRR is very useful with a flash lamp.
Note: No ERS effect (as in Fig.7a case).

Fig.8b just shows what happens if the flash is delayed until frame is read out.


Diagram Screen Fan Exposure time, ms 0.1 0.1 Flash duration, ms 0.02 0.02 Flash delay, ms 40 30 Comments The fan blades are motion blurred in the rows not affected by the delayed 20μs flash. All of the rows not read out before the flash are affected.

Diagram Screen Fan Exposure time, ms 0.1 0.1 Flash duration, ms 0.02 0.02 Flash delay, ms 0 0 Comments Fan is rotating. No motion blur. In GRR if flash is not delayed the whole image is affected by the flash. Brighter environment = lower contrast.

Diagram Screen Fan Exposure time, ms 5 10 Flash duration, ms 0.02 0.02 Flash delay, ms 0 0 Comments Fan is rotating. 100 times longer exposure compared to the previous example – the environment is relatively dark. Conclusions
  • ERS Continuous – max fps, constant exposure, not synced
  • ERS Snapshot – constant exposure, synced
  • GRR Snapshot – synced, use this mode with flash

Elphel presenting at ORCONF 2016, An open source digital design conference

Sun, 10/02/2016 - 23:12

On October 8th, 2016 Andrey will be presenting his work on VDT – Free Software Environment for FPGA Development at an open source digital design conference, ORCONF 2016. ORCONF 2016

The conference will take place in Bologna, Italy, and we are glad for the possibility to meet some of European users of Elphel cameras, and to connect with the community of developers excited about open source design, free software and open hardware.

Elphel will be present at the conference by Andrey Filippov from USA headquarters and Alexadre Poltorak, founder of Swiss 3D4Pi mobile mapping company, working closely with Elphel to integrate Eyesis4Pi, stereophotogrammetric camera, for the purpose of image based 3D reconstruction applications. Andrey will bring and demonstrate the new multisensor NC393 H-camera and Alexandre plans to take some panoramic footage with Eyesis4Pi camera, while in Bologna.