Imaging solutions with Free Software & Open Hardware

Who's online

There are currently 0 users online.

Subscribe to Elphel Development Blog feed Elphel Development Blog
Updated: 37 min 47 sec ago

Photogrammetric Calibration of the Quad Thermal Camera

Tue, 06/18/2019 - 19:46

Figure 1. Calibration of the quad visible range + quad LWIR camera.

We’ve got the first results of the photogrammetric calibration of the composite (visible+LWIR) Talon camera described int the previous post and tested the new pattern. In the first experiments with the new software code we’ve got average reprojection error for LWIR of 0.067 pix, while the visible quad camera subsystem was calibrated down to 0.036 pix. In the process of calibration we acquired 3 sequences of 8-frame (4 visible + 4 LWIR) image sets, 60 sets from each of the 3 camera positions: 7.4m from the target on the target center line, and 2 side views from 4.5m from the target and 2.2 m right and left from the center line. From each position camera axis was scanned in the range of ±40° horizontally and ±25° vertically.

In the first stage of calibration the acquired images (Figure 2 shows both LWIR and visible range images) were processed to extract the detected pattern grid node. Each node’s pixel coordinates (x,y), a pair of the pattern UV indexes and the contrast are stored in the form of multi-layer TIFF files – processed images from Figure 2 are shown on animated GIFs, Figure 3 is for LWIR and Figure 4 – for the visible range images. Each of the images has a link to a respective TIFF file that has metadata visible with ImageJ program.

Figure 2. LWIR and visible range images of the pattern.

The software for the calibration had to be modified, as it did not have provisions for handling different resolution sensors in the same system. Additionally it was designed for the smaller pattern cell size so each image had larger number of detected nodes. We also had to use other method of absolute pattern matching (finding out which of the multiple almost identical measure nodes corresponds to the particular (e.g. center) node of the physical pattern. Earlier, with the fine pitch patterns, we used software-controlled laser pointers, with the current approach we used the fact that the visible subsystem was already calibrated – that provided sufficient data for absolute matching. Only one image for each camera position had manually marked center pattern cell – that mark simplified the processing.

Before LWIR data was used the visible rage images were processed with Levenberg-Marquardt algorithm (LMA) and the system knew precise position and orientation of the camera during capturing of each image set. After preliminary adjustment with idealized (uniform and flat) pattern grid it was possible to simultaneously refine each physical pattern node location in 3d, similarly as we did earlier (Figures 4 and 5 in this post). Using large cell pattern, the visible cameras bundle adjustment resulted in 0.11 pixels when using just radial distortion model. This was improved to 0.036 pix when using non-radial correction to compensate other irregularities of the lenses.

When LWIR data was added and each image set had known extrinsic parameters, the absolute grid matching was performed in the following way. Starting with the image with the smallest distance from the target (so the wrong matching of the grid cell would cause attitude error larger than initial precision of the system assembly) the attitude angles of each of the 4 LWIR subcameras was found. Then for each additional image LMA was run with all but 2 parameters known and fixed – camera base position X and Y coordinates (in the plane parallel to the target). As the position was already known from the visible range subsystem, the X and Y difference was matched to the physical pattern dimensions to determine how many cells (if any) the assumed grid match is offset. That gave absolute grid match for all of the LWIR images, and bundle adjustment of the LWIR modules intrinsic (focal length, offset from the center pixel, radial distortion coefficient) as well as the modules location in the camera coordinate system and their attitudes was possible. It resulted in 0.097 pixels reprojection error with just the radial distortion model. We did not expect much from the non-radial correction as the sensor resolution is low, but it still gave some improvement – the residual error dropped down to 0.067 pix. It is likely this value can be improved more with better tweaking of the processing parameters – the current result was achieved in the first run.

Figure 3. Processed pattern grid from LWIR data (click image to download ImageJ Tiff stack).

Figure 4. Processed pattern grid from visible range camera data (click image to download ImageJ Tiff stack).

These calibration results are not that exciting compared to what we will get with the LWIR 3D model reconstruction, but it gives us confidence that we are on the right track, and our goal is possible. Having a much higher resolution sub-system conveniently provides ground truth data for training of the neural network – something that was much more difficult to achieve with the high resolution visible range only systems.

3D Perception with Thermal Cameras

Thu, 06/06/2019 - 17:33

Figure 1. Oleg carrying combined LWIR and visible range camera.

While working on extremely long range 3D visible range cameras we realized that the passive nature of such 3D reconstruction that does not involve flashing lasers or LEDs like LIDARs and Time-of-Flight (ToF) cameras can be especially useful for thermal (a.k.a LWIR – long wave infrared) vision applications. SBIR contract granted at the AF Pitch Day provided initial funding for our research in this area and made it possible.

We are now in the middle of the project and there is a lot of the development ahead, but we have already tested all the hardware and modified our earlier code to capture and detect calibration pattern in the acquired images.

Printed Calibration Patterns for LWIR

Figure 2. Initial LWIR pattern detection test with a small (100×100 pix) FLIR handheld camera.

For the LWIR imaging we planned to use approach similar to what we have developed for the visible range cameras, and the first step is the precise photogrammetric camera calibration. Before engaging in the project we made sure that we can capture images of the pattern in LWIR wavelength range, the same pattern image should also be registered by the visible range camera to be used as the ground truth data. For the regular cameras there are many ways and materials to create high quality test patterns, our current setup uses 7.0 meters by 3.0 meters pattern printed on five sheets of self-adhesive film attached to the wall. This pattern illuminated by the regular bright fluorescent lamps did not register any image by a LWIR camera (we borrowed a small, handheld FLIR camera, used for HVAC applications).

In the next (successful) test we used a larger scale pattern printed on paper that was not bonded to the wall. The pattern was illuminated by a pair of 500W halogen lamps (see Figure 2). The lamp radiation — both visible and infrared (mostly infrared) was heating the image and so the black painted areas on paper became hotter than the white ones as shown in the FLIR camera image (it is upside down as the camera was held in a vise in that position). The measured temperature difference between white and black areas, illuminated from the distance of 1.5–2 meters was in the range of 2–4°K.

Figure 3. LWIR calibration pattern design.

This quick test proved that with reasonable amount of high power halogen lamps we can achieve sufficient temperature contrast for the photogrammetric calibration. It still has some limitations — while the amount of heat absorbed by the pattern paint can be calculated from the data registered by the visible range cameras, the pattern surface temperature also depends on less controlled effects – air convection and heat transfer to the wall where the paper touches it. These factors reduce accuracy of the calibration, to mitigate their influence we designed a pattern where the ground truth relative surface temperature can be calculated from the measured visible range images and a few parameters that describe thermal properties of the materials.

Calibration Pattern Design

We used 5mm thick presentation type foam boards that consist of the polystyrene foam layer between two paper surface sheets. As we found out later, it is important to use both black and white paints during the printing process – unpainted paper surface that looks matte in visible light behaves like a mirror for the LWIR radiation and reflects the lamps used for illumination. You may find such reflection in Figure 5 – the leftmost panel that was built first does not have white paint. A thin (t=0.8 mm) aluminum sheets are glued to the back surfaces of the foam boards, and the forced airflow equalizes their temperatures. Figure 4 reveals the back side of the pattern module – thicker (1/2″ and 1″) machined foam boards make ribs that provide structural integrity of the pattern modules. These ribs are attached to the aluminum side of the pattern sheets with epoxy resin, their other side is reinforced with the fiberglass tape. Soft foam along the perimeter of the modules allows another use of the fans that provide the forced airflow – the developed vacuum is used to firmly press the modules to the smooth (covered with the visible range pattern film) wall. Timelapse video below illustrates the process of building and installation of the LWIR calibration pattern.

Figure 4. A stack of two face-down LWIR pattern modules.

Figure 3 illustrates temperature calculation for the uniformly illuminated pattern, this method will be used to derive ground truth thermal data from the registered visible range images. For a straight line on a surface of the pattern the absorbed light power produces a step function. The temperature of the underlying aluminum is a blurred version of that function, with the sigma depending on the thicknesses and thermal conductivity ratios of the foam and aluminum. With the used parameters amplitude of the aluminum temperature variations is below 1%. For the very thin foam the surface temperature would be just a sum of the aluminum temperature and a step function, the actual surface temperature distribution is blurred by the amount proportional to the foam thickness as the heat flow under the black/white pattern borders is not exactly orthogonal to the surface.

Calibration pattern period is 357mm, approximately four times larger than that of the pattern we use for the visible range cameras calibration. This size is selected to work with the low resolution prototype system (160×120 pix in each of the four channels) and still is suitable for the higher resolution LWIR imagers of up to 1280×960. Figure 5 shows the view of the pattern by the visible range (left) and LWIR (right) cameras from the distance of 3 meters (top) and 7 meters (bottom) from the target. The field of view of the already calibrated visible range cameras is slightly larger that that of the LWIR, so ground truth data is available over all LWIR sensors area.

Figure 5. LWIR calibration pattern captured by visible range (2592×1936) and LWIR (160×120) cameras from the distance of 3 meters (above) and 7 meters (below).

Visible + LWIR combo camera

The main goal of our current project is to determine to what extent methods that we currently use for visible range camera photogrammetric calibration, distortion and aberration correction, deep subpixel disparity calculation and neural network training are applicable to the LWIR systems. Generally LWIR sensors have lower resolution than the regular visible and near infrared (VNIR) ones, are expensive and subject to export control, so with the limited time and resources available we selected small (160×120) FLIR Lepton-3.5 sensor modules available at Digi-Key. These modules have integrated germanium lenses and mechanical shutters that are essential for the photometric calibration. The sensors have limited frame rate (9 FPS) that puts them outside of the scope of export regulations. We plan use higher resolution sensors after evaluating the prototype system and achieving the goal of 0.1 pix (or better) disparity resolution.

The overall prototype camera design (Figure 6) is rather straightforward: we used already calibrated X-shaped quad camera and added another one with four LWIR sensors. FLIR Leton-3.5 modules do not support externally triggered operation, so I hoped that if all 4 sensors are fed with the same clock, and the reset signal is de-asserted simultaneously, then all channels will stay in sync and output frames simultaneously. Later this assumption turned out to be true, the difference between the channels after startup was in the range of ±10μs and it did not change over time. These small offset values are likely related to the analog PLL lock, they do not require any additional correction as they are more than a thousand times shorter than the effective pixel integration time. The visible range quad camera is triggered by the LWIR one. We measured internal LWIR module latency by capturing image of the rotating object and used this data to program the cameras triggering parameters to achieve simultaneous image capturing by all 8 of the visible range and thermal sensors.

Figure 6. Prototype combo camera: 4×VNIR+4×LWIR

Lepton-3.5 sensor module is small enough so we were able to fit it together with the power conditioning circuitry and the configuration EEPROM on the same dimension PCB as our regular sensor front ends (SFE) of 28mm×15mm and then mount in the standard SFE body. The board electrical and mechanical design is documented in Elphel Wiki.

Lepton module is controlled over the I2C bus that was already supported in Elphel NC393 series cameras, the video output is provided over SPI-based VoSPI interface, so some FPGA development along with the related drivers and applications code was required. Due to the digitized microbolometers output instead of that of a photodiodes in regular CMOS sensors, the rationale behind the use of the gamma compression that allows virtually lossless 8-bit representation of 12-bit pixel ADC output does not hold. That in turn required full 16-bit output rather than the JP4 image format that we were using for over a decade. We implemented Tiff format (both 8 and 16 bits per pixel) that is now available for all other supported image sensors, not only for the LWIR ones.

Photogrammetric calibration software

We are now working on adapting our existing code base to work with the LWIR low resolution sensors, and the photogrammetric calibration is the first step before we can perform 3D scene reconstruction and train a DNN to get higher disparity measurement accuracy. Our existing calibration workflow consists of the following steps:

  1. Capturing of a few hundred image sets of the calibration pattern from a 3 different camera locations, using switchable laser pointers for absolute matching of the periodic pattern cells in the camera images.
  2. Extracting pattern grid key points positions in the individual images with deep subpixel accuracy (0.03-0.05 pix) combining both pixel domain and frequency domain image processing.
  3. Bundle adjustment of the camera intrinsic and extrinsic parameters assuming lenses radial distortion models, physical pattern grid points correction with Levenberg-Marquardt algorithm (LMA).
  4. Continuing bundle adjusting allowing non-radial lens distortions and perform photometric calibration of the sensors (flat field, lens vignetting correction).
  5. Applying calculated distortions to the synthesized pattern grid and calculating Point Spread Function (PSF) to be used for space-variant deconvolution for aberration correction.

Figure 7. Image set of 4 color and 4 LWIR images from the frame sequence captured by Talon camera.

As of now we just finished adaptation/re-implementation of the first two steps to work with LWIR sensors. Compared to the above workflow we do not use laser pointers for LWIR, but the visible range quad camera is already calibrated and can provide absolute matching of the pattern images. The major challenge in step 2 was to make the software work when there are very few (or only one) pattern cell available in each processing window – our previously developed software could comfortably use many pattern cells.

3D perception with LWIR

When the remaining steps of the photogrammetric calibration will be finished, we will use the quad camera 3D scene reconstruction software to build the 3D scenes from the captured LWIR images. The paired visible range quad camera has significantly higher resolution and its disparity data will be used as ground truth for the network training. Here we will benefit from the thermal imaging difference from the traditional image intensifier-based night vision cameras that can only operate in very low light conditions. LWIR sensors can operate simultaneously with the visible range ones (and many systems fuse their images to effectively enhance LWIR resolution).

Impatient to get real life scene imagery even before the software is ready, we took the first prototype camera to Southern Utah. We call this prototype “Talon” for the T-38 training jet where the visible range quad camera plays the role of the instructor that sits slightly behind the LWIR “the student”. We captured synchronized visible/LWIR images in a slot canyon as well as the LWIR-only sets in the pitch black darkness — it was a New Moon that night.

The raw LWIR images themselves do not look that exciting – there are many much higher resolution ones used by the military, law enforcement and even hunters. The things will change when we’ll be able to generate the LWIR 3D scenes.

Timelapse video: Building the LWIR Calibration Pattern

GPU Implementation of the Tile Processor

Thu, 10/11/2018 - 15:10

After we coupled the Tile Processor (TP) that performs quad camera image conditioning and produces 2D phase correlation in space-invariant form with the neural network[1], the TP remained the bottleneck of the tandem. While the inferred network uses GPU and produces disparity output in 0.5 sec (more than 80% of this time is used for the data transfer), the TP required tens of seconds to run on CPU as a multithreaded Java application. When converted to run on the GPU, similar operation takes just 0.087 seconds for four 5 MPix images, and it is likely possible to optimize the code farther — this is our first experience with Nvidia® CUDA™.

Implementation Starting with CUDA™ and JCUDA

Before starting development of the GPU code we verified that the GPU acceleration is possible for the main program written in Java by evaluating demo ImageJ plugin[2]. Next was to get into development with Nvidia® CUDA™ using Nsight[3] plugin for Eclipse IDE. Nsight offered an option to import sample projects and I started to learn CUDA™ with one of them – dct8x8. That project uses “Runtime API” and consists of a mixture of C/C++ code for the CPU and for the GPU. This mixed mode is not directly compatible with JCUDA[4], but if the “kernel” (code executed by the GPU) is kept separate from the CPU one, it is possible to develop/debug/test the program in Nsight IDE and then use the same file containing just the GPU kernel(s) with JCUDA. What needs to be changed is the portion of the non-GPU C/C++ code that transfers data between the computer system (CPU) memory and the GPU dedicated memory physically located on the graphic card.

Guided by the Nvidia® dct8x8 sample project I first implemented similar DCT-IV[5] and DST-IV needed for the Complex Lapped Transform (CLT) used by the TP for conversion to the frequncy domain and back (sample project contained code only for the DCT-II (direct) and DCT-II (inverse) used in JPEG and related applications. After several iterations I made those programs to run almost as fast as the highly optimized code in the Nvidia® sample, and the next step was to implement complete CLT and aberration correction for the Bayer mosaic images, following the approach we used for the RTL[6]. Debugging was simplified by the fact that the same algorithm was already tested in Java, and the intermediate data arrays could be compared between the Java and CUDA™ outputs.

Kernel for the Complex Lapped Transform of the Bayer Mosaic Images

The first implemented kernel is taking the following inputs:

  • Four of the 5 MPix Bayer mosaic images.
  • Four of the per-camera arrays of the space variant (stride 16) CLT deconvolution kernels already converted to the frequency domain (this data is reused for multiple image sets).
  • Additional kernel fractional pixel x,y offsets and their derivatives to interpolate required image shifts between the grid nodes of the space-variant kernels
  • List of the tiles to process that contains tiles location on the tile grid and the fractional pixel X,Y shift calculated externally from the required disparity for the known radial lens distortions. This list is used because not all the tiles need to be processed on each pass, when building the depth map only some tiles need to be processed, and some tiles require multiple iterations with different disparity values.

Output from the first kernel is a set of per-camera (4), per tile (324×242 for the 2592×1936 images), per color component (3) of 4×8×8 arrays representing CLT frequency domain transformation of each of the 16×16 (stride 8) pixels tiles of the source images. The transformation sequence is described in the earlier post[6], it assumes the following stages:

  • Finding the closest de-convolution kernel for the specified tile index and offsets.
  • Calculation of the full horizontal and vertical offset that combines requested image tile center and the interpolated kernel data.
  • Splitting X,Y offsets into rounded integer and fractional pixel shifts. Integer part is used to select position of the 16×16 window in the source image, fractional one is applied later in the form of the phase shift in the frequency domain. It is also applied to the window function.
  • Folding 16×16 image tiles into the 8×8 pixel ones for the 2D DCT-IV/DST-IV conversions, using the shifted 2D half-sine window function.
  • Calculating the CLT layers: DCT-IV/DCT-IV, DST-IV/DCT-IV, DCT-IV/DST-IV and DST-IV/DST-IV (horizontal pass/vertical pass). For the Bayer mosaic input, of the 12 (3 colors by 4 CLT layers) transforms only 4 are actually needed, other 8 are restored using the symmetry of transforms of sparse (1 in 4 non-zero for red and blue, 1 in 2 for green color components) inputs.
  • Element-wise multiplication of the converted image tile by the kernel tile, equivalent to the pixel-domain convolution.
  • Applying the residual fractional pixel shift (in the range of +/-0.5 pix in each direction) implemented as a phase rotation. Such shift does not involve re-sampling as the space domain shift would, and so it does not introduce any related quantization noise.
  • Optional 2D low pass filter that can be combined with the convolution kernels.
Kernels Performance Evaluation

When I went through all the processing pipeline, and made sure the results match the Java code output. I measured the execution time and was disappointed – ~5 seconds per set of 4 images wasn’t what I expected. Using profiler I soon realized that my understanding of CUDA™ was wrong – all the participating threads have to execute exactly the same code, it is not like in multi-threaded CPU code or multiple simultaneously operating modules in RTL. Re-writing the code to eliminate divergence reduced execution time to 0.9 s. Not too exciting, but still significant gain over the CPU alone. Another kernel for inverse transform to convert from the frequency domain back to the images added 0.6 seconds. The inverse MCLT produces overlapping 16×16 (stride 8) tiles that need to be added together to result in a full picture, implementation uses 4 (stride 2 in each direction) passes over the image, with each pass free of any overlaps between tiles allowing asynchronous parallel execution.

After spending about two weeks troubleshooting the code kernel code as a part of the C++ application I was expecting to encounter more difficult to pinpoint problems when adding these GPU kernels to the Java application. There are multiple data arrays to be fed to the GPU with no convenience of having the debugger at hand.

In reality it was much easier – JCUDA[4] does a wonderful job and it took me just a few hours to convert the code to run as a GPU accelerator for the Java program. I already had the Java code to convert images and convolution kernels to the data arrays that were passed to the C++ test application via binary files, only what remained to be done was to flatten multi-dimensional arrays and to replace an array of struct – it had a mixture of integer (tile indices) and float (offsets) members. With that done, there were a couple bugs left, but they were clearly reported in the Java stack trace output.

I do not know if it is possible to use #include directives in the code compiled from JCUDA, but as the source code is anyway first read from the file to a Java String before it is sent to the compiler, I just concatenated all the needed (currently just 2) files as strings. I also added “#define JCUDA” to the concatenated string before the files content, as well as some other numeric defines (like image dimensions) shared between the Java and the GPU code. I enclosed all the includes and the duplicate parameter defines within #ifndef JCUDA in the GPU kernel source files, that made the same source code files to work in both Nsight IDE and in Java application with nvrtcCreateProgram(). Soon I got the complete output images and verified they are the same as generated by the Java code.

And in the end I’ve got an unexpected and wonderful “reward”. When stepping over the critical GPU-related Java code lines I was expecting 1.5 second delay for the execution of the two kernels, but could not notice any. I enclosed each GPU kernel call with extra cuCtxSynchronize() thinking it did not wait for completion – still no visible delay. Then I put a loop to run kernels 100 times and got ~6 seconds for the first kernel. Something was obviously wrong – but the output images were correct. I went back to the IDE and made sure I have “Release” (not “Debug”) in seemingly every relevant place, but still it was running slow. Then I launched built binaries from the command line and discovered, that while “Debug” version is as slow as when run from the IDE, the “Release” version is 20.1 times faster, same performance as with JCUDA.

Kernel description Debug mode execution time Release mode execution time Convert 4 images to FD and deconvolve with kernels 1073.4 ms 57.0 ms Convert 4 images from FD to RGB slices 665.3 ms 29.5 ms Total 1738.7 ms 86.5 ms

These run times were measured for the “GeForce GTX 1050 Ti” with compute capability 6.1, 4GB memory.


The GPU code implemented and tested so far does not include the 2D phase correlation kernel need for the depth map generation, I started with just the rectified images as the pictures usually show if something is wrong in the processing. Phase correlation kernel will be easy to implement (the first kernel will stay the same) and the execution time will be approximately the same. Then we will work on coupling this code with the Tensorflow inferred network and feed the 2D correlation data directly from the GPU device memory. That will result in the near real-time performance of the whole system.

Links to the source code are provided below[7],[8],[9] (the code is also mirrored at Github). It needs to be cleaned up and may eventually be released as a library after more functionality will be added. We believe this code will be useful for variety of imaging applications both coupled with the ML systems or traditional. The phase correlation can be used for multiple view camera systems (as in our case) or for the optical flow processing when matching images acquired by the same camera in subsequent frames.


[1] “Neural network doubled effective baseline of the stereo camera”

[2] “How to create an ImageJ Plugin using JCuda”

[3] Nvidia®Nsight™ Eclipse Edition

[4] Java bindings for Nvidia®CUDA™

[5] Wikipedia article: “Discrete Cosine Transform”

[6] “Efficient Complex Lapped Transform Implementation for the Space-Variant Frequency Domain Calculations of the Bayer Mosaic Color Images”

[7] Source code of the 8×8 DCT-II,DST-II,DCT-IV and DST-IV for GPU

[8] Source code of the Tile Processor GPU implementation

[9] Source code of the Java class to integrate GPU acceleration with JCUDA

Neural network doubled effective baseline of the stereo camera

Wed, 09/05/2018 - 14:47

Figure 1. Network diagram. One of the tested configurations is shown.

Neural network connected to the output of the Tile Processor (TP) reduced the disparity error twice from the previously used heuristic algorithms. The TP corrects optical aberrations of the high resolution stereo images, rectifies images, and provides 2D correlation outputs that are space-invariant and so can be efficiently processed with the neural network.

Tile Processor and the 2D correlation output

TP receives raw Bayer mosaic data from four (or more) camera sensors and combines several chained operations in the frequency domain, eliminating re-sampling errors. TP operates with the fixed windows of 16×16 Bayer mosaic pixels (stride 8 in each direction). While the camera does not have any moving parts, TP operates similarly to the human vision – it calculates correlations for the requested “target disparity” that is analogous to the eye convergence. The difference is that each image tile may have independently set target disparity.

At this stage of the project we evaluated generation of the depth map using 2D correlation outputs for the tiles and the tile target disparity. Of the six possible correlation pairs of the four camera sensors (top, bottom, left, right and two diagonals) we used four, combining (by averaging) horizontal and vertical pairs together. Of the full 2D correlation results we preserved the center 9×9 pixels, so each tile provides 4×9×9 tensor as shown in Figure 1. Five megapixel sensors output 2592(H)×1936(V) pixels making it 324(H)×242(V) tiles. Target disparity for each tile is calculated using existing heuristic-based program, and the network is trained to improve that value using the 2D correlation data.

Neural network and the data sets

Suggested network architecture consists of the two stages – first stage processes each tile output without any interaction with the neighbors, the second will be convolutional enhancing disparity prediction for each tile by using information from the neighbors. We now use a Siamese type network with samples combining correlation outputs (and corresponding target disparities) from 5×5 tile regions, with network trained to predict disparity of the center tile only. While it is less efficient when processing continuous depth maps, it gives us higher training flexibility by allowing to adjust frequency of the “smooth” samples and the samples with discontinuities of different types.

We had 266 processed scenes to use for training and testing. Each scene provides up 324×242=78408 tiles, of them about half are usable (not the featureless sky and not the near objects that we did not used in this project). When capturing images camera was running at 5 frames per second, so we put aside first image from each second for testing (20%) and used the rest for training. Trying to equalize batches we calculated 2D histogram (disparity-confidence) over all images, divided disparity-confidence area into equal percentile regions and then created batches from the data files selecting random tiles – one from each of those regions.

For the ground truth we used disparity/confidence pairs captured by a dual camera rig that has a baseline almost 5 times longer than that of a single camera. Initial cost function was just a weighted (by ground truth confidence) squared disparity difference, later we clipped it by a certain value to reduce influence of a rather common case when the dual camera rig measurement (ground truth) and that of a single camera (used as target disparity) matched different (foreground or background) objects for the same tile resulting in multi-pixel disparity differences.

The training results were evaluated in 2 ways. First we compared costs (total one minimized and partials) for the training data, test data independently generated but from the same source files already used for training, and from the fresh images that were laid aside for testing. Difference between the tests was used as an indication of overfitting. In addition we compared the heuristic output for the whole test images (same data was also used as a target disparity input for the network) and the network output, calculating accuracy gain for different criteria: tiles with certain disparity range and correlation strength (confidence). And it is this gain for far tiles (less than 5 pix disparity or farther than 100 meters) that let us claim that the network doubled disparity resolution that is equivalent of using twice longer baseline of the stereo camera.

Noticing that overfitting starts to develop (by comparing costs corresponding to the test data) we performed the following actions:

  • Applied right/left mirroring of the source data. There may be insufficient up/down and transposition symmetry so we did not rely on replicating source data that way
  • First layer of the stage 1 subnets receives image-like 2D correlation data that has certain properties, so for regularization we added costs for the Laplacians of the first layer relative weights with the zero boundary conditions
  • Processing in the stage 2 should provide reasonable results even when only the data from the same (center) tile is available. Adding 8 (for 3×3 kernel) neighbors and then 16 (to the total of 5×5) should improve result, so we added 2 more instances (sharing all weights) of the stage 2 – one with all zeros but the center (of 25 stage 1 subnets) input, and the other with the center 3×3 non-zero inputs. And then calculated cost for the error of each of these outputs and mixed them in the final cost used for optimization.

These methods combined allowed to postpone the beginning of overfitting and improved convergence – the results did not go worse (at least significantly) when the training ran for too long.

Figure 2. Results for far flat objects (> 1000 m). X3D , images


Figures 2 and 3 illustrate results of the depth map processing with the neural network. Each has 6 sub plots following the common pattern:

  • Top right “Ground truth confidence” contains the full scene correlation strength and the red frame that identifies what part of the scene is analyzed in the other subplots
  • Top left “Ground truth disparity map” shows tile disparity values as measured by a wide baseline dual camera rig. The disparity units as shown on the vertical bar to the right are pixels of the main camera (not the long baseline rig), same units are used on all the remaining subplots
  • Middle left “Heuristic disparity map” shows result of the existing processing of the 2D tile correlation output using heuristic algorithms. These values are used as “target disparity” to calculate 2D correlation for each tile and the correlation results (as 4×9×9 tensors) are fed to the network.
  • Bottom left “Network disparity output” shows the network output for the tiles that do have ground truth data, tiles output that can not be verified is blanked (shown in white)
  • Middle right shows mismatch between the heuristic disparity map (middle left) and the ground truth disparity (top left)
  • Bottom right shows similar errors of the network disparity prediction

Image captions provide the links to the x3d models of the same scene (models are built with the existing software that uses what is considered here as “ground truth” and do not rely on the network processing). Another link (Images↗) in the captions leads to the camera image viewer for the scene. That viewer shows all 8 camera images, the first 4 correspond to the processed data, the last 4 are captured by a second camera used for the ground truth measurements. These images are not raw (raw Bayer mosaic images are also provided as explained in the wiki page), they are calculated for the target disparity set to 0 for all tiles. Images are the result of the space-variant deconvolution for aberration correction, they are being subject to the linear operations only, so at certain zoom levels they have visible modulation caused by the Bayer mosaic, and color de-mosaic artifacts in the periodic grid areas as there are no (nonlinear) demosaic operations in the processing pipeline.

Figure 3. Buildings in the range of 680-2200m. X3D , images

Evaluating disparity maps of the flat horizontal surface

The scene fragment in Figure 2 shows rather flat surface from 1000 meters (~0.5 pix disparity) to infinity (mountain ridge is over 20,000 meters – beyond the resolution of the camera). The disparity noise of the middle row (heuristic output) is greatly reduced by the network (bottom row) and it is not just the low-pass filtering – the transition to the near objects (yellow) is not blurred. The network output reveals some low frequency disparity fluctuations that are not present in the ground truth data (top subplot). These fluctuations are caused by the correlations between the Bayer patterns of the individual sensors, each pattern being distorted by the optical aberration correction. These errors may be reduced by feeding all 6 individual correlation pairs to the network – currently two horizontal (top and bottom) and two vertical (left and right) pairs are reduced by averaging. Combining them non-linearly in the network may increase S/N ratio. Additionally this noise may be reduced by adding more sensors without increasing the overall camera dimensions.

Disparity maps in the urban environment

Figure 3 contains the scene fragment captured while driving northbound along the State Street in Salt Lake City. The buildings visible there (pseudo-colors other than white and yellow) are from 680 meters to 2200 meters ahead (2200 m is the Utah State Capitol around tile at [170,113] – it is visible on the ground truth (top left) and network output (bottom left) subplots.

The challenge here was to prevent the network from blurring the depth map between objects at different distances. While smooth disparity gradients are possible (like the street pavement) most detectable disparity variations at long distances are due to the discontinuities caused by the overlapping objects. When disparity difference is small (1 pixel or less) the correlation argmax for the tile containing both foreground and background pixels would be (incorrectly) somewhere between the foreground and the background values. For the purpose of the next stage of the scene 3d reconstruction – fitting planes – it would be better if such edge tile would be assigned to either foreground or background. And this “cutting corners” on the depth map did happen with the initial cost function based on L2 (cost proportional to squared difference to ground truth disparity weighted with the ground truth confidence) alone.

Figure 4. Buildings in the range of 130-200m from the same scene as in Figure 3. X3D , images [/caption] -->

Cost function for preserving edges in the depth map

Tweaking the cost function significantly improved performance – the result is visible by comparing middle right and bottom right subplots along the vertical building edge at horizontal tile 163 and vertical between 100 and 111 of the Figure 3. The difference between the heuristic disparity and the ground truth shows a visible tile column that is more negative than the tiles around it (the disparity of the foreground object was reduced by correlation). That column completely disappears in the bottom right subplot as the edge sharpness is restored by the trained network.

The cost function was modified in the following way:

  • First the disparity difference was leaky-clipped to 0.3 pixels. Larger errors are usually caused by matching different objects – ground truth may be measured for the distant background while the target disparity matched closer foreground (or vice versa).
  • The second modification specifically added extra cost for “cutting corners” (blurring edges). The average value of the 8 neighbors’ ground truth disparity is calculated for each tile, and outputs falling between the ground truth disparity and the average disparity generate additional cost.

Next steps

Increase of the depth resolution twice was just a low hanging fruit – this project is our first hands-on experience with the neural networks. When visitors of our booth at CVPR-2018 were amazed by just a few percents range error at 2000 meters in the interactive X3D model, we had to explain that the model uses data captured by a pair of such cameras and we plan to use it as a ground truth for the neural network training. And that we hope that eventually a single 258 mm quad camera using the neural network will provide the data as accurate as the existing dual camera rig with 1256 mm baseline. We are not there yet, just half-way, but believe that our original estimate was correct and that goal is reachable.

Add more data

The next immediate step will be adding more training data and possibly reimplementing the TP code to use GPU (this code exist now in Java for CPU and in Verilog for FPGA/VLSI). CPU TP implementation is a bottleneck in the current pipeline, so far we processed only few percents of the captured imagery. Larger dataset will allow to try deeper/wider networks

We will evaluate results of feeding all 6 pairs to the network and experiment with cost functions to reduce noise caused by the correlation of the Bayer patterns of the individual sensors
and maintain space-invariance of the TP output.

Generate output for field calibration

Currently the network outputs only the single scalar per each tile – disparity value. It is important to train the network to generate the confidence value of the disparity it outputs. Additionally the heuristic program we use now can provide misalignment data for each tile (“lazy eye”) – such data is used for field calibration of the camera system by bundle adjustment of the individual sensors attitude. Such output can also be received from the network.

Then we will work on switching other parts of the 3d scene reconstruction to use the neural networks and there are at least two areas that rely not just on straightforward math but use a lot of heuristics.

Target disparity calculation with NN

One such area is selection of the target disparity for each tile. With the small TP window the correlation output provides valid result only within just a few disparity pixels of the center – preprogrammed shift. Full disparity sweep would be expensive so current software uses a combination of methods – scans all tiles at infinity, then uses that data to create low resolution images and correlates them to identify potentially occupied 3D volumes, then grows measured tiles by predicting disparity for the new tiles and measuring correlation. This prediction depends heavily on heuristics and seems to be a good application area to use the network.

Building 3D surfaces

Another area is building 3D surfaces from tile disparities, it is highly heuristic-infested too. Current software uses “supertiles” (overlapping areas of 16×16 with stride 8 tiles) to build multiple “plates” using eigenvalues/eigenvectors of the covariance matrices build of the tile data. Then such plates in the neighboring supertiles are matched to each other and merged if they are likely to belong to the same 3D surface. These plates simultaneously use both the Disparity Space Image (DSI) and the world 3D coordinates. The DSI coordinates are native to the camera and the measurement accuracy can be expressed in the pixels of DSI. The real world coordinates, on the other hand characterize likelihood of such 3d object to exist. The plate merging considers both DSI distance (to be withing measurement accuracy) and world 3D (comparing angles and linear distances between planes to merge). After merging the supertile plates into 3D surfaces, each tile is re-evaluated and assigned to one of them.

Pixel-accurate texture edges

The last step of building realistic 3D scene models was not implemented in the current software – just the provisions for it in the tile processor. It is the restoration of the pixel-accurate edges in the output textures from the tile-accurate depth map and the texture tiles output from the TP.

Hardware improvements

In parallel we plan to improve the hardware and the image capturing process. We will make a light enclosure and a more rigid frame to eliminate the need for additional field calibration, caused by minor variations (mostly thermal) of the camera sensors attitudes. We will try to use fusion of multiple scene models to calculate 3D ground truth data instead of the dual camera rig that we use now. This rig relies on the really far objects (tens of thousands meters) to be able to calibrate itself. We are lucky to have mountain ridges visible around in Salt Lake City, but even they are often too close to be considered infinity.

A fun project

And a fun project – we will try to capture a flock of birds in the air and see how well we can measure and track the 3D coordinates of the multitude of the small flying objects – both without and with the neural network.

CVPR 2018 – from Elphel’s perspective

Sat, 07/21/2018 - 13:32

In this blog article  we will recall the most interesting results of Elphel participation at CVPR 2018 Expo, the conversations we had with visitor’s at the booth, FAQs as well as unusual questions, and what we learned from it. In addition we will explain our current state of development as well as our near and far goals, and how the exhibition helps to achieve them.
The Expo lasted from June 19-21, and each day had it’s own focus and results, so this article is organized chronologically.

Day One: The best show ever!

June 19, CVPR 2018, booth 132

While we are standing nervously at our booth, thinking: “Is there going to be any interest? Will people come, will they ask questions?”, the first poster session starts and a wave of visitors floods the exhibition floor. Our first guest at the booth spends 30 minutes, knowledgeably inquiring about Elphel’s long-range 3D technology and leaves his business card, saying that he is very impressed. This was a good start of a very busy day full of technical discussions. CVPR is the first exhibition we have participated in where we did not have any problems explaining our projects.

The most common questions that were asked:

Q: Why does your camera have 4 lenses?
A: With 4 lenses we have 4 stereo-pairs: horizontal stereo-pair is responsible for vertical features (it is the same as in regular stereo-camera); the vertical pair detects horizontal features, while the 2 diagonal pairs ensure that almost any edge is detected. (Elphel Presentation, p.8).

  • At least 4 non-colinear sensors
  • Native 2-D image matching
  • Lossless rectifcation in the frequency domain
  • Deep sub-pixel super-resolution
  • Adaptable to the higher level goals processing Long Range High Resolution MVS system measuring distances thousands times exceeding the baseline over wide (60° x 45°) FoV
  • -->Q: Why don’t you use lidars?
    A: Lidars are not capable of long range distance measurements. The practical maximum range of  a lidar is 150-200 meters, while we target 500-2000 meters with a 150 mm stereo-base.(Elphel Presentation, p.7)

    Q: How do you deal with thermal expansion?
    A: there are 2 parts to this question: 1) thermal expansion of the calibrated lens is taken care of by the design of the sensor-lens module; 2) Thermal expansion of the whole camera  is also an issue, especially with the larger stereo-base. The titanium tubes that form the cross of the X-Camera expand more on the “sunny side”, and the whole camera bends inward, like a flower. When taking images under the hot Utah sun, we had to wrap the tubes with insulation, covered in aluminum foil.
    After some consideration we decided to remove the insulation wrapping for the CVPR show, because it makes the camera more presentable, and were surprised that this question came up so many times – we almost put the insulation cover back on.

    Q: Do you build custom lenses?
    A: We pre-select standard M12 lenses, because the quality of the lens can vary even within the same model. We have developed a full calibration process and post-processing software to compensate for optical aberrations, allowing to preserve the full sensor resolution over the camera FOV. Read more about aberration correction on Elphel blog.

    Q: What is your part of the technology?
    A: Everything: we have all parts: Hardware – cameras with aberration corrected lenses and thermally compensated sensor-lens module; a state of the art calibration facility and process; 3D-reconstruction software (Free Software, GNU GPL).

    Q: What is the accuracy of your photogrammetry:
    A: We comfortably work with 1/10 of a pixel.

    Q: How dense is your 3D scene?
    A: Currently: 324 x 242 overlapping tiles, with no dead zones.

    Q: Does it build a 3D model in real-time?
    A: Currently the 3D model is reconstructed in software in the post-processing of the acquired images. This software is tested for FPGA, with which we can achieve 12fps 3D-reconstruction on the camera. Our final goal is to design and manufacture custom ASIC with 3D processing on the chip. (Elphel Presentation, p.17)

    The last visitor of the day from a large European company stopped at our booth around 6pm, 30 minutes before the end for the day, and we spend another hour talking and showing Elphel’s camera and high-resolution accurate images it produces. The lights on the floor went down, but we stayed and explain the technology behind our results and future plans. It was a nice day, well spent.

    Day 2: Neural network focus.

    End-to-End vs. Augmented

    Our current results:
    1. 500 meters, high resolution, 3D reconstruction with 5% accuracy with 150mm stereo-base (quad-stereo camera), passive (image-based)
    2. 2000 meters with 5% accuracy with 2 quad-stereo cameras based at 1256 mm from each other
    3. 3D-Reconstruction is done in post-processing with software simulated for FPGA porting. With FPGA – 12fps in 3D

    Our near future goals:

    1. Train the neural network with the current 3D-image sets to achieve same results as in (2) – 2000 meter 3D reconstruction, with just one quad-sensor camera
    2. Port software to FPGA to achieve real-time 3D reconstruction (12 fps)

    Our long-term goals:

    1. Manufacture custom ASIC for 3D-reconstruction to reconstruct 3D scenes on the fly with video frame rate. The small size of the custom chip will allow us to mass produce a long-range 3D reconstruction camera for industrial and commercial purposes. Smaller form-factor, inexpensive, and fast.
    2. Scalable design: a smaller stereo-base with 4 sensors – phone application (200 meter accurate 3D reconstruction for consumer applications); larger stereo-base: extremely long range passive 3D reconstruction – military applications, drone applications, etc.
    CNNs are not efficient for the real time high-resolution 3-D image processing.

    Elphel’s approach to 3D reconstruction with the help of neural network is different from the mainstream one. This is partially due to our expertise is building calibrated hardware, which already produces robust 3D measurements. Therefore we see the value in pre-processing the images before feeding them to the NN  similar to the eye pre-procesing visual information before sending it to the brain. The human eye receives 150 MPix of data, processes it on the retina, and sends only 1Mpix to the brain – 150 times reduction.

    One question came up twice, both times from the conference attendees, about the difficulty the NN has to process high-resolution images. The current approach to NN training recommends to use low resolution raw images (640×480), because the NN has to be trained on a large amount of images (thousands). The last 6 slides of Elphel’s presentation talk about augmenting the NN with Elphel high-performance Tile Processor, that significantly reduces the amount of data, while leaving the “decision making” to the network.

    We argue that the End-to-End approach is less efficient for high-resolution image processing then NN augmented with training-tree, linear image processing.

    Elphel training sets for Neural Network:

    Elphel offers unique image sets for neural network training. These are high-resolution, space-invariant data sets.
    Elphel is seeking collaboration with machine learning research teams working in the area of ML applications for 3D object classification, localization and tracking.


    Day 3: Autonomous Vehicle Focus:

    The exhibition is slowing down. Some booths start packing at midday. Elphel, however is busier than ever.
    We have talked to about every self-driving car company that had it’s presence at CVPR. We also talked to other large corporations doing research in perception and 3D-reconstruction.
    We start by visiting our neighbors – other exhibitors at the Expo, we are very curious what they develop, for which applications, can it be combined with our technology, can we learn something from them, and can they learn from us? It’s a collaborative spirit, with which we start conversation. One of our closest neighbors is Mapbox – an open source mapping platform for custom designed maps, so we start with them. We use Open street Map as well as other maps with our Scene Viewer for 3D-reconstructed scenes for locating scenes on the map as well as for the ground truth data. During lunch break at we add Mapbox tiles to our scene viewer. Mapbox exhibitors are pleased that it worked so well for us. Now they are also very interested in our technology!

    Then we walk to other booths, learn about them and present Elphel.
    Some of the responses we’ve got when we introduced our technology and our current results:

    • Researcher from Apple: “You have the most unique technology on this Expo!”
    • Autonomous Vehicle company A: “I can not believe you can accurately measure distances at 500 meters with just a small base of 150mm? With just 5-10% error? It is simply impossible! I have to come see it!” After visiting our booth seeing the demo he asks: “Can you demonstrate your technology on the car? It has to be in Singapore!”
    • Autonomous Vehicle company B: ” We are happy with the use of the Lidars on the city streets, with close range perception and 30 mph driving speed. But on the highway, where the speed if much higher and the distance much farther the Lidar’s output is too sparse. Passive, long-range 3D reconstruction would be a perfect solution.”
    • Autonomous Vehicle company C: “How dense is your 3D scene? Oh, it is dense, then we can use it.”
    • Autonomous Vehicle company D: “We are building the perception solution for self-driving trucks. We need long-range distance measurement.”
    • Autonomous Vehicle company E: “We have 16 different types of sensors on the car right now. I don’t think we have room to put another camera. Besides the area behind the wind shield is very valuable, we already have 4 cameras there – there is no room. By the way, we re-evaluate our approach every 6 months, and we just did that, so let’s, maybe, talk in 6 months again?”
      Interesting thought – why would anyone need unobstructed windshield in a self-driving car? They did not explain that.
    View of the CVPR 2018 Expo “From Elphel Perspective”:

    creating 3-D model of the CVPR 2018 Expo

    3D model view of CVPR 2018, made by Elphel 3D X-Camera

    One last thing to do before the show is over is to get the 3D-X-Cam up high and build the 3D model of the CVPR 2018 EXPO in “real-time”!

    Two Dimensional Phase Correlation as Neural Network Input for 3D Imaging

    Fri, 07/20/2018 - 19:10

    We uploaded an image set with 2D correlation data together with the import Python code for experiments with the neural networks and are now looking for collaboration with those who would love to apply their DL experience to the new kind of input data. More data will follow and we welcome feedback to make this data set more useful.

    The application area we are interested in is an extremely long distance 3D scene reconstruction and ranging with the distance to baseline ratio of 1000:1 to 10,000:1 while preserving wide field of view. Earlier post describes aircraft distance and velocity measurements with up to 3,000:1 distance-to-baseline ratio and 60°(H)×45°(V) field of view.

    Figure 1. Space-invariant 2D phase correlation data

    Data set: source images, 2D correlation tiles, and X3D scene models

    The data set contains 2D phase correlation output calculated from the 2592×1936 Bayer mosaic source images captured by the quad stereo camera, and Disparity Space Image (DSI) calculated from a pair of such cameras. Longer baseline provides higher range resolution and this DSI is serving as the ground truth for a single quad camera. The source images as well as all the used software is also provided under the GNU/GPLv3 license. DSI is organized as a 324×242 array – each sample is calculated from the corresponding 16×16 tile. Tiles are overlapping (as shown in Figure 3) with stride 8.

    These data sets are provided together with the fully reconstructed X3D scene models, viewable in the browser (Wavefront OBJ files are also generated). The scene models are different from the raw DSI as the next software stages generate meshes, and that frequently leads to over-simplification of the original DSI (so most fronto parallel objects in the scene provide better range accuracy when probed near the bottom). Each scene is accompanied with the multi-layer TIFF file (*-DSI_COMBO.tiff) that allows to see the difference between the measured DSI and the one used in the rendered X3D model. File format and Python import software is documented in Oleg’s post “Reading quad stereo TIFF image stacks in Python and formatting data for TensorFlow”, the data files are directly viewable with ImageJ.

    The first goal of applying neural network to this data is to improve DSI generation from the raw 2D correlation compared to what we can do now with the traditional code by using data from all correlation pairs and the neighbor tiles to propagate information along the objects edges and amplify correlation in low textured areas by pooling the correlation outputs from multiple tiles.

    While there are no moving parts in the system, the overall process of measuring the DSI resembles human/animal stereo vision, where the eyes are converged for the expected object distance, and the visual processing deals with only a small residual mismatch, constantly adjusting the convergence. The difference here is that each tile can be individually “converged” for the requested disparity – this is done by combining integer pixel shift of the tile input window with the accurate fractional pixel shift by a phase rotator in the frequency domain to avoid any re-sampling errors. Currently prediction of the “convergence” (expected disparity) for the next tiles to measure is performed by a mixture of several algorithms (to avoid an expensive full disparity scan over the whole image). This process of predicting disparity is likely to be improved by a properly designed and trained network too, the one that will use the tile context.

    As the current processing is recurrent (repeated “eye convergence” / mismatch processing), there are several options how to organize the training data. We decided to focus on sub-pixel matching, so for each scene with calculated ground truth DSI (from the dual camera rig) the correlation output is measured for the target disparity (convergence) in the range of minus 1 pixel to plus 1 pixel with the step of 0.1 pixel in the currently provided data sets. It makes total of 21 files of the correlation outputs. This partial disparity sweep around the ground truth is an alternative to a full disparity sweep that would require too much storage.

    In addition to the raw Bayer images scene models have links to the JPEG images, corresponding to each sub-camera view. These images have aberrations corrected and they are partially rectified for the disparity = 0 (infinity). They are not rectilinear, and still have small radial distortion common to all of them, it is compensated during generation of the x3d models. The images have visible modulation caused by the Bayer mosaic – they were not subject to any (destructive) demosaic processing – only to the linear one. Each color channel is separately de-convolved with a set of space-variant kernels.

    Figure 2. Dual quad camera rig

    Available data series

    Initial data contains scenes captured with the car mounted dual quad camera rig (Figure 2) in two series:

    • Salt Lake City State Street
    • Salt Lake City overlook (northern part of the city in the direction of the airport)

    The first series (80 scenes) was captured while driving north at a rate of 5 frames per second. Only the first four seconds (20 scenes) of each 20 second interval are posted to reduce the amount of uploaded data. The second series (59 scenes) was captured while standing and then slowly turning and starting to move.

    More data will be added, we are open to suggestions – what data should be included to make the whole data set more useful. All available series combined are available at this link: . Number of the scenes is definitely too small to extract the semantic information, but it may be enough to train the network to build DSI using small samples of the correlation tiles. At least to start with.

    The frequently asked questions about this dataset Why is there a special format of the data set, why not just provide the raw images?

    We do provide raw Bayer images captured by the camera in JP4 format, but due to significant for the high-resolution images optical aberrations, complete end-to-end processing would require space-variant coefficients that to our understanding would require enormous resources and a very large number of images for training. Instead we propose to separate the space-variant pre-processing and provide space-invariant data to the network. And the 2D correlation data is the earliest space-invariant data available with this approach.

    Figure 3. Anisotropy of the 2D correlation. Red and blue squares show two overlapping tiles in the source images and in the calculated 2D correlation.

    Why do you use four cameras for stereo when most systems have just two?

    In the case when the source image contains nicely textured in all directions areas, the correlation output contains a sharp round spot and subpixel argmax calculation is not very difficult – polynomial interpolation provides good results. Unfortunately many real-life objects do not have convenient texture, and poor texture is Achilles’ heel of many vision-based 3D reconstruction systems. On the other hand discontinuity between foreground and background objects usually results in sharp edges that can be used for matching, and it is usually safe to assume that lack of visible features means that the surface is smooth and depth can be reconstructed by averaging over larger area. The edges may have any direction, but binocular systems with horizontal baselines are only sensitive to the vertical features. Figure 3. shows correlation output of a distant airplane where the correlation spots are stretched in the direction of the edges. The four-sensor camera in Figure 1 contains 6 pairs, and after combining horizontal and vertical there are 4 directions left, so for any edge direction there is a pair with a almost perpendicular to it.

    Why don’t you use LIDARs for the ground truth data?

    Most available LIDARs have a rather short range of 150-250 meters, and we are interested in measuring larger scenes. So instead we decided to use a pair of similar quad cameras positioned at ~5 times larger distance than the quad camera baseline, so the measurements it provides are also 5 times more accurate and can be used for training as ground truth. Binocular system does have limitations as described above, so it will not provide accurate data for the horizontal features (like electrical wires so common over Salt Lake City).

    Figure 4. Field calibration results

    What are the current limitations of the hardware?

    Individual quad cameras use precisely thermally compensated sensor front ends with the image plane movement relative to the sensor of less than ±0.02μ/°C, but the overall mount made of titanium tubes does bend when heated by the sun radiation from one side. Just few tenths of a pixel, but it is too much for the application, so until we have a new design we have to use field calibration and bundle-adjust individual sub-camera attitudes using the image data itself. It allows to keep misalignment to ~0.05 pix.

    Dual camera rig (Figure 2) as currently implemented has its limitations that we temporarily mitigate with the field calibration. We are working on the rig improvements too – one of the mechanical issues is that when driving at freeway speeds we noticed 55 Hz torsion oscillations (two cameras rotating in opposite directions at the ends of the 1.5 m aluminum tube) of 0.4 pix amplitude. So until we will strengthen the rig we had to limit data set to the images captured at lower speeds to keep vibration-caused mismatch under 0.1 pix.

    Figure 4. shows results of the field calibration for the image series captured during driving along the State Street. Each quad camera (main with the 258 mm baseline and auxiliary with the 150 mm one) had adjusted its sensor front end attitude angles using the image data, same with the attitude of the auxiliary camera as a whole with respect to the main one (baseline of 1256 mm). The results of the distance measurements obtained from this long baseline pair were then used as ground truth for the individual quad cameras. Relatively strong correction required for the first ~20 samples was caused by the changing temperature in the camera connection tubes after slowing down from the freeway speed to the city one.

    Electronic rolling shutter (ERS) limitations. We really love small format high-resolution image sensors perfected by the cellphone industry for the image quality and sensitivity, but that performance currently requires Electronic Rolling Shutter (ERS) that leads to image distortions of the moving objects (or caused by the moving/rotating camera itself). In the presence of movement of the objects (or camera egomotion) the measured disparity depends on the time difference between the moments the same feature is captured by the sub-cameras. The sensors are mechanically aligned to be parallel within 5 pixels that corresponds to 0.2ms of the readout time. That means that horizontal pairs capture all objects almost simultaneously regardless of disparity, for the vertical and diagonal pairs that time difference increases proportionally to the disparity, but for the far objects (where disparity accuracy is most important) that difference is still small (0.03ms for each disparity pixel). Currently posted image sets do not have any ERS correction, and that somewhat deteriorates performance for the near objects, but when considering just the egomotion (both translation and rotation) this effect can be easily compensated, at least for the uniform rotation on movement.

    In the posted image set ERS is not an issue for small (<5 pix) disparities, we will add software compensation over the full disparity range of the scenes for uniform movements, and later will use IMU to compensate higher frequency movements also.

    Reading quad stereo TIFF image stacks in Python and formatting data for TensorFlow

    Fri, 07/20/2018 - 14:10

    Fig.1 TIFF image stack

    The input is a <filename>.tiff – a TIFF image stack generated by ImageJ Java plugin (using bioformats) with Elphel-specific information in ImageJ written TIFF tags.

    Reading and formatting image data for the Tensorflow can be split into the following subtasks:

    1. convert a TIFF image stack into a NumPy array
    2. extract information from the TIFF header tags
    3. reshape/perform a few array manipulations based on the information from the tags.

    To do this we have created a few Python scripts (see that use Pillow, Numpy, Matplotlib, etc..

    ~$ python3 <filename>.tiff
    It will print header info in the terminal and display the layers (and decoded values) using Matplotlib.


    Subtasks 1. and 2. are simply done by Pillow. As for 3., if it was a simple TIFF image stack there would be no need to have extra scripts.

    If the layer were just stacked along depth the shape would be (2178, 2916, 5). The script reshapes the data into tiles and puts the values into a separate array, so the output is 2 Numpy arrays:

    • image data shape: (242, 324, 9, 9, 4) with the layers along the last dimension
    • values data shape: (242, 324, 3)

    And they are ready to be used in the Tensorflow.

    For more details and example output, see this wiki page or inspect the source. The most detailed general info is probably in the slides from our presentation for CVPR2018 – format description starts from p.19.

    File samples

    Actual files can be found here or go to 3d+biquad, open individual models and hit the light green button to ‘Download source files for ml’.

    Without using Python displaying layers and decoding tags is natively supported by ImageJ or Fiji. To read the tags exiftool or tiffinfo linux programs can be used.

    Other libraries

    Other Python libraries that could have worked (but examples for Pillow were easier to find):

    Elphel at CVPR 2018

    Wed, 06/20/2018 - 00:12

    Elphel booth at CVPR 2018 Expo

    Elphel is presenting at CVPR 2018 EXPO: Long Range Passive 3D Reconstruction System,  providing NN with custom training sets, enabling realtime 3D-aware machine learning systems.

    Please visit us at CVPR, Booth 132.

    You can also find our presentation and related articles on wiki page



    Capturing Aircraft Position with the Long Range Quad Stereo Camera

    Sun, 05/06/2018 - 12:57

    Figure 1. Aircraft positions during descent captured with the quad stereo camera. Each animation frame corresponds to the available 3-D model.

    While we continue to work on the multi-sensor stereo camera hardware (we plan to double the number of sensors to capture single-exposure HDR image sets) and develop code to get the ground truth data for the CNN training, we had some fun testing the camera for capturing aircraft position in 3-D space. Completely passive method, of course.

    We found a suitable spot about 2.5 km from the beginning of the runway 34L of the Salt Lake City international airport (exact location is shown in the model viewer) so approaching aircraft would pass almost over our heads. With the 60°(H)×45°(V) field of view of the camera aircraft are visible when they are 270 m away horizontally.

    Camera was set to capture synchronized image sets at 5 frames per second (same as in Fig. 1 animation), each set was processed independently without any relation to the other ones (no optical flow or motion vectors). Our existing (not airplane-aware) prototype software generated the scene 3-D models, they in turn were used to measure the aircraft positions. Result accuracy may be improved by using directly the intermediate Disparity Space Image (DSI) data – current reconstruction code tries to fit some flat surfaces to approximate the DSI data. At long distances there is insufficient data about the shape of the objects, so at ~500 m range (Fig. 2) the software used a fronto-parallel surface (distances to the aircraft nose, center, and tail are the same) and at ~600m it selected a tilted surface in the wrong direction, so the measured distance to the tail is larger than the one measured to the nose and center.

    Figure 2. Ground distance and altitude of the three points sampled.

    The measurements accuracy is still good, considering that the disparity of 1 image pixel corresponds for this short baseline camera to 527 meters range, at 800 m 1/10 pixel disparity precision corresponds to 120 m uncertainty. Ground speed and the descend rate (indicated in the animation) are calculated between the first captured set and the current one, so the last value averages over the full 6 seconds.

    This method of capturing aircraft position and speed does not require that it travels directly to/from the camera, it can fly in any direction. There could also be multiple aircraft (such as UAVs) in the same camera view. Another 3-D scene (Fig. 3) captured from the same spot perpendicular to the flight direction shows an aircraft at 1600 m from the camera at 110 m altitude. According to the map the center line of the 34R runway is 14% farther (at 1820 m) from the camera position. That corresponds to 0.045 pix disparity error in the distance measurement, so calculated ground speed and altitude would also be 14% lower than actual.

    Figure 3. Aircraft traveling perpendicular to the view direction.

    Each of the provided 3-D scene models has the corresponding “set of 4 images” link on the settings page that can be viewed as an animation, manually iterated over or downloaded – the one for Fig. 3 scene shows how a 0.33 pixel wiggle of the FedEx MD-11 image looks like (images are zoomable and pannable). Here are the direct links for image sets of the first (1.75 pix disparity) and the last (0.7 pix disparity) of the scene models of the approaching B737 shown in Fig. 1.

    These four images are not used in the actual reconstruction process, they are rendered from the source raw (JP4 format) ones for the specified disparity (D0.0 in the file names show that disparity was 0.0 pixels) for viewing or evaluating with alternative software. These images are not rectilinear and have residual common (for all four) radial distortion. They have optical aberrations corrected by the space-variant deconvolution, sub-camera residual misalignment is field-corrected (such misalignment may be caused by an uneven heating of the camera tubular structure by the direct sunlight). The images are partially rectified to the common radial distortion model that bests fits simultaneously all four sub-cameras. The remaining radial distortion is compensated during final generation of the 3-D model. These images have visible grid artifacts caused by the Bayer pattern – no de-mosaic processing is involved because it hurts linear operations over the images, and it is not adequate for the peripheral image areas (see earlier post). Each color component is processed separately for the phase correlation, when combining texture images these artifacts are reduced as they are not coherent between the channels. Any additional non-linear improvements of the Bayer mosaic suppression can be applied to the texture generation step, not to the data used for correlation.

    Implementing a linux driver for an image sensor for NC393

    Fri, 04/20/2018 - 12:48

    Fig.1 MT9F002


    This post briefly covers implementation of a driver for On Semi’s MT9F002 14MPx image sensor for 10393 system boards – the steps are more or less general. The driver is included in the latest software/firmware image (20180416). The implemented features are programmable:

    • window size
    • horizontal & vertical mirror
    • color gains
    • exposure
    • fps and trigger-synced ports
    • frame-based commands sequence allowing to change settings of any image up to 16 frames ahead (didn’t need to be implemented as it’s the common part of the driver for all sensors)
    • auto cable phase adjustment during init for cables of various lengths


    • Get the register reference document for MT9F002
    • Setup the Eclipse IDE for linux kernel development – see instructions


    Capturing images without the driver

    There is a set of scripts used for verification in software (on PC) and hardware (on the target board). Among other things they can be used to establish initial communication with a sensor, run registers initialization and acquire the first images. The process for MT9F002 is described on our wiki.

    Driver implementation

    In general the driver can be split into 2 parts:

    • common for all sensors – sets up working with fpga frame-based command sequencers (see NC393 development progress and the future plans).
    • sensor specific – commands that do actual communication with a sensor. This is the part needs to be implemented.

    For a sensor there are a few tables and a few functions, the most important ones are:


    • mt9f002_pages – a ‘page’ is the upper byte of a 16-bit register address of the sensor (for 8-bit reg addresses the page number is 0x00), the table should list all available pages in the sensor – or at least the most important ones – they are used to program fpga’s 256-max records per port tables used to write commands and at least one record is used to read from port – so in general for a setup w/o a MUX board the limit of allowed pages per sensor is 256-1 = 255 (with a MUX + 3x sensors on a single port it is 256-2(mux write & read)-4(3x sensor read + 1x broadcast read))/4 = 62)
    • mt9f002_par2addr – used convert 256 8-bit parameters available as P_SENSOR_REGSxxx to 16-bit register address
    • mt9f002_ahead_tab – sequencer commands latencies – leave default, they are needed on the final stages of development

    functions (they are registered on top of the common ones):

    • mt9f002_pgm_detectsensor – remove reset read sensor id
    • mt9f002_pgm_initsensor – init registers that select interface, sensor’s PLL multipliers/dividers and other required settings
    • mt9f002_pgm_exposure – program exposure
    • mt9f002_pgm_window – program image width and height
    • mt9f002_pgm_limitfps – limit fps based on compressor capabilities and trigger period settings
    • mt9f002_pgm_gains – program color gains
    • mt9f002_pgm_triggermode – set sensor in triggered mode
    • mt9f002_pgm_sensorregs – write sensor registers directly

    See this article for more details.


    Dual Quad-Camera Rig for Capturing Image Sets

    Tue, 03/20/2018 - 00:35

    Figure 1. Dual quad-camera rig mounted on a car

    Following the plan laid out in the earlier post we’ve built a camera rig for capturing training/testing image sets. The rig consists of the two quad cameras as shown in Figure 1. Four identical Sensor Front Ends (SFE) 10338E of each camera use 5 MPix MT9P006 image sensors, we will upgrade the cameras to 18 MPix SFE later this year, the circuit boards 103981 are in production now.

    We decided to use cameras with different baselines: 150 mm and 258 mm, the distance between the quad-camera centers is 1256 mm. The distance map calculated from the larger of the two cameras will have higher precision, than the distance map from the smaller one. For the same processing algorithms and the same object the distance resolution is reverse proportional to the baseline. Even higher resolution will be achieved by fusing depth maps from the two cameras together. The higher precision results may be used for the cost functions of the lower distance resolution camera instead of the unavailable real ground truth data. To make the rig sufficiently rigid the cameras are mounted on the aluminum tube (OD=57 mm), and that tube is attached to the rest of the mount through the vibration isolators.

    Each of the two cameras uses matched set of the lenses with focal length variation of less than 1% to simplify frequency domain mutual image rectification. Cameras still need fine alignment and calibration, after that we will start image sets capturing.

    This year Computer Vision and Pattern Recognition conference (CVPR 2018) will be held in Salt Lake City, Utah. We plan to participate in the CVPR exhibition and we will be happy to see visitors at Elphel office that is less than 10 km from the conference venue.

    High Resolution Multi-Vew Stereo: Tile Processor and Convolutional Neural Network

    Mon, 02/05/2018 - 22:08

    Figure 1. Multi-board setup for the TP+CNN prototype

    This article describes our next steps that will continue the year-long research on high resolution multi-view stereo for long distance ranging and 3D reconstruction. We plan to fuse the methods of high resolution images calibration and processing, already emulated functionality of the Tile Processor (TP), developed RTL code for its implementation and the Convolutional Neural Network (CNN). Compared to the CNN alone this approach promises over a hundred times reduction in the number of input features without sacrificing universality of the end-to-end processing. The TP part of the system is responsible for the high resolution aspects of the image acquisition (such as optical aberrations correction and image rectification), preserves deep sub-pixel super-resolution using efficient implementation of the 2-D linear transforms. Tile processor is independent of any training, only a few hyperparameters define its operation, all the application-specific processing and “decision making” is delegated to the CNN.

    Machine Learning for 3D Scene Reconstruction

    Machine learning is an active development area, and its applications to the 3D scene reconstruction stimulated by the autonomous vehicles including self-driving cars is no exception. Use of the CNNs to extract surfaces from the random-dot stereograms was published as early as 1992[1]. Most of the modern researches use standard image sets: Middlebury stereo data set[2] for high resolution near objects and KITTI[3] for the longer range applications. KITTI images are acquired from a moving car, they have attached ground truth data captured by the LIDAR. This image set uses binocular pairs and has relatively low resolution (1.4 MPix) compared to the modern image sensors, and still most of the CNN architectures require from seconds to thousands of seconds even when implemented with GPU devices and so are not yet suitable for the real-time applications.

    Most of the CNNs[4] input raw pixel data and perform unary feature extraction in the parallel (one for each image in a stereo set) subnets, merge features and perform additional processing of the resulting 3d data. This is a so-called “siamese” network architecture that benefits from sharing parameters between the identical subnetworks. It is common to put most resources to the unary part of the processing resulting in truncating of the common stage of the processing that can consist of just a single layer(Fast Architecture in [4]). Efficient implementation in [5] limits CNN processing to just generate DSI and use traditional methods of DSI enhancement such as semi-global matching [6], other architectures split network after exchanging features and generate depth maps for each of the stereo images individually[7].

    Figure 2. 2D MDCT basis functions (¼ of all MCLT ones for N=8)

    Convolutional Neural Networks and the Frequency Domain Processing

    Early layers of the various CNNs (and the eye retina too) are very general and even remind the basis functions (Figure 2) of the two-dimensional Fourier (DFT), cosine/sine (DCT and DST) and wavelet (DWT) transforms so it is no surprise that there are works that explore combinations of such transforms and the neural networks. Some of them [89] exploit energy concentration property of these transforms that makes possible popular image compression such as JPEG. Others [1011] evaluate efficiency of the available Fast Fourier Transform implementations to speed-up convolutions by converting image data to the frequency domain and then applying the pointwise multiplication according to the convolution-multiplication property. Improvement is modest, as the frequency domain calculations are most efficient for the large windows, while most modern CNNs use small ones, such as 3×3, where Winograd algorithm is more efficient.

    Tile Processor and the High Resolution Multi-View Camera

    Multi-view high resolution cameras present a special case where frequency domain processing results may lead to the reduction of the CNN input features by two orders of magnitude compared to raw pixel input. As described in the earlier post Tile Processor is using efficient Modulated Complex Lapped Transform (MCLT) conversion of the Bayer mosaic (color) high resolution image data to the frequency domain, simultaneously providing subpixel resolution shift for image rectification. Frequency domain processing includes space-variant optical aberration correction (required for the high resolution small format image sensors), phase correlation for image pairs and textures processing if it is required in addition to the distance (disparity) measurement. After channels/pairs merging the frequency domain data is converted back to the pixel domain where the tile correlation maximums are located and measured. Each overlapping tile is 16×16 pixels, the tile period (corresponding to stride in CNN terms) is 8×8 pixels. In the case when only the disparity value and confidence are output from each tile it makes 2 features. If the raw pixels were input, for 4 channels it would be 4*8*8 = 256 features or 128 times more. It may be useful to increase the number of features and supplement average disparity for all 4 pairs of quad camera with separate and horizontal and vertical pairs to improve foreground-background separation, in that case the feature reduction would still be over 40.

    Conversion of the raw pixels to the Disparity Space Image (DSI) involves significant reduction of the (X,Y) resolution, when using four of the 18 MPix (4912×3684) imagers the DSI resolution is just 614×460, – isn’t it a waste of the sensor resolution? No, it is not:

    • a deep sub-pixel resolution for disparity measurement needed for long-distance ranging requires matching of the large image areas anyway
    • most of the image area for the most real-world images corresponds to smooth 3-d surfaces where assumption of a common disparity value for a tile is reasonable.
    • the initial image resolution is preserved by the TP when source images are converted to the textures (simultaneously improving quality as the data from 4 rectified images is averaged)
    • pixel-accurate distance map may be restored by extra processing the pixel data for selected tiles where depth discontinuity is detected, then assigning each pixel to one of the available surfaces.

    Advantages of the TP+CNN over the End-to-End CNN

    Significant (42..128) reduction of the input features is not the only advantage of the TP+CNN combination over the CNN alone. Being “convolutional” CNNs depend on translation symmetry, the groups of related pixels are treated the same way regardless of their localization in the image. That is only an approximation, especially when dealing with the high resolution images and extracting subpixel disparity values. This divergence from the strict convolution model is caused by the optical aberrations and distortions and requires use of the space-variant convolution instead, or performing complete aberration correction and image rectification before the images are fed to the network. Otherwise both the complexity of the network and amount of training data would increase dramatically. Image rectification with pixel (or slightly better) precision is a common task in stereo processing. It involves interpolation and re-sampling of the pixel data, the process that leads to the phase noise introduction, especially harmful when deep super-resolution of the matched images is required. Tile processor implementation combines multiple operations (fractional pixel image shifts, optical aberrations correction, phase correlation of the matched pairs), TP avoids image re-sampling from the sensor pixel grid by replacing it with the phase rotation in the frequency domain.

    Final step that reduces the number of features that are sent from the TP to the CNN is extraction of the disparity value by calculation of the argmax of the phase correlation data. This function has to be calculated with subpixel resolution for the data defined on the integer pixel grid. Certain biases are possible, and the TP implementation offers trade-off between speed and accuracy. The result disparity value is a sum of the pre-applied disparity (implemented as a phase rotation in the frequency domain on top of the integer pixel shift) and the argmax value (correlation maximum offset from zero). When higher accuracy is required, a second iteration may be performed by applying the full disparity from the first iteration, then the residual argmax offset will be close to zero and less subject to bias.

    System Performance Estimation and the Prototype Setup

    Optimal system for the real-time high resolution 3-D scene reconstruction and ranging would require development of the application-specific SoC. If used with a set of four 18 MPix image sensors (such as ON Semiconductor AR1820HS) and a single ×16 1600 MHz DDR4 memory device, the 16 nm technology process, the TP subsystem will be capable of 10 Hz operation covering the full 4912×3684 frames reserving half of the memory bandwidth for other then TP operations.

    We plan to emulate such system using available NC393 camera electronic and optical-mechanical components, including multiple 10393 system boards based on Xilinx Zynq 7030 SoC. Each such board has a GigE port and four identical sensor ports routed directly to the FPGA I/O pads allowing flexible assignment of pin functions. Typical applications include up to 8 differential LVDS pairs, clock pair, I²C and clock input. The same connectors can be used for high-speed communication between the 10393 boards. Partitioning the system into multiple boards will allow to fit the required TP functionality into smaller FPGAs, then send the result features (614×460×6) over the GigE to a workstation with GPU for the experiments with different CNN implementations. The system bandwidth will be lower than that of the application-specific SoC, the 10 Hz operation will be possible with 5 MPix sensors (2.5 Hz with 18 MPix).

    Inter-board connections are shown in Figure 1 (just the connections, the actual prototype camera will look more like in Figure 3, but with a wider body). Five to seven of the 10393 boards are arranged in 2 layers. Four layer 1 boards use one of the sensor ports to receive image data from the attached sensor, perform image conditioning, flat-field correction and store data in the dedicated DDR3 memory. They later read the data as 16×16 pixels overlapping tiles, calculate the tile centers using calibration data and requested location and nominal disparity from the data received over the GigE. Each tile is transformed to the frequency domain, the data is subject to the space-variant aberration correction. The result frequency domain tiles are output through three remaining sensor ports that are reconfigured to be LVDS transmitters. The layer 2 boards simultaneously receive frequency domain data through all 4 of their sensor ports from the layer 1 and perform phase correlation (pointwise multiplication followed by normalization) on the image pairs. There could be just a single layer 2 board, or up to 3 (limited by the available layer 1 ports) to perform different types of correlations in parallel (all 4 pairs combined, two vertical pairs and separately 2 horizontal pairs for better foreground/background separation. The results of the frequency domain calculations are then transformed to the pixel domain and the argmax is calculated. Then argmax value is used to calculate the full tile disparity, and the corresponding correlation value – as a disparity confidence. The pair (disparity, confidence) for each tile is then sent over GigE to the CNN implemented on a workstation computer.

    Figure 3. Quad sensor camera for image sets acquisition

    Image Sets for Training and Testing

    While the TP functionality is already tested with the software emulation, and the efficient implementation is developed, more research is needed for the CNN part of the system. Available image sets, such as KITTI[3] have insufficient resolution (1.4 MPix) and they use different spatial arrangement of the cameras. We plan to capture high resolution quad camera image sets using available NC393-based cameras that will be upgraded from 5 MPix to 18 MPix sensors of the same 1/2.3″ format so the optical-mechanical design will remain the same. As we are primarily interested in long distance ranging (few hundreds to thousands meters), use of the LIDARs to capture ground truth data is not practical. Instead we plan to mount a pair of identical quad cameras (with the baseline of 150mm) on a car 1500 mm apart, pointed in the same direction, so when the 3D measurements from these quad cameras are fused, the accuracy of the composite distance data would be ten times better, because the effective baseline will be 1500mm. Of course that method has some limitations (it will not help to improve data from the poorly textured objects) but it will provide higher absolute distance resolution that can be used for the loss function during CNN training. Data from the individual quad cameras will be used for training and testing of the network.

    All acquired images, related calibration data and software will be available online under GNU GPL.


    [1] Becker, Suzanna, and Geoffrey E. Hinton. “Self-organizing neural network that discovers surfaces in random-dot stereograms.” Nature 355.6356 (1992): 161.

    [2] Scharstein, Daniel, et al. “High-resolution stereo datasets with subpixel-accurate ground truth.” German Conference on Pattern Recognition. Springer, Cham, 2014.

    [3] Menze, Moritz, and Andreas Geiger. “Object scene flow for autonomous vehicles.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.

    [4] J. Zbontar and Y. LeCun, “Stereo matching by training a convolutional neural network to compare image patches,” Journal of Machine Learning Research, vol. 17, no. 1-32, p. 2, 2016.

    [5] W. Luo, A. G. Schwing, and R. Urtasun, “Efficient deep learning for stereo matching,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5695–5703, 2016.

    [6] H. Hirschmuller, “Accurate and efficient stereo processing by semi-global matching and mutual information,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 2, pp. 807–814, IEEE, 2005.

    [7] J A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry, “End-to-end learning of geometry and context for deep stereo regression,” arXiv preprint arXiv:1703.04309, 2017.

    [8] Sihag, Saurabh, and Pranab Kumar Dutta. “Faster method for Deep Belief Network based Object classification using DWT.” arXiv preprint arXiv:1511.06276 (2015).

    [9] Ulicny, Matej, and Rozenn Dahyot. “On using CNN with DCT based Image Data.” Proceedings of the 19th Irish Machine Vision and Image Processing conference IMVIP 2017

    [10] Vasilache, Nicolas, et al. “Fast convolutional nets with fbfft: A GPU performance evaluation.” arXiv preprint arXiv:1412.7580 (2014).

    [11] Lavin, Andrew, and Scott Gray. “Fast algorithms for convolutional neural networks.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.

    Photo Finish

    Tue, 01/30/2018 - 14:53

    Since 2005 and the older 333 model, Elphel cameras have a Photo Finish mode. First, it was ported to 353 generation, and then from 353 to 393 camera systems.

    In this mode the camera samples scan lines and delivers composite images as video frames. Due to the Bayer pattern of the sensor the minimal sample height is 2 lines. The max fps for the minimal sample height is 2300 line pairs per second. The max width of a composite frame can be up to 16384px (is determined by WOI_HEIGHT). A sequence of these frames can be simply joined together without any missing scan lines.

    Current firmware (20180130) includes a photo finish demo:

    A couple notes for 393 photo finish implementation:

    • works in JP4 format (COLOR=5). Because in this format demosaicing is not done it does not require extra scan lines, which simplified fpga’s logic.
    • fps is controlled:
      • by exposure for the sensor in the freerun mode (TRIG=0, delivers max fps possible)
      • by external or internal trigger period for the sensor in the snapshot mode (TRIG=4, a bit lower fps than in freerun)

    See our wiki’s Photo-finish article for instructions and examples.

    Efficient Complex Lapped Transform Implementation for the Space-Variant Frequency Domain Calculations of the Bayer Mosaic Color Images

    Mon, 01/08/2018 - 02:00

    This post continues discussion of the small tile space-variant frequency domain (FD) image processing in the camera, it demonstrates that modulated complex lapped transform (MCLT) of the Bayer mosaic color images requires almost 3 times less computational resources than that of the full RGB color data.

    “Small Tile” and “Space Variant”

    Why “small tile“? Most camera images have short (up to few pixels) correlation/mutual information span related to the acquisition system properties – optical aberrations cause a single scene object point influence a small area of the sensor pixels. When matching multiple images increase of the window size reduces the lateral (x,y) resolution, so many of the 3d reconstruction algorithms do not use any windows at all, and process every pixel individually. Other limitation on the window size comes from the fact that FD conversions (Fourier and similar) in Cartesian coordinates are shift-invariant, but are sensitive to scale and rotation mismatch. So targeting say 0.1 pixel disparity accuracy the scale mismatch should not cause error accumulation over window width exceeding that value. With 8×8 tiles (16×16 overlapped) acceptable scale mismatch (such as focal length variations) should be under 1%. That tolerance is reasonable, but it can not get much tighter.

    What is “space variant“? One of the most universal operations performed in the FD is convolution (also related to correlation) that exploits convolution-multiplication property. Mathematically convolution applies the same operation to each of the points of the source data, so shifted object of the source image produces just a shifted result after convolution. In the physical world it is a close approximation, but not an exact one. Stars imaged by a telescope may have sharper images in the center, but more blurred in the peripheral areas. While close (angularly) stars produce almost the same shape images, the far ones do not. This does not invalidate convolution approach completely, but requires kernel to (smoothly) vary over the input images [12], makes it a space-variant kernel.

    Figure 1. Complex Lapped Transform with DCT-IV/DST-IV: time-domain aliasing cancellation (TDAC) property. a) selection of overlapping input subsequences 2*N-long, multiplication by sine window; b) creating N-long sequences for DCT-IV (left) and DST-IV (right); c) (after frequency domain processing) extending N-long sequence using DCT-IV boundary conditions (DST-IV processing is similar); d) second multiplication by sine window; e) combining partial data

    There is another issue related to the space-variant kernels. Fractional pixel shifts are required for multiple steps of the processing: aberration correction (obvious in the case of the lateral chromatic aberration), image rectification before matching that accounts for lens optical distortion, camera orientation mismatch and epipolar geometry transformations. Traditionally it is handled by the image rectification that involves re-sampling of the pixel values for a new grid using some type of the interpolation. This process distorts the signal data and introduces non-linear errors that reduce accuracy of the correlation, that is important for subpixel disparity measurements. Our approach completely eliminates resampling and combines integer pixel shift in the pixel domain and delegates the residual fractional pixel shift (±0.5 pix) to the FD, where it is implemented as a cosine/sine phase rotator. Multiple sources of the required pixel shift are combined for each tile, and then a single phase rotation is performed as a last step of pixel domain to FD conversion.

    Frequency Domain Conversion with the Modulated Complex Lapped Transform

    Modulated Complex Lapped Transform (MCLT)[3] can be used to split input sequence into overlapping fractions, processed separately and then recombined without block artifacts. Popular application is the signal compression where “processed separately” means compressed by the encoder (may be lossy) and then reconstructed by the decoder. MCLT is similar to the MDCT that is implemented with DCT-IV, but it additionally preserves and allows frequency domain modification of the signal phase. This feature is required for our application (fractional pixel shifts and asymmetrical lens aberrations modify phase), and MCLT includes both MDCT and MDST (that use DCT-IV and DST-IV respectively). For the image processing (2d conversion) four sub-transforms are needed:

    • horizontal DCT-IV followed by vertical DCT-IV
    • horizontal DST-IV followed by vertical DCT-IV
    • horizontal DCT-IV followed by vertical DST-IV
    • horizontal DST-IV followed by vertical DST-IV

    MathJax.Hub.Config({ extensions: ["mml2jax.js"], jax: ["input/MathML","output/HTML-CSS"], "HTML-CSS": { availableFonts: ["TeX"], scale: 75 }, MMLorHTML: { prefer: { MSIE: "HTML", Firefox: "MML", Opera: "HTML", other: "HTML" } } }); Reversibility of the Modified Discrete Cosine and Sine Transforms

    Figure 1 illustrates the principle of TDAC (time-domain aliasing cancellation) that restores initial data from the series of individually converted subsections after seemingly lossy transformations. Step a) shows the 2*N long (in our case N=8) data subsets extraction, each subset is multiplied by a window function. It is a half period of the sine function, but other windows are possible as long as they satisfy the Princen-Bradley condition. Each of the sections (a…h) corresponds to N/2 of the input samples, color gradients indicate input order. Figure 2b has seemingly lossy result of MDCT (left) and MDST (right) performed on the 2*N long input (a,b,c,d) resulting in N-long (– c ~ –d,a– b ~ ) – tilde indicates time-reversal of the subsequence (image has the gradient direction reversed too). Upside-down pieces indicate subtraction, up-right – addition. Each of the 2*N → N transforms is irreversible by itself, TDAC depends on the neighbor sections.

    Figure 1c shows the first step of original sequence restoration, it extends N-long sequence using DCT-IV boundary conditions – it continues symmetrically around left boundary, and anti-symmetrically around the right one (both around half-sample from the first/last one), behaving like a first quadrant (0 to π/2) of the cosine function. Conversions for DST branch are not shown, they are similar, just extended like a first quadrant of the sine function.

    The result sequence of step c) (now 2*N long again) is multiplied by the window function for the second time in step d), the two added terms of the last column (d and c~) are swapped for clarity.

    Figure 2. Space-variant phase shift (extra overlap), source-centered window. Distorted reconstruction in the center

    The last image e) places result of the step d) and those of similarly processed subsequences (c,d,e,f) and (e,f,g,h) to the same time line. Fraction (d~ ,c~ ) of the first block compensates (–d~ ,– c~ ) of the second, and (f~,e~ ) of the second – (–f~,–e~ ) of the third. As the window satisfies Princen-Bradley condition, and it is applied twice, the c columns of the first and second block when added result in c segment of the original sequence. The same is true for the d, e, and f columns. First and last N samples are not restored as there are no respective left and right neighbors to be processed.

    Space-variant shift and TDAC property

    Modified discrete cosine and sine transforms exhibit a perfect reconstruction property when input sequence is split into regular overlapping intervals and this is the case for many applications, such as audio and video compression. But what happens in the case of a space-variant shift? As the total shift is split into integer and symmetrical fractional part even smooth variation of the required shift for the neighbor sample sequences there will be places where the required shift crosses ±0.5 pix boundaries. This will cause the overlap to be N±1 instead of exactly N and the remaining fractional pixel shift will jump from +0.5 pix to -0.5pix or vice versa.

    Figure 3. Space-variant phase shift (reduced overlap), source-centered window. Distorted reconstruction in the center

    Such condition is illustrated in Figure 2, where fractional pixel shift is increased to ±1.0 instead of ±0.5 to avoid signal shape distortion caused by a fractional shift. The sine window is also modified accordingly to have zeros on both ends and so to have a flat top, but it still satisfies Princen-Bradley condition. The sawtooth waveform represents input signal that has large pedestal = 10 and a full amplitude of 3. The two left input intervals (1 to 16 and 9 to 24) are shown in red, the two right ones (15 to 30 and 23 to 38) are blue. These intervals have extra overlap (N+2) between red and blue ones. Dotted waveforms show window functions centered around input intervals, dashed – inputs multiplied by the windows. Solid lines on the top plot show result of the FD rotation resulting in 1 shift to the right for the red subsequences, to the left – by the blue ones.

    The bottom plot of Figure 2 is for the “rectified image”, where the overlapping intervals are spread evenly (0-15, 8-23, 16-31, 24-39), the restored data is multiplied by the windows again (red and blue solid lines) and then added together (dark green waveform) in an attempt to restore the original sawtooth.

    Figure 4. Space-variant phase shift (extra overlap), modified window. Correct reconstruction

    There is an obvious problem in the center where the peak is twice higher than the center sawtooth (dotted waveform). It is actually a “wrinkle” caused by the input signal pedestal that was pushed to the center by the increased overlap of the input, not by the sawtooth shape. The FD phase shift moved windowed input signal, not just the input signal. So even if the input signal was constant, left two windows would be shifted right by 1, and right ones – left by one sample, distorting their sum. Figure 3 illustrates the opposite case where the input subsequences have reduced rather than increased overlap and FD phase rotation moves windowed data away from the center – the restored signal has a dip in the middle.

    This problem can be corrected if the first window function used for input signal multiplication takes the FD shift into account, so that the input windows after phase rotation match the output ones. Signals similar to those in Figure 2, but with appropriately offset input windows (dotted waveforms are by 1 pixel asymmetrical) are presented in Figure 4. It shows perfect reconstruction (dark green line) of the offset input sawtooth signal.

    Figure 5. Space-variant phase shift (reduced overlap), modified window. Correct reconstruction[/caption] Figure ? illustrates the correct reconstruction of the input data of Figure 3 made with the input window function modified to match the FD phase rotation. --> Implementation of the MCLT for the Monochrome 16×16 Pixel Tiles

    Figure 5. MCLT converter for the 16×16 pixel tiles

    The MCLT converter illustrated in Figure 5 takes an array of 256 pixel samples and processes them at the rate of 1 pixel per clock, resulting in 4 of the 64 (8×8) arrays representing FD transformation of this tile. Converter incorporates phase rotator that is equivalent to the fractional pixel shifter with 1/128 pixel resolution. Multiple pixel tiles tiles can be processed immediately after each other or with a minimal gap of 16 pixels needed to restore the pipeline state. Fold sequencer receives start signal and simultaneously two 7-bit fractional pixel X and Y shifts in 1/128 increments (-64 to +63). Pixel data is received from the read-only port of the dual-port memory (input tile buffer) filled from the external DDR memory. Fold sequencer generates buffer addresses (including memory page number) and can be configured for various buffer latency.

    Figure 6. DCT-IV and DST-IV basis functions for N=8

    Fold sequencer simultaneously generates X and Y addresses for the 2-port window ROM that generates window function (it is a half-sine as discussed above) values that are combined by a multiplier as the 2-d window function is separable. Each of the dimensions calculate window values with appropriate subpixel shift to allow space-variant FD processing. Mapping from the 16×16 to 8×8 tiles is performed according to Figure 1b for each direction, resulting in four 8×8 tiles for DCT/DCT (horizontal/vertical), DST/DCT, DCT/DST and DST/DST future processing, Each of the source pixels contribute to all 4 of the 8×8 arrays, and each corresponding elements of the 4 output arrays share the same 4 contributing pixels – just with different signs. That allows to iterate through the source pixels in groups of 4 once, multiply them by appropriate window values and in 4 cycles add/subtract them in 4 accumulators. These values are registered and multiplexed during the next 4 cycles feeding 512×25 DTT input buffer (DTT stands for Discrete Trigonometric Transform – a collective name for both DCT and DST).

    “Folded” data stored in the 512×25 DTT input buffer is fed to the 2-dimensional 8×8 pixel DTT module. It is similar to the one described in the DCT-IV implementation blog post, it was just modified to allow all 4 DTT variants, not just the DCT/DCT described there. This was done using the property of DCT-IV/DST-IV that DST-IV can be calculated as DCT-IV if the input sequence is reversed (x0 ↔ x7, x1 ↔ x6, x2 ↔ x5, x3 ↔ x4) and sign of the odd output samples (y1, y3, y5, y7) is inverted. This property can be seen by comparing plots of the basis functions in Figure 6, the proof will be later in the text.

    Figure 7. Frequency domain phase rotator

    Another memory buffer is needed after the 2d DTT module as it processes one of four 64-pixel transforms at a time, and the phase rotator (Figure 7) needs simultaneous access to all 4 components of the same FD sample. These 4 components are shown on the left of the diagram, they are multiplied by 4 sine/cosine values (shown as CH, SH, CV and SV) and then combined by the adders/subtracters. Total number of multiplications is 16, and they have to be performed it 4 clock cycles to maintain the same data rate through all the MCLT converter, so 4 multiplier-accumulator modules are needed. One FD point calculation uses 4 different sine/cosine coefficients, so a single ROM is sufficient. The phase rotator uses horizontal and vertical fractional pixel shift values for the second time (first was for the window function) and combines them with the multiplexed horizontal and vertical indexes (3 bits each) as address inputs of the coefficient ROM. Rotator provides rotated coefficients that correspond to the MCLT transform of the pixel data shifted in both horizontal and vertical directions by up to ±0.5 pix at a rate of 1 coefficient per clock cycle, providing both output data and address to the external multi-page memory buffer.

    Modulated Complex Lapped Transform for the Bayer Mosaic Color Images

    Fig. 8. Lateral chromatic aberration and the Bayer mosaic: a) monochrome (green) PSF, b) composite color PSF, c) Bayer mosaic of the sensor (direction of aberration shown), d) distorted mosaic matching chromatic aberration of b).

    Most camera applications use color image sensors that provide Bayer mosaic color data: one red, one blue and two diagonal green pixels in each 2×2 pixel group. In old times image senor pixel density was below the lenses optical resolution and all the cameras performed color interpolation (usually bilinear) of the “missing” colors. This process implied that each red pixel, for example, had four green neighbors (up, down, left and right) at equal distance, and 4 blue pixels located diagonally. With the modern high-resolution sensors it is not the case, possible distortions are shown on Figure 8 (copied from the earlier post). More elaborate “de-mosaic” processing involves non-linear operations that would influence sub-pixel correlation results.

    As the simple de-mosaic procedures can not be applied to the high resolution sensors without degrading the images, we treat each color subchannel individually, merging results after performing the optical aberration correction in the FD, or at least compensating the lateral chromatic aberration that causes pixels x/y shift.

    Channel separation means that when converting data to the FD the input data arrays are decimated: for green color only half of the values (located in a checkerboard pattern) are non-zero, and for red and blue subchannels – only quarter of all pixels have non-zero values. The relative phase of the remaining pixels depends on the required integer pixel offset (only ±0.5 are delegated to the FD) and may have 2 values (black vs. white checkerboard cells) for green, and 4 different values (odd/even for each of the horizontal/vertical directions independently) for red and blue. MCLT can be performed on the sparse input arrays the same way as on the complete 16×16 ones, low-pass filters (LPF) may be applied on the later processing stages (LPF may be applied to the deconvolution kernels used for aberration correction during camera calibration). It is convenient to multiply red and blue values by 2 to compensate for the smaller number of participating pixels compared to the green sub-channel.

    Direct approach to calculate FD transformation of the color mosaic input image would be to either run the monochrome converter (Figure 5) three times (in 256*3=768 clock cycles) masking out different pixel pattern each time, or implement the same module (including the input pixel buffer) three times to achieve 256 clock cycle operation. And actually the 16×16 pixel buffer used in monochrome converter is not sufficient – even the small lateral chromatic aberration leads to mismatch of the 16×16 tiles for different colors. That would require using larger – 18×18 (for minimal aberration) pixel buffer or larger if that aberration may exceed a full pixel.

    Luckily MCLT that consists of MDCT and MDST, they it turn use DCT-IV and DST-IV that have a very convenient property when processing the Bayer mosaic data. Almost the same implementation as used for the monochrome converter can transform color data at the same 256 clock cycles. The only part that requires more resources is the final phase rotator – it has to output 3*4*64 values in 256 clock cycles so three instances of the same rotator are required to provide the full load of the rest of the circuitry. This almost (not including rotators) triple reduction of the MCLT calculation resources is based on “folding” of the input data for the DTT inputs and the following DCT-IV/DST-IV relation.

    Relation between DCT-IV and DST-IV

    DST-IV is equivalent to DCT-IV of the input sequence, where all odd input samples are multiplied by -1, and the result values are read in the reversed order. In the matrix form it is shown in (1) below:

    .equation td{text-align:center; vertical-align:middle; } td.eq-no{ width:5%; } table.equation { width:100%; } DST IV = [ 0       1       1       1       ⋅⋅⋅       1       0 ] ⋅ DCT IV ⋅ [ 1       0   -1           1           ⋅⋅⋅   0       -1 ] (1)

    Equations (2) and (3) show definitions of DCT-IV and DST-IV[4]:

    DCT IV (k) = 2N ⋅ ∑ l=0 N–1 cos( π N ⋅(l+12) ⋅(k+12) ) (2) DST IV (k) = 2N ⋅ ∑ l=0 N–1 sin( π N ⋅(l+12) ⋅(k+12) ) (3)

    The modified by (1) DST-IV can be re-writted as the two separate sums for even (l=2*m) and odd (l=2*m+1) input samples and by replacing output samples k with reversed (N-1-k). Then after removing full periods (n*2*π) and applying trigonometric identities it can be converted to the same value as DST-IV:

    DCT mod IV (k) = 2N ⋅ ( ∑ m=0 N/2–1 cos( π N ⋅(2⋅m+12) ⋅((N–1–k)+12) ) – ∑ m=0 N/2–1 cos( π N ⋅(2⋅m+32) ⋅((N–1–k)+12) ) ) = 2N ⋅ ( ∑ m=0 N/2–1 cos( π N ⋅( 12⋅N – (2⋅m+12) ⋅(k+12) ) ) – ∑ m=0 N/2–1 cos( π N ⋅( 32⋅N – (2⋅m+32) ⋅(k+12) ) ) ) = 2N ⋅ ( ∑ m=0 N/2–1 cos( –π2 +π N ⋅(2⋅m+12) ⋅(k+12) ) – ∑ m=0 N/2–1 cos( π2 + π N ⋅(2⋅m+32) ⋅(k+12) ) ) = 2N ⋅ ( ∑ m=0 N/2–1 sin( π N ⋅(2⋅m+12) ⋅(k+12) ) + ∑ m=0 N/2–1 sin( π N ⋅(2⋅m+32) ⋅(k+12) ) ) = DST IV (k) (4) Two-dimensional MCLT Symmetries for the Bayer Mosaic Sub-arrays

    As shown in Figure 1b, the 2*N-long input sequence of four fragments (a,b,c,d) is folded in N-long sequence (5) for DCT-IV input:

    (a,b,c,d) ⟶ (– c ~ –d,a– b ~ ) (5)

    and (6) – for DST-IV:

    (a,b,c,d) ⟶ ( c ~ –d,a+ b ~ ) (6)

    where tilde “~” over the name indicates reversal of the segment. Such direction reversal for the sequences of even length (N/2=4 in our case) swaps odd- and even-numbered samples. Each of the halves of each of (5) and (6) show that both have the same (–d,a) for direct and differ in sign for reversed: (– c~ , –b~ ) and ( c~ , b~ ) terms. Each of the input samples appears exactly once in each of the DCT-IV and DST inputs. Even samples of the input sequence contribute identically to the even samples of both output sequences (through a an d) and contribute with the opposite signs to the odd output samples (through b and c). And similarly the odd-numbered input samples contribute identically to the odd-numbered output samples and with the opposite sign to the even ones.

    So, sparse input sequence with only non-zero values at even positions result in both DCT-IV and DST-IV having the same values at even positions and multiplied by -1 – in the odd. Odd-only input sequences result in same values in odd positions and negatives – in the even ones.

    Now this property can be combined to the previously discussed DST-IV to DCT-IV relation and extended to the two-dimensional Bayer mosaic case. We can start with the red and blue channels that have 1-in-4 non-zero pixels. If the non-zero pixels are in the even columns, then after first horizontal pass the DST-IV output will differ from the DCT-IV only by the output samples order according to (1), as even input values are the same, and odd are negated. Reversed output coefficients order for the horizontal pass means that the DST-IV will be just horizontally flipped with respect to the DCT-IV one. If the non-zero input was for the odd columns, then the DST-IV output will be reversed and negated compared to the DCT-IV one.

    The second (vertical) pass is applied similarly. If original pattern had non-zero even rows, the result of vertical DST-IV would be the same as those of the DCT-IV after a vertical flip. If the odd rows were non-zero instead – the result will be vertically flipped and negated with respect of the DCT-IV one.

    This result means that for the 1-in-4 Bayer mosaic array only one DCT-IV/DCT-IV transform is need. The three other DTT combinations may be obtained by flipping the DCT-IV/DCT-IV output horizontally and/or vertically and possibly negating all the data. Reversal of the readout order does not require additional hardware resources (just a few gates for a little fancier memory address counter). Data negation is also “free” as it can easily be absorbed by the phase rotator that already has adders/subtracters and just needs an extra inversion sign control. Flips do not depend on the odd/even columns/rows, only negation depends on them.

    Green channel has non-zero values in either (0,0) and (1,1) or (0,1) and (1,0) positions. So it can be considered as a sum of two 1-in-4 channels described above. In the (0,0) case neither horizontal, no vertical DST-IV inverts sign, and (1,1) inverts both, so the DST-IV/DST-IV does not have any inversion, and it will stay true for the green color – combination of (0,0) and (1,1). We can not compare DCT-IV/DCT-IV with ether DST-IV/DCT-IV or DCT-IV/DST-IV, but these two will have the same sign compared to each other (zero or double negation). So the DST-IV/DST-IV will be double-flipped (both horizontally and vertically) version of the DCT-IV/DCT-IV output, and DCT-IV/DST-IV – double flipped version of DST-IV/DCT-IV, and calculation for this green pattern requires just two of the 4 DTT operations. Similarly, for the (0,1)/(1,0) green pattern DST-IV/DST-IV will be double-flipped and negated version of the DCT-IV/DCT-IV output, and DCT-IV/DST-IV – double-flipped and negated version of DST-IV/DCT-IV.

    Combining results for all three color components: reg, blue and green, we need total of four DTT operations on 8×8 tiles: one for red, one for blue and 2 for green component instead of twelve if each pixel had full RGB value.

    Implementation of the MCLT for the Bayer Mosaic Pixel Tiles

    Figure 9. MCLT converter for the Bayer mosaic tiles

    MCLT converter for the Bayer Mosaic data shown in Figure 9 is similar to the already described converter for the monochrome data. The first difference is that now the Fold sequencer addresses larger source tiles – it is run-time configurable to be 16×16, 18×18, 20×20 or 22×22 pixels. Each tile may have different size – this functionality can be used to optimize external memory access: use smaller tiles in the center areas of the sensor with lower lateral chromatic aberrations, read larger tiles in the peripheral areas with higher aberrations. The extended tile should accommodate all three color shifted 16×16 blocks as shown in Figure 10.

    Figure 10. Example color channels 16×16 pixel tiles mismatch due to the chromatic aberration

    The start input triggers calculation of all 3 colors, it can appear either each 256-th clock cycle or it needs to wait for at least 16 cycles after the end of the previous tile input if started asynchronously. X/Y offsets are provided individually for each color channel and they are stored in a register file inside the module, the integer top-left corner offset is provided separately for each color to simplify internal calculations.

    Dual-port window ROM is the same as in the monochrome module, both horizontal and vertical window components are combined by a multiplier and then the result is multiplied by the pixel data received from the external memory buffer with configurable latency.

    There are only two (instead the four for monochrome) accumulators required because each of the “folded” elements of the 8×8 DTT inputs has two contributor source pixels at most (for green color), while red and blue use just a single accumulator. Red color is processed first, then blue, and finally green. For the first ones only the DCT-IV/DCT-IV input data is prepared (in 64 clock cycles each), green color requires DST-IV/DCT-IV block additionally (128 clock cycles).

    DTT output data goes to three parallel dual-port memory buffers. Red and blue channels each require just single 64-element array that later provides 4 different values in 4 clock cycles for horizontally and vertically flipped data. Green channel requires 2*64 element array. These buffers are filled one at a time (red, blue, green) and each of them feed corresponding phase rotator. Rotator outputs need 256 cycles to provide 4*64=256 FD values, the symmetry exploited during DTT conversion is lost at this stage. The three phase rotators can provide output addresses/data asynchronously or the green and blue outputs can be delayed and output simultaneously with the green channel, sharing the external memory addresses.

    MCLT Converter for the Bayer Mosaic Tiles Simulation and Comparison to the Software Post-processing results

    Figure 11. MCLT converter for the Bayer mosaic simulation

    Results of the simulation of the MCLT converter are shown in Figure 11. Simulation ran with the clock frequency set to 100 MHz, the synthesized code should be able to run at at least 250 MHz in the Zynq SoC of the NC393 camera – at the same rate as is currently used for the data compression – all the memories and DSPs are fully buffered. Simulation show processing of the 3 tiles: two consecutive and the third one after a minimal pause. The data input was provided by the Java code that was written for the post-processing of the camera images, the same code generated intermediate results and the output of the MCLT conversion. Java used code involved double precision floating point calculations, while the RTL code is based on fixed-point calculations with the precision of the architecture DSP primitives, so there is no bit-to-bit match of the data, instead there is a difference that corresponds to the width of the used data words.


    The RTL code (licensed under GNU GPLv3+) developed for the MCLT-based frequency domain conversion of the Bayer color mosaic images uses almost the same resources as the monochrome image transformation for the tiles of the same size – three times less than a full RGB image would require. Input window function modification that accounts for the two-dimensional fractional pixel shift and the post-transform phase rotators allow to avoid re-sampling for the image rectification that degrades sub-pixel resolution.

    This module is simulated with Icarus Verilog (a free software simulator), results are compared to those of the software post-processing used for the 3-d scene reconstruction of multiple image sets, described in the “Long range multi-view stereo camera with 4 sensors” post.

    The MCLT module for the Bayer color mosaic images is not yet tested in the FPGA of the camera – it needs more companion code to implement a full processing chain that will generate disparity space images (DSI) required for the real-time 3d scene reconstruction and/or output of the aberration-corrected image textures. But it constitutes a critical part of the Tile Processor project and brings the overall system completion closer.


    [1] Thiébaut, Éric, et al. “Spatially variant PSF modeling and image deblurring.” SPIE Astronomical Telescopes+ Instrumentation. International Society for Optics and Photonics, 2016. pdf

    [2] Řeřábek, M., and P. Pata. “The space variant PSF for deconvolution of wide-field astronomical images.” SPIE Astronomical Telescopes+ Instrumentation. International Society for Optics and Photonics, 2008.pdf

    [3] Malvar, Henrique. “A modulated complex lapped transform and its applications to audio processing.” Acoustics, Speech, and Signal Processing, 1999. Proceedings., 1999 IEEE International Conference on. Vol. 3. IEEE, 1999.pdf

    [4] Britanak, Vladimir, Patrick C. Yip, and Kamisetty Ramamohan Rao. Discrete cosine and sine transforms: general properties, fast algorithms and integer approximations. Academic Press, 2010.

    Updated Poky to Rocko

    Wed, 12/20/2017 - 17:15

    We have updated the Yocto build system to Poky Rocko released back in October. Here’s a short summary table of the updates:

    before after Poky 2.0 (Jethro) 2.4 (Rocko) gcc 5.3.0 7.2.0 linux kernel 4.0 4.9

    Other packages got updates as well:

    • apache2-2.4.18 => apache2-2.4.29
    • php-5.6.16 => php-5.6.31
    • udev-182 changed to eudev-3.2.2, etc.

    This new version is in the rocko branch for now but will be merged into master after some transition period (and the current master will be moved to jethro branch). Below are a few tips for future updates.


    It was relatively simple to update the recipes and build the images. Most of the packages were updated to newer versions. One noticeable change is tmp/sysroots/ is not a common location anymore – instead there is the recipe-sysroot/ directory in each recipe’s build directory. Also, our headers didn’t get to the sysroots which got fixed by having:
    sysroot_stage_all_append() {
        sysroot_stage_dir ${WORKDIR}/headers/include ${SYSROOT_DESTDIR}/usr/include-uapi
    We had this task in Jethro but there was another variable used ${STAGING_DIR_TARGET} (instead of ${SYSROOT_DESTDIR})

    Linux kernel

    It took most of the time to update it.

    Luckily our drivers didn’t break but there were a few problems:

        1. Serial output

    First, we merged our drivers and patches to make the kernel get built at least. After building the kernel the serial output didn’t work – the fix was to switch to earlycon in bootargs in the device tree (in 4.0 ealyprintk still worked):
    chosen {
      bootargs = "cma=128M console=ttyPS0,115200 root=/dev/mmcblk0p2 rw earlyprintk rootwait rootfstype=ext4";
      linux,stdout-path = "/amba@0/serial@e0000000";
    chosen {
      bootargs = "earlycon cma=128M root=/dev/mmcblk0p2 rw rootwait rootfstype=ext4";
      stdout-path = "serial0:115200n8";

        2. Ethernet

    Next, there was no ethernet. After some time of debugging the problem was with the device tree again:
    phy3: phy@3 {
      compatible = "atheros,8035";
      device_type = "ethernet-phy";
      reg = <0x3>;
    phy3: phy@3 {
      /* Atheros 8035 */
      compatible = "ethernet-phy-id004d.d072";
      /* compatible = "ethernet-phy-ieee802.3-c22";*/
      device_type = "ethernet-phy";
      reg = <0x3>;
    The compatible field has to be in a certain format. See the docs.
    The next thing has Xilinx switched from xlnx,ps7-ethernet-1.00.a to cdns,zynq-gem (see arch/arm/boot/dts/zynq-7000.dtsi). Didn’t track when – both drivers work. The cdns,zynq-gem is in the mainline, the other is not.
    And the last change to the ethernet driver was adding the fixup for Atheros 8035, the network chip we use.

        3. Bitstream

    Then the bitstream wouldn’t get programmed. It was Xilinx’s problem – new gcc was opting away some of the variables (dma_done) in the driver – they fixed the kernel but didn’t change the hash of the linux-xlnx in the recipe yet ( So, it’s probably coming to 2017.4.

        4. Other minor things

    Xilinx has changed some of its default config variables and in 4.9 there are a few new dependencies here and there.
    The changes made to our config (elphel393.cfg) which is added to the default config for zynq devices:
    # This option is for FPGA part
    # prints time before messages
    # dependency for DYNAMIC_DEBUG=y
    # turned off because old:



    udev package switched from udev-182 changed to eudev-3.2.2. udev-182 can copy static device files from /lib/udev/devices/, eudev-3.2.2 does not do this – created a new rule:
    ACTION=="add", RUN+="/usr/bin/rsync -a /lib/udev/devices/ /dev/"
    This rule adds up ~6 secs to boot time for some reason vs almost nothing if run from the camera init script

    Other notes
    • Reminder: the system boots into initramfs first and runs init built by initramfs-live-boot. The script runs switch_root in the end. udev daemon gets killed and restarted

    Assembling Long Range Stereo Camera

    Wed, 12/13/2017 - 23:06

    MNC393-XCAM partial assembly and parts

    The long anticipated parts for the Long range camera have arrived!

    The mechanical parts for the MNC393-XCAM – Long Range Multi-view Stereo Camera are machined, tested, and ready to be anodized. This enables us to have the X-camera assembled before the winter holidays. The holiday break will provide a good opportunity to test the camera, capture new photos, and create robust 3D models from calibrated images.
    The titanium X-frame of the camera ensures thermal stability required for continuous accuracy of 3D measurements. The aluminum enclosure and sealed lens filters weatherproof the system allowing for the proposed outdoor use of the camera.
    We intend to assemble two cameras: one with a 150 mm distance between the sensors and another with a longer baseline. The expected accuracy for the camera with the shorter baseline is greater than 10% at a 200 meter distance. We have achieved 10% accuracy with H-camera with calibrated sensors, even though the 3D-printed parts were not thermally stable and some error was accumulated over time. It was a very pleasant surprise that the software was still able to deal with somewhat un-calibrated images and detect distances very accurately, creating impressive 3D-scenes: Scene_viewer The second camera will have a 280 mm distance between sensors, which is determined by the longest FPC cables we can use without signal losses. It promises to double the measured distance with the same degree of accuracy, therefore an extremely long range 3D-scenes will be produced.

    The Long Range Multi-View Stereo Camera with 4 sensors MNC393-XCAM is planned for release in early 2018.

    Drupal, WordPress, Mediawiki, Mail Archive, Gitlab – all in one web site

    Sun, 12/03/2017 - 23:23
    Multiple Subdomains for the Same Web Site

    It is a common case when a company or organization uses multiple content management systems (CMS) and specialized web application to organize its web presence. We describe here how Elphel handles such CMS variety and provide the source code that can be customized for other similar sites.

    We currently use the following CMS and web applications:

    • Drupal as a general purpose CMS for the main site
    • WordPress for the development blogs
    • Mediawiki for the wiki-based documentation.
    • Mailman (self hosted) and Mail Archive (external site) for the mailing list that is our main channel of the user technical support
    • Gitlab CE for the code and other version-controlled content. We used Github but switched to self-hosted Gitlab CE following FSF recommendations
    • Other customized versions of FLOSS web applications, such as OSTicket for support tickets and FrontAccounting for inventory and production
    • In-house developed free software web applications, such as x3dom-based 3D scene and map viewer, 3D mechanical assembly viewer available for the assemblies and mechanical components on the wiki, WebGL Panorama Viewer/Editor.

    Figure 1. Elphel web site on a desktop screen

    When users visit company web site they expect to have well organized and logically solid navigation to get the information they came for. There are multiple ways how to combine several applications – for example, it is possible to add wiki to WordPress or use it with Gitlab repositories. Of course it is theoretically possible to develop a web site where all the information in available in these applications will be properly cross-linked and stay current. It may also be possible to customize one of the CMS to handle all the needs, but that would require significant web development resources. For a small company following that path the required number of web developers may easily exceed the number of hardware, FPGA and software developers.

    On the other hand the small company size should not be an excuse – why the web site visitors can not find the information they are looking for. For us it was a very clear signal when our long time users asked if we can email them mechanical drawings of the camera boards and components. For us that meant a bad web site usability – for years we have this information for some 900+ standard and machined parts available in our wiki: DXF and PDF for 2-d drawings, STEP files for 3d in our wiki: 393 series and previous 353 series. Camera mechanical assemblies can be navigated with the x3dom-based viewer.

    Improving Navigation of the Composite Web Site

    As we were not going to give up simultaneous use of multiple content management systems (we needed functionality of each of them) we were looking for the following improvements of information accessibility:

    • to have a common search over all subdomains always available (looking glass icon in the top right corner)
    • as we can not cross-link properly all the information, then at least we have to communicate the idea that there are multiple subdomains to our visitors.

    Figure 2. Site search for ‘ERS’ (acronym for Electronic Rolling Shutter)

    We did try to launch simultaneous search over the main (Drupal) site, wiki, blog (WordPress) and Mail Archive using site-specific requests, but that was available only on the main site, and for Mail Archive we could not add it as it is hosted on the external site. An obvious solution to have site-wide search always accessible, including external sites was to use iframe elements. Figure 2 shows results of simultaneous search for “ERS”, each linking to the corresponding subdomain tab.

    Figure 3. Elphel web site on a small screen

    Immitating Multiple Application Windows

    As we have to use iframe elements for multi-site search we can use the framed layout to tell visitors that instead of a single web site with a consistent navigation there are in fact multiple subdomains, each having different structure and navigation. Instead of a single browser window or tab we are presenting the web site as a stack of individual windows on a desktop as shown on Figure 1. Most web sites do not use full width of the modern computer desktop and have useful content width of just 800-1100 pixels – that gives some room to show parts of the subdomain pages that are out of focus in our case. For the not as wide screens (or browser windows) such as on mobile devices, the page layout is different (Figure 3) – subdomains that are currently out of focus are shown as colored horizontal bars. Navigation is rather intuitive – mouse click on the background tab brings it to focus, following links that lead to the other subdomain (e.g. from the blog tab to wiki) brings the corresponding page to focus and shows linked page there, avoiding the case when there will be two wiki tabs and no blog ones. External links open in new browser tabs – that also helps to keep multi-site structure intact.

    Each subdomain page title (Main, Blog, Wiki/Docs…) is a link to the content shown in the current iframe element. Following that link or using “Copy Link Location” from a context menu opens just that subdomain page alone or copies its address to use it as a bookmark or send over email or post in social media.

    Redirection and Canonical URLs

    Next step after selecting the overall page layout was to decide – how should we show the framed version to the visitors and what URLs to have as canonical for different pages? At that point we had multiple subdomains: for the main site (Drupal), for the wiki (Mediawiki), for the blogs (WordPress). The default domain was reserved for the top page to open individual subdomains in the iframe elements.

    One way to deal with it was to permanently redirect* to*,* to*, and so on, and then to rely on to process the request and open subdomains in respective iframe elements. That would always load all the stack of the pages and may be both intrusive and annoying – visitors may be already aware of the structure of the web site and want just the specific blog post or wiki page. And as iframes are routinely used for serving ads, their functionality may be disabled in the browser of a visitor.

    We use the original URLs as canonical (e.g.*, not*), so search engines will return them as the results, and then redirect (with temporary 302) requests depending on the referrer. If referrer (HTTP_REFERER) is provided in the request and it does not belong to Elphel domain – open the framed version with focus on the tab with the requested URL. Otherwise, with no referrer (directly entered, received via email, disabled for privacy reasons) or when opening the link from other Elphel page (such as already framed version) the page is opened alone.

    Below is the relevant portion of the Apache 2 .htaccess file for the (there are also WordPress-specific rules and enforcing SSL access (use of https instead of insecure http protocol). Lines 5..7 are to disable redirection for certain groups of addresses. Line 2 discloses the fact (by adding “Vary: Referer” to the headers) that response depends on the referrer page.

    #Tell search engines we do not cloak Header always add Vary: Referer RewriteCond %{REQUEST_FILENAME} !-f RewriteCond "%{HTTP_REFERER}" "!^((.*)elphel\.com(.*)|)$" RewriteCond %{REQUEST_URI} !category.*feed RewriteCond %{REQUEST_URI} !^/[0-9]+\..+\.cpaneldcv$ <span style="color:#999999">RewriteCond %{REQUEST_URI} !^/\.well-known/pki-validation/[A-F0-9]{32}\.txt(?:\ Comodo\ DCV)?$</span> RewriteRule ^(.*)${REQUEST_URI} [L,R=302]

    Such redirection is obviously impossible for external sites as Mail Archive and we did not use it for – in both cases if visitors followed those links from the search engines results, they would not expect such pages to be parts of the well cross-linked company web site.


    Source code: iframed

    The main frame is a single page with iframe elements. The iframes’ addresses are hard coded in the index.php.
    To avoid, where possible, displaying the same content in multiple frames window.postMessage() method is used. Link click events in an iframe are blocked and the link data is passed to the main frame via postMessage for analysis following focusing on an iframe with a matching domain and refreshing the links or opening a new tab for external websites.

    The following code was added to each controlled subdomain (see elphel_messenger.js):

    <script src=""></script> <script> ElphelMessenger.init(); </script>

    Usability ans SEO

    We are trying to improve usability of the company web site that effectively is a collection of subdomains each running different CMS without major investments into the web design. At the same time we were trying to avoid mistakes in SEO (were we do not have much experience) that would make Elphel invisible for the search engines. Our SEO concerns were caused by the following reasons:

    • the top (almost like a “frameset” in older HTML days) page itself has very little content – most is provided in the included iframe elements
    • the served content depends on the referrer address – that might be considered as “cloaking.”

    Knowing nothing how algorithms are distinguishing between cloaking and legitimate referrer redirects (such as visitors geographical location) we assumed that cloaking shows indexing bots some information that is not served to the human visitors. Our case is different – visitors get either exactly the same information as the bots, or the same one plus additional. We indicated that the served content depends on the referrer URL in the response header (so smart bot can try variable values for “referer”). Testing the new web site for half year did not show any negative effect on the web site visibility in the search engines.

    But the main question still remains – did we reach our goal to improve web site usability?

    • Were we able to communicate the idea that the site consists of multiple loosely-connected subdomains with different navigation?
    • Is the framed site navigation intuitive or annoying?
    • Does the combined search over multiple subdomains do its job and does it behave as users expect?

    Only our visitors can tell us this, user comments are always welcome. And while we do not have a one-click distribution to use this solution for other similar sites (some customization will be needed) our code is available in the Git repository iframed.

    Developing with Eclipse CDT and Yocto – Linux kernel and applications

    Wed, 11/22/2017 - 17:32

    Elphel uses embedded GNU/Linux distribution based on Yocto. For most of our development (excluding just mechanical and PCB design) we use universal Eclipse IDE: for FPGA development, Linux kernel drivers development, embedded applications and web applications, for editing LaTeX texts. And we use this popular IDE for delivering pre-configured projects to our users to make it easier for them to start efficient modification of the initial camera software and then initiate the new projects.

    We tried to use Yocto plugin but were not able to configure it for kernel development, and the kernel drivers development is one of the the largest and probably is the most difficult part of the camera software development, the part were we need Eclipse IDE assistance most.

    One of the major challenges of using code analysis tools of Eclipse CDT with the Linux kernel is that there are so many files that define the same names. These files are selected during the build process, and for correct code analysis Eclipse CDT has to reproduce rather complex Linux system of configuration, multi-level macro defines to resolve references in the source code. We were able to solve this problem (to some extent), but it required a fair amount of manual tweaking and was not universal – developing applications would require different modifications.

    Figure 1. Excluded source directories in the Navigator (left) panel are crossed out.

    At the same time all the software components in the distribution are built with the powerful Bitbake build system, and existing ”recipes” and the invoked Makefiles “know” which files to use. Following DRY (“don’t repeat yourself”) principle we removed references to “make” command in the project build and replaced them with bitbake command “bitbake <target> -c compile -f” running in the Eclipse console. As we did that for the main build command (CDT Builder) the console output is parsed for errors and warnings, results appear as problem markers in “Problems” view and in the source code. To help Eclipse CDT (and users who navigate the source code with it) limit attention to only files and directories that are actually used in the bitbake build process we implemented the following trick (and coded it as script):

    • Initialize source/headers directories with bitbake, so it “knows” that everything needs to be rebuilt for the project
    • Create a list of the source files (resolving symlinks when needed) and “touch” them, setting modification timestamps. This action prepares the files so the next (first after modification) file access will be recorded as access timestamp. Record the current time.
    • Wait a few seconds to reliably distinguish if each file was accessed after modification
    • Run bitbake build (“bitbake <target> -c compile -f”)
    • Scan all the files from the previously created source list and generate ”include_list” of those that were accessed during the build process.
    • As CDT accepts only “exclude” filters in this context, recursively combine full source list and include_list to generate “exclude_list” pruning all the branches that have nothing to include and replacing them with the full branch reference
    • Apply the generated exclusion list to the CDT project file “.cproject”

    In the case of such complex system as Linux kernel there still remain several incorrectly resolved links, because different parts of the code may use different header files that share the same names, but there are not many of them, they can be handled manually.

    Elphel wiki page Eclipse_CDT_projects_with_bitbake has more details on using Elphel camera projects with Eclipse IDE.

    Natural environments in 3D with Elphel camera and Blender

    Mon, 09/25/2017 - 13:44

    Setting 3D camera on the rock at Cape Alava

    Testing 3D camera on a road trip

    In August of 2017, my family and I went on a trip to the Pacific Northwest, partially for a much needed vacation, but equally as importantly, to test my dad’s new 3D camera. My dad had been designing calibrated multi-sensor cameras for as long as I can remember, and since February was working determinately on developing principally new algorithms for reconstructing a 3D model from a set of 4 simultaneously taken photographs . Now that the camera and the software were ready, there was no better time to test it.

    The main portion of our trip was spent at the Olympic Peninsula in Washington. The gorgeous and complex natural setting was perfect for testing the abilities and limits of the camera. The camera lenses are arranged in a square configuration, each lens the same distance apart from the other. Such configuration is mimicking the position of human eyes, while adding the vertical pair helps measure distances to horizontal objects as well as vertical ones. Our depth perception, comes entirely from our brain’s ability to combine together two (from each eye) images to create a 3D space within our mind. The camera operates very similarly, using parallax, the technique that has been used throughout time to calculate distances using fairly simple geometry. Elphel 3D camera takes four individual images, which is more precise than two, and our software program calculates the distances to each object in the scene, combining data from 4 photographs and creates 3D model of the scene. You can explore our models with Elphel 3D model viewer: Also the models can be opened with 3D modeling program, such as Blender, and as a result you appear to be standing in a 3D realm, experiencing the environment, in a whole new sense .

    screenshot of the 3D model of ocean waves

    Although the camera is meant to be a long-range photogrammetric camera, capable of accurately measuring distances at 200 meters and farther, we were fascinated with the idea of creating realistic 3D environments where we can virtually walk through. Throughout most of the testing we chose locations which would include the complex organic forms of nature, as to test the camera’s ability to work with finer details and non-geometric forms . Naturally, we shot parts of the Olympic National Park rain forest, with very exciting results. We also photographed the ocean waves crashing onto the shores rocks, and many more natural beauties. While my brother and I were taking the photographs, many times we would have to scale rocks, laptop and tripod in hand, in order to get the proper location. The process could get tedious, but at times was oddly exciting. At one point, we wanted to get an up-close shot of the waves just as they were arching over the rocks, but to do so, I had to act as a shield in case the water got too close to the camera, ready to leap in front of it at any time. Another time, my brother packed the camera into his backpack and biked to the most northwestern point on the US mainland, Cape Flattery, and took a series of images there. My dad had also used the images we had taken as the test data to find and fix bugs in his software.

    All in all, the experience was really helpful, the vacation was rejuvenating, and the results were astounding, and it gave me the feeling that my family and I were creating something of the future. I don’t really know how this whole idea wouldn’t be considered exciting. Imagine, if you could just take a picture, and have it turned into a 3D space. Not only does the idea itself seem like the fantastic inventions of any quality science fiction (aka super cool), but it’s application to the real world are endless.

    Working with Elphel models in Blender

    We have selected our favorite scenes to create a virtual path through the Olympic rain forest. Each model can be generated with a specified level of details, usually 500 meshes is enough for many scenes, however the rain forest looks more realistic when it is created with 2000 meshes.

    $("#forest01").player(1); Tutorial

    The procedure for importing and combining 3D model files in Blender, a Free software for 3D modeling and animation, is fairly simple.
    Elphel example 3D meshes can be downloaded from by opening the desired scene, pressing the download icon (↓) and extracting archive into directory on your computer.

    3D scene download

    Download Blender form this link:, and follow the installation instructions available for Linux, Windows, and Mac.
    Once Blender has been opened, the scene must be cleared. This is done by pressing “a” until everything is highlighted and then pressing “x”

    Clear scene

    The 3D model file can then be opened in Blender by selecting the .obj option in the import menu and then selecting the downloaded *.obj file.

    Import menu in Blender

    Follow these instructions if the computer struggles with graphics:
    On right side of the window you will find the modifier options on the properties panel. Select the decimate modifier and set a ratio that works for your computer, this ratio will reduce the number of triangles but might also seriously warp textures. Do not press the apply button as viewport performance has already been improved.

    Decimate modifier reduces the number of triangles

    You will find that the imported mesh is gray; to view the textures change the viewport render mode to textured at the bottom of the 3d viewport (this may take a while on slower computers).

    Texture mode

    Now that textures are enabled, simulated lighting must be disabled. To do this hover mouse over the viewport and press “n” to open the properties region. Under the display tab check the “shadeless” box (this option is only available if the viewport render mode is set to textured).

    Imported mesh with textures

    The procedure can be repeated to import more models and manipulate them in Blender creating panoramic view of the city streets, path in the forest and other realistic 3D environments.

    Long range multi-view stereo camera with 4 sensors

    Wed, 09/20/2017 - 23:40

    Figure 1. Four sensor stereo camera model

    Four-camera stereo rig prototype is capable of measuring distances thousands times exceeding the camera baseline over wide (60 by 45 degrees) field of view. With 150 mm distance between lenses it provides ranging data at 200 meters with 10% accuracy, production units will have higher accuracy. Initial implementation uses software post-processing, but the core part of the software (tile processor) is designed as FPGA simulation and will be moved to the actual FPGA of the camera for the real time applications.

    Scroll down or just hyper-jump to Scene viewer for the links to see example images and reconstructed scenes.


    Most modern advances in the area of the visual 3d reconstruction are related to structure from motion (SfM) where high quality models are generated from the image sequences, including those from the uncalibrated cameras (such as cellphone ones). Another fast growing applications depend on the active ranging with either LIDAR scanning technology or time of flight (ToF) sensors.

    Each of these methods has its limitations and while widespread smart phone cameras attracted most of the interest in the algorithms and software development, there are some applications where the narrow baseline (distance between the sensors is much smaller, than the distance to the objects) technology has advantages.

    Such applications include autonomous vehicle navigation where other objects in the scene are moving and 3-d data is needed immediately (not when the complete image sequence is captured), and the elements to be ranged are ahead of the vehicle so previous images would not help much. ToF sensors are still very limited in range (few meters) and the scanning LIDAR systems are either slow to update or have very limited field of view. Passive (visual only) ranging may be desired for military applications where the system should stay invisible by not shining lasers around.

    Technology snippets Narrow baseline and subpixel resolution

    The main challenge for the narrow baseline systems is that the distance resolution is much worse than the lateral one. The minimal resolved 3d element, voxel is very far from resembling a cube (as 2d pixels are usually squares) – with the dimensions we use: pixel size – 0.0022 mm, lens focal length f = 4.5 mm and the baseline of 150 mm such voxel at 100 m distance is 50 mm high by 50 mm wide and 32 meters deep. The good thing is that while the lateral resolution generally is just one pixel (can be better only with additional knowledge about the object), the depth resolution can be improved with reasonable assumptions by an order of magnitude by using subpixel resolution. It is possible when there are multiple shifted images of the same object (that for such high range to baseline ratio can safely be assumed fronto-parallel) and every object is presented in each image by multiple pixels. With 0.1 pixel resolution in disparity (or shift between the two images) the depth dimension of the voxel at 100 m distance is 3.2 meters. And as we need multiple pixel objects for the subpixel disparity resolution, the voxel lateral dimensions increase (there is a way to restore the lateral resolution to a single pixel in most cases). With fixed-width window for the image matching we use 8×8 pixel grid (16×16 pixel overlapping tiles) similar to what is used by some image/video compression algorithms (such as JPEG) the voxel dimensions at 100 meter range become 0.4 m x 0.4 m x 3.2 m. Still not a cube, but the difference is significantly less dramatic.

    Subpixel accuracy and the lens distortions

    Matching images with subpixel accuracy requires that lens optical distortion of each lens is known and compensated with the same or better precision. Most popular way to present lens distortions is to use radial distortion model where relation of distorted and ideal pin-hole camera image is expressed as polynomial of point radius, so in polar coordinates the angle stays the same while the radius changes. Fisheye lenses are better described with “f-theta” model, where linear radial distance in the focal plane corresponds to the angle between the lens axis and ray to the object.

    Such radial models provide accurate results only with ideal lens elements and when such elements are assembled so that the axis of each individual lens element precisely matches axes of the other elements – both in position and orientation. In the real lenses each optical element has minor misalignment, and that limits the radial model. For the lenses we had dealt with and with 5MPix sensors it was possible to get down to 0.2 – 0.3 pixels, so we supplemented the radial distortion described by the analytical formula with the table-based residual image correction. Such correction reduced the minimal resolved disparity to 0.05 – 0.08 pixels.

    Fixed vs. variable window image matching and FPGA

    Modern multi-view stereo systems that work with wide baselines use elaborate algorithms with variable size windows when matching image pairs, down to single pixels. They aggregate data from the neighbor pixels at later processing stages, that allows them to handle occlusions and perspective distortions that make paired images different. With the narrow baseline system, ranging objects at distances that are hundreds to thousands times larger than the baseline, the difference in perspective distortions of the images is almost always very small. And as the only way to get subpixel resolution requires matching of many pixels at once anyway, use of the fixed size image tiles instead of the individual pixels does not reduce flexibility of the algorithm much.

    Processing of the fixed-size image tiles promises significant advantage – hardware-accelerated pixel-level tile processing combined with the higher level software that operates with the per-tile data rather than with per-pixel one. Tile processing can be implemented within the FPGA-friendly stream processing paradigm leaving decision making to the software. Matching image tiles may be implemented using methods similar to those used for image and especially video compression where motion vector estimation is similar to calculation of the disparity between the stereo images and similar algorithms may be used, such as phase-only correlation (PoC).

    Two dimensional array vs. binocular and inline camera rigs

    Usually stereo cameras or fixed baseline multi-view stereo are binocular systems, with just two sensors. Less common systems have more than two lenses positioned along the same line. Such configurations improve the useful camera range (ability to measure near and far objects) and reduce ambiguity when dealing with periodic object structures. Even less common are the rigs where the individual cameras form a 2d structure.

    In this project we used a camera with 4 sensors located in the corners of a square, so they are not co-linear. Correlation-based matching of the images depends on the detailed texture in the matched areas of the images – perfectly uniform objects produce no data for depth estimation. Additionally some common types of image details may be unsuitable for certain orientations of the camera baselines. Vertical concrete pole can be easily correlated by the two horizontally positioned cameras, but if the baseline is turned vertical, the same binocular camera rig would fail to produce disparity value. Similar is true when trying to capture horizontal features with the horizontal binocular system – such predominantly horizontal features are common when viewing near flat horizontal surfaces at high angles of incidents (almost parallel to the view direction).

    With four cameras we process four image pairs – 2 horizontal (top and bottom) and 2 vertical (right and left), and depending on the application requirements for particular image region it is possible to combine correlation results of all 4 pairs, or just horizontal and vertical separately. When all 4 baselines have equal length it is easier to combine image data before calculating the precise location of the correlation maximums – 2 pairs can be combined directly, and the 2 others after rotating tiles by 90 degrees (swapping X and Y directions, transposing the tiles 2d arrays).

    Image rectification and resampling

    Many implementations of the multi-view stereo processing start with the image rectification that involves correction for the perspective and lens distortions, projection of the individual images to the common plane. Such projection simplifies image tiles matching by correlation, but as it involves resampling of the images, it either reduces resolution or requires upsampling and so increases required memory size and processing complexity.

    This implementation does not require full de-warping of the images and related resampling with fractional pixel shifts. Instead we split geometric distortion of each lens into two parts:

    • common (average) distortion of all four lenses approximated by analytical radial distortion model, and
    • small residual deviation of each lens image transformation from the common distortion model

    Common radial distortion parameters are used to calculate matching tile location in each image, and while integer rounded pixel shifts of the tile centers are used directly when selecting input pixel windows, the fractional pixel remainders are preserved and combined with the other image shifts in the FPGA tile processor. Matching of the images is performed in this common distorted space, the tile grid is also mapped to this presentation, not to the fully rectified rectilinear image.

    Small individual lens deviations from the common distortion model are smooth 2-d functions over the 2-d image plane, they are interpolated from the calibration data stored for the lower resolution grid.

    We use low distortion sorted lenses with matching focal lengths to make sure that the scale mismatch between the image tiles is less than tile size in the target subpixel intervals (0.1 pix). Low distortion requirement extends the distances range to the near objects, because with the higher disparity values matching tiles in the different images land to the differently distorted areas. Focal length matching allows to use modulated complex lapped transform (CLT) that similar to discrete Fourier transform (DFT) is invariant to shifts, but not to scaling (log-polar coordinates are not applicable here, as such transformation would deny shift invariance).

    Enhancing images by correcting optical aberrations with space-variant deconvolution

    Matching of the images acquired with the almost identical lenses is rather insensitive to the lens aberrations that degrade image quality (mostly reduce sharpness), especially in the peripheral image areas. Aberration correction is still needed to get sharp textures in the result 3d models over full field of view, the resolution of the modern sensors is usually better than what lenses can provide. Correction can be implemented with space-variant (different kernels for different areas of the image) deconvolution, we routinely use it for post-processing of Eyesis4π images. The DCT-based implementation is described in the earlier blog post.

    Space-variant deconvolution kernels can absorb (be combined with during calibration processing) the individual lens deviations from the common distortion model, described above. Aberration correction and image rectification to the common image space can be performed simultaneously using the same processing resources.

    Two dimensional vs. single dimensional matching along the epipolar lines

    Common approach for matching image pairs is to replace the two-dimensional correlation with a single-dimensional task by correlating pixels along the epipolar lines that are just horizontal lines for horizontally built binocular systems with the parallel optical axes. Aggregation of the correlation maximums locations between the neighbor parallel lines of pixels is preformed in the image pixels domain after each line is processed separately.

    For tile-based processing it is beneficial to perform a full 2-d correlation as the phase correlation is performed in the frequency domain, and after the pointwise multiplication during aberration correction the image tiles are already available in the 2d frequency domain. Two dimensional correlation implies aggregation of data from multiple scan lines, it can tolerate (and be used to correct) small lens misalignments, with appropriate filtering it can be used to detect (and match) linear features.

    Implementation Prototype camera

    Experimental camera looks similar to Elphel regular H-camera – we just incorporated different sensor front ends (3d CAD model) that are used in Eyesis4π and added adjustment screws to align optical axes of the lenses (heading and tilt) and orientations of the image sensors (roll). Sensors are 5 Mpix 1/2″ format On Semiconductor MT9P006, lenses – Evetar N125B04530W.

    We selected lenses with the same focal length within 1%, and calibrated the camera using our standard camera rotation machine and the target pattern. As we do not yet have production adjustment equipment and software, the adjustment took several iterations: calibrating the camera and measuring extrinsic parameters of each sensor front end, then rotating each of the adjustment screws according to spreadsheet-calculated values, and then re-running the whole calibration process again. Finally the calibration results: radial distortion parameters, SFE extrinsic parameters, vignetting and deconvolution kernels were converted to the form suitable for run-time application (now – during post-processing of the captured images).

    Figure 2. Camera block diagram

    This prototype still uses 3d-printed parts and such mounts proved to be not stable enough, so we had to add field calibration and write code for bundle adjustment of the individual imagers orientations from the 2-d correlation data for each of the 4 individual pairs.

    Camera performance depends on the actual mechanical stability, software-compensation can only partially mitigate this misalignment problem and the precision of the distance measurements was reduced when cameras went off by more than 20 pixels after being carried in a backpack. Nevertheless the scene reconstruction remained possible.


    Multi-view stereo rigs are capable of capturing dynamic scenes so our goal is to make a real-time system with most of the heavy-weight processing be done in the FPGA.

    One of the major challenges here is how to combine parallel and stream processing capabilities of the FPGA with the flexibility of the software needed for implementation of the advanced 3d reconstruction algorithms. This approach is to use the FPGA-based tile processor to perform uniform operations on the lists of “tiles” – fixed square overlapping windows in the images. FPGA processes tile data at the pixel level, while software operates the whole tiles.

    Figure 2 shows the overall block diagram of the camera, Figure 3 illustrates details of the tile processor.

    Figure 3. FPGA tile processor

    Initial implementation does not contain actual FPGA processing, so far we only tested in FPGA some of the core functions – two dimensional 8×8 DCT-IV needed for both 16×16 CLT and ICLT. Current code consists of the two separate parts – one part (tile processor) simulates what will be moved to the FPGA (it handles image tiles at the pixel level), and the other one is what will remain software – it operates on the tile level and does not deal with the individual pixels. These two parts interact using shared system memory, tile processor has exclusive access to the dedicated image buffer and calibration data.

    Each tile is 16×16 pixels square with 8 pixel overlap, software prepares tile list including:

    • tile center X,Y (for the virtual “center” image),
    • center disparity, so the each of the 4 image tiles will be shifted accordingly, and
    • the code of operation(s) to be performed on that tile.

    Figure 4. Correlation processor

    Tile processor performs all or some (depending on the tile operation codes) of the following operations:

    • Reads the tile tasks from the shared system memory.
    • Calculates locations and loads image and calibration data from the external image buffer memory (using on-chip memory to cache data as the overlapping nature of the tiles makes each pixel to participate on average in 4 neighbor tiles).
    • Converts tiles to frequency domain using CLT based on 2d DCT-IV and DST-IV.
    • Performs aberration correction in the frequency domain by pointwise multiplication by the calibration kernels.
    • Calculates correlation-related data (Figure 4) for the tile pairs, resulting in tile disparity and disparity confidence values for all pairs combined, and/or more specific correlation types by pointwise multiplication, inverse CLT to the pixel domain, filtering and local maximums extraction by quadratic interpolation or windowed center of mass calculation.
    • Calculates combined texture for the tile (Figure 5), using alpha channel to mask out pixels that do not match – this is the way how to effectively restore single-pixel lateral resolution after aggregating individual pixels to tiles. Textures can be combined after only programmed shifts according to specified disparity, or use additional shift calculated in the correlation module.
    • Calculates other integral values for the tiles (Figure 5), such as per-channel number of mismatched pixels – such data can be used for quick second-level (using tiles instead of pixels) correlation runs to determine which 3d volumes potentially have objects and so need regular (pixel-level) matching.
    • Finally tile processor saves results: correlation values and/or texture tile to the shared system memory, so software can access this data.

    Figure 5. Texture processor

    Single tile processor operation deals with the scene objects that would be projected to this tile’s 16×16 pixels square on the sensor of the virtual camera located in the center between the four actual physical cameras. The single pass over the tile data is limited not just laterally, but in depth also because for the tiles to correlate they have to have significant overlap. 50% overlap corresponds to the correlation offset range of ±8 pixels, better correlation contrast needs 75% overlap or ±4 pixels. The tile processor “probes” not all the voxels that project to the same 16×16 window of the virtual image, but only those that belong to the certain distance range – the distances that correspond to the disparities ±4 pixels from the value provided for the tile.

    That means that a single processing pass over a tile captures data in a disparity space volume, or a macro-voxel of 8 pixels wide by 8 pixels high by 8 pixels deep (considering the central part of the overlapping volumes). And capturing the whole scene may require multiple passes for the same tile with different disparity. There are ways how to avoid full range disparity sweep (with 8 pixel increments) for all tiles – following surfaces and detecting occlusions and discontinuities, second-level correlation of tiles instead of the individual pixels.

    Another reason for the multi-pass processing of the same tile is to refine the disparity measured by correlation. When dealing with subpixel coordinates of the correlation maximums – either located by quadratic approximation or by some form of center of mass evaluation, the calculated values may have bias and disparity histograms reveal modulation with the pixel period. Second “refine” pass, where individual tiles are shifted by the disparity measured in the previous pass reduces the residual offset of the correlation maximum to a fraction of a pixel and mitigates this type of bias. Tile shift here means a combination of the integer pixel shift of the source images and the fractional (in the ±0.5 pixel range) shift that is performed in the frequency domain by multiplication by the cosine/sine phase rotator.

    Total processing time and/or required FPGA resources linearly depend on the number of required tile processor operations and the software may use several methods to reduce this number. In addition to the two approaches mentioned above (following surfaces and second-level correlation) it may be possible to reduce the field of view to a smaller area of interest, predict current frame scene from the previous frames (as in 2d video compression) – tile processor paradigm preserves flexibility of the various algorithms that may be used in the scene 3d reconstruction software stack.

    Scene viewer

    The viewer for the reconstructed scenes is here: (viewer source code).

    Figure 6. 3d+map index page

    Index page shows a map (you may select from several providers) with the markers for the locations of the captured scenes. On the left there is a vertical ribbon of the thumbnails – you may scroll it with a mouse wheel or by dragging.

    Thumbnails are shown only for the markers that fit on screen, so zooming in on the map may reduce number of the visible thumbnails. When you select some thumbnail, the corresponding marker opens on the map, and one or several scenes are shown – one line per each scene (identified by the Unix timestamp code with fractional seconds) captured at the same locations.

    The scene that matches the selected thumbnail is highlighted (as 4-th line in the Figure 6). Some scenes have different versions of reconstruction from the same source images – they are listed in the same line (like first line in the Figure 6). Links lead to the viewers of the selected scene/version.

    Figure 7. Selection of the map / satellite imagery provider

    We do not have ground truth models for the captured scenes build with the active scanners. Instead as the most interesting is ranging of the distant objects (hundreds of meters) it is possible to use publicly available satellite imagery and match it to the captured models. We had ideal view from Elphel office window – each crack on the pavement was visible in the satellite images so we could match them with the 3d model of the scene. Unfortunately they ruined it recently by replacing asphalt :-).

    The scene viewer combines x3dom representation of the 3d scene and the re-sizable overlapping map view. You may switch the map imagery provider by clicking on the map icon as shown in the Figure 7.

    The scene and map views are synchronized to each other, there are several ways of navigation in either 3d or map area:

    • drag the 3d view to rotate virtual camera without moving;
    • move cross-hair ⌖ icon in the map view to rotate camera around vertical axis;
    • toggle ⇅ button and adjust camera view elevation;
    • use scroll wheel over the 3d area to change camera zoom (field of view is indicated on the map);
    • drag with middle button pressed in the 3d view to move camera perpendicular to the view direction;
    • drag the the camera icon (green circle) on the map to move camera horizontally;
    • toggle ⇅ button and move the camera vertically;
    • press a hotkey t over the 3d area to reset to the initial view: set azimuth and elevation same as captured;
    • press a hotkey r over the 3d area to set view azimuth as captured, elevation equal to zero (horizontal view).

    Figure 8. 3D model to map comparison

    Comparison of the 3d scene model and the map uses ball markers. By default these markers are one meter in diameter, the size can be changed on the settings () page.

    Moving pointer over the 3d area with Ctrl key pressed causes the ball to follow the cursor at a distance where the view line intersects the nearest detected surface in the scene. It simultaneously moves the corresponding marker over the map view and indicates the measured distance.

    Ctrl-click places the ball marker on the 3d scene and on the map. It is then possible to drag the marker over the map and read the ground truth distance. Dragging the marker over the 3d scene updates location on the map, but not the other way around, in edit mode mismatch data is used to adjust the captured scene location and orientation.

    Program settings used during reconstruction limit the scene far distance to z = 1000 meters, all more distant objects are considered to be located at infinity. X3d allows to use images at infinity using backdrop element, but it is not flexible enough and is not supported by some other programs. In most models we place infinity textures to a large billboard at z = 10,000 meters, and it is where the ball marker will appear if placed on the sky or other far objects.

    Figure 9. Settings and link to four images

    The settings page () shown in the Figure 9 has a link to the four-image viewer (Figure 10). These four images correspond to the captured views and are almost “raw images” used for scene reconstruction. These images were subject to the optical aberration correction and are partially rectified – they are rendered as if they were captured by the same camera that has only strictly polynomial radial distortion.

    Such images are not actually used in the reconstruction process, they are rendered only for the debug and demonstration purposes. The equivalent data exists in the tile processor only in the frequency domain form as an intermediate result, and was subject to just linear processing (to avoid possible unintended biases) so the images have some residual locally-checkerboard pattern that is due to the Bayer mosaic filter (discussed in the earlier blog). Textures that are generated from the combination of all four images have the contrast of such pattern significantly lower. It is possible to add some non-linear filtering at the very last stage of the texture generation.

    Each scene model has a download link for the archive that contains the model itself as *.x3d file and Wavefront *.obj and *.mtl as well as the corresponding RGBA texture files as PNG images. Initially I missed the fact that x3d and obj formats have opposite direction of surface normals for the same triangular faces, so almost half of the Wavefront files still have incorrect (opposite direction) surface normals.


    Our initial plan was to test algorithms for the tile processor before implementing them in FPGA. The tile processor provides data for the disparity space image (DSI) – confidence value of having certain disparity for specified 2d position in the image, it also generates texture tiles.

    When the tile processor code was written and tested, we still needed some software to visualize the results. DSI itself seemed promising (much better coverage than what I had with earlier experiments with binocular images), but when I tried to convert these textured tiles into viewable x3d model directly, it was a big disappointment. Result did not look like a 3d scene – there were many narrow triangles that made sense only when viewed almost directly from the camera actual location, a small lateral viewpoint movement – and the image was falling apart into something unrecognizable.

    Figure 10. Four channel images (click for actual viewer with zoom/pan capability)

    I was not able to find ready to use code and the plan to write a quick demo for the tile processor and generated DSI seemed less and less realistic. Eventually it took at least three times longer to get somewhat usable output than to develop DCT-based tile processor code itself.

    Current software is still incomplete, lacks many needed features (it even does not cut off background so wires over the sky steal a lot of surrounding space), it runs slow (several minutes per single scene), but it does provide a starting point to evaluate performance of the long range 4-camera multi-view stereo system. Much of the intended functionality does not work without more parameter tuning, but we decided to postpone improvements to the next stage (when we will have cameras that are more stable mechanically) and instead try to capture more of very different scenes, process them in batch mode (keeping the same parameter values for all new scenes) and see what will be the output.

    As soon as the program was able to produce somewhat coherent 3d model from the very first image set captured through Elphel office window, Oleg Dzhimiev started development of the web application that allows to match the models with the map data. After adding more image sets I noticed that the camera calibration did not hold. Each individual sub-camera performed nicely (they use thermally compensated mechanical design), but their extrinsic parameters did change and we had to add code for field calibration that uses image themselves. The best accuracy in disparity measurement over the field of view still requires camera poses to match ones used at full calibration, so later scenes with more developed misalignment (>20 pixels) are less precise than earlier (captured in Salt Lake City).

    We do not have an established method to measure ranging precision for different distances to object – the disparity values are calculated together with the confidence and in lower confidence areas the accuracy is lower, including places where no ranging is possible due to the complete absence of the visible details in the images. Instead it is possible to compare distances in various scene models to those on the map and see where such camera is useful. With 0.1 pixel disparity resolution and 150 mm baseline we should be able to measure 300 m distances with 10% accuracy, and for many captured scene objects it already is not much worse. We now placed orders to machine the new camera parts that are needed to build a more mechanically stable rig. And parallel to upgrading the hardware, we’ll start migrating the tile processor code from Java to Verilog.

    And what’s next? Elphel goal is to provide our users with the high performance hackable products and freedom to modify them in the ways and for the purposes we could not imagine ourselves. But it is fun to fantasize about at least some possible applications:

    • Obviously, self-driving cars – increased number of cameras located in a 2d pattern (square) results in significantly more robust matching even with low-contrast textures. It does not depend on sequential scanning and provides simultaneous data over wide field of view. Calculated confidence of distance measurements tells when alternative (active) ranging methods are needed – that would help to avoid infamous accident with a self-driving car that went under a truck.
    • Visual odometry for the drones would also benefit from the higher robustness of image matching.
    • Rovers on Mars or other planets using low-power passive (visual based) scene reconstruction.
    • Maybe self-flying passenger multicopters in the heavy 3d traffic? Sure they will all be equipped with some transponders, but what about aerial roadkills? Like a flock of geese that forced water landing.
    • High speed boating or sailing over uneven seas with active hydrofoils that can look ahead and adjust to the future waves.
    • Landing on the asteroids for physical (not just Bitcoin) mining? With 150 mm baseline such camera can comfortably operate within several hundred meters from the object, with 1.5 m that will scale to kilometers.
    • Cinematography: post-production depth of field control that would easily beat even the widest format optics, HDR with a pair of 4-sensor cameras, some new VFX?
    • Multi-spectral imaging where more spatially separate cameras with different bandpass filters can be combined to the same texture in the 3d scene.
    • Capturing underwater scenes and measuring how far the sea creatures are above the bottom.