07/24/18 [x3domlet][master] by Oleg Dzhimiev: testing old functions
testing old functions
07/24/18 [x3domlet][master] by Oleg Dzhimiev: debug
debug
07/24/18 [x3domlet][master] by Oleg Dzhimiev: fixed align heading crossing south
fixed align heading crossing south
07/24/18 [x3domlet][master] by Oleg Dzhimiev: fixed align heading crossing south
fixed align heading crossing south
07/24/18 [x3domlet][master] by Oleg Dzhimiev: print old tilt/roll in apply dialog
print old tilt/roll in apply dialog
07/24/18 [x3domlet][master] by Oleg Dzhimiev: proper reset view for tilt/roll
proper reset view for tilt/roll
10393
License
← Older revision Revision as of 06:37, 24 July 2018 Line 74: Line 74: ====License==== ====License==== −* [http://www.ohwr.org/projects/cernohl CERN OHLv1.1+]+* [https://www.ohwr.org/licenses/cern-ohl/license_versions/v1.1 CERN OHLv1.1+] * [http://www.gnu.org/licenses/gpl.html GPLv3+] * [http://www.gnu.org/licenses/gpl.html GPLv3+] Andrey.filippovTemplate:Cad4a
Template:Cad4b
Template:Cad4c
Template:Cad4c assembly
Usage
← Older revision Revision as of 06:31, 24 July 2018 Line 18: Line 18: Copyright © {{CURRENTYEAR}} Elphel Inc. Copyright © {{CURRENTYEAR}} Elphel Inc. −Licensed under [https://ohwr.org/cernohl CERN OHL v.1.1], [https://www.gnu.org/copyleft/fdl.html GNU FDL v.1.3]+Licensed under [https://www.ohwr.org/licenses/cern-ohl/license_versions/v1.1], [https://www.gnu.org/copyleft/fdl.html GNU FDL v.1.3] |- |- |{{#if: {{{2|}}} | <span title="{{{2}}}" style="color:red;">Known problems</span> | }} |{{#if: {{{2|}}} | <span title="{{{2}}}" style="color:red;">Known problems</span> | }} | | |} |} Andrey.filippov07/23/18 [x3domlet][master] by Oleg Dzhimiev: tmp
tmp
07/23/18 [x3domlet][master] by Oleg Dzhimiev: hid one more tooltip
hid one more tooltip
07/23/18 [x3domlet][master] by Oleg Dzhimiev: fixing alignment application
fixing alignment application
07/23/18 [x3domlet][master] by Oleg Dzhimiev: fixing base point and marker sliding
fixing base point and marker sliding
07/23/18 [x3domlet][master] by Oleg Dzhimiev: temporary hide markers when dragging
temporary hide markers when dragging
07/23/18 [x3domlet][master] by Oleg Dzhimiev: test fix
test fix
CVPR 2018 – from Elphel’s perspective
In this blog article we will recall the most interesting results of Elphel participation at CVPR 2018 Expo, the conversations we had with visitor’s at the booth, FAQs as well as unusual questions, and what we learned from it. In addition we will explain our current state of development as well as our near and far goals, and how the exhibition helps to achieve them.
The Expo lasted from June 19-21, and each day had it’s own focus and results, so this article is organized chronologically.
June 19, CVPR 2018, booth 132
While we are standing nervously at our booth, thinking: “Is there going to be any interest? Will people come, will they ask questions?”, the first poster session starts and a wave of visitors floods the exhibition floor. Our first guest at the booth spends 30 minutes, knowledgeably inquiring about Elphel’s long-range 3D technology and leaves his business card, saying that he is very impressed. This was a good start of a very busy day full of technical discussions. CVPR is the first exhibition we have participated in where we did not have any problems explaining our projects.
Q: Why does your camera have 4 lenses?
A: With 4 lenses we have 4 stereo-pairs: horizontal stereo-pair is responsible for vertical features (it is the same as in regular stereo-camera); the vertical pair detects horizontal features, while the 2 diagonal pairs ensure that almost any edge is detected. (Elphel Presentation, p.8).
A: Lidars are not capable of long range distance measurements. The practical maximum range of a lidar is 150-200 meters, while we target 500-2000 meters with a 150 mm stereo-base.(Elphel Presentation, p.7)
Q: How do you deal with thermal expansion?
A: there are 2 parts to this question: 1) thermal expansion of the calibrated lens is taken care of by the design of the sensor-lens module; 2) Thermal expansion of the whole camera is also an issue, especially with the larger stereo-base. The titanium tubes that form the cross of the X-Camera expand more on the “sunny side”, and the whole camera bends inward, like a flower. When taking images under the hot Utah sun, we had to wrap the tubes with insulation, covered in aluminum foil.
After some consideration we decided to remove the insulation wrapping for the CVPR show, because it makes the camera more presentable, and were surprised that this question came up so many times – we almost put the insulation cover back on.
Q: Do you build custom lenses?
A: We pre-select standard M12 lenses, because the quality of the lens can vary even within the same model. We have developed a full calibration process and post-processing software to compensate for optical aberrations, allowing to preserve the full sensor resolution over the camera FOV. Read more about aberration correction on Elphel blog.
Q: What is your part of the technology?
A: Everything: we have all parts: Hardware – cameras with aberration corrected lenses and thermally compensated sensor-lens module; a state of the art calibration facility and process; 3D-reconstruction software (Free Software, GNU GPL).
Q: What is the accuracy of your photogrammetry:
A: We comfortably work with 1/10 of a pixel.
Q: How dense is your 3D scene?
A: Currently: 324 x 242 overlapping tiles, with no dead zones.
Q: Does it build a 3D model in real-time?
A: Currently the 3D model is reconstructed in software in the post-processing of the acquired images. This software is tested for FPGA, with which we can achieve 12fps 3D-reconstruction on the camera. Our final goal is to design and manufacture custom ASIC with 3D processing on the chip. (Elphel Presentation, p.17)
The last visitor of the day from a large European company stopped at our booth around 6pm, 30 minutes before the end for the day, and we spend another hour talking and showing Elphel’s camera and high-resolution accurate images it produces. The lights on the floor went down, but we stayed and explain the technology behind our results and future plans. It was a nice day, well spent.
Day 2: Neural network focus.End-to-End vs. Augmented
Our current results:- 500 meters, high resolution, 3D reconstruction with 5% accuracy with 150mm stereo-base (quad-stereo camera), passive (image-based)
- 2000 meters with 5% accuracy with 2 quad-stereo cameras based at 1256 mm from each other
- 3D-Reconstruction is done in post-processing with software simulated for FPGA porting. With FPGA – 12fps in 3D
Our near future goals:
- Train the neural network with the current 3D-image sets to achieve same results as in (2) – 2000 meter 3D reconstruction, with just one quad-sensor camera
- Port software to FPGA to achieve real-time 3D reconstruction (12 fps)
Our long-term goals:
- Manufacture custom ASIC for 3D-reconstruction to reconstruct 3D scenes on the fly with video frame rate. The small size of the custom chip will allow us to mass produce a long-range 3D reconstruction camera for industrial and commercial purposes. Smaller form-factor, inexpensive, and fast.
- Scalable design: a smaller stereo-base with 4 sensors – phone application (200 meter accurate 3D reconstruction for consumer applications); larger stereo-base: extremely long range passive 3D reconstruction – military applications, drone applications, etc.
Elphel’s approach to 3D reconstruction with the help of neural network is different from the mainstream one. This is partially due to our expertise is building calibrated hardware, which already produces robust 3D measurements. Therefore we see the value in pre-processing the images before feeding them to the NN similar to the eye pre-procesing visual information before sending it to the brain. The human eye receives 150 MPix of data, processes it on the retina, and sends only 1Mpix to the brain – 150 times reduction.
One question came up twice, both times from the conference attendees, about the difficulty the NN has to process high-resolution images. The current approach to NN training recommends to use low resolution raw images (640×480), because the NN has to be trained on a large amount of images (thousands). The last 6 slides of Elphel’s presentation talk about augmenting the NN with Elphel high-performance Tile Processor, that significantly reduces the amount of data, while leaving the “decision making” to the network.
We argue that the End-to-End approach is less efficient for high-resolution image processing then NN augmented with training-tree, linear image processing.
Elphel training sets for Neural Network:Elphel offers unique image sets for neural network training. These are high-resolution, space-invariant data sets.
Elphel is seeking collaboration with machine learning research teams working in the area of ML applications for 3D object classification, localization and tracking.
Day 3: Autonomous Vehicle Focus:
The exhibition is slowing down. Some booths start packing at midday. Elphel, however is busier than ever.
We have talked to about every self-driving car company that had it’s presence at CVPR. We also talked to other large corporations doing research in perception and 3D-reconstruction.
We start by visiting our neighbors – other exhibitors at the Expo, we are very curious what they develop, for which applications, can it be combined with our technology, can we learn something from them, and can they learn from us? It’s a collaborative spirit, with which we start conversation. One of our closest neighbors is Mapbox – an open source mapping platform for custom designed maps, so we start with them. We use Open street Map as well as other maps with our Scene Viewer for 3D-reconstructed scenes for locating scenes on the map as well as for the ground truth data. During lunch break at we add Mapbox tiles to our scene viewer. Mapbox exhibitors are pleased that it worked so well for us. Now they are also very interested in our technology!
Then we walk to other booths, learn about them and present Elphel.
Some of the responses we’ve got when we introduced our technology and our current results:
- Researcher from Apple: “You have the most unique technology on this Expo!”
- Autonomous Vehicle company A: “I can not believe you can accurately measure distances at 500 meters with just a small base of 150mm? With just 5-10% error? It is simply impossible! I have to come see it!” After visiting our booth seeing the demo he asks: “Can you demonstrate your technology on the car? It has to be in Singapore!”
- Autonomous Vehicle company B: ” We are happy with the use of the Lidars on the city streets, with close range perception and 30 mph driving speed. But on the highway, where the speed if much higher and the distance much farther the Lidar’s output is too sparse. Passive, long-range 3D reconstruction would be a perfect solution.”
- Autonomous Vehicle company C: “How dense is your 3D scene? Oh, it is dense, then we can use it.”
- Autonomous Vehicle company D: “We are building the perception solution for self-driving trucks. We need long-range distance measurement.”
- Autonomous Vehicle company E: “We have 16 different types of sensors on the car right now. I don’t think we have room to put another camera. Besides the area behind the wind shield is very valuable, we already have 4 cameras there – there is no room. By the way, we re-evaluate our approach every 6 months, and we just did that, so let’s, maybe, talk in 6 months again?”
Interesting thought – why would anyone need unobstructed windshield in a self-driving car? They did not explain that.
creating 3-D model of the CVPR 2018 Expo
3D model view of CVPR 2018, made by Elphel 3D X-Camera
One last thing to do before the show is over is to get the 3D-X-Cam up high and build the 3D model of the CVPR 2018 EXPO in “real-time”!
Two Dimensional Phase Correlation as Neural Network Input for 3D Imaging
We uploaded an image set with 2D correlation data together with the import Python code for experiments with the neural networks and are now looking for collaboration with those who would love to apply their DL experience to the new kind of input data. More data will follow and we welcome feedback to make this data set more useful.
The application area we are interested in is an extremely long distance 3D scene reconstruction and ranging with the distance to baseline ratio of 1000:1 to 10,000:1 while preserving wide field of view. Earlier post describes aircraft distance and velocity measurements with up to 3,000:1 distance-to-baseline ratio and 60°(H)×45°(V) field of view.
Figure 1. Space-invariant 2D phase correlation data
Data set: source images, 2D correlation tiles, and X3D scene models
The data set contains 2D phase correlation output calculated from the 2592×1936 Bayer mosaic source images captured by the quad stereo camera, and Disparity Space Image (DSI) calculated from a pair of such cameras. Longer baseline provides higher range resolution and this DSI is serving as the ground truth for a single quad camera. The source images as well as all the used software is also provided under the GNU/GPLv3 license. DSI is organized as a 324×242 array – each sample is calculated from the corresponding 16×16 tile. Tiles are overlapping (as shown in Figure 3) with stride 8.
These data sets are provided together with the fully reconstructed X3D scene models, viewable in the browser (Wavefront OBJ files are also generated). The scene models are different from the raw DSI as the next software stages generate meshes, and that frequently leads to over-simplification of the original DSI (so most fronto parallel objects in the scene provide better range accuracy when probed near the bottom). Each scene is accompanied with the multi-layer TIFF file (*-DSI_COMBO.tiff) that allows to see the difference between the measured DSI and the one used in the rendered X3D model. File format and Python import software is documented in Oleg’s post “Reading quad stereo TIFF image stacks in Python and formatting data for TensorFlow”, the data files are directly viewable with ImageJ.
The first goal of applying neural network to this data is to improve DSI generation from the raw 2D correlation compared to what we can do now with the traditional code by using data from all correlation pairs and the neighbor tiles to propagate information along the objects edges and amplify correlation in low textured areas by pooling the correlation outputs from multiple tiles.
While there are no moving parts in the system, the overall process of measuring the DSI resembles human/animal stereo vision, where the eyes are converged for the expected object distance, and the visual processing deals with only a small residual mismatch, constantly adjusting the convergence. The difference here is that each tile can be individually “converged” for the requested disparity – this is done by combining integer pixel shift of the tile input window with the accurate fractional pixel shift by a phase rotator in the frequency domain to avoid any re-sampling errors. Currently prediction of the “convergence” (expected disparity) for the next tiles to measure is performed by a mixture of several algorithms (to avoid an expensive full disparity scan over the whole image). This process of predicting disparity is likely to be improved by a properly designed and trained network too, the one that will use the tile context.
As the current processing is recurrent (repeated “eye convergence” / mismatch processing), there are several options how to organize the training data. We decided to focus on sub-pixel matching, so for each scene with calculated ground truth DSI (from the dual camera rig) the correlation output is measured for the target disparity (convergence) in the range of minus 1 pixel to plus 1 pixel with the step of 0.1 pixel in the currently provided data sets. It makes total of 21 files of the correlation outputs. This partial disparity sweep around the ground truth is an alternative to a full disparity sweep that would require too much storage.
In addition to the raw Bayer images scene models have links to the JPEG images, corresponding to each sub-camera view. These images have aberrations corrected and they are partially rectified for the disparity = 0 (infinity). They are not rectilinear, and still have small radial distortion common to all of them, it is compensated during generation of the x3d models. The images have visible modulation caused by the Bayer mosaic – they were not subject to any (destructive) demosaic processing – only to the linear one. Each color channel is separately de-convolved with a set of space-variant kernels.
Figure 2. Dual quad camera rig
Available data seriesInitial data contains scenes captured with the car mounted dual quad camera rig (Figure 2) in two series:
- Salt Lake City State Street ↗
- Salt Lake City overlook (northern part of the city in the direction of the airport) ↗
The first series (80 scenes) was captured while driving north at a rate of 5 frames per second. Only the first four seconds (20 scenes) of each 20 second interval are posted to reduce the amount of uploaded data. The second series (59 scenes) was captured while standing and then slowly turning and starting to move.
More data will be added, we are open to suggestions – what data should be included to make the whole data set more useful. All available series combined are available at this link: ↗. Number of the scenes is definitely too small to extract the semantic information, but it may be enough to train the network to build DSI using small samples of the correlation tiles. At least to start with.
The frequently asked questions about this dataset Why is there a special format of the data set, why not just provide the raw images?We do provide raw Bayer images captured by the camera in JP4 format, but due to significant for the high-resolution images optical aberrations, complete end-to-end processing would require space-variant coefficients that to our understanding would require enormous resources and a very large number of images for training. Instead we propose to separate the space-variant pre-processing and provide space-invariant data to the network. And the 2D correlation data is the earliest space-invariant data available with this approach.
Figure 3. Anisotropy of the 2D correlation. Red and blue squares show two overlapping tiles in the source images and in the calculated 2D correlation.
Why do you use four cameras for stereo when most systems have just two?In the case when the source image contains nicely textured in all directions areas, the correlation output contains a sharp round spot and subpixel argmax calculation is not very difficult – polynomial interpolation provides good results. Unfortunately many real-life objects do not have convenient texture, and poor texture is Achilles’ heel of many vision-based 3D reconstruction systems. On the other hand discontinuity between foreground and background objects usually results in sharp edges that can be used for matching, and it is usually safe to assume that lack of visible features means that the surface is smooth and depth can be reconstructed by averaging over larger area. The edges may have any direction, but binocular systems with horizontal baselines are only sensitive to the vertical features. Figure 3. shows correlation output of a distant airplane where the correlation spots are stretched in the direction of the edges. The four-sensor camera in Figure 1 contains 6 pairs, and after combining horizontal and vertical there are 4 directions left, so for any edge direction there is a pair with a almost perpendicular to it.
Why don’t you use LIDARs for the ground truth data?Most available LIDARs have a rather short range of 150-250 meters, and we are interested in measuring larger scenes. So instead we decided to use a pair of similar quad cameras positioned at ~5 times larger distance than the quad camera baseline, so the measurements it provides are also 5 times more accurate and can be used for training as ground truth. Binocular system does have limitations as described above, so it will not provide accurate data for the horizontal features (like electrical wires so common over Salt Lake City).
Figure 4. Field calibration results
What are the current limitations of the hardware?Individual quad cameras use precisely thermally compensated sensor front ends with the image plane movement relative to the sensor of less than ±0.02μ/°C, but the overall mount made of titanium tubes does bend when heated by the sun radiation from one side. Just few tenths of a pixel, but it is too much for the application, so until we have a new design we have to use field calibration and bundle-adjust individual sub-camera attitudes using the image data itself. It allows to keep misalignment to ~0.05 pix.
Dual camera rig (Figure 2) as currently implemented has its limitations that we temporarily mitigate with the field calibration. We are working on the rig improvements too – one of the mechanical issues is that when driving at freeway speeds we noticed 55 Hz torsion oscillations (two cameras rotating in opposite directions at the ends of the 1.5 m aluminum tube) of 0.4 pix amplitude. So until we will strengthen the rig we had to limit data set to the images captured at lower speeds to keep vibration-caused mismatch under 0.1 pix.
Figure 4. shows results of the field calibration for the image series captured during driving along the State Street. Each quad camera (main with the 258 mm baseline and auxiliary with the 150 mm one) had adjusted its sensor front end attitude angles using the image data, same with the attitude of the auxiliary camera as a whole with respect to the main one (baseline of 1256 mm). The results of the distance measurements obtained from this long baseline pair were then used as ground truth for the individual quad cameras. Relatively strong correction required for the first ~20 samples was caused by the changing temperature in the camera connection tubes after slowing down from the freeway speed to the city one.
Electronic rolling shutter (ERS) limitations. We really love small format high-resolution image sensors perfected by the cellphone industry for the image quality and sensitivity, but that performance currently requires Electronic Rolling Shutter (ERS) that leads to image distortions of the moving objects (or caused by the moving/rotating camera itself). In the presence of movement of the objects (or camera egomotion) the measured disparity depends on the time difference between the moments the same feature is captured by the sub-cameras. The sensors are mechanically aligned to be parallel within 5 pixels that corresponds to 0.2ms of the readout time. That means that horizontal pairs capture all objects almost simultaneously regardless of disparity, for the vertical and diagonal pairs that time difference increases proportionally to the disparity, but for the far objects (where disparity accuracy is most important) that difference is still small (0.03ms for each disparity pixel). Currently posted image sets do not have any ERS correction, and that somewhat deteriorates performance for the near objects, but when considering just the egomotion (both translation and rotation) this effect can be easily compensated, at least for the uniform rotation on movement.
In the posted image set ERS is not an issue for small (<5 pix) disparities, we will add software compensation over the full disparity range of the scenes for uniform movements, and later will use IMU to compensate higher frequency movements also.
Reading quad stereo TIFF image stacks in Python and formatting data for TensorFlow
Fig.1 TIFF image stack
The input is a <filename>.tiff – a TIFF image stack generated by ImageJ Java plugin (using bioformats) with Elphel-specific information in ImageJ written TIFF tags.
Reading and formatting image data for the Tensorflow can be split into the following subtasks:
- convert a TIFF image stack into a NumPy array
- extract information from the TIFF header tags
- reshape/perform a few array manipulations based on the information from the tags.
To do this we have created a few Python scripts (see python3-imagej-tiff: imagej_tiff.py) that use Pillow, Numpy, Matplotlib, etc..
Usage:
~$ python3 imagej_tiff.py <filename>.tiff
It will print header info in the terminal and display the layers (and decoded values) using Matplotlib.
Subtasks 1. and 2. are simply done by Pillow. As for 3., if it was a simple TIFF image stack there would be no need to have extra scripts.
If the layer were just stacked along depth the shape would be (2178, 2916, 5). The script reshapes the data into tiles and puts the values into a separate array, so the output is 2 Numpy arrays:
- image data shape: (242, 324, 9, 9, 4) with the layers along the last dimension
- values data shape: (242, 324, 3)
And they are ready to be used in the Tensorflow.
For more details and example output, see this wiki page or inspect the source. The most detailed general info is probably in the slides from our presentation for CVPR2018 – format description starts from p.19.
File samplesActual files can be found here or go to 3d+biquad, open individual models and hit the light green button to ‘Download source files for ml’.
Without using Python displaying layers and decoding tags is natively supported by ImageJ or Fiji. To read the tags exiftool or tiffinfo linux programs can be used.
Other librariesOther Python libraries that could have worked (but examples for Pillow were easier to find):
Pages
