Imaging solutions with Free Software & Open Hardware

Who's online

There are currently 0 users online.

Subscribe to Elphel Development Blog feed Elphel Development Blog
Updated: 10 min 2 sec ago

Lapped MDCT-based image conditioning with optical aberrations correction, color conversion, edge emphasis and noise reduction

Thu, 01/19/2017 - 21:55
jQuery(window).load(function(){ jQuery("#sravnitel_0").sravnitel({ images: ["http://blog.elphel.com/wp-content/uploads/2017/01/1481779723_364497-00-raw-debayer-only-annotated.jpeg","http://blog.elphel.com/wp-content/uploads/2017/01/1481779723_364497-00-raw-debayer-only.jpeg","http://blog.elphel.com/wp-content/uploads/2017/01/1481779723_364497-00-no-color-processing-rgb-lpf.jpeg","http://blog.elphel.com/wp-content/uploads/2017/01/1481779723_364497-00-no-nonlin.jpeg","http://blog.elphel.com/wp-content/uploads/2017/01/1481779723_364497-00-no-denoise.jpeg","http://blog.elphel.com/wp-content/uploads/2017/01/1481779723_364497-00-processed.jpeg"], titles: ["Raw image, standard bilinear 3x3 pixels kernel de-mosaic, annotated","Raw image, standard bilinear 3x3 pixels kernel de-mosaic","No color conversion, low pass filtered RGB only","Colors converted and low-pass filtered, no nonlinear processing","Edge emphasis enabled, no noise suppression","Fully processed with noise suppression"], showtitles: true, width: 750, height: 560, index_l: 0, index_r: 5, zoom: 1, center_x: 1800, center_y: 1150, }); });

Fig.1. Image comparison of the different processing stages output Results of the processing of the color image

Previous blog post “Lens aberration correction with the lapped MDCT” described our experiments with the lapped MDCT[1] for optical aberration corrections of a single color channel and separation of the asymmetrical kernel into a small asymmetrical part for direct convolution and a larger symmetrical one to be applied in the frequency domain of the MDCT. We supplemented this processing chain with additional steps of the image conditioning to evaluate the overall quality of the of the results and feasibility of the MDCT approach for processing in the camera FPGA.

Image comparator in Fig.1 allows to see the difference between the images generated from the results of the several stages of the processing. It makes possible to compare any two of the image layers by either sliding the image separator or by just clicking on the image – that alternates right/left images. Zoom is controlled by the scroll wheel (click on the zoom indicator fits image), pan – by dragging.

Original image was acquired with Elphel model 393 camera with 5 Mpix MT9P006 image sensor and Sunex DSL227 fisheye lens, saved in jp4 format as a raw Bayer data at 98% compression quality. Calibration was performed with the Java program using calibration pattern visible in the image itself. The program is designed to work with the low-distortion lenses so fisheye was a stretch and the calibration kernels near the edges are just replicated from the ones closer to the center, so aberration correction is only partial in those areas.

First two layers differ just by added annotations, they both show output of a simple bilinear demosaic processing, same as generated by the camera when running in JPEG mode. Next layers show different stages of the processing, details are provided later in this blog post.

Linear part of the image conditioning: convolution and color conversion

Correction of the optical aberrations in the image can be viewed as convolution of the raw image array with the space-variant kernels derived from the optical point spread functions (PSF). In the general case of the true space-variant kernels (different for each pixel) it is not possible to use DFT-based convolution, but when the kernels change slowly and the image tiles can be considered isoplanatic (areas where PSF remains the same to the specified precision) it is possible to apply the same kernel to the whole image tile that is processed with DFT (or combined convolution/MDCT in our case). Such approach is studied in deep for astronomy [2],[3] (where they almost always have plenty of δ-function light sources to measure PSF in the field of view :-)).

The procedure described below is a combination of the sparse kernel convolution in the space domain with the lapped MDCT processing making use of its perfect (only approximate with the variant kernels) reconstruction property, but it still implements the same convolution with the variant kernels.

Signal flow is presented in Fig.2. Input signal is the raw image data from the sensor sampled through the color filter array organized as a standard Bayer mosaic: each 2×2 pixel tile includes one of the red and blue samples, and 2 of the green ones.

In addition to the image data the process depends on the calibration data – pairs of asymmetrical and symmetrical kernels calculated during camera calibration as described in the previous blog post.

Fig.2. Signal flow of the linear part of MDCT-based image conditioning

Image data is processed in the following sequence of the linear operations, resulting in intensity (Y) and two color difference components:

  1. Input composite signal is split by colors into 3 separate channels producing sparse data in each.
  2. Each channel data is directly convolved with a small (we used just four non-zero elements) asymmetrical kernel AK, resulting in a sequence of 16×16 pixel tiles, overlapping by 8 pixels (input pixels are not limited to 16×16 tiles).
  3. Each tile is multiplied by a window function, folded and converted with 8×8 pixel DCT-IV[4] – equivalent of the 16×16->8×8 MDCT.
  4. 8×8 result tiles are multiplied by symmetrical kernels (SK) – equivalent of convolving the pre-MDCT signal.
  5. Each channel is subject to the low-pass filter that is implemented by multiplying in the frequency domain as these filters are indeed symmetrical. The cutoff frequency is different for the green (LPF1) and other (LPF2) colors as there are more source samples for the first. That was the last step before inverse transformation presented in the previous blog post, now we continued with a few more.
  6. Natural images have strong correlation between different color channels so most image processing (and compression) algorithms involve converting the pure color channels into intensity (Y) and two color difference signals that have lower bandwidth than intensity. There are different standards for the color conversion coefficients and here we are free to use any as this process is not a part of a matched encoder/decoder pair. All such conversions can be represented as a 3×3 matrix multiplication by the (r,g,b) vector.
  7. Two of the output signals – color differences are subject to an additional bandwidth limiting by LPF3.
  8. IMDCT includes 8×8 DCT-IV, unfolding 8×8 into 16×16 tiles, second multiplication by the window function and accumulation of the overlapping tiles in the pixel domain.
Nonlinear image enhancement: edge emphasis, noise reduction

For some applications the output data is already useful – ideally it has all the optical aberrations compensated so the remaining far-reaching inter-pixel correlation caused by a camera system is removed. Next steps (such as stereo matching) can be done on- (or off-) line, and the algorithms do not have to deal with the lens specifics. Other applications may benefit from additional processing that improves image quality – at least the perceived one.

Such processing may target the following goals:

  1. To reduce remaining signal modulation caused by the Bayer pattern (each source pixel carries data about a single color component, not all 3), trying to remove it by a LPF would blur the image itself.
  2. Detect and enhance edges, as most useful high-frequency elements represent locally linear features
  3. Reduce visible noise in the uniform areas (such as blue sky) where significant (especially for the small-pixel sensors) noise originates from the shot noise of the pixels. This noise is amplified by the aberration correction that effectively increases the high frequency gain of the system.

Some of these three goals overlap and can be addressed simultaneously – edge detection can improve de-mosaic quality and reduce related colored artifacts on the sharp edges if the signal is blurred along the edges and simultaneously sharpened in the orthogonal direction. Areas that do not have pronounced linear features are likely to be uniform and so can be low-pass filtered.

The non-linear processing produces modified pixel value using 3×3 pixel array centered around the current pixel. This is a two-step process:

  • First the 3×3 center-symmetric matrices (one for Y, another for color) of coefficients are calculated using the Y channel data, then
  • they are applied to the Y and color components by replacing the pixel value with the inner product of the calculated coefficients and the original data.

Signal flow for one channel is presented in Fig.3:

Fig.3 Non-linear image processing: edge emphasis and noise reduction

  1. Four inner products are calculated for the same 9-sample Y data and the shown matrices (corresponding to second derivatives along vertical, horizontal and the two diagonal directions).
  2. Each of these values is squared and
  3. the following four 3×3 matrices are multiplied by these values. Matrices are symmetrical around the center, so gray-colored cells do not need to be calculated.
  4. Four matrices are then added together and scaled by a variable parameter K1. The first two matrices are opposite to each other, and so are the second two. So if the absolute value of the two orthogonal second derivatives are equal (no linear features detected), the corresponding matrices will annihilate each other.
  5. A separate 3×3 matrix representing a weighted running average, scaled by K2 is added for noise reduction.
  6. The sum of the positive values is compared to a specified threshold value, and if it exceed it – all the matrix is proportionally scaled down – that makes different line directions to “compete” against each other and against the blurring kernel.
  7. The sum of all 9 elements of the calculated array is zero, so the default unity kernel is added and when correction coefficients are zeros, the result pixels will be the same as the input ones.
  8. Inner product of the calculated 9-element array and the input data is calculated and used as a new pixel value. Two of the arrays are created from the same Y channel data – one for Y and the other for two color differences, configurable parameters (K1, K2, threshold and the smoothing matrix) are independent in these two cases.
Next steps How much is it possible to warp?

The described method of the optical aberration correction is tested with the software implementation that uses only operations that can be ported to the FPGA, so we are almost ready to get back to to Verilog programming. One more thing to try before is to see if it is possible to merge this correction with a minor distortion correction. DFT and DCT transforms are not good at scaling the images (when using the same pixel grid). It is definitely not possible no rectify large areas of the fisheye images, but maybe small (fraction of a pixel per tile) stretching can still be absorbed in the same step with shifting? This may have several implications.

Single-step image rectification

It would be definitely attractive to eliminate additional processing step and save FPGA resources and/or decrease the processing time. But there is more than that – re-sampling degrades image resolution. For that reason we use half-pixel grid for the offline processing, but it increases amount of data 4 times and processing resources – at least 4 times also.

When working with the whole pixel grid (as we plan to implement in the camera FPGA) we already deal with the partial pixel shifts during convolution for aberration correction, so it would be very attractive to combine these two fractional pixel shifts into one (calibration process uses half-pixel grid) and so to avoid double re-sampling and related image degrading.

Using analytical lens distortion model with the precision of the pixel mapping

Another goal that seems achievable is to absorb at least the table-based pixel mapping. Real lenses can only to some precision be described by an analytical formula of a radial distortion model. Each element can have errors and the multi-lens assembly can inevitably has some mis-alignments – all that makes the lenses different and deviate from a perfect symmetry of the radial model. When we were working with (rather low distortion) wide angle lenses Evetar N125B04530W we were able to get to 0.2-0.3 pix root mean square of the reprojection error in a 26-lens camera system when using radial distortion model with up to the 8-th power of the radial polynomial (insignificant improvement when going from 6-th to the 8-th power). That error was reduced to 0.05..0.07 pixels when we implemented table-based pixel mapping for the remaining (after radial model) distortions. The difference between one of the standard lens models – polynomial for the low-distortion ones and f-theta for fisheye and “f-theta” lenses (where angle from optical axis approximately linearly depends on the distance from the center in the focal plane) is rather small, so it is a good candidate to be absorbed by the convolution step. While this will not eliminate re-sampling when the image will be rectified, this distortion correction process will have a simple analytical formula (already supported by many programs) and will not require a full pixel mapping table.

High resolution Z-axis (distance) measurement with stereo matching of multiple images

Image rectification is an important precondition to perform correlation-based stereo matching of two or more images. When finding the correlation between the images of a relatively large and detailed object it is easy to get resolution of a small fraction of a pixel. And this proportionally increases the distance measurement precision for the same base (distance between the individual cameras). Among other things (such as mechanical and thermal stability of the system) this requires precise measurement of the sub-camera distortions over the overlapping field of view.

When correlating multiple images the far objects (most challenging to get precise distance information) have low disparity values (may be just few pixels), so instead of the complete rectification of the individual images it may be sufficient to have a good “mutual rectification”, so the processed images of the object at infinity will match on each of the individual images with the same sub-pixel resolution as we achieved for off-line processing. This will require to mechanically orient each sub-camera sensor parallel to the others, point them in the same direction and preselect lenses for matching focal length. After that (when the mechanical match is in reasonable few percent range) – perform calibration and calculate the convolution kernels that will accommodate the remaining distortion variations of the sub-cameras. In this case application of the described correction procedure in the camera will result in the precisely matched images ready for correlation.

These images will not be perfectly rectified, and measured disparity (in pixels) as well as the two (vertical and horizontal) angles to the object will require additional correction. But this X/Y resolution is much less critical than the resolution required for the Z-measurements and can easily tolerate some re-sampling errors. For example, if a car at a distance of 20 meters is viewed by a stereo camera with 100 mm base, then the same pixel error that corresponds to a (practically negligible) 10 mm horizontal shift will lead to a 2 meter error (10%) in the distance measurement.

References

[1] Malvar, Henrique S. Signal processing with lapped transforms. Artech House, 1992.

[2] Thiébaut, Éric, et al. “Spatially variant PSF modeling and image deblurring.” SPIE Astronomical Telescopes+ Instrumentation. International Society for Optics and Photonics, 2016. pdf

[3] Řeřábek, M., and P. Pata. “The space variant PSF for deconvolution of wide-field astronomical images.” SPIE Astronomical Telescopes+ Instrumentation. International Society for Optics and Photonics, 2008.pdf

[4] Britanak, Vladimir, Patrick C. Yip, and Kamisetty Ramamohan Rao. Discrete cosine and sine transforms: general properties, fast algorithms and integer approximations. Academic Press, 2010.

Lens aberration correction with the lapped MDCT

Sat, 01/07/2017 - 18:19

Modern small-pixel image sensors exceed resolution of the lenses, so it is the optics of the camera, not the raw sensor “megapixels” that define how sharp are the images, especially in the off-center areas. Multi-sensor camera systems that depend on the tiled images do not have any center areas, so overall system resolution may be as low as that of is its worst part.

Fig. 1. Lateral chromatic aberration and Bayer mosaic: a) monochrome (green) PSF, b) composite color PSF, c) Bayer mosaic of the sensor, d) distorted mosaic for the chromatic aberration of b).

De-mosaic processing and chromatic aberrations

Our current cameras role is to preserve the raw sensor data while providing some moderate compression, all the image correction is applied during post-processing. Handling the lens aberration has to be done before color conversion (or de-mosaicing). When converting Bayer data to color images most cameras start with the calculation of the “missing” colors in the RG/GB pattern using 3×3 or 5×5 kernels, this procedure relies on the specific arrangement of the color filters.

Each of the red and blue pixels has 4 green ones at the same distance (pixel pitch) and 4 of the opposite (R for B and B for R) color at the equidistant diagonal locations. Fig.1. shows how lateral chromatic aberration disturbs these relations.

Fig.1a is the point-spread function (PSF) of the green channel of the sensor. The resolution of the PSF measurement is twice higher than the pixel pitch, so the lens is not that bad – horizontal distance between the 2 greens in Fig.1c corresponds to 4 pixels of Fig.1a. It is also clearly visible that the PSF is elongated and the radial resolution in this part of the image is better than the tangential one (lens center is left-down).

Fig.1b shows superposition of the 3 color channels: blue center is shifted up-and-right by approximately 2 PSF pixels (so one actual pixel period of the sensor) and the red one – half-pixel left-and-down from the green center. So the point light of a star, centered around some green pixel will not just spread uniformly to the two “R”s and two “B”s shown connected with lines in Fig.1c, but the other ones and in different order. Fig.1d illustrates the effective positions of the sensor pixels that match the lens aberration.

Aberrations correction at post-processing stage

When we perform off-line image correction we start with separating each color channel and re-sampling it at twice the pixel pitch frequency (adding zero sample between each measured one) – this increase allows to shift image by a fraction of a pixel both preserving resolution and not introducing the phase errors that may be visually OK but hurt when relying on sub-pixel resolution during correlation of images.

Next is the conversion of the full image into the overlapping square tiles to the frequency domain using 2-d DFT, then multiplication by the inverted PSF kernels – individual for each color channel and each part of the whole image (calibration procedure provides a 2-d array of PSF kernels). Such multiplication in the frequency domain is equivalent to (much more computationally expensive) image convolution (or deconvolution as the desired result is to reduce the effect of the convolution of the ideal image with the PSF of the actual lens). This is possible because of the famous convolution-multiplication property of Fourier transform and its discrete versions.

After each color channel tile is corrected and the phases of color components match (lateral chromatic aberration is compensated) it is the time when the data may be subject to non-linear processing that relies on the properties of the images (like detection of lines and edges) to combine the color channels trying to achieve highest spacial resolution and not to introduce color artifacts. Our current software performs it while data is in the frequency domain, before the inverse Fourier transform and merging the lapped tiles to the restored image.

Fig.2. Histogram of difference between original image and after direct and inverse MDCT (with 8×8 pixels DCT-IV)

MDCT of an image – there and back again

It would be very appealing to use DCT-based MDCT instead of DFT for aberration correction. With just 8×8 point DCT-IV it may be possible to calculate direct 16×16 -> 8×8 MDCT and 8×8 -> 16×16 IMDCT providing perfect reconstruction of the image. 8×8 pixels DCT should be able to handle convolution kernels with 8 pixel radius – same would require 16×16 pixels DFT. I knew there will be a challenge to handle non-symmetrical kernels but first I gave a try to a 2-d MDCT to convert and reconstruct back a camera image that way. I was not able to find an efficient Java implementation of the DCT-IV so I had to write some code following the algorithms presented in [1].

That worked nicely – when I obtained a histogram of the difference between the original image (pixel values were in the range of 0 to 255) and the restored one – IMDCT(MDCT(original)) it demonstrated negligible error. Of course I had to discard 8 pixel border of the image added by replication before the procedure – these border pixels do not belong to 4 overlapping tiles as all internal ones and so can not be reconstructed.

When this will be done in the camera FPGA the error will be higher – DCT implementation there uses just an integer DSP – not capable of the double precision calculations as the Java code. But for the small 8×8 transformations it should be rather easy to manage calculation precision to the required level.

Convolution with MDCT

It was also easy to implement a low-pass symmetrical filter by multiplying 8×8 pixel MDCT output tiles by a DCT-III transform of the desired convolution kernel. To convolve f ☼ g you need to multiply DCT_IV(f) by DCT_III(g) in the transform domain [2], but that does not mean that DCT-III has also be implemented in the FPGA – the de-convolution kernels can be prepared during off-line calibration and provided to the camera in the required form.

But not much more can be done for the convolution with asymmetric kernels – they either require additional DST (so DCT and DST) of the image and/or padding data with extra zeros [3],[4] – all that reduces advantage of DCT compared to DFT. Asymmetric kernels are required for the lens aberration corrections and Fig.1 shows two cases not easily suitable for MDCT:

  • lateral chromatic aberrations (or just shift in the image domain) – Fig.1b and
  • “diagonal” kernels (Fig.1a) – not an even function of each of the vertical and horizontal axes.

Fig.3. Convolution kernel factorization: a) required asymmetrical and shifted kernel, b) 4-point direct convolution with (sparse) Bayer color channel data, c) symmetric convolution kernel for MDCT, d) symmetric kernel – DCT-III of c) to multiply DCT-IV kernels of the image.

Symmetric kernels are like what you can do with a twice folded piece of paper, cut to some shape and unfolded, with folds oriented strictly vertically and horizontally.

Factorization of the convolution

Another way to handle convolution with non-symmetrical kernels is to split it in two – first convolve with an asymmetrical one directly and then – use MDCT and symmetrical kernel. The input data for combined convolution is split Bayer data, so each color channel receives sparse sequence – green one has only half non-zero elements and red and blue – only 1/4 such pixels. In the case of half-pixel grid (to handle fraction-pixel shifts) the relative amount of non-zero pixels is four times smaller, so the total number of multiplications is the same as for the whole-pixel grid.

The goal of such factorization is to minimize the number of the non-zero elements in the asymmetrical kernel, imposing no restrictions on the symmetrical one. Factorization does not have to be absolutely precise – the effect of deconvolution is limited by several factors, most important being the amplification of the sensor noise (such as shot noise). The required number of non-zero pixel may vary with the type of the distortion, for the lens we experimented with (Sunex DSL227 fisheye) just 4 pixels were sufficient to achieve 2-4% error for each of the kernel tiles. Four pixel kernels make it 1 multiplication per each of the red and blue pixels and 2 multiplications per green. As the kernels are calculated during the camera off-line calibration it should be possible to simultaneously generate scheduling of the the DSP and buffer memories to additionally reduce the required run-time FPGA resources.

Fig.3 illustrates how the deconvolution kernel for the aberration correction is split into two for the consecutive convolutions. Fig.1a shows the required deconvolution kernel determined during the existing calibration procedure. This kernel is shown far off-center even for the green channel – it appeared near the edge of the fish-eye lens field of view as the current lens model is based on the radial polynomial and is not efficient for the fish-eye (f-theta) lenses, so aberration correction by deconvolution had to absorb that extra shift. As the convolution kernel has fixed non-zero elements, the computation complexity does not depend on the maximal kernel dimensions. Fig.3b shows the determined asymmetric convolution kernel of 4 pixels, and Fig.3c – the kernel for symmetric convolution with MDCT, the unique 8×8 pixels part of it (inside of the red square) is replicated to the other3 quadrants by mirroring along row 0 and column 0 because of the whole pixel even symmetry – right boundary condition for DCT-III. Fig.3d contains result of the DCT-III applied to the data shown in Fig.3c.

Fig.4. Symmetric convolution kernel tiles in MDCT domain. Full image (click to open) has peripheral kernels replicated as there was no calibration data outside of the fisheye lens filed of view

There should be some more efficient ways to find optimal combinations of the two kernels, currently I used a combination of the Levenberg-Marquardt Algorithm (LMA) that minimizes approximation error (root mean square of the differences between the given kernel and the result of the convolution of the two calculated) and adding/replacing pixels in the asymmetrical kernel, sorting the variants for the best LMA fit. Experimental code (FactorConvKernel.java) for the kernel calculation is in the same GitHub repository.

Each kernel tile is processed independently of the neighbors, so while the aberration deconvolution kernels are changing smoothly between the adjacent tiles, the individual asymmetrical (for direct convolution with Bayer signal data) and symmetrical (for convolution by multiplication in the MDCT space) may change dramatically (see Fig.4). But when the direct convolution is applied before the window multiplication to the source pixels that contribute to a 16×16 pixel MDCT overlapping tile, then the result (after IMDCT) depends on the convolution of the two kernels, not the individual ones.

Deconvolving the test image

Next step was to apply the convolution to the test image, see if there are any visible blocking (or other) artifacts and if the image sharpness was improved. Only a single (green) channel was tested as there is no DCT-based color conversion code in this program yet. Program was tested with the whole pixel grid (not half pixel) so some reduction of sharpness caused by fractional pixel shift was expected. For the comparison “before/after” aberration correction I used two pairs – one with the raw Bayer data (half of the pixels are black in a checker-board pattern) and the other – with the Bayer pattern after 0.4 pix low-pass filter to reduce the checkerboard pattern. Without this filtering image would be either twice darker or (as on these pictures) saturated at lower levels (checkerboard 0/255 alternating pixels result in average gray level of only half of the full range).

Fig.5. Alternating images of a segment (green channel only): low-pass filter of the Bayer mosaic before and after deconvolution. Click image to show comparison with raw Bayer component.
Raw Bayer
Bayer data, low pass filter, sigma = 0.4 pix
Deconvolved

Fig.5 shows animated GIF of a fraction of the whole image, clicking the image shows comparison to the raw Bayer (with the limited gray level), caption links the full size images for these 3 modes.

No de-noise code is used, so amplification of the pixel shot noise is clearly visible, especially on the uniform surfaces, but aliasing cancellation remained functional even with abrupt changing of the convolution kernels as ones shown in Fig.4.

Conclusions

Algorithms suitable for FPGA implementation are tested with the simulation code. Processing of the images subject to the typical optical aberration of the fisheye lens DSL227 does not add significantly to the computational complexity compared to the pure symmetric convolution using lapped MDCT based on the 8×8 pixels two-dimensional DCT-IV.

This solution can be used as a first stage of the real time image correction and rectification, capable of sub-pixel resolution in multiple application areas, such as 3-d reconstruction and autonomous navigation.

References

[1] Plonka, Gerlind, and Manfred Tasche. “Fast and numerically stable algorithms for discrete cosine transforms.” Linear algebra and its applications 394 (2005): 309-345.
[2] Martucci, Stephen A. “Symmetric convolution and the discrete sine and cosine transforms.” IEEE Transactions on Signal Processing 42.5 (1994): 1038-1051. pdf
[3] Suresh, K., and T. V. Sreenivas. “Linear filtering in DCT IV/DST IV and MDCT/MDST domain.” Signal Processing 89.6 (2009): 1081-1089. Abstract and full text pdf.
[4] Reju, Vaninirappuputhenpurayil Gopalan, Soo Ngee Koh, and Ing Yann Soon. “Convolution using discrete sine and cosine transforms.” IEEE Signal Processing Letters 14.7 (2007): 445. pdf
[5] Malvar, Henrique S. “Extended lapped transforms: Properties, applications, and fast algorithms.” IEEE Transactions on Signal Processing 40.11 (1992): 2703-2714.

Measuring SSD interrupt delays

Thu, 12/22/2016 - 18:56


Sometimes we need to test disks connected to camera and we developed a small test program for this purpose. This program basically resembles camogm (in-camera recording program) in its operation and allows us to write repeating blocks of data containing counter value and then check the consistency of the data written. This program works directly with disk driver and collects some statistics during its operation. Disk driver, among other things, measures the time between two events: when write command is issued and when command completion interrupt from controller is received. This time can be used to measure disk write speed as the amount of data sent to disk with each command is also known. In general, this time slightly floats around its average value given that the amount of data written with each command is almost the same. But long run tests have shown that sometimes the interrupt return time after write command can be much longer then the average time.

We decided to investigate this situation in a little bit more details and tested two SSDs with our test program. The disks used for tests were SanDisk SD8SMAT128G1122 and Crucial CT250MX200SSD6, both were connected to eSATA camera port over M.2 SSD adapter. We used these disks before and they demonstrated different performance during recording. We ran camogm_test to write 3 MB blocks of data in cyclic mode. The program collected delayed interrupt times reported by driver as well as the amount of data written since the last delay event. The processed results of the test are in the following table:

Disk Average IRQ reception time, ms Standard deviation, ms Average IRQ delay time, ms Standard deviation, ms Data recorded since last IRQ delay, GB Standard deviation, GB CT250MX200SSD6 (250 GB) 11.9 1.1 804 12.7 499.7 111.7 SD8SMAT128G1122 (128 GB) 19.3 4.8 113 6.5 231.5 11.5

The delayed interrupt times of these disks are considerably different although the difference in average interrupt times which reflect disk write speeds is not that big. It is interesting to notice that the amount of data written to disk between two consecutive interrupt delays is almost twice the total disk size. smartctl reported the increase of Runtime_Bad_Block attribute for CT250MX200SSD6 after each delay but the delays occurred each time on different LBAs. Unfortunately, SD8SMAT128G1122 does not have such parameter in its smartctl attributes and it is difficult to compare the two disks by this parameter.

DCT type IV implementation

Sat, 12/17/2016 - 00:15

As we finished with the basic camera functionality and tested the first Eyesis4π built with the new 10393 system boards (it is smaller, requires less power and, is faster) we are moving forward with the in-camera image processing. We plan to combine our current camera calibration methods that require off-line post processing and the real-time image correction using the camera own FPGA resources. This project development will require switching between the actual FPGA coding and the software implementation of the same algorithms before going to the next step – software is still easier to design. The first part was in FPGA realm – it was to implement the fundamental image processing block that we already know we’ll be using and see how much of the resources it needs.

DCT type IV as a building block for in-camera image processing

We consider a small (8×8 pixel) DCT-IV to be a universal block for conditioning of the raw acquired images. Such operations as lens optical aberrations correction, color conversion (de-mosaic) in the presence of the lateral chromatic aberration, image rectification (de-warping) are easier to perform in the frequency domain using convolution-multiplication property and other algorithms.

In post-processing we use DFT (Discrete Fourier Transform) over rather large (64×64 to 512×512) tiles, but that would be too much for the in-camera processing. First is the tile size – for good lenses we do not need that large convolution kernels. Additionally we plan to combine several processing steps into one (based on our off-line post-processing experience) and so we do not need to sub-sample images – in our current software we double resolution of the raw images at the beginning and scale back the final result to reduce image degradation caused by re-sampling.

The second area where we plan to reduce computations is the replacement of the DFT with the DCT that is designed to be fed with the pure real data and so requires less arithmetic operations than DFT that processes complex input values.

Why “type IV” of the DCT?

Fig.1. Signal flow graph for DCT-IV

We already have DCT type II implemented for the JPEG/JP4 compression, and we still needed another one. Type IV is used in audio compression because it can be converted to a modified discrete cosine transform (MDCT) – a procedure when multiple overlapped windows are processed one at a time and the results are seamlessly combined without any block artifacts that are familiar for the JPEG with low settings of the compression quality. We too need lapped transform to process large images with relatively small (much smaller than the image itself) convolution kernels, and DCT-IV is a perfect fit. 8-point DCT-IV allows to implement transformation of 16-point segments with 8-point overlap in a reversible manner – the inverse transformation of 8-point data may be converted to 16-point overlapping segments, and being added together these segments result in the original data.

There is a price though to pay for switching from DFT to DCT – the convolution-multiplication property being so straightforward in FFT gets complicated for DCT[1]. While convolving with symmetrical kernels is still simple (just the kernel has to be transformed differently, but it is anyway done off-line in our case), the arbitrary kernel convolution (or just a shift in image space needed to compensate the lateral chromatic aberration) requires both DCT-IV and DST-IV transformed data. DST-IV can be calculated with the same DCT-IV modules (just by reversing the direction of input data and alternating the sign of the output samples), but it still requires additional hardware resources and/or more processing time. Luckily it is only needed for the direct (image domain to frequency domain) transform, the inverse transform IDCT-IV (frequency to image) does not require DST. And IDCT-IV is actually the same as the direct DCT-IV, so we can again instantiate the same module.

Most of the two-dimensional transforms combine 1-d transform modules (because DCT is a separable transform), so we too started with just an 8-point DCT. There are multiple known factorizations for such algorithm[2] and we used one of them (based on BinDCT-IV) shown in Fig.1.

Fig.2. Simplified diagram of Xilinx DSP48E1 primitive (only used functionality is shown)

DSP primitive in Xilinx Zynq

This algorithm is implemented with a pair of DSP48E1[3] primitives shown in Fig.2. This primitive is flexible and allows to configure different functionality, the diagram contains only the blocks and connections used in the current project. The central part is the multiplier (signed 18 bits by signed 25 bits) with inputs from a pair of multiplexed B registers (B1 and B2, 18 bits wide) and the pre-adder AD register (25 bits). The AD register stores sum/difference of the 25-bit D-register and a multiplexed pair of 25-bit A1 and A2 registers. Any of the inputs can be replaced by zero, so AD can receive D, A1, A2, -A1, -A2, D+A1, D-A1, D+A2 and D-A2 values. Result of the multiplier (43 bits) is stored in the M register and the data from M is combined with the 48-bit output accumulator register P. Final adder can add or subtract M to/from one of the P, 48-bit C-register or just 0, so the output P register can receive +/-M, P+/-M and C+/-M. The wrapper module dsp_ma_preadd_c.v reduces the number of DSP48E1 signals and parameters to those required for the project and in addition to the primitive instance have a simple model of the DSP slice to allow simulation without the DSP48E1 source code for convenience.

Fig.3. One-dimensional 8-point DCT-IV implementation

8-point DCT-IV transform

The DCT-IV implementation module (Fig.3.) operates in 16 clocks cycles (2 clock periods per data item) and the input/output permutations are not included – they can be absorbed in the data source and destination memories. Current implementation does not implement correct rounding and saturation to save resources – such processing can be added to the outputs after analysis for particular application data widths. This module is not in the coder/decoder signal chain so bit-accuracy is not required.

Data is output each other cycle (so two such modules can easily be used to increase bandwidth), while input data is scrambled more, some of the items have to appear twice in a 16-cycle period. This implementation uses two of the DSP48E1 primitives connected in series. First one implements the left half of the Fig.1. graph – 3 rotators (marked R8, and two of R4), four adders, and four subtracters, The second one corresponds to the right half with R1, R5, R9, R13, four adders, and four subtracters. Two of the small memories (register files) – 2 locations before the first DSP and 4 locations before the second effectively increase the number of the DSP internal D registers. The B inputs of the DSPs receive cosine coefficients, the same ROM provides values for both DSP stages.

The diagram shows just the data paths, all the DSP control signals as well as the memories write and read addresses are generated at the defined times decoded from the 16-cycle period. The decoder is based on the spreadsheet draft of the design.

Fig.4. Two-dimensional 8×8 DCT-IV

Two-dimensional 8×8 points DCT-IV

Next diagram Fig.4. show a two-dimensional DCT type IV implementation using four of the 1-d 8-point DCT-IV modules described above. Input data arrives continuously in line-scan order, next 64-item block may follow either immediately or after a delay of at least 16 cycles so the pipelines phases are correctly restarted. Two of the input 8×25 memories (width can be reduced to match input data, 25 is the width of the DSP48E1 inputs) are used to re-order the input data.As each of the 1-d DCT modules require input data at more than a half cycles (see bottom of Fig.3) interleaving with the common memory for both channels is not possible, so each channel has to have a dedicated one. First of the two DCT modules convert even lines of 8 points, the other one – odd lines. The latency of the data output from the RAM in the second channel is made 1 cycle longer, so the output data from the channels also arrive at odd/even time slots and can be multiplexed to a common transpose buffer memory. Minimal size of the buffer is 2 of the 64 item pages (width can be reduced to match application requirements), but having just a two-page buffer increases the minimal pause time between blocks (if they are not immediate), with a four page buffer (and BRAM primitives are larger even if just halves are used) the minimal non-immediate delay of the 16 cycles of a 1-d module is still valid.

The second (vertical) pass is similar to the first (horizontal) one, it also has individual small memories for input data reordering and 2 output de-scrambler memories. It is possible to use a single stage, but the memory should hold at least 17 items (>16) and the primitives are 16-deep, and I believe that splitting in series makes it easier for the placer/router tools to implement the design.

Next steps

Now when the 8×8 point DCT-IV is designed and simulated the next step is to switch to the Java coding (add to our ImageJ plugin for camera calibration and image post-processing), convert calibration data to the form suitable for the future migration to FPGA and try the processing based on the chosen 8×8 DCT-IV. When satisfied with the results – continue with the FPGA coding.

References

[1] Martucci, Stephen A. “Symmetric convolution and the discrete sine and cosine transforms.” IEEE Transactions on Signal Processing 42.5 (1994): 1038-1051. pdf

[2] Rao, K. Ramamohan, and Ping Yip. Discrete cosine transform: algorithms, advantages, applications. Academic press, 2014.

[3] 7 Series DSP48E1 Slice, UG479 (v1.9), Xilinx, Sep. 2016. pdf

Using a flash with a CMOS image sensor: ERS and GRR modes

Mon, 10/24/2016 - 17:56
Operation modes in conventional CMOS image sensors with the electronic rolling shutter

Flash test setup

Most of the CMOS image sensors have Electronic Rolling Shutter – the images are acquired by scanning line by line. Their strengths and weaknesses are well known and extremely wide usage made the technology somewhat perfect – Andrey might have already said this somewhere before.

There are CMOS sensors with a Global Shutter but (for the same formats) their pixels’ dynamic range is lower, the readout noise is bigger and the frame rates are lower as well. Here are a few links:

So, the typical sensor with ERS may support 3 modes of operation:

  • Electronic Rolling Shutter (ERS) Continuous
  • Electronic Rolling Shutter (ERS) Snapshot
  • Global Reset Release (GRR) Snapshot

GRR Snapshot was available in the 10353 cameras but ourselves we never tried it – one should have write directly to the sensor’s register to turn it on. But now it is tested and working in 10393s available through the TRIG (0x14) parameter.


MT9P001 sensor


Further, I will be writing about ON Semi’s MT9P001 image sensor focusing on snapshot modes. The operation modes are described in the sensor’s datasheet. In short:

In ERS Snapshot mode (Fig.1,3), exposure time is constant across all rows but each next row’s exposure start is delayed by tROW (row readout time) from the previous one (and so is the exposure end).

In GRR Snapshot mode (Fig.2,4), the exposure of all rows starts at the same moment but each next row is exposed by tROW longer than the previous one. This mode is good when a flash use is needed.

The difference between ERS Snapshot and Continuous is that in the latter mode the sensor doesn’t wait for a trigger and starts new image while still finishing reading the previous one. It provides the highest frame rate (Fig.5).

Fig.1 Electronic Rolling Shutter (ERS) Snapshot mode

Fig.2 Global Reset Release (GRR) Snapshot mode

Fig.3 ERS mode, whole frame

Fig.4 GRR mode, whole frame

Fig.5 Sensor operation modes, frame sequence

Here are some of the actual parameters of MT9P001:

Parameter Value Active pixels  2592h x 1944v tROW 33.5 μs Frame readout time (Nrows x tROW)  1944 x 36.38 μs ~ 72 ms Test setup
  • NC393L-389
  • 9xLEDs
  • Fan (Copal F251R, 25×25 mm, rotating at 5500-8000 RPM)

The LEDs were powered & controlled by the camera’s external trigger output, the delay and duration of which are programmable.

The flash duration was set to 20 μs to catch, without the motion blur, the fan’s blades are marked with stickers – 5500-8000 RPM that is 0.5-0.96° per 20 μs. There was not enough light from the LEDs, so the setup is placed in dark environment and the camera color gains were set to 8 (ISO ~800-1000) – the images are a bit noisy.

The trigger period was set to 250 ms – and the synced LEDs were blinking for each frame.

The information on how to program the NC393 camera to generate trigger signal, fps, change sensor’s operation modes (ERS/GRR) can be found here.

Fig.6a Setup: screen, camera view

Fig.6b Setup: fan

Fig.6c Setup: fan, camera view

Flash in ERS mode Fig.7a Fig.7b Fig.7c

In Fig.7a to expose all rows to the flash the exposure needs to be programmed so the 1st row’s end of exposure will exceed the last row’s start of exposure and the flash delayed until the exposure start of the last row. That makes the single row exposure 72ms+tflash.
Note: there is no ERS effect for moving objects – provided, of course, that the flash is much brighter than the other light sources that will be reducing the contrast during the 72ms frame time.

In Fig.7b the exposure is shorter than the frame readout time – the flash delay can be any – the result is a brighter band on the image as shown in the example below.

Another way to expose all rows is to keep the flash on from the 1st row start until the last row end (Fig.7c) – that’s as good as keeping the flash on all the time.

Example:

Diagram Screen Fan Exposure time, ms 5 20 Flash duration, ms 0.02 (20μs) 0.02 Flash delay, ms 40 40 Comments The fan blades are motion blurred in the rows not affected by the delayed 20μs flash. The flash delay is set so the affected rows appear in the middle of the image. Exposure time defines the width of the bright rows band. Flash in GRR mode Fig.8a Fig.8b

In GRR mode the flash does not need to be delayed and the exposure of the 1st row can be as low as tflash but the last row will be exposed for tflash+72ms (Fig.8a). If the scene is uniformly illuminated the the image tends to be darker in the top and getting brighter in the bottom. GRR is very useful with a flash lamp.
Note: No ERS effect (as in Fig.7a case).

Fig.8b just shows what happens if the flash is delayed until frame is read out.

Examples:

Diagram Screen Fan Exposure time, ms 0.1 0.1 Flash duration, ms 0.02 0.02 Flash delay, ms 40 30 Comments The fan blades are motion blurred in the rows not affected by the delayed 20μs flash. All of the rows not read out before the flash are affected.

Diagram Screen Fan Exposure time, ms 0.1 0.1 Flash duration, ms 0.02 0.02 Flash delay, ms 0 0 Comments Fan is rotating. No motion blur. In GRR if flash is not delayed the whole image is affected by the flash. Brighter environment = lower contrast.

Diagram Screen Fan Exposure time, ms 5 10 Flash duration, ms 0.02 0.02 Flash delay, ms 0 0 Comments Fan is rotating. 100 times longer exposure compared to the previous example – the environment is relatively dark. Conclusions
  • ERS Continuous – max fps, constant exposure, not synced
  • ERS Snapshot – constant exposure, synced
  • GRR Snapshot – synced, use this mode with flash
Links

Elphel presenting at ORCONF 2016, An open source digital design conference

Sun, 10/02/2016 - 23:12

On October 8th, 2016 Andrey will be presenting his work on VDT – Free Software Environment for FPGA Development at an open source digital design conference, ORCONF 2016. ORCONF 2016

The conference will take place in Bologna, Italy, and we are glad for the possibility to meet some of European users of Elphel cameras, and to connect with the community of developers excited about open source design, free software and open hardware.

Elphel will be present at the conference by Andrey Filippov from USA headquarters and Alexadre Poltorak, founder of Swiss 3D4Pi mobile mapping company, working closely with Elphel to integrate Eyesis4Pi, stereophotogrammetric camera, for the purpose of image based 3D reconstruction applications. Andrey will bring and demonstrate the new multisensor NC393 H-camera and Alexandre plans to take some panoramic footage with Eyesis4Pi camera, while in Bologna.

NC393 development progress and the future plans

Mon, 09/19/2016 - 13:41

Since we started to deliver first NC393 series cameras in May we were working on the cameras software – original version was rather limited. While it was capable of serving images/video over the network and recording them on the internal m.2 SSD, it did not have the advanced image acquisition control (through the GUI and programmatically) that was standard for the earlier NC353 series. Now the core functionality is operational and in a month we plan to have the remaining parts (inter-camera synchronization, working with multiple sensors per-port with 10359 multiplexer, GPS+IMU logging) online too. FPGA code is already ported, but it needs to be tested and a fair amount of troubleshooting, identifying the problems and weeding out the bugs is still left to be done.

Fig 1. Four camvc instances for the four channels of NC393 camera

Users of earlier Elphel cameras can easily recognize familiar camvc web interface – Fig. 1 shows a screenshot of the four instances of this interface controlling 4 sensors of NC393 camera in “H” configuration.

This web application tests multiple underlaying pieces of software in the camera: FPGA code, Linux kernel drivers that control the low level of the camera operation and are handling 8 interrupts from the imaging subsystem (NC353 camera processor had just one), PHP extension to interact with the drivers, image server, histograms visualization program, autoexposure and white balance daemons as well as multiple PHP scripts and Javascript code. Luckily, the higher the level, the less changes we needed in the code from the NC353 (in most cases just a single new parameter – sensor port had to be introduced), but the debugging process included going through all the levels of code – bug chasing could start from Javascript code, go to PHP code, then to PHP extension, to kernel driver, direct FPGA control from the Python code (bypassing drivers), simulating Verilog code with Cocotb. Then, when the problem was identified and the HDL code corrected (it usually required several more iterations with simulation), the top level programs were tested again with the new FPGA bitstream. And this is the time when the integration of all the development in the same Eclipse IDE is really paying off – easy code navigation, making changes to different language programs – and the software was rebuilding and transferring the results to the target system automatically.

Camera core software

NC393 camera software aims the same goals as the previous models – allow the full speed operation of the imagers while minimizing real-time requirements to the software on the two levels:

  • kernel level (tolerate large delays when waiting for the interrupts to be served) and
  • application level – allow even scripting languages to keep up with the hardware

Interrupt latency is usually not a problem what working with full frame multi-megapixel images, but the camera can operate a small window at high FPS too. Many operations with the sensor (like changing resolution or image size) require coordinated updating sensor internal registers (usually over I²C connection), changing parameters of the sensor-to-memory FPGA channel (with appropriate latency), parameters of the memory-to-compressor channel, and parameters of the compressor itself. Additionally the camera software should provide the modified image headers (reflecting the new window size) when the acquired image will be recorded or requested over the network.

Application software just needs to tell when (at what frame number) it needs the new window size and the kernel plus FPGA code will take care of the rest. Slow software should just tell in advance so the camera code and the sensor itself will have enough time to execute the request. Multiple parameters modifications designated for a specific frame will be applied almost simultaneously even if frame sync pulses where received from the sensor while application was sending the new data.

Image-derived data remains available long after the image is acquired

Similar things happen with the data received from the sensor – image itself and histograms (they are used for the automatic exposure adjustment and white balancing). Application does not need to to read them before the next frame data arrives – compressed images are kept in a large (64MB per port) ring buffer in the system memory – it can keep record of several seconds of images. Histograms (for up to 4 different windows inside the full image for each sensor port) are preserved for 15 frames after being acquired and transferred over DMA to the system memory. Subset of essential acquisition parameters and image metadata (needed for Exif output) are preserved for 2048 and 511 frames respectively.

Fig 2. Interaction of the image sensor, FPGA, kernel drivers and user space applications

FPGA frame-based command sequencers

There are 2 sequencers for each of the four sensor ports on the FPGA level – they do not use any of the CPU resources:

  • I²C sequencers handle relatively slow i2c commands to be sent to the senor, usually these commands need to arrive before start of the next frame,
  • Command sequencers perform writes to the memory-mapped registers and so control the FPGA operation. These operations need to happen in guaranteed time just after the start of frame, before the corresponding subsystems begin to process the incoming image data.

Both are synchronized by the “start of frame” signals from the sensor, each sequencer has 16 frame pages, each page contains 64 command slots.

Sequencers allow absolute (modulo 16) frame address and relative (to current) frame address. Writing to the current frame (zero offset) is interpreted as “ASAP” and the commands are issued immediately, not synchronized by the start of frame. Additionally, if the commands were written too late and the frame sync arrived before they were executed, they will still be processed before the next frame slot page is activated.

Kernel support of the image frame parameters

There are many frame-related parameters that control image acquisition in the camera, including various sensor register settings, parameters that control gamma conversion, image format for recording to video memory (dedicated to FPGA DDR3 not shared with the CPU), compressor format, signal gains, color saturations, compression quality, coring parameters, histogram windows size and position. There is no such thing as the “current frame parameters” in the camera, at any given moment the sensor may be programmed for a certain image size, while its output data reflects the previous frame format, and the compressor is still not finished with even earlier image. That means that the camera should be aware of multiple sets of the same parameters, each applicable to a certain frame (identified by an absolute frame number). In that case the sensor “now” is receiving not the “current” frame parameters, but the frame parameters of a frame that will happen 2 frame intervals later.

Current implementation keeps parameters (all parameters are unsigned long) in a 16-element ring buffer, each element being a

/** Parameters block, maintained for each frame (0..15 in NC393) of each sensor channel */ struct framepars_t { unsigned long pars[927]; ///< parameter values (indexed by P_* constants) unsigned long functions; ///< each bit specifies function to be executed (triggered by some parameters change) unsigned long modsince[31]; ///< parameters modified after this frame - each bit corresponds to one element in in par[960] (bit 31 is not used) unsigned long modsince32; ///< parameters modified after this frame super index - non-zero elements in in mod[31] (bit 31 is not used) unsigned long mod[31]; ///< modified parameters - each bit corresponds to one element in in par[960] (bit 31 is not used) unsigned long mod32; ///< super index - non-zero elements in in mod[31] (bit 31 is not used) };

Interrupt driven processing of the parameters take CPU time (in contrast with the FPGA sequencers described before), so the processing should be efficient and not iterate through almost a thousand entries for each interrupt, It is also not practical to copy a full set of parameters from the previous frame. Parameters structure for each frame include mod[31] array where each element stores a bit field that describes modification of the 32 consecutive parameters, and a single mod32 represents each mod as a single bit. So mod32 == 0 means that there were no changes (as is true for the majority of frames) and there is nothing to do for the interrupt service routine. Additional fields modsince[31] and modsince32 mean that there were changes to the parameter after this frame. It is used to initialize a new (15 frames ahead of “now”) frame entry in the ring buffer. The buffer is modulo 16, so parameters for [this_frame + 15] share the same memory address as [this_frame-1], and if the parameter is not “modified since” (as is true for the majority of parameters), nothing has to be done for it when advancing this_frame.

There is a configurable parameter that tells parameter processing at interrupts how far to look ahead in the future (Fig.2 shows frames that are too far in the future hatched). The function starts with the current frame and proceeds in the future (up to the specified limit) looking for modified, but not yet processed parameters. Processing of the modified parameters involves calling of up to 32 “generic”(sensor-agnostic) functions and up to 32 their sensor-specific variants. Each parameter that triggers some action if modified is assigned a bitmask of functions to schedule on change, and when the parameter is written to buffer, the functions field for the frame is OR-ed, so during the interrupt only this single field has to be considered.

Processing parameters in a frame scans all the bits in functions (in defined order, starting from the LSB, generic first), the functions involve verification and calculation of derivative values, writing data to the FPGA command and I²C sequencers (deep green and blue on Fig. 2 show the new added commands to the sequencers). Additionally some actions may schedule other parameters changes to be processed at later frame.

User space applications and the frame parameters

Application see frame parameters through the character device driver that supports write, mmap, and (overloaded) lseek.

  • write operation allows to set a list of parameters and apply these changes to a particular frame as a single transaction
  • mmap provides read access to all the frame parameters for up to 15 frames in the future, parameter defines are provided through the header files under kernel include/uapi, so applications (such as PHP extension) can access them by symbolic names.
  • lseek is heavily overloaded, especially for positive offsets to SEEK_END – such commands initiate special actions in this driver, such as waiting for the specific frame. It is partially used instead of the ioctl command, because lseek is immediately supported in most languages while ioctl often requires special extensions.
Communicating image data to the user space

Similar to handling of the frame acquisition and processing parameters, that deals with the future and lets even slow applications to control the process being frame-accurate, other kernel drivers use the FPGA code features to give applications sufficient time to process acquired data before it is overwritten by the newer one. These drivers use similar character device interface with mmap for data access and lseek for control, some use write to send data to the driver.

  • circbuf driver provides access to the compressed image data in any of the four 64MB ring buffers that contain compressed by the FPGA data (FPGA also provides the microsecond-accurate timestmap and the image size). Each image is 32-byte aligned, FPGA skips additional 32 bytes after each frame. Compressor interrupt service routine (located in sensor_common.c) fills this area with some of the image acquisition metadata.
  • histograms driver handles the histograms for the acquired images. Histograms are calculated in the FPGA on the image-to-memory path and so are active even if compressor is stopped. There are 3 types of histogram data that may be needed by the applications, and only the first one (direct) is provided by the FPGA over DMA, two others (derivative) are calculated in the driver and cached, so application request for the same derivative histogram does not require re-calculation. Histograms are calculated for the pixels after gamma-conversion even if raw (2 bytes/pixel) data is recorded, so table indices are always in the range of 0 to 255.
    • direct histograms are provided by the FPGA that maintains data for 16 consecutive (last acquired) frames, for each of the 4 color channels (2 separate green ones), for each of the sensor ports and sub-channels (when multiplexers are used). Each frame data contain 256*4=1024 of the unsigned long (32 bit) values.
    • cumulative histograms contain the corresponding cumulative values, each element equals to sum of the direct histogram values from 0 to the specified index. When divided by the value at index 255 (total number of pixel of this color channel =1/4 of all pixels in WOI) the result will tell what part of all pixels have values less or equal to the current.
    • percentiles are reversed cumulative histograms, they tell what is the pixel level for which a certain fraction of all pixels has a value of equal or below it. These values refer to non-linear (gamma-converted) pixel values, so automatic exposure also uses reversed gamma tables and does interpolation between the two values in the percentile table.
  • jpeghead driver generates JPEG/JP4 headers that need to be concatenated with the compressed output from circbuf (and with the end-of-image 0xff/0xd9 marker) to make a complete image file
  • exif driver manipulates Exif data in the camera – it stores Exif frame-variable data for the last acquired frames in a 512-element ring buffer, allows to specify and set additional Exif fields, provides mmap read access to the metadata.
Camera applications

Current applications include

  • Elphel PHP extension allows multiple PHP scripts to work in the camera, providing server-side of the web applications functionality, such as camvc.
  • imgsrv is a fast image server that bypasses camera web server and transfers images and metadata avoiding any copying of extra data – network controller sends data over DMA from the same buffer where FPGA delivered compressed data (also over DMA). Each sensor port has a corresponding instance of imgsrv, serving different network ports.
  • camogm allows simultaneous recording image data from multiple channels at up to 220 MB/s
  • autoexposure is an auto exposure and white balance daemon that uses image histograms for the specified WOI to adjust exposure time, sensor analog gains and signal gain coefficients in the FPGA.
  • pnghist is a CGI program that visualizes histograms as PNG images, it supports several histogram presentation modes.

Other applications that were available in the earlier NC353 series cameras (such as RTP/RTSP video streamer) will be ported shortly.

Future plans

NC393 camera has 12 times higher performance than the earlier NC353 series, and porting of the functionality of the NC353 is much more than just tweaking of the FPGA code and the drivers – large portions had to be redesigned completely. Camera FPGA project includes provisions for advanced image processing, and that changed the foundation of the camera code. That said, it is much more exciting to move forward and implement functionality that did not exist before, but we had to finish that “boring” part first. And as now it is coming closer, I would like to share our future development plans and invite others who may be interested to cooperate.

New sensors

NC393 was designed to have maximal flexibility in the sensor interface – this we learned from our experience with 303-313-333-353 series of cameras. So far NC393 is tested with one parallel interface sensor and one with a 4-lane HiSPI interface (both have links to the circuit diagrams). Each port can use 8 lanes+clock (9 differential) pairs and several more control/clock signals. Larger/faster sensors may use multiple sensors ports and so multiply available interface lines.
It will be interesting to try high sensitivity large pixel E2V sensors and ToF technology. TI OPT8241 seems to be a good fit for NC393, but OPT8241 I²C register map is not provided.

Quadcopters flying Star Wars style

Most quadcopters use brushless DC motors (BLDC) that maybe tricky to control. Integrated motor controllers that detect rotor position using the voltage on the power coils or external sensors (and so emulate ancient physical brushes) work fine when you apply only moderate variations to the rotation speed but may fail if you need to change the output fast and in precisely calculated manner. FPGA can handle such calculations better and leave CPU resources for the high level tasks. I would imagine such motor control to include some tiny FPGA paired with the high-current MOSFET drivers attached to the motors. Then use lightweight SATA cables (such as 3m 5602 series) to connect them to the NC393 daughter board. NC393 already has dual ARM CPU so it can use existing free software to fly drones and take video/images at the same time. Making it not just fly, but do “tricks” will be really exciting.

Image processing and High Level Synthesis (HLS) alternative

NC393 FPGA design started around a 16-channel memory access optimized for 2d data. Common memory may be not the most modern approach to parallel processing, but when the bulk memory (0.5GB of the DDR3) is a single device, it has to be shared between the channels and not all the module connection can be converted to simple stream protocols. Even before we started to add image processing, we have to maintain two separate bitstreams – one for the parallel sensors, and the other – for HiSPI (serial) ones. They can not be made run-time programmable as even the voltage levels are different, to say nothing that both interfaces together will not fit into Zynq FPGA – we already balancing around 80% of the slice utilization. Theoretically NC393 can use two of the parallel and 2 serial sensors (two pairs of sensor ports use two separate I/O banks with individually programmable supply voltage), but that adds even more variants to the top level module configuration and matching constraints files, and makes the code less readable.

Things will get even more complicated when there will be more active memory channels involved in the processing, especially when the inter-synchronization of the different modules processing multi-sensor 2d data is more complex than just stream in/stream out.

When processing muti-view scenes we will start with de-warping followed by FFT to implement correlation between the 4 simultaneous images and so significantly reduce ambiguity of a stereo-pair correlation. In parallel with working on Verilog code for the new modules I plan to try to reduce the complexity of the inter-module connections, making it more flexible and easier to maintain. I would love to use something higher level, but unfortunately there is nothing for me to embrace and use.

Why I do not believe in HLS

Focusing on the algorithmic level and leaving RTL implementation to the software is definitely a good idea, but the task is much more ambitious than to try to replace GCC or GNU/Linux operating system that even most proprietary and encryption-loving companies have to use. The gap between the algorithms and RTL code is wider than between the C code and the Assembler for the CPU, regardless of some nice demos with the Sobel filter applied to the live video stream or similar simple processing.

One of the major handicaps of the existing approach is an obsession with making modern reprogrammable FPGA code mimic the fixed-function hardware integrated circuits popular in the last century. To be software-like is much more powerful than to look like some old hardware. It is sure that separation of the application levels, use of the standard APIs are important, but it is most beneficial in the mature areas. In the new ones I consider it to be a beauty of coding to be able to freely cross any implementation levels, break some good programming practices, adjust it here and there, redesign and start over, balance overall performance and structure to create something new. Features and interfaces freeze will come later.

So what to use instead?

I do not yet know what it should be exactly, but I would borrow Python decorators and functionality of Verilog generate operators. Instead of just instantiating “black boxes” with rigid interfaces – allow the wrapper code (both automatically generated and hand-crafted) to get inside the instantiated modules code and modify it for the particular instances. “Decoration” meaning generation of the modified module code for the specific instances. Something like programmatic parametrization (modifying code, not just the parameter values, even those that direct generate operators).

Elphel FPGA code is source code based, there are zero of the “black boxes” in the design. And as all the code (109579 lines of it) is available it is accessible for the software too, and “robots” can analyze it and make it easier to manage. We would like to have them as “helpers” not as “wizards” who can offer just a few choices among the pre-programmed options.

To some extend we already do have such “helpers” – our current Python code “understands” Verilog parameter definitions in the source code, including some calculations of the derivative ones. That makes it possible for the Python programs running in the camera to use the same register addresses and bit fields as defined for the FPGA code implemented in the current bitstream.

When the cameras became capable of running FPGA code controlled by the Python program and we were ready to develop kernel drivers, we added extra functionality to the existing Python code. Now it is able not just to read Verilog parameters for itself, but also to generate C code to facilitate drivers development. This converter is not a compiler-like program that takes Verilog input and generates C header files. It is still a human-coded program that retrieves the parameters values from the Verilog code and helps developer by using familiar content-assist functionality of the IDE, detects and flags misspelled parameter names in PyDev (Eclipse IDE plugin for Python), re-generates output when the Verilog source is modified.

We also used Python to generate Verilog code for AHCI implementation, it seemed more convenient than native Verilog generate. Wrapping Verilog in Python and generating clean (for human analysis) Verilog code that can be used in wave viewer and in implementation tools timing analysis. It will be quite natural to make the Python programs understand more of Verilog code and help us manage the structure, generate matching constraints files that FPGA implementation tools require in addition to the HDL code. FPGA professionals probably use TCL scripts for that, it may be a nice language but I never used it outside of the FPGA scripting, so it is always a problem for me to recall how to use it when coming back to FPGA coding after long interruptions.

I did look at MyHDL of course, but it is not exactly what I need. MyHDL tries to replace Verilog completely and the structural modeling part of it suffers from the focus on RTL. I just want Python to help me with Verilog code, not to replace it (similar to how I do not think that Verilog is the best language to simulate CPU activities). I love Cocotb more – even its gentle name (COroutine based COsimulation) tells me that it is not “instead of” but “in addition to”. Cocotb does not have a ready solution for this project either (it was never a goal of this program) so here is an interesting project to implement.

There are several specific cases that I would like to be handled by the implementation.

  • add new functionally horizontal connections in a clean way between hierarchical objects: add outputs all the way up to the common parent module, wires at the top, and then inputs all the way down to the destination. Of course it is usually better to avoid such extra connections, but their traces in module ports help to keep them under control. Such connections may be just temporary and later removed, or be a start of adding new functionality to the involved modules.
  • generate a low footprint debug network to selected hierarchical modules and generate target Python code to probe/modify registers through this network accessing data by the HDL hierarchical names.
  • control the destiny of the decorators – either keep them as separate pieces of code or merge with the original source and make the result HDL code a new “co-designed” source.

And this is what I plan to start with (in parallel to adding new Verilog code). Try to combine existing pieces of the solution and make it a complete one.

Reaching 220 MB/s sustained write speed with SATA-2 controller

Tue, 09/13/2016 - 16:51
Introduction

Elphel cameras use camogm, a user space application, for recording acquired images to a disk storage. The application is developed to use such storage devices as disk drives or USB drives mounted in the operating system. The Elphel393 model cameras have SATA-2 controller implemented in FPGA, a system driver for this controller, and they can be equipped with an SSD drive. We were interested in performing write speed tests using the SATA controller and a couple of M.2 SSDs to find out the top disk bandwidth camogm can use during image recording. Our initial approach was to try a commonly accepted method of using hdparm and dd system utilities. The first disk was SanDisk SD8SMAT128G1122. According to the manufacturer specification [pdf], this is a low power disk for embedded applications and this disk can show 182 MB/s sequential write speed in SATA-3 mode. We had the following:

~# hdparm -t /dev/sda2 /dev/sda2: Timing buffered disk reads: 274 MB in 3.02 seconds = 90.70 MB/sec ~# time sh -c "dd if=/dev/zero of=/dev/sda2 bs=500M count=1 &amp;&amp; sync" 1+0 records in 1+0 records out real 0m6.096s user 0m0.000s sys 0m5.860s

which results in total write speed around 82 MB/s.

The second disk was Crusial CT250MX200SSD6 [pdf] and its sequential write speed should be 500 MB/s in SATA-3 mode. We had the following:

~# hdparm -t /dev/sda2 /dev/sda2: Timing buffered disk reads: 236 MB in 3.01 seconds = 78.32 MB/sec ~# time sh -c "dd if=/dev/zero of=/dev/sda2 bs=500M count=1 &amp;&amp; sync" 1+0 records in 1+0 records out real 0m6.376s user 0m0.010s sys 0m5.040s

which results in total write speed around 78 MB/s. Our preliminary tests had shown that the controller can achieve 200 MB/s write speed. Taking this into consideration, the performance figures obtained were not very promising, so we decided to add one new feature in the latest version of camogm – the ability to write data to a raw storage device. Raw storage device is a disk or a disk partition with direct access to hardware bypassing any operating system caches and buffers. Such type of access can potentially improve I/O performance but requires additional efforts to implement data management in software.

First approach

We tried to bypass file system in the first attempt and used device file (/dev/sda in our case) in camogm for I/O operations. We compared CPU load and I/O wait time during write operation to a partition with ext4 file system and to a device file. dstat turned to be a very handy tool for generating system resource statistics. The statistics were collected during 3 periods of operation: in idle mode before writing, during writing, and in idle mode after writing. All these periods can be clearly seen on the figures below. We also changed the quality parameter which affects the resulting size of JPEG files. Files with quality parameter set to 80 were around 1 MB in size and files with quality parameter set to 90 were almost 2 MB in size.


As expected, the figures show that device file write operation takes less CPU time than the same operation with file system, because there no file system operations and caches involved.


CPU wait for disk IO on the figures means the amount of time in percent the CPU waits for an I/O operation to complete. Here camogm process spends more CPU time waiting for data to be written during device file operations than during file system operations, and again this could be explained by the fact that caching on the file system level in not used.

We also measured the time camogm spent on writing each individual file to device file and to files on ext4 file system.


The clear patterns on the figures correspond to several sensor channels used during recording and each channel produced JPEG files different in size from the other channels. As we have already seen, file system caching has its influence on the results and the difference in overall write time becomes less obvious when the size of files increases.

Although the tests had shown that writing data to file system and to device file had different overall performance, we could not achieve any significant performance gain which would narrow the gap between initial results and preliminary write speed data. We decided to try another approach: only pass commands to disk driver and write data from disk driver.

Second approach

The idea behind this approach was simple. We already have JPEG data in circular buffer in memory and disk driver only needs pointers to the data we want to write at any given moment in time. camogm was modified to pass those pointers and some meta information to driver via its sysfs interface. We modified our AHCI driver as well to add new functions. The driver accepts a command from camogm, aligns data buffers to a predefined boundary and a frame in total to a physical sector boundary, and places the command to command queue. Commands are picked from the command queue right after current disk transaction is complete. We measured the time spent by driver preparing a new command, waiting for an interrupt after a command had been issued, and waiting for a new command to arrive. Total data size per each transaction was around 9.5 MB in case of SD8SMAT128G1122 and around 3 MB in case of CT250MX200SSD6. The disks were installed in cameras with 14 Mpx and 5 Mpx sensors respectively.


These figures show that the time spent in the driver on command preparation is almost negligible in comparison to the time spent waiting for the write command to complete and this was exactly what we finally wanted to get. We could achieve almost 160 MB/s write speed for SD8SMAT128G1122 and around 220 MB/s for CT250MX200SSD6. Here is a summary of results obtained in different modes of writing for two test disks:

Disk write performance Disk File system access Device file access Raw driver access SD8SMAT128G1122 82 MB/s 90 MB/s 160 MB/s CT250MX200SSD6 78 MB/s – 220 MB/s

CT250MX200SSD6 was not tested in device file access mode as it was clear that this method did not fit our needs.

Disk access sharing

One of the problems we had to solve while working on the driver was disk access sharing from operating system and from driver during recording. The disk in camera had two partitions, one was formatted to ext4 file system and mounted in operating system and the other was used as a data buffer for camogm. It is possible that some user space application could access mounted partition when camogm is writing data to disk data buffer and this situation should be correctly processed. camogm as a top priority process should always have the full disk bandwidth and other system processes should be granted access only during periods of time when camogm is waiting for the next frame. libata has built-in command deferral mechanism and we used this mechanism in the driver to decide whether the system process should have access to disk or the command should be deferred. To use this mechanism, we added our function to ATA port operations structure:

static struct ata_port_operations ahci_elphel_ops = { ... .qc_defer = elphel_qc_defer, };

This function is called every time a new system command arrives and the driver can defer the command in case it is busy writing data.

A web interface for a simpler and more flexible Linux kernel dynamic debug controlling

Thu, 09/08/2016 - 18:40

Along with the documentation there is a number of articles explaining the dynamic debug (dyndbg) feature of the Linux kernel like this one or this. Though we haven’t found anything that would extend the basic functionality – so, we created a web interface using JavaScript and PHP on top of the dyndbg.

Fig.1 debugfs-webgui

In most cases it all works fine – when writing a linux driver you:
1. insert pr_debug()/dev_dbg() for debug messaging.
2. compile kernel with dyndbg enabled (CONFIG_DYNAMIC_DEBUG=y)
3. then just ‘echo‘ query strings or ‘cat‘ files with commands to switch on/off the debug messages at runtime. Examples:

  • single:

echo -c 'file svcsock.c line 1603 +pfmt' > /dynamic_debug/control

  • batch file:

cat query-batch-file > /dynamic_debug/control

When it’s all small – enabling/disabling the whole file or a function is not a problem. When the driver grows big with lots of debug messages or there are a few drivers interact with each other it becomes more convenient to have multiple configurations with certain debug lines on or off. As the source code changes the lines get shifted – and so, the batch files require editing.

If the target system (embedded or not) has network and a web browser (Apache2 + PHP) a quite simple solution is to add a web interface to the dynamic debug. The one we have developed has the following features:

  • allows having multiple configurations for each file
  • displays only files of interest
  • updates debug configuration for modified files where debug lines got shifted
  • keeps/updates the current config (in json format) in tmpfs – saves to disk on button click
  • p, f, l, m, t flags are supported

Get the source code then proceed with the README.md.

I will not have to learn SystemVerilog

Mon, 07/11/2016 - 18:32

Or at least larger (verification) part of it – interfaces, packages and a few other synthesizable features are very useful to reduce size of Verilog code and make it easier to maintain. We now are able to run production target system Python code with Cocotb simulation over BSD sockets.

Client-server simulation of NC393 with Cocotb


Previous workflow

Before switching to Cocotb our FPGA-related workflow involved:

  1. Creating RTL design code
  2. Writing Verilog tests
  3. Running simulations
  4. Synthesizing and creating bitfile
  5. Re-writing test code to run on the target system in Python
  6. Developing kernel drivers to support the FPGA functionality
  7. Developing applications that access FPGA functionality through the kernel drivers

Of course the steps are not that linear, there are hundreds of loops between steps 1 and 3 (editing RTL source after finding errors at step 3), almost as many from 5 to 1 (when the problems reveal themselves during hardware testing) but few are noticed only at step 6 or 7. Steps 2, 5, 6+7 involve a gross violation of DRY principle, especially the first two. The last steps sufficiently differ from step 5 as their purpose is different – while Python tests are made to reveal the potential problems including infrequent conditions, drivers only use a subset of functionality and try to “hide” problems – perform recovering actions to maintain operation of the device after abnormal condition occurs.

We already tried to mitigate these problems – significant part of the design flexibility is achieved through parametrized modules. Parameters are used to define register map and register bit fields – they are one of the most frequently modified when new functionality is added. Python code in the camera is able to read and process Verilog parameters include files when running on the target system, and while generating C header files for the kernel drivers, so here DRY principle stands. Changes in any parameters definitions in Verilog files will be automatically propagated to both Python and C code.

But it is definitely not enough. Steps 2 and 5 may involve tens of thousands lines of code and large part of the Python code is virtually a literal translation from the Verilog original. All our FPGA-based systems (and likely it is true for most other applications) involve symbiotic operation of the FPGA and some general purpose processor. In Xilinx Zynq they are on the same chip, in our earlier designs they were connected on the PCB. Most of the volume of the Verilog test code is the simulation of the CPU running some code. This code interacts with the rest of the design through the writes/reads of the memory-mapped control/status registers as well as the system memory when FPGA is a master sending/receiving data over DMA.

This is one of the reasons I hesitated to learn verification functionality of SystemVerilog. There are tons of computer programming languages that may be a better fit to simulate program activity of the CPU (this is what they do naturally). Currently most convenient for bringing the new hardware to life seems to be Python, so I was interested in trying Cocotb. If I new it is that easy I would probably start earlier, but having rather large volume of the existing Verilog code for testing I was postponing the switch.

Trying Cocotb

Two weeks ago I gave it a try. First I prepared the instruments – integrated Cocotb into VDT, made sure that Eclipse console output is clickable for the simulator reported problems, simulator output as well as for the errors in Python code and the source links in Cocotb logs. I used the Cocotb version of the JPEG encoder that has the Python code for simulation – just added configuration options for VDT and fixed the code to reduce number of warning markers that VDT generated. Here is the version that can be imported as Eclipse+VDT project.

Converting x393 camera project to Cocotb simulation

Next was to convert simulation of our x393 camera project to use Cocotb. For that I was looking not just to replace Verilog test code with Python, but to use the same Python program that we already have running on the target hardware for simulation. The program already had a “dry run” option for development on a host computer, that part had to be modified to access simulator. I needed some way to effectively isolate the Python code that is linked to the simulator and the code of the target system, and BSD sockets provide a good match that. One part of the program that uses Cocotb modules and is subject to special requirement to the Python coroutines to work for simulation – it plays a role of the server. The other part (linked to the target system program) replaces memory accesses, sends the request parameters over the socket connection, and waits for the response from the server. The only other than memory access commands that are currently implemented are “finish” (to complete simulation and analyze the results in wave viewer – GtkWave), “flush” (flush file writes – similar to cache flushes on a real hardware) and interruptible (by the simulated system interrupt outputs) wait for specified time. Simulation time is frozen between requests from the client, so the target system has to specifically let the simulated system run for certain time (or until it will generate an interrupt).

Simulation client

Below is the example of modification to the target code memory write (full source). X393_CLIENT is True: branch is for old dry run (NOP) mode, second one (elif not X393_CLIENT is None:) is for the simulation server and the last one accesses real memory over /dev/mem.

def write_mem (self,addr, data,quiet=1): """ Write 32-bit word to physical memory @param addr - physical byte address @param data - 32-bit data to write @param quiet - reduce output """ if X393_CLIENT is True: print ("simulated: write_mem(0x%x,0x%x)"%(addr,data)) return elif not X393_CLIENT is None: if quiet < 1: print ("remote: write_mem(0x%x,0x%x)"%(addr,data)) X393_CLIENT.write(addr, [data]) if quiet < 1: print ("remote: write_mem done" ) return with open("/dev/mem", "r+b") as f: page_addr=addr & (~(self.PAGE_SIZE-1)) page_offs=addr-page_addr mm = self.wrap_mm(f, page_addr) packedData=struct.pack(self.ENDIAN+"L",data) d=struct.unpack(self.ENDIAN+"L",packedData)[0] mm[page_offs:page_offs+4]=packedData if quiet <2: print ("0x%08x <== 0x%08x (%d)"%(addr,d,d))

There is not much magic in initializing X393_CLIENT class instance:

print("Creating X393_CLIENT") try: X393_CLIENT= x393Client(host=dry_mode.split(":")[0], port=int(dry_mode.split(":")[1])) print("Created X393_CLIENT") except: X393_CLIENT= True print("Failed to create X393_CLIENT")

And all the sockets handling code is less than 100 lines (source):

import json import socket class SocketCommand(): command=None arguments=None def __init__(self, command=None, arguments=None): # , debug=False): self.command = command self.arguments=arguments def getCommand(self): return self.command def getArgs(self): return self.arguments def getStart(self): return self.command == "start" def getStop(self): return self.command == "stop" def getWrite(self): return self.arguments if self.command == "write" else None def getWait(self): return self.arguments if self.command == "wait" else None def getFlush(self): return self.command == "flush" def getRead(self): return self.arguments if self.command == "read" else None def setStart(self): self.command = "start" def setStop(self): self.command = "stop" def setWrite(self,arguments): self.command = "write" self.arguments=arguments def setWait(self,arguments): # wait irq mask, timeout (ns) self.command = "wait" self.arguments=arguments def setFlush(self): #flush memory file (use when sync_for_* self.command = "flush" def setRead(self,arguments): self.command = "read" self.arguments=arguments def toJSON(self,val=None): if val is None: return json.dumps({"cmd":self.command,"args":self.arguments}) else: return json.dumps(val) def fromJSON(self,jstr): d=json.loads(jstr) try: self.command=d['cmd'] except: self.command=None try: self.arguments=d['args'] except: self.arguments=None class x393Client(): def __init__(self, host='localhost', port=7777): self.PORT = port self.HOST = host # Symbolic name meaning all available interfaces self.cmd= SocketCommand() def communicate(self, snd_str): sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) sock.connect((self.HOST, self.PORT)) sock.send(snd_str) reply = sock.recv(16384) # limit reply to 16K sock.close() return reply def start(self): self.cmd.setStart() print("start->",self.communicate(self.cmd.toJSON())) def stop(self): self.cmd.setStop() print("stop->",self.communicate(self.cmd.toJSON())) def write(self, address, data): self.cmd.setWrite([address,data]) rslt = self.communicate(self.cmd.toJSON()) def waitIrq(self, irqMask,wait_ns): self.cmd.setWait([irqMask,wait_ns]) rslt = self.communicate(self.cmd.toJSON()) def flush(self): self.cmd.setFlush() def read(self, address): self.cmd.setRead(address) rslt = self.communicate(self.cmd.toJSON()) return json.loads(rslt)

Simulation server

Server code is larger (it now has 360 lines) but it is rather simple too. It runs in the Cocotb environment (coroutines that yield to simulator have “@cocotb.coroutine” decorations), receives and responds to the commands over the socket. When the command involves writing, it compares requested address to the pre-defined ranges and either sends data over one of the master AXI channels (defined in x393interfaces.py) or writes to the “system memory” – just a file system file with appropriate offset corresponding to the specified address)

elif self.cmd.getWrite(): ad = self.cmd.getWrite() self.dut._log.debug('Received WRITE, 0x%0x: %s'%(ad[0],hex_list(ad[1]))) if ad[0]in self.RESERVED: if ad[0] == self.INTM_ADDRESS: self.int_mask = ad[1][0] rslt = 0 elif (ad[0] >= self.memlow) and (ad[0] < self.memhigh): addr = ad[0] self._memfile.seek(addr) for data in ad[1]: # currently only single word is supported sdata=struct.pack(" 0x%08x"%(data,addr)) addr += 4 rslt = 0 elif(ad[0] >= 0x40000000) and (ad[0] < 0x80000000): rslt = yield self.maxigp0.axi_write(address = ad[0], value = ad[1], byte_enable = None, id = self.writeID, dsize = 2, burst = 1, address_latency = 0, data_latency = 0) self.dut._log.debug('maxigp0.axi_write yielded %s'%(str(rslt))) self.writeID = (self.writeID+1) & self.writeIDMask elif (ad[0] >= 0xc0000000) and (ad[0] < 0xfffffffc): self.ps_sbus.write_reg(ad[0],ad[1][0]) rslt = 0 else: self.dut._log.info('Write address 0x%08x is outside of maxgp0, not yet supported'%(ad[0])) rslt = 0 self.dut._log.info('WRITE 0x%08x <= %s'%(ad[0],hex_list(ad[1], max_items = 4))) self.soc_conn.send(self.cmd.toJSON(rslt)+"\n") self.dut._log.debug('Sent rslt to the socket')

Similarly read commands acquire data from either AXI read channel or from the same memory image file. Data is sent to this file over the AXI slave interface by the simulated device.

Top Verilog module

The remaining part of the conversion form plain Verilog to Cocotb simulation is the top Verilog file – x393_dut.v. It contains an instance of the actual synthesized module (x393_i) and Verilog simulation modules of the connected peripherals. These modules can also be replaced by Python ones (and some eventually will be), but others, like Micron DDR3 memory model, are easier to use as they are provided by the chip manufacturer.

Python modules can access hierarchical nodes in the design, but to keep things cleaner all the design inputs and outputs are routed to/from the outputs/inputs of the x393_dut module. In the case of Xilinx Zynq that involves connecting internal nodes – Zynq considers CPU interface not as I/O, but as an empty module (PS7) instantiated in the design.

Screenshot of the simulation with client (black console) and server (in Eclipse IDE)


Conclusions

Conversion to Python simulation was simple, considering rather large amount of project (Python+Verilog) code – about 100K lines. After preparing the tools it took just one week and now we have the same code running both on the real hardware and in the simulator.

Splitting the simulation into client/server duo makes it easy to use any other programming language on the client side – not just the Python of our choice. Unix sockets provide convenient means for that. Address decoder (which decides what interface to use for the received memory access request) is better to keep on the server (simulator) side of the socket connection, not on the client. This minimizes changes to the target code and the server is playing the role of the memory-mapped system bus, behaves as the real hardware does.

Are there any performance penalties compared to all-Verilog simulation? None visible in our designs. Simulation (and Icarus Verilog is a single-threaded application) is the most time-consuming part – for our application it is about 8,000,000 times slower than the modeled hardware. Useful simulations (all-Verilog) for the camera runs for 15-40 minutes with tiny 64×32 pixel images. If we ran normal set of 14 MPix frames it would take about a week for the first images to appear at the output. Same Python code on the target runs for a fraction of a second, so even as the simulator is stopped while Python runs, combined execution time does not noticeably change for the Python+Verilog vs. all-Verilog mode. It would be nice to try to use Verilator in addition to Icarus. While it is not a real Verilog simulator (it can not handle ‘bx and ‘bz values, just ‘0’ and ‘1’) it is much faster.

Tutorial 02: Eclipse-based FPGA development environment for Elphel cameras

Sun, 05/22/2016 - 16:20

Elphel cameras offer unique capabilities – they are high performance systems out of the box and have all the firmware and FPGA code distributed under GNU General Public Licenses making it possible for users to modify any part of the code. The project does not use any “black boxes” or encrypted modules, so it is simulated with the free software tools and user has access to every net in the design. We are trying to do our best to make this ‘hackability’ not just a theoretical possibility, but a practical one.

Current camera FPGA project contains over 400 files under version control and almost 100K lines of HDL (Verilog) code, there are also constraints files, tool configurations, so we need to provide means for convenient navigation and modification of the project by the users.

We are starting a series of tutorials to facilitate acquaintance with this project, and here is the first one that shows how to install and configure the software. This tutorial is made with a fresh Kubuntu 16.04 LTS distribution installed on a virtual machine – this flavor of GNU/Linux we use ourselves and so it is easier for us to help others in the case of problems, but it should be also easy to install it on other GNU/Linux systems.

Later we plan to show how to navigate code and view/modify tool parameters with VDT plugin, run simulation and implementation tools. Next will be a “Hello world” module added to the camera code base, then some simple module that accesses the video memory.

Video resolution is 1600×900 pixels, so full screen view is recommended.

Download links for: video and captions.

Running this software does not require to have an actual camera, so it may help our potential users to evaluate software capabilities and see if it matches their requirements before purchasing an actual hardware. We will also be able to provide remote access to the cameras in our office for experimenting with them.

3D Print Your Camera Freedom

Tue, 05/10/2016 - 13:31

Two weeks ago we were making photos of our first production NC393 camera to post an announcement of the new product availability. We got all the mechanical parts and most of the electronic boards (14MPix version will be available shortly) and put them together. Nice looking camera, powered by a high performance SoC (dual ARM plus FPGA), packaged in a lightweight aluminum extrusion body, providing different options for various environments – indoors, outdoors, on board of the UAV or even in the open space with no air (cooling is important when you run most of the FPGA resources at full speed). Tons of potential possibilities, but the finished camera did not seem too exciting – there are so many similar looking devices available.

NC393 camera, front view

NC393 camera, back panel view. Includes DC power input (12-36V and 20-75V options), GigE, microSD card (bootable), microUSB(type B) connector for a system console with reset and boot source selection, USB/eSATA combo connector, microUSB(type A) and 2.5mm 4-contact barrel connector for external synchronization I/O

NC393 assembled boards: 10393(system board), 10385 (power supply board), 10389(interface board), 10338e (sensor board) and 103891 - synchronization adapter board, view from 10389. m.2 2242 SSD shown, bracket for the 2260 format provided. 10389 internal connectors include inter-camera synchronization and two of 3.3VDC+5.0VDC+I2C+USB ones.

NC393 assembled boards: 10393(system board), 10385 (power supply board), 10389(interface board), 10338e (sensor board) and 103891 - synchronization adapter board, view from 10385

10393 system board attached to the heat frame, view from the heat frame. There is a large aluminum heat spreader attached to the other side of the frame with thermal conductive epoxy that provides heat transfer from the CPU without the use of any spring load. Other heat dissipating components use heat pads.

10393 system board attached to the heat frame, view from the 10393 board

10393 system board, view from the processor side

(function ( $ ) { "use strict"; $(function () { var masterslider_eceb = new MasterSlider(); // slider controls masterslider_eceb.control('arrows' ,{ autohide:true, overVideo:true }); masterslider_eceb.control('thumblist' ,{ autohide:false, overVideo:true, dir:'h', speed:17, inset:false, arrows:false, hover:false, customClass:'', align:'bottom',type:'thumbs', margin:10, width:100, height:80, space:5, fillMode:'fill' }); masterslider_eceb.control('slideinfo' ,{ autohide:false, overVideo:true, dir:'h', align:'bottom',inset:false , margin:10 , size:60 }); // slider setup masterslider_eceb.setup("MS57323cfcaeceb", { width : 800, height : 480, minHeight : 0, space : 0, start : 1, grabCursor : true, swipe : true, mouse : true, layout : "boxed", wheel : false, autoplay : false, instantStartLayers:false, loop : true, shuffle : false, preload : 0, heightLimit : true, autoHeight : false, smoothHeight : true, endPause : false, overPause : true, fillMode : "fill", centerControls : true, startOnAppear : false, layersMode : "center", hideLayers : false, fullscreenMargin: 0, speed : 20, dir : "h", parallaxMode : 'swipe', view : "basic" }); window.masterslider_instances = window.masterslider_instances || []; window.masterslider_instances.push( masterslider_eceb ); }); })(jQuery);

An obvious reason for our dissatisfaction is that the single-sensor camera uses just one of four available sensor ports. Of course it is possible to use more of the freed FPGA resources for a single image processing, but it is not what you can use out of the box. Many of our users buy camera components and arrange them in their custom setup themselves – that does not have a single-sensor limitation and it matches our goals – make it easy to develop a custom system, or sculpture the camera to meet your ideas as stated on our web site. We would like to open the cameras to those who do not have capabilities of advanced mechanical design and manufacturing or just want to try new camera ideas immediately after receiving the product.

Why multisensor?

One simple answer can be “because we can” – the CPU+FPGA based camera system can simultaneously handle multiple small image sensors we love – sensors perfected by the cellphone industry. Of course it is also possible to connect one large (high resolution/high FPS) sensor or even to use multiple camera system for one really fast sensor – we did such trick with the NC323 camera for book scanning, but we believe that the future is with the multiple view systems that combine images from several synchronized cameras using computational photography rather than large lens/large sensor traditional cameras.

Multi-sensor systems can acquire high-resolution panoramic images in a single shot (or offer full sphere live video), they can be used for image-based 3-d reconstruction that in many cases provide much superior quality to the now traditional LIDAR-based scanners which can not output cinematographic quality 3-d scenes. They can be used to capture HDR video by combining data for the same voxels rather than pixels. Such systems can easily beat the shallow depth of field of the traditional large format cameras and offer possibility of the post-production focus distance adjustment. Applications are virtually endless, and while at Elphel we are developing such multi-sensor systems our main products are still the high-performance camera systems hackable at any imaginable level.

Prototype of the 21-sensor 3D HDR 6K cinematographic camera

Eyesis4π stereophotogrammetric camera

NC353-369-PHG3 3-sensor camera camera, view demo (mouse scroll changes disparity, shift-scroll - zoom)

Spherical view camera with two fish eye lenses

Two sensor stereo camera with two interchangeable C/CS-mount lenses

SCINI project underwater remotely operated vehicle (ROV)

A helmet-mounted panoramic camera by HomeSide 720°

Quadcopter using multisensor camera for navigation, by 'Autonomous Aerospace' team in Krasnoyarsk, Russia

Multisensor R5 camera by Google

(function ( $ ) { "use strict"; $(function () { var masterslider_17e3 = new MasterSlider(); // slider controls masterslider_17e3.control('arrows' ,{ autohide:true, overVideo:true }); masterslider_17e3.control('thumblist' ,{ autohide:false, overVideo:true, dir:'h', speed:17, inset:false, arrows:false, hover:false, customClass:'', align:'bottom',type:'thumbs', margin:10, width:100, height:80, space:5, fillMode:'fill' }); masterslider_17e3.control('slideinfo' ,{ autohide:false, overVideo:true, dir:'h', align:'bottom',inset:false , margin:10 , size:40 }); // slider setup masterslider_17e3.setup("MS57323cfcb17e3", { width : 800, height : 480, minHeight : 0, space : 0, start : 1, grabCursor : true, swipe : true, mouse : true, layout : "boxed", wheel : false, autoplay : false, instantStartLayers:false, loop : true, shuffle : false, preload : 0, heightLimit : true, autoHeight : false, smoothHeight : true, endPause : false, overPause : true, fillMode : "fill", centerControls : true, startOnAppear : false, layersMode : "center", hideLayers : false, fullscreenMargin: 0, speed : 20, dir : "h", parallaxMode : 'swipe', view : "basic" }); window.masterslider_instances = window.masterslider_instances || []; window.masterslider_instances.push( masterslider_17e3 ); }); })(jQuery); Hackable by Design

To have all documentation open and released under free licenses such as GNU GPL and CERN OHL is a precondition, but it is not sufficient. The hackable products must be designed to be used that way and we strive to provide this functionality to our users. This is true for the products themselves and for the required tools, so we had to go as far as to develop software for FPGA tools integration with the popular Eclipse IDE and replace closed source manufacturer code that is not compatible with the free software Verilog simulators.

Same is true for the camera mechanical parts – users need to be able to reconfigure not just the firmware, FPGA code or rearrange the electronic components, but to change the physical layout of their systems. One popular solution to this challenge is to offer modular camera systems, but unfortunately this approach has its limits. It is similar to Lego® sets (where kids can assemble just one object already designed by the manufacturer) vs. Lego® bricks where the possibilities are limited by the imagination only. Often camera modularity is more about marketing (suggesting that you can start with a basic less expensive set and later buy more parts) than about the real user freedom.

We too provide modular components and try to maintain compatibility between the generations of modules – new Elphel cameras can directly interface more than a decade old sensor boards and this does not prevent them from simultaneously supporting modern sensor interfaces. Physical dimensions and shapes of the camera electronic boards also remain the same – they just pack more performance in the same volume as newer components become available. Being in the business of developing hackable cameras for 15 years, we realize that the modularity alone is not a magic bullet. Luckily now there are other possibilities.

3d printing camera parts

3d printing process offers freedom in the material world but so far we were pessimistic about its use for the camera components where microns often matter. Some of the camera modules use invar (metal alloy that has almost zero thermal expansion coefficient at normal temperatures) elements to compensate for the thermal expansion, and the PLA plastic parts seem rather alien here. Nevertheless it is possible to meet the requirements of the camera mechanical design even with this material. In some cases it is sufficient to have precise and stable sensor/lens combination – sensor front end (SFE), small fluctuations in the mutual position/orientation of the individual SFE may be compensated using image data itself in the overlapping areas. It is possible to design composite structure that combines metal elements of simple shape (such as aluminum, thin wall stainless steel tubes or even small diameter invar rods) and printed elements of complex shape. Modern fiber-reinforced material for 3d-printing promise to improve mechanical stability and reduce thermal expansion of the finished parts.

This technology perfectly fits to the hackable multi-sensor systems and fills important missing part of “sculpturing” the user camera. 3-d printing is slow and we can not print every camera, but that is not really needed. While we certainly can print some parts, we are counting that this technology is now available in  most parts of the world where we ship the products, and the parts can be manufactured by the end user. We anticipate that many of the customer designs being experimental by nature will need later modifications, building the parts by the user can save on the overseas shipments too.

We count that the users will design their own parts, but we will try to make their job easier and provide modifiable design examples and fragments that can be used in their parts. This idea of incorporating 3-d printing technology into Elphel products is just 2 weeks old and we prepared several quick design prototypes to try it – below are some examples of our first generation of such camera parts.

Panoramic camera with perfect stitching - it uses 2 center cameras to measure distances

Stereo camera with 4 sensors having 1:3:2 bases providing all integer 1 to 6 multiples of 43mm in the lens pairs

Rectangular arranged 4-sensor stereo camera, adjustable bases

Short-base (48mm form center) 4-sensor camera

Printed adapter for the SFE of the 4-sensor panoramic camera

Printed adapter for the short-base 4-sensor camera

Various 3-d printed camera parts

It takes about 3 hours to print one SFE adapter

(function ( $ ) { "use strict"; $(function () { var masterslider_46c2 = new MasterSlider(); // slider controls masterslider_46c2.control('arrows' ,{ autohide:true, overVideo:true }); masterslider_46c2.control('thumblist' ,{ autohide:false, overVideo:true, dir:'h', speed:17, inset:false, arrows:false, hover:false, customClass:'', align:'bottom',type:'thumbs', margin:10, width:100, height:80, space:5, fillMode:'fill' }); masterslider_46c2.control('slideinfo' ,{ autohide:false, overVideo:true, dir:'h', align:'bottom',inset:false , margin:10 , size:40 }); // slider setup masterslider_46c2.setup("MS57323cfcb46c2", { width : 800, height : 480, minHeight : 0, space : 0, start : 1, grabCursor : true, swipe : true, mouse : true, layout : "boxed", wheel : false, autoplay : false, instantStartLayers:false, loop : true, shuffle : false, preload : 0, heightLimit : true, autoHeight : false, smoothHeight : true, endPause : false, overPause : true, fillMode : "fill", centerControls : true, startOnAppear : false, layersMode : "center", hideLayers : false, fullscreenMargin: 0, speed : 20, dir : "h", parallaxMode : 'swipe', view : "basic" }); window.masterslider_instances = window.masterslider_instances || []; window.masterslider_instances.push( masterslider_46c2 ); }); })(jQuery);

Deliverables

“3d print your camera freedom” – we really mean that. It is not about printing of a camera or its body. You can always get a complete camera in one of the available configurations packaged in a traditional all-metal body if it matches your ideas, printing just adds freedom to the mechanical design.

We will continue to provide all the spectrum of the camera components such as assembled boards and sensor front ends as well as the complete cameras in multiple configurations. For the 3-d printed versions we will have the models and reusable design fragments posted online. We will be able to print some parts and ship the factory assembled cameras. In some cases we may be able to help with the mechanical design, but we try to avoid doing any custom design ourselves. We consider our job is done well if we are not needed to modify anything for the end user. Currently we use one of the proprietary mechanical CAD programs so we do not have fully editable models and can only provide exported STEP files of the complete parts and interface fragments that can be incorporated in the user designs.

We would like to learn how to do this in FreeCAD – then it will be possible to provide the usable source files and detailed instructions how to customize them. FreeCAD environment can be used to create custom generator scripts in Python – this powerful feature helped us to convert all our mechanical design files into x3d models that can be viewed and navigated in the browser (video tutorial). This web based system proved to be not just a good presentation tool but to be more convenient for parts navigation than the CAD program itself, we use it ourselves regularly for that purpose.

Maybe we’ll be able to find somebody who is both experienced in mechanical design in FreeCAD and interested in multi-sensor camera systems to cooperate on this project?

Tutorial 01: Access to Elphel camera documentation from 3D model

Thu, 04/21/2016 - 20:08

We have created a short video tutorial to help our users navigate through 3D models of Elphel cameras. Cameras can be virtually taken apart and put back together which helps to understand the camera configuration and access information about every camera component. Please feel free to comment on the video quality and usefulness, as we are launching a series of tutorials about cameras, software modifications, FPGA development on 10393 camera board, etc. and we would like to receive feedback on them.

Description:

In this video we will show how the 3D model of Elphel NC393 camera can be used to view the camera, understand the components it is made of, take it apart and put back together, and get access to each part’s documentation.

The camera model is made using X3Dom technology autogenerated from STEP files used for production.

In your browser you can open the link to one of the camera assemblies from Elphel wiki page:

The buttons on the right list all camera components.

You can click on one of the buttons and the component will be selected on the model. Click again and the part will be selected without the rest of the model.
From here, using the buttons at the bottom of the screen you can open the part in a new window.
Or look for the part on Elphel wiki;
Or hide the part and see the rest of the model;
Eventually you can return to the whole model by clicking on the part button once more, or there is always a reset model button, at the top left corner.

You can also select part by clicking on the part on the model.

To deselect it click again;

Right click removes the part, so you can get access to the insides of the camera.

Once you have selected the part you can look for more information about it on Elphel wiki.

For the selected board you can type the board name in the wiki search and get access to the description about the board, circuit diagram, parts list and PCB layout.

All Elphel software is Free Software and distributed under GNU/GPL license as well as Elphel camera designs are open hardware, distributed under CERN open Hardware license.

Synchronizing Verilog, Python and C

Wed, 03/30/2016 - 14:04

Elphel NC393 as all the previous camera models relies on the intimate cooperation of the FPGA programmed in Verilog HDL and the software that runs on a general purpose CPU. Just as the FPGA manufacturers increase the speed and density of their devices, so do the Elphel cameras. FPGA code consists of the hundreds of files, tens of thousand lines of code and is constantly modified during the lifetime of the product both by us and by our users to accommodate the cameras for their applications. In most cases, if it is not just a bug fix or minor improvement of the previously implemented functionality, the software (and multiple layers of it) needs to be aware of the changes. This is both the power and the challenge of such hybrid systems, and the synchronization of the changes is an important issue.

Verilog parameters

Verilog code of the camera consists of the parameterized modules, we try to use parameters and generate Verilog operators in most cases, but `define macros and `ifdef conditional directives are still used to switch some global options (like synthesis vs. compilation, various debug levels). Eclipse-based VDT that we use for the FPGA development is aware of the parameters, and when the code instantiates a parametrized module that has parameter-dependent widths of the ports, VDT verifies that the instance ports match the signals connected to them, and warns the developer if it is not the case. Many parameters are routed through the levels of the hierarchy so the deeper instances can be controlled from a single header file, making it obvious which parameters influence which modules operations. Some parameters are specified directly, while some have to be calculated – it is the case for the register address decoders of the same module instances for different channels. Such channels have the same relative address maps, but different base addresses. Most of the camera parameters (not counting the trivial ones where the module instance parameters are defined by the nature of the code) are contained in a single x393_parameters.vh header file. There are more than six hundred of them there and most influence the software API.

Development cycle

When implementing some new camera FPGA functionality, we start with the simulation – always. Sometimes very small changes can be applied to the code, synthesized and tested in the actual hardware, but it almost never works this way – bypassing the simulation step. So far all the simulation we use consit of the plain old Verilog test benches (such as this or that) – not even System Verilog. Most likely for simulating CPU+FPGA devices ideal would be the use the software programming language to model the CPU side of the SoC and keep Verilog (or VHDL who prefers it) to the FPGA. Something like cocotb may work, especially we are already manually translating Verilog into Python, but we are not there yet.

Translaing Verilog to Python

So the next step is as I just mentioned – manual translation of the Verilog tasks and functions used in simulation to Python that code that can run on the actual hardware. The result does not look extremely pythonian as I try to follow already tested Verilog code, but it is OK. Not all the translation is manual – we use a import_verilog_parameters.py module to “understand” the parameters defined in Verilog files (including simple arithmetic and logical operations used to generate derivative parameters/localparams in the Verilog code), get the values from the same source and so reduce the possibility to accidentally use old software with the modified FPGA implementation. As the parameters are known to the program at a run time and PyDev (running, btw, in the same Eclipse IDE as the VDT – just as a different “perspective”) can not catch the misspelled parameter names. So the program has an option to modify itself and generate pre-defines for each of the parameter. Only the top part of the vrlg module is human-generated, everything under line 120 is automatically generated (and has to be re-generated only after adding new parameters to the Verilog source).

Hardware testing with Python programs

When the Verilog code is manually translated (or while new parts of the code are being translated or developed from scratch) it is possible to operate the actual camera. The top module is still called test_mcntrl as it started with DDR3 memory calibration using Levenberg-Marquardt algorithm (luckily it needs to run just once – it takes camera 10 minutes to do the full calibration this way).

This program keeps track of the Verilog parameters and macros, exposes all the functions (with the names not beginning with the underscore character), extracts docstrings from the code and combines it with the generated list of the function parameters and their default values, provides search/help for the functions with regexp (a must when there are hundreds of such functions). Next code ran in the camera:

x393 +0.043s--> help w.*_sensor_r === write_sensor_reg16 === defined in x393_sensor.X393Sensor, /usr/local/bin/x393_sensor.py: 496) Write i2c register in immediate mode @param num_sensor - sensor port number (0..3), or "all" - same to all sensors @param reg_addr16 - 16-bit register address (page+low byte, for MT9P006 high byte is an 8-bit slave address = 0x90) @param reg_data16 - 16-bit data to write to sensor register Usage: write_sensor_reg16 <num_sensor> <reg_addr16> <reg_data16> x393 +0.010s-->

And the same one in PyDev console window of Eclipse IDE – “simulated” means that the program could not detect the FPGA and so it is not the target hardware:

x393(simulated) +0.121s--> help w.*_sensor_r === write_sensor_reg16 === defined in x393_sensor.X393Sensor, /home/andrey/git/x393/py393/x393_sensor.py: 496) Write i2c register in immediate mode @param num_sensor - sensor port number (0..3), or "all" - same to all sensors @param reg_addr16 - 16-bit register address (page+low byte, for MT9P006 high byte is an 8-bit slave address = 0x90) @param reg_data16 - 16-bit data to write to sensor register Usage: write_sensor_reg16 <num_sensor> <reg_addr16> <reg_data16> x393(simulated) +0.001s-->

Python program was also used for the AHCI SATA controller initial development (before adding it was possible to add is as Linux kernel platform driver, but number of parameters there is much smaller, and most of the addresses are defined by the AHCI standard.

Synchronizing parameters with the kernel drivers

Next step is to update/redesign/develop the Linux kernel drivers to support camera functionality. Learning the lessons from the previous camera models (software was growing with the hardware incrementally) we are trying to minimize manual intervention into the process of synchronizing different layers of code (including the “hardware” one). Previous camera interface to the FPGA consisted of the hand-crafted files such as x353.h. It started from the x313.h (for NC313 – our first camera based on Axis CPU and Xilinx FPGA – same was used in NC323 that scanned many billions of book pages), was modified for the NC333 and later for our previous NC353 used in car-mounted panoramic cameras that captured most of the world’s roads.

Each time the files were modified to accommodate the new hardware, it was always a challenge to add extra bits to the memory controller addresses, image frame widths and heights (they are now all 16-bit wide – enough for the multi-gigapixel sensors). With Python modules already knowing all the current values of the Verilog parameters that define software interface it was natural to generate the C files needed to interface the hardware in the same environment.

Implementation of the register access in the FPGA

The memory-mapped registers in the camera share the same access mechanism – they use MAXIGP0 (CPU master, general purpose, channel 0) AXI port available in SoC, generously mapped there to 1/4 of the whole 32-bit address range (0×40000000.0x7fffffff). While logically all the locations are 32-bit wide, some use just 1 byte or even no data at all – any write to such address causes defined action.

Internally the commands are distributed to the target modules over a tree of byte-parallel buses that tolerate register insertion, at the endpoints they are converted to the parallel format by cmd_deser.v instances. The status data from the modules (sent by status_generate.v) is routed as messages (also in byte-parallel format to reduce the required FPGA routing resources) to a single block memory that can be read over the AXI by the CPU with zero delay. The status generation by the subsystems is individually programmed to be either on demand (in response to the write operation by the CPU) or automatically when the register data changes. While this write and read mechanism is common, the nature of the registers and data may be very different as the project combines many modules designed at different time for different purposes. All the memory mapped locations in the design fall into 3 categories:

  • Read only registers that allow to read status from the various modules, DMA pointers and other small data items.
  • Read/write registers – the ones where result of writing does not depend on any context. The full write register address range has a shadow memory block in parallel, so reading from that address will return the data that was last written there.
  • Write-only registers – all other registers where write action depends on the context. Some modules include large tables exposed through a pair of address/data locations in the address map, many other have independent bit fields with the corresponding “set” bit, so internal values are modified for only the selected field.
Register access as C11 anonymous members

All the registers in the design are 32-bit wide and are aligned to 4-byte ranges, even as not all of them use all the bits. Another common feature of the used register model is that some modules exist in multiple instances, each having evenly spaced base addresses, some of them have 2-level hierarchy (channel and sub-channel), where the address is a sum of the category base address, relative register address and a linear combination of the two indices.

Individual C typedef is generated for each set of registers that have different meanings of the bit fields – this way it is possible to benefit from the compiler type checking. All the types used fit into the 32 bits, and as in many cases the same hardware register can accept alternative values for individual bit fields, we use unions of anonymous (to make access expressions shorter) bit-field structures.

Here is a generated example of such typedef code (full source):

// I2C contol/table data typedef union { struct { u32 tbl_addr: 8; // [ 7: 0] (0) Address/length in 64-bit words (<<3 to get byte address) u32 :20; u32 tbl_mode: 2; // [29:28] (3) Should be 3 to select table address write mode u32 : 2; }; struct { u32 rah: 8; // [ 7: 0] (0) High byte of the i2c register address u32 rnw: 1; // [ 8] (0) Read/not write i2c register, should be 0 here u32 sa: 7; // [15: 9] (0) Slave address in write mode u32 nbwr: 4; // [19:16] (0) Number of bytes to write (1..10) u32 dly: 8; // [27:20] (0) Bit delay - number of mclk periods in 1/4 of the SCL period u32 /*tbl_mode*/: 2; // [29:28] (2) Should be 2 to select table data write mode u32 : 2; }; struct { u32 /*rah*/: 8; // [ 7: 0] (0) High byte of the i2c register address u32 /*rnw*/: 1; // [ 8] (0) Read/not write i2c register, should be 1 here u32 : 7; u32 nbrd: 3; // [18:16] (0) Number of bytes to read (1..18, 0 means '8') u32 nabrd: 1; // [ 19] (0) Number of address bytes for read (0 - one byte, 1 - two bytes) u32 /*dly*/: 8; // [27:20] (0) Bit delay - number of mclk periods in 1/4 of the SCL period u32 /*tbl_mode*/: 2; // [29:28] (2) Should be 2 to select table data write mode u32 : 2; }; struct { u32 sda_drive_high: 1; // [ 0] (0) Actively drive SDA high during second half of SCL==1 (valid with drive_ctl) u32 sda_release: 1; // [ 1] (0) Release SDA early if next bit ==1 (valid with drive_ctl) u32 drive_ctl: 1; // [ 2] (0) 0 - nop, 1 - set sda_release and sda_drive_high u32 next_fifo_rd: 1; // [ 3] (0) Advance I2C read FIFO pointer u32 : 8; u32 cmd_run: 2; // [13:12] (0) Sequencer run/stop control: 0,1 - nop, 2 - stop, 3 - run u32 reset: 1; // [ 14] (0) Sequencer reset all FIFO (takes 16 clock pulses), also - stops i2c until run command u32 :13; u32 /*tbl_mode*/: 2; // [29:28] (0) Should be 0 to select controls u32 : 2; }; struct { u32 d32:32; // [31: 0] (0) cast to u32 }; } x393_i2c_ctltbl_t;

Some member names in the example above are commented out (like /*tbl_mode*/ in lines 398, 408 and 420). This is done so because some bit fields (in this case bits [29:28]) have the same meaning in all alternative structures, and auto-generating complex union/structure combinations to create a valid C code with each member having unique name would produce rather clumsy code. Instead this script makes sure that same named members really designate the same bit fields, and then makes them anonymous while preserving names for a human reader. The last member (u32 d32:32;) is added to each union making it possible to address each of them as an unsigned long variable without casting.

And this is a snippet of the part of the generator code that produced it:

def _enc_i2c_tbl_wmode(self): dw=[] dw.append(("rah", vrlg.SENSI2C_TBL_RAH, vrlg.SENSI2C_TBL_RAH_BITS, 0, "High byte of the i2c register address")) dw.append(("rnw", vrlg.SENSI2C_TBL_RNWREG, 1, 0, "Read/not write i2c register, should be 0 here")) dw.append(("sa", vrlg.SENSI2C_TBL_SA, vrlg.SENSI2C_TBL_SA_BITS, 0, "Slave address in write mode")) dw.append(("nbwr", vrlg.SENSI2C_TBL_NBWR, vrlg.SENSI2C_TBL_NBWR_BITS,0, "Number of bytes to write (1..10)")) dw.append(("dly", vrlg.SENSI2C_TBL_DLY, vrlg.SENSI2C_TBL_DLY_BITS, 0, "Bit delay - number of mclk periods in 1/4 of the SCL period")) dw.append(("tbl_mode", vrlg.SENSI2C_CMD_TAND, 2, 2, "Should be 2 to select table data write mode")) return dw

The vrlg.* values used above are in turn read from the x393_parameters.vh Verilog file:

//i2c page table bit fields parameter SENSI2C_TBL_RAH = 0, // high byte of the register address parameter SENSI2C_TBL_RAH_BITS = 8, parameter SENSI2C_TBL_RNWREG = 8, // read register (when 0 - write register parameter SENSI2C_TBL_SA = 9, // Slave address in write mode parameter SENSI2C_TBL_SA_BITS = 7, parameter SENSI2C_TBL_NBWR = 16, // number of bytes to write (1..10) parameter SENSI2C_TBL_NBWR_BITS = 4, parameter SENSI2C_TBL_NBRD = 16, // number of bytes to read (1 - 8) "0" means "8" parameter SENSI2C_TBL_NBRD_BITS = 3, parameter SENSI2C_TBL_NABRD = 19, // number of address bytes for read (0 - 1 byte, 1 - 2 bytes) parameter SENSI2C_TBL_DLY = 20, // bit delay (number of mclk periods in 1/4 of SCL period) parameter SENSI2C_TBL_DLY_BITS= 8,

Auto-generated files also include x393.h, it provides other constant definitions (like valid values for the bit fields) – lines 301..303, and function declarations to access registers. Names of the functions for read-only and write-only are derived from the address symbolic names by converting them to the lower case, the ones which deal with read/write registers have set_ and get_ prefixes attached.

#define X393_CMPRS_CBIT_CMODE_JPEG18 0x00000000 // Color 4:2:0 #define X393_CMPRS_CBIT_FRAMES_SINGLE 0x00000000 // Use single-frame buffer #define X393_CMPRS_CBIT_FRAMES_MULTI 0x00000001 // Use multi-frame buffer // Compressor control void x393_cmprs_control_reg (x393_cmprs_mode_t d, int cmprs_chn); // Program compressor channel operation mode void set_x393_cmprs_status (x393_status_ctrl_t d, int cmprs_chn); // Setup compressor status report mode x393_status_ctrl_t get_x393_cmprs_status (int cmprs_chn);

Register access functions are implemented with readl() and writel(), this is a corresponding section of the x393.c file:

// Compressor control void x393_cmprs_control_reg (x393_cmprs_mode_t d, int cmprs_chn) {writel(d.d32, mmio_ptr + (0x1800 + 0x40 * cmprs_chn));} // Program compressor channel operation mode void set_x393_cmprs_status (x393_status_ctrl_t d, int cmprs_chn) {writel(d.d32, mmio_ptr + (0x1804 + 0x40 * cmprs_chn));} // Setup compressor status report mode x393_status_ctrl_t get_x393_cmprs_status (int cmprs_chn) { x393_status_ctrl_t d; d.d32 = readl(mmio_ptr + (0x1804 + 0x40 * cmprs_chn)); return d; }

There are two other header files generated from the same data, one (x393_defs.h) is just an alternative way to represent register addresses – instead of the getter and setter functions it defines the preprocessor macros:

// Compressor control #define X393_CMPRS_CONTROL_REG(cmprs_chn) (0x40001800 + 0x40 * (cmprs_chn)) // Program compressor channel operation mode, cmprs_chn = 0..3, data type: x393_cmprs_mode_t (wo) #define X393_CMPRS_STATUS(cmprs_chn) (0x40001804 + 0x40 * (cmprs_chn)) // Setup compressor status report mode, cmprs_chn = 0..3, data type: x393_status_ctrl_t (rw)

The last generated file – x393_map.h uses the preprocessor macro format to provide a full ordered address map of all the available registers for all channels and sub-channels. It is intended to be used just as a reference for developers, not as an actual include file.

Conclusions

The generated code for Elphel NC393 camera is definitely very hardware-specific, its main purpose is to encapsulate as much as possible of the hardware interface details and so to reduce dependence of the higher layers of software on the modifications of the HDL code. Such tasks are common to other projects that involve CPU/FPGA tandems, and similar approach to organizing software/hardware interface may be useful there too.

NAND flash support for Xilinx Zynq in U-Boot SPL

Fri, 03/18/2016 - 17:40
Overview
  • Target board: Elphel 10393 (Xilinx Zynq 7Z030) with 1GB NAND flash
  • U-Boot final image files (both support NAND flash commands):
    • boot.bin - SPL image – loaded by Xilinx Zynq BootROM into OCM, no FSBL required
    • u-boot-dtb.img - full image – loaded by boot.bin into RAM
  • Build environment and dependencies (for details see this article) :


 

The story

First of all, Ezynq was updated to use the mainstream U-Boot to remove an extra agent (u-boot-xlnx) from the dependency chain. But since the flash driver for Xilinx Zynq hasn’t make it to the mainstream yet it was copied to Ezynq’s source tree for U-Boot. When building the source tree is copied over U-Boot source files. We will make a patch someday.

Full image (u-boot-dtb.img)

Next, the support for flash and commands was added to the board configuration for the full u-boot image. Required defines:

include/configs/elphel393.h (from zynq-common.h in u-boot-xlnx):
#define CONFIG_NAND_ZYNQ
#ifdef CONFIG_NAND_ZYNQ
#define CONFIG_CMD_NAND_LOCK_UNLOCK /*zynq driver doesn't have lock/unlock commands*/
#define CONFIG_SYS_MAX_NAND_DEVICE 1
#define CONFIG_SYS_NAND_SELF_INIT
#define CONFIG_SYS_NAND_ONFI_DETECTION
#define CONFIG_MTD_DEVICE
#endif
#define CONFIG_MTD

NOTE: original Zynq NAND flash driver for U-Boot (zynq_nand.c) doesn’t have Lock/Unlock commands. Same applies to pl35x_nand.c in the kernel they provide. By design, on power on the NAND flash chip on 10393 is locked (write protected). While these commands were added to both drivers there’s no need for unlock in U-Boot as all of the writing will be performed from OS boot from either flash or micro SD card. Out there some designs with NAND flash do not have flash locked on power on.

And configs/elphel393_defconfig:

CONFIG_CMD_NAND=y

There are few more small modifications to add the driver to the build – see ezynq/u-boot-tree. Anyways, it worked on the board. Easy. Type “nand” in u-boot terminal for available commands.

SPL image (boot.bin)

Then the changes for the SPL image were made.

Currently U-Boot runs twice to build both images. For the SPL run it sets CONFIG_SPL_BUILD, the results are found in spl/ folder. So, in general, if one would like to build U-Boot with SPL supporting NAND flash for some other board he/she should check out common/spl/spl_nand.c for the required functions, they are:

nand_spl_load_image()
nand_init() /*no need if drivers/mtd/nand.c is included in the SPL build*/
nand_deselect() /*usually an empty function*/

And drivers/mtd/nand/ - for driver examples for SPL – there are not too many of them for some reason.

For nand_init() I included drivers/mtd/nand.c – it calls board_nand_init() which is found in the driver for the full image – zynq_nand.c.

Defines in include/configs/elphel393.h:

#define CONFIG_SPL_NAND_ELPHEL393
#define CONFIG_SYS_NAND_U_BOOT_OFFS 0x100000 /*look-up in dts!*/
#define CONFIG_SPL_NAND_SUPPORT
#define CONFIG_SPL_NAND_DRIVERS
#define CONFIG_SPL_NAND_INIT
#define CONFIG_SPL_NAND_BASE
#define CONFIG_SPL_NAND_ECC
#define CONFIG_SPL_NAND_BBT
#define CONFIG_SPL_NAND_IDS
/* Load U-Boot to this address */
#define CONFIG_SYS_NAND_U_BOOT_DST CONFIG_SYS_TEXT_BASE
#define CONFIG_SYS_NAND_U_BOOT_START CONFIG_SYS_NAND_U_BOOT_DST

CONFIG_SYS_NAND_U_BOOT_OFFS 0×100000 – is the offset in the flash where u-boot-dtb.img is written – this is done in OS. The flash partitions are defined in the device tree for the kernel.

Again a few small modifications (KConfigs and makefiles) to include everything in the build – see ezynq/u-boot-tree.

NOTES:

  • Before boot.bin was about 60K (out of 192K available). After everything was included the size is 110K. Well, it fits and so the optimization can be done some time in the future for the driver to have only what is needed – init and read.
  • drivers/mtd/nand/nand_base.c – kzalloc would hang the board – had to change it in the SPL build.
  • drivers/mtd/nand/zynq_nand.c – added timeout for some flash functions (NAND_CMD_RESET) – addresses the case when the board has flash width configured (through MIO pins) but doesn’t carry flash or the flash cannot be detected for some reason. Not having timeout hangs such boards.
Other Notes
  • With U-Boot moving to KBuild nobody knows what will happen to the CONFIG_EXTRA_ENV_SETTINGS – multi-line define.
  • Current U-Boot uses a stripped down device tree – added to Ezynq.
  • The ideal scenario is to boot from SPL straight to OS – the falcon mode (CONFIG_SPL_OS_BOOT). Consider in future.
  • Tertiary Program Loader (TPL) – no plans.

 

Free FPGA: Reimplement the primitives models

Fri, 03/18/2016 - 16:42

We added the AHCI SATA controller Verilog code to the rest of the camera FPGA project, together they now use 84% of the Zynq slices. Building the FPGA bitstream file requires proprietary tools, but all the simulation can be done with just the Free Software – Icarus Verilog and GTKWave. Unfortunately it is not possible to distribute a complete set of the files needed – our code instantiates a few FPGA primitives (hard-wired modules of the FPGA) that have proprietary license.

Please help us to free the FPGA devices for developers by re-implementing the primitives as Verilog modules under GNU GPLv3+ license – in that case we’ll be able to distribute a complete self-sufficient project. The models do not need to provide accurate timing – in many cases (like in ours) just the functional simulation is quite sufficient (combined with the vendor static timing analysis). Many modules are documented in Xilinx user guides, and you may run both the original and replacement models through the simulation tests in parallel, making sure the outputs produce the same signals. It is possible that such designs can be used as student projects when studying Verilog.

Models we are looking for

The camera project includes more than 200 Verilog files, and it depends on just 29 primitives from the Xilinx simulation library (total number of the files there is 214):

  • BUFG.v
  • BUFH.v
  • BUFIO.v
  • BUFMR.v
  • BUFR.v
  • DCIRESET.v
  • GLBL.v
  • IBUF.v
  • IBUFDS_GTE2.v
  • IBUFDS.v
  • IDELAYCTRL.v
  • IDELAYE2_FINEDELAY.v
  • IDELAYE2.v
  • IOBUF_DCIEN.v
  • IOBUF.v
  • IOBUFDS_DCIEN.v
  • ISERDESE1.v *
  • MMCME2_ADV.v
  • OBUF.v
  • OBUFT.v
  • OBUFTDS.v
  • ODDR.v
  • ODELAYE2_FINEDELAY.v
  • OSERDESE1.v *
  • PLLE2_ADV.v
  • PS7.v
  • PULLUP.v
  • RAMB18E1.v
  • RAMB36E1.v

This is just a raw list of the unisims modules referenced in the design, it includes PS7.v – a placeholder model of the ARM processing system, modules for AXI functionality simulation are already included in the project. The implementation is incomplete, but sufficient for the the camera simulation and can be used for other Zynq-based projects. Some primitives are very simple (like DCIRESET), some are much more complex. Two modules (ISERDESE1.v and OSERDESE1.v) in the project are the open-source replacements for the encrypted models of the enhanced hardware in Zynq (ISERDESE2.v and OSERDESE2.v) – we used a simple ifdef wrapper that selects reduced (but sufficient for us) functionality of the earlier open source model for simulation and the current “black box” for synthesis.

The files list above includes all the files we need for our current project, as soon as the Free Software replacement will be available we will be able to distribute the self-sufficient project. Other FPGA development projects may need other primitives, so ideally we would like to see all of the primitives to have free models for simulation.

Why is it important

Elphel is developing high-performance products based on the FPGA desings that we believe are created for Freedom. We share all the code with our users under GNU General Public License version 3 (or later) but the project depends on proprietary tools distributed by vendors who have monopoly on the tools for their silicon.

There are very interesting projects (like icoBOARD) that use smaller devices with completely Free toolchain (Yosys), but the work of those developers is seriously complicated by non-cooperation of the FPGA vendors. I hope that in the future there will be laws that will limit the monopoly of the device manufacturers and require complete documentation for the products they release to the public. There are advanced patent laws that can protect the FPGA manufacturers and their inventions from the competitors, there is no real need for them to fight against their users by hiding the documentation for the products.

Otherwise this secrecy and “Security through Obscurity” will eventually (and rather soon) lead to a very insecure world where all those self-driving cars, “smart homes” will obey not us, but just the “bad guys” as the current software malware will get to even deeper hardware level. It is very naive to believe that they (the manufacturers) are ultimate masters and have the complete control of “their” devices of ever growing complexity. Unfortunately they do not realize this and are still living in the 20-th century dreams, treating their users as kids who can only play with “Lego blocks” and believe in powerful Wizards who pretend to know everything.

We use proprietary toolchain for implementation, but exclusively Free tools – for simulation

Our projects require devices that are more advanced than those that already can be programmed with independently designed Free Software tools, so we have to use the proprietary ones. Freeing the simulation seems to be achievable, and we made a step in this direction – made the whole project simulation possible with the Free Software. Working with the HDL code and simulating it takes most part of the FPGA design cycle, in our experience it is 2/3 – 3/4, and only the remaining part involves running the toolchain and test/troubleshoot the hardware. The last step (hardware troubleshooting) can also be done without any proprietary software – we never used any in this project that utilizes most of the Xilinx Zynq FPGA resources. Combination of the Verilog modules and extensible Python programs that run on the target devices proved to be a working and convenient solution that keeps the developer in the full control of the process. These programs read the Verilog header files with parameter definitions to synchronize register and bit fields addresses between the hardware and the software that uses them.

Important role of the device primitives models

Modern FPGA include many hard-wired embedded modules that supplement the uniform “sea of gates” – addition of such modules significantly increases performance of the device while preserves its flexibility. The modules include memory blocks, DSP slices, PLL circuits, serial-to-parallel and parallel-to-serial converters, programmable delays, high-speed serial transceivers, processor cores and more. Some modules can be automatically extracted by the synthesis software from the source HDL code, but in many cases we have to directly instantiate such primitives in the code, and this code now directly references the device primitives.

The less of the primitives are directly instantiated in the project – the more portable (not tied to a particular FPGA architecture) it is, but in some cases synthesis tools (they are proprietary, so not fixable by the users) incorrectly extract the primitives, in other – the module functionality is very specific to the device and the synthesis tool will not even try to recognize it in the behavioral Verilog code.

Even open source proprietary modules are inconvenient

In earlier days Xilinx was providing all of their primitives models as open source code (but under non-free license), so it was possible to use Free Software tools to simulate the design. But even then it was not so convenient for both our users and ourselves.

It is not possible to distribute the proprietary code with the projects, so our users had to register with the FPGA manufacturer, download the multi-gigabyte software distribution and agree to the specific license terms before they were able to extract those primitives models missing from our project repository. The software license includes the requirement to install mandatory spyware that you give a permission to transfer your files to the manufacturer – this may be unacceptable for many of our users.

It is also inconvenient for ourselves. The primitives models provided by the manufacturer sometimes have problems – either do not match the actual hardware or lack full compatibility with the simulator programs we use. In such cases we were providing patches that can be applied to the code provided by the manufacturer. If Xilinx kept them in a public Git repository, we could base our patches on particular tags or commits, but it is not the case and the manufacturer/software provider preserves the right to change the distributed files at any time without notice. So we have to update the patches to maintain the simulation working even we did not change a single line in the code.

Encrippled modules are unacceptable

When I started working on the FPGA design for Zynq I was surprised to notice that Xilinx abandoned a practice to provide the source code for the simulation models for the device primitives. The new versions of the older primitives (such as ISERDESE2.v and OSERDESE2.v instead of the previous ISERDESE1.v and OSERDESE1.v) now come in encrippled (crippled by encryption) form while they were open-sourced before. And it is likely this alarming tendency will continue – many proprietary vendors are hiding the source code just because they are not so proud about its quality and can not resist a temptation to encrypt it instead of removing the obsolete statements and updating the code to the modern standards.

Such code is not just inconvenient, it is completely unacceptable for our design process. The first obvious reason is that it is not compatible with the most important development tool – a simulator. Xilinx provides decryption keys to trusted vendors of proprietary simulators and I do not have plans to abandon my choice of the tool just because the FPGA manufacturer prefers a different one.

Personally I would not use any “black boxes” even if Icarus supported them – the nature of the FPGA design is already rather complex to spend any extra time of your life on guessing – why this “black box” behaves differently than expected. And all the “black boxes” and “wizards” are always limited and do not 100% match the real hardware. That is normal, when they cover most of the cases and you have the ability to peek inside when something goes wrong, so you can isolate the bug and (if it is actually a bug of the model – not your code) report it precisely and find the solution with the manufacturer support. Reporting problems in a form “my design does not work with your black box” is rather useless even when you provide all your code – it will be a difficult task for the support team to troubleshoot a mixture of your and their code – something you could do yourself better.

So far we used two different solutions to handle encrypted modules. In one case when the older non-crippled model was available we just used the older version for the new hardware, the other one required complete re-implementation of the GTX serial transceiver model. The current code has many limitations even with its 3000+ lines of code, but it proved to be sufficient for the SATA controller development.

Additional permission under GNU GPL version 3 section 7

GNU General Public License Version 3 offers a tool to apply the license in a still “grey area” of the FPGA code. When we were using earlier GPLv2 for the FPGA projects we realized that it was more a statement of intentions than a binding license – FPGA bitstream as well as the simulation inevitably combined free and proprietary components. It was OK for us as the copyright holders, but would make it impossible for others to distribute their derivative projects in a GPL-compliant way. Version 3 has a Section 7 that can be used to give the permission for distribution of the derivative projects that depend on non-free components that are still needed to:

  1. generate a bitstream (equivalent to a software “binary”) file and
  2. simulate the design with Free Software tools

The GPL requirement to provide other components under the same license terms when distributing the combined work remains in force – it is not possible to mix this code with any other non-free code. The following is our wording of the additional permission as included in every Verilog file header in Elphel FPGA projects.

Additional permission under GNU GPL version 3 section 7:
If you modify this Program, or any covered work, by linking or combining it
with independent modules provided by the FPGA vendor only (this permission
does not extend to any 3-rd party modules, "soft cores" or macros) under
different license terms solely for the purpose of generating binary "bitstream"
files and/or simulating the code, the copyright holders of this Program give
you the right to distribute the covered work without those independent modules
as long as the source code for them is available from the FPGA vendor free of
charge, and there is no dependence on any encrypted modules for simulating of
the combined code. This permission applies to you if the distributed code
contains all the components and scripts required to completely simulate it
with at least one of the Free Software programs.

Available documentation for Xilinx FPGA primitives

Xilinx has User Guides files available for download on their web site, some of the following links include release version and may change in the future. These files provide valuable information needed to re-implement the simulation models.

  • UG953 Vivado Design Suite 7 Series FPGA and Zynq-7000 All Programmable SoC Libraries Guide lists all the primitives, their I/O ports and attributes
  • UG474 7 Series FPGAs Configurable Logic Block has description of the CLB primitives
  • UG473 7 Series FPGAs Memory Resources has description for Block RAM modules, ports, attributes and operation of these modules
  • UG472 7 Series FPGAs Clocking Resources provides information for the clock buffering (BUF*) primitives and clock management tiles – MMCM and PLL primitives of the library
  • UG471 7 Series FPGAs SelectIO Resources covers advanced I/O primitives, including DCI, programmable I/O delays elements and serializers/deserializers, I/O FIFO elements
  • UG476 7 Series FPGAs GTX/GTH Transceivers is dedicated to the high speed serial transceivers. Simulation models for these modules are partially re-implemented for use in AHCI SATA Controller.

AHCI platform driver

Mon, 03/14/2016 - 20:16
AHCI PLATFORM DRIVER

In kernels prior to 2.6.x AHCI was only supported through PCI and hence required custom patches to support platform AHCI implementation. All modern kernels have SATA support as part of AHCI framework which significantly simplifies driver development. Platform drivers follow the standard driver model convention which is described in Documentation/driver-model/platform.txt in kernel source tree and provide methods called during discovery or enumeration in their platform_driver structure. This structure is used to register platform driver and is passed to module_platform_driver() helper macro which replaces module_init() and module_exit() functions. We redefined probe() and remove() methods of platform_driver in our driver to initialize/deinitialize resources defined in device tree and allocate/deallocate memory for driver specific structure. We also opted to resource-managed function devm_kzalloc() as it seems to be preferred way of resource allocation in modern drivers. The memory allocated with resource-managed function is associated with the device and will be freed automatically after driver is unloaded.

HARDWARE LIMITATIONS

As Andrey has already pointed out in his post, current implementation of AHCI controller has several limitations and our platform driver is affected by two of them.
First, there is a deviation from AHCI specification which should be considered during platform driver implementation. The specification defines that host bus adapter uses system memory for the Command List Structure, Received FIS Structure and Command Tables. The common approach in platform drivers is to allocate a block of system memory with single dmam_alloc_coherent() call, set pointers to different structures inside this block and store these pointers in port specific structure ahci_port_priv. The first two of these structures in x393_sata are stored in the FPGA RAM blocks and mapped to register memory as it was easier to make them this way. Thus we need to allocate a block of system memory for Command Tables only and set other pointers to predefined addresses.
Second, and the most significant one from the driver’s point of view, proved to be single command slot implemented. Low level drivers assume that all 32 slots in Command List Structure are implemented and explicitly use the last slot for internal commands in ata_exec_internal_sg() function as shown in the following code snippet:
struct ata_queued_cmd *qc;
unsigned int tag, preempted_tag;
 
if (ap->ops->error_handler)
    tag = ATA_TAG_INTERNAL;
else
    tag = 0;
qc = __ata_qc_from_tag(ap, tag);

ATA_TAG_INTERNAL is defined in libata.h and reserved for internal commands. We wanted to keep all the code of our driver in our own sources and make as fewer changes to existing Linux drivers as possible to simplify further development and upgrade to newer kernels. So we decided that substitution of the command tag in our own code which handles command preparation would be the easiest way of fixing this issue.

DRIVER STRUCTURES

Proper platform driver initialization requires that several structures to be prepared and passed to platform functions during driver probing. One of them is scsi_host_template and it serves as a direct interface between middle level drivers and low level drivers. Most AHCI drivers use default AHCI_SHT macro to fill the structure with predefined values. This structure contains a field called .can_queue which is of particular interest for us. The .can_queue field sets the maximum number of simultaneous commands the host bus adapter can accept and this is the way to tell middle level drivers that our controller has only one command slot. The scsi_host_template structure was redefined in our driver as follows:
static struct scsi_host_template ahci_platform_sht = {
    AHCI_SHT(DRV_NAME),
    .can_queue = 1,
    .sg_tablesize = AHCI_MAX_SG,
    .dma_boundary = AHCI_DMA_BOUNDARY,
    .shost_attrs = ahci_shost_attrs,
    .sdev_attrs = ahci_sdev_attrs,
};
Unfortunately, ATA layer driver does not take into consideration the value we set in this template and uses hard coded tag value for its internal commands as I pointed out earlier, so we had to fix this in command preparation handler.
ata_port_operations is another important driver structure as it controls how the low level driver interfaces with upper layers. This structure is defined as follows:
static struct ata_port_operations ahci_elphel_ops = {
    .inherits = &ahci_ops,
    .port_start = elphel_port_start,
    .qc_prep = elphel_qc_prep,
};
The port start and command preparation handlers were redefined to add some implementation specific code. .port_start is used to allocate memory for Command Table and set pointers to Command List Structure and Received FIS Structure. We decided to use streaming DMA mapping instead of coherent DMA mapping used in generic AHCI driver as explained in Andrey’s article. .qc_prep is used to change the tag of current command and organize proper access to DMA mapped buffer.

PERFORMANCE CONSIDERATIONS

We used debug code in the driver along with profiling code in the controller to estimate overall performance and found out that upper driver layers introduce significant delays in command execution sequence. The delay between last DMA transaction in a sequence of transactions and next command could be as high as 2 ms. There are various sources of overhead which could lead to delays, for instance, file system operations and context switches in the operating system. We will try to use read/write operations on a raw device to improve performance.

LINKS

AHCI/SATA stack under GNU GPL
GitHub: AHCI driver source code

AHCI/SATA stack under GNU GPL

Sat, 03/12/2016 - 16:14

Implementation includes AHCI SATA host adapter in Verilog under GNU GPLv3+ and a software driver for GNU/Linux running on Xilinx Zynq. Complete project is simulated with Icarus Verilog, no encrypted modules are required.

This concludes the last major FPGA development step in our race against finished camera parts and boards already arriving to Elphel facility before the NC393 can be shipped to our customers.

Fig. 1. AHCI Host Adapter block diagram


Why did we need SATA?

Elphel cameras started as network cameras – devices attached to and controlled over the Ethernet, the previous generations used 100Mbps connection (limited by the SoC hardware), and NC393 uses GigE. But this bandwidth is still not sufficient as many camera applications require high image quality (compared to “raw”) without compression artifacts that are always present (even if not noticeable by the human viewer) with the video codecs. Recording video/images to some storage media is definitely an option and we used it in the older camera too, but the SoC IDE controller limited the recording speed to just 16MB/s. It was about twice more than the 100Mb/s network, but still was a bottleneck for the system in many cases. The NC393 can generate 12 times the pixel rate (4 simultaneous channels instead of a single one, each running 3 times faster) of the NC353 so we need 200MB/s recording speed to keep the same compression quality at the increased maximal frame rate, higher recording rate that the modern SSD are capable of is very desirable.

Fig.2. SATA routing: a) Camera records data to the internal SSD; b) Host computer connects directly to the internal SSD; c) Camera records to the external mass storage device

The most universal ways to attach mass storage device to the camera would be USB, SATA and PCIe. USB-2 is too slow, USB-3 is not available in Xilinx Zynq that we use. So what remains are SATA and PCIe. Both interfaces are possible to implement in Zynq, but PCIe (being faster as it uses multiple lanes) is good for the internal storage while SATA (in the form of eSATA) can be used to connect external storage devices too. We may consider adding PCIe capability to boost recording speed, but for initial implementation the SATA seems to be more universal, especially when using a trick we tested in Eyesis series of cameras for fast unloading of the recorded data.

Routing SATA in the camera

It is a solution similar to USB On-The-Go (similar term for SATA is used for unrelated devices), where the same connector is used to interface a smartphone to the host PC (PC is a host, a smartphone – a device) and to connect a keyboard or other device when a phone becomes a host. In contrast to the USB cables the eSATA ones always had identical connectors on both ends so nothing prevented to physically link two computers or two external drives together. As eSATA does not carry power it is safe to do, but nothing will work – two computers will not talk to each other and the storage devices will not be able to copy data between them. One of the reasons is that two signal pairs in SATA cable are uni-directional – pair A is output for the host and input for device, pair B – the opposite.

Camera uses Vitesse (now Microsemi) VSC3304 crosspoint switch (Eyesis uses larger VSC3312) that has a very useful feature – it has reversible I/O ports, so the same physical pins can be configured as inputs or outputs, making it possible to use a single eSATA connector in both host and device mode. Additionally VSC3304 allows to change the output signal level (eSATA requires higher swing than the internal SATA) and perform analog signal correction on both inputs and outputs facilitating maintaining signal integrity between attached SATA devices.

Aren’t SATA implementations for Xilinx Zynq already available?

Yes and no. When starting the NC393 development I contacted Ashwin Mendon who already had SATA-2 working on Xilinx Virtex. The code is available on OpenCores under GNU GPL license. There is an article published by IEEE . The article turned out to be very useful for our work, but the code itself had to be mostly re-written – it was still for different hardware and were not able to simulate the core as it depends on Xilinx proprietary encrypted primitives – a feature not compatible with the free software simulators we use.

Other implementations we could find (including complete commercial solution for Xilinx Zynq) have licenses not compatible with the GNU GPLv3+, and as the FPGA code is “compiled” to a single “binary” (bitstream file) it is not possible to mix free and proprietary code in the same design.

Implementation

The SATA host adapter is implemented for Elphel NC393 camera, 10393 system board documentation is on our wiki page. The Verilog code is hosted at GitHub, the GNU/Linux driver ahci_elphel.c is also there (it is the only hardware-specific driver file required). The repository contains a complete setup for simulation with Icarus Verilog and synthesis/implementation with Xilinx tools as a VDT (plugin for Eclipse IDE) project.

Current limitations

The current project was designed to be a minimal useful implementation with provisions to future enhancements. Here is the list of what is not yet done:

  • It is only SATA2 (3GHz) while the hardware is SATA3(6GHz) capable. We will definitely work on the SATA3 after we will complete migration to the new camera platform. Most of the project modules are already designed for the higher data rate.
  • No scrambling of outgoing primitives, only recognizing incoming ones. Generation of CONTp is optional by SATA standard, but we will definitely add this as it reduces EMI and we already implemented multiple hardware measures in this direction. Most likely we will need it for the CE certification.
  • No FIS-based switching for port multipliers.
  • Single command slot, and no NCQ. This functionality is optional in AHCI, but it will be added – not much is missing in the current design.
  • No power management. We will look for the best way to handle it as some of the hardware control (like DevSleep) requires i2c communication with the interface board, not directly under FPGA control. Same with the crosspoint switch.

There is also a deviation from the AHCI standard that I first considered temporary but now think it will stay this way. AHCI specifies that a Command list structure (array of 32 8-DWORD command headers) and a 256-byte Received FIS structure are stored in the system memory. On the other hand these structures need non-paged memory, are rather small and require access from both CPU and the hardware. In x393_sata these structures are mapped to the register memory (stored in the FPGA RAM blocks) – not to the regular system memory. When working with the AHCI driver we noticed that it is even simpler to do it that way. The command tables themselves that involve more data passing form the software to device (especially PRDT – physical region descriptor tables generated from the scatter-gather lists of allocated data memory) are stored in the system memory as required and are read to the hardware by the DMA engine of the controller.

As of today the code is still not yet cleaned up from temporary debug additions. It will all be done in the next couple weeks as we need to combine this code with the large camera-specific code – SATA controller (~6% of the FPGA resources) was developed separately from the rest of the code (~80% resources) as it makes both simulation and synthesis iterations much faster.

Extras

This implementation includes some additions functionality controlled by Verilog `ifdef directives. Two full block RAM primitives as used for capturing data in the controller. One of these “datascopes” captures incoming data right after 10b/8b decoder – it can store either 1024 samples of the incoming data combined of 16 bit of data plus attributes or the compact form when each 32-bit primitive is decoded and the result is a 5-bit primitive/error number. In that case 6*1024 primitives are recorded – 3 times longer than the longest FIS.

Another 4KB memory block is used for profiling – the controller timestamps and records first 5 DWORDs of each each incoming and outgoing FIS, additionally it timestamps software writes to the specific location allowing mixed software/hardware profiling.

This project implements run-time access to the primitive attributes using Xilinx DRP port of the GTX elements, same interface is used to programmatically change the logical values of the configuration inputs, making it significantly simpler to guess how the partially documented attributes change the device functionality. We will definitely need it when upgrading to SATA3.

Code description Top connections

The controller uses 3 differential I/O pads of the device – one input pair (RX on Fig.1) and one output pair (TX) make up a SATA port, additional dedicated input pair (CLK) provides 150MHz that synchronizes most of the controller and the transmit channel of the Zynq GTX module. In the 10393 board uses SI53338 spread-spectrum capable programmable clock to drive this input.

Xilinx conventions tell that the top level module should instantiate the SoC Processing System PS7 (I would consider connections to the PS7 as I/O ports), so the top module does exactly that and connects to AXI ports of the actual design top module to the MAXIGP1 and SAXIHP3 ports of the PS7, IRQF2P[0] provides interrupt signal to the CPU. MAXIGP1 is one of the two 32-bit AXI ports where CPU is master – it is used for PIO access to the controller register memory (and read out debug information), SAXIHP3 is one of the 4 “high performance” 64-bit wide paths, this port is used by the controller DMA engine to transfer command tables and data to/from the device. The port numbers are selected to match ones unused in the camera specific code, other designs may have different assignments.

Clocks and clock domains

Current SATA2 implementation uses 4 different clock domains, some may be shared with other unrelated modules or have the same source.

  1. aclk is used in MAXIGP1 channel and part of the MAXI REGISTERS module synchronizing AXI-pointing port of the dual-port block RAM that implements controller registers. 150 MHz (maximal permitted frequency) is used, it is generated from one of the PS7 FPGA clocks
  2. hclk is used in AXI HP3 channel, DMA Control and parts of the H2D CCD FIFO (host-to-device cross clock domain FIFO ), D2H CCD FIFO and AFI ABORT modules synchronizing. 150 MHz (maximal permitted frequency) is used, same as the aclk
  3. mclk is used throughout most of the other modules of the controller except parts of the GTX, COMMA, 10b8 and input parts of the ELASTIC. For current SATA2 implementation it is 75MHz, this clock is derived from the external clock input and is not synchronous with the first two
  4. xclk – source-synchronous clock extracted from the incoming SATA data. It drives COMMA and 10b8 modules, ELASTIC allows data to cross clock boundaries by adding/removing ALIGNp primitives
ahci_sata_layers

The two lower layers of the stack (phy and link) that are independent of the controller system interface (AHCI) are instantiated in ahci_sata_layers.v module together with the 2 FIFO buffers for D2H (incoming) and H2D outgoing data.

SATA PHY

SATA PHY layer Contains the OOB (Out Of Band) state machine responsible for handling COMRESET,COMINIT and COMWAKE signals, the rest is just a wrapper for the functionality of the Xilinx GTX transceiver. This device includes both high-speed elements and some blocks that can be synthesized using FPGA fabric. Xilinx does not provide the source code for the GTX simulation module and we were not able to match the hardware operation to the documentation, so in the current design we use only those parts of the GTXE2_CHANNEL primitive that can not be replaced by the fabric. Other modules are implemented as regular Verilog code included in the x393_sata project. There is a gtx_wrap module in the design that has the same input/output ports as the primitive allowing to select which features are handled by the primitive and which – by the Verilog code without changing the rest of the design.
The GTX primitive itself can not be simulated with the tools we use, so the simulation module was replaced, and Verilog `ifdef directive switches between the simulation model and non-free primitive for synthesis. The same approach we used earlier with other Xilinx proprietary primitives.

Link

Link module implements SATA link state machine, scrambling/descrambling of the data, calculates CRC for transmitted data and verifies CRC for the received one. SATA does not transmit and receive data simultaneously (only control primitives), so both CRC and scrambler modules have a single instance each providing dual functionality. This module required most troubleshooting and modifications during testing the hardware with different SSD – at some stages controller worked with some of them, but not with others.

ahci_top

Other modules of the design are included in the ahci_top. Of them the largest is the DMA engine shown as a separate block on the Fig.1.

DMA

DMA engine makes use of one of the Zynq 64-bit AXI HP ports. This channel includes FIFO buffers on the data and address subchannels (4 total) – that makes interfacing rather simple. The hard task is resetting the channels after failed communication of the controller with the device – even reloading bitsteam and resetting the FPGA would not help (actually it makes things even worse). I searched Xilinx support forum and found that similar questions where only discussed between the users, there was no authoritative recommendation from Xilinx staff. I added axi_hp_abort module that watches over the I/O transactions and keeps track of what was sent to the FIFO buffers, being able to complete transactions and drain buffers when requested.

The DMA module reads command table, saves command data in the memory block to be later read by the FIS TRANSMIT module, it then reads the scatter-gather memory descriptors (PRDT) (supporting pre-fetch if enabled) and reads/writes the data itself combining the fragments.

On the controller side data that comes out towards the device (H2D CCD FIFO) and coming from device(D2H CCD FIFO) needs to cross the clock boundary between hclk and mclk, and handle alignment issues. AXI HP operates in 64-bit mode, data to/from the link layer is 32-bit wide and AHCI allows alignment to the even number of bytes (16bits). When reading from the device the cross-clock domain FIFO module does it in a single step, combining 32-bit incoming DWORDs into 64-bit ones and using a barrel shifter (with 16-bit granularity) to align data to the 64-bit memory QWORDs – the AXI HP channel provides per-byte write mask that makes it rather easy. The H2D data is converted in 2 steps: First it crosses the clock domain boundary being simultaneously transformed to 32-bit with a 2-bit word mask that tells which of the two words in each DWORD are valid. Additional module WORD STUFFER operates in mclk domain and consolidates incoming sparse DWORDs into full outgoing DWORDs to be sent to the link layer.

AHCI

The rest of the ahci_top module is shown as AHCI block. AHCI standard specifies multiple registers and register groups that HBA has. It is intended to be used for PCI devices, but the same registers can be used even when no PCI bus is physically present. The base address is programmed differently, but the relative register addressing is still the same.

MAXI REGISTERS

MAXI REGISTERS module provides the register functionality and allows data to cross the clock domain boundary. The register memory is made of a dual-port block RAM module, additional block RAM (used as ROM) is pre-initialized to make each bit field of the register bank RW (read/write), RO (read only), RWC (read, write 1 to clear) or RW1 (read, write 1 to set) as specified by the AHCI. Such initialization is handled by the Python program create_ahci_registers.py that also generates ahci_localparams.vh include file that provides symbolic names for addressing register fields in Verilog code of other modules and in simulation test benches. The same file runs in the camera to allow access to the hardware registers by names.

Each write access to the register space generates write event that crosses the clock boundary and reaches the HBA logic, it is also used to start the AHCI FSM even if it is in reset state.

The second port of the register memory operates in mclk domain and allows register reads and writes by other AHCI submodules (FIS RECEIVE – writes registers, FIS TRANSMIT and CONTROL STATUS)

The same module also provides access to debug registers and allows reading of the “datascope” acquired data.

CONTROL STATUS

The control/status module maintains “live” registers/bits that the controller need to react when they are changed by the software and react on various events in the different parts of the controller. The updated register values are written to the software accessible register bank.

This module generates interrupt request to the processor as specified in the AHCI standard. It uses one of the interrupt lines from the FPGA to the CPU (IRQF2P) available in Zynq.

AHCI FSM

The AHCI state machine implements the AHCI layer using programmable sequencer. Each state traverses the following two stages: actions and conditions. The first stage triggers single-cycle pulses that are distributed to appropriate modules (currently 52 total). Some actions require just one cycle, others wait for “done” response from the destination. Conditions phase involves freezing logical conditions (now 44 total) and then going through them in the order specified in AHCI documentation. The state description for the machine is provided in the Assembler-like format inside the Python program ahci_fsm_sequence.py it generates Verilog code for the action_decoder.v and condition_mux.v modules that are instantiated in the ahci_fsm.v.

The output listing of the FSM generator is saved to ahci_fsm_sequence.lst. Debug output registers include address of the last FSM transition, so this listing can be used to locate problems during hardware testing. It is possible to update the generated FSM sequence at run time using designated as vendor-specific registers in the controller I/O space.

FIS RECEIVE

The FIS RECEIVE module processes incoming FIS (DMA Setup FIS, PIO Setup FIS, D2H register FIS, Set device bits FIS, unknown FIS), updates required registers and saves them in the appropriate areas of received FIS structure. For incoming data FIS it consumes just the header DWORD and redirects the rest to the D2H CCD FIFO of the DMA module. This module also implements the word counters (PRD byte count and decrementing transfer counter), these counters are shared with the transmit channel.

FIS TRANSMIT

FIS TRANSMIT module recognizes the following commands received from the AHCI FSM: fetch_cmd, cfis_xmit, atapi_xmit and dx_xmit, following the prefetch condition bit. The first command (fetch_cmd) requests DMA engine to read in the command table and optionally to prefetch PRD memory descriptors. The command data is read from the DMA module memory after one of the cfis_xmit or atapi_xmit comamnds, it is then transmitted to the link layer to be sent to device. When processing the dx_xmit this module sends just the header DWORD and transfers control to the DMA engine, continuing to count PRD byte count and decrementing transfer counter.

FPGA resources used

According to the “report_utilization” Xilinx Vivado command, current design uses:

  • 1358 (6.91%) slices
  • 9.5 (3.58%) Block RAM tiles
  • 7 (21.88%) BUFGCTRL
  • 2 (40%) PLLE2_ADV

The resource usage will be reduced as there are debug features not yet disabled. One of the PLLE2_ADV uses clock already available in the rest of the x393 code (150MHz for MAXIGP1 and SXAHIHP3), the other PLL that produces 75MHz transmit-synchronous clock can probably be eliminated too. Two of the block RAM tiles are capturing incoming primitives and profiling data, this functionality is not needed in the production version. More the resources may be saved if we’ll be able to use the hard-wired 10b/8b decoder, 8b/10b encoder, comma alignment and elastic buffer primitives of the Xilinx GTXE2_CHANNEL.

Testing the hardware Testing with Python programs

All the initial work with the actual hardware was done with the Python script that started with reimplementation of the same functionality used when simulating the project. Most is in x393sata.py that imports x393_vsc3304.py to control the VSC3304 crosspoint switch. This option turned out very useful for troubleshooting starting from initial testing of the SSD connection (switch can route the SSD to the desktop computer), then for verifying the OOB exchange (the only what is visible on my oscilloscope) – switch was set to connect SSD to Zynq, and use eSATA connector pins to duplicate signals between devices, so probing did not change the electrical characteristics of the active lines. Python program allowed to detect communication errors, modify GTX attributes over DRP, capture incoming data to reproduce similar conditions with the simulator. Step-by-step it was possible to receive signature FIS, then get then run the identify command. In these tests I used large area of the system memory that was reserved as a video ring buffer set up as “coherent” DMA memory. We were not able to make it really “coherent” – the command data transmitted to the device (controller reads it from the system memory as a master) often contained just zeros as the real data written by the CPU got stuck in either one of the caches or in the DDR memory controller write buffer. These errors only went away when we abandoned the use of the coherent memory allocation and switched to the stream DMA with explicit synchronization with dma_sync_*_for_cpu/dma_sync_*_for_device.

AHCI driver for GNU/Linux

Mikhail Karpenko is preparing post about the software driver, and as expected this development stage revealed new controller errors that were not detected with just manual launching commands through the Python program. When we mounted the SSD and started to copy gigabyte files, the controller reported some fake CRC errors. And it happened with one SSD, but not with the other. Using data capturing modules it was not so difficult to catch the conditions that caused errors and then reproduce them with the simulator – one of the last bugs detected was that link layer incorrectly handled single incoming HOLD primitives (rather unlikely condition).

Performance results

First performance testing turned out to be rather discouraging – ‘dd’ reported under 100 MB/s rate. At that point I added profiling code to the controller, and the data rate for the raw transfers (I tried command that involved reading of 24 of the 8KB FISes), measured from the sending of the command FIS to the receiving of the D2H register FIS confirming the transfer was 198MB/s – about 80% of the maximal for the SATA2. Profiling the higher levels of the software we noticed that there is virtually no overlap between the hardware and software operation. It is definitely possible to improve the result, but the fact that the software slowed twice the operation tells that it if the requests and their processing were done in parallel, it will consume 100% of the CPU power. Yes, there are two cores and the clock frequency can be increased (the current boards use the speed grade 2 Zynq, while the software still thinks it is speed grade 1 for compatibility with the first prototype), it still may be a big waste in the camera. So we will likely bypass the file system for sequential recording video/images and use the second partition of the SSD for raw recording, especially as we will record directly from the video buffer of the system memory, so no dealing with scatter-gather descriptors, and no need to synchronize system memory as no cache is involved. The memory controller is documented as being self-coherent, so reading the same memory while it is being written to through a different channel should cause write operation to be performed first.

Conclusions and future plans

We’ve achieved the useful functionality of the camera SATA controller allowing recording to the internal high capacity m.2 SSD, so all the hardware is tested and cameras can be shipped to the users. The future upgrades (including SATA3) will be released in the same way as other camera software. On the software side we will first need to upgrade our camogm recorder to reduce CPU usage during recording and provide 100% load to the SATA controller (rather easy when recording continuous memory buffer). Later (it will be more important after SATA3 implementation) we may optimize controller even more try to short-cut the video compressors outputs directly to the SATA controller, using the system memory as a buffer only when the SSD is not ready to receive data (they do take “timeouts”).

We hope that this project will be useful for other developers who are interested in Free Software solutions and prefer the Real Verilog Code (RVC) to all those “wizards”, “black boxes” and “IP”.

Software tools used (and not)

Elphel designs and builds high performance cameras striving to provide our users/developers with the design freedom at every possible level. We do not use any binary-only modules or other hidden information in our designs – all what we know ourselves is posted online – usually on GitHub and Elphel Wiki. When developing FPGA, and that unfortunately still depends on proprietary tools, we limit ourselves to use only free for download tools to be exactly in the same position as many of our users. We can not make it necessary for the users (and consider it immoral) to purchase expensive tools to be able to modify the free software code for the hardware they purchased from Elphel, so no “Chipscopes” or other fancy proprietary tools were used in this project development.

Keeping information free is a precondition, but it is not sufficient alone for many users to be able to effectively develop new functionality to the products, there needs to be ease of doing that. In the area of the FPGA design (and it is a very powerful tool resulting in high performance that is not possible with just software applications) we think of our users as smart people, but not necessarily professional FPGA developers. Like ourselves.

Fig.3 FPGA development with VDT

We learned a lesson from our previous FPGA projects that depended too much on particular releases of Xilinx tools and were difficult to maintain even for ourselves. Our current code is easier to use, port and support, we tried to minimize dependence on particular tools used what we think is a better development environment. I believe that the “Lego blocks” style is not the the most productive way to develop the FPGA projects, and it is definitely not the only one possible.

Treating HDL code similar to the software one is not less powerful paradigm, and to my opinion the development tools should not pretend to be “wizards” who know better than me what I am allowed (or not allowed) to do, but more like gentle secretaries or helpers who can take over much of routine work, remind about important events and provide some appropriate suggestions (when asked for). Such behavior is even more important if the particular activity is not the only one you do and you may come back to it after a long break. A good IDE should be like that – help you to navigate the code, catch problems early, be useful with default settings but provide capabilities to fine tune the functionality according to the personal preferences. It is also important to provide familiar environment, this is why we use the same Eclipse IDE for Verilog, Python, C/C++ and Java and more. All our projects come with the initial project settings files that can be imported in this IDE (supplemented by the appropriate plugins) so you can immediately start development from the point we currently left it.

For FPGA development Elphel provides VDT – a powerful tool that includes deep Verilog support and integrates free software simulator Icarus Verilog with the Github repository and a popular GTKWave for visualizing simulation results. It comes with the precofigured support of FPGA vendors proprietary synthesis and implementation tools and allows addition of other tools without requirement to modify the plugin code. The SATA project uses Xilinx Vivado command line tools (not Vivado GUI), support for several other FPGA tools is also available.

NC393 camera is fit for flight

Thu, 02/11/2016 - 15:49

The components for 10393 and other related circuit boards for the new NC393 camera series have been ordered and contract manufacturing (CM) is ready to assemble the first batch of camera boards.

In the meantime, the extruded parts that will be made into NC393 camera body have been received at Elphel. The extrusion looks very slick with thin, 1mm walls made out of strong 6061-T6 aluminium, and weighs only 55g. The camera’s new lightweight design is suitable for use on a small aircraft. The heat frame responsible for cooling the powerful processor has also been extruded.

We are very pleased with the performance of Profile Precision Extrusions located in Phoenix, Arizona, which have delivered a very accurate product ahead of the proposed schedule. Now we can proudly engrave “Made in USA” on the camera, as now even the camera body parts are made in the United States.

Of course, we have tried to order the extrusion in China, but the intricately detailed profile is difficult to extrude and tolerances were hard to match, so when Profile Precision was recommended to us by local extrusion facilities we were happy to discover the outstanding quality this company offers.

 

 

While waiting for the extruded parts we have been playing with another new toy: the 3D printer. We have been creating prototypes of various camera models of the NC393 series. The cameras are designed and modelled in a 3D virtual environment, and can viewed and even taken apart by mouse click thanks to X3dom technology. The next step is to build actual parts on the 3D printer and physically assemble the camera prototypes, which will allow us to start using the prototypes in the physical world: finding what features are missing, and correcting and finalizing the design. For example, when the mini-panoramic NC393-4PI4 camera prototype was assembled it was clear that it needs the 4 fins (now seen on the final model) to protect the lenses from touching the surfaces as well as to provide shade from the sun. NC393-4PI4 and NC393-4PI4-IMU-GPS are small 360 degree panoramic cameras assembled with 4 fish-eye lenses especially suitable for interior panoramic applications.

The prototypes are not as slick as the actual aluminium bodies, but they give a very good example of what the actual cameras will look like.

 

 

As of today, the 10393 and other boards are in production, the prototypes are being built and tested for design functionality, and the aluminium extrusions have been received. With all this taken care of, we are now less than one month away from the NC393 being offered for sale; the first cameras will be distributed to the loyal Elphel customers who have placed and pre-paid orders several weeks ago.

X3D assemblies from any CAD

Mon, 12/21/2015 - 03:09
Converting mechanical assemblies to X3D models from STEP (ISO 10303) files

Like all manufacturing companies we use mechanical CAD program to design our products. We would love to use Free Software programs for that, but so far even FreeCAD has a warning on their download page “FreeCAD is under heavy development and might not be ready for production use”. We have to use proprietary tools, our choice was the program that natively runs on GNU/Linux we use on our computers. This program generates STEP files that we can send to virtually any machine shop (locally or overseas) and expect to receive the manufactured parts that match our design. For the last 6 years we kept the CAD models for all the camera parts on Elphel Wiki hoping they might be needed not only by the machine shops we order parts from, but also by our users to incorporate (or modify) our products in their systems.

All the mechanical CAD programs can export STEP, we can use this format for assemblies

The STEP file export is quite adequate for the production, but it would be convenient for our users (including ourselves) to be able to easily navigate through the complex assemblies. Theoretically STEP can handle assemblies too, but I’ve got an impression that the CAD program owners are not that interested in the interoperability – they want everybody to use their program, and the interoperability scope is limited to a simplified scheme: CAD(their) -> CAM(any) and the assembly structure is often lost when generating output files. When we tried to export Eyesis4π camera as a STEP file it got more than 0.5GB in size and when imported (even by the same program) it resulted in over 1800 solids without any hierarchy or even the part names. Additionally the colors were lost when the STEP file was imported back and it is understandable – CAD programs need to be able to produce STEP files (otherwise they would be completely useless), but importing requirements are more relaxed. Having no control over the proprietary program output we had to find a way to use the CAM files (in STEP format) in the other way than the CAD providers intended and recreate the assembly structure ourselves.

FreeCAD as the environment for model conversion

FreeCAD seemed to us as a best choice for the next step regardless to its “not ready for production” status as it has a great advantage of being FLOSS, and having excellent support for Python access to the functionality (through macros and a nice Python console). First I looked for a possibility to export data as X3D and was impressed that a FreeCAD macro that does that – export_x3d.py has less than 100 lines of code. It did not export colored faces of electronic components on the PCB, but that was something we could definitely fix ourselves.

Having working color output was a first step to a more ambitious project – feed the program with a library of STEP files of components and a flat STEP assembly file. The program should recognize each of the objects in the assembly by comparing it with the known parts, replace them with references to the library parts and provide translation and rotation. There are multiple ways how to deal with this task and I will describe what we did later in the post, in short – it just worked. We fed the program with a library of 800+ part files that we had (some custom, some just standard fasteners from McMaster), and the assembly file and it recognized almost all of the objects and correctly placed them, so Oleg Dzhimiev was able to start working on the viewer to navigate the models using the x3dom technology while I continued working on the converter.

Links to the converted models

Here is a link to Elphel Wiki page Elphel camera assemblies. Tis page opens multiple designs – they include the new NC393 camera models (for which we do not yet received all the mechanical parts) as well as our current products for which we already had the needed CAD files.

We had not tried to convert design data exported by other mechanical CAD software and it is interesting to know if this program can help users of other CAD systems. We tried to make it agnostic to the source of the STEP files, but it does require the possibility to export files with specified color of the faces (AP214 has this possibility while AP203 does not). Color information is anyway needed as a proxy for materials/finish to distinguish between different parts that have exactly the same geometry, we also use it to hint orientation of the parts in the assembly.

There are multiple ways how the program can be improved, but at least for our project it is already usable. And we hope it is not just for us.

Technical details

As soon as we verified that FreeCAD can import our STEP files and it is not that difficult to generate the X3D models we started freecad_x3d project at Github. The x3d_step_assy.py macro runs in FreeCAD and generates X3D files from the STEP input, the rest of the repository is the viewer for the produced models.

Indexing the STEP part files

The first thing the program does is it scans and indexes all the STEP models of the parts, saving the information that is needed for matching to the assembly objects. STEP opening in FreeCAD is a very slow process (especially in the GUI mode that is required to have access to the object color information), so this step is needed to significantly speed-up subsequent assembly files processing. The part invariant information such as center of gravity (center of volume to be precise) location, volume, surface area and gyration radii provided by FreeCAD. If the part has differently colored faces the centers for each color is recorded too. Additionally a list of up to 18 vertex coordinates is calculated and added – these vertices are tested to be inside (or near to) the objects in the assembly. Currently these vertices are selected as having maximal and minimal values for each of the 3 coordinates as well as their sums and differences.

Normally each part model consists of just one solid object, but in practice it is not always the case. The CAD program we use generates extra “tube” object for each thread, sometimes we do it intentionally like making a two-solid photographic UV protection filter as a frame and a glass. This allows us to selectively change solid/wireframe state when working in CAD program. Current implementation saves information about each solid in a part and places the largest (now by volume) solid first (at index 0), the matching uses only the first solid, and that leads to false-positive in reporting of the objects that do not have any matches to parts. “False” – because these unmatched objects will still appear in the X3D model as their are included in the individual part models. Removing such false positive objects from the report is definitely possible, but it was not a big hassle to manually inspect them in the FreeCAD 3D-view.

All this information is recorded in Python pickle format, one file for each STEP file. When program needs to process an assembly, it first verifies that each STEP part file has a corresponding pickle one an (re)calculates the ones that are either missing or outdated (older than the STEP model).

Generation of the X3D files for each part model

Next step after indexing of the STEP models of the parts is to generate individual parts in X3D format. Program uses the color information that exists after import in GUI mode for each object face and uses it in the generation of the X3D XML data. It wraps each object with X3D “Group” node to combine multiple possible objects in a part and to provide a bounding box information, and then adds the outmost “Transform” node with zero translation and rotation – it can be used for the viewer program to move rotate the object. Currently the viewer reads group bound box center and moves the top object in the opposite direction for convenient rotation. The imported STEP files may have large offsets of the models from the (0,0,0) point, if this is not corrected the viewer may try to rotate the object around the point that is far off-screen.

Similarly to the generation of the pickle files, program only generates part X3D models if they do not exist or are older then the input STEP files. We noticed that at this stage FreeCAD often segfaults (regardless of the version) and it seems to be related to the GUI. Luckily you only have to load this many files once, and if the FreeCAD crashes you may just restart it and the macro will continue generation of the new files.

Selection of the parts candidates for the assembly objects

Opening a complex assembly as STEP file in FreeCAD can take a while (one of our models was opening 40 minutes), so please be patient. The part matching take twice less time, so the program offers two options – use the currently active document in FreeCAD or start from the file path and open it.

When all the assembly data is available, the program indexes each object extracting parameters similar to those of the parts – volume, area, inertial properties, centers of each color (if present). Then it uses this data to create a list of parts-candidates for each assembly object, requiring that the orientation-invariant parameters of each object exported as a part of the assembly matches (to the configurable precision) that of the same part exported individually. If colors are available, the total area of each color is compared too, but match is allowed if only the shape is the same as CAD may allow to change the object color in the assembly making it different from that of the library part. If several parts match the assembly object then the better color match disqualifies other shape-only candidates, so it is possible to color-code the same-shape parts.

Matching of the assembly object to the part orientation

Next step of the assembly to parts decomposition is to determine the part position and orientation to match the assembly objects. In most cases there will be no more that one candidate for each object, but if there are several the program will try them all and use the first match. It is very easy to find the translation of the part – just use the vector between the already known centers of volumes, but it is more tricky to find the correct orientation. There are multiple ways how to match orientations, and the program can be definitely improved. We chose rather simple approach that requires modification of some parts, but is rather easy as the parts models are created by us. The number of parts that required modification is rather small, this modification has to be done once per part (not for each assembly) and the modification does not invalidate the model for CAM usage.

This approach uses the offsets of the “centers of gravity” of the faces of each color (even a single-colored object may have the center of all faces offset from the center of volume) and then the principal axes of gyration that are provided by FreeCAD. Color offsets are used first, then supplemented by the gyration axes, each step verifies that the vector is non-zero and the next one is not co-linear to the first. Only two orthogonal vectors are needed, the third one needed for rotational matrix is calculated as a cross product of the first two. Use of gyration axes even if all 3 have different gyration radii ad so are reliably calculated have ambiguity as they do not provide the sign, only the line of direction. The same asymmetrical object can be oriented in 4 different ways (alternating the sign of the two of the 3 axes) and the program tries each of them. Initially I tried to compare the volume of boolean intersection of the two objects that should be the same as the volume of a single object if they match, but for some of our STEP models FreeCAD refused to calculate intersection, so I used isinside() function instead that calculates if a given point is inside the object to the certain precision so can be used to verify that all of the set of vertices saved for the part object with the transformation matrix applied end up “inside” the assembly object (actually on the border). Unfortunately even that had exception – in one of the object one vertex was returning “False” with any tolerance, even larger that the object size. In that rare case the program tries to move the test point around by the same precision-long vector, and that modification worked, FreeCAD return “True” for the isinside() call.

When the color hints are required in the part models

Using just the gyration principal axes fails when the object has some symmetry (point or axis). Consider a regular socket head screw. Unless it is a really short one it will have one small and two equal large gyration radii, and the axis for the small gyration radius can be reliably found (it is just the regular axis of the screw) but the two perpendicular ones are arbitrary and may be different for the part and the assembly object. That will lead that the hex head will have incorrect orientation, but usually this hex hole orientation is unimportant. So here we slightly cheated – the test vertices selected for verification with isinside() are some of the outmost ones of the solid (we selected vertices that have maximal/minimal values of each of the coordinates and sums/differences of their pairs and all three) – the hex hole does not have any of them. Most of the fasteners we use are such socket head ones, this approach would not work for hex bolts and nuts – they need to have one of the hex faces colored.

And there are other objects that require some color hints in the part model, like a square plate having no or symmetrical holes, or a turned (round) part with the symmetrical holes in it – two of the gyration radii are the same and the corresponding axes can not be unambiguously determined. You may color one of the side faces of the square faces or color the inside of a hole to break a symmetry. If the part does not have individually selectable faces in the CAD program you may create a small colored cylinder or box, align it with one of the flat faces and boolean cut it from the object, and then boolean add it to it. The result object has the same shape for CAM, but it will have a colored square or circle on one of the faces – sufficient for unambiguous definition of the orientation.

Pages