Professional Documents
Culture Documents
KinectFusion
Kinect Fusion
Kinect for Windows 1.7, 1.8
Tech Specs
Kinect Fusion can process data either on a DirectX 11 compatible GPU with C++ AMP, or on the CPU, by setting the
reconstruction processor type during reconstruction volume creation. The CPU processor is best suited to offline
processing as only modern DirectX 11 GPUs will enable realtime and interactive frame rates during reconstruction.
Recommended Hardware
Desktop PC with 3GHz or better multicore processor and a graphics card with 2GB or more of dedicated onboard
https://msdn.microsoft.com/enus/library/dn188670(d=printer).aspx
1/9
12/2/2015
KinectFusion
memory. Kinect Fusion has been tested for highend scenarios on a NVidia GeForce GTX680 and AMD Radeon HD
7850.
Note: It is possible to use Kinect Fusion on laptop class DirectX11 GPU hardware, but this typically runs significantly
slower than desktopclass hardware. In general, aim to process at the same frame rate as the Kinect sensor 30fps to
enable the most robust camera pose tracking.
Processing Pipeline
As shown in Figure 2, the Kinect Fusion processing pipeline involves several steps to go from the raw depth to a 3D
reconstruction:
Figure 2. Kinect Fusion Pipeline
The first stage is depth map conversion. This takes the raw depth from Kinect and converts it into floating
point depth in meters, followed by a optional conversion to an oriented point cloud which consists of 3D
points/vertices in the camera coordinate system, and the surface normals orientation of the surface at these
points for use with the AlignPointClouds function.
The second stage calculates the global/world camera pose its location and orientation and tracks this pose
as the sensor moves in each frame using an iterative alignment algorithm, so the system always knows the
current sensor pose relative to the initial starting frame. There are two algorithms in Kinect Fusion. The first is
NuiFusionAlignPointClouds which can either be used to align point clouds calculated from the reconstruction
with new incoming point clouds from the Kinect camera depth, or standalone for example, to align two
separate cameras viewing the same scene. The second is AlignDepthToReconstruction which provides more
accurate camera tracking results when working with a reconstruction volume, however, this may be less robust
to objects which move in a scene. If tracking breaks in this scenario, realign the camera with the last tracked
pose and tracking should typically continue.
The third stage is fusing or integration of the depth data from the known sensor pose into a single
https://msdn.microsoft.com/enus/library/dn188670(d=printer).aspx
2/9
12/2/2015
KinectFusion
volumetric representation of the space around the camera. This integration of the depth data is performed
perframe, continuously, with a running average to reduce noise, yet handle some dynamic change in the
scene such as small objects being removed or added. As a moving sensor sees a surface from slightly
different viewpoints, any gaps or holes where depth data is not present in the original Kinect image can also
be filled in e.g. you can move the sensor around an object to fill in the rear of the object and surfaces are
continuously refined with newer high resolution data as the camera approaches the surface more closely.
The reconstruction volume can be raycast from a sensor pose which is typically, but not limited to, the current
Kinect sensor pose, and this resultant point cloud can be shaded for a rendered visible image of the 3D
reconstruction volume.
Typical volume sizes that can be scanned are up to around 8m^3. Typical realworld voxel resolutions can be up to
around 12mm per voxel. However, it is not possible to have both of these simultaneously see the Reconstruction
Volume section below.
Interested in the research behind Kinect Fusion, or more technical detail about the algorithms? See the Microsoft
Research Kinect Fusion project page for publications and a video.
Tracking
Kinect Fusion tracking uses only the depth stream from the Kinect sensor. Tracking relies on there being enough
variation in the depth in each frame so that it can match up what it sees between frames and calculate the pose
difference. If you point the Kinect at a single planar wall, or a mostly planar scene, there will not be enough depth
variation for tracking to succeed. Cluttered scenes work the best, so if you are trying to scan an environment, try
scattering some objects around if the tracking is problematic.
There are two tracking algorithms implemented in Kinect Fusion the AlignDepthFloatToReconstruction function and
AlignPointClouds function. It is possible to use either for camera pose tracking, however, if you are creating a
reconstruction volume the AlignDepthFloatToReconstruction function will likely perform more accurate tracking. In
contrast, the AlignPointClouds function can also be used standalone, without a reconstruction volume to align two
point clouds see the interface comments for more information on standalone use. Note that internally the high
level ProcessFrame function in the INuiFusionReconstruction interface uses AlignDepthFloatToReconstruction.
The AlignPointClouds tracking algorithm optionally outputs an ARGB visible image of the camera tracking algorithm
alignment results, currently colorcoded by perpixel algorithm output. This may be used as input to additional vision
algorithms such as object segmentation. Values vary depending on whether the pixel was a valid pixel used in
tracking inlier or failed in different tests outlier. 0xff000000 indicates an invalid input vertex e.g. from 0 input
depth, or one where no correspondances occur between point cloud images. Outlier vertices rejected due to too
large a distance betweeen vertices are coded as 0xff008000. Outlier vertices rejected due to to large a difference in
normal angle between point clouds are coded as 0xff800000. Inliers are color shaded depending on the residual
energy at that point, with more saturated colors indicating more discrepancy between vertices and less saturated
colors i.e. more white representing less discrepancy, or less information at that pixel. In good tracking the majority
of pixels will appear white, with typically small amounts of red and blue around some objects. If you see large
amounts of red, blue or green across the whole image, then this indicates tracking is likely lost, or there is drift in the
camera pose. Resetting the reconstruction will also reset the tracking here.
The AlignDepthFloatToReconstruction tracking algorithm optionally outputs a an image of type
NUI_FUSION_IMAGE_TYPE_FLOAT of the camera tracking algorithm alignment results. The image describes how well
each depth pixel aligns with the reconstruction model. This may be processed to create a color rendering, or may be
used as input to additional vision algorithms such as object segmentation. These residual values are normalized 1 to
1 and represent the alignment cost/energy for each pixel. Larger magnitude values either positive or negative
represent more discrepancy, and lower values represent less discrepancy or less information at that pixel. Note that if
valid depth exists, but no reconstruction model exists behind the depth pixels, 0 values indicating perfect alignment
will be returned for that area. In contrast, where no valid depth occurs 1 values will always be returned.
Constraints on Tracking
https://msdn.microsoft.com/enus/library/dn188670(d=printer).aspx
3/9
12/2/2015
KinectFusion
Kinect Fusion depends on depth variation in the scene to perform its camera tracking. Scenes must have sufficient
depth variation in view and in the model to be able to track successfully. Small, slow movement in both translation
and rotation are best for maintaining stable tracking. Dropped frames can adversely affect tracking, as a dropped
frame can effectively lead to twice the translational and rotational movement between processed frames. When using
AlignDepthFloatToReconstruction it is typically possible to guide the user to realign the camera with the last tracked
position and orientation and resume tracking.
Reconstruction Volume
The reconstruction volume is made up of small cubes in space, which are generally referred to as Voxels. You can
specify the size of the volume on creation by passing a NUI_FUSION_RECONSTRUCTION_PARAMETERS structure to
the NuiFusionCreateReconstruction function. The number of voxels that can be created depends on the amount of
memory available to be allocated on your reconstruction device, and typically up to around 640x640x640 =
262144000 voxels can be created in total on devices with 1.5GB of memory or more. The aspect ratio of this volume
can be arbitrary; however, you should aim to match the volume voxel dimensions to the shape of the area in the real
world you aim to scan.
The voxelsPerMeter member scales the size that 1 voxel represents in the real world, so if you have a cubic
384x384x384 volume this can either represent a 3m cube in the real world if you set the voxelsPerMeter member to
128vpm as 384/128=3, where each voxel is 3m/384=7.8mm^3, or a 1.5m cube if you set it to 256vpm 384/256=1.5,
where each voxel is 1.5m/384 = 3.9mm^3. This combination of voxels in the x,y,z axis and voxels per meter will
enable you to specify a volume with different sizes and resolutions, but notice that it is a tradeoff with a fixed
number of voxels that you can create, you cannot create a volume which represents a very large world volume, with a
very high resolution.
On GPUs the maximum contiguous memory block that can typically be allocated is around 1GB, which limits the
reconstruction resolution to approximately 640^3 262144000 voxels. Similarly, although CPUs typically have more
total memory available than a GPU, heap memory fragmentation may prevent very large GBsized contiguous
memory block allocations.. If you need very high resolution also with a large real world volume size, multiple volumes
or multiple devices may be a possible solution.
Note
If you are doing interactive reconstruction on a GPU, the memory requirement applies to the video memory on
that GPU. If you are doing offline reconstruction on a CPU, the memory requirement applies to the main memory
of that machine.
4/9
12/2/2015
KinectFusion
and readback for the individual steps as would occur when calling separately. Here, the reconstruction is only
updated if the camera tracking is successful.
Please note that for environment scanning cluttered scenes enable the best camera tracking add objects to scenes
which are mostly planar. In order to extract a mesh of a particular scene or object in a static scene, these objects can
be removed later manually using 3rd party mesh processing tools.
Figure 3. Kinect Fusion Basics Applications scanning desktops left D2D, right WPF
1. First ensure your machine meets the minimum specifications see above, then start the application and you
should see a window similar to that in Figure 3.
2. Point the sensor at a scene, such as a desktop, and start the application. You should see a representation of
the desktop appear in the Kinect Fusion sample image. Currently in the sample the resolution is hardcoded in
the constructor to 512x384x512 voxels, with 256 voxels per meter which is a 2m wide x 1.5 high x 2m deep
reconstruction volume.
3. Press the reset button if the sensor loses track see status bar at the bottom of the window for lost track
messages.
https://msdn.microsoft.com/enus/library/dn188670(d=printer).aspx
5/9
12/2/2015
KinectFusion
1. First ensure your machine meets the minimum specifications see above, then start the application and you
should see a window similar to that in Figure 4.
2. The images in the window are: raw Kinect Depth at top right, Camera Tracking results at bottom right see
Camera Tracking section above, and the main raycast and shaded view into the reconstruction volume from
the camera pose in the large image on the left.
3. Point the sensor at a scene, such as a desktop, and start the application. You should see a representation of
the desktop appear in the Kinect Fusion sample image. Currently in the sample the resolution starts at
512x384x512 voxels, with 256 voxels per meter which is a 2m wide x 1.5 high x 2m deep reconstruction
volume.
4. Press the reset button shown below in Figure 5 if the sensor loses track see status bar at the bottom of the
window for lost track messages.
Figure 6. Depth Threshold Sliders, Reset Button, Create Mesh Button, and other configuration in
Kinect Fusion ExplorerD2D
https://msdn.microsoft.com/enus/library/dn188670(d=printer).aspx
6/9
12/2/2015
KinectFusion
5. Change the depth threshold parameters by moving the depth threshold sliders shown at the bottom of Figure
5. Note how this will clip the depth image in the top right corner. Kinect Fusion requires depth to work, so you
need to have valid depth within the region of the reconstruction volume. The sliders start out at minimum
0.35m near the minimum Kinect sensing distance, and 8m near the maximum Kinect sensing distance and
can be used for things such as background or foreground removal.
6. Try playing with the additional configuration boxes, such as Display Surface Normals, Near Mode and
Mirror Depth. Note mirroring depth will reset the reconstruction. Pause Integration will stop the
reconstruction volume integrating depth data, and is useful if you have fully reconstructed your scene and
now only want to track the camera pose, rather than update the scene in the volume it will also run faster
without integration.
Figure 7. Reconstruction Volume Settings in Kinect Fusion ExplorerD2D
7. The Reconstruction volume settings visible in Figure 6 enable you to change the real world size and shape of
the reconstruction volume. Try playing around, and see how both the X,Y,Z volume and the voxels per meter
affect the size in the real world and the visible resolution of the volume.
8. The Maximum Integration weight slider controls the temporal averaging of data into the reconstruction
volume increasing makes the system a higher detailed reconstruction, but one which takes longer to average
and does not adapt to change. Decreasing it makes the volume respond faster to change in the depth e.g.
objects moving, but is noisier overall.
9. Click the Create Mesh button in Figure 6. The meshes output by the Kinect Fusion Explorer sample are the
very first step to 3D printing a replica of objects and scenes you scan. Note that most 3D printers require
meshes to be closed and watertight, without holes to be able to print. Typically the steps required for 3D
printing involve manual cleaning/removal of extraneous geometry then insertion of 3D geometry to close
holes. Some popular 3D design and editing, or CAD software packages can perform hole filling automatically.
We recommend using the binary STL mesh file output when your scan is a high resolution or the intended
target is a 3D printer, as the file size is smaller than the ASCII format .obj.
Note
Note, STL is a unitless format, and different mesh applications interpret the positions as being in different
units. In our sample, we assume each unit is 1 Meter.
7/9
12/2/2015
KinectFusion
The head scanning sample demonstrates how to leverage a combination of Kinect Fusion and Face Tracking to scan
high resolution models of faces and heads.
Tips
Add clutter at different depths to scenes when environment scanning to improve problematic tracking.
Mask out background pixels to only focus on the object when scanning small objects with a static sensor.
Dont move the sensor too fast or jerkily.
Dont get too close to objects and surfaces you are scanning monitor the Kinect depth image.
As we only rely on depth, illumination is not an issue it even works in the dark.
Some objects may not appear in the depth image as they absorb or reflect too much IR light try scanning
from different angles especially perpendicular to the surface to reconstruct.
If limited processing power is available, prefer smaller voxel resolution volumes and faster/better tracking over
high resolution volumes slow and worse tracking.
If surfaces do not disappear from the volume when something moves, make sure the sensor sees valid depth
behind it if there is 0 depth in the image, it does not know that it can remove these surfaces, as it is also
possible something may also be very close to the sensor inside the minimum sensing distance occluding the
view.
Community Additions
Thanks
Thanks
kcherry497
7/16/2015
https://msdn.microsoft.com/enus/library/dn188670(d=printer).aspx
8/9
12/2/2015
KinectFusion
2015 Microsoft
https://msdn.microsoft.com/enus/library/dn188670(d=printer).aspx
9/9