Wednesday, December 18, 2013

MS Kinect microphone array geometry

This post provides the geometrical parameters of the MS Kinect microphone array that are necessary if you want to do acoustic beamforming from the 4 channels without using the Microsoft SDK (for example if you are using Linux or libfreenect or OpenNI).

The image below is a drawing of the internal parts of the Kinect including the microphones in purple:
As you can see, the microphones are not spaced at regular intervals. It is possible to see the exact position of the microphones without disassembling the Kinect by simply opening the plastic covers at the bottom of the device with a flat screwdriver (see next pictures).
The microphones are sitting in the small rubber cubes highlighted in the following picture and that provide some mechanical isolation from the body of the device:
Finally, the next drawing shows the position of the microphones as measured with respect to a reference system with origin in the device's middle point, x-axis along the devices length and y-axis oriented as the perspective of the cameras. The drawing should be interpreted looking at the device from above (I used an XBox 360 Kinect MODEL 1473, the product codes returned by lsusb on linux are the following: 045e:02ae Camera, 045e:02ad Audio, 045e:02c2 Motor):

In text form, the coordinates (x,y) of each microphone in cm are:
channel 1: (11.3, 2)
channel 2: (-3.6, 2)
channel 3: (-7.6, 2)
channel 4: (-11.3, 2)
The assignment between microphones and channels in the audio device was obtained by playing a tone with my telephone next to each microphone and using libfreenect's wavrecord tool to record the audio. I then picked the channel with the highest sound intensity with the help of audacity (see screendump below)


Appendix

some useful sox commands to handle the files returned by wavrecord:
1) merge the four channels in a single wav file (note however that the program you use to reproduce the resulting file might interpret some channels as surround or low frequency channels):
sox channel1.wav channel2.wav channel3.wav channel4.wav --channels 4 -M output_4ch.wav

2) create a stereo file by selecting only the extreme microphones (right and left channels correspond to right and left in the video from the camera):
sox channel1.wav channel4.wav -M --channels 2 output_stereo.wav

3) create a stereo file by mixing all four channels with weights determined by microphone position. Note that this not necessarily better than 2. If we call x the set of x coordinates for the mics, we have (in order of channels)
x = [11.3 -3.6 -7.6 -11.3]
if we call ldx the distance from the left microphone (ch1) normalized by the distance between the extreme microphones:
ldx = [0.0 0.6593 0.8363 1.0]
we can define, somewhat arbitrarily, the left weights lw as (1-ldx)^2 and additionally impose that the weights should sum to 1:
lw = [0.875 0.1016 0.0235 0.0]
If we repeat the same procedure for the right channel we obtain:
rw = [0.0 0.2037 0.3277 0.4686]
Then we can use sox to produce the mixture with the following command

sox channel1.wav channel2.wav channel3.wav channel4.wav -M --channels 2 output_stereo_mixture.wav remix 4v0.875,3v0.1016,2v0.0235 2v0.2037,3v0.3277,1v0.4686