Sound Source Localization Assumptions

I've been reading a number of research papers, etc., related to sound source localization as I attempt to build such a system (e.g. for traffic jam vs. open road classification), especially those about Time Delay of Arrival (TDoA) estimation, combining evidence from multiple microphone pairs, and calibrating the locations of the microphones in an array.  Some of those papers have been clear about their assumptions, which helps in understanding the limits of their designs.

The approach I'm taking involves digitizing the microphone inputs and computing the cross-correlation of the signals recorded by a pair of microphones at various time delays (e.g. how does the sound that arrived at microphone 1 compare to the sound that arrived at microphone 2 delayed by 1 ms, or delayed by 3 samples, etc.).  Let's see what assumptions I'm making...
  1. The speed of sound is constant across the region of interest (sources to microphones) over the short-term (e.g. 30 minutes). The speed of sound varies with changes in temperature, humidity, and air pressure, and also wind increases the speed in the direction of the wind, and decreases it in the opposite direction.
  2. For planar sound sources (e.g. for cars on surface roads, for fairly flat ground), we have at least 3 microphones in the plane of the sound.
  3. For non-planar sources (e.g. planes, people sitting and standing in a conference room), we have at least 4 microphones in a sensible non-planar arrangement; for example, 4 microphones in a tetrahedral arrangement gives us the best omni-directional resolution, because poor resolution from one pair (e.g. when the sound source is on the same line as the two microphones, but not between them) is compensated for by the other pairs.
  4. Not all microphones need to be paired up (e.g. we might have 4 microphones in one array, making up 6 pairs, and then another 3 microphones in another array, contributing another 3 pairs), so not all constraints below apply to every possible microphone pair, just those pairs used for TDoA estimation.
  5. For my "traffic detection, at a distance" project (the far-field), I want to be able distinguish between cars in the two lanes of the road (coming and going from the congested intersection), out to a distance of around 250 meters.  The road width is about 25 feet (measured from aerial image), so a separation of around 2 meters would be enough to distinguish the two directions.  That amounts to an angular resolution of 0.5 degrees... pretty darn small.  I've not done calculations yet to show whether this is feasible.
  6. For near-field applications (e.g. aiming a camera at a speaker in a conference room), we'd like to be able to distinguish objects 30cm apart, out to around 5m, or an angular resolution of about 3.5 degrees.
  7. The microphones in each pair are not too near each other, else our position resolution will be diminished, because the maximum TDoA is small. For example, with microphones 1 inch apart, the maximum TDoA is about 94 microseconds; with sound recorded at CD rate (44,100 samples per second, about 23 microseconds per sample), a sound can arrive at the second microphone at most 4 samples later than at the first, so at best we can say that a sound arrived from one of 9 (4*2 + 1) "directions" (broadly defined) relative to the microphone pair.
  8. The microphones in each pair have similar frequency response, else it is hard to compare their signals to compute TDoA.
  9. The microphones in each pair have similar response to sounds arriving from any direction; this is actually very hard to achieve, as microphones typically have some directionality to their response, even those known as omni-directional. If we have omni-directional microphones that all have the same orientation, then sources far away will appear to be coming from essentially the same direction relative to the orientation of the microphone for all of them; but for a nearby sound source, the sound might be at very directions relative to the microphone orientation, which can change both the amplitude response, and the frequency response.
  10. The analog to digital conversion (ADC) of the microphones is at the same sample rate for all microphones.
  11. The microphone through ADC path has similar response for both microphones in a pair (e.g. they have the same gain and electrical noise).
  12. The signal of interest is large enough by the time it is picked up by the microphones and converted to digital to be distinguished from noise (i.e. the signal-to-noise ratio is high enough, whatever that is).
  13. The sampled data is aligned (i.e. we have a straightforward means to get all the samples recorded "at the same time").  For my experiments so far this was easy because I'm using a 4-channel recorder, but for a more complete prototype I'll need a new solution.  My friend Bent recommends, based on his personal experience, the Motu 8pre, which has 8 microphone inputs; even better, he is willing to loan me his for some experiments!
I'm sure I'm making even more assumptions than I've included here (and some of these are really "desired features").






Popular Posts