Two-channel noisy recordings of a moving speaker within a limited area
Two-channel noisy recordings of a moving speaker within a limited area
Motivation
This task is focused on a natural situation when the target is an uttering person whose location is limited to a specific area. For example, the target could be a speaker who is seated in a meeting (noisy) room. Its position is distant from microphones (say more than 1 meter) and is changing due to small movements of the speaker's head. The goal is to remove typical noise (e.g. babble noise) from the recorded speech. We assume that two microphones are available.For such situation, a priori information may be provided in the form of noise-free recordings of the target from several (fixed) positions within the assumed area. For example, such recordings could be obtained during speaker-only intervals. How effectively can we exploit the a priori knowledge to enhance recordings of the speaker when the noise is present and his/her position is not perfectly known and could be changing within the limited area?
Results
The results are available here
Scenario
The target is a loudspeaker that occurs within a 30x30cm area. The loudspeaker is (always) directed towards two microphones that are 2 meters distant from the center of the area. Details of the scenario are given in the following figure.
Development dataset
Download dev16.zip
Download dev44.1.zip

For training, the dataset contains noise-free recordings of utterances played by the loudspeaker when it was standing (without movement) in one of 16 fixed positions within the target area. The file names have the format dev_position_<xx>.wav, where <xx> is the index of the position.
Next, there are four recordings during that the loudspeaker was moved over four positions. A video of the first recording is available for illustration here

Test dataset
Download test16.zip
Download test44.1.zip

The dataset contains five noisy recordings of the moving loudspeaker within the area. The file names have the format test_<set>_x_x_x_x_mix.wav, where <set> is the index of the recording (A, B,C, D, or E). Here, the trajectory of the movement is not revealed.
Tasks
The participants are encouraged to submit- Enhanced (de-noised) testing as well as development noisy recordings
- Estimated trajectories of the loudspeaker in terms of sequences of indices of positions (mandatory)
Submissions
Each participant should make his/her results available online in the form of a tarball called <YourName>_<dataset>.zip.The files containing the enhanced utterances should be named: <dataset>_<set>_x_x_x_x_enh.wav
where <dataset> is either dev or test, <set> is A, B, C, D, or E, and x_x_x_x are the estimated positions of the target during the movement.
Each participant should then send an email to "zbynek.koldovsky (at) tul.cz" providing:
- contact information (name, affiliation)
- basic information about his/her algorithm, including its average running time (in seconds per test excerpt and per GHz of CPU) and a bibliographical reference if possible
- the URL of the tarball(s)
Evaluation criteria
The evaluation will be done through the perceptual evaluation toolkit PEASS v.2.0
Licensing issues
All files are distributed under the terms of the Creative Commons Attribution-Noncommercial-ShareAlike 3.0
The recordings are authored by Emmanuel Vincent, Zbynek Koldovsky, and Jiri Malek.
Back to Audio source separation top