Two-channel noisy recordings of a moving speaker within a limited area
Two-channel noisy recordings of a moving speaker within a limited area
Motivation
This task is focused on a natural situation when the target is an uttering person whose location is limited to a specific area. For example, the target could be a speaker who is seated in a meeting (noisy) room. Its position is distant from microphones (say more than 1 meter) and is changing due to small movements of the speaker's head. The goal is to remove typical noise (e.g. babble noise) from the recorded speech. We assume that two microphones are available.For such situation, a priori information may be provided in the form of noise-free recordings of the target from several (fixed) positions within the assumed area. For example, such recordings could be obtained during speaker-only intervals. How effectively can we exploit the a priori knowledge to enhance recordings of the speaker when the noise is present and his/her position is not perfectly known and could be changing within the limited area?
Results
The results are available hereScenario
The target is a loudspeaker that occurs within a 30x30cm area. The loudspeaker is (always) directed towards two microphones that are 2 meters distant from the center of the area. Details of the scenario are given in the following figure.Development dataset
Download dev16.zip (14 MB) (Development dataset, 16 kHz, 16 bits)Download dev44.1.zip (63 MB) (Development dataset, 44.1 kHz, 24 bits)
For training, the dataset contains noise-free recordings of utterances played by the loudspeaker when it was standing (without movement) in one of 16 fixed positions within the target area. The file names have the format dev_position_<xx>.wav, where <xx> is the index of the position.
Next, there are four recordings during that the loudspeaker was moved over four positions. A video of the first recording is available for illustration here . The file names have the format dev_<set>_<positions>_{sim,src,noi,mix}.wav, where <set> is the index of the recording (A, B,C, or D), <positions> contains indices of four positions passed during the movement, and {sim,src,noi,mix} denote, respectively, target source images, source signal of the target, noise, and the noisy recording (sim+noi).
Test dataset
Download test16.zip (3 MB) (Test dataset, 16 kHz, 16 bits)Download test44.1.zip (12 MB) (Test dataset, 44.1 kHz, 24 bits)
The dataset contains five noisy recordings of the moving loudspeaker within the area. The file names have the format test_<set>_x_x_x_x_mix.wav, where <set> is the index of the recording (A, B,C, D, or E). Here, the trajectory of the movement is not revealed.
Tasks
The participants are encouraged to submit- Enhanced (de-noised) testing as well as development noisy recordings
- Estimated trajectories of the loudspeaker in terms of sequences of indices of positions (mandatory)
Submissions
Each participant should make his/her results available online in the form of a tarball called <YourName>_<dataset>.zip.The files containing the enhanced utterances should be named: <dataset>_<set>_x_x_x_x_enh.wav
where <dataset> is either dev or test, <set> is A, B, C, D, or E, and x_x_x_x are the estimated positions of the target during the movement.
Each participant should then send an email to "zbynek.koldovsky (at) tul.cz" providing:
- contact information (name, affiliation)
- basic information about his/her algorithm, including its average running time (in seconds per test excerpt and per GHz of CPU) and a bibliographical reference if possible
- the URL of the tarball(s)
Evaluation criteria
The evaluation will be done through the perceptual evaluation toolkit PEASS v.2.0.Licensing issues
All files are distributed under the terms of the Creative Commons Attribution-Noncommercial-ShareAlike 3.0 license. The files to be submitted by participants will be made available on a website under the terms of the same license.The recordings are authored by Emmanuel Vincent, Zbynek Koldovsky, and Jiri Malek.
Back to Audio source separation top