SISEC 2013 : Two-channel noisy recordings of a moving speaker within a limited area

Two-channel noisy recordings of a moving speaker within a limited area

Motivation

This task is focused on a natural situation when the target is an uttering person whose location is limited to a specific area. For example, the target could be a speaker who is seated in a meeting (noisy) room. Its position is distant from microphones (say more than 1 meter) and is changing due to small movements of the speaker's head. The goal is to remove typical noise (e.g. babble noise) from the recorded speech. We assume that two microphones are available.

For such situation, a priori information may be provided in the form of noise-free recordings of the target from several (fixed) positions within the assumed area. For example, such recordings could be obtained during speaker-only intervals. How effectively can we exploit the a priori knowledge to enhance recordings of the speaker when the noise is present and his/her position is not perfectly known and could be changing within the limited area?

Results

The results are available here (external link)

Scenario

The target is a loudspeaker that occurs within a 30x30cm area. The loudspeaker is (always) directed towards two microphones that are 2 meters distant from the center of the area. Details of the scenario are given in the following figure.

Development dataset

Download dev16.zip (14 MB) (Development dataset, 16 kHz, 16 bits)
Download dev44.1.zip (63 MB) (Development dataset, 44.1 kHz, 24 bits)

For training, the dataset contains noise-free recordings of utterances played by the loudspeaker when it was standing (without movement) in one of 16 fixed positions within the target area. The file names have the format dev_position_<xx>.wav, where <xx> is the index of the position.

Next, there are four recordings during that the loudspeaker was moved over four positions. A video of the first recording is available for illustration here (external link)

. The file names have the format dev_<set>_<positions>_{sim,src,noi,mix}.wav, where <set> is the index of the recording (A, B,C, or D), <positions> contains indices of four positions passed during the movement, and {sim,src,noi,mix} denote, respectively, target source images, source signal of the target, noise, and the noisy recording (sim+noi).

Test dataset

Download test16.zip (3 MB) (Test dataset, 16 kHz, 16 bits)
Download test44.1.zip (12 MB) (Test dataset, 44.1 kHz, 24 bits)

The dataset contains five noisy recordings of the moving loudspeaker within the area. The file names have the format test_<set>_x_x_x_x_mix.wav, where <set> is the index of the recording (A, B,C, D, or E). Here, the trajectory of the movement is not revealed.

Tasks

The participants are encouraged to submit

Enhanced (de-noised) testing as well as development noisy recordings
Estimated trajectories of the loudspeaker in terms of sequences of indices of positions (mandatory)

Submissions

Each participant should make his/her results available online in the form of a tarball called <YourName>_<dataset>.zip.
The files containing the enhanced utterances should be named: <dataset>_<set>_x_x_x_x_enh.wav
where <dataset> is either dev or test, <set> is A, B, C, D, or E, and x_x_x_x are the estimated positions of the target during the movement.

Each participant should then send an email to "zbynek.koldovsky (at) tul.cz" providing:

contact information (name, affiliation)
basic information about his/her algorithm, including its average running time (in seconds per test excerpt and per GHz of CPU) and a bibliographical reference if possible
the URL of the tarball(s)

Evaluation criteria

The evaluation will be done through the perceptual evaluation toolkit PEASS v.2.0 (external link)

Licensing issues

All files are distributed under the terms of the Creative Commons Attribution-Noncommercial-ShareAlike 3.0 (external link)

license. The files to be submitted by participants will be made available on a website under the terms of the same license.

The recordings are authored by Emmanuel Vincent, Zbynek Koldovsky, and Jiri Malek.

Back to Audio source separation top

History