History: Two-channel mixtures of speech and real-world background noise

Preview of version: 115

Two-channel mixtures of speech and real-world background noise


We propose to repeat the Two-channel mixtures of speech and real-world background noise (external link) without Chime corpus (external link) because the reference speech data has been already provided in the second ChiME challenge (external link).

Introduction

This task aims to evaluate denoising and DOA estimation techniques by the SiSEC 2010 noisy speech dataset (external link).

Description of the dataset

We consider two-channel mixtures of one speech source and real-world background noise sampled at 16 kHz.

These data are part of the SiSEC 2010 noisy speech dataset (external link). Background noise signals were recorded via a pair of omnidirectional microphones spaced by 8.6 cm in six different public environments:
  • Su1: subway car moving
  • Su2: subway car standing at station
  • Ca1: cafeteria 1
  • Ca2: cafeteria 2 (another cafeteria than Ca1)
  • Sq1: square 1
  • Sq2: square 2 (another square than Sq1)
and in two different positions within each environment:
  • Ce: center (except in Su1 and Su2)
  • Co: corner
Two recordings identified by a letter (A or B) were made in each case. Mixtures were then generated by adding a speech signal to the background noise signal. For the reverberant environments Su and Ca, the speech signals were recorded in an office room using the same microphone pair. For the outdoor environment Sq, the speech signals were mixed anechoically through simulation. The distance between the sound source and the array centroid was 1.0 m for female speech and 0.8 m for male speech. The direction of arrival (DOA) of the speech source was different in each mixture and the signal-to-noise ratio (SNR) was drawn randomly between -17 and +12 dB.

Test data

Download the test set (external link) (13 MB)

The data consist of 20 stereo WAV audio files that can be imported in Matlab using the wavread command. These files are named test_<env>_<cond>_<take>_mix.wav, where
  • <env>: noise environment ( Su1, Su2, Ca1, Ca2, Sq1, Sq2)
  • <cond>: recording condition ( Ce, Co)
  • <take>: take ( A, B)

Development data

Download the development set (external link) (24 MB)

The data consists of 36 WAV audio files and 10 text files. These files are named as follows:
  • dev_<env>_<cond>_<take>_src.wav: single-channel speech signal
  • dev_<env>_<cond>_<take>_sim.wav: two-channel spatial image of the speech source
  • dev_<env>_<cond>_<take>_noi.wav: two-channel spatial image of the background noise
  • dev_<env>_<cond>_<take>_mix.wav: two-channel mixture signal
  • dev_<env>_<cond>_<take>_DOA.txt: DOA of the speech source (see the SiSEC 2010 wiki (external link) for the convention adopted to measure DOA)
where
  • <env>: noise environment ( Su1, Ca1, Sq1)
  • <cond>: recording condition ( Ce, Co)
  • <take>: take ( A, B)
Since the source DOAs were measured geometrically in the Su1 and Ca1 environments, they might contain a measurement error up to a few degrees; on the contrary, there is no such error in the Sq environment, because the spatial images of the speech source were simulated. The Co condition of the Ca1 environment has only take A.

Tasks and reference software

We propose the following 3 tasks:
  1. speaker DOA estimation: estimate the DOA of the speech source
  2. speech signal estimation: estimate the single-channel speech source
  3. speech and noise spatial image estimation: decompose the mixture signal into two two-channel signals corresponding to the speech source and the background noise

Participants are welcome to use some of the Matlab reference software below to build their own algorithms:
  • stft_multi.m (external link): multichannel STFT
  • istft_multi.m (external link): multichannel inverse STFT
  • example_denoising.m (external link): TDOA estimation by GCC-PHATmax, ML target and noise variance estimation under a diffuse noise model, and multichannel Wiener filtering

Submission

Each participant is asked to submit the results of his/her algorithm for task 2 and/or 3 over all or part of the mixtures in the development dataset and the test dataset. The results for task 1 may also be submitted if possible.

Each participant should make his results available online in the form of a tarball with the following file naming convention:
  • <dataset>_<env>_<cond>_<take>_src.wav: single-channel speech signal
  • <dataset>_<env>_<cond>_<take>_sim.wav: two-channel spatial image of the speech source
  • <dataset>_<env>_<cond>_<take>_noi.wav: two-channel spatial image of the background noise
  • <dataset>_<env>_<cond>_<take>_DOA.txt: DOA of the speech source
where <dataset> is test or dev.

Each participant should then send an email to "onono (at) nii.ac.jp" and "zbynek.koldovsky (at) tul.cz" providing:
  • contact information (name, affiliation)
  • basic information about his/her algorithm, including its average running time (in seconds per test excerpt and per GHz of CPU) and a bibliographical reference if possible
  • the URL of the tarball

The submitted audio files will be made available on a website under the terms of the Licensing section below.

Evaluation criteria

We propose to use the same evaluation criteria as in SiSEC 2010, except that the order of the estimated sources must be recovered.

The estimated speaker DOAs in task 1 will be evaluated in terms of absolute difference with the true DOAs.

The estimated speech signals in task 2 will be evaluated via the energy ratio criteria defined in the BSS_EVAL (external link) toolbox allowing arbitrary filtering between the estimated source and the true source.

The estimated speech and noise spatial image signals in task 3 will be evaluated via the energy ratio criteria introduced for the Stereo Audio Source Separation Evaluation Campaign (external link) and via the perceptually-motivated criteria in the PEASS (external link) toolkit.

Performance will be compared to that of ideal binary masking as a benchmark (i.e. binary masks providing maximum SDR), computed over a STFT or a cochleagram.

The above performance criteria and benchmarks are respectively implemented in
An example use is given in example_denoising.m (external link).

Licensing issues

All files are distributed under the terms of the Creative Commons Attribution-Noncommercial-ShareAlike 3.0 (external link) license. The files to be submitted by participants will be made available on a website under the terms of the same license.

Public environment data were authored by Ngoc Q. K. Duong and Nobutaka Ito.

Potential Participants

  • Shoko Araki (araki.shoko (a) lab_ntt_co_jp)
  • Dorothea Kolossa (dorothea_kolossa (a) ruhr-uni-bochum_de)
  • Alexey Ozerov (alexey_ozerov (a) inria_fr)
  • Francesco Nesta (nesta (a) fbk_eu)
  • Armin Sehr (sehr (a) nt_e-technik_uni-erlangen_de)
  • Ngoc Duong
  • Jani Even
  • Hiroshi Saruwatari
  • Dang Hai Tran Vu
  • Hiroshi Sawada

Task proposed by the Audio Committee

Back to Audio source separation top

History

Legend: v=view, c=compare, d=diff
Date UserEdit Comment Version Action
Tue 30 of July, 2013 04:45 CEST admin   123
Current
 v
Tue 30 of July, 2013 04:39 CEST admin   122  v  c  d  
Mon 01 of July, 2013 11:59 CEST admin   121  v  c  d  
Wed 06 of Mar., 2013 01:51 CET admin   120  v  c  d  
Wed 06 of Mar., 2013 01:51 CET admin   119  v  c  d  
Wed 06 of Mar., 2013 01:50 CET admin   118  v  c  d  
Wed 06 of Mar., 2013 01:49 CET admin   117  v  c  d  
Mon 25 of Feb., 2013 09:40 CET admin   116  v  c  d  
Mon 25 of Feb., 2013 09:38 CET admin   115  v  c  d  
Fri 22 of Feb., 2013 13:53 CET admin   114  v  c  d  
Fri 22 of Feb., 2013 13:52 CET admin   113  v  c  d  
Fri 22 of Feb., 2013 13:52 CET admin   112  v  c  d  

Menu

Google Search

 
sisec2013.wiki.irisa.fr
WWW