AISHELL8-FISHEYE: A Fisheye Audio-Visual Dataset for Target Speaker Extraction with Distortion-Aware Baselines

Paper, Authors, and Abstract
Authors
Peijun Yang
photon65537@whu.edu.cn
Wuhan University
Zhan Jin
zhan.jin@whu.edu.cn
Wuhan University
Ming Li
ming.li.cuhksz@gmail.com
The Chinese University of Hong Kong
Hui Bu
buhui@aishelldata.com
Beijing AISHELL Technology Co., Ltd.
Juan Liu
liujuan@whu.edu.cn
Wuhan University
Abstract:

Audio-visual target speaker extraction remains a challenging task, particularly in visually distorted conditions such as those captured by fisheye cameras. It has significant practical implications, such as enabling the extraction of better quality speech during police criminal investigations, journalists's undercover investigations, wearable first-person cameras, etc. To address the lack of dedicated datasets for fisheye audio-visual speech processing, we introduce AISHELL8-FISHEYE, the first fisheye audio-visual dataset for target speaker extraction based on lip movement cues. The dataset consists of video captured by both fisheye and normal cameras as well as the corresponding audio recordings, featuring speakers in wide-angle scenes with significant radial distortion. Alongside the dataset, we propose a novel baseline model that leverages a distortion-aware branch pre-trained on fisheye head pose estimation to extract distortion-robust features, which are then fused with lip motion features extracted via a ResNet-based encoder. The fused features are passed through a speech separation module to extract the target speaker's speech. Experimental results demonstrate that our distortion-aware approach significantly outperforms conventional methods on fisheye data.

Audio & Video Samples

The audio and the corresponding fisheye video data, from which the audio originally associated with the video has been extracted.

Sample 01
utterance 1
Sample 02
utterance 2
Sample 03
utterance 3
Audio Player
Video Player

Extraction Demo

Audio-visual target speaker extraction: click to play the mixture and the extracted result.

Mixture vs. Extracted Output Demo 01
Audio-only playback.
Mixture vs. Extracted Output Demo 02
Audio-only playback.
Mixture vs. Extracted Output Demo 03
Audio-only playback.
Extraction Audio Player

Download

Click the icon to open the dataset page.

Hugging Face
Hugging Face

License: CC BY-NC 4.0