AISHELL8-FISHEYE: A Fisheye Audio-Visual Dataset for Target Speaker Extraction with Distortion-Aware Baselines

Paper, Authors, and Abstract

Authors

Peijun Yang

photon65537@whu.edu.cn

Wuhan University

Zhan Jin

zhan.jin@whu.edu.cn

Wuhan University

Ming Li

ming.li.cuhksz@gmail.com

The Chinese University of Hong Kong

Hui Bu

buhui@aishelldata.com

Beijing AISHELL Technology Co., Ltd.

Juan Liu

liujuan@whu.edu.cn

Wuhan University

Abstract:

Audio-visual target speaker extraction remains a challenging task, particularly in visually distorted conditions such as those captured by fisheye cameras. It has significant practical implications, such as enabling the extraction of better quality speech during police criminal investigations, journalists's undercover investigations, wearable first-person cameras, etc. To address the lack of dedicated datasets for fisheye audio-visual speech processing, we introduce AISHELL8-FISHEYE, the first fisheye audio-visual dataset for target speaker extraction based on lip movement cues. The dataset consists of video captured by both fisheye and normal cameras as well as the corresponding audio recordings, featuring speakers in wide-angle scenes with significant radial distortion. Alongside the dataset, we propose a novel baseline model that leverages a distortion-aware branch pre-trained on fisheye head pose estimation to extract distortion-robust features, which are then fused with lip motion features extracted via a ResNet-based encoder. The fused features are passed through a speech separation module to extract the target speaker's speech. Experimental results demonstrate that our distortion-aware approach significantly outperforms conventional methods on fisheye data.

For more details: Hugging Face / AISHELL8-FISHEYE

Jump to Audio/Video Samples Jump to Download

Audio & Video Samples

The audio and the corresponding fisheye video data, from which the audio originally associated with the video has been extracted.

Sample 01

utterance 1

Sample 02

utterance 2

Sample 03

utterance 3

Audio Player

Video Player

Extraction Demo

Audio-visual target speaker extraction: click to play the mixture and the extracted result.

Mixture vs. Extracted Output Demo 01

Audio-only playback.

Mixture vs. Extracted Output Demo 02

Audio-only playback.

Mixture vs. Extracted Output Demo 03

Audio-only playback.

Extraction Audio Player

Download

Click the icon to open the dataset page.

Hugging Face

License: CC BY-NC 4.0