Audio-visual target speaker extraction remains a challenging task, particularly in visually distorted conditions such as those captured by fisheye cameras. It has significant practical implications, such as enabling the extraction of better quality speech during police criminal investigations, journalists's undercover investigations, wearable first-person cameras, etc. To address the lack of dedicated datasets for fisheye audio-visual speech processing, we introduce AISHELL8-FISHEYE, the first fisheye audio-visual dataset for target speaker extraction based on lip movement cues. The dataset consists of video captured by both fisheye and normal cameras as well as the corresponding audio recordings, featuring speakers in wide-angle scenes with significant radial distortion. Alongside the dataset, we propose a novel baseline model that leverages a distortion-aware branch pre-trained on fisheye head pose estimation to extract distortion-robust features, which are then fused with lip motion features extracted via a ResNet-based encoder. The fused features are passed through a speech separation module to extract the target speaker's speech. Experimental results demonstrate that our distortion-aware approach significantly outperforms conventional methods on fisheye data.
The audio and the corresponding fisheye video data, from which the audio originally associated with the video has been extracted.
Audio-visual target speaker extraction: click to play the mixture and the extracted result.