N号馆第7期资料

2020-02-27 166浏览

1.AI in AudioVisual PMArLoEIcessing 2018/10/24
2.Self introduction • 2011 年毕业于中国矿业大学机械工程专业，学士学位 • 2014 年 6 月毕业于北京航空航天大学机器人所，获得硕士学位，期间主要研究内容为医疗机器人。 • 2018 年 4 月毕业于东京大学生物医学工程专攻，获得工学博士学位。博士期间的研究内容为医学图像处理，手术机器人导航系统等。 • 现就职于联想日本研究院，大和实验室 (Thinkpad 诞生地 ) ，主任研究员。目前的研究方向为下一代智能人机交互，计算机视觉。 • 兴趣爱好：旅行，阅读最新科技报道，星际类题材影视剧，美食 MA LEI, Ph.D. 2
3.Contents • Background – Era of AI – Temporal co-occurrence of audio and visual signal • AI based audio-visual processing – Audio-visual feature alignment – Speaker detection – Lip reading – Voice separation • Human machine interface – Voice interaction 3
4.Contents • Background – Era of AI – Temporal co-occurrence of audio and visual signal • AI based audio-visual processing – Audio-visual feature alignment – Speaker detection – Lip reading – Voice separation • Human machine interface – Voice interaction 4
5.Era of Artificial Intelligence • AI brings breakthroughs for many important fields – Object detection – Cancer detection – Speech recognition – Self driving – Robot –… • Big companies are all in AI – Google, Huawei, Samsung… • AI in Audio-visual processing – Image processing and Audio processing are fully discussed – What happens when combing audio and image 5
6.Temporal co-occurrence of audio and visual signal • Two example videos • Co-occurrence of audio and visual signal Motion Audio Time 6
7.Single-modality feature Vision feature 3D Convolution 3D Convolution 3D Convolution Audio feature 1D Convolution 1D Convolution 1D Convolution 7
8.Contents • Background – Era of AI – Temporal co-occurrence of audio and visual signal • AI based audio-visual processing – Audio-visual feature alignment – Speaker detection – Lip reading – Voice separation • Human machine interface – Voice interaction 8
9.Self-supervision of audio-visual training 9
10.Audio-visual feature alignment • Objective – Align the audio and visual signal – Latency between audio and visual signal caused by capturing, processing and transmission • Solution – Align audio signal to visual signal using audio-visual network. 10
11.Audio-visual feature alignment • Align audio frame with video frame • Relate sound with the motion in video Audio-visual network for aligning the audio-visual feature Owens, Andrew, and Alexei A. Efros. "Audiovisual scene analysis with selfsupervised 11 multisensory features." arXiv preprintarXiv:1804.03641(2018).
12.Audio-visual feature alignment Demo video 12
13.Speaker detection Speaker detection is a specific application of audiovisual feature alignment. Pipeline to generate the audio-visual datas Audio-visual Network for speaker Chung, Joon Son, and Andrew Zisserman. "Out oftime:automated lip sync in the wild." In Asian Conference on Computer Vision, pp. 251263. Springer, Cham, 2016. 13
14.Speaker detection 14
15.Lip reading • 2001, a space odyssey 15
16.Lip reading • Objective – Transform lip speaking motion into text • Solution – Use voice dictation results as labeled data to train lip reading network • Application – Understand user’s intent in noise environment – Spying?? Text 16
17.Lip readingLipnet:'>Lipnet: