N号馆第7期资料

2020-02-27 161浏览

  • 1.AI in AudioVisual PMArLoEIcessing 2018/10/24
  • 2.Self introduction • 2011 年毕业于中国矿业大学机械工程专业,学士 学位 • 2014 年 6 月毕业于北京航空航天大学机器人所, 获得硕士学位,期间主要研究内容为医疗机器人。 • 2018 年 4 月毕业于东京大学生物医学工程专攻, 获得工学博士学位。博士期间的研究内容为医学图 像处理,手术机器人导航系统等。 • 现就职于联想日本研究院,大和实验室 (Thinkpad 诞生地 ) ,主任研究员。 目前的研究方向为下一代 智能人机交互,计算机视觉。 • 兴趣爱好:旅行,阅读最新科技报道,星际类题材 影视剧,美食 MA LEI, Ph.D. 2
  • 3.Contents • Background – Era of AI – Temporal co-occurrence of audio and visual signal • AI based audio-visual processing – Audio-visual feature alignment – Speaker detection – Lip reading – Voice separation • Human machine interface – Voice interaction 3
  • 4.Contents • Background – Era of AI – Temporal co-occurrence of audio and visual signal • AI based audio-visual processing – Audio-visual feature alignment – Speaker detection – Lip reading – Voice separation • Human machine interface – Voice interaction 4
  • 5.Era of Artificial Intelligence • AI brings breakthroughs for many important fields – Object detection – Cancer detection – Speech recognition – Self driving – Robot –… • Big companies are all in AI – Google, Huawei, Samsung… • AI in Audio-visual processing – Image processing and Audio processing are fully discussed – What happens when combing audio and image 5
  • 6.Temporal co-occurrence of audio and visual signal • Two example videos • Co-occurrence of audio and visual signal Motion Audio Time 6
  • 7.Single-modality feature Vision feature 3D Convolution 3D Convolution 3D Convolution Audio feature 1D Convolution 1D Convolution 1D Convolution 7
  • 8.Contents • Background – Era of AI – Temporal co-occurrence of audio and visual signal • AI based audio-visual processing – Audio-visual feature alignment – Speaker detection – Lip reading – Voice separation • Human machine interface – Voice interaction 8
  • 9.Self-supervision of audio-visual training 9
  • 10.Audio-visual feature alignment • Objective – Align the audio and visual signal – Latency between audio and visual signal caused by capturing, processing and transmission • Solution – Align audio signal to visual signal using audio-visual network. 10
  • 11.Audio-visual feature alignment • Align audio frame with video frame • Relate sound with the motion in video Audio-visual network for aligning the audio-visual feature Owens, Andrew, and Alexei A. Efros. "Audio­visual scene analysis with self­supervised 11 multisensory features." arXiv preprintarXiv:1804.03641(2018).
  • 12.Audio-visual feature alignment Demo video 12
  • 13.Speaker detection Speaker detection is a specific application of audiovisual feature alignment. Pipeline to generate the audio-visual datas Audio-visual Network for speaker Chung, Joon Son, and Andrew Zisserman. "Out oftime:automated lip sync in the wild." In Asian Conference on Computer Vision, pp. 251­263. Springer, Cham, 2016. 13
  • 14.Speaker detection 14
  • 15.Lip reading • 2001, a space odyssey 15
  • 16.Lip reading • Objective – Transform lip speaking motion into text • Solution – Use voice dictation results as labeled data to train lip reading network • Application – Understand user’s intent in noise environment – Spying?? Text 16
  • 17.Lip readingLipnet:'>Lipnet: