Multimodal Transformer Models for Joint Text-Visual Processing

Speaker: Pradeep Natarajan (Amazon)

Date and Time: March 29 at 3:30pm CT

Place: 2405 SC or Zoom


I will present key results from recent work in Alexa AI for multimodal processing, in particular joint text-visual modeling. This includes our recent work on incorporating cues from signals such as object detection, dialog history etc. to improve retrieval quality. Finally, we will also present our latest work on developing a single encoder model that can simultaneously process text-only, visual only or text+visual inputs in contrast to most existing architectures that require modality-specific encoders before they are combined.


Pradeep Natarajan is a Senior Principal Applied Scientist at Amazon’s Alexa AI division. He has over 20 years of experience in developing and deploying large scale machine learning systems in diverse modalities including computer vision, language understanding, speech recognition and financial time series analysis. His work has been published in leading venues including CVPR, ECCV, ACL, EMNLP, ICASSP, and Interspeech. He has served as a Principal Investigator for large DARPA and IARPA programs and developed industry leading technology for analyzing unstructured visual and audio data. He also served as the Head of Machine Learning at Citadel Investment Group from 2014-18 deploying successful trading strategies across multiple financial instruments using non-linear models. He joined the Alexa AI team in 2018 and has been leading efforts for developing computer vision technology to enhance Alexa's voice based interactions and leveraging large language models in multiple applications across Alexa.