The Challenge of Adversarial Text and How to Extract from it: extracting phone numbers from human trafficking ads

Speaker:Nate Chambers (US Naval Academy)

Date and Time: 10am CT, July 24

Place: TBD


Adversarial text is written with obfuscated words and characters for the purpose of fooling machine learned extractors. Illicit domains like human trafficking often employ such text obfuscation techniques. This talk will address the challenge of extracting phone numbers from such noisy text, like "3wõn7_28tree(øne)_573", but more broadly the talk will discuss the NLP challenge of dealing with unicode in any domain. With very little available training data, how can today's neural models learn to generalize to the diversity of noise available to an adversarial writer? This talk will present a couple solutions to this challenge, focusing on character-based neural models that use typical NLP architectures like LSTMs and CRFs, but also that draw inspiration from the vision community to perform image recognition of the characters with CNNs. I'll first present results from our Best Paper Award at the Workshop for Noisy User-Generated text, exploring extraction from short text snippets, and then show some simple steps to expand it to full document extraction.


Nate Chambers is an Associate Professor in his 10th year at the US Naval Academy in Annapolis, MD. The Naval Academy is a 4-year undergraduate college where he teaches core computer science including an NLP elective. He received his Ph.D. from Stanford in 2011 with a dissertation on learning event scripts and schemas from large amounts of text. His main research has bounced between script learning and temporal reasoning, including the development of the temporal ordering system CAEVO, and annotation projects like TimeBank Dense and CaTeRS. He is excited that DARPA recently started a new program (KAIROS) focused on learning this type of knowledge. While he continues to focus on events and time, he also enjoys working on side projects with undergraduates, and this talk is one of those successes.