Detecting Identity of Authors from Lexical Elements and Cognitive Topics (DIALECT)
Navy SBIR FY2012.1


Sol No.: Navy SBIR FY2012.1
Topic No.: N121-080
Topic Title: Detecting Identity of Authors from Lexical Elements and Cognitive Topics (DIALECT)
Proposal No.: N121-080-0319
Firm: Aptima, Inc.
12 Gill Street
Suite 1400
Woburn, Massachusetts 01801
Contact: Charlotte Shabarekh
Phone: (781) 496-2465
Web Site: aptima.com
Abstract: Exploiting the anonymous nature of the internet, terrorists are able to cloak their identity when authoring blogs, posting to chatrooms and sending tweets by using pseudonyms and creating multiple usernames. This makes it difficult to ascertain who the true author is of a web post, and to determine if posts under different profiles, across websites can be attributed to the same author. Detecting Identity of Authors from Lexical Elements and Cognitive Topics (DIALECT) addresses the challenge of authorship attribution facing intelligence analysts working with Open-Source Intelligence (OSINT). Using an inherently language-independent approach, DIALECT automatically learns a profile of linguistic, idiosyncratic and content-based features that form a unique fingerprint for an author. Additionally, DIALECT uses social science theory to influence the core machine learning algorithm's selection of dialectal and semantic features for use in distinguishing which cultural, tribal, religious or political groups the author belongs to. By associating authors with their socio-cultural group, DIALECT provides insight into the authors' cognitive processes, such as their political leanings and ideological affiliations. By modeling feature sets at both the individual author and group levels, DIALECT is able to attribute documents to groups, even when it is unable to determine the specific author.
Benefits: DIALECT will both enhance analyst's productivity and increase their situational awareness by providing them with the automated tools to cluster anonymous or deceptively signed documents by inferred author, by the author's inferred demographic group or by inferred subject (topic). When these clusters are viewed together, they provide context for one another and a comprehensive picture emerges of what is being discussed and by whom. When the document meta-data is considered, DIALECT is able to determine when the discussions are taking place, to provide the analyst with a timeline for use in trending what topics were being discussed and by whom. Most powerfully, DIALECT will provide the analysts with profiles (feature sets) of authors that include their linguistic features, idiosyncratic writing style, topics of discussion, socio-cultural/political affiliation and associated meta-data (e.g., login information, websites to which they post). These profiles provide insight into the authors' cognitive processes, both at the individual- and group-levels, to provide context for analysts who are assessing the threats posed by these individuals or groups. DIALECT has clear benefits, not to just the intelligence community, but also to criminal justice agencies that use authorship attribution for digital forensics and to the commercial industry for use in organizing results from search engines.

Return