Open Campus initiative brings natural language processing to cyber research
September 30, 2014
- ARL's Open Campus initiative provided the ideal vehicle for CISD researchers to bring in a visiting doctoral student, specializing in machine learning for privacy and security technologies, to advance the potential of automated linguistic-type analysis, or stylometry, for authorship attribution of source code.
Do authors of source code have a unique identifying style—a "coding fingerprint"—that can be learned from a few samples of their work and used to identify them?
This question holds implications for protecting intellectual property as well as for identifying malware authors and tracking the evolution of malware. It spurred a cross-cutting summer project that exercised Open Campus to bring disciplines, institutions, and experts together.
Aware that methods from text analytics can strengthen cyber analytics, U.S. Army Research Laboratory researchers Dr. Clare Voss, a computational linguist, and Dr. Richard Harang, a network-science researcher, met to define a seed problem with Drexel University Professor Rachel Greenstadt, a leader in electronic-privacy and information-security research who has pioneered methods for authorship attribution in both prose and source code.
The laboratory's Open Campus initiative provided the ideal vehicle for Melissa Holland and Bill Glodek, both from ARL's Computational and Information Sciences Directorate, to bring in Professor Greenstadt's doctoral student Aylin Caliskan-Islam.
Caliskan-Islam is currently specializing in machine learning for privacy and security technologies, supported by a Drexel College of Engineering Fellowship and Brocade and Xerox scholarships in honor of Grace Hopper for Women in Computing.
Enriching the collaboration were ARL computational linguist Jeffrey Micher and College Qualified Leaders student Andrew Liu, a computer-science undergraduate student at the University of Maryland, College Park, currently in the Advanced Cybersecurity Experience for Students, or ACES, Honors Program.
The summer project sought to advance the potential of automated linguistic-type analysis, or stylometry, for authorship attribution of source code.
Google Code Jam competition results, which are openly available computer programs written by 11,000 participants in Google's annual international competitions since 2003, were used in the project research to investigate new ways of representing coding style through natural language processing (NLP)-inspired syntactic, lexical, and layout features.
Advanced machine-learning techniques were applied to these features to develop author-recognition algorithms, ultimately achieving a recognition accuracy of 91 percent using just five source code samples per author.
This is a significant improvement over previous stylometry methods, which have achieved author-recognition accuracies in the 65–77 percent range. Of the features with information gain in the best-performing Open Campus experiments, 57 percent were syntactic.
"The ease and speed with which we could get an Open Campus participant settled and working here in ARL really helped facilitate collaboration, and allowed us to go from a project idea to publishable results in just a few months," said Harang.
Noting how Open Campus worked to pave the way for new insights and more effective methods in source-code stylometry, Caliskan-Islam commented: "When I arrived at ARL in June, I was immediately introduced to people in the NLP and cyber-security labs and soon received an extended visitor pass and access to research facilities. I was able to start working efficiently from the first day. The research environment here is productive, inspiring, and open to collaboration."
"I learned a lot of new, useful things from the experts in different fields and got valuable insights into research that helps the Army," added Caliskan-Islam.