Real-Time News Analysis (RTNA) Scraper Assessment

Report No. ARL-TN-295
Authors: Christine E. Slocum; Ann E. M. Brodeen
Date/Pages: September 2007; 22 pages
Abstract: An assessment was conducted to evaluate the performance of the Real-Time News Analysis Scraper application used to extract article body text from online news sources. The application's performance was evaluated by determining the integrity of scraped text outputted, a metric found by calculating the outputs similarity to text manually selected from the same articles by a human control group. Levenshtein's edit-distance algorithm was implemented to calculate normalized similarity scores of each scraped and manually selected article text pair; normalized scores were direct indicators of integrity. The Scraper was found to perform unacceptably overall because the majority of scraped articles experienced integrity loss exceeding the established threshold. Results of the assessment were insufficiently detailed to give causal explanations for the Scraper's observed performance. Recommendations were not made for the application's improvement; however, a protocol was outlined in detail for a follow-on assessment.
Distribution: Approved for public release
Last Update / Reviewed: September 1, 2007