The Effect of Training Data Set Composition on the Performance of a Neural Image Caption Generator

Report No. ARL-TR-8124
Authors: Abigail Wilson and Adrienne Raglin
Date/Pages: September 2017; 20 pages
Abstract: This research seeks to determine how many images of a particular object in a training data set are necessary to achieve caption quality saturation in neural image caption generators. Understanding the relationship between caption quality and the size and composition of training data sets could improve efficiency in model training and lead to the development of optimized data sets for different tasks. We hypothesize that increasing the exposure of a neural network to an object will improve its performance, up to a point, after which the caption quality will saturate; and that this may vary based on the object's visual homogeneity. We trained several image captioning models, using an existing code Neuraltalk2, on subsets of the Microsoft Common Objects in Context data set, which contained a precise number of some common object categories (e.g., "cat" and "pizza"). The performance with different levels of exposure to the selected objects was compared using the Metric for Evaluation of Translation with Explicit Ordering (METEOR) and Consensus-Based Image Description Evaluation (CIDEr) automated scoring metrics. The data indicate that increasing the quantity of images of a particular object in the training data set improved the performance up to 1,500 images, but not beyond that.
Distribution: Approved for public release
  Download Report ( 0.304 MBytes )
If you are visually impaired or need a physical copy of this report, please visit and contact DTIC.

Last Update / Reviewed: September 1, 2017