Standard train-dev-test splits used to benchmark multiple models against each other are
ubiquitously used in Natural Language Processing (NLP). In this setup, the train data is
used for training the model, the development
set for evaluating different versions of the proposed model(s) during development, and the
test set to confirm the answers to the main research question(s). However, the introduction
of neural networks in NLP has led to a different use of these standard splits; the development set is now often used for model selection during the training procedure.Because of
this, comparing multiple versions of the same
model during development leads to overestimation on the development data. As an effect,
people have started to compare an increasing
amount of models on the test data, leading to
faster overfitting and “expiration” of our test
sets. We propose to use a tune-set when developing neural network methods, which can be
used for model picking so that comparing the
different versions of a new model can safely be
done on the development data.