In Example 1, LLMs infer from the dataset name that it is about anaphor agreement and include this information in the instruction. In Example 2, LLMs create the task of paraphrase identification by understanding the relationship between the fields "sentence1" and "sentence2" implied in the dataset description. Under the description-unaware setting like Example 3, tasks can be generated based on the names of data fields.
We first evaluate models trained with Dynosaur on Super-NI (a.k.a. NIV2) to examine its ability to solve NLP tasks. We first find that We fine-tune T5-3B and LLAMA-7B with different datasets and compare performance on Super-NI and User-Instruction-252. We observe that on Super-NI, both models fine-tuned with Dynosaur data outperform Alpaca, Instruction GPT-4 and Dolly that are much more expensive to be collected. In particular, training T5-3B with Dynosaur brings at least 2.5-22 ROUGE-L improvement than baselines.
Dynosaur targets on task solving and contains fewer instructions on user assistance (like writing emails and organizing data), but we also notice that on User-Instruction-252, Dynosaur can be exploited as additional training data to achieve higher performance than solely training with either Alpaca or Instruction GPT-4.
We calculate the cost of generating all the Dynosaur instructions and a subset for Super-NI fine-tuning. Dynosaur can bring better performance with much less generation cost.
We also recruit human annotators to evaluate (instruction, input, output) pairs. We find that our method is found to be completely correct in 79% of instances, a substantial improvement over the 54% reported in Self-Instruct.
As Dynosaur can expand over time as new tasks come in, an important question is how to adapt a trained instruction-following model to new tasks without suffering from catastrophic forgetting.
Experiments on Super-NI show that replaying is an effective method to improve generalization ability and mitigate forgetting issues. For instruction tuning, we further design to select replay tasks based on instruction representations. Results show that selecting the most diverse instruction representations can be better than selecting based on data representation diversity.
@article{yin2023dynosaur,
title={Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation},
author={Yin, Da and Liu, Xiao and Yin, Fan and Zhong, Ming and Bansal, Hritik and Han, Jiawei and Chang, Kai-Wei},
journal={arXiv preprint arXiv:2305.14327},
year={2023}
}