Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation

Da Yin^*1, Xiao Liu^*2, Fan Yin^*1, Ming Zhong^*3, Hritik Bansal¹, Jiawei Han³, Kai-Wei Chang¹

¹University of California Los Angeles, ²Peking University, ³University of Illinois Urbana-Champaign

We introduce Dynosaur, a dynamic growth paradigm for instruction-tuning data curation. Built upon metadata of NLP datasets, we generate multiple task instructions applicable to various NLP datasets and determine the relevant data fields for constructing instruction-tuning data with LLMs.

Dynosaur offers several advantages:

Lower generation costs
- Less than $12 for generating 800K instruction-tuning data
Good quality of instruction-tuning data
- Higher validity than Self-Instruct
- Better downstream performance than Alpaca and Instruction GPT-4 on Super-NI
- Helpful supplement to further improve performance on Super-NI and User-Instruction-252
Ability to grow dynamically
- Continuously producing instructions over new datasets from Huggingface Datasets

An ever-growing instruction-tuning dataset provides an opportunity to dynamically and continuously improve instruction-following models. We also perform analysis to study the proper replay strategy for better generalizability and less forgetting issues.

Dynosaur Collection Method

The collection process of Dynosaur consists of following steps:

Metadata collection:
- Collect dataset name, description, data fields and dataset annotations from Huggingface
Instruction and input/output field generation:
- Prompting LLMs with metadata and outputing multiple instructions and input/output fields
Filtering out invalid instruction data:
- Filtering out duplicate instructions and the ones with invalid input/output fields

Generated Instruction Cases

In Example 1, LLMs infer from the dataset name that it is about anaphor agreement and include this information in the instruction. In Example 2, LLMs create the task of paraphrase identification by understanding the relationship between the fields "sentence1" and "sentence2" implied in the dataset description. Under the description-unaware setting like Example 3, tasks can be generated based on the names of data fields.

Results

We first evaluate models trained with Dynosaur on Super-NI (a.k.a. NIV2) to examine its ability to solve NLP tasks. We first find that We fine-tune T5-3B and LLAMA-7B with different datasets and compare performance on Super-NI and User-Instruction-252. We observe that on Super-NI, both models fine-tuned with Dynosaur data outperform Alpaca, Instruction GPT-4 and Dolly that are much more expensive to be collected. In particular, training T5-3B with Dynosaur brings at least 2.5-22 ROUGE-L improvement than baselines.

Dynosaur targets on task solving and contains fewer instructions on user assistance (like writing emails and organizing data), but we also notice that on User-Instruction-252, Dynosaur can be exploited as additional training data to achieve higher performance than solely training with either Alpaca or Instruction GPT-4.

Analysis

We calculate the cost of generating all the Dynosaur instructions and a subset for Super-NI fine-tuning. Dynosaur can bring better performance with much less generation cost.

We also recruit human annotators to evaluate (instruction, input, output) pairs. We find that our method is found to be completely correct in 79% of instances, a substantial improvement over the 54% reported in Self-Instruct.

As Dynosaur can expand over time as new tasks come in, an important question is how to adapt a trained instruction-following model to new tasks without suffering from catastrophic forgetting.

Experiments on Super-NI show that replaying is an effective method to improve generalization ability and mitigate forgetting issues. For instruction tuning, we further design to select replay tasks based on instruction representations. Results show that selecting the most diverse instruction representations can be better than selecting based on data representation diversity.

BibTeX


    @article{yin2023dynosaur,
      title={Dynosaur: A Dynamic Growth Paradigm for Instruction-Tuning Data Curation},
      author={Yin, Da and Liu, Xiao and Yin, Fan and Zhong, Ming and Bansal, Hritik and Han, Jiawei and Chang, Kai-Wei},
      journal={arXiv preprint arXiv:2305.14327},
      year={2023}
    }