stay NLP field ,pretrain-finetune and prompt-tuning Technology can be improved GPT-3 Performance of equal size model on various tasks , However, the performance of this kind of large model in zero sample learning task is still not outstanding . In order to further mine the model performance in the zero sample scenario , Google Quoc Le The researchers trained a parameter of 1370 The autoregressive language model of billion Base LM, And in which a new command adjustment （instruction tuning） technology , Results show , The model using instruction adjustment technology is used in natural language reasoning 、 The zero sample performance on unprecedented tasks such as reading comprehension and open domain Q & A exceeds GPT-3 Small sample performance .
Large scale language model （LM） It has been proved that it can be well applied to small sample learning tasks . for example OpenAI Proposed GPT-3 , The parameter is equal to 1,750 Billion , Not only can you answer questions better 、 translate 、 Write an article , It also has some mathematical calculation ability . Without fine tuning , Can be in multiple NLP Achieve the most advanced performance on the benchmark .
However , image GPT-3 Such a large-scale language model has zero samples （zero-shot） The performance in the learning task is not very outstanding . for example ,GPT-3 In performing reading comprehension 、 Q & A and natural language reasoning , The performance of zero samples is better than that of small samples （few-shot） Poor performance .
In this paper ,Quoc Le Researchers from Google have explored a simple way to improve the performance of large language models in the case of zero samples , So as to expand the audience . They think that NLP Tasks can be described by natural language instructions , for example 「 Is the mood of the film review positive or negative ？」 perhaps 「 hold 『how are you』 Translate into Chinese 」.
The study used 137B Parameters of the pre training model and execute the instruction adjustment task , Yes 60 Multiple instructions expressed in natural language NLP Adjust the task . They call this result model Finetuned LANguage Net, or FLAN.
Address of thesis ：https://arxiv.org/pdf/2109.01652.pdf GitHub Address ：https://github.com/google-research/flan.
To evaluate FLAN Zero sample performance on unknown tasks , The study is based on NLP The task type of the task divides it into multiple clusters , And evaluate each cluster , At the same time, on other clusters FLAN Make command adjustment . Here's the picture 1 Shown , To evaluate FLAN The ability to perform natural language reasoning , The study was carried out in a series of other NLP Mission （ Such as common sense reasoning 、 Translation and emotion analysis ） Make command adjustment on the model . Because of this setting, it is ensured that FLAN There is no natural language reasoning task in instruction adjustment , Therefore, its ability to perform zero sample natural language reasoning can be evaluated .
The assessment shows that ,FLAN Significantly improve the model （base 137B Parameters ） Zero sample performance of . stay 25 Of the three assessment tasks ,FLAN Zero sample at 19 This task is superior to 175B Parameters GPT-3 Zero sample , Even in many tasks （ Such as ANLI、RTE、BoolQ、AI2-ARC、OpenbookQA and StoryCloze） It is also significantly better than GPT-3 Small sample . In ablation studies , It is found that increasing the number of task clusters in instruction adjustment , It can improve the performance of the model in unprecedented tasks , And the benefits of instruction adjustment will only appear when the scale of the model is large enough .
The empirical results emphasize the ability of language models to describe tasks using natural language instructions . More broadly , Pictured 2 Shown , Command adjustment combines pre training and fine tuning （pretrain–finetune） characteristic , And by using finetune Supervise to improve the ability of text interaction when the language model responds to reasoning .
FLAN： Improve zero sample learning with instruction adjustment
The motivation of instruction tuning is to improve the response of language model NLP The ability to command , Designed to teach through the use of supervision LM Perform tasks described in instructions . The language model will learn to follow instructions , Even for tasks not seen before . In order to evaluate the performance of the model on unprecedented tasks , The research divides tasks into multiple clusters according to task types , When other clusters adjust instructions , Set aside a task cluster for evaluation .
Tasks and templates
The study will 62 It's in Tensorflow Publicly available text datasets on datasets （ Including language understanding and language generation tasks ） Come together . The figure below 3 All data sets used in this study are shown ; Each data set is classified into one of twelve task clusters , Data sets in each cluster have the same task type .
The study defines a task as a specific set of inputs given by a data set - Output pair . For each task , The researchers manually wrote ten unique templates , Use natural language instructions to describe tasks . Most of the ten templates describe the original task , But to increase diversity , Researchers for each task , Up to three... Are provided 「 Changing tasks （turned the task around）」 The template of , The figure below 4 Several instruction templates for natural language reasoning tasks are given .
Model architecture and pre training . In the experiment , The study used intensive left to right 、 Decoder only 、137B Parametric transformer Language model . The model is in a set of network documents （ Including documents containing computer code ）、 Conversation data and Wikipedia Pre training on , These documents use SentencePiece library (Kudo & Richardson, 2018), By tokenize by 2.81T BPE token and 32K token The vocabulary of . about 10% The pre training data is non English . This dataset is not like GPT-3 The training set is so clean , It also mixes dialogue and code .
Researchers are reasoning in natural language 、 reading comprehension 、 Open domain Q & A 、 Commonsense reasoning 、 It refers to a number of tasks such as resolution and translation FLAN The performance of the system is evaluated . For each task , They reported the average and standard error of performance on all templates , This represents when given a typical natural language instruction FLAN Expected performance .
Natural language reasoning tasks
The following table 1 The results of natural language reasoning tests of different models are shown , Given a premise and assumption —— The model must confirm that if the given premise is true, the hypothesis is also true . You can see ,FLAN Strong performance in all cases .
Although in CB and RTE There is a high square difference in the results of different templates , but FLAN In the absence of any prompt In engineering, it is still significantly better than zero sample and small sample in four data sets GPT-3. With the best dev Template time ,FLAN It is better than small samples on five data sets GPT-3.FLAN Even in ANLI-R3 Data sets go beyond supervised BERT.
Reading comprehension and open domain Q & A tasks
In reading comprehension tasks , The model is asked to answer questions about a given article paragraph , The results are as follows 2 Shown .FLAN stay BoolQ and OBQA Significantly better than GPT-3. When using the best dev Template time ,FLAN stay MultiRC The data set is slightly better than the small sample GPT-3.
For open domain Q & A tasks ,FLAN stay ARC-easy and ARC-challenge It is significantly better than zero sample and small sample on the data set GPT-3. stay Natural Questions On dataset ,FLAN Better than zero samples GPT-3, Weaker than small samples GPT-3.
Common sense reasoning and co referential resolution tasks
The results of different models on five common sense reasoning data sets are shown in the table below 3 Shown ,FLAN stay StoryCloze Better than... On data sets GPT-3, stay CoPA and PiQA On the dataset GPT-3. But in HellaSwag and ReCoRD On dataset ,Base LM and FLAN Are weaker than GPT-3.
On two co referential resolution tasks , Have the best dev Template FLAN stay Winogrande Better than zero samples on the data set GPT-3, But in WSC273 On dataset ,Base LM and FLAN Are weaker than GPT-3.
The researchers are still GPT-3 Three data sets evaluated in the paper are tested FLAN Machine translation performance , The three data sets are WMT’14 French - English as well as WMT’16 German in English - English and Romanian - English .
The test results are shown in the table below 4 Shown ,Base LM The zero sample translation performance is weak , But the translation results of small samples are comparable to GPT-3.FLAN It is superior to the small sample in five of the six evaluation indexes Base LM. And GPT-3 similar ,FLAN It shows strong performance in translating into English , And it has advantages over supervised translation baseline .
Because the core problem of this paper is how to improve the zero sample performance of the model on unprecedented tasks , Therefore, the first Ablation Experiment of this study studied the impact of clusters and the number of tasks used in instruction adjustment on performance .
chart 5 It shows the experimental results . In line with expectations , The researchers observed that 3 individual held-out The average performance of a cluster increases as additional clusters and tasks are added to the instruction tuning （ Except for emotion analysis cluster ）, It is proved that the proposed instruction adjustment method is helpful to improve the zero sample performance on the new task .
The figure below 6 It turns out that , For larger models , Instruction tuning fills some model capacity , But it also teaches these models the ability to follow instructions , Allows the model to generalize the remaining capacity to new tasks .