Zero sample performance surpasses small samples, and Google's 137 billion parameter new model is stronger than gpt-3

TechWeb 2021-09-15 04:51:33

   stay NLP field ,pretrain-finetune and prompt-tuning Technology can be improved GPT-3 Performance of equal size model on various tasks , However, the performance of this kind of large model in zero sample learning task is still not outstanding . In order to further mine the model performance in the zero sample scenario , Google Quoc Le The researchers trained a parameter of 1370 The autoregressive language model of billion Base LM, And in which a new command adjustment (instruction tuning) technology , Results show , The model using instruction adjustment technology is used in natural language reasoning 、 The zero sample performance on unprecedented tasks such as reading comprehension and open domain Q & A exceeds GPT-3 Small sample performance .

Large scale language model (LM) It has been proved that it can be well applied to small sample learning tasks . for example OpenAI Proposed GPT-3 , The parameter is equal to 1,750 Billion , Not only can you answer questions better 、 translate 、 Write an article , It also has some mathematical calculation ability . Without fine tuning , Can be in multiple NLP Achieve the most advanced performance on the benchmark .

However , image GPT-3 Such a large-scale language model has zero samples (zero-shot) The performance in the learning task is not very outstanding . for example ,GPT-3 In performing reading comprehension 、 Q & A and natural language reasoning , The performance of zero samples is better than that of small samples (few-shot) Poor performance .

In this paper ,Quoc Le Researchers from Google have explored a simple way to improve the performance of large language models in the case of zero samples , So as to expand the audience . They think that NLP Tasks can be described by natural language instructions , for example 「 Is the mood of the film review positive or negative ?」 perhaps 「 hold 『how are you』 Translate into Chinese 」.

The study used 137B Parameters of the pre training model and execute the instruction adjustment task , Yes 60 Multiple instructions expressed in natural language NLP Adjust the task . They call this result model Finetuned LANguage Net, or FLAN.

 Zero sample performance exceeds that of small samples , Google 1370 The new model with 100 million parameters is better than GPT-3 stronger Address of thesis :https://arxiv.org/pdf/2109.01652.pdf GitHub Address :https://github.com/google-research/flan.

To evaluate FLAN Zero sample performance on unknown tasks , The study is based on NLP The task type of the task divides it into multiple clusters , And evaluate each cluster , At the same time, on other clusters FLAN Make command adjustment . Here's the picture 1 Shown , To evaluate FLAN The ability to perform natural language reasoning , The study was carried out in a series of other NLP Mission ( Such as common sense reasoning 、 Translation and emotion analysis ) Make command adjustment on the model . Because of this setting, it is ensured that FLAN There is no natural language reasoning task in instruction adjustment , Therefore, its ability to perform zero sample natural language reasoning can be evaluated .

 Zero sample performance exceeds that of small samples , Google 1370 The new model with 100 million parameters is better than GPT-3 stronger

The assessment shows that ,FLAN Significantly improve the model (base 137B Parameters ) Zero sample performance of . stay 25 Of the three assessment tasks ,FLAN Zero sample at 19 This task is superior to 175B Parameters GPT-3 Zero sample , Even in many tasks ( Such as ANLI、RTE、BoolQ、AI2-ARC、OpenbookQA and StoryCloze) It is also significantly better than GPT-3 Small sample . In ablation studies , It is found that increasing the number of task clusters in instruction adjustment , It can improve the performance of the model in unprecedented tasks , And the benefits of instruction adjustment will only appear when the scale of the model is large enough .

The empirical results emphasize the ability of language models to describe tasks using natural language instructions . More broadly , Pictured 2 Shown , Command adjustment combines pre training and fine tuning (pretrain–finetune) characteristic , And by using finetune Supervise to improve the ability of text interaction when the language model responds to reasoning .

 Zero sample performance exceeds that of small samples , Google 1370 The new model with 100 million parameters is better than GPT-3 stronger FLAN: Improve zero sample learning with instruction adjustment

The motivation of instruction tuning is to improve the response of language model NLP The ability to command , Designed to teach through the use of supervision LM Perform tasks described in instructions . The language model will learn to follow instructions , Even for tasks not seen before . In order to evaluate the performance of the model on unprecedented tasks , The research divides tasks into multiple clusters according to task types , When other clusters adjust instructions , Set aside a task cluster for evaluation .

Tasks and templates

The study will 62 It's in Tensorflow Publicly available text datasets on datasets ( Including language understanding and language generation tasks ) Come together . The figure below 3 All data sets used in this study are shown ; Each data set is classified into one of twelve task clusters , Data sets in each cluster have the same task type .

 Zero sample performance exceeds that of small samples , Google 1370 The new model with 100 million parameters is better than GPT-3 stronger

The study defines a task as a specific set of inputs given by a data set - Output pair . For each task , The researchers manually wrote ten unique templates , Use natural language instructions to describe tasks . Most of the ten templates describe the original task , But to increase diversity , Researchers for each task , Up to three... Are provided 「 Changing tasks (turned the task around)」 The template of , The figure below 4 Several instruction templates for natural language reasoning tasks are given .

 Zero sample performance exceeds that of small samples , Google 1370 The new model with 100 million parameters is better than GPT-3 stronger Training details

Model architecture and pre training . In the experiment , The study used intensive left to right 、 Decoder only 、137B Parametric transformer Language model . The model is in a set of network documents ( Including documents containing computer code )、 Conversation data and Wikipedia Pre training on , These documents use SentencePiece library (Kudo & Richardson, 2018), By tokenize by 2.81T BPE token and 32K token The vocabulary of . about 10% The pre training data is non English . This dataset is not like GPT-3 The training set is so clean , It also mixes dialogue and code .

experimental result

Researchers are reasoning in natural language 、 reading comprehension 、 Open domain Q & A 、 Commonsense reasoning 、 It refers to a number of tasks such as resolution and translation FLAN The performance of the system is evaluated . For each task , They reported the average and standard error of performance on all templates , This represents when given a typical natural language instruction FLAN Expected performance .

Natural language reasoning tasks

The following table 1 The results of natural language reasoning tests of different models are shown , Given a premise and assumption —— The model must confirm that if the given premise is true, the hypothesis is also true . You can see ,FLAN Strong performance in all cases .

Although in CB and RTE There is a high square difference in the results of different templates , but FLAN In the absence of any prompt In engineering, it is still significantly better than zero sample and small sample in four data sets GPT-3. With the best dev Template time ,FLAN It is better than small samples on five data sets GPT-3.FLAN Even in ANLI-R3 Data sets go beyond supervised BERT.

 Zero sample performance exceeds that of small samples , Google 1370 The new model with 100 million parameters is better than GPT-3 stronger Reading comprehension and open domain Q & A tasks

In reading comprehension tasks , The model is asked to answer questions about a given article paragraph , The results are as follows 2 Shown .FLAN stay BoolQ and OBQA Significantly better than GPT-3. When using the best dev Template time ,FLAN stay MultiRC The data set is slightly better than the small sample GPT-3.

For open domain Q & A tasks ,FLAN stay ARC-easy and ARC-challenge It is significantly better than zero sample and small sample on the data set GPT-3. stay Natural Questions On dataset ,FLAN Better than zero samples GPT-3, Weaker than small samples GPT-3.

 Zero sample performance exceeds that of small samples , Google 1370 The new model with 100 million parameters is better than GPT-3 stronger Common sense reasoning and co referential resolution tasks

The results of different models on five common sense reasoning data sets are shown in the table below 3 Shown ,FLAN stay StoryCloze Better than... On data sets GPT-3, stay CoPA and PiQA On the dataset GPT-3. But in HellaSwag and ReCoRD On dataset ,Base LM and FLAN Are weaker than GPT-3.

On two co referential resolution tasks , Have the best dev Template FLAN stay Winogrande Better than zero samples on the data set GPT-3, But in WSC273 On dataset ,Base LM and FLAN Are weaker than GPT-3.

 Zero sample performance exceeds that of small samples , Google 1370 The new model with 100 million parameters is better than GPT-3 stronger translate

The researchers are still GPT-3 Three data sets evaluated in the paper are tested FLAN Machine translation performance , The three data sets are WMT’14 French - English as well as WMT’16 German in English - English and Romanian - English .

The test results are shown in the table below 4 Shown ,Base LM The zero sample translation performance is weak , But the translation results of small samples are comparable to GPT-3.FLAN It is superior to the small sample in five of the six evaluation indexes Base LM. And GPT-3 similar ,FLAN It shows strong performance in translating into English , And it has advantages over supervised translation baseline .

 Zero sample performance exceeds that of small samples , Google 1370 The new model with 100 million parameters is better than GPT-3 stronger Other experiments

Because the core problem of this paper is how to improve the zero sample performance of the model on unprecedented tasks , Therefore, the first Ablation Experiment of this study studied the impact of clusters and the number of tasks used in instruction adjustment on performance .

chart 5 It shows the experimental results . In line with expectations , The researchers observed that 3 individual held-out The average performance of a cluster increases as additional clusters and tasks are added to the instruction tuning ( Except for emotion analysis cluster ), It is proved that the proposed instruction adjustment method is helpful to improve the zero sample performance on the new task .

 Zero sample performance exceeds that of small samples , Google 1370 The new model with 100 million parameters is better than GPT-3 stronger

The figure below 6 It turns out that , For larger models , Instruction tuning fills some model capacity , But it also teaches these models the ability to follow instructions , Allows the model to generalize the remaining capacity to new tasks .

 Zero sample performance exceeds that of small samples , Google 1370 The new model with 100 million parameters is better than GPT-3 stronger

 

版权声明
本文为[TechWeb]所创,转载请带上原文链接,感谢
https://fheadline.com/2021/09/20210909113919072p.html
相似文章