In recent days, , tencent QQ Pre training model developed by browser Laboratory 「 Shenzhou 」（Shenzhou） stay 9 month 19 Japanese Chinese language understanding evaluation CLUE Top of the list , A new industry record , Become the first in Chinese natural language understanding A pre training model that exceeds the human level in terms of comprehensive evaluation data .
As the most authoritative assessment in the field of Chinese language understanding The benchmark One of ,CLUE Covering text similarity 、 classification 、 Natural language reasoning 、 Reading comprehension, etc 10 term Semantic analysis And understanding class subtasks .QQ browser “ Shenzhou ” The model relies on top language understanding , The top CLUE1.0 General list / Classified list / Reading comprehension list , Break three world records . The total number of points in the ranking exceeded 85.88 branch , More than humans The benchmark fraction 0.271.
Build ten billion Parameters Of the order of magnitude 「 Shenzhou 」1.0 Model
natural language processing And understand （NLP&NLU） It is a core competence in the content field , It's also AI The core direction of sustainable development in the field , The scope of application covers search 、 recommend 、 Business algorithms and more AI All aspects of the field .
In the current academic and industrial circles , Preliminary training （pretrain）+ fine-tuning （finetune）+ Distillation （distill） The mode of application , It has become a new paradigm of semantic understanding .BERT As the basic model of pre training, it has been widely used in related algorithm technology , On this basis , Having a better and better pre training model can bring more benefits to all semantic understanding abilities .
「 Shenzhou 」 The natural language pre training model is composed of tencent QQ Browser lab at 2021 The results of self-study in . Through joint tencent QQ Browser search and content algorithm team , stay 6 The moon reaches the top CLUE Based on the skyscraper pre training model, a lot of innovations have been further carried out ： Introduce cross layer attenuation Attention Residual link algorithm 、 And will instance-wise The self distillation technique is introduced into the training of pre training model , And autoregressive MLM Training strategies, etc . meanwhile , On this basis, knowledge enhancement is carried out through secondary pre training , Further improve the effect of pre training model .
On a large scale Deep learning The effect of the model has been successful in all aspects , But training a ten billion two-way self coding model has always been a challenge .「 Shenzhou 」 The model passes through ZeRO Segmentation scheme , Divide the 10 billion model into N On this card , And combine FP16 Training 、 Gradient checking further reduces video memory usage . The underlying communication will TCP Change it to GPUDirect RDMA signal communication , Greatly improve the communication efficiency , And further reduce the traffic through the gradient aggregation algorithm .
Final ,QQ Browser lab through industry-leading training capabilities , Finally, the training got Shenzhou - Ten billion Parameters A two-way self coding pre training model of quantity . adopt 「 Shenzhou 」 Pre training ability , You only need to update the model using this paradigm , It can improve the model effect on almost all semantic class understanding tasks , It has great applicability ; secondly ,「 Shenzhou 」 Pre training ability is the basis of multimodal pre training , Help improve the comprehensive effect of multimodal pre training , Improve the comprehensive effect of multimodal pre training for video understanding ; At the same time, Shenzhou is also based on tencent The secondary output of the existing middle console , Further expand the radiation range .
First in Chinese natural language understanding A model capable of surpassing the human level
On all kinds of English lists , Machines have surpassed humans for some time . As the largest and most difficult language in the world , Faced with many more complex language context problems than English , Such as cutting words 、 morphology 、 Syntactic differences , The overall difficulty of language understanding is greater . The ability of understanding Chinese language in the industry is far from human beings （HUMAN） Still a certain distance , When natural language understanding After the ability deepens and reaches the human level , In the Chinese environment, technology can do more work that only human beings can do in the traditional sense .
Combination tencent pcg venus machine learning The platform introduces a large number of Model optimization And accelerated algorithms ,「 Shenzhou 」 At the previous billion level Parameters Quantitative 「 Small 」 Model skyscraper （Motian） On the basis of , Built 10 billion Parameters My training ability , Combined with a large number of optimization algorithms , Finally, we got the best model in Chinese language understanding in the industry .
In order to further verify the effect of the pre training model , tencent QQ Browser lab in Chinese language understanding evaluation The benchmark CLUE The effect is verified on the list , On 2021 year 9 month 19 Day summit CLUE1.0 General list / Classified list / Reading comprehension list , All above the human level , And at the same time CLUE 1.1 General list of / Classified list / The reading comprehension list has achieved the first place .
At present 「 Shenzhou 」 Already in QQ Browser search 、 tencent Watch the information flow 、QQ The browser supports dozens of semantic class algorithm applications in the novel scene , Achieve significant benefits in multi business scenarios ; It's also based on tencent Search for middle stage and big content middle stage scenes , Radiate to tencent Journalism 、 video 、 Micro vision, etc tencent PCG Business scenario .
Refresh 27 individual NLP The benchmark Mission
be based on 「 Shenzhou 」 Leading Chinese natural language understanding Ability , tencent QQ The browser lab team is 27 The optimal results are obtained on a Chinese natural language public data set , Problem types cover all aspects of natural language , Including document retrieval 、 Event extraction 、 Take out ideas 、 Natural language inference 、 semantic similarity 、 classification 、 Machine reading comprehension 、 Named entity identification Don't wait .
in addition to , With the help of 「 Shenzhou 」 The effect of pre training model , It can bring a stronger comprehensive effect to all semantic understanding abilities , Including but not limited to the following scenarios ：
Industry efficient solutions , For example, the question bank in the education industry understands 、 Vehicle conversation scene, etc ;
Auxiliary annotation , In reviewing 、 Customer service 、 Medical consultation, Q & A and other fields , Reduce unnecessary human interaction and annotation through semantics and knowledge ;
Improve the semantics of multimodal scenes , Optimize multimodal alignment effects .
「 Shenzhou 」 Pre training data
「 Shenzhou 」 A large number of basic training data of skyscraper model are used for reference , Including penguin 、 A novel 、 All kinds of encyclopedias 、 Journalism 、 Community Q & A, etc . On this basis, a large number of Internet web page data are introduced , Optimized by precise cleaning , On the premise of ensuring the amount of data, model drift caused by low-quality data shall be avoided at the same time .
Self distillation pre training algorithm
Distillation of knowledge （Knowledge Distillation） It refers to the trained teacher model （Teacher Model） The knowledge is transferred to the student model by distillation （Student Model）, To improve the effect of student model , Often student models Parameters The quantity is small . And with the Distillation of knowledge The difference is , Self distillation （Self-Distillation） It refers to the model Parameters Constant quantity , Continuously improve your effect by distilling to yourself .
Currently in CV and NLP field , Self distillation technology has been widely used , The universality of its effect is also verified . And in pre training , The standard self distillation technology has not been widely used , The main reason is that the pre training process consumes a lot of time and resources , The standard self distillation technology needs several times of model training 、 Prediction and distillation process can have a better effect , It's very time consuming , Obviously not suitable for pre training .
suffer ALBEF Of momentum Distillation technology and r-drop(ICLR2021) Technical inspiration ,QQ The browser lab team explored layer-wise and instance-wise Application of self distillation in pre training model , Expect to minimize the consumption of time and resources , Self distillation is carried out online during training , Achieve the purpose of rapidly improving the effect of the model . The experimental results show that , Both methods can improve the universality of downstream tasks , among instance-wise The effect of self distillation is better , But the consumption of video memory will also be high .
The left of the figure below is layer-wise Self distillation technology , In the training process, the output of each layer is used to distill the final output of the model to continuously improve the performance of the model ; On the right of the figure below is instance-wise Self distillation technology , utilize dropout The randomness of , Two different outputs can be produced for the same input , Distill yourself online , Quickly improve the effect .
introduce Knowledge map Enhance pre training, enhance knowledge understanding ability
Pre training model for real-world knowledge understanding , Need more knowledge 『 Feed 』, The industry has also explored how to introduce knowledge into pre training .「 Shenzhou 」 It also makes further in-depth optimization in knowledge enhancement ： Search based construction Knowledge map Data and encyclopedia corpus , The team tried 3 An intellectual task —— Far Supervision Relationship classification 、 Similar entities replace predictions and triples - Text Mask forecast .
The picture below is 3 An example of a knowledge task , Experiments show that this 3 Both tasks can effectively introduce knowledge , Bring good improvement in downstream knowledge tasks .
Optimize to avoid Parameters Forget
Knowledge tasks can drive the pre training model to learn relevant knowledge Parameters , But it is also easy to cause the original Parameters Forgetting and the decline of model generality . Common solution Parameters The practice of forgetting is for the input knowledge corpus , Training MLM Tasks and knowledge tasks .
Although this practice has slowed down Parameters Forget , However, due to the single and regular knowledge corpus , introduce MLM It is also impossible to avoid the poor effect of the model in the general scene . In response to this question , The team introduced the mechanism of two-way corpus input , The general pre training corpus and knowledge task corpus are combined into two-way input , Share model Encoder Parameters , Do joint training . This ensures that MLM The diversity of task corpus input , It also reduces the impact of encyclopedia corpus with regular knowledge tasks on the model .
Experimental results show that , Using two-way input is more than... On average in multiple downstream tasks than doing joint learning only in Encyclopedia corpus 0.5% The promotion of . After introducing two-way input , As mentioned above 3 All knowledge tasks can improve the pre training model in downstream tasks . among , Far Supervision Relationship classification 、 A triple - Text Mask Predictive tasks can be used in reading comprehension tasks EM The average increase in indicators is 0.7%; stay Natural language reasoning On class tasks , Then there are 0.15% To 0.3% Unequal Promotion .
At present 「 Shenzhou 」 Has been gradually applied to QQ Browser search 、 Look at some information 、 Novels and other scenes . With the further improvement of Shenzhou and the combination of business practice , Will also further transform QQ The search ability of the browser , Understand the needs behind user expression , The most intelligent way to meet the user's intention 、 Deep service users .