Baidu stay 2021 year
Deep learning Developer Summit WAVE SUMMIT A pre training model integrating language and vision is open-source ERNIE-UNIMO, Its core approach UNIMO Has been NLP Top conferences ACL 2021 The Lord will officially employ as oral Long article .
In the heart of the machine 7 month 31 Day ACL 2021 At the paper sharing meeting , Li Wei, the first author of this paper, introduced their research results in detail , Those of you who are interested can
Click to read the original text to view the review video .
AI Whether the system can be like people , Using a unified brain model , The general ability to realize the integration of perception and cognition ？ Based on this starting point ,
Baidu Proposed UNIMO This paper attempts to construct a unified pre training model for various modes .
Address of thesis ：https://arxiv.org/abs/2012.15409
Code address ：https://github.com/PaddlePaddle/ERNIE/tree/develop/ernie-unimo
This method is the first to learn a lot of text at the same time 、 Images 、 Graphic pair data , Through cross modal contrast learning , Effectively make language knowledge and visual knowledge unified and mutually reinforcing .UNIMO In language understanding and generation 、 Cross modal understanding and generation ,4 Class scenario 13 It surpasses the mainstream text pre training model and cross modal pre training model , Simultaneous ascent
Visual Q & A The list VQA、 List of text reasoning aNLI Other authoritative lists , The first mock exam is through the single modal data such as non parallel text and image. , Language knowledge and visual knowledge can enhance each other . at present UNIMO Has been gradually
Baidu Landing application in products .
UNIMO Methods to introduce
Big data is
Deep learning One of the key foundations for success . Different modes according to application data ,
Deep learning The application fields generally include ： Natural language processing on text data , Visual application on visual data , Cross modal application on graphic data . obviously , The human brain's learning of various modal data is not independent , such as , The human brain can automatically associate relevant language knowledge after seeing the picture , vice versa . Mastery of various modal data , So that human beings can fully learn all kinds of languages 、 Vision 、 Voice knowledge and mutual enhancement , Show a strong level of intelligence through a unified model . that , be based on
Deep learning Of AI Can the system learn heterogeneous modal data at the same time as people ？ If it can be achieved , Will undoubtedly open further
Deep learning Boundaries for large-scale data utilization , So as to further improve AI The universal integration of perception and cognition of the system AI Ability .
Baidu A unified pre training method for heterogeneous modal data is proposed UNIMO, Use text at the same time 、 Image and graphic data training , Learn the unified semantic representation of text and image , The first mock exam has the ability to handle multiple single modal and cross modal downstream tasks simultaneously. .UNIMO The core module of is a Transformer The Internet , In the specific training process , Text 、 Images and pictures are randomly mixed with three kinds of modal data , Where the image is converted to the target （object） Sequence , Text is converted into words （token） Sequence , Image and text pairs are transformed into the splicing of target sequence and word sequence .UNIMO Unified processing of three types of data , Self supervised learning based on mask prediction on target sequence or word sequence , And based on the image and text data for cross modal comparative learning , So as to realize the unified representation learning of image and text . further , This kind of joint learning method also makes text knowledge and visual knowledge enhance each other , So as to effectively improve the ability of text semantic representation and visual semantic representation .
The biggest challenge of unified pre training of heterogeneous modes is how to bridge the semantic gap between different modes so as to realize the unification of semantic representation . As shown in the figure below ,UNIMO An innovative cross modal contrastive learning method is proposed , At the same time, the associated picture and text pair data is introduced 、 Joint comparative learning of text data and image data . In particular ,UNIMO By text rewriting , Data expansion for graphic pairs , Obtain a large number of positive and strong negative case text pair data . At the same time, in order to make better use of text and image data ,UNIMO Text and image retrieval , Obtain relevant images and text as positive examples . In this way, various types of positive examples and high-quality strong negative examples are used ,UNIMO Associative comparison in a unified semantic space , Thus, we can learn the accurately aligned cross modal semantic representation .
UNIMO experimental result
In terms of experiments ,UNIMO A lot of text is used 、 Joint learning of image and graphic data , The first mock exam is also verified on various single modal and cross modal downstream tasks. . Pre training data section , Text corpus includes Wikipedia、BookCorpus、OpenWebText And so on 54G corpus ; Image data is crawled from the Internet 170 10000 images ; The graphic data includes COCO Caption、Visual Genome、Conceptual Caption、SBU Caption. Downstream tasks include both text and text search 、
Visual Q & A 、 Figure description generation 、 Cross modal tasks such as visual inference , It also includes text classification 、 reading comprehension 、 Text in this paper, 、 Various text tasks such as problem generation . Model ,Base be based on 12 Layer of Transformer, and Large Use 24 layer .
On cross modal tasks , Main comparison of the paper ViLBERT、UNITER、Oscar、Villa And the latest cross modal pre training model . Experimental results show that ,UNIMO In graphic retrieval Flick、 Visual inference SNLI-VE、
Visual Q & A VQA、 Figure description generation CoCo Caption It is stable and exceeds the previous pre training models , It fully explains the unified pre training UNIMO The model can effectively deal with various cross modal tasks .
Specially ,UNIMO It can also handle plain text tasks . Previous cross modal pre training model , The effect drops sharply when dealing with plain text tasks , Some tasks have even decreased by more than 10-20 A little bit . and UNIMO On various text understanding and generation tasks , Including text classification 、 Text inference 、 Text in this paper, 、 Reading comprehension and problem generation , Good results have been achieved , exceed RoBERTa、XLNet、UniLM Classical text model .
In order to verify UNIMO The first mock exam and the need for unified learning are discussed. , The separation experiment was carried out . Experimental results show that , When text data is not used for pre training ,UNIMO The effect decreased in cross modal tasks . When the graphic data and image data are not used ,UNIMO It also drops on text tasks . This fully explains ,UNIMO Unified learning approach , It can enhance text knowledge and visual knowledge , Effectively improve task effect .
UNIMO It can support various text and cross modal tasks , Both text search and map search can be supported , It also supports the generation of text descriptions according to pictures 、 Automatically generate pictures according to text description , It also supports question and answer of picture content . Of course ,UNIMO Language only tasks are also supported , Such as text reasoning 、 reading comprehension 、 Text generation, etc . From the results of practical application tasks , The study found that UNIMO It can enhance vision and language , So as to achieve better application effect . At present, some technologies have begun to
Baidu Landing in search , Help users get more pictures that meet their needs 、 video . Let's take a look at the sample effect on the actual task .
Cross modal Retrieval ： Search for pictures by text 、 Search for text with pictures
UNIMO Able to search relevant pictures according to text description , Or search the relevant text description according to the picture . In terms of the results ,UNIMO Be able to more accurately understand the semantics of words or pictures , Retrieve a more matching picture or text .
Cross modal Q & A :
UNIMO It also supports the use of natural language to ask questions about picture content .UNIMO Be able to understand the contents and concepts in the picture , Combined with the background knowledge learned from the model , Answer accurately .
Cross modal generation ： Generate pictures from text
UNIMO It can generate corresponding pictures according to the text description . From the results , We can find out UNIMO Ability to align visual and linguistic attributes and concepts , So as to generate accurate and clear pictures .
Baidu A pre training method integrating language and vision is proposed for the first time UNIMO, The first mock exam provides a new unified modal learning paradigm. , Broke the text 、 Boundary between image and graphic data , So that machines can take advantage of large-scale heterogeneous modal data like people , Learn language knowledge and visual knowledge and enhance each other , So as to realize the universal integration of perception and cognition AI Ability . Maybe , Unified learning of heterogeneous modes is one of the key nodes towards general artificial intelligence . future
Baidu The first mock exam will be done more in unified mode learning. , Coming soon .