before , Turing award winner 、 Pioneer of deep learning Yann LeCun A tweet of attracted many netizens to discuss .
In this tweet ,LeCun Express ：「 Deep learning is not as impressive as you think , Because it is only the interpolation result produced by curve fitting . But in high-dimensional space , There is no interpolation . In high dimensional space , Everything is extrapolation .」
and LeCun The content forwarded is from Harvard cognitive scientist Steven Pinker A tweet from ,Pinker Express ：「 The general approximation theorem well explains why neural networks work and why they often don't work . Only understand Andre Ye General approximation theorem , You can understand neural networks .」
Pinker mentioned Andre Ye, That's what we're going to introduce 《You Don’t Understand Neural Networks Until You Understand the Universal Approximation Theorem》 Author of the article . Although the article was last year , But it plays a very important role in understanding neural networks .
In the mathematical theory of artificial neural network , General approximation theorem （ Or universal approximation theorem ） The ability of artificial neural network to approximate arbitrary function is pointed out . Usually, the neural network referred to in this theorem is feedforward neural network , And the approximate objective function is usually a continuous function with input and output in Euclidean space . However, some studies have extended this theorem to other types of neural networks , For example, convolutional neural networks 、 Radial basis function network 、 Or other special neural networks .
This theorem means that neural networks can be used to approximate arbitrary complex functions , And can achieve any approximate accuracy . But it doesn't tell us how to choose neural network parameters （ The weight 、 Number of neurons 、 Number of nerve layers, etc ） To achieve the objective function we want to approximate .
1989 year ,George Cybenko A single hidden layer was first proposed and proved 、 Any width 、 And use S General approximation theorem of feedforward neural networks with function as excitation function . Two years later, 1991 year ,Kurt Hornik The study found that , The choice of activation function is not the key , The multilayer neural layer and multi neuron architecture of feedforward neural network are the key to make neural network a universal approximator
most important of all , This theorem explains why neural networks seem to be so clever . Understanding it is a key step in developing a deep understanding of neural networks .
compact （ Co., LTD. 、 closed ） Any continuous function on a set can be approximated by piecewise functions . With - 3 and 3 Take the sine wave between , It can be approximated by three functions —— Two quadratic functions and one linear function , As shown in the figure below .
However ,Cybenko The description of this piecewise function is more specific , Because it can be constant , Essentially through step To fit the function . With enough constant fields (step), We can reasonably estimate the function in a given range .
Based on this approximation , We can think of neurons as step To build a network . The weight and deviation are used as 「 door 」 To determine which input falls , Which neuron should be activated , A neural network with a sufficient number of neurons can simply divide a function into several constant regions to estimate .
For the input signal falling on the descending part of the neuron , By magnifying the weight to a larger value , The final value will be close to 1( When using sigmoid Function calculation ). If it doesn't belong to this part , Moving the weight to negative infinity produces a result close to 0 The end result of . Use sigmoid Function as some kind of processor to determine the existence of neurons , As long as there are a large number of neurons , Any function can approximate almost perfectly . In multidimensional space ,Cybenko Popularized this idea , Each neuron controls the hypercube of the space in a multidimensional function .
The key to the general approximation theorem is , It is not a complex mathematical relationship between input and output , Instead, complex functions are divided into many small ones using simple linear operations 、 The less complicated part , Each part is processed by a neuron .
since Cybenko After the initial proof of , Many new improvements have been formed in the academic circle , For example, for different activation functions （ for example ReLU）, Or have different architectures （ Circular network 、 Convolution, etc ） Test the general approximation theorem .
No matter what , All this exploration revolves around an idea —— Neural networks find an advantage in the number of neurons . Each neuron monitors a pattern or region of the feature space , Its size is determined by the number of neurons in the network . Fewer neurons , The more space each neuron needs to monitor , So the approximation ability decreases . however , With the increase of neurons , Whatever the activation function is , Any function can be spliced together with many small pieces .
Generalization and extrapolation
One might point out that , The general approximation theorem is simple , But it's a little too simple （ At least conceptually ）. Neural networks can distinguish numbers 、 Generate music, etc , And usually act smart , But it's actually just a complex approximator .
The neural network is designed for a given data point , Able to model complex mathematical functions . Neural network is a good approximator , however , If the input is outside the training range , They lose their function . This is similar to the finite Taylor series approximation , Sine wave can be fitted in a certain range , But out of range, it fails .
extrapolation , Or the ability to make reasonable predictions outside a given training range , This is not the purpose of neural network design . From the general approximation theorem , We learned that neural networks are not really intelligent , But an estimator hidden under the disguise of multi-dimensional , It looks common in two or three dimensions .
The practical significance of the theorem
Of course , The general approximation theorem assumes that neurons can continue to be added to infinity , This is not feasible in practice . Besides , It is also impractical to use the almost infinite parameter combination of neural network to find the best combination of performance . However , The theorem also assumes that there is only one hidden layer , And as you add more hidden layers , The potential for complexity and universal approximation increases exponentially .
In its place , Machine learning engineers decide how to construct a neural network architecture suitable for a given problem based on intuition and experience , So that it can approach the multi-dimensional space well , Know the existence of such a network , But there is also a trade-off between computing performance .