Deep Learning is a sub-field of Machine learning which replicates the human brain’s learning capability, which is learning by example. The method used by deep learning networks is using hierarchical representation to learn abstract representation of any problem.
For example in case of images, the model learns edges in the beginning, after that corners, colour differences, then objects. Deep learning has gained a lot of attention lately as it achieves tremendous results in the different fields it is applied in. Moreover, the first paper describing a Neural Network was published in 1943 where the Warren McCulloch and Walter Pitts defined a model using electrical circuits of how the brain’s neurons work.
Which means that Deep Learning is not a new topic, however, the main reason behind its popularity lately is the increase in processing speed and memory capacity which were one of the main reasons halting Deep Networks from achieving the reputation it has right now. Broken down, Deep Networks are being trained by feeding lots of data to them, and then later on the Network is able to make decisions based on the data it has learned before. This simple idea has lead to a break through in many fields of Image Classification, followed by Text Classification. In Image Classification, Convolutional Neural Networks (CNN) are more used as it makes more sense to use them and they already achieve great results. As for Natural Language Processing (NLP) tasks, the state-of-the-art always differs due to the continuous battle between CNNs and RNNs.
Convolutional neural networks (CNNs) of Deep Learning
One of the main and widely used type of Deep Neural Networks (DNN) is the CNN variation. The CNN was originally created for computer vision, however, due to its hierarchical representation is has shown to be really effective in NLP. Simply a basic CNN takes an input of a fixed size, an image in the case of computer vision and a fixed length number of words in the case of NLP, and performs multiple operations on it in the shape of convolutions and maxpooling to produce the feature space, which is later on fed to a fully connected layer which acts as a classifier for the network.
The convolutional layer is the basic block of CNN at it basically multiplies the input by weights, and this is done by sliding a mask of learnable weights over the whole input. Moreover, the maxpooling layer reduces the size of the input by only considering the maximum number in a window that slides over the input. Furthermore, this structure of CNN can be seen as a hierarchical structure where the input is multiplied and reduced in several steps until it gradually reaches the output, each of these steps can have a function like detecting edges at the beginning in case of images.
Recursive neural network (RNN) of Deep Learning
A recursive neural network (RNN) is a kind of deep neural network created by applying the same set of weights recursively over a structured input, to produce a structured prediction over variable-size input structures, or a scalar prediction on it, by traversing a given structure in topological order.
RNNs have been successful, for instance, in learning sequence and tree structures in natural language processing, mainly phrase and sentence continuous representations based on word embedding. RNNs have first been introduced to learn distributed representations of structure, such as logical terms.
The choice between CNN and RNN of Deep Learning
Theoretically speaking, CNN is a hierarchical Network and RNN is a sequential Network, meaning that CNN would be better in feature extraction tasks such as sentiment analysis where the sentence can be understood by some key-phrases. On the other hand, RNN would be better in sequence modeling tasks such as Machine Translation as it can capture the sequence or flow of the sentence instead of its features. However, practically this does not hold, as in some datasets CNN may perform better and on other datasets
RNN may be better, even though the classification is on the same task. In addition, even if one of the models outperforms the other, the difference between them is always small, thus, it would be better to bear in mind other factors, like speed of training and that of classification, when choosing between one of the two models.
One of the main problems with using Sparse representation with Neural Networks is the fact that the dimensions of the Sparse vector is not constant and is dependant on the size of the text. Therefore, it was important to find a method which solves this problem, that is why Word Embeddings was proposed. Many forms of Word Embeddings are available but the most common one and widely used until now is the word2vec. The word2vec method uses the fact that similar words exist in the same context, to build a one-layer network to predict the similarity between words.
Afterwards, only the weights of this network are used later on in the main Neural Network used for classification. The vectors representing each word are then of continuous numbers rather than 0’s and 1’s, in which similar words lay next to each other in this vector space model. Moreover, the authors noticed a remarkable observation which is that semantic relations are preserved, for example when the vector of “King” is taken and then the vector of “man” is deducted from it, afterwards the vector of “woman” is added, the result would be something near the vector of “Queen”, which means similar relationships between words are kept.
Not only with word meanings, but also with grammatical structures, like the difference between “craziness” and “crazy” is nearly the same difference as the one between “brightness” and “bright”. Nevertheless, the problem came when trying to represent a whole text or sentence, where the only way was to either average or Max-pool the word vectors in this sentence to get a vector representing the whole sentence/text, but this approach lead to a loss of information and slightly worse accuracy. To solve this, Paragraph vector was introduced, which embeds the whole paragraph into a vector using Neural Networks as well.
Furthermore, both word and paragraph vectors use backpropagation with stochastic gradient descent for learning the weights of the network. Moreover, an advantage of using word embeddings is that it can be generalized over the whole vocabulary, and it is not needed to train a network every time as several reliable embeddings are available online. When using such embedding for pre-training, there are two options for the usage: Static, where the weights of the model is fixed and are not updated during training the classifier network. Dynamic, where the weights are updated during training the network, and the model is only used as a pre-training. Every method of these has its own advantages, as in some cases the Static one has better results and in other cases it is the other way around.
Document Classification with Neural Networks
On the contrary to what most people believe, Document Classification is not only limited to Recursive Neural Networks (RNN). It is true that RNNs are basically used for a sequence of input, which a text is one of them. However, in some cases, Convolutional Neural Networks (CNN) achieve better performance than their RNN rivals.
The choice of the proper Network is totally dependant on the problem itself. But a sure thing is that RNNs are slower and harder to train than CNNs, and that would be a plus for the CNNs. Moreover, in some cases a combination of both CNNs and RNNs leads to a better solution than every one of them alone.
Another Network to consider also during Document classification is the Long-short term Memory networks (LSTM) and their Bidirectional variation (Bi-LSTM), which are both in fact a variation of RNN. However, they have shown to achieve better results in some problems.