Recurrent Neural Networks

The previous chapter showed how a deep learning model—specifically CNNs—could be applied to images. The process could be decoupled into a feature extractor that figures out the optimal hidden-state representation of the input (in this case a vector of fea

PDF / 710,132 Bytes
26 Pages / 439.37 x 666.142 pts Page_size
95 Downloads / 256 Views

DOWNLOAD

REPORT

Recurrent Neural Networks The previous chapter showed how a deep learning model—specifically CNNs—could be applied to images. The process could be decoupled into a feature extractor that figures out the optimal hidden-state representation of the input (in this case a vector of feature maps) and a classifier (typically a fully connected layer). This chapter focuses on the hidden- state representation of other forms of data and explores RNNs. RNNs are especially useful for analyzing sequences, which is particularly helpful for natural language processing and time series analysis. Even images can be thought of as a subset of sequence data; if we shuffle the rows and columns (or channels) then the image becomes unrecognizable. This is not the case for spreadsheet data, for example. However, CNNs have a very weak notion of order and typically the kernel size for a convolution is in the single digits. As these convolutions are stacked on top of each other, the receptive field increases, but the signal also gets dampened. This means that CNNs typically only care about temporary spatial relationships, such as a nose or eye. In Figure 7-1, we can imagine that we have shuffled a sequence, preserving order only within local groups, but most CNNs will still classify it the same, even though it makes no sense overall.

© Mathew Salvaris, Danielle Dean, Wee Hyong Tok 2018 M. Salvaris et al., Deep Learning with Azure, https://doi.org/10.1007/978-1-4842-3679-6_7

161

Chapter 7

Recurrent Neural Networks

Figure 7-1. CNNs have a weak concept of order, as can be seen by applying ResNet-121 trained on ImageNet to a shuffled image For some other forms of data, the relationship between members of the sequence becomes even more important. Music, text, time series data, and more all depend heavily on a clear representation of history. For example the sentence, “I did not watch this movie yesterday but I did really like it,” differs from “I did watch this movie yesterday but I did not really like it,” or even “This is a lie—I really did not like the movie I watched yesterday.” Not surprisingly, word order is key. For a CNN to capture a relationship across so many words, the kernel size has to be much larger than the number of hidden units required for an RNN to capture the same relationship (and at some point, it will no longer be possible). To see why we need a new deep learning structure for these kinds of sequences, let’s first examine what happens if we try to hack together a basic neural network to predict the last digit of a sequence. If we imagine that we have a sequence of numbers (from 0–9) such as [0, 1, 2, 3, 4] and [9, 8, 7, 6, 5], we can represent each number as a 10-dimensional vector

162

Chapter 7

Recurrent Neural Networks

that is one-hot encoded. For example, the number 2 could be encoded as [0, 0, 1, 0, 0, 0, 0, 0, 0, 0] and 6 as [0, 0, 0, 0, 0, 0, 1, 0, 0, 0]. To train a network to predict the last digit of the sequence we can attempt two different approaches. First, we can concatenate the four one-hot encoded v

Data Loading...