From the course: Artificial Intelligence Foundations: Neural Networks

Convolutional neural networks (CNN)

- [Instructor] In previous videos, we've defined neural networks and looked at their components and how neural networks learn patterns in data. We specifically looked at multilayer perceptrons. In this video, we'll look at three additional neural network types. After completing this introductory video, you will be able to: describe convolutional neural networks, commonly referred to as CNNs, describe recurrent neural networks, commonly referred to as RNNs, and transformer neural networks. There are many other types of neural networks, but they are out of scope for our course. Let's begin by describing use cases for convolutional neural networks. Image classification. Object detection, like in self-driving cars or determining if a person is smiling or not, these are all examples of use cases that use convolutional neural networks. CNNs have other use cases as well, such as processing natural language and other complex image classification problems. In an earlier video, you learned about the feed-forward neural network, also known as multilayer perceptron where you have a series of fully connected layers, but this type of network is not suitable for the use cases previously mentioned. And why is that? A digital image is an array of pixels and each pixel is characterized by its x, y coordinates and its value. Shown here as discrete points on a rectangular grid is a digital representation of a 2D image on the left and a 3D image on the right. Machines view images as 2D arrays of numbers to decipher them. If we include colors, then it becomes a 3D array. To a computer, an image is just another array of numbers. Each object has its pattern, and that is what the computer will use to identify an object in an image. The machine takes this image as input and provides a classification output. Similar to the processes followed by the human brain, is it a one or not? In this image, notice that just to capture the rows requires four inputs into the four nodes for just the number one. What if the machine had to recognize an eight or determine if the color image of a dog was a dog or a cat? Images represent a large input for a neural network. They can have hundreds of thousands of pixels and up to three-color channels. In a classic, fully connected network, this requires a huge number of connections and network parameters. So the classic neural network architecture is inefficient for computer vision tasks. This is where the feed-forward convolutional neural network architecture comes into play. CNNs take a regular image as input and provides a classification output, similar to the processes followed by the human brain. This image shows a CNN predicting cat as an output when presented with an image of both a dog and a cat. CNNs contain five types of layers: input, convolution, pooling, fully connected, and output. This is the input layer that holds the raw pixel values of an image. The input, convolution, and pooling layers are where feature extraction is done. Think of the convolutional layer as a filter that passes over the image scanning a few pixels at a time. The convolutional layer is the core building block of a CNN and it is where the majority of computation occurs. It requires a few components, which are input data, a filter, and a feature map. It extracts different features from an image, such as the cheek, ears, nose, eyes, and then creates a feature map. When we talk about filters in convolutional neural networks, then we are talking about the weights. These are the feature detectors. These filters will determine which pixels or parts of the image the model will focus on. The filter is always smaller than input data and the dot product is performed between input and filter array. Feature maps are what we get after a filter has passed through the pixel values of an input image. It is what the convolutional layer sees after passing the filters on the image. It is what we call a convolution operation in terms of deep learning. If we were to add numbers to the input map, the dot product, the multiplication of the input map and the convolution filter of the first scan would look like the image shown here. If the full scan of the input would be done, you can see that the input would be reduced from a five-by-five table to a three-by-three table. The pooling layer takes small rectangular blocks from the convolutional layer and subsamples it to produce a single output from that block. There are several ways to do this pooling, such as taking the average of, or the maximum, or a learned linear combination of the neurons in the block. Essentially, pooling layers reduces the spatial size of the input, making it easier to process data. Pooling basically helps reduce the number of parameters and computations present in the network. Essentially, pooling layers reduces the spatial size of the input, making it easier to process data. Pooling basically helps reduce the number of parameters and computations present in the network. There are two main types of pooling: max pooling and average pooling. Shown here is max pooling, which takes the maximum value from each feature map, and average pooling, which takes the average value. Pooling layers are typically used after convolutional layers in order to reduce the size of the input before it is fed into a fully connected layer. The fully connected layer, or input layer, or the flattened layer takes the output of the previous layers, flattens them and turns them into a single vector that can be input to the next stage. A fully connected layer takes all neurons in the previous layer and connects it to every single neuron it has. Fully connected Layers are typically used towards the end of a CNN when the goal is to take the features learned by the previous layers and use them to make predictions. For example, if we were using a CNN to classify images of animals, the final fully connected layer might take the features learned by the previous layers and use them to classify an image as containing a dog, cat, bird, or duck. CNNs are often used for image recognition and classification tasks. For example, CNNs can be used to identify objects in an image or to classify an image as being a cat or a dog. CNNs can also be used for more complex tasks, such as generating descriptions of an image or identifying the points of interest in an image. CNNs can also be used for time series data, such as audio data or text data.

Contents