Are Convolutional Networks obsolete?

The new networks on the block

Computer Vision has been dominated by convolutional networks arguably since 1998 with Yann LeCun's LeNet beat all other models on digit recognition, which is considered by many as the start of the modern deep learning era.

Vision Transformers (ViTs) finally beat CNNs for image classification in 2020. The runaway success of the Transformer architecture has spawned multiple AI startups aiming to bring AI into profitable consumer products.

So the question is, should you forget about CNNs?

ConvNext : Making convolutions great again

Not quite. CNNs returned to the top spot in image classification in 2022.

The authors called their new architecture ConvNext - a convnet for the 2020s. They identify the major differences between the best practices for CNNs and ViTs and tried to adopt all the strengths of the ViTs into the CNNs. Some of the changes were -

1) Using depthwise convolutions (spatial convolutions that act on each channel separately followed by a 1x1 to mix the channels)

2) More efficient ordering of bottleneck layers to yield same or more number of channels with fewer computations.

3) Increasing kernel size to 7x7 instead of 3x3

4) Fewer activation layers

5) Replace Batch Normalization (BN) with Layer Normalization (LN), reduce number of normalization layers to 1 per block

6) Using a separate normalization + downsampling layer between ResNet blocks.

You might think many of these changes seem minor, bordering on trivial. I agree, I don't trust random hyperparameters that make 0.5% improvements. Therefore I will distill a few broader intuitions about CNNs we already know that this paper confirms.

Intuition 1 : Larger embeddings are better

Many of the changes made in ConvNext are actually driving towards 1 specific effect that should come as no surprise. Larger vector spaces are better than smaller ones. Of course larger vector spaces increase computational cost, therefore clever optimizations are required to make them tractable.

ResNet introduced bottleneck layers where the spatial convolutions use small number of channels for efficiency and 1x1 convolutions increase the number of channels to get highly expressive vector space. The ConvNext paper further optimizes the ordering of layers to get larger vector spaces for fewer FLOPs.

Intuition 2 : Hacks work until they don't

I hate Batch Normalization.

It is the hackiest of hacks with very little justification. Changing features values based on a randomly sampled mini-batch never made sense to me.

If you are lucky, it should do nothing. If you have true out-of-distribution test datasets (e.g. medical data), BN often decreases performance.

But the worst outcome is what actually happened, that BN improved training in some cases due to some peculiar idiosyncracies of the dataset and architecture and as a result, millions of hours and $ were spent studying and optimizing a random hack.

The ConvNext authors find that using LN beats BN (normalizing each channel independently is far simpler and makes logical sense). Hopefully, I will never have to deal with BN again.

Intuition 3 : Quality vs Quantity of non-linearity

When I was first learning about neural network, the general wisdom was to stuff in as many non-linear activation layers as possible. All the magic was in the non-linear activation, and more was supposedly better. The first part is more or less correct, without non-linearities the best you can do is a simple linear regression.

The origin of the second idea that more activations is better comes from the multilayer perceptron, or in other words a fully connected neural network. Two fully connected layers without a non-linear activation layer in between can always be written as a single layer. Therefore every layer must have an activation layer or you are just wasting compute for no gain.

In principle, the same is also true of convolutional layers. However, convolutional layers are more flexible than dense layers, e.g. we have spatial, depthwise, bottlenecks and layer normalizations. Recall that a spatial convolution with lots of channels can be optimized by using a spatial convolution with fewer channels followed by a 1x1 convolution to increase the number of channels. In such a case, we might be better off treating the new 2 layer system as 1 effective convolution and only add 1 activation after it.

This is what the ConvNext authors found, they only retain one activation per ResNet block, as seen in the image above.

Intuition 4 : Convolutions still got it

The ConvNext paper shows that CNNs are not obsolete, and arguably still the best at image classification. The authors note that success of ViTs can be attributed to their flexibility that allows them to borrow many of the inductive biases of CNNs. Some neuroscience experiments suggest that human and animals brains might also have the similar inductive biases in their visual cortex.

All these observations suggest that even if CNNs eventually do become obsolete, it will likely be because some new architectures can mimic the CNNs where needed and generalize beyond where possible.

————————————————————————————-

There you have it, my intuitions for why CNNs are still great at vision tasks and what we have learnt about improving them. For more thoughts on AI/ML, subscribe below and follow me on Twitter