Depthwise Separable Convolutions

Recently(ish) there has been a development called "depthwise separable convolutions." Unfortunately, it's hard to find a resource that explains clearly what they are. You could just go off of the formulae, of course:

But those look like a pain. Here's a description in words.

Let's say our input is 128x128x16.

Regular convolution: We can think of a 3x3 convolution (with "same" padding) as follows. A single kernel has size 3x3x16 (matching the depth of the input). We walk it over our input as usual (over each of the 128x128 positions). At each step, each element in the kernel is multiplied by the corresponding element in the input, and all 3*3*16=144 elements are summed to produce one output. After walking over the whole input, we have a feature map of size 128x128x1. If we do this 32 times (that is, use 32 kernels), we stack them to get an output of 128x128x32.

Pointwise convolution: This is just a regular convolution, but our kernels are always 1x1(x16).

Depthwise convolution: We use a single kernel. Say it has size 3x3x16. We walk as usual, but instead of each time summing all 3*3*16 elements, we sum the 3*3 elements in each layer, and leave them in their layer. Thus we get a 128x128x16 output even though we have only one kernel.

Depthwise separable convolution: This is just a depthwise convolution followed by a pointwise convolution.

2 comments:

Maximum Likelihood Estimation for dummies

What is Maximum Likelihood Estimation (MLE)? It's simple, but there are some gotchas. First, let's recall what likelihood  is. ...