3

I understand convolution filters when applied to an image (e.g. an 224x224 image with 3 in-channels transformed by 56 total filters of 5x5 conv to a 224x224 image with 56 out-channels). The key is that there are 56 different filters each with 5x5x3 weights that end up producing output image 224x224, 56 (term after comma is output channels).

But I can't seem to understand how conv1d filter works in seq2seq models on a sequence of characters. One of the models i was looking at https://arxiv.org/pdf/1712.05884.pdf has a "post-net layer is comprised of 512 filters with shape 5×1" that operates on a spectrogram frame 80-d (means 80 different float values in the frame), and the result of filter is a 512-d frame.

  • I don't understand what in_channels, out_channels mean in pytorch conv1d definition as in images I can easily understand what in-channels/out-channels mean, but for sequence of 80-float values frames I'm at loss. What do they mean in the context of seq2seq model like this above?

  • How do 512, 5x1 filters on 80 float values produce 512 float values?**

  • Wouldn't a 5x1 filter when operating on 80 float values just produce 80 float values (by just taking 5 consecutive values at a time of those 80)? How many weights total these 512 filters have?**

The layer when printed in pytorch shows up as:

(conv): Conv1d(80, 512, kernel_size=(5,), stride=(1,), padding=(2,))

and the parametes in this layer show up as:

postnet.convolutions.0.0.conv.weight : 512x80x5 = 204800
  • Shouldn't the weights in this layer instead be 512*5*1 as it only has 512 filters each of which is 5x1?
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Joe Black
  • 625
  • 6
  • 19

1 Answers1

5

Intro explanation

Basically Conv1d is just like Conv2d but instead of "sliding" the rectangle window across the image (say 3x3 for kernel_size=3) you "slide" across the vector (say of length 256) with kernel (say of size 3). This is the case for in_channels and out_channels equal to 1 which is the basic one.

Below you can see Conv1d sliding across 3 in_channels (x-axis, y-axis, z-axis) across seconds steps.

1D Convolution

You could add depth to the kernel (just like you did for 2D convolution with 5x5x3 cube), which would be 5x3 as well (5 is the kernel size, 3 is the number of in_channels). Now there could be out_channels of those squares (e.g. 56 out_channels) so the final produced sequence is 56 x sequence_length.

Questions

[...] post-net layer is comprised of 512 filters with shape 5×1" that operates on a spectrogram frame 80-d (means 80 different float values in the frame), and the result of filter is a 512-d frame.

So your input is 80d (instead of 3 axes like above), kernel_size is the same (5) and out_channels is 512. So the input could look something like this: [64, 80, 256] (for [batch, in_channels, length]) and output would be [64, 512, 256] (provided padding of 3 was used on both sides).

I don't understand what in_channels, out_channels mean in pytorch conv1d definition as in images I can easily understand what in-channels/out-channels mean, but for sequence of 80-float values frames I'm at loss. What do they mean in the context of seq2seq model like this above?

I guess that was answered above. Main point is: the sequence isn't 80-float values! Sequence can be of any length (just like image can be of any size when you pass it to the convolution), here in_channels is 80.

How do 512, 5x1 filters on 80 float values produce 512 float values?**

512 x sequence_length values are produced on 80 x sequence_length inputs.

Shouldn't the weights in this layer instead be 512*5*1 as it only has 512 filters each of which is 5x1?

In PyTorch, in your case, weights would be of shape torch.Size([512, 80, 5]). They could be torch.Size([512, 1, 5]) if you have one input channel, but in this case there are 80 of them.

Szymon Maszke
  • 22,747
  • 4
  • 43
  • 83
  • interesting, thanks for your comment. it seems like what you're saying is that conv1d of filter size 5 will operate every time on 5 _different_ frames (each with 80 different float values)? wonder if I my understanding is correct? very different from the notion of conv on an image where conv of filter size 5x5 will operate on the same image, just different 5x5 slice of pixels along all in-channels. – Joe Black May 30 '20 at 20:30
  • if i understand correctly, it means conv on image intuitively is different than conv1d on sequence, as each conv1d on sequence operates on different input frames in the sequence, whereas in conv2d it's on the same image. wonder if it typically causes others similar confusion too, or i missed some intuitive key that explains this inconsistency between conv1d and conv on image? – Joe Black May 30 '20 at 20:32
  • @JoeBlack `conv1d of filter size 5 will operate every time on 5 different frames (each with 80 different float values)` - yes, indeed that's the case if you specify input correctly. ` very different [...] just different 5x5 slice of pixels along all in-channels` - IMO it's not very different, it acts across all different `5` values along all `in-channels` in this case (as it's `1D`). `conv on image intuitively is different than conv1d on sequence` - it operates on the same framers, just along the first dimension. – Szymon Maszke May 30 '20 at 22:19
  • @JoeBlack `wonder if it typically causes others similar confusion too` - actually I found the confusion other way around (hard to grasp `2D` case, `1D` being easier), to me your way of thinking is extraordinary if that makes you feel better (not judging ofc!). xd – Szymon Maszke May 30 '20 at 22:21
  • 1
    Simpler example: https://jinglescode.github.io/2020/11/01/how-convolutional-layers-work-deep-learning-neural-networks/ – Danijel Jun 06 '23 at 11:24