1

Using youtube-dl --write-auto-sub we get a file like this:

WEBVTT
Kind: captions
Language: en
Style:
::cue(c.colorCCCCCC) { color: rgb(204,204,204);
 }
::cue(c.colorE5E5E5) { color: rgb(229,229,229);
 }
##

00:00:00.030 --> 00:00:02.619 align:start position:0%

<c.colorE5E5E5>because<00:00:00.630><c> then</c><00:00:00.780><c> media</c><00:00:01.079><c> tries</c><00:00:01.380><c> to</c><00:00:01.589><c> sell</c><00:00:01.800><c> chips</c></c><c.colorCCCCCC><00:00:02.129><c> a</c></c>

00:00:02.619 --> 00:00:02.629 align:start position:0%
<c.colorE5E5E5>because then media tries to sell chips</c><c.colorCCCCCC> a
 </c>

00:00:02.629 --> 00:00:05.869 align:start position:0%
<c.colorE5E5E5>because then media tries to sell chips</c><c.colorCCCCCC> a
lot<00:00:03.629><c> of</c><00:00:03.870><c> chips</c></c><c.colorE5E5E5><00:00:04.200><c> into</c></c><c.colorCCCCCC><00:00:04.560><c> the</c></c><c.colorE5E5E5><00:00:04.890><c> Android</c><00:00:05.279><c> Market</c><00:00:05.700><c> and</c></c>

00:00:05.869 --> 00:00:05.879 align:start position:0%
lot of chips<c.colorE5E5E5> into</c><c.colorCCCCCC> the</c><c.colorE5E5E5> Android Market and
 </c>

00:00:05.879 --> 00:00:08.900 align:start position:0%
lot of chips<c.colorE5E5E5> into</c><c.colorCCCCCC> the</c><c.colorE5E5E5> Android Market and
NVIDIA</c><c.colorCCCCCC><00:00:06.600><c> has</c></c><c.colorE5E5E5><00:00:06.839><c> been</c><00:00:07.109><c> the</c><00:00:07.350><c> single</c><00:00:07.980><c> worst</c><00:00:08.280><c> company</c></c>

00:00:08.900 --> 00:00:08.910 align:start position:0%
NVIDIA<c.colorCCCCCC> has</c><c.colorE5E5E5> been the single worst company
 </c>

00:00:08.910 --> 00:00:14.420 align:start position:0%
NVIDIA<c.colorCCCCCC> has</c><c.colorE5E5E5> been the single worst company
we've<00:00:09.090><c> ever</c><00:00:09.389><c> dealt</c><00:00:09.719><c> with</c><00:00:09.870><c> so</c><00:00:10.620><c> Nvidia</c><00:00:11.090><c> fuck</c><00:00:12.090><c> you</c></c>

webvtt-py can be used to extract the color and timing information, but why does Youtube generate repeated captions? And what is the best way to get the plaintext caption? I've tried ignoring all captions that are 0.010 seconds long but there are still overlapping lines (that is, the text in the end of one line overlaps with the text in the beginning of the next line).

qwr
  • 9,525
  • 5
  • 58
  • 102

0 Answers0