2

I got an interesting project to do! I'm thinking about converting an srt file into a csv/xls file.

a srt file would look like this:

1
00:00:00,104 --> 00:00:02,669
Hi, I'm shell-scripting.

2
00:00:02,982 --> 00:00:04,965
I'm not sure if it would work,
but I'll try it!

3
00:00:05,085 --> 00:00:07,321
There must be a way to do it!

while I want to output it into a csv file like this:

"1","00:00:00,104","00:00:02,669","Hi, I'm shell-scripting."   
"2","00:00:02,982","00:00:04,965","I'm not sure if it would work"
,,,"but I'll try it!"
"3","00:00:05,085","00:00:07,321","There must be a way to do it!"

So as you can see, each subtitle takes up two rows. My thinking would be using grep to put the srt data into the xls, and then use awk to format the xls file.

What do you guys think? How am I suppose to write it? I tried

$grep filename.srt > filename.xls

It seems that all the data including the time codes and the subtitle words ended up all in column A of the xls file...but I want the words to be in column B...How would awk be able to help with the formatting?

Thank you in advance! :)

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
Penny
  • 1,218
  • 1
  • 13
  • 32
  • Your output does not look like "Comma Separated Values" to me. But anyway .. I suggest you think through the process YOU would use to manually make the format changes you describe. I.e. "open a file", "recognize start of record", "record start and end times", "step through lines of text", "loop back to step 2", etc, writing your program IN ENGLISH. Once you have a procedural understanding of your problem, you can begin to convert it to code. – ghoti Aug 21 '15 at 05:30

4 Answers4

5
$ cat tst.awk
BEGIN { RS=""; FS="\n"; OFS=","; q="\""; s=q OFS q }
{
    split($2,a,/ .* /)
    print q $1 s a[1] s a[2] s $3 q
    for (i=4;i<=NF;i++) {
        print "", "", "", q $i q
    }
}

$ awk -f tst.awk file
"1","00:00:00,104","00:00:02,669","Hi, I'm shell-scripting."
"2","00:00:02,982","00:00:04,965","I'm not sure if it would work,"
,,,"but I'll try it!"
"3","00:00:05,085","00:00:07,321","There must be a way to do it!"
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
1

I think something like this should do it quite nicely:

awk -v RS= -F'\n' '
   { 
      sub(" --> ","\x7c",$2)                 # change "-->" to "|"
      printf "%s|%s|%s\n",$1,$2,$3           # print scene, time start, time stop, description
      for(i=4;i<=NF;i++)printf "|||%s\n",$i  # print remaining lines of description
   }' file.srt

The -v RS= sets the Record Separator to blank lines. The -F'\n' sets the Field Separator to new lines.

The sub() replaces the "-->" with a pipe symbol (|).

The first three fields are then printed separated by pipes, and then there is a little loop to print the remaining lines of description, inset by three pipe symbols to make them line up.

Output

1|00:00:00,104|00:00:02,669|Hi, I'm shell-scripting.
2|00:00:02,982|00:00:04,965|I'm not sure if it would work,
|||but I'll try it!
3|00:00:05,085|00:00:07,321|There must be a way to do it!

As I am feeling like having some more fun with Perl and Excel, I took the above output and parsed it in Perl and wrote a real Excel XLSX file. Of course, there is no real need to use awk and Perl so ideally one would re-cast the awk and integrate it into the Perl since the latter can write Excel files while the former cannot. Anyway here is the Perl.

#!/usr/bin/perl
use strict;
use warnings;

use Excel::Writer::XLSX;
my $DEBUG=0; 
my $workbook  = Excel::Writer::XLSX->new('result.xlsx');
my $worksheet = $workbook->add_worksheet();
my $row=0; 

while(my $line=<>){
   $row++;                                   # move down a line in Excel worksheet
   chomp $line;                              # strip CR
   my @f=split /\|/, $line;                  # split fields of line into array @f[], on pipe symbols (|)
   for(my $j=0;$j<scalar @f;$j++){           # loop through all fields
     my $cell= chr(65+$j) . $row;            # calcuate Excell cell, starting at A1 (65="A")
     $worksheet->write($cell,$f[$j]);        # write to spreadsheet
     printf "%s:%s ",$cell,$f[$j] if $DEBUG;
   }
   printf "\n" if $DEBUG;
}

$workbook->close;

Output

enter image description here

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • Thank you so much for the code! I tried the perl, it did an amazing job! Since I'm not familiar with perl, I'm trying to read and understand you code. I tried to run your perl code without the awk code, and the result.xls has all the content in column A. The good news is one line of the source srt takes up one row in the result.xls. (I want to show you picture here, but seems like I can't, sorry for the inconvenience..) How would the result.xls show up like the picture that you pasted here? I think I have to revise your perl code.If you have time, would you please let me now how? Thank you! – Penny Aug 22 '15 at 03:13
  • I have added another answer that does the whole lot in one go in Perl for you. Please have a look. – Mark Setchell Aug 22 '15 at 21:38
1

My other answer was half awk and half Perl, but, given that awk can't write Excel spreadsheets whereas Perl can, it seems daft to require you to master both awk and Perl when Perl is perfectly capable of doing it all on its own... so here goes in Perl:

#!/usr/bin/perl
use strict;
use warnings;

use Excel::Writer::XLSX;
my $workbook  = Excel::Writer::XLSX->new('result.xlsx');
my $worksheet = $workbook->add_worksheet();
my $ExcelRow=0; 
local $/ = "";   # set paragraph mode, so we read till next blank line as one record

while(my $para=<>){
   $ExcelRow++;                               # move down a line in Excel worksheet
   chomp $para;                               # strip CR
   my @lines=split /\n/, $para;               # split paragraph into lines on linefeed character
   my $scene = $lines[0];                     # pick up scene number from first line of para
   my ($start,$end)=split / --> /,$lines[1];  # pick up start and end time from second line
   my $cell=sprintf("A%d",$ExcelRow);         # work out cell
   $worksheet->write($cell,$scene);           # write scene to spreadsheet column A
   $cell=sprintf("B%d",$ExcelRow);            # work out cell
   $worksheet->write($cell,$start);           # write start time to spreadsheet column B
   $cell=sprintf("C%d",$ExcelRow);            # work out cell
   $worksheet->write($cell,$end);             # write end time to spreadsheet column C
   $cell=sprintf("D%d",$ExcelRow);            # work out cell
   $worksheet->write($cell,$lines[2]);        # write description to spreadsheet column D
   for(my $i=3;$i<scalar @lines;$i++){        # output additional lines of description
      $ExcelRow++;
      $cell=sprintf("D%d",$ExcelRow);         # work out cell
      $worksheet->write($cell,$lines[$i]);
   }
}

$workbook->close;

Save the above on a file called srt2xls and then make it executable with the command:

chmod +x srt2xls

Then you can run it with

./srt2xls < SomeFileile.srt

and it will give you this spreadsheet called result.xlsx

enter image description here

Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • Thank you so much Mark!! I'm going to try it tomorrow and see how it goes! But thank you for it in advance! I will definitely check and try my best to understand your code! :) – Penny Aug 23 '15 at 05:53
  • Hey Mark, I tried your code, but I have a problem with my Writer-Excel-XLSX module. I downloaded the ActivePerl 5.18 and it is in my library folder, and I installed the Writer-Excel-XLSX by using $ppm install Excel-Writer-XLSX.However, when I tried to run your code, I received with the error message: Can't locate Excel/Writer/XLSX.pm in @INC (you may need to install the Excel::Writer::XLSX module) (INC contains: /...)I know that the excel module is not in the INC Path and I tried different ways to try to include it, but no luck...Do you know how? Sorry to trouble you so much! – Penny Aug 23 '15 at 22:39
  • Two options... either insert a line after `use warnings;` that says `lib '/path/to/wherever/you/installed/XLSmodule';` or go to `Control Panel->System` and set up an Environment Variable called PERL5LIB that also contains the directory where you installed the Excel module. See http://community.activestate.com/forum-topic/how-update-inc – Mark Setchell Aug 23 '15 at 22:58
  • Hi Mark, I added the two lines: use FindBin; use lib "$FindBin::Bin/../Excel-Writer-XLSX"; to your above code after use warnings, but the @INC still can't find the module...Did I type it right? I'm not sure what the ".." should contain. I found the Excel folder under the path /Users/Library/ActivePerl-5.18/lib/Excel. Is it supposed to be the path? – Penny Aug 23 '15 at 23:48
  • Have you mixed up `cygwin` and `ActiveState` Perl on a Windows box or something? You talk about `shell` and `awk` and `grep` in your question but seem to be on Windows using ActiveState Perl? Have you installed the Excel module inside ActiveState but called the cygwin Perl or something like that? – Mark Setchell Aug 24 '15 at 17:25
  • Hi Mark, thank you for being so patient with me! I was using the OS X system instead of windows. I have downloaded ActiveState Perl.dmg and it (ActiveState Perl-5.18) was in my /Users/username/Library/ActiveState Perl-5.18. When I open it's folder, there are four other folders: bin, etc, html, lib. When I open lib, The Excel folder is in it. Then I realize maybe the excel.pm is not the right one, so I moved the Excel-Writer-XLSX-0.85 folder under lib. I tried your code, and the same issue about @INC still happened. So I was wondering if the pm is not in the right folder or a naming issue? Thx! – Penny Aug 24 '15 at 18:50
  • Why would you install ActiveState Perl on OS X? It comes with a perfectly good built-in Perl? – Mark Setchell Aug 24 '15 at 18:55
  • Because I wanted to download the excel.pm and I found that activeperl has the PPM where I can use ppm install command to download the modules...when I typed in perl -v, the command gives me the perl 5.16 version. So maybe the default perl it is using is still the built-in version? The INC paths are Built under darwin @INC: /opt/local/lib/perl5/site_perl/5.16.3/darwin-thread-multi-2level ...So should I add the module here? I'm trying to find the built-in perl 5 folder, so I can put the module in. Hope that I am on the right track? Just not sure if I can find the folder. – Penny Aug 24 '15 at 19:13
  • I update my Apple-supplied, standard OSX Perl with `sudo perl -MCPAN -e 'install Excel::Writer::XLSX'` – Mark Setchell Aug 24 '15 at 19:18
  • I used the above command and I think the module was downloaded! Yay! But the problem is @INC still contains library/perl/5.18, which is weird, because I ran $ sudo /usr/local/ActivePerl-5.18/bin/ap-uninstall. The result is that the INC path still goes to the wrong directory-->Can't locate Excel/Writer/XLSX.pm in INC (you may need to install the Excel::Writer::XLSX module) (INC contains: /Users/venga-penny/srt-xls/../Excel-Writer-XLSX /Library/Perl/5.18/darwin-thread-multi-2level /Library/Perl/5.18... – Penny Aug 24 '15 at 19:34
  • Thank you Mark for your help! It helped a lot. – Penny Aug 27 '15 at 00:49
0

Since you want to convert the srt into csv. below is awk command

 awk '{gsub(" --> ","\x22,\x22");if(NF!=0){if(j<3)k=k"\x22"$0"\x22,";else{k="\x22"$0"\x22 ";l=1}j=j+1}else j=0;if(j==3){print k;k=""}if(l==1){print ",,,"k ;l=0;k=""}}' inputfile > output.csv

detail veiw of awk

awk '{
       gsub(" --> ","\x22,\x22"); 
       if(NF!=0)
         {
           if(j<3)
              k=k"\x22"$0"\x22,";
           else
            {
              k="\x22"$0"\x22 ";
              l=1
            }
          j=j+1
         }
        else
          j=0;
        if(j==3)
          { 
            print k;
            k=""
          }
        if(l==1)
          {
            print ",,,"k;
            l=0;
            k=""
          }
    }' inputfile > output.csv

take the output.csv on windows platform and then open with microsoft excel and save it as .xls extension.

Shravan Yadav
  • 1,297
  • 1
  • 14
  • 26
  • Hi Shranvan, Thank you so much for writing the code! I actually wanted to transfer the data from the srt file to an excel, so the "|" and "----------" are just things that i used to imitate an excel format to have you guys better understand how I want the data to appear in an excel. Now I feel the difficulty of doing it, because when I output the srt into the csv file, it was comma delimited, so the subtitles seem to parse in a wrong way...So I am a bit confused now about how to do it... – Penny Aug 21 '15 at 06:57
  • Hey Shranvan, I tried to output the srt into a xls file. That looks much better.. My code is: $grep [0-9] filename.srt > filename.xls But I'm not sure how to format the xls file, meaning how to put the subtitle words into the B column for example, how does awk be able to help? To further explain, if I just use grep to output the subtitles, then all of them would end up all in the A column...but I want the words to be in the B column. Thank you very much!! :) – Penny Aug 21 '15 at 07:10