How can I extract only the text (stripping out timecodes) from SubRip .srt files?

Question

I'd like to use text only from a subtitle for further processing.

So, opening a srt file would load this

1
00:00:10,500 --> 00:00:13,000
Elephant's Dream

2
00:00:15,000 --> 00:00:18,000
At the left we can see...

Then, after stripping/extracting, the result would be

Elephant's Dream
At the left we can see...

I want to strip out all the numbering and timecode, so the output would consist only of plain text in the exact same order as the original subtitle, and store the result in a variable for further processing.

public void open_file()
{
    JFileChooser filechooser = new JFileChooser();
    filechooser.setFileSelectionMode(JFileChooser.FILES_ONLY);
    int i  = filechooser.showOpenDialog(this);
    if (i == filechooser.CANCEL_OPTION)
        return;
    File OpenFile = filechooser.getSelectedFile();
    if (OpenFile  == null || OpenFile.getName().equals(""))
    {
        JOptionPane.showMessageDialog(this, "choose file", "Error", JOptionPane.ERROR_MESSAGE);
        return;
    }
    try {
        BufferedReader bufferedreader = new BufferedReader(new FileReader(BukaFile));
        StringBuffer stringbuffer = new StringBuffer();
        String Row;
        while ((Row = bufferedreader.readLine()) != null) stringbuffer.append(Row+"\n");
        textArea.setText(stringbuffer.toString());
        String SubText = textArea.getText();
    } catch (FileNotFoundException ex) {
         JOptionPane.showMessageDialog(null, "File not found" + ex);
    } catch (IOException ex)
            {
                JOptionPane.showMessageDialog(null, "IO Error"+ ex);
            }
}

I've made a method (as above) to open and load an existing srt file and put it into a String (named SubText above) variable.

To extract those texts, all I know is that I have to use either numberings, timecodes, and blank space for start point and end point, but I have no idea on how to code for detecting those numberings and timecodes in the text.

How should I accomplish this in java? I'm using Netbeans, by the way.

You could maybe skip the first two lines then read one line, skip 3 lines, read one line, skip 3 lines etc. — assylias, Sep 17 '14 at 10:35
well, the problem is that, some text are sometimes more than one line, so i can't just "skip 3 line" over and over — MIMB, Sep 17 '14 at 11:00

TedTrippin · Accepted Answer · 2014-09-17T11:56:36.383

0

The format is simple, each subtitle is separated by a blank line so all you do is skip the first 2 lines then read everything until you get to a blank line.

So replace you while loop with something like this...

    while (...) {
        String lineNumber = bufferedReader.readLine();
        String time = bufferedReader.readLine();
        String text;
        while (!(text = bufferedReader.readLine()).equals(""))
            stringBuffer.append(text).append("\n");
    }

Be sure to add your own end of file check.

edited Sep 17 '14 at 11:56

answered Sep 17 '14 at 10:52

TedTrippin

3,525
5
28
46

my while loop in there is used for write down untouched srt in a textarea, with while loop to read and append texts until the end of file (as condition). when i tried to implement your while loop so that my while loop write down stripped srt, textarea only filled with blank line. do i have to change my while condition? – MIMB Sep 17 '14 at 11:53
Oops! Got logic wrong way round, should have been !equals(""). I've now corrected. – TedTrippin Sep 17 '14 at 11:58
thanks a lot sir :) it worked! BTW, because of my while loop condition, i only need one bufferedreader skip. and, do you have any idea how can i remove html tags in srt file? (such as or ) – MIMB Sep 17 '14 at 14:20
already answered here http://stackoverflow.com/questions/240546/remove-html-tags-from-a-string – TedTrippin Sep 17 '14 at 16:55
i am using Jsoup lib and i've used parse method on a string, but it doesn't do anything. am i using wrong method to strip html tags? – MIMB Sep 17 '14 at 19:00
my bad, i am using the wrong method. using clean method and all those tags are gone, but also all the \n (blank line). how can i avoid this? – MIMB Sep 17 '14 at 19:17
Clean the line first then add a return after. – TedTrippin Sep 18 '14 at 08:21
funny how i've been working on this all night just to no avail, but simple advice from you enlighten me so easily.. once again, THANK YOU SIR :) – MIMB Sep 18 '14 at 09:23

How can I extract only the text (stripping out timecodes) from SubRip .srt files?

1 Answers1