R: Extracting time from srt (subtitles) file

Question

I need to calculate the speech rate of each line of subtitle. The content of the srt (subtitles) file looks like this:

1
00:00:19,000 --> 00:00:21,989
I'm Annita McVeigh and welcome to Election Today where we'll bring you

2
00:00:22,000 --> 00:00:23,989
the latest from the campaign trail, plus debate and analysis.

3
00:00:24,000 --> 00:00:28,989
The Liberal Democrats promise to protect the pay of millions

For example, it takes 4 seconds 989 milliseconds to say the 10 words "The Liberal Democrats promise to protect the pay of millions". The average speech rate of these 10 words is 498.9 milliseconds per word.

How do I read the srt file so that I can have a dataframe with startTime, endTime, textString and wordCount as columns and lines of subtitle as rows like below?

startTime<-c("00:00:19,000", "00:00:22,000", "00:00:24,000")

endTime<-c("00:00:21,989", "00:00:23,989", "00:00:28,989")

textString<-c("I'm Annita McVeigh and welcome to Election Today where we'll bring you", "the latest from the campaign trail, plus debate and analysis.", "The Liberal Democrats promise to protect the pay of millions")

wordCount<-c(12,10,10)

rate.df<-data.frame(startTime, endTime, textString, wordCount)

How do I subtract startTime from endTime in R, when time is presented in the form of hour:minute:second,millisecond?

I succeeded in the task using MS Excel, but I have too much data to use Excel for this task. — Ninjadog, Apr 10 '16 at 16:23

digEmAll · Accepted Answer · 2016-04-11T13:02:36.893

Here's a possible solution (the code is pretty self explanatory):

text="

1
00:00:19,000 --> 00:00:21,989
I'm Annita McVeigh and welcome to Election Today where we'll bring you

2
00:00:22,000 --> 00:00:23,989
the latest from the campaign trail, 
plus debate 
and analysis.



3
00:00:24,000 --> 00:00:28,989
The Liberal Democrats promise to protect 
the pay of millions"

con<-textConnection(text)
lines <- readLines(con) 

# the previous lines of code are just to replicate you case, and
# they should be replaced by the following single line in the real case
# lines <- readLines(srtFileName)

listOfEntries <- 
lapply(split(1:length(lines),cumsum(grepl("^\\s*$",lines))),function(blockIdx){
    block <- lines[blockIdx]
    block <- block[!grepl("^\\s*$",block)]
    if(length(block) == 0){
      return(NULL)
    }
    if(length(block) < 3){
      warning("a block not respecting srt standards has been found")
    }
    return(data.frame(id=block[1], 
                      times=block[2], 
                      textString=paste0(block[3:length(block)],collapse="\n"),
                      stringsAsFactors = FALSE))
  })
m <- do.call(rbind,listOfEntries)


# split start and end times
tmp <- do.call(rbind,strsplit(m[,'times'],' --> '))
m$startTime <- tmp[,1]
m$endTime <- tmp[,2]

# parse times
tmp <- do.call(rbind,lapply(strsplit(m$startTime,':|,'),as.numeric))
m$fromSeconds  <- tmp %*% c(60*60,60,1,1/1000)

tmp <- do.call(rbind,lapply(strsplit(m$endTime,':|,'),as.numeric))
m$toSeconds  <- tmp %*% c(60*60,60,1,1/1000)

# compute time difference in seconds
m$timeDiffInSecs <- m$toSeconds - m$fromSeconds

# word count
m$wordCount <- vapply(gregexpr("\\W+",m$textString),length,0) + 1

# or if you consider "I'm" a single word you can remove the occurrencies of ', e.g. :
#m$wordCount <- vapply(gregexpr("\\W+",gsub("'","",m$textString)),length,0) + 1

m$millisecsPerWord <- m$timeDiffInSecs * 1000 / m$wordCount

Result :

> m
  id                         times                                                             textString
2  1 00:00:19,000 --> 00:00:21,989 I'm Annita McVeigh and welcome to Election Today where we'll bring you
3  2 00:00:22,000 --> 00:00:23,989      the latest from the campaign trail, \nplus debate \nand analysis.
6  3 00:00:24,000 --> 00:00:28,989         The Liberal Democrats promise to protect \nthe pay of millions
     startTime      endTime fromSeconds toSeconds timeDiffInSecs wordCount millisecsPerWord
2 00:00:19,000 00:00:21,989          19    21.989          2.989        14         213.5000
3 00:00:22,000 00:00:23,989          22    23.989          1.989        11         180.8182
6 00:00:24,000 00:00:28,989          24    28.989          4.989        10         498.9000

Oh. That's amazing! Thank you so much, digEmAll! The codes are just beautiful! — Ninjadog, Apr 10 '16 at 18:27

R: Extracting time from srt (subtitles) file

1 Answers1