-2

I have hundreds of automatic machine transcripts of video and audio files. I have every transcript in five formats: JSON, XML, SRT, VTT, TXT. (Click here to see example files.) The JSON and XML files contain the most comprehensive data, including speaker ID, confidence level, and timecodes.

I am looking for a way to mine or search this data to find words and phrases. I need to be able to submit a Boolean search query, then click a result and play the video/audio file at the timecode of the text result. The only necessary Boolean operators are NOT, AND, OR (just like an online search engine). Example search: ("baseball bat" AND park) OR soccer

I'm thinking of a fairly simple interface.

Basic options:

  • Search box
  • Minimum confidence level slider

Ideas for advanced options:

  • Speaker: "Bob,Joe,Bill" (that is, speaker must be one of these)
  • Maximum time allowed between words in AND search: X.X seconds
  • Maximum time allowed between words in exact phrase search: X.X seconds
  • Words in exact phrase search must have same speaker: ON/OFF
  • Words between AND must have same speaker: ON/OFF
  • Words between OR must have same speaker: ON/OFF
  • Words between AND must be found within chronological order: ON/OFF
  • Ignore punctuation: ON/OFF

Simply put, I need something like Agent Ransack with timecodes and, if possible, some miscellaneous options. I know this is a very specific and complex request. :) Can you give me any leads on this idea? I don't want to reinvent the wheel. Which software/command line program/engine comes closest to being able to do all this? Perhaps I can adapt it from there.

Thanks!

grgoelyk
  • 397
  • 1
  • 3
  • 12

1 Answers1

0

You can implement such a system on top of Solr/Lucene http://lucene.apache.org/solr, however, you need to get more experience to implement required features.

For open source implementation of speech archival and indexing you can check Matterhorn

You can find details on Matterhorn speech indexing in presentation

However, this is not the only way to implement such functionality, you can also proceed with the language of your choice and simple tools. Ruby/PHP or Node.js will also work here.

Nikolay Shmyrev
  • 24,897
  • 5
  • 43
  • 87