I have hundreds of automatic machine transcripts of video and audio files. I have every transcript in five formats: JSON, XML, SRT, VTT, TXT. (Click here to see example files.) The JSON and XML files contain the most comprehensive data, including speaker ID, confidence level, and timecodes.
I am looking for a way to mine or search this data to find words and phrases. I need to be able to submit a Boolean search query, then click a result and play the video/audio file at the timecode of the text result. The only necessary Boolean operators are NOT, AND, OR (just like an online search engine). Example search: ("baseball bat" AND park) OR soccer
I'm thinking of a fairly simple interface.
Basic options:
- Search box
- Minimum confidence level slider
Ideas for advanced options:
- Speaker: "Bob,Joe,Bill" (that is, speaker must be one of these)
- Maximum time allowed between words in AND search: X.X seconds
- Maximum time allowed between words in exact phrase search: X.X seconds
- Words in exact phrase search must have same speaker: ON/OFF
- Words between AND must have same speaker: ON/OFF
- Words between OR must have same speaker: ON/OFF
- Words between AND must be found within chronological order: ON/OFF
- Ignore punctuation: ON/OFF
Simply put, I need something like Agent Ransack with timecodes and, if possible, some miscellaneous options. I know this is a very specific and complex request. :) Can you give me any leads on this idea? I don't want to reinvent the wheel. Which software/command line program/engine comes closest to being able to do all this? Perhaps I can adapt it from there.
Thanks!