I'm writing a Javascript bookmarklet as a side project for work (don't code for a living, very much a beginner).
It scans through a cnn.com
transcript and picks out the names and titles of the live guests, excluding those that are played from tape.
To do this I grab the site, then use replace()
and regex to remove text between BEGIN VIDEO CLIP
and END VIDEO CLIP
, and then use another regular expression to scan for everything that matches the NAME, TITLE:
format. It works like a charm on some transcripts, and fails miserably on others. Here's my code:
(function () {
var webPage = document.body.innerText;
var tape = webPage.replace(/(BEGIN VIDEO CLIP)([\s\S]*)(END VIDEO CLIP)|(BEGIN VIDEOTAPE)([\s\S]*)(END VIDEOTAPE)/g, "");
var searchForGuests = /[A-Z ].+,[A-Z0-9 ].+:/g;
var guests = tape.match(searchForGuests).join("; ");
alert("Guests: " + guests)
})();
As an example, when applied to http://transcripts.cnn.com/TRANSCRIPTS/1303/05/pmt.01.html, it alerts only the name of the host (Piers Morgan), even though there are several live guests. Is it my regex that's the problem? I've been testing in Regexr, but as far as I can tell, not using anything illegal in Javascript.
It should work on any of the following transcripts: http://transcripts.cnn.com/transcripts.