0

I'm writing a Javascript bookmarklet as a side project for work (don't code for a living, very much a beginner).

It scans through a cnn.com transcript and picks out the names and titles of the live guests, excluding those that are played from tape.

To do this I grab the site, then use replace() and regex to remove text between BEGIN VIDEO CLIP and END VIDEO CLIP, and then use another regular expression to scan for everything that matches the NAME, TITLE: format. It works like a charm on some transcripts, and fails miserably on others. Here's my code:

(function () {
    var webPage = document.body.innerText;
    var tape = webPage.replace(/(BEGIN VIDEO CLIP)([\s\S]*)(END VIDEO CLIP)|(BEGIN VIDEOTAPE)([\s\S]*)(END VIDEOTAPE)/g, "");
    var searchForGuests = /[A-Z ].+,[A-Z0-9 ].+:/g;
    var guests = tape.match(searchForGuests).join("; ");
    alert("Guests: " + guests)
})();

As an example, when applied to http://transcripts.cnn.com/TRANSCRIPTS/1303/05/pmt.01.html, it alerts only the name of the host (Piers Morgan), even though there are several live guests. Is it my regex that's the problem? I've been testing in Regexr, but as far as I can tell, not using anything illegal in Javascript.

It should work on any of the following transcripts: http://transcripts.cnn.com/transcripts.

babyjordan
  • 39
  • 7
  • 2
    first off .+ matches anything, I'm guessing you want to match `/[A-Z ]+,[A-Z0-9 ]+:/g` something like that since they're all in caps – srosh Mar 08 '13 at 20:46
  • and use [regex.exec](https://developer.mozilla.org/en-US/docs/JavaScript/Reference/Global_Objects/RegExp/exec) – srosh Mar 08 '13 at 20:48

2 Answers2

0

The major problem here is probably the greedy [\s\S]*, which will match and remove too much. Try to use [\s\S]*? instead. The added ? after the * makes it match as little as possible (instead of as much as possible).

Qtax
  • 33,241
  • 9
  • 83
  • 121
0

In your searchForGuests regex, try ^([A-Za-z0-9, ]+(?=:))

If your text is this:

TOM COUGHLIN, NFL COACH: Preparation is the key to success. 
MORGAN: Plus he's worn out his Oscar welcome but she's Hollywood's golden girl, Kristin Chenoweth. 

It'll return match:

TOM COUGHLIN, NFL COACH
MORGAN
Amy
  • 7,388
  • 2
  • 20
  • 31
  • I'd only like to match the first time the name appears, with the full title. Also, sometimes the titles have more than one comma... Michael Jordan, NBA Player, Chicago Bulls. – babyjordan Mar 08 '13 at 20:54