2

I'm trying to extract TV show name from txt file using R.

I have loaded the txt and assigned it to a variable called txt. Now I'm trying to use regular expression to extract just the information I want.

The pattern of information I want to extract is likes of

SHOW: Game of Thrones 7:00 PM EST
SHOW: The Outsider 3:00 PM EST
SHOW: Don't Be a Menace to South Central While Drinking Your Juice In The Hood 10:00 AM EST

and so on. There are about 320 shows and I want to extract all 320 of them.

So far, I've come up with this.

pattern <- "SHOW:\\s\\w*"
str_extract_all(txt, pattern3)

However, it doesn't extract the entire title name like I intended. (ex: it will extract "SHOW: Game" instead of "SHOW: Game of Thrones". If I were to extract that one show, I would just use "SHOW:\\s\\w*\\s\\w*\\s\\w* to match the word count, but I want to extract all shows in txt, including the longer and shorter names.

How should I write the regular expression to get the intended result?

2 Answers2

1

Does this work, using look around:

str_extract(st, '(?<=SHOW: )(.*)(?= \\d{1,2}:.. [PA]M ...)')
[1] "Game of Thrones"                                                         
[2] "The Outsider"                                                            
[3] "Don't Be a Menace to South Central While Drinking Your Juice In The Hood"


 

With SHOW:

str_extract(st, '(.*)(?= \\d{1,2}:.. [PA]M ...)')
[1] "SHOW: Game of Thrones"                                                         
[2] "SHOW: The Outsider"                                                            
[3] "SHOW: Don't Be a Menace to South Central While Drinking Your Juice In The Hood"


 

Data:

st
[1] "SHOW: Game of Thrones 7:00 PM EST"                                                          
[2] "SHOW: The Outsider 3:00 PM EST"                                                             
[3] "SHOW: Don't Be a Menace to South Central While Drinking Your Juice In The Hood 10:00 AM EST"
Karthik S
  • 11,348
  • 2
  • 11
  • 25
  • Also works very well! What does "?<=SHOW "and "?=" do? – purpleskater Nov 04 '20 at 16:32
  • @purpleskater , So "?<= "is the positive look behind and "?=" is a positive look ahead. So we are checking if that pattern exists. if pattern does exist, it's not returned along with what we are searching for, it's also called zero length assertions, they don't consume the pattern. So here we are not asking R to return what's there in look arounds, we are only interested what's between them, which in our case are the show names. – Karthik S Nov 04 '20 at 17:02
  • @purpleskater, if any of the solutions have met your requirement, can you upvote or accept them. https://stackoverflow.com/help/someone-answers – Karthik S Nov 04 '20 at 17:29
1

You could get the value without using lookarounds by matching SHOW: and capturing the data in group 1 matching as least as possible until the first occurrence of AM or PM.

\bSHOW:\s+(.*?)\s+\d{1,2}:\d{1,2}\s+[AP]M\b

Explanation

  • \bSHOW:\s+ A word boundary, match SHOW: and 1+ whitspace chars
  • (.*?) Capture group 1, match as least as possible (non greedy)
  • \s+\d{1,2}:\d{1,2} Match 1+ whitespace chars, 1-2 digits : 1-2 digits
  • \s+[AP]M\b Match 1+ whitespace chars followed by either AM or PM and a word boundary

Regex demo | R demo

library(stringr)

txt <- c("SHOW: Game of Thrones 7:00 PM EST", "SHOW: The Outsider 3:00 PM EST", "SHOW: Don't Be a Menace to South Central While Drinking Your Juice In The Hood 10:00 AM EST")
pattern <- "\\bSHOW:\\s+(.*?)\\s+\\d{1,2}:\\d{1,2}\\s+[AP]M\\b"
str_match(txt, pattern)[,2]

Output

[1] "Game of Thrones"                                                         
[2] "The Outsider"                                                            
[3] "Don't Be a Menace to South Central While Drinking Your Juice In The Hood"

If you want to include SHOW, it can be part of the capturing group.

\b(SHOW:.*?)\s+\d{1,2}:\d{1,2}\s+[AP]M\b

Regex demo

The fourth bird
  • 154,723
  • 16
  • 55
  • 70