0

I am battling with regex and I can't figure it out.

I have a bid data base extracted from last.fm (www.lastfm.com). The file is a .txt file where each column from each line is delimited by "," (comma) with over 1.7 GB and there are some characters messing up the reading into R. Until now I managed to understand where everything goes wrong and the main problem comes from " (quotation marks) inside other quotation marks.

To elucidate, here is an example of the .txt file when readLines is applied.

[1] "user,\"Method Man & Redman\",\"Da Rockwilder\",0,2012,2,10,8,0,41"       
[2] "user,\"Method Man & Redman\",\"Y.O.U.\",0,2012,2,10,7,56,25"             
[3] "user,\"Method Man & Redman\",\"Blackout\",0,2012,2,10,7,51,53"           
[4] "user,\"Chuckie\",\"Who Is Ready To Jump (Club Mix)\",0,2012,2,10,7,40,12"
[5] "user,\"Opgezwolle\",\"Volle Kracht\",0,2012,2,10,7,36,31"                
[6] "user,\"Opgezwolle\",\"Ut Is Wat Het Is\",0,2012,2,10,7,33,25"

Basically this becomes a data frame with 10 columns: username, "Artist", "Track", loved (0/1), year, month, day, hour, minute, second

The above example can easily be read without any problems but I get problems when something like this happens:

[1] "user,\"Fall Out Boy\",\"\"The Take Over, The Breaks Over\"\",0,2010,4,17,7,11,37"
[2] "user,\"Gare du Nord\",\"I Want Love 12\" Remix\",0,2011,6,12,19,32,33"

In the first case, due to the double quotation marks, the comma in the name of the track makes this into two different columns and instead of the 10 columns I get 11 columns. On the second case, the 12" leaves the string "open" and only stops until it finds a similar case. When this happens, I loose several lines of the data frame.

What I want as a solution? I want to remove all the " (quotations marks) except the ones that surround the name of the Artist and name of the Track.

Output: The output would have in total four (4) " (quotation marks) in each line. "Artist" and "Track Name". So the output for those 2 lines that give me problem would be:

[1] "user,\"Fall Out Boy\",\"The Take Over, The Breaks Over\",0,2010,4,17,7,11,37"
[2] "user,\"Gare du Nord\",\"I Want Love 12 Remix\",0,2011,6,12,19,32,33"

I tried to use Regex with gsub and gstring but I can't get it to extract only the " marks that are in excess.

If this is too complicated, something that would extract all the " except the first 3 (quotation marks around Artist name and first quotation mark around Track name) and the last one (quotation mark at the end of Track name), might work for most of the cases (and I would do the rest manually). I am assuming here that no Artist name contains quotation marks.

Any help would be appreciated and if you need any further explanation or data please let me know.

JMarchante
  • 123
  • 1
  • 4
  • 11

2 Answers2

4

Use negative lookarounds to remove all the \" which are neither preceded nor followed by commas.

(?<!,)\\"(?!,)

DEMO

> x <- c('user,\"Fall Out Boy\",\"\"The Take Over, The Breaks Over\"\",0,2010,4,17,7,11,37', 'user,\"Gare du Nord\",\"I Want Love 12\" Remix\",0,2011,6,12,19,32,33')
> gsub("(?<!,)\\\"(?!,)", "", x, perl=T)
[1] "user,\"Fall Out Boy\",\"The Take Over, The Breaks Over\",0,2010,4,17,7,11,37"
[2] "user,\"Gare du Nord\",\"I Want Love 12 Remix\",0,2011,6,12,19,32,33" 

Notice that there needs to be an extra backslash in the pattern argument, because backslashes are escape operators in both R and the regex-engine.

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274
  • Nice example. I hope my edit to get the proper English logical expression is acceptable. It fails without the perl=TRUE argument. – IRTFM Oct 30 '14 at 00:05
  • Thanks for your comment. I'd remove the `(*SKIP)(*F)` part. – Avinash Raj Oct 30 '14 at 00:28
  • Run both answers through my database and this was the "winner". After screening 28 Million lines, I ended up with 1199 lines where I had more quotations than wanted. This happened because in this examples the quotations were preceded or followed by commas. Here is an example: `"User,\"Frédéric Chopin\",\"Prelude, in D sharp Major Raindrop\", Op. 28-15\",0,2012,1,24,14,52,15"` Thank you so much! – JMarchante Oct 30 '14 at 00:45
  • That just looks like an error in construction: `Op. 28-15\"` should have been: `\"Op. 28-15\"` – IRTFM Oct 30 '14 at 00:56
  • I agree with you @BondedDust but I am not responsible for the track names that come from last.fm. If it helps to understand, the original line was: `"User,\"Frédéric Chopin\",\"Prelude, in D sharp Major \"Raindrop\", Op. 28-15\",0,2012,2,7,10,19,42"` So track name should be "Prelude, in D sharp Major "Raindrop", Op. 28-15" – JMarchante Oct 30 '14 at 01:04
  • Oh. Embedded double-quotes, some with trailing interior commas. I doubt that either of our solutions are that smart. – IRTFM Oct 30 '14 at 01:07
  • 1
    I am extremely happy with the answers provided and I can easily deal with 1199 lines. So, once again, thank you all for the quick reply. – JMarchante Oct 30 '14 at 01:14
2

Character classes with alphanumeric and double quote and backreferences can do it:

gsub("([ 0-9a-zA-Z\"])(\\\")([ 0-9a-zA-Z\"])", "\\1\\3",test)

[1] "user,\"Fall Out Boy\",\"The Take Over, The Breaks Over\",0,2010,4,17,7,11,37"
[2] "user,\"Gare du Nord\",\"I Want Love 12 Remix\",0,2011,6,12,19,32,33"     

Could also consider:

gsub("([ [:alpha:][:digit:]\"])(\\\")([ [:alpha:][:digit:]\"\"])", 
     "\\1\\3", test)

Basically removing any double-quote mark that is flanked both sides by a class that doesn't have a comma in it. Would break down if there were spaces between your quoting-marks and the correct separating marks. The ?regex page describes your options for using character classes. The parentheses are the delimiters for backreferences: first backref is '\\1' and refers to the characters matched by the character class inside the first paired parentheses: ([ [:alpha:][:digit:]\"]). By omitting the middle backreference from the replacement argument the matching double-quotes get eliminated.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Just made a small test and until now it works. Can you take me a bit further through the steps so I can understand the rationale behind it? I will try with a bigger dataset meanwhile and see if it saves my day. Thank you so much! – JMarchante Oct 29 '14 at 23:57
  • I just run your code on my dataset. It worked much better than what I managed to do but the answer of @Avinash Raj made a better job. Here is an example where your code failed to extract the quotation marks: `"user,\"CNBC's\",\"CNBC's The Suze Orman Show\"- Full Show for 01/28/2012\",0,2012,2,1,21,58,55"` Thank you for your kind answer. – JMarchante Oct 30 '14 at 00:51
  • Yep, you would need to add "-" to the character class definition. – IRTFM Oct 30 '14 at 00:55