-1

I have a text file that contains words that are concatenated where they should not be. Below is an example of the text file:

Gangnam S.'s Reviewof JOEY Eaton Centre - Toronto (4/5) on Yelp. JOEY Eaton Centre 86 reviews Rating Details Categories:Restaurants Canadian (New) Nightlife Bars Sports Bars Canadian (New); Sports Bars 1 Dundas St W Toronto;ON M5G 1Z3 Neighbourhood: Downtown Core (647) 352-5639 http://www.joeyrestaurants.com AddPhotos Hours: Mon-Sun 11 am - 2 am Good for Kids: No Accepts Credit Cards: Yes Parking: Garage; StreetAttire: Casual Good for Groups: Yes Price Range: $ Takes Reservations: Yes Delivery: No Take Away: YesWaiter Service: Yes Outdoor Seating: Yes Wi-Fi: Free Good For: Dinner Alcohol: Full Bar Noise Level:Average Ambience: Trendy Has TV: Yes Caters: No First to Review Karen G. Edit Business Info Send to FriendBookmark Write a Review 86 reviews for JOEY Eaton Centre Reviews Matching: Search Reviews ReviewHighlights ...I had to get the Killer Ahi Tuna Tacos - seared rare with... In 3 reviews Try the Lobster Ravioli orLobster Grilled Cheese. In 8 reviews ...ordered the Bombay Butter Chicken - served with toasted... In 7reviews Loading... Sort by: Yelp Sort | Date | Rating | Elites' | Facebook Friends' Facebook Friends FromReviewers You're Following Reviews from Your Friends 86 reviews in English Review from Catherine J. Elite'12 11 friends 26 reviews Catherine J. Markham; ON 11/21/2012 A bar the size of a warehouse and a lineupto match; but leap over Joey's welcome mat and you'll get a great introduction to the city. There's a couplereasons to enjoy this joint: 1) Size. It's big.

What would be an efficient way of cleaning up this text and the improperly concatenated words using R?

Thanks,

Butch

Wai Ha Lee
  • 8,598
  • 83
  • 57
  • 92
  • I have tried using the splitWords() function using the tmt package, however it takes a really long time on a long string such as this since it needs to check for every possible split. – Butch Jones Sep 17 '15 at 22:12

1 Answers1

0

If the problem is one where two improperly concatenated words contain a lower case first word and an uppercase second word, then this will work, if your text is txt:

gsub("([a-z])([A-Z])", "\\1 \\2", txt)

e.g.

> txt <- "FriendBookmark Write a Review 86 reviews for JOEY Eaton Centre Reviews Matching: Search Reviews ReviewHighlights"
> gsub("([a-z])([A-Z])", "\\1 \\2", txt)
[1] "Friend Bookmark Write a Review 86 reviews for JOEY Eaton Centre Reviews Matching: Search Reviews Review Highlights"

Unfortunately the methods for parsing words not concatenated as "camelCase" are harder. So for instance to separate "couplereasons" would require tokenizing the text and performing a dictionary search on partial words, and even that would not be conclusive. How do you parse "theresits" - "there sits" or "the resits"?

Ken Benoit
  • 14,454
  • 27
  • 50