1

I am scraping a very long forum thread, and I want to come up with a database that has columns containing the following info: date / full post text / quoted user / quoted text / clean text

The clean text should be each user's post, without the quotations if they are replying to anyone. if the post is not a reply, I would leave it as NA. The following is an invented post, with invented user, to illustrate what I have managed to do so far:

post<-"Meow1 wrote: »\noday is gonna be the day that they're gonna throw it back to you?\nBy now you should've somehow Realized what you gotta do\n\n\nI don't believe that anybody Feels the way I do, about you now\nMeow1 wrote: »\nI'm sure you've heard it all before But you never really had a doubt\n\n\nBecause maybe, you're gonna be the one that saves me\nMeow1 wrote: »\nAnd after all, you're my wonderwall\n\n\nAnd all the lights that lead us there are blinding"

Then I try to pull out the quoted user (Meow1) and it works:

QuotedUser_1<-ifelse(grepl('wrote:', post), gsub('\\s*wrote.*$', '', post), NA) 
QuotedUser_1
[1] "Meow1"

Then I created this codes for pulling out the quoted text, and the clean text:

Quotedtext_1<- ifelse(grepl('wrote:', post), gsub('^.*wrote\\s*|\\s*\\n\\n\\n.*$', '', post), NA)

It works when there is only one quoted text, but otherwise, it only gives the last quoted bit (in the example, 'And after all, you´re my wonderwall') And same for the clean text, it only returns the last reply:

Clean_text<- sub('^.*\\n\\n\\n\\s*|\\s*wrote.*', '', post)

If anyone has a suggestion to improve the code, so that I can have a vector with all the quotations, and a vector with all the replies, I would be very grateful...

Cheers

G5W
  • 36,531
  • 10
  • 47
  • 80
Nuria
  • 65
  • 5

1 Answers1

0

Are you sure you cannot scrape the author and text information separately? Without a source it's difficult to know, but I guess they can be obtained by different css-selectors making it much easier to split the data. If not, it might be helpful to look into str_locate_all which allows you to locate all occurences of e.g. "wrote:" and split the string accordingly.

TomS
  • 226
  • 3
  • 10
  • Do you mind posting your solution? Maybe someone else will have a similar problem in the future and was happy to read here how to solve it :) – TomS Oct 04 '17 at 05:17
  • I know and I will (despite the code might make everyone cry, as dirty as it is), I want to keep the material scraped anonymous, so I need to recreate the sript. I will do it this week end :) – Nuria Oct 04 '17 at 07:55
  • Thanks a lot! A close-to-cry-solution is always better than no solution - at least you have a proper start for improvement – TomS Oct 04 '17 at 09:51