0

I’m a beginner in web scraping and trying to learn how to implement an automated process to collect data from the web submitting search terms.

The specific problem I’m working on is as follows:

Given the stackoverflow webpage https://stackoverflow.com/ I submit a search for the term “web scraping” and want to collect in a list all question links and the content for each question.

Is it possible to scrape these results?

My plan is to create a list of terms:

term <- c(“web scraping”, “crawler”, “web spider”)

submit a research for each term and collect both question title and content of the question.

Of course the process should be repeated for each pages of results.

Unfortunately, being relatively new to web scraping, I'm not sure what to do. I’ve already downloaded some packages to scrape the web (rvest, RCurl, XML, RCrawler).

Thanks for your help

Luigi
  • 21
  • 3
  • 1
    This *could* be done using rvest, but have you considered using the [Stack Exchange API](https://api.stackexchange.com/docs) instead? There's a /search api available, and a separate site called [Stack Apps](https://stackapps.com/) where you can ask questions and view the documentation if you're interested in going this route. – Bill the Lizard Apr 13 '18 at 18:09
  • You are likely not trying to scrape SO but are more likely not willing to identify the site you are trying to scrape due to ToS violations. Please provide a MWE for the real URL. – hrbrmstr Apr 17 '18 at 02:09
  • @Bill the Lizard I'm aware of the API but would like to develop a process which can be used for any kind of web scraping task – Luigi Apr 17 '18 at 14:29
  • @hrbrmstr I cannot understand what your point means – Luigi Apr 17 '18 at 14:32
  • Scraping google/bing/etc is a violation of both robots.txt and terms of service on those sites. Same for rotten tomatoes, imdb, amazon, ebay, and a plethora of others. You aren't providing the real site URL(s) so it's very likely you're trying to pwn ethical SO contributors to help you do something unethical. – hrbrmstr Oct 14 '18 at 03:39

0 Answers0