0

I want to create an app that randomly access pages from another site. This site has more than 40,000 pages and does not have an api.

How can I collect the url of all these 40,000 pages? Copy and paste will be eternal.

All of these pages follow the same structure, similar to site.com/directory/1.html, site.com/directory/2.html, etc

  • Already been answered - http://stackoverflow.com/questions/2804467/spider-a-website-and-return-urls-only – PressingOnAlways Feb 11 '17 at 04:39
  • @PressingOnAlways That seems directed specifically towards `wget`. OP has tagged this with JavaScript. – E. Sundin Feb 11 '17 at 04:42
  • The OP contemplated copying and pasting all the urls suggesting that he can post-process the data. I suggest using wget or some established methodology of grabbing the urls and importing that into your application. I do not see the need to re-invent a web scraping bot. – PressingOnAlways Feb 11 '17 at 04:47

2 Answers2

0

PhantomJS is great for this. Or you could learn NodeJS and setup a 'scraper' that will basically grab each page's html via a GET request and parse it using something like cheerio (jquery for serverside).

Your question is pretty broad as there are many ways to sink a ship. You just gotta pick a tool and go at it. Goodluck!

matt
  • 1,680
  • 1
  • 13
  • 16
0

There are multiple tools to use for this in different environments. You could achieve this with:

  • Node.js - The environment
  • request - The http request tool
  • cheerio - The html parsing tool which supports jQuery-like selectors like $("a.somelink-selector")
  • Perhaps the async library to more easily control how many requests you'll do at a time
E. Sundin
  • 4,103
  • 21
  • 30