Copy multiple links from a website

Question

I want to create an app that randomly access pages from another site. This site has more than 40,000 pages and does not have an api.

How can I collect the url of all these 40,000 pages? Copy and paste will be eternal.

All of these pages follow the same structure, similar to site.com/directory/1.html, site.com/directory/2.html, etc

Already been answered - http://stackoverflow.com/questions/2804467/spider-a-website-and-return-urls-only — PressingOnAlways, Feb 11 '17 at 04:39
@PressingOnAlways That seems directed specifically towards `wget`. OP has tagged this with JavaScript. — E. Sundin, Feb 11 '17 at 04:42
The OP contemplated copying and pasting all the urls suggesting that he can post-process the data. I suggest using wget or some established methodology of grabbing the urls and importing that into your application. I do not see the need to re-invent a web scraping bot. — PressingOnAlways, Feb 11 '17 at 04:47

score 0 · Accepted Answer · answered Feb 11 '17 at 04:41

PhantomJS is great for this. Or you could learn NodeJS and setup a 'scraper' that will basically grab each page's html via a GET request and parse it using something like cheerio (jquery for serverside).

Your question is pretty broad as there are many ways to sink a ship. You just gotta pick a tool and go at it. Goodluck!

score 0 · Answer 2 · answered Feb 11 '17 at 04:45

There are multiple tools to use for this in different environments. You could achieve this with:

Node.js - The environment
request - The http request tool
cheerio - The html parsing tool which supports jQuery-like selectors like $("a.somelink-selector")
Perhaps the async library to more easily control how many requests you'll do at a time

Copy multiple links from a website

2 Answers2