4

I want to automatically grab some content from a page.

I wonder if it is possible:

  1. Run my own written JavaScript on the page after the page is loaded (I use FireFox. I don't have the ability to change content of the page. I just want to run JS on my browser.). The script will use getelementbyid or similar method to get the link to the next page

  2. Run a JavaScript to collect my interested content (some URLs) on that page and store those URLs in a local file

  3. Go to next page (the next page will get really loaded with my browser, but I do not need to intervene at all) and repeat step 1 and step 2, until there is no next page.

The classic way to do this is to write a Perl script using LWP or PHP script using CURL, etc. But that is all server side. I wonder if I can do it client side.

Zero Piraeus
  • 56,143
  • 27
  • 150
  • 160
Minghui Yu
  • 1,323
  • 2
  • 20
  • 27
  • 1
    There isn't a way to write directly to a file on the client side. It is a security risk so you need to use ajax or a page submit to write files. If you don't own the pages, and if you don't mind re-running the javascript on each page manually (i.e. run your script through firebug) you *could* do all this but I think it would be time consuming. Sounds like web crawling and I'm pretty certain you would be better off doing it on the server side. – scrappedcola Aug 28 '12 at 22:36
  • Hi @scrappedcola : I may not express my intent well: I want to have my Javascript running through something like Firebug (but firebug cannot do that. It can only debug JS that comes with the page). If it cannot write to local file, then writing to console is also okay. I can copy & paste --- though not exactly what I need, I can live with that too. Thanks,. – Minghui Yu Aug 28 '12 at 22:41
  • 1
    You can run js through Firebug. Open up firebug. Click on Console tab. At the bottom there is a command line (has red box next to >>> and a red box on right hand side). Click the red box on the right hand side (has white arrow on it). In the box cut and paste you JS code and click Run at the bottom of that box. see: http://getfirebug.com/commandline – scrappedcola Aug 28 '12 at 22:43
  • 1
    You don't even need to cut and paste the entire code. You could write a short 1 liner that adds your js file to the page to run as long as you have the page hosted somewhere (using IIS and local host would probably work use your machines fully qualified name). – scrappedcola Aug 28 '12 at 22:45

2 Answers2

5

I do something rather similar, actually.

By using GreaseMonkey, you can write a user-script that will interact with the pages however you need. You can get the next page link and scroll things as you like.

You can also store any data locally, within Firefox though some new functions called GM_getValue and GM_setValue.

I take the lazy way out. I just generate a long list of the URLs that I find when navigating the pages. I do a crude "document.write" method and I dump out my list of URLs as a batch file that rules on wget.

At that point I copy-and-paste the batch file then run it.

If you need to run this often enough that it should be automated, there used to be a way to turn GreaseMonkey scripts into Firefox extensions, that have access to more power.

Another option is currently AFAIK, Chrome only. You can collect whatever information you need and build a large file from it, then use the download attribute of a link and come up with a single-click to save things.

Update

I was going to share the full code for that I was doing, but it was so tied to a particular website that it wouldn't have really helped -- so I'll go for a more "general" solution.

Warning, this code typed on the fly and may not be actually correct.

// Define the container
// If you are crawling multiple pages, you'd want to load this from
// localStorage.
var savedLinks = [];

// Walk through the document and build the links.
for (var i = 0; i < document.links.length; i++) {
  var link = document.links[i];

  var data = { 
    url: link.url,
    desc = getText(link)
  };

  savedLinks.push(data);
}

// Here you'd want to save your data via localStorage.


// If not on the last page, find the 'next' button and load the next page
// [load next page here]

// If we *are* on the last page, use document.write to output our list.
// 
// Note: document.write totally destroys the current document.  It really is quite
// an ugly way to do it, but in this case it works.
document.write(JSON.stringify(savedLinks, null, 2));
Jeremy J Starcher
  • 23,369
  • 6
  • 54
  • 74
  • Totally forgot about GreaseMonkey. – scrappedcola Aug 28 '12 at 22:46
  • @Jeremy Could you explain more of the particulars of what you're doing? What are you trying to do here? What do you do with "document.write"? – Coldblackice Mar 10 '14 at 07:44
  • @JeremyJStarcher Thanks! So out of curiosity, where did you type this code to use it -- console, JS scratchpad, Firebug, new Greasemonkey script, etc.? – Coldblackice Mar 11 '14 at 02:16
  • @Coldblackice That all depends on your needs. In my case I had it part of a GreaseMonkey script so that it could walk the entire remote site and collect all sorts of information. However, if you only need to snag info off one one page, the console would be awkward but work. – Jeremy J Starcher Mar 11 '14 at 02:23
  • @JeremyJStarcher Well, I'll often find a site with data to scrape or links to click, e.g., a page listing countries/populations. Rather than mouse-selecting and copying/cleaning in Excel, I'd like to use Javascript to do the work -- selecting/scraping the data into a list, or opening each of the links in new tabs. So in an on-the-fly fashion, tailoring it to the respective page. Would the console be the best "on-the-fly" route to interact with a page like this? – Coldblackice Mar 11 '14 at 05:53
  • @Coldblackice That is really a matter of opinion. Personally? I use a GreaseMonkey script ... mostly because I find editing the script in a real editor SOOO much nicer than trying to write something in the console. You can even edit the script right there in the browser's working directory (for firefox, anyways. I think chrome is the same way) so you only have to reload the page and it will also reload your script. – Jeremy J Starcher Mar 11 '14 at 06:01
2

Selenium/webdriver will let you write a simple java/ruby/php app that will launch Firefox, use its JavaScript engine to interact with the page in the browse.

Or, if the web page does not require JavaScript to make the content you see interested in available, you could use a html parser in your favourite language and leave the browser out of it.

If you want to do it in JavaScript in Firefox you could probably do it in a greasemonkey script

Nicholas Albion
  • 3,096
  • 7
  • 33
  • 56