Fetching via wget to memory & bypassing disk writes

Question

Is it possible to download contents of a website—a set of HTML pages—straight to memory without writing out to disk?

I have a cluster of machines with 24G of installed each, but I’m limited by a disk quota to several hundreds MB. I was thinking of redirecting the output wget to some kind of in-memory structure without storing the contents on a disk. The other option is to create my own version of wget but may be there is a simple way to do it with pipes

Also what would be the best way to run this download in parallel (the cluster has >20 nodes). Can’t use the file system in this case.

Are you fetching that *set of pages* using `wget --recursive` ? — Alex Jasmin, Jan 11 '10 at 21:01
Well, it sounds like the solution is getting a higher quota. When faced with a similar problem, I purchased really big disks for the machine and had the sysadmins set them up for me. :) — brian d foy, Jan 12 '10 at 11:30

Sinan Ünür · Accepted Answer · 2010-01-11T21:57:52.243

See wget download options:

‘-O file’

‘--output-document=file’

The documents will not be written to the appropriate files, but all will be concatenated together and written to file. If ‘-’ is used as file, documents will be printed to standard output, disabling link conversion. (Use ‘./-’ to print to a file literally named ‘-’.)

If you want to read the files into a Perl program, you can invoke wget using backticks.

Depending on what you really need to do, you might be able to get by just using LWP::Simple's get.

use LWP::Simple;
my $content = get("http://www.example.com/");
die "Couldn't get it!" unless defined $content;

Update: I had no idea you could implement your own file system in Perl using Fuse and Fuse.pm. See also Fuse::InMemory.

score 5 · Answer 2 · edited Oct 19 '14 at 04:08

5

Are you root? You could just use a tmpfs.

Re your edit: you're not CPU bound, you don't need to use every machine. You can use xargs -n SOME_NUMBER to split your list of root urls, assuming there are several.

But if you are keen on sharing memory, you can set up a cluster memcache and mount it on every machine with memcachefs.

edited Oct 19 '14 at 04:08

Giacomo1968

25,759
11
71
103

answered Jan 11 '10 at 20:54

Tobu

24,771
4
91
98

4

It so happens that Linux has a `tmpfs` mounted at `/dev/shm`, accessible for everyone (not just root). Not that you *should* abuse it for this purpose, but... ;-) – ephemient Jan 11 '10 at 21:03
Depends on your setup and your distro, you may or may not have tmpfs already mounted at /dev/shm. – davr Jan 11 '10 at 21:10
2

@ephemient You are a bad person. (incidentally, /var/lock also works) – Tobu Jan 11 '10 at 21:11
1

@davr: Any Linux distribution using Glibc≥2.2 and that wishes to be POSIX.1-compliant has `/dev/shm` mounted; Glibc implements POSIX shared memory (`shm_open`) via files in that directory. – ephemient Jan 11 '10 at 22:13

score 5 · Answer 3 · answered Jan 11 '10 at 21:00

5

If you a) are already using Perl, b) want to download HTML, and c) parse it, I always recommend LWP and HTML::TreeBuilder.

answered Jan 11 '10 at 21:00

Leonardo Herrera

8,388
5
36
66

score 2 · Answer 4 · edited Oct 19 '14 at 04:08

2

wget <url> -O -

Will write the contents of a URL to standard output, which can then be captured in memory.

edited Oct 19 '14 at 04:08

Giacomo1968

25,759
11
71
103

answered Jan 11 '10 at 21:16

mob

117,087
18
149
283

Fetching via wget to memory & bypassing disk writes

4 Answers4