Rebol: Splitting a large # of network reads across multiple interpreters?

Question

BACKGROUND: Every few weeks I need to read about ten thousand URLs for search engine optimization purposes (on sites I own/manage). In the future the total URLs will be multiplied by 20 to support other language versions of my site.

When I use rebol, it takes about a second per URL to read and process, around 3 hours total. To reduce the time-to-completion I want to split the job into a number of smaller batches that can execute simultaneously on multiple (local) interpreters.

Current thinking is that my script will write a number of .r files which, when launched in another process will process a subset of the URL list.

QUESTION: I'm wondering if there are any tips or warnings/restrictions for launching interpreter processes in this manner. For http reads, I expect to launch fewer than 10 interpreters.

Happy to share my script and insights as I learn more.

Did you check UniServe? It is available in Cheyenne source archive. — endo64, Sep 19 '15 at 13:27
When I check the docs (I'm not using rebol 3 at the momentP http://www.rebol.com/docs/core23/rebolcore-14.html I get a "not enough memory" error when I use a no-wait port with this approach: while [wait tp data: copy tp] [append content data] — Edoc, Sep 19 '15 at 14:57
How did you open your http port? You have to use open/no-wait/direct or you will always copy the data you have already read, that the port is buffering. — sqlab, Sep 25 '15 at 11:54
Ah thanks. From that Ports documentation it wasn't clear that I could only use /no-wait with /direct. Now that I have it working, on first blush it doesn't seem significantly faster than using 'read. — Edoc, Sep 25 '15 at 20:19
The http scheme handles the connection synchronously at the opening. You would have to roll your own asynchron connection in order to get advantages. You will not get one connection or url faster, but you should be able to see advantages, if you open more than one und handle them. And best would be to do this in your own /awake function. — sqlab, Sep 26 '15 at 06:44
If the bigger part of the work is processing the data, the asynchronous handling of tcp connections will not bring significant advantages. — sqlab, Sep 26 '15 at 06:50
Thanks for the demo Graham. But no, it doesn't help. :) Not to worry, though, I'm not stuck, I just need to spend some time rewriting my humble script to split up the work for a configurable number of interpreters. — Edoc, Oct 02 '15 at 14:01
Could it be as simple as using launch to create copies of the script that process subsets of urls? — grantwparks, Sep 19 '16 at 03:25

Rebol: Splitting a large # of network reads across multiple interpreters?

0 Answers0