1

BACKGROUND: Every few weeks I need to read about ten thousand URLs for search engine optimization purposes (on sites I own/manage). In the future the total URLs will be multiplied by 20 to support other language versions of my site.

When I use rebol, it takes about a second per URL to read and process, around 3 hours total. To reduce the time-to-completion I want to split the job into a number of smaller batches that can execute simultaneously on multiple (local) interpreters.

Current thinking is that my script will write a number of .r files which, when launched in another process will process a subset of the URL list.

QUESTION: I'm wondering if there are any tips or warnings/restrictions for launching interpreter processes in this manner. For http reads, I expect to launch fewer than 10 interpreters.

Happy to share my script and insights as I learn more.

Edoc
  • 339
  • 4
  • 10
  • Did you use an event based approach (.a.k.a no-wait) ? – sqlab Sep 16 '15 at 06:48
  • No, I did not. Used just the standard 'read. – Edoc Sep 16 '15 at 13:26
  • Did you check UniServe? It is available in Cheyenne source archive. – endo64 Sep 19 '15 at 13:27
  • No, I will take a look. – Edoc Sep 19 '15 at 14:56
  • When I check the docs (I'm not using rebol 3 at the momentP http://www.rebol.com/docs/core23/rebolcore-14.html I get a "not enough memory" error when I use a no-wait port with this approach: while [wait tp data: copy tp] [append content data] – Edoc Sep 19 '15 at 14:57
  • How did you open your http port? You have to use open/no-wait/direct or you will always copy the data you have already read, that the port is buffering. – sqlab Sep 25 '15 at 11:54
  • Ah thanks. From that Ports documentation it wasn't clear that I could only use /no-wait with /direct. Now that I have it working, on first blush it doesn't seem significantly faster than using 'read. – Edoc Sep 25 '15 at 20:19
  • The http scheme handles the connection synchronously at the opening. You would have to roll your own asynchron connection in order to get advantages. You will not get one connection or url faster, but you should be able to see advantages, if you open more than one und handle them. And best would be to do this in your own /awake function. – sqlab Sep 26 '15 at 06:44
  • If the bigger part of the work is processing the data, the asynchronous handling of tcp connections will not bring significant advantages. – sqlab Sep 26 '15 at 06:50
  • Does this help? https://www.youtube.com/watch?v=ktnu5HXpFvY – Graham Chiu Sep 30 '15 at 07:58
  • Thanks for the demo Graham. But no, it doesn't help. :) Not to worry, though, I'm not stuck, I just need to spend some time rewriting my humble script to split up the work for a configurable number of interpreters. – Edoc Oct 02 '15 at 14:01
  • Could it be as simple as using launch to create copies of the script that process subsets of urls? – grantwparks Sep 19 '16 at 03:25

0 Answers0