0

What is the fastest way to download webpage source into a memo component? I use Indy and HttpCli components.

The problem is that I have a listbox filled with more than 100 sites, my program downloads source to a memo and parses that source for mp3 files. It is something like a Google music search program; it uses Google queries to make Google search easier.

I started reading about threads which lead to my question: Can I create a IdHttp instance in a thread with parsing function and tell it to parse half of the sites in the listbox?

So basically when a user clicks parse, the main thread should do:

for i := 0 to listbox1.items.count div 2 do
    get and parse

, and the other thread should do:

for i := form1.listbox1.items.count div 2 to form1.listbox1.items.count - 1 do
    get and parse.

, so they would add parsed content to form1.listbox2 in the same time. Or is it maybe easier to start two IdHttp instances in the main thread; one for first half of sites and other for second?

For this: should I use Indy or Synapse?

Thom A
  • 88,727
  • 11
  • 45
  • 75
  • I would suggest you read the documentation about what Synchronize does, and make each thread ask for one (and only one) URL when it starts and each time after is has handled one URL. If the websites use XHTML I would also check MSXML2_TLB's DOMDocument.load method to see if loading and parsing performs well. – Stijn Sanders Nov 06 '11 at 18:53

3 Answers3

9

I would create a thread that can read a single url and process its content. You can then decide how many of those threads you want to fire at the same time. Your computer will allow quite a number of connections, so if those 100 sites have different hostnames, it is not a problem to run 10 or 20 at the same time. Too much is overkill, but too little is a waste of processor time.

You can tweak this process even further by having separate threads for downloading and processing, so that you can have a number of threads constantly downloading content. Downloading is not very processor intensive. It is basically waiting for a response, so you can easily have a relatively large number of download threads, while a couple of other worker threads can grab items from the pool of results and process them.
But splitting downloading and processing will make it a little bit more complex, and I don't think you're up to that challenge yet.

Because currently, you got some other problems. At first, it is not done to use VCL components in a thread. If you need information from a listbox in a thread, you will either need to use Synchronize in the thread to make a 'safe' call to the main thread, or you will have to pass the information needed before you start the thread. The latter is more efficient, because code executed using Synchronize actually runs in the main thread, making your multi-threading less efficient.

But my attention actually was drawn to the first line, "download webpage source into memo component". Don't do that! Don't load those results in a memo for processing. Automatic processing can best be done in memory, outside of visual controls. Using strings, streams, or even stringlists for processing a text is way faster than using a memo.
A stringlist has some overhead as well, but it uses the same construction of indexing the lines (TMemoStrings, which is the Lines property of a Memo, and TStringList both have the same ancestor), so if you got code that makes use of this, it will be quite easy to convert it to TStringList.

GolezTrol
  • 114,394
  • 18
  • 182
  • 210
  • +1 nice approach, and thanks for pointing out that using a memo control is a bad idea. I have found a very nice thread-safe stringlist, which would be handy here (TThreadStringList by Tilo Eckert): http://www.swissdelphicenter.ch/torry/showcode.php?id=2167 It's a wrapper around TStringList that uses a critical section to ensure safe access to the underlying stringlist. – Chris Thornton Nov 07 '11 at 14:50
  • It's indeed convenient to use TThreadStringList for the list of items to download. Each downloaded item can be pushed to a separate TThreadStringList for processing. That way, you can split downloading and processing, like I suggested, without too much hassle. – GolezTrol Nov 07 '11 at 16:33
  • Thanks for the not to use memo advice , I am using tstringslist for webpage source text , I'm a beginner in Delphi threads area , so I fully understand what you think , but it will take some time for me to be able to code that , I use google for getting sites , and after some time google blocks my requests(antibot protection) so downloading and parsing at the same time is not smart idea for me . Thanks for your answer... – Danijel Maksimovic Maxa Nov 07 '11 at 17:37
  • FYI, Indy has its own `TIdThreadSafeStringList` class. Look at the various classes available in the `IdThreadSafe.pas` unit. – Remy Lebeau Nov 07 '11 at 19:55
  • @Danijel: if Google is flagging you because you download data too often, then simply slow down how often you run your downloading code. Google has no way of knowing how fast you parse the data after you have downloaded it. – Remy Lebeau Nov 07 '11 at 19:57
5

I would suggest doing ALL of the parsing in threads, don't have the main thread do any parsing at all. The main thread should only manage the UI. Don't parse the HTML from a TMemo, have each thread download to a TStream or String and then parse from that directly. Use TIdSync or TIdNotify to send parsing results to the UI for display (if speed is important, use TIdNotify). Involving the UI components in your parsing logic will slow it down.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • 1
    If the parser is not just about parsing, but is doing some data processing, it may not be 100% multi-thread safe. IMHO parsing will be much faster than downloading. – Arnaud Bouchez Nov 07 '11 at 06:37
  • It's just download , and parse thing , and stringreplace to replace %20,%3D etc... in listbox... – Danijel Maksimovic Maxa Nov 07 '11 at 18:59
  • All of that can be done without involving the UI until the final result is ready to be displayed. Gather the URLs into a TStringList, spawn threads as needed for the list entries, where each thread downloads into a String or TMemoryStream (TIdHTTP supports both), parses the data, and posts the result to the main thread using TIdNotify. That is thread-safe and avoids unnecessary UI bottlenecks. – Remy Lebeau Nov 07 '11 at 19:54
  • I am a beginner , two days ago started reading about threads , this is what I'm able to do and understand so far [ http://pastebin.com/bL0Ezfgu ] . Any good examples could help me to better understand threads. Thanks :) – Danijel Maksimovic Maxa Nov 07 '11 at 20:54
4

Indy or Synapse are both multi-thread ready. I'd recommend using Synpase, which is much lighter than Indy, and will be sufficient enough for your purpose. Do not forget about the HTTP APIs provided by Microsoft.

Simple implementation:

  • One thread per URI;
  • Each thread gets the data using one HTTP communication;
  • Then each thread parse the data;
  • Then use Synchronize to refresh the UI.

Perhaps my favorite:

  • Define a number of maximum threads to be used (e.g. 8);
  • Each of these threads will maintain a remote connection (this is the purpose of HTTP/1.1 and can really make a difference about speed);
  • All requests are retrieved by those threads one by one - do not pre-assign URLs to threads, but retrieve a new URL from a global list when a thread has finished one (each URL does not take always the same time);
  • The threads may wait until any other URI is added to the global list (using a Sleep(100) or a semaphore e.g.);
  • Then parse and update the UI in the main GUI thread, using a dedicated GDI message (WM_USER+...) - parsing will be fast IMHO (and remember that UI refresh can be slow - take a look at BeginUpdate-EndUpdate methods for instance) - I found out that a GDI message (with the associated HTML data) is more efficient than using Synchronize which blocks the background thread;
  • Another option is to do the parsing in the background thread, just after having retrieved the data from its URI - perhaps not worth it (only if your parser is slow), and you may come into multi-threading issues if your parser/data processor is not 100% thread-safe.

The 2nd is how popular so-called "download managers" are implemented.

When you deal with multithreading, you'll have to "protect" your shared resources (lists, e.g.). Use a TCriticalSection to access any global list (e.g. the URI list), and release the lock as soon as possible.

And try to test your implementation with several computers and networks, concurrent access, diverse Operating Systems. Debugging multi-threaded applications can be difficult, so the simpler implementation the better: that is the reason why I recommend making the download part multi-threaded, but let the main thread process the data (which won't be huge, so it shall be fast).

Arnaud Bouchez
  • 42,305
  • 3
  • 71
  • 159
  • Can you provide me simple code how to retrieve url from list , cos i don't know how to split 100 urls from listbox to 8 threads....I can make variable link and send it to thread before thread.resume but how to give it link after I started it thanks – Danijel Maksimovic Maxa Nov 09 '11 at 21:23
  • 1
    @DanijelMaksimovicMaxa The URI are just a global `TStringList`, which is read from every thread when it is free to download a new file. You *do not* assign URI to the threads, but you let the thread ask the list about any remaining URI to be downloaded. You must protect the access to the list with a TCriticalSection, in order to avoid two threads retrieving the same URI at once. – Arnaud Bouchez Jan 27 '12 at 12:45