0

I have a problem with my html-scraper. Html-scraper is multithreading application written on Java using HtmlUnit, by default it run with 128 threads. Shortly, it works as follows: it takes a site url from big text file, ping url and if it is accessible - parse site, find specific html blocks, save all url and blocks info including html code into corresponding tables in database and go to the next site. Database is mysql 5.1, there are 4 InnoDb tables and 4 views. Tables have numeric indexes for fields used in table joining. I also has a web-interface for browsing and searching parsed data (for searching I use Sphinx with delta indexes), written on CodeIgniter.

Server configuration:

CPU: Type Xeon Quad Core X3440 2.53GHz
RAM: 4 GB
HDD: 1TB SATA
OS: Ubuntu Server 10.04

Some mysql config:

key_buffer = 256M
max_allowed_packet = 16M
thread_stack = 192K
thread_cache_size = 128
max_connections = 400
table_cache = 64
query_cache_limit = 2M
query_cache_size = 128M

Java machine run with default parameters except next options:

-Xms1024m -Xmx1536m -XX:-UseGCOverheadLimit -XX:NewSize=500m -XX:MaxNewSize=500m -XX:SurvivorRatio=6 -XX:PermSize=128M -XX:MaxPermSize=128m -XX:ErrorFile=/var/log/java/hs_err_pid_%p.log 

When database was empty, scraper process 18 urls in second and was stable enough. But after 2 weaks, when urls table contains 384929 records (~25% of all processed urls) and takes 8.2Gb, java application begun work very slowly and crash every 1-2 minutes. I guess the reason is mysql, that can not handle growing loading (parser, which perform 2+4*BLOCK_NUMBER queries every processed url; sphinx, which updating delta indexes every 10 minutes; I don't consider web-interface, because it's used by only one person), maybe it rebuild indexes very slowly? But mysql and scraper logs (which also contain all uncaught exceptions) are empty. What do you think about it?

c1tru55
  • 77
  • 7
  • 1
    Can you give more details of the crash? Is it a JVM crash, or are you getting an error like OutOfMemoryError. Have you tried memory profiling your application or increasing the maximum memory? – Peter Lawrey Jan 17 '12 at 13:51
  • it's not an OutOfMemoryError exception, application just shutting down in few minutes silently (maybe due to mysql). At this time web-interface is not responding, sql queries perform very slowly (300s and more). I try increase max memory but it does not help – c1tru55 Jan 18 '12 at 05:20

3 Answers3

0

I'd recommend running the following just to check a few status things.. puting that output here would help as well:

  1. dmesg
  2. top Check the resident vs virtual memory per processes
technocrat
  • 3,513
  • 5
  • 25
  • 39
  • **top** `VIRT RES SHR %CPU %MEM COMMAND` `823m 53m 2960 460 1.3 mysqld` `3094m 1.9g 10m 329 49.1 java` – c1tru55 Jan 18 '12 at 05:56
  • wow yea, java is definitely up there. Did you find anything conclusive in the dmesg? - It should show which thread died. Also - have you noticed a trend in memory usage for either of those programs yet? If you run your top like this `top -p[pid],[pid]` you'll be able to watch those two exclusively. If the java application is crashing every 1-2 minutes, and it's ram usage is 1.9g sometime while it's running within those 1-2 minutes, it might indicate a memory leak. – technocrat Jan 18 '12 at 14:35
0

So the application become non responsive? (Not the same as a crash at all) I would check all your resources are free. e.g. do a jstack to check if any threads are tied up.

Check in MySQL you have the expect number of connections. If you continuously create connections in Java and don't clean them up the database will run slower and slower.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
0

Thank you all for your advice, mysql was actually cause of the problem. By enabling slow query log in my.conf I see that one of the queries, which executes every iteration, performs 300s (1 field for searching was not indexed).

c1tru55
  • 77
  • 7