0

I have a few million documents. What I am trying to do is simple, process the documents to extract the information I need and load it into a database. I am doing it in Python and using SQLAlchemy. Also I am using multiprocessing to make use of all the cores on my machine. The documents are XML with huge chunks of text. The database is MySQL with a custom relation schema defined.

However, it runs very slow and loads only about 50k documents in 6-7 hours.

Is there any way that I can speed this task up?

tshepang
  • 12,111
  • 21
  • 91
  • 136
y2p
  • 4,791
  • 10
  • 40
  • 56
  • 1
    does putting the extracted data in mysql a must? What are you going to do with it later? – Guy Gavriely Jan 20 '14 at 19:28
  • Do you have any indication of what the bottleneck is? ie, timeit results for the processing/inserting/etc? Some common speedups are: turn off autocommit(!), turn off (console) printing of logs, break things into smart batching, change the number of threads you're using... – a p Jan 20 '14 at 19:38
  • @Guy: The data is going to be used by the rest of my team later, for different purposes. So I need it accessible and queryable – y2p Jan 20 '14 at 19:40
  • @ap: `autocommit` is off, no `logging`. What do you mean by smart batching? I did play with the number of threads and the cpu_count() too, didn't make a lot of difference. – y2p Jan 20 '14 at 19:52

1 Answers1

0

sometimes RDBMS is not the answer, one sign for such situation is if your data has no relations to one another, for example, if every document stands by itself.

if you'd like to have some unstructured data searchable, consider building a searchable index using pylucene

or maybe put the data in some non-rel database like mongodb

in any case, try to identify what part of your system is slowing down the process, my guess would be the database or the file system, if this is mysql all you can do is throwing more hardware on it.

another way to optimize a system that use IO extensively is to switch to async programming using a library like twisted but it has some learning curve, so better make 100% sure its needed.

Guy Gavriely
  • 11,228
  • 6
  • 27
  • 42