I have a few million documents. What I am trying to do is simple, process the documents to extract the information I need and load it into a database. I am doing it in Python and using SQLAlchemy. Also I am using multiprocessing
to make use of all the cores on my machine. The documents are XML with huge chunks of text. The database is MySQL with a custom relation schema defined.
However, it runs very slow and loads only about 50k documents in 6-7 hours.
Is there any way that I can speed this task up?