1

I have a list of about 200,000 entities, and I need to query a specific RESTful API for each of those entities, and end up with all the 200,000 entities saved in JSON format in txt files. The naive way of doing it is going through the list of the 200,000 entities and query one by one, add the returned JSON to a list, and when it's done, right all to a text file. Something like:

from apiWrapper import api
from entities import listEntities #list of the 200,000 entities
a=api()
fullEntityList=[]
for entity in listEntities:
fullEntityList.append(a.getFullEntity(entity))

with open("fullEntities.txt","w") as f:
    simplejson.dump(fullEntityList,f)

Obviously this is not reliable, as 200,000 queries to the API will take about 10 hours or so, so I guess something will cause an error before it gets to write it to the file. I guess the right way is to write it in chunks, but not sure how to implement it. Any ideas? Also, I cannot do this with a database.

Abhijit
  • 62,056
  • 18
  • 131
  • 204
leonsas
  • 4,718
  • 6
  • 43
  • 70

3 Answers3

2

I would recommend writing them to a SQLite database. This is they way I do it for my own tiny web spider applications. Because you can query the keys quite easily, and check which ones you already retrieved. This way, your application can easily continue where it left off. In particular if you get some 1000 new entries added next week.

Do design "recovery" into your application from the beginning. If there is some unexpected exception (Say, a timeout due to network congestion), you don't want to have to restart from the beginning, but only those queries you have not yet successfully retrieved. At 200.000 queries, an uptime of 99.9% means you have to expect 200 failures!

For space efficiency and performance it will likely pay off to use a compressed format, such as compressing the json with zlib before dumping it into the database blob.

SQLite is a good choice, unless your spider runs on multiple hosts at the same time. For a single application, sqlite is perfect.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
1

The easy way is to open the file in 'a' (append) mode and write them one by one as they come in.

The better way is to use a job queue. This will allow you to spawn off a.getFullEntity calls into worker thread(s) and handle the results however you want when/if they come back, or schedule retries for failures, etc. See Queue.

wim
  • 338,267
  • 99
  • 616
  • 750
0

I'd also use a separate Thread that does file-writing, and use Queue to keep record of all entities. When I started off, I thought this would be done in 5 minutes, but then it turned out to be a little harder. simplejson and all other such libraries I'm aware off do not support partial writing, so you cannot first write one element of a list, later add another etc. So, I tried to solve this manually, by writing [, , and ] separately to the file and then dumping each entity separately.

Without being able to check it (as I don't have your api), you could try:

import threading
import Queue
import simplejson
from apiWrapper import api
from entities import listEntities #list of the 200,000 entities

CHUNK_SIZE = 1000

class EntityWriter(threading.Thread):
    lines_written = False
    _filename = "fullEntities.txt"

    def __init__(self, queue):
        super(EntityWriter, self).__init()
        self._q = queue
        self.running = False

    def run(self):
        self.running = True
        with open(self._filename,"a") as f:
            while True:
                try:
                    entity = self._q.get(block=False)
                    if not EntityWriter.lines_written:
                        EntityWriter.lines_written = True
                        f.write("[")
                        simplejson.dump(entity,f)
                    else:
                        f.write(",\n")
                        simplejson.dump(entity,f)
                except Queue.Empty:
                    break
        self.running = False

    def finish_file(self):
         with open(self._filename,"a") as f:
             f.write("]")


a=api()
fullEntityQueue=Queue.Queue(2*CHUNK_SIZE)
n_entities = len(listEntities)
writer = None
for i, entity in listEntities:
    fullEntityQueue.append(a.getFullEntity(entity))
    if (i+1) % CHUNK_SIZE == 0 or i == n_entities-1:
        if writer is None or not writer.running:
            writer = EntityWriter(fullEntityQueue)
            writer.start()
writer.join()
writer.finish_file()

What this script does

The main loop still iterates over your list of entities, getting the full information for each. Afterwards each entity is now put into a Queue. Every 1000 entities (and at the end of the list) an EntityWriter-Thread is being launched that runs in parallel to the main Thread. This EntityWriter gets from the Queue and dumps it to the desired output file.

Some additional logic is required to make the JSON a list, as mentioned above I write [, , and ] manually. The resulting file should, in principle, be understood by simplejson when you reload it.

Thorsten Kranz
  • 12,492
  • 2
  • 39
  • 56