4

I have attempted to clean up and revise code in an answer here for my needs where I only want to delete from the Model Reservations for data records prior to the date expressed in the get as yy,mm,dd.

If I am correctly anticipating the action of cleanTable/2012/10/5 against the routing ('/cleanTable/([\d]+)/([\d]+)/([\d]+)', CleanTable) then my code would only delete at most 50 (10*nlimit) data records.

Btw, the author of the original code (who likely no longer subscribes to SO), claimed his main trick for accomplishing this code was "to include redirect in html instead of using self.redirect".

I am unfamiliar with raise Exception and the like, but my instinct would be to add a raise Exception or raise StopIteration to the for loop after it is made into a while loop. But it is not clear to me whether raising an StopIteration exception actually causes iteration to stop or if more is needed. Also, I don't know how to revise so the html ends smoothly upon early exit.

class CleanTable(BaseHandler):

    def get(self, yy,mm,dd):
        nlimit=5
        iyy=int(yy)
        imm=int(mm)
        idd=int(dd)
        param=date(iyy,imm,idd)
        q=Reservations.all(keys_only=True)
        q.filter("date < ", dt(iyy,imm,idd))
        results = q.fetch(nlimit)
        self.response.headers['Content-Type'] = 'text/plain'
        self.response.out.write("""
          <html>
          <meta HTTP-EQUIV="REFRESH" content="url=http://yourapp.appspot.com/cleanTable">
            <body>""")

        try:
            for i in range(10):
                db.delete(results)
                results = q.fetch(nlimit, len(results))
                for r in results:
                    logging.info("r.name: %s" % r.name)
                self.response.out.write("<p> "+str(nlimit)+" removed</p>")
                self.response.out.write("""
                </body>
              </html>""")

        except Exception, inst:
            logging.info("inst: %s" % inst)
            self.response.out.write(str(inst))
Community
  • 1
  • 1
zerowords
  • 2,915
  • 4
  • 29
  • 44
  • This looks like a really ugly hack... From your code, I assume that the keys of all your entities fit into memory of your app. If that's true, a much more elegant and reliable way would be to use Task queues - for every key, start a task with that key to delete it. The worker would then load then perform the deletion. If this doesn't work, batch the process. If I have time later, I'll post sample code. – Sologoub Oct 07 '12 at 16:56
  • You can batch delete ( db.delete(keys) ) the resulting entities of a query in a task. A task has a timelimit of 10 minutes. If a task deletes max 10000 entities per run and you have more entities to delete, you repeat the task until your query results in less than 10000 entities. You can also use mapreduce: https://developers.google.com/appengine/docs/python/dataprocessing/overview – voscausa Oct 07 '12 at 21:58
  • My ulitimate use of this code would be in a cron job to throw away old records when I don't know how many records there are, so I thought some sort of while loop would be appropriate. – zerowords Oct 08 '12 at 17:25

2 Answers2

0

This is not the best approach to clean your models. A better approach would be to get all the keys of your entities and create Task Queues. Each queue will get a batch of keys for the entities that need to be modified.

Another approach would also be to create a cron job that will query for the x number of oldest modified entities, fix them and then store them back.

Finally, if your number of entities is so huge, you could also consider the use of Backends.

Hope this helps.

Thanos Makris
  • 3,115
  • 2
  • 17
  • 26
0

Here is my update routine and it has converted 500.000 entities. Be sure to run it on a backend instance (You can target a Queue to a backend instance). Notice that I am using a cursor, thats the only way you can consistently iterate through data (Never use offset!).

Queue queue = QueueFactory.getQueue("grinderQueue");
    queue.add(TaskOptions.Builder.withPayload(new DeferredTask() { //lets generate
        private static final long serialVersionUID = 1L;
        @Override
        public void run() { 
             String cursor = null;
             boolean done = false;
             Date now = new Date(1346763868L * 1000L); // 09/04/2012 

             while(!done) {
                 DatastoreService datastore = DatastoreServiceFactory.getDatastoreService();

                 Query query = new Query("Venue");
                 query.setFilter(new FilterPredicate("timeOfLastUpdate", Query.FilterOperator.LESS_THAN,now));
                 PreparedQuery pq = datastore.prepare(query);
                 FetchOptions fetchOptions = FetchOptions.Builder.withLimit(1000);
                 if(cursor != null)
                     fetchOptions.startCursor(Cursor.fromWebSafeString(cursor));

                 QueryResultList<Entity> results = pq.asQueryResultList(fetchOptions);              


                 List<Entity> updates = new ArrayList<Entity>();
                 List<Entity> oldVenueUpdates = new ArrayList<Entity>();
                 int tuples = 0;
                 for(Entity en : results) {
                    tuples++;
                    try { 
                        if(en.getProperty(Venue.VENUE_KEY) == null)
                            continue;
                        Entity newVenue = new Entity("CPVenue",(String)en.getProperty(Venue.VENUE_KEY));                
                        newVenue.setPropertiesFrom(en);
                        newVenue.removeProperty("timeOfLastVenueScoreCalculation");
                        newVenue.removeProperty("actionsSinceLastVenueScoreCalculation");
                        newVenue.removeProperty("venueImageUrl");
                        newVenue.removeProperty("foursquareId");

                        newVenue.setProperty("geoCell", GeoCellCalculator.calcCellId(Double.valueOf((String)en.getProperty("lng")), Double.valueOf((String)en.getProperty("lat")),8)); 
                        newVenue.setProperty(Venue.TIME_SINCE_LAST_UPDATE, new Date());
                        updates.add(newVenue);

                        Venue v = new Venue(newVenue);

                        //Set timestamp on Venue
                        en.setProperty("timeOfLastUpdate", now);
                        oldVenueUpdates.add(en);

                    }catch(Exception e) {
                        logger.log(Level.WARNING,"",e);             
                    }
                 }
                done = tuples == 0;
                tuples = 0;
                if(results.getCursor() != null)
                    cursor = results.getCursor().toWebSafeString();
                else
                    done = true;

                System.out.println("Venue Conversion LOOP updates.. " + updates.size() + " cursor " + cursor);
                datastore.put(updates);
                datastore.put(oldVenueUpdates);
             }
             System.out.println("Venue Conversion DONE");
        }}));
Middy
  • 91
  • 1
  • 9
  • I don't know what it means to run it on a backend instance. I assume the language used is Java, but my app is python on gae. Can I use your code anyway? – zerowords Oct 30 '12 at 14:10
  • Well you need to convert it to python. But its all API call so it should be easy. You can define a task QUEUE with a target (Just a name) that name should be a backend instance, which you need to configure as well – Middy Nov 01 '12 at 07:42
  • I'll have to study up to do this and it may take some time. And I'll look in the docs for `backend instance`. I'll try to remember to get back to you. – zerowords Nov 02 '12 at 12:22