0

The application is syncing data records between devices via an online Mongo DB collection. Multiple devices can send batches of new or modified records to the server Mongo collection at any time. Devices get all record updates for them that they don't already have, by requesting records added or modified since their last get request.

Approach 1 - was to add a Date object field (called stored1) to the records before saving to MongoDb. When a device requests records , mongoDb paging is used to skip entries up to the current page, and then limit to 1000. Now that the data set is large, each page request is taking a long time, and mongo hit a memory error.
https://docs.mongodb.com/manual/reference/limits/#operations

Setting allowDiskUse(true) as shown in the posted code in my current configuration isn't fixing the memory error for some reason. If that can be fixed, it still wouldn't be a long term solution as the query times with the paging are already too long.

Approach 2:

What is the best way for pagination on mongodb using java

https://arpitbhayani.me/blogs/benchmark-and-compare-pagination-approach-in-mongodb

The 2nd approach considered is to change from Mongo paging skipping returned records, to just asking for stored time > largest stored time last received, until the number of records in a return is less than the limit. This requires the stored timestamp to be unique between all records matching the query, or it could miss records or get duplicate records etc. In the example code, using the stored2 field, there's still a chance of duplicate timestamps, even if the probability is low.

Mongo has a BSON timestamp that guarantees unique values per collection, but I don't see a way to use it with document save(), or query on it in Spring Boot. It would need to be set on each record newly inserted, or replaced, or updated. https://docs.mongodb.com/manual/reference/bson-types/#timestamps

Any suggestions on how to do this?

@Getter
@Setter
public abstract class DataModel {

    private Map<String, Object> data;

    @Id // maps this field name to the database _id field, automatically indexed
    private String uid;

    /** Time this entry is written to the db (new or modified), to support querying for changes since last query */
    private Date stored1; //APPROCAH 1
    private long stored2; //APPROACH 2
}

/** SpringBoot+MongoDb database interface implementation */
@Component
@Scope("prototype")
public class SpringDb implements DbInterface {

    @Autowired
    public MongoTemplate db; // the database

    @Override
    public boolean set(Collection<?> newRecords, Collection<?> updatedRecords) {
        // get current time for this set
        Date date = new Date();
        int randomOffset = ThreadLocalRandom.current().nextInt(0, 500000);
        long startingNanoSeconds = Instant.now().getEpochSecond() * 1000000000L + instant.getNano() + randomOffset;
        int ns = 0;

        if (updatedRecords != null && updatedRecords.size() > 0) {
            for (Object entry : updatedRecords) {
                entry.setStored1(date); //APPROACH 1
                entry.setStored2(startingNs + ns++); //APPROCH 2
                db.save(entry, repoName);
            }
        }

        // for new documents only
        if (newRecords != null && newRecords.size() > 0) {
            for (DataModel entry : newRecords) {
                entry.setStored1(date); //APPROACH 1
                entry.setStored2(startingNs + ns++); // APPROACH 2
            }
    
            //multi record insert
            db.insert(newRecords, repoName);
        }

        return true;
    }

    @Override
    public List<DataModel> get(Map<String, String> params, int maxResults, int page, String sortParameter) {
        // generate query
        Query query = buildQuery(params);

        //APPROACH 1 
        // do a paged query
        Pageable pageable = PageRequest.of(page, maxResults, Direction.ASC, sortParameter);
        List<T> queryResults = db.find(query.allowDiskUse(true).with(pageable), DataModel.class, repoName); //allowDiskUse(true) not working, still get memory error
        // count total results
        Page<T> pageQuery = PageableExecutionUtils.getPage(queryResults, pageable,
        () -> db.count(Query.of(query).limit(-1).skip(-1), clazz, getRepoName(clazz)));
        // return the query results
        queryResults = pageQuery.getContent();

        //APPROACH 2
        List<T> queryResults = db.find(query.allowDiskUse(true), DataModel.class, repoName);

        return queryResults;
    }

    @Override
    public boolean update(Map<String, String> params, Map<String, Object> data) {
        // generate query
        Query query = buildQuery(params);

        //This applies the same changes to every entry
        Update update = new Update();
        for (Map.Entry<String, Object> entry : data.entrySet()) {
            update.set(entry.getKey(), entry.getValue());
        }
        db.updateMulti(query, update, DataModel.class, repoName);

        return true;
    }

    private Query buildQuery(Map<String, String> params) {
        //...
    }
}
Robb Peebles
  • 221
  • 1
  • 2
  • 8
  • Instead of any date, you can use approach 2 with the mongo generated ```_id``` field. It'll work same as the date. – Harshit Dec 14 '21 at 05:38
  • At millisecond resolution it's not really reasonable to expect no two events will ever have the same timestamp. Is there another, unique, field that you could also include in the index and sort on? – Joe Dec 14 '21 at 07:33
  • I see the _id field is roughly the created time, but not guaranteed monotonic - https://docs.mongodb.com/manual/reference/bson-types/#std-label-objectid. More importantly, the query needs to be on modified time, not created time, to get all records that were modified (which includes newly created) since the last get. – Robb Peebles Dec 15 '21 at 05:24
  • Can you elaborate on how adding a unique field to the index would be used to solve the issue? Need to get records with (Modified time > last time) which I think requires the first index field = “Modified”. Any _id's can be modified. Example index: Modified, _id 1,2 1,3 1,4 2,1 2,5 Get query #1: Modified>=0, _id>0, limit 2 : returns _id's 2,3 (next query needs to check for more at Modified=1 but skip _id’s <=3) Get query #2: Modified>=1, _id>3, limit 2 : returns _id's 4,5 MISSES _id 1 – Robb Peebles Dec 15 '21 at 05:53

1 Answers1

0

The solution I ended up using was to define, and index on, another field called storedId, which is a string concatenation of the modified record storedTime, and the _id. This guarantees all these storedId record fields are unique, because _id is unique.

Here's an example to show how indexing and querying on the concatenated storedTime+_id field works, while indexing and querying on the separate storedTime and _id fields fails:

public abstract class DataModel {

    private Map<String, Object> data;

    @Indexed
    private String _id; // Unique id
    @Indexed
    private String storedTime; // Time this entry is written to the db (new or modified)

    @Indexed
    String storedId;    // String concatenation of storedTime and _id field
}

//Querying on separate fields and indexes:
{
//storedTime, _id
     "time1", "id2"
     "time1", "id3"
     "time1", "id4"
     "time2", "id1"
     "time2", "id5"
}

 get (storedTime>"time0", _id>"id0", limit=2) // returns _id's 2,3 (next query needs to check for more at storedTime="time1" but skip _id’s <="id3")
 get (storedTime>="time1", _id>"id3", limit=2) // returns _id's 4,5
//FAILS because this second query MISSES _id 1  (Note any existing _id record can be modified at any time, so the _id fields are not in storedTime order)

//Querying on the combined field and index:
     {
    //storedId
     "time1-id2"
     "time1-id3"
     "time1-id4"
     "time2-id1"
     "time2-id5"
     }

 get (storedId>"time0", limit=2) // returns _id's 2,3 (next query for values greater than the greatest last value returned)
 get (storedId>"time1-id3", limit=2) // returns _id's 4,1 (next query for values greater than the greatest last value returned)
 get (storedId>"time2-id1", limit=2) //: returns _id 5
//WORKS, this doesn't miss or duplicate any records
Robb Peebles
  • 221
  • 1
  • 2
  • 8