0

I need to get the difference between two set of integers (record ids). The first set is stored in text file, the second set is stored in mysql database. I have two options:

1- Read all ids from database, load them to java objects, load all ids from text file and use

Sets.difference(dbset, fileset);

2- Read only the text file ids and use a Hql query to get the difference:

public List getDiff(Set<Integer> ids){
    Query query;
    query = getSession().createQuery("from myset s where s.id not in (:ids)");
    query.setParameterList("ids", ids);
    return query.list(); 
}

I posted this question because these set can be quite big, and I don't know if is there any limits for Hibernate / Mysql query (Hibernate translate that query in "not in (1,2,3,...)) or otherwise, I can easily reach the jvm memory limit.

Tobia
  • 9,165
  • 28
  • 114
  • 219
  • Why do you not just try it? You could easily create a set at least as large as the maximum set you expect programatically in a loop, and test the query against the database. Then just observe memory usage, run times, exceptions thrown due to exceeded limits, ... – FrankPl Nov 29 '13 at 08:13
  • It is difficult to try assuming the memory settings of all users of this application. I prefer the best effort approach. – Tobia Nov 29 '13 at 08:39
  • @Toba But implicitly or explicitly, you will have to specify a minimum for the environment you application runs in. Or do you mean you want to write adaptive code that partitions the problem in an environment with few memory? This means you will need still more tests, in order to set the parameters for switching between different approaches. – FrankPl Nov 29 '13 at 09:23
  • Yes, but I hope mysql or hibernate can do a better memory manage than me. However I surely will do some test with a minimum requirement enviroment. – Tobia Nov 29 '13 at 09:27

1 Answers1

1

The option #2 will not work correct as it gives you only not present in file IDs. In case when there is some ID that is not in DB but is in file you don't get it in difference. Only option #1 will work correct and it looks nice to me. The only problem - memory. But it will be not really easy to create some algorithm that will do this work and save memory. If you IDs number is up to 100000 I would't worry about memory.

alexey28
  • 5,170
  • 1
  • 20
  • 25
  • You're right about the 2° option is not the difference... in my case it is enought, but I agree this is not the difference. I will edit my question – Tobia Nov 29 '13 at 08:37