9

I'm working on some J2EE project which involves storing postal codes, cities and countries together. We have developed a Java class which handles the integration of every country file (containing each postal code and each city). The problem is that for some countries (Great Britain, Netherlands...), the file is pretty big (400.000 to 800.000 lines).

I've got a while() loop which reads the next line, gets the information and stores it into my database. The problem is that for the 1000 or 10.000 first lines, the process is fast, really fast, then seems to be slowing each time it goes through the loop, then happens to throw a HeapSpaceOverflowException after 150.000 lines.

I thought first that some objects weren't garbage collected and slowed down my algorithm, but I can't figure out which one. Besides, when I run this algorithm on my PC, JConsole tells me that heap space is regularly cleaned (seems to be garbage collected), but the process is still slower and slower.

Below is the code of the method:

FileReader fr = new FileReader(nomFichier);
BufferedReader br = new BufferedReader(fr);
    
int index = 0; String ligne; String codePostal; String nomVille; 
String codePays; PPays pays; String[] colonnes;
    
while ((ligne = br.readLine()) != null)
{
    System.out.println("line "+ ++index);
        
    colonnes = ligne.split(Pattern.quote(";"));
        
    codePostal = colonnes[9];
    nomVille   = colonnes[8];
    codePays   = colonnes[0];
        
    pays = this.pc.getByCodePays(codePays);
        
    this.pc.getByCodePostalAndVilleAndINSEE(codePostal, nomVille, pays.getNomPays(), "");
}

Variable this.pc is injected through @Inject annotation.

Can someone help me to figure out why this code gets slower and slower?

For completeness sake, I've added the code of the get...() method:

public Codepostalville getByCodePostalAndVilleAndINSEE(String codePostal, String ville, 
                                                       String pays, String codeINSEE) throws DatabaseException
{
    Codepostal cp = null; Ville v = null; PPays p = null; Codepostalville cpv = null;
    
    try
    {
        // Tout d'abord, il faut retrouver l'objet CodePostal
        cp = (Codepostal) this.em
                        .createNamedQuery("Codepostal.findByCodePostal")
                        .setParameter("codePostal", codePostal)
                        .getSingleResult();
    }
    catch (NoResultException nre1)
    {
        // Si on ne l'a pas trouvé, on le crée
        if (cp == null)
        {
            cp = new Codepostal();
            cp.setCodePostal(codePostal);
            cpc.getFacade().create(cp);
        } 
    }
    
    // On retrouve la ville...
    try
    {
        // Le nom de la ville passé par l'utilisateur doit être purgé (enlever
        // les éventuels tirets, caractères spéciaux...)
        // On crée donc un nouvel objet Ville, auquel on affecte le nom à purger
        // On effectue la purge, et on récupère le nom purgé
        Ville purge = new Ville();
        purge.setNomVille(ville);
        purge.purgerNomVille();
        ville = purge.getNomVille();
        
        v = (Ville) this.em
                        .createNamedQuery("Ville.findByNomVille")
                        .setParameter("nomVille", ville)
                        .getSingleResult();
    }
    catch (NoResultException nre2)
    {
        // ... ou on la crée si elle n'existe pas
        if (v == null)
        {
            v = new Ville();
            v.setNomVille(ville);
            vc.getFacade().create(v);
        }
    }
    
    // On retrouve le pays
    try
    {
        p = (PPays) this.em
                        .createNamedQuery("PPays.findByNomPays")
                        .setParameter("nomPays", pays)
                        .getSingleResult();
    }
    catch (NoResultException nre2)
    {
        // ... ou on la crée si elle n'existe pas
        if (p == null)
        {
            p = new PPays();
            p.setNomPays(pays);
            pc.getFacade().create(p);
        }
    }
        
    // Et on retrouve l'objet CodePostalVille
    try
    {
        cpv = (Codepostalville) this.em
                .createNamedQuery("Codepostalville.findByIdVilleAndIdCodePostalAndIdPays")
                .setParameter("idVille", v)
                .setParameter("idCodePostal", cp)
                .setParameter("idPays", p)
                .getSingleResult();
        
        // Si on a trouvé l'objet CodePostalVille, on met à jour son code INSEE
        cpv.setCodeINSEE(codeINSEE);
        this.getFacade().edit(cpv);
    }
    catch (NoResultException nre3)
    {         
        if (cpv == null)
        {
            cpv = new Codepostalville();
            cpv.setIdCodePostal(cp);
            cpv.setIdVille(v);
            cpv.setCodeINSEE(codeINSEE);
            cpv.setIdPays(p);
            this.getFacade().create(cpv);
        }
    }
    
    return cpv;
}

So, I have some more information. The getCodePostal...() method needs around 15ms to be executed at the very beginning of the loop, and after 10.000 lines, it needs more than 100ms to be executed (almost 10 times more!). In this new version I have disabled the commit/rollback code, so each query is committed on the fly.

I can't really find why it needs more and more time.

I've tried to search for some information about JPA's cache : My current configuration is this (in persistence.xml) :

<property name="eclipselink.jdbc.bind-parameters" value="true"/>
<property name="eclipselink.jdbc.cache-statements" value="true"/>
<property name="eclipselink.cache.size.default" value="10000"/>
<property name="eclipselink.query-results-cache" value="true"/>

I don't know if it is the most efficient configuration, and I would appreciate some help and some explanations about JPA's cache.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
Adrien Dos Reis
  • 472
  • 2
  • 14
  • what does getByCodePays and getByCodePostalAndVilleAndINSEE do? And have you used a profiler yet? – Chris K Jul 24 '14 at 12:23
  • Have you checked which part is the bottleneck of your implementation? Are you conducting any check that can be affected by the number of database elements? – Eypros Jul 24 '14 at 12:24
  • Is there a possibility where you can break big file in to smaller files. And then through executor, read each segment and process? – Nikhil Joshi Jul 24 '14 at 12:24
  • I can't see any reason that loop would slow down over time, it appears that you don't even create anything new within the loop body. The only thing I can think is that it's in the body of one of your `getByCode*` methods. – Andrew Stubbs Jul 24 '14 at 12:25
  • this.pc.getByCodePostalAndVilleAndINSEE needs to be inspected, although, if this is just a Bean-style getter, it doesn't make sense without storing any results. What has been omitted? – laune Jul 24 '14 at 12:28
  • how exactly are persisting the objects in your db? If you are using Hibernate, then you are facing good-ol' session-overflow problem – injecteer Jul 24 '14 at 12:28
  • @ChrisK `getByCodePostalAndVilleAndINSEE` checks if the postal code and city exists yet in database (in order to create the object if it doesn't exist, or get it if it does). Both the tables that I am querying are full or near 800.000 elements, so I guess that querying on 800.000 elements or 850.000 elements (assuming that 50.000 lines of the file have already been parsed) is almost the same process speed... I have just used JConsole, not any profiler yet. But good idea, I might first try to know if maybe this method could be the bottleneck of the algorithm – Adrien Dos Reis Jul 24 '14 at 12:29
  • @AdrienDosReis a quick hack to find out if it is the db will be to comment out those calls. Do you have an index on the table that is being used by the query? If not, then the db will get slower and slower. – Chris K Jul 24 '14 at 12:31
  • I have just read the updated code, that is a lot of db calls per line of input. Is it possible to redesign the approach to have a cache that can be checked for duplication quickly or to use a stored procedure to reduce network round trips? Lastly what is the transaction demarcation for this work? If every insert is in the same transaction then it will eventually blow up, if every sql call is in its own transaction then it will run even slower. Batching inserts will help you too. – Chris K Jul 24 '14 at 12:36
  • @ChrisK The three tables that I'm querying all have the same structure : IdTable PRIMARY KEY INDEXED Data INDEXED The last table is an association table regrouping the three primary keys, and got an index on all three columns. The transaction strategy is handled by a `UserTransaction` object, a call to `userTransaction.begin()`at the beginning of the loop, and `userTransaction.commit()` or `rollback()` at the end. Do you have any ideas how I could redesign (as you said) the approach in order to make the work quicker ? – Adrien Dos Reis Jul 24 '14 at 12:40
  • @AdrienDosReis, I need a bit more domain knowledge before I can give a good answer to that. Specifically who else is writing to the table and just how many entries are you aiming to cater for? – Chris K Jul 24 '14 at 12:41
  • @ChrisK I'm the only one who is writing to the table. I mean, I can try to execute this code on my PC, and it's the only process that access to my local database, and it's still slow. I don't know if that answers your question though. – Adrien Dos Reis Jul 24 '14 at 12:43
  • 1
    @AdrienDosReis, cool. In that case you have several options. Assuming that the DB is not empty when you start. I would batch the inserts and forget about the checks. With unique constraints in the db then the insert will fail if there is duplication, and you can then catch that error and handle it. Batching makes that a little tricker, so you may prefer to read in the ids already in the table first so that the check can be done in JVM or to switch to a stored proc to do the inserts. Another option would be to allow duplicates, you could always delete them later in one go. – Chris K Jul 24 '14 at 12:47
  • @AdrienDosReis - I suggest you make bulk queries. i.e, Read 5(or more) lines and make a query for those 5 lines in one shot. Don't write for each line. Try optimizing in the same way for other DB reads / writes. Also, if you need to write the same record multiple times, then check if it can be done only with the final write call.. – TheLostMind Jul 24 '14 at 12:48
  • @AdrienDosReis if you cannot remove the checks entirely and have to go to the database, then I second TheLostMind's suggestion of batched queries. If I recall, Hibernate has built in support for them. – Chris K Jul 24 '14 at 12:52
  • @ChrisK - Your `--last` *comment* should be an *answer* :P – TheLostMind Jul 24 '14 at 12:53
  • 1
    @TheLostMind, lol thank you. But it lacks detail, and I lack the time right now to flesh it out. Feel free to run with it. – Chris K Jul 24 '14 at 12:54

2 Answers2

12

You might want to read up on JPA concepts. In brief, an EntityManager is associated with a persistence context, which keeps a reference to all persistent objects manipulated through it, so it can write any changes done to these objects back to the database.

Since you never close the persistence context, that's the likely cause of your memory leak. Moreover, a persistence provider must write changes to persistent objects to the database prior to issuing a query, if these changes might alter the result of the query. To detect these changes requires iteration over all objects associated with the current persistent context. In your code, that's nearly a million objects for every query you issue.

Therefore, at the very least, you should clear the persistence context in regular intervals (say every 1000 rows).

It's also worth noting that unless your database is on the same server, every query you issue must travel over the network to the database, and the result back to the application server, before your program can continue. Depending on network latency, this can easily take a milli second each time - and you are doing this several million times. If it needs to be truly efficient, loading the entire table into memory, and performing the checks for existence there, might be substantially faster.

meriton
  • 68,356
  • 14
  • 108
  • 175
  • Oh. Good call, I thought that the queries waiting for a commit were stored in a temp space into the database. But if they are also stored into my application, that explains why it gets slower, and why the heap space explodes. I'll try to comment my `begin()` and `commit()`, and let's see if it speeds up :) – Adrien Dos Reis Jul 24 '14 at 13:15
  • So, as I said in my edit, I tried, but it didn't work any better :( The solution that I'll try to implement relies on cutting every file of +5000 rows into pieces, and work on small pieces of files. We'll see if it gets better – Adrien Dos Reis Jul 24 '14 at 15:58
0

Problem "solved" (almost)! I have configured my persistence.xml this way:

<property name="eclipselink.jdbc.batch-writing" value="JDBC"/>
<property name="eclipselink.jdbc.batch-writing.size" value="10000"/>

At first, it didn't change anything. But then, I tried to cut my file into smaller pieces (when the file has more than 5000 rows, I read the 5000 rows, I store them in a StringBuilder, then I read the StringBuilder to insert 5000 rows at once).

This way, my code doesn't get any slower after 20.000 rows (for now). It seems to work fine, but I still can't get why my code was getting slower when I worked with bigger pieces of file.

Thanks to everyone who tried to help me on this one.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197
Adrien Dos Reis
  • 472
  • 2
  • 14