-1

I am returning an array list form an object and using it in another. The application is multi-threaded and each thread fills the array list one int at a time from a file, so each add is a get to the array list. There are 200 threads with a file of 1million ints each. The application takes hours to run and I assume this is my bottle neck, since when I test with a local array list it takes 4 minutes. My problem is, this is used everywhere and I need to synchronize on the array list. Is there a fast solution to this problem or do I have to make it so each thread has its own array list and don't return it?

Actually I was wrong, its only when the array is local to the method that is faster anywhere like declared at top of class it takes hours to run, I'm stumped on this.

My return code looks like:

synchronized public ArrayList<Integer> getData() 
{
    return this.myData;
}

Here is what runs slow, I removed other things and am trying to bench mark on this and this takes hours:

    Scanner scanner = new Scanner(filePath);

    /*
     * While we have data keep reading
     * when out of data the simulation is complete.
     */
    while (scanner.hasNext()) 
    {
        /*
         * Get the data to simulate requests
         * and feed it to the algorithm being evaluated.
         */
        if (scanner.hasNextInt()) 
        {
            int temp = scanner.nextInt();
            //System.out.println( this.tClientName+" "+temp);


            /*
             * Add the temp value from incoming stream. 
             * 
             * todo:: UNLESS its NOT found on the client as a miss
             */
            tClientCache.getCache().add(temp); 

        } 
        else 
        {
            scanner.next();
        }
    }//END Of while (scanner.hasNext()) 
    /*
     * Close the scanner
     */
    scanner.close();
Dixon Steel
  • 1,021
  • 4
  • 12
  • 27
  • Replace that `ArrayList` that doesn't support synchronization by a `BlockingQueue`. Also, check if you have synchrozed methods that may slow your application. – Luiggi Mendoza Nov 05 '15 at 16:52
  • 5
    Profile, don't guess. – Durandal Nov 05 '15 at 16:54
  • "one int at a time from a file" - why? can't they collect all the data from files, and only then add them to list using addAll(...)? Would be much faster. Give us more code and explain it better if you want to receive any help... – wfranczyk Nov 05 '15 at 16:55
  • your synchronisation pattern is all wrong: you're not synchronising on the array, only on the object with its reference. instead provide an object the threads can synchronise on during array access (i.e. don't use the `synchronized` method anti-pattern). – BeyelerStudios Nov 05 '15 at 16:56
  • I removed the sync to prove it wasn't that and it wasn't. I need to add to the list one and a time to simulate a cache. I will post more code when I can. – Dixon Steel Nov 05 '15 at 16:57
  • @bezmax arrays are a perfectly reasonable idea for caches under most conditions, maps are arrays with a hash function on top – BeyelerStudios Nov 05 '15 at 17:01
  • I have to use one at a time to simulate data coming into the cache. BeyelerStudios I'm not sure I get what you are saying. – Dixon Steel Nov 05 '15 at 17:42
  • @BeyelerStudios if you are still around I think you are correct after talking to someone who knows java better than I, so does that mean I clone it and return an object and then iterate over that to get the synchronized data? – Dixon Steel Nov 06 '15 at 19:24
  • @DixonSteel in your code `tClientCache.getCache().add(temp);` is basically a synchronized access to `getCache()`, then an unsynchronized `add(temp)` on the array. The access to the array you probably don't need to synchronise (except if you change the array mid-computation), but the `add` is really important. Synchronise on a common object like so: `synchronize(obj) { mycache.add(temp); }` your client cache could provide this object, i.e.: `Object obj = tClientCache.getSynchronizationObject();` and can be any `Object` that's a member of the cache. best practice is to have a dedicatetd member. – BeyelerStudios Nov 06 '15 at 20:04
  • though this is not very efficient as others have pointed out: if you don't care about the order, simply fill a local array and `addAll` after you read the whole file (or blockwise, whatever). this requires only one synchronisation for multiple adds, instead of per add. – BeyelerStudios Nov 06 '15 at 20:08
  • @BeyelerStudios thanks, the data does change so this explains why I was having some out of bounds errors, since I was never really synced on the data array. As for the time, I thought about using a block, but I talked to the requirements and they said cut back the data and what I log and it may help. Reading from a file was required for actual real world traces, or I would not have done it that way. I thought of sharing a file but I heard java sucks at seeking. – Dixon Steel Nov 06 '15 at 20:44
  • @BeyelerStudios I tried to clone and returned the cloned object like this: synchronized public Object getCacheObject() { return this.cache.clone(); } but it did not work, the add method is part of the array list object in Java, at this point I am not sure how to properly sync to make this stop. – Dixon Steel Nov 08 '15 at 04:43

2 Answers2

0

The issue is almost certainly not act of returning the ArrayList, as that's just returning a reference.

The most likely case is the synchronization overhead, as every call to that method needs to acquire a lock, get the data, then release the lock (with some caveats, but that's basically true).

Additionally that synchronization almost certainly doesn't even do what you want it to do, as the actual access to the ArrayList needs to be synchronized and not just the act of getting a reference to it.

Generally speaking you've got two options:

  • reduce the number of synchronization points (i.e. synchronize less often) or
  • choose a more efficient synchronization mechanism.

Can your threads maybe collect a number of results and put them in in bulk (say a thousand at a time)? Or can you switch to a more multi-threading-capable data structure (CopyOnWriteArrayList comes to mind, but that's optimized for frequent reading and very infrequent writing, so probably not for your use case).

Joachim Sauer
  • 302,674
  • 57
  • 556
  • 614
0

If your concurrent function looks like this:

Scanner scanner = new Scanner(filePath);

while(scanner.hasNext()) {
    if(scanner.hasNextInt()) {
        int temp = scanner.nextInt();
        tClientCache.getCache().add(temp);
    } else {
        scanner.next();
    }
}

scanner.close();

You can synchronise by using a common synchronisation object:

Scanner scanner = new Scanner(filePath);
Object syncObject = tClientCache.getSynchronizationObject();
ArrayList<Integer> list = tClientCache.getCache();

while(scanner.hasNext()) {
    if(scanner.hasNextInt()) {
        int temp = scanner.nextInt();
        // synchronise manipulation
        synchronized(syncObject) {
            list.add(temp);
        }
    } else {
        scanner.next();
    }
}

scanner.close();

and extend your CacheClient by the following:

class CacheClient {
     ...
     public Object getSynchronizationObject() { return m_syncObj; }
     ...
     private Object m_syncObj = new Object(); // For synchronised access to the cache.
}

Of course like that you'll have to synchronise all other access to the cache too, while you're adding to the list. Consider rewriting your programme in such a way, that either the output of each file is processed independently and thus each in their own (unsynchronised) list, or - in case where you need to merge the data - you process the data in bulks:

Scanner scanner = new Scanner(filePath);
int threshold = ...

while(scanner.hasNext()) {
    if(scanner.hasNextInt()) {
        int temp = scanner.nextInt();
        bulk.add(temp);
        // instead of an arbitrary threshold, why not merge the array of a whole file?
        if(bulk.size() >= threshold) {
            tClientCache.process(bulk);
            bulk.clear();
        }
    } else {
        scanner.next();
    }
}
if(!bulk.isEmpty()) {
    tClientCache.process(bulk);
}

scanner.close();

and perform the synchronisation in ClientCache.process:

class ClientCache {
    ...
    public void process(ArrayList<Integer> bulk) {
        // synchronise cache manipulation
        synchronized(getSynchronizationObject()) {
            // merge howsoever you like...
            getCache().addAll(bulk);
        }
    }
}

200 Mio int is not a lot of data for current systems (<1GB), but 200 Mio Integer is about 3 GB! Depending on what kind of processing you do on this data, the memory access might completely destroy your performance: again, perform bulk-data processing where possible, and if you need to do high-performance stuff like sorting, consider copying bulks of data into fixed sized int[], perform your sorting on the basic type array, then merge those bulks again back into your arrays.

Community
  • 1
  • 1
BeyelerStudios
  • 4,243
  • 19
  • 38
  • I've have tired something like this except I used a separate sync object on the file processing, but I see you return from the cash? – Dixon Steel Nov 09 '15 at 22:15
  • I tried this, but no luck I must have something somewhere I am not seeing for the race condition, not speed issue. – Dixon Steel Nov 09 '15 at 22:40
  • @DixonSteel how do you process your array? do you process it while it's being filled? – BeyelerStudios Nov 10 '15 at 07:20
  • Actually this is showing promise, I did not return and sync on the object from all sides going at the data. So far it looks like it is working. What I don't get is why do I have to return the synchronized object and do this synchronized(getSynchronizationObject()){} as well as this in the cache public synchronized Object getCacheObject() { //return this.cache.clone(); synchronized (key ) { return new ArrayList(this.cache);// try } } – Dixon Steel Nov 10 '15 at 15:02
  • @DixonSteel that depends: do you ever assign another array list to `tClientCache.cache`? Otherwise no: you don't need the `synchronized` method and you don't need to return a clone... returning the clone especially means you will lose any inserted elements, as they are inserted into a copy and the copy is then lost... to avoid confusion, don't post commented out code here please... – BeyelerStudios Nov 10 '15 at 15:13
  • I don't ever assign another array list to the cache. Everybody now uses the returned copy, so I would think that is OK? Is there a better way? If I don't return the synced object like the about shows things break, seems to work otherwise with minimal testing. Sorry about the code, I'm burning out from this project. Can I just go back to returning the reference and syncing on the object around the ref return? – Dixon Steel Nov 10 '15 at 18:40
  • @DixonSteel you have changed too much logic since posting your code for me to make any useful suggestions, maybe better to pose a new question with your current code (and include the necessary parts of the ClientCache implementation) – BeyelerStudios Nov 11 '15 at 08:53
  • Thanks for all the help, I think I'm good with this sync code. As for the speed that is related to me using the java logger and it printing a message every time it enters a method. I don't know how to turn that off and just log the essentials, but I can always write my own. – Dixon Steel Nov 12 '15 at 13:00
  • @DixonSteel you should be able to set a higher logging level on your logger to avoid overly-verbose output – BeyelerStudios Nov 12 '15 at 13:09
  • I need to read about it more, but so far I haven't seen a way to make it stop printing method names as it enters them, that's what is killing my throughput. Most articles never mention how to stop that. – Dixon Steel Nov 12 '15 at 19:48