Looking for advice on improving a custom function in AnyLogic

Question

I'm estimating last mile delivery costs in an large urban network using by-route distances. I have over 8000 customer agents and over 100 retail store agents plotted in a GIS map using lat/long coordinates. Each customer receives deliveries from its nearest store (by route). The goal is to get two distance measures in this network for each store:

d0_bar: the average distance from a store to all of its assigned customers
d1_bar: the average distance between all customers common to a single store

I've written a startup function with a simple foreach loop to assign each customer to a store based on by-route distance (customers have a parameter, "customer.pStore" of Store type). This function also adds, in turn, each customer to the store agent's collection of customers ("store.colCusts"; it's an array list with Customer type elements).

Next, I have a function that iterates through the store agent population and calculates the two average distance measures above (d0_bar & d1_bar) and writes the results to a txt file (see code below). The code works, fortunately. However, the problem is that with such a massive dataset, the process of iterating through all customers/stores and retrieving distances via the openstreetmap.org API takes forever. It's been initializing ("Please wait...") for about 12 hours. What can I do to make this code more efficient? Or, is there a better way in AnyLogic of getting these two distance measures for each store in my network?

Thanks in advance.

    //for each store, record all customers assigned to it
    for (Store store : stores) 
        {
        distancesStore.print(store.storeCode + "," + store.colCusts.size() + "," + store.colCusts.size()*(store.colCusts.size()-1)/2 + ",");
    
    //calculates average distance from store j to customer nodes that belong to store j
        double sumFirstDistByStore = 0.0;
        int h = 0;
        while (h < store.colCusts.size())
            {
            sumFirstDistByStore += store.distanceByRoute(store.colCusts.get(h));
            h++;
            }
        distancesStore.print((sumFirstDistByStore/store.colCusts.size())/1609.34 + ",");
            
    //calculates average of distances between all customer nodes belonging to store j
        double custDistSumPerStore = 0.0;
        int loopLimit = store.colCusts.size();
        int i = 0;
        while (i < loopLimit - 1)
            {
            int j = 1;
            while (j < loopLimit)
                {
                custDistSumPerStore += store.colCusts.get(i).distanceByRoute(store.colCusts.get(j));
                j++;
                }
            i++;
            }   
        distancesStore.print((custDistSumPerStore/(loopLimit*(loopLimit-1)/2))/1609.34);
        distancesStore.println();
        }

Artem P. · Answer 1 · 2021-06-29T20:28:36.697

1

Firstly a few simple comments:

Have you tried timing a single distanceByRoute call? E.g. can you try running store.distanceByRoute(store.colCusts.get(0)); just to see how long a single call takes on your system. Routing is generally pretty slow, but it would be good to know what the speed limit is.
The first simple change is to use java parallelism. Instead of using this:

    for (Store store : stores)
    { ...

use this:

    stores.parallelStream().forEach(store -> {
    ...
    });

this will process stores entries in parallel using standard Java streams API.

It also looks like the second loop - where avg distance between customers is calculated doesn't take account of mirroring. That is to say distance a->b is equal to b->a. Hence, for example, 4 customers will require 6 calculations: 1->2, 1->3, 1->4, 2->3, 2->4, 3->4. Whereas in case of 4 customers your second while loop will perform 9 calculations: i=0, j in {1,2,3}; i=1, j in {1,2,3}; i=2, j in {1,2,3}, which seems wrong unless I am misunderstanding your intention.
Generally, for long running operations it is a good idea to include some traceln to show progress with associated timing.

Please have a look at above and post results. With more information additional performance improvements may be possible.

edited Jun 29 '21 at 20:28

answered Jun 29 '21 at 20:02

Artem P.

816
1
6
8

Thanks. 1) I wrote and tested the code with only a sliver of the data (2 stores, 16 customers). When I ran the model then, the time to initialize was almost unnoticeable (probably less than a second, but I didn't check the console for it). 2) This is interesting. I'll look into this more. 3) You're right, there's a mistake. Should be int j = i+1. I'm assuming an undirected network, so a->b should be the same as b->a. 4) Good advice. Thanks. – Vince Jun 29 '21 at 20:26
1

One more things to add: a speed up might be achieved by adding more memory to the JVM (will reduce GC pauses) – Artem P. Jun 29 '21 at 20:31
Following up, how does parallelStream() differ from setting the number of processors to use under runtime preferences? Also, the store agents in my model aren't a collection, they're a population. Looks like parallelStream() has to be done on a collection; am I understanding that right? – Vince Jun 29 '21 at 21:53
1

parallelSteams() is a Java functionality, number of processors to use refers to multi run experiments only. yes, `parallelStreams()` are only applicable to collection but there is nothing stopping you from putting Store agents into a collection, they can be reference from more than one place (i.e. population and collection). – Artem P. Jun 29 '21 at 22:11

Looking for advice on improving a custom function in AnyLogic

1 Answers1