Why the dip in speed increase for generating 400,000,000 random numbers?

Question

I'm generating around 400,000,000 (400 million) random numbers in parallel on an Intel i7 with 4 cores (8 threads hyperthreaded) on macOS with 8 GB RAM.

However, I'm also generating 400,000,000 random numbers on a DigitalOcean server with 20 cores on Debian with 64 GB RAM.

Here's the code:

import multiprocessing
import random

rangemin = 1
rangemax = 9

def randomGenPar_backend(backinput):
    return random.randint(rangemin, rangemax)

def randomGenPar(num):
    pool = multiprocessing.Pool()
    return pool.map(randomGenPar_backend, range(0, num))

randNum = 400000000

random.seed(999)
randomGenPar(randNum)

These are the results of the benchmark:

5,000,000 Random Numbers:
1 Core: 5.984
8 Core: 1.982

50,000,000 Random Numbers:
1 Core: 57.28
8 Core: 19.799
20 Core: 18.257
Times Benefit (20 core vs. 8 core) = 1.08

100,000,000 Random Numbers:
1 Core: 115
8 Core: 40.434
20 Core: 31.652
Times Benefit (20 core vs. 8 core) = 1.28

200,000,000 Random Numbers:
8 Core: 87
20 Core: 60
Times Benefit (20 core vs. 8 core) = 1.45

300,000,000 Random Numbers:
8 Core: 157
20 Core: 88
Times Benefit (20 core vs. 8 core) = 1.78

400,000,000 Random Numbers:
8 Core: 202
20 Core: 139
Times Benefit (20 core vs. 8 core) = 1.45 (DIP!)

500,000,000 Random Numbers:
8 Core: 280
20 Core: 171
Times Benefit (20 core vs. 8 core) = 1.64 (INCREASE!)

600,000,000 Random Numbers:
8 Core: 342
20 Core: 198
Times Benefit (20 core vs. 8 core) = 1.73

700,000,000 Random Numbers:
8 Core: 410
20 Core: 206
Times Benefit (20 core vs. 8 core) = 1.99

800,000,000 Random Numbers:
8 Core: 482
20 Core: 231
Times Benefit (20 core vs. 8 core) = 2.09

Usually, the more random numbers that are generated, the more the parallelism of the 20 core CPU can be used. Therefore, the "times increase" of speed from 8 core to 20 core increases over time.

However, after 300 million random numbers, this decreases, and increases again until 800 million (I haven't tested further).

Why is this? Is there a specific reason? Was it just random? (I've repeated this twice, and gotten the same result both times)

EDIT: If it makes any difference, I'm using the time function to time the execution of the script. Also, the OS isn't the same on both machines (8 core - macOS, 20 core - Debian).

Would you mind providing a graph of the two speeds depending of the number of random values generated? It's a bit difficult to reckon the figures you're giving... — Right leg, Jun 12 '17 at 09:06
are you running the same `OS` locally as you are on DigitalOcean? Have you looked at the performance of `os.urandom` instead of `random.randint` — Matti Lyra, Jun 12 '17 at 09:07
How are you measuring performance? Is it possible that some other tasks running at the server are temporarily clogging the CPU? — Błotosmętek, Jun 12 '17 at 09:20
@MattiLyra, I'm using Debian on DigitalOcean and macOS locally. I have no other tasks running during the benchmark. I'll take a look at the difference in performance and get back to you. — TajyMany, Jun 12 '17 at 16:36
@Błotosmętek, I'm not running any other tasks during the benchmark. — TajyMany, Jun 12 '17 at 16:37
@Rightleg, I've provided the benchmarks in the question now. The timing is in number of seconds to complete the python script. — TajyMany, Jun 12 '17 at 16:46
Can you add information about how much RAM you have? It may well be relevant: a Python list of `float` objects is going to take 32 bytes per float (including the pointer from the list to the `float` object) on a 64-bit machine, so with 400 million floats, that's ~12.8 GB for the final list. And that's ignoring the RAM taken up by any intermediate lists. — Mark Dickinson, Jun 12 '17 at 16:52
@MarkDickinson, I've got 8 GB RAM on my 8-core Mac, and 64 GB on my 20-core Debian server. — TajyMany, Jun 12 '17 at 16:55
DId you try repeating the tests multiple times or only once? Do you get a similar average? — noxdafox, Jun 20 '17 at 18:33

score 1 · Answer 1 · answered Jul 09 '17 at 14:58

Two possible explanations come to mind.

This could be an artifact of garbage collection kicking in. An easy experiment would be to shut-off GC and see if the "dip" persists:

>>> import gc
>>> gc.disable()

Another possibility is that this is an artifact of list growth using realloc() under-the-hood. Lists implemented are fixed length arrays of pointers. When map() grows the list with append(), the C function call realloc() is called periodically to resize the array of pointers. Often, this call is very cheap because none of the data has to be moved. However, if even a single byte in memory "obstructs" the resize, then all of the data has to be relocated. This is very expensive and could cause your the "dip" if at that point in execution multiprocessing is creating an obstructing byte.

To test this hypothesis, you could use imap() instead of map() and feed the results into collections.deque() instead of a list(). The deque implementation does not use relloc so its performance is consistent in the face of fragmented memory (internally, it just makes repeated calls to malloc() to obtained fixed length memory blocks).

Why the dip in speed increase for generating 400,000,000 random numbers?

1 Answers1