-1

I have a large data set (~2Gb) to analyse and I'd like to multi process it to reduce the run time of the code. I've imported the dataset into a list which I will then want to run numerous passes over. On each pass I'll set up a pool for each available core and each pool will then only assess a certain block of the data set (note, the pool still needs access to the complete data set).

Each line of the input file takes the format "a,b,c,d,e,f,g,h" and all are numbers.

I'm struggling to separate out the get the parameters in the Calc1stPass Pool; I'm getting a tuple index out or range error. Can anyone help me out with this error please?

def Calc1stPass(DataSet,Params):
    
    print("DataSet =", DataSet)
    print("Params =", Params)
    Pass, (PoolNumber, ArrayCount, CoreCount) = Params

    StartRow = int((ArrayCount / CoreCount) * PoolNumber)
    EndRow  = int(((ArrayCount / CoreCount) * (PoolNumber+1))-1)

    for Row in range(StartRow,EndRow):
        Rand = randrange(ArrayCount)
        Value1 = Decimal(DataSet[Row][0]) + Decimal(DataSet[Row][1])
        Value2 = Decimal(DataSet[Rand][0]) + Decimal(DataSet[Rand][1])
        Value3 = Value1 - Value2

        NewValue = Decimal(DataSet[Row][7]) + Value3
        DataSet[Row][7] = str(NewValue)

def main():
    
    #Importing the file
    print("Importing File ",  FileToImport)
    OriginalDataSet = []
    f = open(FileToImport)
    for line in f:
        StrippedLine = line.rstrip()
        OriginalDataSet.append(StrippedLine.split(",",))
    ArrayCount = len(OriginalDataSet)

    #Running passes on dataset
    for Pass in range(NumberofPasses):
        print("Running Pass : ", Pass + 1, " of ", NumberofPasses)
        CoreCount = mp.cpu_count()
        WorkPool=mp.Pool(CoreCount)

        for PoolNumber in range(CoreCount):
            Params = [Pass,PoolNumber,ArrayCount,CoreCount]
            RevisedDataSet = WorkPool.starmap(Calc1stPass, product(OriginalDataSet, zip(range(1),Params)))
        print(RevisedDataSet)
    
if __name__ == "__main__":
    freeze_support()
    main()
Paul
  • 41
  • 6
  • why this loop: `for PoolNumber in range(CoreCount):`? each pass only changes `Rand`, and overwrites the previous results... – Aaron Nov 01 '21 at 17:38
  • The intention for that line is that I can pass the PoolNumber to the pool and then use that variable to assign which sections of the dataset each pool will work on. – Paul Nov 02 '21 at 09:25

1 Answers1

0

Okay, here we go with what I came up with after some discussion plus trial and error. I hope I've kept it somewhat comprehensible. However, it seems you are very new to a lot of this, so you probably have a lot of reading to do regarding how certain libraries and data types work.

Analyzing the algorithm

Let's start with taking a closer look at your computation:

for Pass in range(Passes:
    for Row in range(StartRow,EndRow):
        Rand = randrange(ArrayCount)
        Value1 = Decimal(DataSet[Row][0]) + Decimal(DataSet[Row][1])
        Value2 = Decimal(DataSet[Rand][0]) + Decimal(DataSet[Rand][1])
        Value3 = Value1 - Value2
        NewValue = Decimal(DataSet[Row][7]) + Value3
        DataSet[Row][7] = str(NewValue)

So basically, we update a single row through a computation involving another random row. Assumptions that I make:

  • the real algorithm does a bit more stuff, otherwise it is hard to see what you want to achieve
  • the access pattern of the real algorithm stays the same

Following our discussion, there are no functional reasons for the following aspects:

  • Computation in Decimal is unnecessary. float will do just fine
  • The values don't need to be stored as string. We can use an array of float

At this point it is clear that we can save tremendous amounts of runtime by using a numpy array instead of a list of string.

There is an additional hazard here for parallelization: We use random numbers. When we use multiple processes, the random number generators need to be set up for parallel generation. We'll cross that bridge when we get there.

Notably, the output column is no input for the next pass. The inputs per pass stay constant.

Input / Output

The input file format seems to be a simple CSV mostly filled with floating point numbers (using only one decimal place) and one column not being a floating point number. The text based format coupled with your information that there are gigabytes of data means that a significant amount of time will be spent just parsing the input file or formatting the output. I'll try to be efficient in both but keep things simple enough that extensions in both are possible.

Optimizing the sequential algorithm

It is always advisable to first optimize the sequential case before parallelizing. So we start here. We begin with parsing the input file into a numpy array.

import numpy as np

def ReadInputs(Filename):
    """Read a CSV file containing 10 columns

    The 7th column is skipped because it doesn't contains floating point values

    Return value:
    2D numpy array of floats
    """
    UsedColumns = (0, 1, 2, 3, 4, 5, 7, 8, 9)
    return np.loadtxt(Filename, delimiter=',', usecols=UsedColumns)

Since we are using numpy, we switch over to its random number generators. This is the setup routine. It allows us to force deterministic values for easier debugging.

def MakeRandomGenerator(Deterministic=False):
    """Initializes the random number generator avoiding birthday paradox

    Arguments:
    Deterministic -- if True, the same same random numbers are being used

    Return value:
    numpy random number generator
    """
    SeedInt = 0 if Deterministic else None
    Seed = np.random.SeedSequence(SeedInt)
    return np.random.default_rng(Seed)

And now the main computation. Numpy makes this very straight-forward.

def ComputePass(DataSets, RandomGenerator):
    """The main computation

    Arguments:
    DataSets -- 2D numpy array. Changed in place
    RandomGenerator -- numpy random number generator
    """
    Count = len(DataSets)
    RandomIndices = RandomGenerator.integers(
        low=0, high=Count, size=Count)
    RandomRows = DataSets[RandomIndices]
    # All rows: first column + second column
    Value1 = DataSets[:, 0] + DataSets[:, 1]
    Value2 = RandomRows[:, 0] + RandomRows[:, 1]
    Value3 = Value1 - Value2
    # This change is in-place of the whole DataSets array
    DataSets[:, 7] += Value3

I've kept the structure the same. That means there are a few optimizations that we can still do:

  1. We never use most columns. Columns that are unnecessary should be removed from the array (skipped in input parsing) to reduce memory consumption and improve locality of data. If necessary for output, it is better to merge in the output stage, maybe by re-reading the input file to gather the remaining columns

  2. Since Value1 and Value2 never change, we could pre-compute Value3 for all rows and just use that. Again, if we don't need the first two columns in memory, better to remove them

  3. If we transpose the array (or store in Fortran order), we improve vectorization. This will make the use of MPI harder, but not impossible

I've not done any of this because I do not want to stray too far from the original algorithm.

The last step is the output. Here I go with a pure Python route to keep things simple and replicate the input file format:

def WriteOutputs(Filename, DataSets):
    LineFormat = "{:.1f}, " * 6 + "+" + ", {:.1f}" * 3 + "\n"
    with open(Filename, 'w') as OutFile:
        for Row in DataSets:
            OutFile.write(LineFormat.format(*Row))

Now the entire operation is rather simple:

def main():
    InFilename = "indata.csv"
    OutFilename = "outdata.csv"
    Passes = 20
    RandomGenerator = MakeRandomGenerator()
    DataSets = ReadInputs(InFilename)
    for _ in range(Passes):
        ComputePass(DataSets, RandomGenerator)
    WriteOutputs(OutFilename, DataSets)


if __name__ == '__main__':
    main()    

Parallelization framework

There are two main concerns for parallelization:

  1. For every row, we need access to the entire input data set to pick a random entry

  2. The amount of calculation per row is very low

So we need to find a way that keeps overhead per row small and shares the input data set efficiently.

The first choice is multiprocessing since, you know, standard library and all that. However, I think that the normal usage patterns have too much overhead. It's certainly possible but I would like to use MPI for this to give us as much performance as possible. Also, your first attempt at parallelization used a pattern that matches MPI's preferred pattern. So it is a good fit.

A word towards the concept of MPI: multiprocessing.Pool works with a main process that distributes work items among a set of worker processes. MPI start N processes that all execute the same code. There is no main process. The only distinguishing feature is the process "rank", which is a number [0, N). If you need a main process, the one with rank 0 is usually chosen. Other than that, the idea is that all processes execute the same code, only picking different indices or offsets based on their rank. If processes need to communicate, there are a couple of "collective" communication patterns such as broadcasting, scattering, and gathering.

Option 1: Pure MPI

Let's rewrite the code. The main idea is this: We distribute rows in the data set among all processes. Then each process calculates all passes for its own set of rows. Input and output take considerable time, so we try to do as much as possible parallelized, too.

We start by defining a helper function that defines how we distribute rows among all processes. This is very similar to what you had in your original version.

from mpi4py import MPI

def MakeDistribution(NumberOfRows):
    """Computes how the data set should be distributed across processes

    Arguments:
    NumberOfRows -- size of the whole dataset

    Return value:
    (Offsets, Counts) numpy integer arrays. One entry per process
    """
    Comm = MPI.COMM_WORLD
    WorldSize = Comm.Get_size()
    SameSize, Tail = divmod(NumberOfRows, WorldSize)
    Counts = np.full(WorldSize, SameSize, dtype=int)
    Counts[:Tail] += 1
    # Start offset per process
    Offsets = np.cumsum(Counts) - Counts[0]
    return Offsets, Counts

A second helper function is used to distribute the data sets among all processes. MPI's allgather function is used to collect results of a computation among all processes into one array. The normal function gather collects the whole array on one process. Allgather collects it in all processes. Since all processes need access to all data sets for their random access, we use allgather. Allgatherv is a generalized version that allows different number of entries per process. We need this because we cannot guarantee that all processes have the same number of rows in their local data set.

This function uses the "buffer" interface of mpi4py. This is the more efficient version but also very error-prone. If we mess up an index or the size of a data type, we risk data corruption.

def DistributeDataSets(DataSets, Offsets, Counts):
    """Shares the datasets with all other processes

    Arguments:
    DataSets -- numpy array of floats. Changed in place
    Offsets, Counts -- See MakeDistribution
    
    Return value:
    DataSets. Most likely a reference to the original.
    Might be an updated copy
    """
    # Sanitize the input. Better safe than sorry and shouldn't cost anything
    DataSets = np.ascontiguousarray(DataSets, dtype='f8')
    assert len(DataSets) == np.sum(Counts)
    # MPI works best if we pretend to have 1-dimensional data
    InnerSize = np.prod(DataSets.shape[1:], dtype=int)
    # I really wish mpi4py had a helper for this
    BufferDescr = (DataSets,
                   Counts * InnerSize,
                   Offsets * InnerSize,
                   MPI.DOUBLE)
    MPI.COMM_WORLD.Allgatherv(MPI.IN_PLACE, BufferDescr)
    return DataSets

We split reading the input data into two parts. First we read all lines in a single process. This is relatively cheap and we need to know the total number of rows before we can distribute the datasets. Then we scatter the lines among all processes and let each process parse its own set of rows. After that, we use the DistributeDataSets function to let each process know all the results.

Scattering the lines uses mpi4py's pickle interface that can transfer arbitrary objects among processes. It's slower but more convenient. For stuff like lists of strings it's very good.

def ParseLines(TotalLines, Offset, OwnLines):
    """Allocates a data set and parses the own segment of it
    
    Arguments:
    TotalLines -- number of rows in the whole data set across all processes
    Offset -- starting offset of the set of rows parsed by this process
    OwnLines -- list of lines to be parsed by the local process
    
    Return value:
    a 2D numpy array. The rows [Offset:Offset+len(OwnLines)] are initialized
    with the parsed values
    """
    UsedColumns = (0, 1, 2, 3, 4, 5, 7, 8, 9)
    DataSet = np.empty((TotalLines, len(UsedColumns)), dtype='f8')
    OwnEnd = Offset + len(OwnLines)
    for Row, Line in zip(DataSet[Offset:OwnEnd], OwnLines):
        Columns = Line.split(',')
        # overwrite in-place with new values
        Row[:] = [float(Columns[Column]) for Column in UsedColumns]
    return DataSet


def DistributeInputs(Filename):
    """Read input from the file and distribute it among processes
    
    Arguments:
    Filename -- path to the CSV file to parse
    
    Return value:
    (DataSets, Offsets, Counts) with
    DataSets -- 2D array containing all values in the CSV file
    Offsets -- Row indices (one per rank) where each process starts its own
        processing
    Counts -- number of rows per process
    """
    Comm = MPI.COMM_WORLD
    Rank = Comm.Get_rank()
    Lines = None
    LineCount = None
    if not Rank:
        # Read the data. We do as little work as possible here so that other
        # processes can help with the parsing
        with open(Filename) as InFile:
            Lines = InFile.readlines()
        LineCount = len(Lines)
    # broadcast so that all processes know the number of datasets
    LineCount = Comm.bcast(LineCount, root=0)
    Offsets, Counts = MakeDistribution(LineCount)
    # reshape into one list per process
    if not Rank:
        Lines = [Lines[Offset:Offset+Count]
                 for Offset, Count
                 in zip(Offsets, Counts)]
    # distribute strings for parsing
    Lines = Comm.scatter(Lines, root=0)
    # parse into a float array
    DataSets = ParseLines(LineCount, Offsets[Rank], Lines)
    del Lines # release strings because this is a huge array
    # Share the parsed result
    DataSets = DistributeDataSets(DataSets, Offsets, Counts)
    return DataSets, Offsets, Counts    

Now we need to update the way the random number generator is initialized. What we need to prevent is that each process has the same state and generates the same random numbers. Thankfully, numpy gives us a convenient way of doing this.

def MakeRandomGenerator(Deterministic=False):
    """Initializes the random number generator avoiding birthday paradox

    Arguments:
    Deterministic -- if True, the same number of processes should always result
        in the same random numbers being used

    Return value:
    numpy random number generator
    """
    Comm = MPI.COMM_WORLD
    Rank = Comm.Get_rank()
    AllSeeds = None
    if not Rank:
        # the root process (rank=0) generates a seed sequence for everyone else
        WorldSize = Comm.Get_size()
        SeedInt = 0 if Deterministic else None
        OwnSeed = np.random.SeedSequence(SeedInt)
        AllSeeds = OwnSeed.spawn(WorldSize)
    # mpi4py can scatter Python objects. This is the simplest way
    OwnSeed = Comm.scatter(AllSeeds, root=0)
    return np.random.default_rng(OwnSeed)

The computation itself is almost unchanged. We just need to limit it to the rows for which the individual process is responsible.

def ComputePass(DataSets, Offset, Count, RandomGenerator):
    """The main computation

    Arguments:
    DataSets -- 2D numpy array. Changed in place
    Offset, Count -- rows that should be updated by this process
    RandomGenerator -- numpy random number generator
    """
    RandomIndices = RandomGenerator.integers(
        low=0, high=len(DataSets), size=Count)
    RandomRows = DataSets[RandomIndices]
    # Creates a "view" into the whole dataset for the given slice
    OwnDataSets = DataSets[Offset:Offset + Count]
    # All rows: first column + second column
    Value1 = OwnDataSets[:, 0] + OwnDataSets[:, 1]
    Value2 = RandomRows[:, 0] + RandomRows[:, 1]
    Value3 = Value1 - Value2
    # This change is in-place of the whole DataSets array
    OwnDataSets[:, 7] += Value3

Now we come to writing the output. The most expensive part is formatting the floating point numbers into strings. So we let each process format its own data. MPI has a file IO interface that allows all processes to write a single file together. Unfortunately, for text files, we need to calculate the offsets before writing the data. So we format all rows into one huge string per process, then write the file.

import io

def WriteOutputs(Filename, DataSets, Offset, Count):
    """Writes all DataSets to a CSV file

    We parse all rows to a string (one per process), then write it
    collectively using MPI

    Arguments:
    Filename -- output path
    DataSets -- all values among all processes
    Offset, Count -- the rows for which the local process is responsible
    """
    StringBuf = io.StringIO()
    LineFormat = "{:.6f}, " * 6 + "+" + ", {:.6f}" * 3 + "\n"
    for Row in DataSets[Offset:Offset+Count]:
        StringBuf.write(LineFormat.format(*Row))
    StringBuf = StringBuf.getvalue() # to string
    StringBuf = StringBuf.encode() # to bytes
    Comm = MPI.COMM_WORLD
    BytesPerProcess = Comm.allgather(len(StringBuf))
    Rank = Comm.Get_rank()
    OwnOffset = sum(BytesPerProcess[:Rank])
    FileLength = sum(BytesPerProcess)
    AccessMode = MPI.MODE_WRONLY | MPI.MODE_CREATE
    OutFile = MPI.File.Open(Comm, Filename, AccessMode)
    OutFile.Set_size(FileLength)
    OutFile.Write_ordered(StringBuf)
    OutFile.Close()

The main process is almost unchanged.

def main():
    InFilename = "indata.csv"
    OutFilename = "outdata.csv"
    Passes = 20
    RandomGenerator = MakeRandomGenerator()
    DataSets, Offsets, Counts = DistributeInputs(InFilename)
    Rank = MPI.COMM_WORLD.Get_rank()
    Offset = Offsets[Rank]
    Count = Counts[Rank]
    for _ in range(Passes):
        ComputePass(DataSets, Offset, Count, RandomGenerator)
    WriteOutputs(OutFilename, DataSets, Offset, Count)


if __name__ == '__main__':
    main()

You need to call this script with mpirun or mpiexec. E.g. mpiexec python3 script_name.py

Using shared memory

The MPI pattern has one significant drawback: Each process needs its own copy of the whole data set. Given its size, this is very inconvenient. We might run out of memory before we run out of CPU cores for multithreading. As a different idea, we can use shared memory. Shared memory allows multiple processes to access the same physical memory without any extra cost. This has some drawbacks:

  1. We need a very recent python version. 3.8 IIRC

  2. Python's implementation may behave differently on various operating systems. I could only test it on Linux. There is a chance that it will not work on any different system

  3. IMHO python's implementation is not great. You will notice that the final version will print some warnings which I think are harmless. Maybe I'm using it wrong but I don't see a more correct way of using it

  4. It limits you to a single PC. MPI itself is perfectly capable (and indeed designed to) operate across multiple systems on a network. Shared memory works only locally.

The major benefit is that the memory consumption does not increase with the number of processes.

We start by allocating such a data set.

From here on, we put in "barriers" at various points where processes may have to wait for one another. For example because all processes need to access the same shared memory segment, they all have to open it before we can unlink it.

from multiprocessing import shared_memory


def AllocateSharedDataSets(NumberOfRows, NumberOfCols=9):
    """Creates a numpy array in shared memory

    Arguments:
    NumberOfRows, NumberOfCols -- basic shape
    
    Return value:
    (DataSets, Buf) with
    DataSets -- numpy array shaped (NumberOfRows, NumberOfCols).
        Datatype float
    Buf -- multiprocessing.shared_memory.SharedMemory that backs the array.
        Close it when no longer needed
    """
    length = NumberOfRows * NumberOfCols * np.float64().itemsize
    Comm = MPI.COMM_WORLD
    Rank = Comm.Get_rank()
    Buf = None
    BufName = None
    if not Rank:
        Buf = shared_memory.SharedMemory(create=True, size=length)
        BufName = Buf.name
    BufName = Comm.bcast(BufName)
    if Rank:
        Buf = shared_memory.SharedMemory(name=BufName, size=length)
    DataSets = np.ndarray((NumberOfRows, NumberOfCols), dtype='f8',
                          buffer=Buf.buf)
    Comm.barrier()
    if not Rank:
        Buf.unlink() # this may differ among operating systems
    return DataSets, Buf

The input parsing also changes a little because have to put the data into the previously allocated array

def ParseLines(DataSets, Offset, OwnLines):
    """Reads lines into a preallocated array
    
    Arguments:
    DataSets -- [Rows, Cols] numpy array. Will be changed in-place
    Offset -- starting offset of the set of rows parsed by this process
    OwnLines -- list of lines to be parsed by the local process
    """
    UsedColumns = (0, 1, 2, 3, 4, 5, 7, 8, 9)
    OwnEnd = Offset + len(OwnLines)
    OwnDataSets = DataSets[Offset:OwnEnd]
    for Row, Line in zip(OwnDataSets, OwnLines):
        Columns = Line.split(',')
        Row[:] = [float(Columns[Column]) for Column in UsedColumns]


def DistributeInputs(Filename):
    """Read input from the file and stores it in shared memory
    
    Arguments:
    Filename -- path to the CSV file to parse
    
    Return value:
    (DataSets, Offsets, Counts, Buf) with
    DataSets -- [Rows, 9] array containing two copies of all values in the
        CSV file
    Offsets -- Row indices (one per rank) where each process starts its own
        processing
    Counts -- number of rows per process
    Buf -- multiprocessing.shared_memory.SharedMemory object backing the
        DataSets object
    """
    Comm = MPI.COMM_WORLD
    Rank = Comm.Get_rank()
    Lines = None
    LineCount = None
    if not Rank:
        # Read the data. We do as little work as possible here so that other
        # processes can help with the parsing
        with open(Filename) as InFile:
            Lines = InFile.readlines()
        LineCount = len(Lines)
    # broadcast so that all processes know the number of datasets
    LineCount = Comm.bcast(LineCount, root=0)
    Offsets, Counts = MakeDistribution(LineCount)
    # reshape into one list per process
    if not Rank:
        Lines = [Lines[Offset:Offset+Count]
                 for Offset, Count
                 in zip(Offsets, Counts)]
    # distribute strings for parsing
    Lines = Comm.scatter(Lines, root=0)
    # parse into a float array
    DataSets, Buf = AllocateSharedDataSets(LineCount)
    try:
        ParseLines(DataSets, Offsets[Rank], Lines)    
        Comm.barrier()
        return DataSets, Offsets, Counts, Buf
    except:
        Buf.close()
        raise

Output writing is exactly the same. The main process changes slightly because now we have to manage the life time of the shared memory.

import contextlib

def main():
    InFilename = "indata.csv"
    OutFilename = "outdata.csv"
    Passes = 20
    RandomGenerator = MakeRandomGenerator()
    Comm = MPI.COMM_WORLD
    Rank = Comm.Get_rank()
    DataSets, Offsets, Counts, Buf = DistributeInputs(InFilename)
    with contextlib.closing(Buf):
        Offset = Offsets[Rank]
        Count = Counts[Rank]
        for _ in range(Passes):
            ComputePass(DataSets, Offset, Count, RandomGenerator)
        WriteOutputs(OutFilename, DataSets, Offset, Count)

Results

I've not benchmarked the original version. The sequential version requires 2 GiB memory and 3:20 minutes for 12500000 lines and 20 passes. The pure MPI version requires 6 GiB and 42 seconds with 6 cores. The shared memory version requires a bit over 2 GiB of memory and 38 seconds with 6 cores.

Homer512
  • 9,144
  • 2
  • 8
  • 25
  • Thanks for the response. I've changed the unpacking statement as above but I'm now getting a "cannot unpack non-iterable int object" error. And yes, thanks for the tip on just starting the pool once. – Paul Nov 02 '21 at 09:23
  • @Paul I've rewritten the answer so that you have an easier time debugging it yourself because I surely can't guess what you actually want to achieve. Feel free to describe the pattern that you want to occur and I'll help you with it. – Homer512 Nov 02 '21 at 09:48
  • Thank you for your help on this. I've revised the original coding to give you a better idea of what I'm trying to achieve. In essence I need to run through every line of the 2Gb file and assess it against a randomly selected line of the file. I then change a value of the original line based on the result. Once I've assessed every line in the file I then repeat that process for a number of passes. Implementing the debugging options you mentioned I get `Params = (0,0)` when I'm expecting 4 integers that I interrogate and can use in the pool. – Paul Nov 03 '21 at 14:00
  • @Paul I've extended my answer now that I have a base understanding of what you want to do. – Homer512 Nov 03 '21 at 14:24
  • that's great; I've modified the code using `RevisedDataSet = WorkPool.map( functools.partial(Calc1stPass, Params=Params), OriginalDataSet)` and it's transferring the variables across as expected. I'm now getting this issue where the OriginalDataSet is a list of lists but when I print the DataSet from def Calc1stPass it's only printing a single list and not the entire list of lists. – Paul Nov 03 '21 at 15:50
  • Err, yeah, that's to be expected. map calls Calc1stPass once per item in OriginalDataSet. If OriginalDataSet is a list of lists, map will iterate over that outer list and pass one of the inner lists to Calc1stPass. – Homer512 Nov 03 '21 at 16:19
  • Is there a way to transfer the list of lists so that I can interrogate it within the pool then? Apologies if this seems basic, the limited scripting experience I have is in another language and that's only from what I managed to teach myself. This is my first attempt at using Python (I've been asked to use it), hence some of the presumably silly things I have in the code – Paul Nov 03 '21 at 21:32
  • @Paul Okay, that makes sense. You pulling complicated stuff like itertools.product and starmap made it seem like you know what you are doing ;-) Apologies for staying in "fixing mode" and not switching to "teaching mode" sooner. Anyway, I rewrote my answer. We should iterate it to match your requirements – Homer512 Nov 04 '21 at 10:39
  • apologies I should have explained that earlier! That's now working, but I had to change `RevisedDataSet = mp.map(WithFixedParams, RevisedDataSet)` to `RevisedDataSet = WorkPool.map(WithFixedParams, RevisedDataSet)` as it was saying mp has no attribute 'map'. In terms of computation required; I'm looking at running 2500 passes on a file that contains ~32million lines so anything that we can do to speed the script up would be appreciated. – Paul Nov 04 '21 at 11:08
  • @Paul: Quick question, are you in a position to use MPI? Can you use mpi4py? Because I think that workflow would suite you more and may also be faster. – Homer512 Nov 04 '21 at 14:10
  • Yes, I've managed to install the mpi4py package. – Paul Nov 04 '21 at 15:44
  • @Paul: Cool. I've got some ideas but I'm too busy today to write them down. You'll get them tomorrow. One more question: What's the reason for using Decimal() instead of float() and why keep the stuff stored as strings instead of Decimal (or float)? – Homer512 Nov 05 '21 at 08:15
  • No worries, whenever you can fit it in is good for me. I was using decimal as the values import as strings (the file I import has 9 numeric values and 1 value that's either "+" or "-"). I've also not used float before. Printing OriginalDataSet gives `[['214.1', '-218.1', '-795.9', '7.3', '3.5', '-0.7', '+', '0', '0', '0'], [...` The str was used to convert the values back to strings, without it the values in returned list were `['214.1', '-218.1', '-795.9', '7.3', '3.5', '-0.7', '+', Decimal('-208.1'), '0', '0']`. Hopefully that makes sense. – Paul Nov 05 '21 at 10:50
  • @Paul I hope I'm now done with this. The MPI version is the best I could come up with. I'm using numpy and MPI which you don't seem familiar with. However, I would recommend you open new questions to ask about specific parts because right now I don't think anyone but me is reading this thread. – Homer512 Nov 07 '21 at 00:29
  • that's brilliant, thank you so much for you time, effort and patience. I'll get my head around this and test it in the coming days and then let you know how I get on (I'll open a new question if there are any issues). – Paul Nov 08 '21 at 07:01