1

I have a lot of data sets which I need to sort and potentially merge. The data consists of 2D (x,y)-data from different time points, and what I want to do is this:

Read the different time points and see if I have several (x,y) data sets from the same time point. If this is the case, I would like compare the X-axis, to see the data can be merged.

All this I can do pretty easy in python - it might not be beautiful code, but it appears to work. The problem is that the merged data contains duplicates of the data which has been pooled, due to the way I iterate over the data sets. This is a bit difficult to explain so let me illustrate with this example code:

# define test data
times = ['10ms', '20ms','30ms','10ms', '10ms']
x_axis = np.atleast_2d(np.linspace(1,5,5)).T
data_sets = [np.concatenate((x_axis, np.random.rand(5,5)),axis=1) for num in range(5)]


def mergeData(times, data_sets):
    data = []
    pooled_times = []
    repetitions_strings = ['','_2nd', '_3rd'] +  ['_%ith' % num for num in range(3,30,1) ]

    for num, item in enumerate(times):
        if times.count(item) == 1:
            # only one data-set with the current time point exist
            pooled_times.append(item)
            data.append(data_sets[num])

        elif times.count(item) > 1:
            # more than one occurence of this time point
            # extract all the occurences and compare them 
            idx_different = 0 # dummy variable used to keep track of the numbers of different X-axis
            idx_repetitions = item == np.array(times)

            # lists do not accept lists of boolean index argument. find the number indices
            num_repititions = np.linspace(0, len(idx_repetitions)-1, len(idx_repetitions))[idx_repetitions]

            # *** get the data ***
            temporary_data = [data_sets[int(num)] for num in num_repititions]
            # either round off X-axis or change both tolerances in np.allclose()
            X_axis_round = [np.round(temporary_data[int(num)][:,0],decimals=4) for num in range(len(temporary_data))]


            # *** THIS IS WHERE THINGS GO BAD :((
            # loop over X-axis and compare - note that last X-axis is NOT considered
            # Deal with last X-axis separatetly 
            for idx1 in range(len(X_axis_round)-1):
                pool = temporary_data[idx1]
                removal_counter = int(0)
                for idx2 in range(idx1+1,len(X_axis_round),len(X_axis_round)):
                    if len(X_axis_round[idx1]) == len(X_axis_round[idx2]) and np.allclose(X_axis_round[idx1],X_axis_round[idx2]):
                        # pool the data because the X-axis and time point is the same
                        pool = np.concatenate((pool, temporary_data[idx2][:,1:]),axis=1)
                        removal_counter += 1

                        # remove the time points included in the pool so they are not dublicated
                        # !!! TIME POINTS SEEMS TO BE REMOVED BUT DUPPLICATES ARE STILL OCCURING?!? !!!
                        index = int(num_repititions[idx2])
                        print 'Removing index: %i, delay %s' % (index, item)
                        times = [times[int(num)] for num in range(len(times)) if num is not index]

                time_string = item + repetitions_strings[idx_different]
                pooled_times.append(time_string)
                data.append(pool)
                idx_different += 1

           # deal with last X-axis in case it is not pooled
           if removal_counter + 1 < len(times):  # True if last data-set could not be pooled
               time_string = item + repetitions_strings[idx_different]
               pooled_times.append(time_string)
               data.append(temporary_data[-1])
               index = num_repititions[-1]
               times = [times[int(num)] for num in range(len(times)) if num is not index]

    return pooled_times, data

As you can see, I am removing entries from the list that I am iterating over (times) which intuitively sounds like a really bad idea. From my test it looks like the loop iterates over all entries in the original Times list, so removing the during the loop does not work - but I can't think of a better way to do this - Input would be really appreciated!

Any smarter way to do this kind of pooling/merging, or make the

for num, item in enumerate(times):

use the current 'times' list instead of the original one?

Any help would be greatly appriciated :)

DonMP
  • 317
  • 2
  • 9
  • 2
    It would help if you could reduce your problem to the essentials, try looking at the following help links: [ask] and [mcve] – Inbar Rose Mar 07 '16 at 11:11

0 Answers0