I have a question that now I have a list of 3 million records and I want get all difference values between every two records. A simple nested loop may take forever. Could anyone suggest me any algorithm which is capable of handling this problem?
-
What does your file look like? And what kind of differences do you expect to see? Are they numbers? – Mark Setchell Mar 11 '14 at 18:07
-
@MarkSetchell they are timestamps in C# datetime formats, I want to know how many hours between every two timestamp. The file is txt, and the data already added in a List
– DingMartin Mar 11 '14 at 18:16 -
Do you mean every pair of consecutive records, or every pair of records (consecutive or not)? The first is 3 million operations. The second is 9000 billion operations. – Andrew Tomazos Mar 11 '14 at 18:18
-
@AndrewTomazos no, not consecutive, every pair of records, there should be n(n-1)/2 difference values i think, where n is for list.count – DingMartin Mar 11 '14 at 18:21
-
@DingMartin: Sure, so for n = 3 million that is about 4500 billion results right? There is no algorithm that can work faster than the size of its output. A nested loop is the best you can do (quadratic time). – Andrew Tomazos Mar 11 '14 at 18:24
-
What would you need 4500 billion difference values for? That would be good to know - the use case behind that might suggest a better algorithm, like sorting the timestamps and then looking at certain ranges. – M Oehm Mar 11 '14 at 18:29
-
@MOehm i want to get the mean value of those 4500 billion difference values. Could it be better if I sort them? – DingMartin Mar 11 '14 at 18:33
-
You mean the mean value of the _absolute_ values of the differences, right? (Otherwise, the task would be easy ...) – M Oehm Mar 11 '14 at 18:54
-
@MOehm Yeah absolute values.. – DingMartin Mar 11 '14 at 19:23
2 Answers
If you want to calculate the mean of all absolute differences and your timestamps are sorted, you just need one loop:
t[i] <= t[i + 1] --> abs(t[i] - t[j]) = t[j] - t[i] for i < j
That is, there is a summand with positive sign and another summand with a negative sign for each of the N
timestamp differences. Let's look at an example with 4 timestamps:
sum = (t[3] - t[2]) + (t[3] - t[1]) + (t[3] - t[0])
+ (t[2] - t[1]) + (t[2] - t[0])
+ (t[1] - t[0])
Here, t[3]
is always added, t[2]
is added twice and subtracted once, t[1]
is added once and subtracted twice and finally the lowest value, t[0]
is always subtracted.
Ore, more general: The first timestamp, i.e. the one with the lowest value, has always the negative sign, N - 1
times. The second has N - 2
times negative signs and a positive sign once, namely when comparing the the first timestamp. The third has N - 3
times a negative sign and a positive sign twice.
So your loop goes like this:
sum = 0;
for i = 0 to N:
sum = sum + (2*i - N + 1) * t[i]
where i
is a zero-based index and N
an exclusive upper bound, C-style. To get the average, divide by (N - 1) * N / 2
.
If your array isn't sorted, you must sort it first, which usually has better performance than quadratic time, so you should be better off than with a nested loop.
One thing that might occur is that by summing up large values, you hit the limits of your data type. You could try to fix that by halving your loop and start summing from both ends in the hope that the differences about cancel themselves out. Alternatively you could already divide by the total number of differences inside the loop, possibly introducing some nasty floating-point round-off errors.

- 28,726
- 3
- 31
- 42
You could parallelise the problem by splitting the file into, say 8, chunks and processing them all at the same time and making the most of those expensive Intel iCores you paid for....
Use the split
command to generate the lists.
#!/bin/bash
split -l 375000 yourfile sublist # split into lumps of 375,000 subfiles called sublist*
for f in sublist* # for all list* files
do
# Start a background process to work on one list
echo start processing file $f in background &
done
wait # till all are finished

- 191,897
- 31
- 273
- 432