10

There are a lot of discussions on the web on the topic of sorting huge files on Unix when the data will not fit into memory. Generally using mergesort and variants.

Hoewever, if suppose, there was enough memory to fit the entire data into it, what could be the most efficient / fastest way of sorting ? The csv files are ~ 50 GB (> 1 billion rows) and there is enough memory (5x the size of data) to hold the entire data.

I can use the Unix sort, but that still takes > 1 hr. I can use any language necessary, but what I am primarily looking for is speed. I understand we can load the data into say, a columnar type db table and sort, but it's a one-time effort, so looking for something more nimble ...

Thanks in advance.

banarun
  • 2,305
  • 2
  • 23
  • 40
xbsd
  • 2,438
  • 4
  • 25
  • 35
  • What do the files contain? – Nawaz Jun 26 '13 at 11:48
  • 5
    RAM=5x50GB? Really? 250GB RAM. That's some serious hardware you've got to play with. Are they hiring :-) – Neil Jun 26 '13 at 11:49
  • 1
    :-) ... That's a standard size server for most investment banks, this has modest memory in comparison. It's mainly to support KDB+ (see kx.com). – xbsd Jun 26 '13 at 11:58
  • @Nawaz, these are csv files with combinations of strings, ints, dates in each row. – xbsd Jun 26 '13 at 11:59
  • 2
    "... cvs files... >1 billion rows...": forget about the sorting. You have a far more serious problem to solve, a foundational/architectural one. You lost the war when your architecture put you in the position of doing random-access processing of a billion records from a variable-record-length file. You need to go back and redesign your whole process. – Euro Micelli Jun 26 '13 at 12:01
  • 1
    This sounds like something worthy some experimenting. If `sort` is not fast enough, I would try `sqlite` next. (1) Load the data to a table without index, (2) add index, (3) query the sorted table for all record. Sqlite should be able to load the data from CSV. In case you can use a real RDBMS instead of SQLite, it might be worth it to split the CSV import into several processes. – wilx Jun 26 '13 at 12:02
  • We didn't create the files. This is the format in which they are delivered from outside. – xbsd Jun 26 '13 at 12:02
  • Have you tried using the coreutils `sort` with the `--batch-size` set to, say, the number of input files (or possibly about half)? – Hasturkun Jun 26 '13 at 12:03
  • @Euro, What I am looking foris if you had the computational power to tackle this, what is the fastest method ... – xbsd Jun 26 '13 at 12:05
  • @Hasturkan, not yet ... – xbsd Jun 26 '13 at 12:05
  • If computational power was supposedly involved (250GB, XD), then a variant of bucket sort for the csv files would be best, I believe (as they are theoretically O(n)) – Binayaka Chakraborty Jun 26 '13 at 12:07
  • 2
    Also, back to `sort`, the GNU `sort` has `--parallel=N` and `--batch-size=NMERGE` options. – wilx Jun 26 '13 at 12:07
  • @wilx, will try that. What I think could be most efficient is if U can read this directly into contiguous memory, sort in memory and write back to disk, ... – xbsd Jun 26 '13 at 12:15
  • 1
    @xbsd: ("... delivered from outside..."): that's the source. Then what? What do you do with it? You have to be able to look at the "whole" problem. Does it really need to be sorted? Are you going to load the data to a database? Are you going to partition it? Do you need to sort only a projection of the data? If it's a one-time deal as you described it and you found it takes merely hours via Unix's sort, can you live with that? – Euro Micelli Jun 26 '13 at 12:16
  • @Euro, finding the distinct rows is the ultimate goal. That reduces the overall size of the data. It gets loaded into a system where a smaller dataset can be manipulated more efficiently than the original size ... – xbsd Jun 26 '13 at 12:22
  • If you really need to sort the whole file, the best sorting method is... "it depends", on the distribution of the data is around the sorting criteria. Is the sorting criteria a unique column? Contiguous? Sparse? Do you know the range a priori? *How much* do you know? Example: let's say the sorting criteria is (or can be mapped to) a contiguous sequence of unique numbers with little or zero sparsity, and you know the range, max record length and count in advance. The fastest sort is to pre-allocate a file of MaxRecLen*RecCnt, read the file in sequence and dump each record on its proper slot. – Euro Micelli Jun 26 '13 at 12:35
  • Based on you last comment, I might try to discard/merge records on the fly as you load them to the database (with indexes online). When you find a duplicate, handle it. That might or might not work, because inserting a billion records with online indexes is hugely expensive. Or it might work, if the final dataset is only a few million records. I don't know. The point is that this is a very specific case with extreme conditions, and a generic "fastest way to do X" answer is not bound to solve your problem. The best answer to your problem is highly dependent on the distribution of your data. – Euro Micelli Jun 26 '13 at 12:48
  • [This](http://stackoverflow.com/questions/3969813/which-parallel-sorting-algorithm-has-the-best-average-case-performance) may be of some help. – n. m. could be an AI Jun 26 '13 at 12:57
  • @Euro, it's not sparse data, total 6 columns (integers and date fields) – xbsd Jun 26 '13 at 13:03

3 Answers3

5

Use parallel sorting algorithms for huge data.

Useful topic: Which parallel sorting algorithm has the best average case performance?

Community
  • 1
  • 1
Zenn
  • 66
  • 3
1

What about QuickSort? Did you try? std::sort is usually implemented by quicksort (more precisely introsort, which switches to heapsort if quicksort performance would be bad), so you can try with it. quicksort is usually the fastest option (although the worst-case complexity is O(n^2), but in usual cases it beats all other sorting algorithms).

The space complexity of quicksort should not be too bad, it requires log2(N) stack space, which is around 30 stack frames for 1 billion items.

However, it is unstable sorting algorithm (order of "equal" items is not preserved), so it depends if you are ok with that.

Btw. Unix sort seems to be implemented by merge sort, which usually isn't the fastest option for in-RAM sort.

EmDroid
  • 5,918
  • 18
  • 18
1

I know this is old but I figure I'd chime in with what I just figured out in hopes that it may help someone else in the future.

GNU sort as you may already know is pretty fast. Couple that with many CPU cores and a lot of RAM and when you pass in some special flags to GNU's sort and make it extremely fast.

* pay close attention to the 'buffer-size' flag. buffer size is the main reason for this speed-up. ive used parallel flag before and it wasn't as fast by itself.

sort --parallel=32 --buffer-size=40G -u -t, -k2 -o $file.csv $file

I used a for loop to handle all the files in the folder and sorted huge csv files, by the second key, with a comma delim, only keeping unique values, with the following results:

for file in $(ls -p | grep -v  -E "[0-4/]"); 
do 
    time sort --parallel=32 --buffer-size=40G -u -t, -k2 -o $file.sorted.csv $file; 
done

real    0m36.041s
user    1m53.291s
sys     0m32.007s

real    0m52.449s
user    1m52.812s
sys     0m38.202s

real    0m50.133s
user    1m41.124s
sys     0m38.595s

real    0m41.948s
user    1m41.080s
sys     0m35.949s

real    0m47.387s
user    1m39.998s
sys     0m34.076s

The input files are 5.5 GB with ~75,000,000 million rows each. The max memory usage I saw while a sort was taking place was a little less then 20 GB. So if it scales proportionally then a 50 GB file should take a little less then 200 GB of space. sorted 27.55 GB of data in under 9 minutes!

cigol on
  • 337
  • 3
  • 12