Python algorithm to sort large blocks of data

Question

I've been looking online for a method to sort the kind of data I have (LDIF files), but I haven't found quite what I'm after. There are already programs out there to accomplish this sorting, but they fail with extremely large data sets. Well, for me extremely large is around 2 GB worth of these blocks, and this exhausts memory when using the ldifsort.pl script even if I have 6 GB RAM available and several more GB of swap. So I'm hoping to write a program that will store the blocks of data to the hard drive, sort the keys in memory, and then reassemble the blocks in sorted order. And I'd like to use python3 as I'm trying to learn that language. So if anyone has suggestions on basic strategy or on specific ways to do this with python3, I'd really appreciate the help.

I have large text files containing LDAP data, basically in the (much simplified) form of:

dn: Subscriber=UniqueName1@domain.com;RestOfTree=node1
groups: 1
permissions: 1
IsActive: FALSE
Barring: TRUE

dn: ProfileID=UniqueName1@domain.com;Subscriber=UniqueName1;RestOfTree=node1
groups: 1
permissions: 1
ServiceProfile: Lemur

dn: Subscriber=UniqueName2@domain.com;RestOfTree=node1
groups: 1
permissions: 1
IsActive: FALSE
Barring: TRUE

dn: ProfileID=UniqueName2@domain.com;Subscriber=UniqueName2;RestOfTree=node1
groups: 1
permissions: 1
ServiceProfile: Lemur

Each subscriber has three more blocks that are associated with it (my example code only shows one other block associated with the subscriber), and I would like to keep all four blocks together after the sorting is complete.

So if I read the dn's in this order (the data associated with the dn's is hidden for brevity):

dn: Subscriber=UniqueName2@domain.com;RestOfTree=node
dn: ProfileID=UniqueName2@domain.com;Subscriber=UniqueName2;RestOfTree=node
dn: Subscriber=UniqueName4@domain.com;RestOfTree=node
dn: ProfileID=UniqueName4@domain.com;Subscriber=UniqueName4;RestOfTree=node
dn: Subscriber=UniqueName1@domain.com;RestOfTree=node
dn: Subscriber=UniqueName3@domain.com;RestOfTree=node
dn: ProfileID=UniqueName3@domain.com;Subscriber=UniqueName3;RestOfTree=node
dn: ProfileID=UniqueName1@domain.com;Subscriber=UniqueName1;RestOfTree=node

I would want the output to be:

dn: Subscriber=UniqueName1@domain.com;RestOfTree=node
dn: ProfileID=UniqueName1@domain.com;Subscriber=UniqueName1;RestOfTree=node
dn: Subscriber=UniqueName2@domain.com;RestOfTree=node
dn: ProfileID=UniqueName2@domain.com;Subscriber=UniqueName2;RestOfTree=node
dn: Subscriber=UniqueName3@domain.com;RestOfTree=node
dn: ProfileID=UniqueName3@domain.com;Subscriber=UniqueName3;RestOfTree=node
dn: Subscriber=UniqueName4@domain.com;RestOfTree=node
dn: ProfileID=UniqueName4@domain.com;Subscriber=UniqueName4;RestOfTree=node

One thought I had was to use sqlite3 to store the data as python reads it, then sort the keys in python, then use queries to extract the data again from sqlite and write the data to a file. But I'm afraid the time spent searching for the keys in sqlite will be excessive. Then I thought I could sort the data in sqlite while I'm inserting the data, but it seems that sqlite doesn't support this, and I don't know if there is another database system that does.

Any help or direction is appreciated.

Thanks to Zach for his suggestion of just using GNU sort instead of a database system. Here is the solution I developed with his help.

awk -f ldifformatter.awk LDAP-data-files*.ldif | sort -t \| -k1 | sed '1d;s/|/\n/g' > sorted.txt

where ldifformatter.awk exchanges all newlines with "|" except for the top level dn's, which get used for sorting.

Thanks, Rusty

SQLite will search keys quickly if the lookup column is indexed. — CL., Aug 28 '13 at 15:49
The order is determined by the number that is right before the `@` in the subscriber name? — Bakuriu, Aug 28 '13 at 16:38
@Bakuriu The order is determined by everything between dn: and the first ;. The domain could be used for sorting. — Rusty Lemur, Aug 28 '13 at 16:47
@RustyLemur Then your output is wrong. If you sort the strings `Subscriber...;` and `ProfileID...;` the latter should be first in the alphanumeric order. How exactly are these items sorted? My guess is sorting emails in alphabetical order and where you have the same email first place the `Subscriber` ones and then the `ProfileID`. — Bakuriu, Aug 28 '13 at 16:57
@Bakuriu Right, you've got the right idea. Each "Subscriber" has "ProfileID" that is associated with it, so I want all the subscribers to be sorted and to display the Subscriber field before the ProfileID field. — Rusty Lemur, Aug 28 '13 at 17:06

score 2 · Answer 1 · answered Aug 28 '13 at 15:54

2

You should not sort your data in memory. You could use merge sort.

Guido van Rossum wrote an article about the same problem — Sorting a million 32-bit integers in 2MB of RAM using Python. There are code examples in this article.

answered Aug 28 '13 at 15:54

Leonid Shvechikov

3,927
2
17
14

score 1 · Accepted Answer · answered Aug 28 '13 at 15:48

1

The command-line sort utility can sort extremely large text files without reading them entirely into memory (at least, the GNU version can). However, to use it you will have to reformat the data so that each record (everything that should be kept together) appears on one line. If the records looked something like this:

dn: Subscriber=UniqueName1@domain.com;RestOfTree=node1|groups: 1|permissions: 1|IsActive: FALSE|Barring: TRUE||dn: ProfileID=UniqueName1@domain.com;Subscriber=UniqueName1;RestOfTree=node1|groups: 1|permissions: 1|ServiceProfile: Lemur

then sort -t \| -k1 will do the job.

You can write a program in Python that streams the data into a temporary file in the appropriate format, invokes sort using subprocess.check_call, and then restores the original format. Use tmpfile.NamedTemporaryFile to create the temporary file.

answered Aug 28 '13 at 15:48

zwol

135,547
38
252
361

Thanks for the suggestion. I'm looking into this now. – Rusty Lemur Aug 28 '13 at 16:06
Wow. I had no idea GNU sort was that powerful. Where ldifsort.pl would have crashed due to memory exhaustion, this worked perfectly: awk -f ldifsorter.awk dataFiles*.ldif | sort -t \| -k1 | sed '1d;s/|/\n/g' > sorted.ldif where ldifsorter.awk just reformats the data like you suggested. Sadly there's no use of python, but this is pretty dang fast. – Rusty Lemur Aug 28 '13 at 16:42
To be honest, your shell pipeline with `awk` and `sed` is what I'd have used if I had this problem myself. I just suggested the Python wrapper since you were interested in learning how to do things in that language. – zwol Aug 28 '13 at 16:53

score 0 · Answer 3 · answered Aug 28 '13 at 15:47

0

I wonder if SQLite is really not up to the task. But anyway, you could use an external sorting algorithm, e.g. Mergesort, to keep the memory usage low.

http://en.wikipedia.org/wiki/External_sorting

answered Aug 28 '13 at 15:47

oefe

19,298
7
47
66

Python algorithm to sort large blocks of data

3 Answers3