Lexicographical sort array of arrays algorithm using C#

Question

I know I already asked this kind of question before dealing with vb6 and it was too slow, so I decided to use C# for this job; now the same code runs at double the speed, but still way too slow.

The reason why it's slow is that it starts lexicographical sorting from the end of each column checking all rows.

What I believe will speed this up is if I start the sorting process from the first column checking all rows and detecting the lowest row by first byte for that column and possibly multiple rows with the same identical first low byte and grouping those for the next step which checks the second (next) column finding which of the second bytes is the lowest byte if they are both the same move on to the next column etc.. if it detects where the next row byte is different then the column code is done for the first byte and moves on to finding the second lowest.. that's actually how I thought this process should work to get a good speed boost.. but unfortunately i had a great confusion with this sorting technology and ended up using what somebody helped me with.

The current code works by brute force sorting from the last column it sorts all the rows.. then it moves one column to the left and re-sorts every row again keeps doing this until it reaches the first column and sorts it. This is slow because since it does iterations for no apparent reason.

Say there is 256 columns and 256 rows a total of 65,536 array elements.. using the current code and say it would have to sort each row multiple times until each row gets a proper sorted order. for each column it could possibly take 65,536 iterations. So a total estimated of 256*65536=16,777,216 iterations everytime I call the function and that's the actual reason why it's slow.

I know this is alot to ask for but if anyone got some free time and maybe already did this before could help me out I'd appreciate it.

Here is the code I have to work with so far.

byte[] sortArrayOfArraysLexicoGraphically(ref byte[] data) {
    byte[] lexicoGraphicalIndexes;
    long dataSize = data.Length;
    long squareRootMinusOne;
    int squareRoot;
    int row = 0;
    bool rowSwapped;
    byte[] tmpRow;

    squareRoot = (int)Math.Sqrt(dataSize);
    tmpRow = new byte[squareRoot];
    squareRootMinusOne = squareRoot - 1;
    lexicoGraphicalIndexes = new byte[squareRoot];

    for(short column = 0; column < lexicoGraphicalIndexes.Length; column++) {
        lexicoGraphicalIndexes[column] = (byte)column;
    }

    for(long column = squareRootMinusOne; column >= 0; column -= 1) {
        do {
            rowSwapped = false;
            do {
                if(data[(row * squareRoot) + column] > data[((row + 1) * squareRoot) + column]) {
                    //Swaps a full row in a few copies.
                    //Copies full row to tmpRow
                    Buffer.BlockCopy(data, (row * squareRoot), tmpRow, 0, squareRoot);
                    //Replace first row with second row.
                    Buffer.BlockCopy(data, ((row + 1) * squareRoot), data, (row * squareRoot), squareRoot);
                    //Replace second row with tmpRow
                    Buffer.BlockCopy(tmpRow, 0, data, ((row + 1) * squareRoot), squareRoot);
                    swapBytes(ref lexicoGraphicalIndexes, row, row + 1);
                    rowSwapped = true;
                }
                row++;
            } while (row < squareRootMinusOne);
            row = 0;
        } while (rowSwapped != false);
    }
    return lexicoGraphicalIndexes;
}

public void swapBytes(ref byte[] data, long firstIndex, long secondIndex) {
    byte tmpFirstByte = data[firstIndex];
    data[firstIndex] = data[secondIndex];
    data[secondIndex] = tmpFirstByte;
}

L.B · Answer 1 · 2012-09-02T10:46:03.527

I must say that your sorting algorithm is very bad. Even without any optimizations and using basic linq you can get tens of times speed up.

I made a test with a data of size N*N where N=200 (I am not sure whether the below code exactly matches yours and is 100% correct but at least you can try and see the result)

List<byte[]> result = data.Batch(N)
                          .OrderBy(b => b, new ArrayComparer())
                          .Select(b => b.ToArray())
                          .ToList();

EDIT

An inplace sort can even be faster.

var list = data.Batch(N).Select(x => x.ToArray()).ToList();
list.Sort(new ArrayComparer());

-

public class ArrayComparer : IComparer<IEnumerable<byte>>
{
    public int Compare(IEnumerable<byte> x, IEnumerable<byte> y)
    {
        var xenum = x.GetEnumerator();
        var yenum = y.GetEnumerator();
        while (xenum.MoveNext() && yenum.MoveNext())
        {
            if (xenum.Current != yenum.Current) 
                   return xenum.Current - yenum.Current;
        }
        return 0;
    }
}

PS: Batch is an extension method from morelinq

I wanted something that was self-contained, I know I didn't mention that my fault for that. Really don't like these new technologies like LinQ plus the extra file required after I tested it, still didn't work couldn't understand `ThrowIfNull`, `ThrowIfNonPositive` probably some other files I had to include too.. I ended up writing it myself took me a good 4 hours but it seems to work faster now.. I'll post it as a answer. — SSpoke, Sep 02 '12 at 23:59
Your `ArrayComparer` doesn't work correctly if the arrays have different length. Might not affect the OP, but prevents this from being a general purpose comparer. — CodesInChaos, Sep 25 '19 at 09:09

SSpoke · Accepted Answer · 2012-09-03T00:45:12.880

Ended up writing this long monster, but it seems to do the trick for a few test runs.. not sure if it's flawless needs more testing, I'll update this when I did more testing.

    int[] sortArrayOfArraysLexicoGraphically(ref byte[] data) {
        int[] lexicoGraphicalIndexes;
        long dataSize = data.Length;
        int squareRoot;
        bool rowSwapped;

        squareRoot = (int)Math.Sqrt(dataSize);
        lexicoGraphicalIndexes = new int[squareRoot];

        for(int column = 0; column < lexicoGraphicalIndexes.Length; column++) {
            lexicoGraphicalIndexes[column] = column;
        }

        byte currentLowestRowByte = 255; //set to highest to avoid unassigned local variable error.
        int previousLowestRowByte = -1; //this is only used after the second pass.
        int lowestRowIndex = -1; //hopefully this won't mess anything up.
        List<int> lowestRowIndexes = new List<int>();
        bool stillSorting = true;
        int startRow = 0; //which row to start with, as the sorting process gets more sorted this number increases.
        int startColumn = 0; //first column

        while(stillSorting) {
            //Resets
            lowestRowIndexes.Clear();
            startColumn = 0;
            currentLowestRowByte = 255;
            lowestRowIndex = -1;

            //first step finds the lowest row in the first column
            for(int row = 0; row < squareRoot; row += 1) {
                if(data[(row * squareRoot) + startColumn] <= currentLowestRowByte && 
                    data[(row * squareRoot) + startColumn] > previousLowestRowByte) {
                    currentLowestRowByte = data[(row * squareRoot) + startColumn];
                    lowestRowIndex = row;
                }
            }

            //Resets for next pass.
            previousLowestRowByte = currentLowestRowByte;

            //Check if sorting process is already finished. (No matches found from step 1).
            if(lowestRowIndex == -1) {
                stillSorting = false;
                break;
            }

            //second step finds all the similar rows with the current lowestRowByte.
            for(int row = 0; row < squareRoot; row += 1) {
                if(data[(row * squareRoot) + startColumn] == currentLowestRowByte) {
                    lowestRowIndexes.Add(row);
                }
            }

            //third step loops through all lowestRowIndexes to find which one comes first, second, third, etc...
            if(lowestRowIndexes.Count > 1) {
                //This sorts the same rows, rows*rows amount of times, until they are sorted correctly.
                rowSwapped = true;
                while(rowSwapped != false) {
                    rowSwapped = false;
                    for (int row = 0; row < lowestRowIndexes.Count; row++)
                    {
                        if((row+1) >= lowestRowIndexes.Count)
                            break;
                        //Current first row byte checked with Next first row byte in lowestRowIndexes.
                        //If both are equal keep going unto next column until a break is found, if any break.
                        startColumn = 1;
                        while(rowSwapped == false) {
                            //Reached beyond the last column.
                            if(startColumn == squareRoot)
                                break;

                            if(data[(lowestRowIndexes[row] * squareRoot) + startColumn] == data[(lowestRowIndexes[row+1] * squareRoot) + startColumn])
                                startColumn++;

                            if(data[(lowestRowIndexes[row] * squareRoot) + startColumn] < data[(lowestRowIndexes[row+1] * squareRoot) + startColumn]) {
                                break; //Sorted already, get out.
                            } else if(data[(lowestRowIndexes[row] * squareRoot) + startColumn] > data[(lowestRowIndexes[row+1] * squareRoot) + startColumn]) {
                                swapBytesRow(ref data, lowestRowIndexes[row], lowestRowIndexes[row+1], squareRoot);
                                swapBytes(ref lexicoGraphicalIndexes, lowestRowIndexes[row], lowestRowIndexes[row+1]);
                                rowSwapped = true; //a swap has occurred.
                                break;
                            }
                        }
                    }
                }

                //forth step re-sorts all the pre-sorted lowestRowIndexes into master array, using startRow variable.
                foreach(int row in lowestRowIndexes) {

                    //First checks if row is already in the proper sorted location.
                    if(row != startRow) {
                        swapBytesRow(ref data, startRow, row, squareRoot);
                        swapBytes(ref lexicoGraphicalIndexes, startRow, row);
                        startRow++; //skip Rows starting from value < startRow as they are perfectly sorted.
                    } else {
                        startRow++; //skip Rows starting from value < startRow as they are perfectly sorted.
                    }                     
                }
            } else {
                //Only one instance of this lowestRowByte existed. so obviously this is the next best sorted match.
                swapBytesRow(ref data, startRow, lowestRowIndex, squareRoot);
                swapBytes(ref lexicoGraphicalIndexes, startRow, lowestRowIndex);
                startRow++; //skip Rows starting from value < startRow as they are perfectly sorted.
            }
        }
        return lexicoGraphicalIndexes;
    }

.

    public void swapBytes(ref byte[] data, long firstIndex, long secondIndex) {
        byte tmpFirstByte = data[firstIndex];
        data[firstIndex] = data[secondIndex];
        data[secondIndex] = tmpFirstByte;
    }

.

    public void swapBytes(ref int[] data, long firstIndex, long secondIndex) {
        int tmpFirstByte = data[firstIndex];
        data[firstIndex] = data[secondIndex];
        data[secondIndex] = tmpFirstByte;
    }

.

    public void swapBytesRow(ref byte[] data, int firstRowIndex, int secondRowIndex, int rowSize) {
        byte[] tmpFirstRowBytes = new byte[rowSize];
        //Copies full row to tmpFirstRowBytes
        Buffer.BlockCopy(data, (firstRowIndex * rowSize), tmpFirstRowBytes, 0, rowSize);
        //Replace first row with second row.
        Buffer.BlockCopy(data, (secondRowIndex * rowSize), data, (firstRowIndex * rowSize), rowSize);
        //Replace second row with tmpFirstRowBytes
        Buffer.BlockCopy(tmpFirstRowBytes, 0, data, (secondRowIndex * rowSize), rowSize);
    }

The problem with your code is that you use a sorting algorithm (similar to Bubble Sort) which has O(N*N) complexity. You should implement a better sorting algorithm like [QuickSort](http://en.wikipedia.org/wiki/Quicksort) or use .Net's build in sort function as I did. — L.B, Sep 03 '12 at 06:11
Can't really see any other way of doing it. I sort the rows which have the same first byte in common and then just swap the first row, second etc.. in order and that's about it. The main sorting code in rare cases may end up sorting 5-7 rows maximum.. the rows which contain only first byte in common only once just immediately added to top without even sorting only swapping what previously was in there. — SSpoke, Sep 03 '12 at 06:59
I would strongly suggest to download the morelinq's binaries [from here](http://code.google.com/p/morelinq/) and test the above code. I really can't say I understood your algorithm very well and my code does the same thing, but at least you can see the difference in speed. (PS: you don't have to use morelinq, to partition the input array to Sqrt(N) parts. I used it only for "easy of use") — L.B, Sep 03 '12 at 07:09

Lexicographical sort array of arrays algorithm using C#

2 Answers2