0

I'm working on implementing a few different sorting methods and for some reason my merge sort algorithm will not work on large data sets. The sort will work for 115,000 words but stops working when it reaches 135,000 words. Once I get this high I end up getting a segmentation fault. I do not understand where the seg fault is coming from. The sort works successfully for text files containing 5K to 125K strings.

The readFile array gets initialized with the number of words in the text file. When debugging it seems like the last numbers that get passed into the mergeSort() function are the following:

#0  0x0000000000402a87 in merge (inputString=0x7fffffbde790, from=0, mid=67499, to=134999) at mergeSort.cpp:102
    n1 = 67500
    n2 = 67500
    i = 0
    j = 0
    k = 32767
    L = <error reading variable L (value requires 2160000 bytes, which is more than max-value-size)>
    R = <error reading variable R (value requires 2160000 bytes, which is more than max-value-size)>
#1  0x0000000000402921 in mergeSort (inputString=0x7fffffbde790, from=0, to=134999) at mergeSort.cpp:88
    mid = 67499
void mergeSort(string readFile[], int from, int to) {
    if (from < to) {
        int mid = from + (to - from) / 2;
        mergeSort(readFile, from, mid);
        mergeSort(readFile, mid + 1, to);
        merge(readFile, from, mid, to);
    }
}
void merge(string readFile[], int from, int mid, int to) {
    int n1 = mid - from + 1;
    int n2 = to - mid;

    string L[n1];
    string R[n2];

    for (int i = 0; i < n1; i++) {
        L[i] = readFile[from + i];
    }
    for (int i = 0; i < n2; i++) {
        R[i] = readFile[mid + i + 1];
    }

    int i = 0;
    int j = 0;
    int k = from;

    while (i < n1 && j < n2) {
        if (L[i] <= R[j]) {
            readFile[k] = L[i];
            i++;
        } else {
            readFile[k] = R[j];
            j++;
        }
        k++;
    }
    while (i < n1) {
        readFile[k] = L[i];
        i++;
        k++;
    }
    while (j < n2) {
        readFile[k] = R[j];
        j++;
        k++;
    }
}
  • 3
    The recursive routine might be leading to a segmentation fault(due to stack overflow). Read https://stackoverflow.com/a/12146513/3656081 for possible suggestions. You might want to restructure your program to be iterative if this code is going to production. – tangy Feb 02 '20 at 08:04
  • 1
    Alternative is the *Bottom_Up Iterative Mergesort*, see [Merge sort - Wikipedia](https://en.wikipedia.org/wiki/Merge_sort). With the typical recursive mergesort, you will hit the 4M default stack limit on Linux at about 100,000 `int`. – David C. Rankin Feb 02 '20 at 08:12
  • Your question lacks a [mcve] and the error output (preferably a full backtrace) when the error occurs. – Ulrich Eckhardt Feb 02 '20 at 08:19
  • @rcgldr Towards the end n1 and n2 both end up being 67,500. This is probably a stupid question but I assume having one array allocated with 135,000 elements and two arrays with 67,500 elements would be the same amount on the stack, correct? –  Feb 03 '20 at 02:37
  • @rcgldr Also I have updated my backtrace. –  Feb 03 '20 at 02:50
  • @user8865807 - I worded my last comment badly, the overflow probably occurs in the merge call during the first instance of mergeSort, but that doesn't occur until the recursion reaches it's deepest level and returns back up to the first instance which then calls merge and that is probably where the stack overflow occurs. Changing merge to use new and delete[] for L and R (as pointers to string) should solve the issue. – rcgldr Feb 03 '20 at 08:34
  • @rcgldr Thank you so much. Is it pretty much standard to use new and delete[] instead of regular arrays when dealing with large data sets? –  Feb 04 '20 at 00:18
  • @user8865807 - new and delete for arrays (of objects) or use a single vector of objects instead. – rcgldr Feb 04 '20 at 01:12

1 Answers1

1

You allocate temporary arrays as automatic variables in the merge function. When the size of these arrays become too large, you lack stack space to allocate them and get undefined behavior (eg a stack overflow).

To handle arbitrarily large arrays, you should allocate the temporary arrays with malloc or new and free them accordingly. To limit the number of allocations, you could allocate a temporary array in a wrapper and pass that recursively in the mergeSort function.

Here is a simple fix allocating temporary arrays in the merge function:

void merge(string readFile[], int from, int mid, int to) {
    int n1 = mid - from + 1;
    int n2 = to - mid;

    string *L = new string[n1];
    string *R = new string[n2];

    for (int i = 0; i < n1; i++) {
        L[i] = readFile[from + i];
    }
    for (int i = 0; i < n2; i++) {
        R[i] = readFile[mid + i + 1];
    }

    int i = 0;
    int j = 0;
    int k = from;

    while (i < n1 && j < n2) {
        if (L[i] <= R[j]) {
            readFile[k] = L[i];
            i++;
        } else {
            readFile[k] = R[j];
            j++;
        }
        k++;
    }
    while (i < n1) {
        readFile[k] = L[i];
        i++;
        k++;
    }
    while (j < n2) {
        readFile[k] = R[j];
        j++;
        k++;
    }
    delete[] L;
    delete[] R;
}

Here is a more elaborate version, possibly more efficient, allocating a single temporary array:

void merge(string readFile[], size_t from, size_t mid, size_t to, string aux[]) {
    size_t i, j, k;

    for (i = from; i < to; i++) {
        aux[i] = readFile[i];
    }

    i = from;
    j = mid;
    k = from;

    while (i < mid && j < to) {
        if (aux[i] <= aux[j]) {
            readFile[k++] = aux[i++];
        } else {
            readFile[k++] = aux[j++];
        }
    }
    while (i < mid) {
        readFile[k++] = aux[i++];
    }
    while (j < to) {
        readFile[k++] = aux[j++];
    }
}

void mergeSort(string readFile[], size_t from, size_t to, string aux[]) {
    if (to - from > 1) {
        size_t mid = from + (to - from) / 2;
        mergeSort(readFile, from, mid, aux);
        mergeSort(readFile, mid, to, aux);
        merge(readFile, from, mid, to, aux);
    }
}

void mergeSort(string readFile[], size_t n) {
    string *aux = new string[n];
    mergeSort(readFile, 0, n, aux);
    delete[] aux;
}
chqrlie
  • 131,814
  • 10
  • 121
  • 189
  • ~135,000 words translates into ~18 levels of recursion (barely over 17). When merge is called, it allocates 2 arrays of std::string, which should only contain the "header" object info per entry, with the actual string memory being allocated from the heap, and that those 2 arrays are automatically freed when merge() exits. It's not clear to me if that is the cause of the error. – rcgldr Feb 02 '20 at 21:57
  • The `string` object has a size of 24 bytes on my system in 64-bit mode. I agree that recursion should not be the issue, but allocating 3.24 megabytes on the stack seems quite likely to cause a **stack overflow** :) – chqrlie Feb 02 '20 at 22:20
  • I agree. However using new and delete in merge() should have solved the problem, but with needless overhead. I'm using Visual Studio, which doesn't support variable length arrays, and would have to use _alloca() (allocate from the stack) to implement the equivalent, and the failure should have occurred on the initial call, not at a later level of recursion with smaller arrays. – rcgldr Feb 02 '20 at 22:58
  • @rcgldr: allocating an array of constructed objects with `_alloca()` is not for the faint of heart :) – chqrlie Feb 02 '20 at 23:12
  • 1
    That wasn't really my point, but just explaining that VS doesn't support variable length arrays, which are allocated off the stack. Which as you already pointed out, is most likely the problem, allocating 3.24 megatbytes on the initial call, but the OP thinks it's coming from a deeply nested call, which seems unlikely. The simplest fix would be to used new[] and delete[] in the OP's merge function. If this is for a class, an optimized version may be obvious that the student had too much help. – rcgldr Feb 03 '20 at 04:35