0

I want to merge sort the data from a text file.(line by line)

For my merge sort to work, I have to read in the data line by line, and fit them into an array in order to sort them. (We only know that there are at most 10000 integer per line) So I've done my research and tried these approach:

1. Use fgets, and strtok/strtol.

Problem: I don't know the max length of the char array. Besides, declaring a huge array may cause buffer overflow.

Source: How many chars can be in a char array?

2. Use fscanf to input into integer array.

Problem: The Same. I don't know how many integers there are in a line. So I won't be ok with "%d" part.(don't know how many there should be)

3. Use fscanf to input in the form of char array, and use strtok/strtol.

Problem: The Same. Since I don't know the length, I can't do something like

char *data;
data = malloc(sizeof(char) * datacount);

since the "datacount" is unknown.

Is there any way out?

UPDATE

Sample Input:

-16342 2084 -10049 10117 2786
3335 3512 -10936 5343 -1612 -4845 -14514

Sample Output:

-16342 -10049 2084 2786 10117
-14514 -10936 -4845 -1612 3335 3512 5343
Community
  • 1
  • 1
李智修
  • 103
  • 6
  • 2
    Consider using [getline](https://linux.die.net/man/3/getline). – Paul R Apr 09 '18 at 08:23
  • 2
    Not clear what you need to do. Please add an input output example. If you only need to sort int from several lines you don't need to read line by line, just int by int using for example `strtol` and skipping spaces and line break – Ôrel Apr 09 '18 at 08:25
  • Use `fscanf` **and a loop**. – iBug Apr 09 '18 at 08:33
  • @PaulR I just looked at the documentation you provided. But I saw that this method still need to specify the length of buffer in size_t. I may be wrong though. – 李智修 Apr 09 '18 at 08:37
  • 2
    @李智修 The POSIX `getline` function can automatically allocate just enough memory needed for the line, if passed the correct arguments. – Some programmer dude Apr 09 '18 at 08:40
  • @Ôrel Ok, I will update my post. What do you mean by using strtol int by int? But I need to at least input the char array so I can do further process. – 李智修 Apr 09 '18 at 08:42
  • 'For my merge sort to work, I have to read in the data line by line, and fit them into an array in order to sort them' - not with a merge sort, no. – Martin James Apr 09 '18 at 09:37
  • @MartinJames What do you mean? Is there another way to sort it without withdrawing data from the txt file? – 李智修 Apr 09 '18 at 09:59
  • Why are you reading the whole list of lines into an array, if you try to prove merge sort? Preciselly, merge sort is useful in cases where not all the data fit in the core memory. I don't understand some of your requirements you find needed, but aren't. – Luis Colorado Apr 12 '18 at 09:43
  • @LuisColorado Sorry, but our professor asked us to also implement the pthread, 1 for merging, 2 for merge sort. I didn't mentioned this since this is unrelated to the problem. – 李智修 Apr 15 '18 at 13:01

3 Answers3

2

You said:

We only know that there are at most 10000 integer per line

So just go for it. A standard-compliant compiler (and environment) should provide the ability to define an array of up to 65,535 objects. 10,000 is far less so just define a static array:

int a[10001], n;

int main() {
    // Open file
    n = 0;
    while (fscanf(fp, " %d", &a[n]) == 1) {
        // Process a[n]
        n++;
    }
}

If you know about platform specification (like sizeof(int) == 4), you can assume integer length. For example, the maximum length of a 32-bit integer is 11 characters (at -2147483648) or so. You can then define a char array with calculated length.

iBug
  • 35,554
  • 7
  • 89
  • 134
  • But what about the rows part? How can we be sure whether the flow is moving onto the next line? – 李智修 Apr 09 '18 at 10:33
  • .... But why is it necessary to read all the integers before sorting... merge sort can begin as soon as the first integer is read... no need to read even one line of text. – Luis Colorado Apr 12 '18 at 09:44
  • @LuisColorado It's because we need to also implement the multi-threading by pthread into 3 thread(2 merge sort, 1 merge). It would be easier in this way IMO. – 李智修 Apr 12 '18 at 10:00
  • anyway, there's no need to read all the data into a huge array.... the thread chain must be configured as a pipeline that does one merge phase and creates a new thread when it detects the phase is not enough for a single merge of all data. You need an array only to store one row of data... not more. – Luis Colorado Apr 12 '18 at 11:14
1

You can indeed use fscanf to read the individual integers. What you need besides that is to know not only about pointers and malloc but also about realloc.

You simply do something like

int temporary_int;
int *array = NULL;
size_t array_size = 0;

while (fscanf(your_file, "%d", &temporary_int) == 1)
{
    int *temporary_array = realloc(array, (array_size + 1) * sizeof(int));
    if (temporary_array != NULL)
    {
        array = temporary_array;
        array[array_size++] = temporary_int;
    }
}

After that loop, if array is not a null pointer then it will contain all the integers from the file, no matter how many there were. The size (number of elements) is in the array_size variable.


After seeing the update it's much easier to understand what is wanted.

In pseudo-code it's easy:

while(getline(line))
{
    array_of_ints = create_array_of_ints();

    for_each(token in line)
    {
        number = convert_to_integer(token);
        add_number_to_array(array_of_ints, number);
    }

    sort_array(array_of_ints);
    display_array(array_of_ints);
}

Actually implementing this is much harder, and depends somewhat on your environment (like if you have access to the POSIX getline function).

If you have e..g getline (or a similar function) then the outer loop in the pseudo-code is easy, and will look just about what it already does. Otherwise you basically have to read character by character into a buffer that you dynamically expand (using realloc) to fit the whole line.

That brings us to the contents of the outer loop: Splitting the input into a set of values. The basic solution you already have by the first code-snippet in this answer, where I reallocate the array as needed in the loop. to split the values then strtok is probably the simplest one to use. And converting to an integer can be done with strtol (if you want validation) of atoi if you don't care about validating your input.

Note that you don't really need to allocate the arrays dynamically. 10000 int values will, on current systems where sizeof(int) == 4 be "only" 40000 bytes. That's small enough to fit even on the stack of most non-embedded systems.

Some programmer dude
  • 400,186
  • 35
  • 402
  • 621
  • 3
    You should mention that doing a `realloc` on each and every iteration is not the most efficient solution (correct me if I'm wrong). – Jabberwocky Apr 09 '18 at 08:28
  • @MichaelWalz True, it's not. But it's easier to understand and much less code than allocating chunks of multiple "elements". I tried to keep it as simple as possible. – Some programmer dude Apr 09 '18 at 08:29
  • Init the array with an arbitrary size then double the size each time there is no room left is easy to implement, in a second time once you have the parsing algo done. – Ôrel Apr 09 '18 at 08:30
  • @Ôrel Thanks, I think I know what you mean now. – 李智修 Apr 09 '18 at 09:03
  • @Someprogrammerdude But how can I check whether the whole process is going line by line? Since we choose to input one integer at one time, we may lost the track of the rows part. I tried to use another while() outside of your loop to check if it is EOF, but my attempts always seems weird. Like: while(fscanf(file, "%d", &anothertmp) != EOF) In this case, it may be produce redundant outcome. – 李智修 Apr 09 '18 at 10:31
  • @李智修 After seeing your update it's easier to understand what you want. Updated answer – Some programmer dude Apr 09 '18 at 11:11
  • Thanks, I know what to do now :). – 李智修 Apr 11 '18 at 00:45
1

The one approach you did not consider, is probably the most portable one: write your own "read one token from a FILE stream, and convert it to a long; but never cross a newline boundary" function:

#include <stdlib.h>
#include <stdio.h>
#include <errno.h>

/* Reads one decimal integer (long) from 'input', saving it to 'to'.
   Returns: 0    if success,
            EOF  if end of input,
            '\n' if newline (end of line),
            '!'  if the number is too long, and
            '?'  if the input is not a number.
*/
int read_long(FILE *input, long *to)
{
    char    token[128], *end;
    size_t  n = 0;
    long    value;
    int     c;

    /* Consume leading whitespace, excluding newlines. */
    do {
        c = fgetc(input);
    } while (c == '\t' || c == '\v' || c == '\f' || c == ' ');

    /* End of input? */
    if (c == EOF)
        return EOF;

    /* Newline? */
    if (c == '\n' || c == '\r') {
        /* Do not consume the newline character! */
        ungetc(c, input);
        return '\n';
    }

    /* Accept a single '+' or '-'. */
    if (c == '+' || c == '-') {
        token[n++] = c;
        c = fgetc(input);
    }

    /* Accept a zero, followed by 'x' or 'X'. */
    if (c == '0') {
        token[n++] = c;
        c = fgetc(input);
        if (c == 'x' || c == 'X') {
            token[n++] = c;
            c = fgetc(input);
        }
    }

    /* Accept digits. */
    while (c >= '0' && c <= '9') {
        if (n < sizeof token - 1)
            token[n] = c;
        n++;
        c = fgetc(input);
    }

    /* Do not consume the separator. */
    if (c != EOF)
        ungetc(c, input);

    /* No token? */
    if (!n)
        return '?';

    /* Too long? */
    if (n >= sizeof token)
        return '!';

    /* Terminate token, making it a string. */
    token[n] = '\0';

    /* Parse token. */
    end = token;
    errno = 0;
    value = strtol(token, &end, 0);
    if (end != token + n || errno != 0)
        return '?';

    /* Save value. */
    if (to)
        *to = value;

    return 0;
}

To advance to the next line, you can use e.g.

/* Skips the rest of the current line,
   to the beginning of the next line.
   Returns: 0 if success (next line exists, although might be empty)
          EOF if end of input.
*/
int next_line(FILE *input)
{
    int  c;

    /* Skip the rest of the current line, if any. */
    do {
        c = fgetc(input);
    } while (c != EOF && c != '\n' && c != '\r');

    /* End of input? */
    if (c == EOF)
        return EOF;

    /* Universal newline support. */
    if (c == '\n') {
        c = fgetc(input);
        if (c == EOF)
            return EOF;
        else
        if (c == '\r') {
            c = fgetc(input);
            if (c == EOF)
                return EOF;
        }
    } else
    if (c == '\r') {
        c = fgetc(input);
        if (c == EOF)
            return EOF;
        else
        if (c == '\n') {
            c = fgetc(input);
            if (c == EOF)
                return EOF;
        }
    }

    ungetc(c, input);
    return 0;
}

To read the longs on each line, you can use a dynamically resized buffer, shared across lines:

int main(void)
{
    long   *field_val = NULL;
    size_t  field_num = 0;
    size_t  field_max = 0;

    int     result;

    do {
        /* Process the fields in one line. */
        field_num = 0;

        do {

            /* Make sure the array has enough room. */
            if (field_num >= field_max) {
                void *temp;

                /* Growth policy; this one is linear (not optimal). */
                field_max = field_num + 5000;

                temp = realloc(field_val, field_max * sizeof field_val[0]);
                if (!temp) {
                    fprintf(stderr, "Out of memory.\n");
                    return EXIT_FAILURE;
                }

                field_val = temp;
            }

            result = read_long(stdin, field_val + field_num);
            if (result == 0)
                field_num++;

        } while (result == 0);

        if (result == '!' || result == '?') {
            fprintf(stderr, "Invalid input!\n");
            return EXIT_FAILURE;
        }

        /*
         * You now have 'field_num' longs in 'field_val' array.
        */

        /* Proceed to the next line. */
    } while (!next_line(stdin));

    free(field_val);
    field_val = NULL;
    field_max = 0;

    return EXIT_SUCCESS;
}

While reading input character by character is not the most efficient way (it tends to be slightly slower than reading eg. line by line), it is compensated by its versatility.

For example, the above code works for any newline convention (CRLF or \r\n, LFCR or \n\r, CR \r, and LF \n) (but in Windows you'll want to specify the "b" flag for fopen() to stop it from making its own newline mangling).

The read-field-by-field approach is also easily extended to e.g. CSV format, including its peculiar quoting rules, and even embedded newlines.

Nominal Animal
  • 38,216
  • 5
  • 59
  • 86
  • Your approach is good, yet I prefer more simple way.....since I'm implementing on a normal Linux machine. Thanks for your answering anyway :). – 李智修 Apr 11 '18 at 00:47
  • 1
    @李智修: No worries! I personally would use `getline()` + `strtol()`, myself. I just wanted to point out this approach, because you did not mention it in the question, and it works well for e.g. proper CSV parsing. – Nominal Animal Apr 11 '18 at 07:36