1

Code:

#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include <time.h>

int main()
{
    FILE *fp1, *fp2;
    int ch1, ch2;
    clock_t elapsed;
    char fname1[40], fname2[40];

    printf("Enter name of first file:");
    fgets(fname1, 40, stdin);
    while ( fname1[strlen(fname1) - 1] == '\n')
    {
        fname1[strlen(fname1) -1] = '\0';
    }

    printf("Enter name of second file:");
    fgets(fname2, 40, stdin);
    while ( fname2[strlen(fname2) - 1] == '\n')
    {
        fname2[strlen(fname2) -1] = '\0';
    }

    fp1 = fopen(fname1, "r");
    if ( fp1 == NULL )
    {
        printf("Cannot open %s for reading\n", fname1 );
        exit(1);
    }

    fp2 = fopen( fname2,  "r");
    if (fp2 == NULL)
    {
        printf("Cannot open %s for reading\n", fname2);
        exit(1);
    }

    elapsed = clock(); // get starting time

    ch1  =  getc(fp1); // read a value from each file
    ch2  =  getc(fp2);

    float counter = 0.0;
    float total = 0.0;

    while(1) // keep reading while values are equal or not equal; only end if it reaches the end of one of the files
    {
        ch1 = getc(fp1);
        ch2 = getc(fp2);

    //printf("%d, %d\n", ch1, ch2);// for debugging purposes

    if((ch1 ^ ch2) == 0)
    {
       counter++;
    }

    total++;

        if ( ( ch1 == EOF) || ( ch2 == EOF)) // if either file reaches the end, then its over!
        {
            break; // if either value is EOF
        }
    }

    fclose (fp1); // close files
    fclose (fp2);

    float percent = (counter / (total)) * 100.0;

    printf("Counter: %.2f Total: %.2f\n", counter, (total));
    printf("Percentage: %.2f%\n", percent);

    elapsed = clock() - elapsed; // elapsed time
    printf("That took %.4f seconds.\n", (float)elapsed/CLOCKS_PER_SEC);
    return 0;
}

Trying to compare two .nc files that are about 1.4 GBs and these are my results:

$ gcc check2.c -w
$ ./a.out
Enter name of first file:air.197901.nc
Enter name of second file:air.197902.nc
Counter: 16777216.00 Total: 16777216.00
Percentage: 100.00%
That took 15.6500 seconds.

No way they are 100% identical lol, any ideas on why it seems to stop at the 16777216th byte?

The counter should be 1,256,756,880 bytes

1.3 GB (1,256,756,880 bytes)

I downloaded this climate data set here:

ftp://ftp.cdc.noaa.gov/Datasets/NARR/pressure/

Thanks for your help in advance

humblebeast
  • 303
  • 3
  • 16
  • In fact, `counter` and `total` are float...with about 7 digit precision. I guess that it tries to compute 16777216+1 and found 16777216. Could you try to use `double` ? Moreover, two identical file except for the first byte will be found identical by your program. It should be what @McLovin meant... – francis Jul 19 '14 at 21:18
  • 1
    you don't test the first byte atm. Better use `fpos_t` (if it is an unsigned integer type) or `unsigned long long` for size... – Deduplicator Jul 19 '14 at 21:19
  • 1
    What @McLovin wants to say with this is that you are using a `float` for the counter. This has a significand of 2^24, i.e. that is the max precision of a `float`. Why are you not using an integral type for such large files (or for counters in general), e.g. a `uint32_t`? In other words: `float` is too limited and the wrong type to use for such a counter. – Rudy Velthuis Jul 19 '14 at 21:22
  • Displaying `total` and `counter` to two decimal places hardly makes any sense either. – Clifford Jul 19 '14 at 21:36
  • You should probably perform the EOF check *before* attempting to compare the characters - though it will not have any affect on the results to two decimal places with such a large file. – Clifford Jul 19 '14 at 21:41
  • The title says the code crashes. Where does it crash? – Apriori Jul 20 '14 at 01:08
  • Any reason you wouldn't just break out as soon as one byte doesn't match? At that point you know the files are not equal and there is no need to keep checking them unless you are really interested in that percentage. If you are interested in the percentage you might think about including the bytes that were leftover from one file in the total. If you are not, you can just skip the equality check all together if the files are not the same size, because you know they won't match. – Apriori Jul 20 '14 at 01:16
  • @Apriori Yes I am interested in the percentage, I would like to see the performance on this byte-by-byte comparison because it is very costly in terms of I/O – humblebeast Jul 20 '14 at 01:36
  • @Apriori Also, I am not interested in the total bytes leftover because I would just like to compare how many duplicate bytes are within the files. So, as soon as one file reaches the end, it will stop – humblebeast Jul 20 '14 at 01:39
  • And you are correct, it doesn't crash. I will change the title. I thought it did because initially I thought it didn't read through the whole file based on the percentage – humblebeast Jul 20 '14 at 01:40
  • @humblebeast: But you display the identical bytes as a percentage of the total bytes. I would think of bytes the larger file has that the smaller file does not as bytes that are not equal. It's also worth noting that with this logic you could have a two files that are 100% equal, but the second file is much larger than the first. This seem wrong to me. Or two files that are 50% equal, the first file being two bytes total with one byte equal, and the second file being several MB. You will also get a division by zero if one of the files is zero size. – Apriori Jul 20 '14 at 01:53
  • 1
    @humblebeast: That's just my personal opinion of what makes sense and is intuitive though. Of course feel free to do what is right for you. – Apriori Jul 20 '14 at 01:53

2 Answers2

4

The float data type is only precise to 6 significant figures and is inappropriate for counter and total. Any floating point type would be inappropriate in any case. Ther are a number of issues with this, not least that ++ for example is an integer operator, the implicit conversion from float to int, increment, then back to float will fail for integer values with greater than 6 digits.

I assume you chose such a type because it has greater range that unsigned int perhaps? I suggest that you use unsigned long long for these variables.

unsigned long long counter = 0;
unsigned long long total = 0;

...

float percent = (float)counter / (float)total * 100.0f ;
Clifford
  • 88,407
  • 13
  • 85
  • 165
  • Agreed, it's really bad to use float/double for a counter such as this. As far as the percentage computation, better yet: use a fixed point computation for the percentage. `unsigned long long percent = (100 * counter) / total;` or for rounding `unsigned long long percent = (1000 * counter) / total; percent = percent / 10 + (percent % 10 >= 5);` it can easily be adapted for more decimal places. This saves a float to int conversion. – Apriori Jul 20 '14 at 01:05
  • I considered advising a fixed point solution, but given that the output is displayed to two decimal places, using fp is the simple solution and on a target with floating-point is inexpensive. – Clifford Jul 20 '14 at 23:07
0

Use int type for counter and total variables

Clifford
  • 88,407
  • 13
  • 85
  • 165
vlk
  • 2,581
  • 3
  • 31
  • 35