-2

I'm trying to calculate the entropy of a .exe file by giving it as an input. However, I'm getting a zero value instead of an answer.

Entropy of a file can be understood as the the summation of (pi*log(pi)) every character in the file. I'm trying to calculate the entropy of a .exe file. However, I'm ending up getting a '0'. The '.exe' file for sure has an output.

Below is my code.

#include <stdio.h>
#include <stdlib.h>
#include "stdbool.h"
#include <string.h>
#include <conio.h>
#include <math.h>

#define MAXLEN 100 

int makehist( char *S, int *hist, int len) {
   int wherechar[256];
   int i,histlen;
   histlen=0;
   for (i=0;i<256;i++)
       wherechar[i]=-1;
   for (i=0;i<len;i++) {
       if (wherechar[(int)S[i]]==-1) {
           wherechar[(int)S[i]]=histlen;
           histlen++;
       }
       hist[wherechar[(int)S[i]]]++;
   }
   return histlen;
}

double entropy(int *hist, int histlen, int len) {
    int i;
    double H;
    H=0;
    for (i=0;i<histlen;i++) {
        H-=(double)hist[i]/len*log((double)hist[i]/len);
    }
    return H;
}

void main() {
    char S[100];
    int len,*hist,histlen;
    int num;
    double H;
    int i=0;
    int count =0;
    FILE*file = fopen("freq.exe","r");
    while (fscanf(file,"%d",&num)>0)
    {
        S[i]=num;
        printf("%d",S[i]);

        i++;
    }

    hist=(int*)calloc(i,sizeof(int));

    histlen=makehist(S,hist,i);

    H=entropy(hist,histlen,i);
    printf("%lf\n",H);
    getch();
}
Michael Myers
  • 188,989
  • 46
  • 291
  • 292
sam
  • 83
  • 1
  • 9

2 Answers2

2
while (fscanf(file,"%d",&num)>0)

This reads numbers encoded as leading white space, optional sign, and a sequence of digits. As soon as some other character is encountered in your file (probably the first byte), your loop will stop. You need to read raw bytes, with getc or fread.

Also, please consider doing the most basic debugging before submitting a question to StackOverflow. Surely your printf in that loop never printed anything, yet you don't mention this in your question and apparently didn't investigate why.

Some other issues:

#define MAXLEN 100

This is never used.


void main() 

This is not a valid definition of main. Use

int main(void)

char S[100];

You have undefined behavior if the input contains more than 100 chars, and a .exe file surely will. You really should be feeding the bytes into your histogram calculation as you read them, rather than storing them in a buffer. Easiest is to make wherechar and histlen globals, but you could also put everything you need into a struct and pass a pointer to the struct, together with each byte, to makehist, and again pass a pointer to the struct to entropy.


FILE*file = fopen("freq.exe","r");

Binary files must be opened with "rb" (doesn't matter on linux but does on Windows). Also, you should check whether fopen succeeds.


hist=(int*)calloc(i,sizeof(int));

hist should have 256 elements. If you allocate this first, then you can process each byte as it is read per above.


You do a divide by zero in entropy if the file is empty ... you should check for len == 0.


wherechar[(int)S[i]] is undefined behavior if the file has chars with negative values, as it surely will. You should use unsigned char instead of char, and then the casts aren't necessary.

Jim Balter
  • 16,163
  • 3
  • 43
  • 66
1

This line seems to be reading numbers:

fscanf(file,"%d",&num)

But I don't really expect to find many numbers in an EXE file. They'd be random byte-values of all different types.

Numbers are only the digits 0-9 (and - & + signs as well).

abelenky
  • 63,815
  • 23
  • 109
  • 159