1

First off, I am not a Unix expert by any stretch, so please forgive a little naiveity in my question.

I have a requirement to list the unencrypted files in a given directory that potentially contains both encryped and unencrypted files.

I cannot reliably identify these files by file extension alone and was hoping someone in the SO community might be able to help me out.

I can run:

file * | egrep -w 'text|XML'

but that will only identify the files that are either text or XML. I could possibly use this if I can't do much better as currently the only other files in the directry are text or XML files but I really wanted to identify all unencrypted files whatever type they may be.

Is this possible in a single line command?

EDIT: the encrypted files are encrypted via openSSL

The command I use to unencrypt the files is:

openssl -d -aes128 -in <encrypted_filename> -out <unencrypted_filename>
Ollie
  • 17,058
  • 7
  • 48
  • 59
  • How to tell whethe a file is encrypted? – kev Mar 02 '12 at 13:25
  • try `files * | grep -v 'encryted'` – kev Mar 02 '12 at 13:34
  • @kev Unfortunately that didn't work, it still lists the openSSL encrypted files in the results. – Ollie Mar 02 '12 at 13:56
  • when I use command `file x.enc`, it print out `x.enc: data`. So you can try `file * | grep -vw 'data'` – kev Mar 02 '12 at 14:03
  • @kev, if i do that i get all the 'data' files in the directory (which includes the encrypted and unencrypted), that is not what i want, hence the question. Using the `file` command i can't seem to differentiate between the encrypted and unencrypted files. – Ollie Mar 02 '12 at 14:06
  • I think you should change the title of the question. You are not asking how to list the encrypted files, but merely how to identify whether or not a file is encrypted. – William Pursell Mar 02 '12 at 14:36
  • @WilliamPursell, the end result I want is a list of the unencrypted files in the directory. I'm assuming that to list those files some method of identifying which files are encrypted is needed. If there is a better way then please let me know. – Ollie Mar 02 '12 at 14:40
  • Exactly what command did you use to encrypt the files? – Keith Thompson Mar 02 '12 at 16:52
  • @KeithThompson, I didn't it is sent in from a partner organisation. I know it is encrypted using `openssl` but I can find out on Moday and I'll post it then. – Ollie Mar 02 '12 at 19:14

2 Answers2

1

Your problem is not a trivial one. The solaris file command uses "magic" - /etc/magic. This is a set of rules to attempt to attempt to determine what flavor a file is. It is not perfect.

If you read the /etc/magic file, note that the last column is verbiage that is in the output of the file command when it recognizes something, some structure in a file.

Basically the file command looks at the first few bytes of a file, just like the exec() family of system calls does. So, #/bin/sh in the very first line of a file, in the first characters of the line, identifies to exec() the "command interpreter" that exec() needs to invoke to "run" the file. file gets the same idea and says "command text" "awk text" etc.

Your issues are that you have to work out what types of files you are going to see as output from file. You need to spend time delving into the non-encrypted files to see what "answers" you can expect from file. Otherwise you can run file over the whole directory tree and sort out all of what you think are correct answers.

find /path/to/files -type f -exec file {} \; | nawk -F':' '!arr[$2]++'  > outputfile

This gives you a list of distinct answers about what file thinks you have. Put the ones you like in a file, call it good.txt

find /path/to/files -type f -exec file {} \; > bigfile
nawk -F':' 'FILENAME=="good.txt" {arr$1]++}
          FILENAME=="bigfile" {if($2 in arr) {print $1}} ' good.txt bigfile > nonencryptedfiles.txt

THIS IS NOT 100% guaranteed. file can be fooled.

jim mcnamara
  • 16,005
  • 2
  • 34
  • 51
  • Thanks Jim for this answer. I'm going to go with somthing like this as it should cover what I need, though I do appreciate it is not 100% reliable so will have to put a few other restrictions in place within the parent system. – Ollie Mar 05 '12 at 11:22
1

The way to identify encrypted files is by the amount of randomness, or entropy, they contain. Files that are encrypted (or at least files that are encrypted well) should look random in the statistical sense. Files that contain unencrypted information—whether text, graphics, binary data, or machine code—are not statistically random.

A standard way to calculate randomness is with an autocorrelation function. You'd probably need to autocorrelate only the first few hundred bytes of each file, so the process can be fairly quick.

It's a hack, but you might be able to take advantage of one of the properties of compression algorithms: they work by removing randomness from data. Encrypted files cannot be compressed (or again, at least not much), so you might try compressing some portion of each file and comparing the compression ratios.

SO has several other questions about finding randomness or entropy, and many of them have good suggestions, like this one: How can I determine the statistical randomness of a binary string?

Good luck!

Community
  • 1
  • 1
Adam Liss
  • 47,594
  • 12
  • 108
  • 150