2

I have a script that I built following the blueprint in the book Learning Python for Forensics. The script will go through a directory specified by the user and collect the metadata for each file in the directory. The results are saved to a sqlite database and also written to a CSV or HTML file.

The script was originally written in Python 2.7.15. I am trying to update the code for Python 3.7. However, there is one line in the ingest directory function that is giving me issues.

The ingestDirectory function looks like this:

def ingestDirectory(cur, source, custodian_id):    
    count = 0
    for root, folders, files in os.walk(source):
        for file_name in files:
            meta_data = dict()
            try:
                meta_data['file_name'] = file_name
                meta_data['file_path'] = os.path.join(root, file_name)
                meta_data['extension'] = os.path.splitext(file_name)[-1]

                file_stats = os.stat(meta_data['file_path'])
                meta_data['mode'] = oct(file_stats.st_mode)
                meta_data['inode'] = int(file_stats.st_ino)
                meta_data['file_size'] = int(file_stats.st_size)
                meta_data['atime'] = formatTimestamp(file_stats.st_atime)
                meta_data['mtime'] = formatTimestamp(file_stats.st_mtime)
                meta_data['ctime'] = formatTimestamp(file_stats.st_ctime)
            except Exception as e:
                logging.error('Could not gather data for file: ' + meta_data['file_path'] + e.__str__())
            meta_data['custodian'] = custodian_id
            columns = '","'.join(meta_data.keys())
            values = '","'.join(str(x).encode('string_escape') for x in meta_data.values())
            sql = 'INSERT INTO Files ("' + columns + '") VALUES ("' + values + '")'
            cur.execute(sql)
            count += 1

The line that is giving me errors is this:

values = '","'.join(str(x).encode('string_escape') for x in meta_data.values())

This line is meant to handle any string escape characters found in metadata.values before writing the data to the database.

When I tried to run this code in Python 3 I got an error about an unrecognized codec. I did some research hear on Stack Overflow and found that string_escape has been replaced with unicode-escape in Python 3.

I am fairly new to Python 3 and Unicode. My question is this:

How do I update the line above so that it uses unicode-escape instead of string_escape and produces the same result as the Python 2.7 code?

Any help would be appreciated! I have been working on this for several days now and every solution I try just results in more error codes or corrupted output files.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 5
    Oh, that's beautiful. Plopping data from a hostile source right into an SQL string. This practically screams "SQL injection". – melpomene Sep 10 '18 at 19:37
  • @melpomene: if this is a console script then that's fine; the user could just write directly to the sqlite database file *anyway*. – Martijn Pieters Sep 10 '18 at 19:39
  • I'm wondering why you need to decode escapes *in the first place*. Using `string_escape` / `unicode_escape` is a work-around for problems elsewhere, not a tool you'd use when you have full control over your data. – Martijn Pieters Sep 10 '18 at 19:40
  • @MartijnPieters The book is called *Learning Python for Forensics*, so I'm assuming this script is meant to analyze files from dubious sources. – melpomene Sep 10 '18 at 19:47
  • @melpomene: that doesn't mean that the practice shown here was recommended by that book, or perhaps only used as a stepping stone. The script analyses a local directory and sets the values from OS-supplied data. I don't see much opportunity for injections (only the filename could be used for that purpose). – Martijn Pieters Sep 10 '18 at 20:08

1 Answers1

2

You are generating SQL in that piece of code, and escaping was there to try and produce valid SQL. It's a very poor-man's attempt at avoiding SQL injection. It's not very effective, and isn't needed as the database driver already knows how to take care of this, in a much safer fashion!

For SQL databases, the correct way to put the values into SQL parameters. SQL parameters consist of two components: placeholders, and values passed in separately to the .execute() method for the database to cleanly handle. The sqlite3 library is no exception, see the cursor.execute() method for details. For your case, you can use named placeholders:

columns = [f'''"{name.replace('"', '""')}"''' for name in meta_data]
placeholders = [f':{name}' for name in meta_data]
sql = f'INSERT INTO Files ({", ".join(columns)}) VALUES ({", ".join(placeholders)})'    
cur.execute(sql, meta_data)

Note that meta_data is passed as a second argument; the database takes each :name placeholder and takes the value for that placeholder from the meta_data dictionary.

I also properly formatted the column names, by putting double quotes around them and doubling any " characters in the name; see the SQLite keyword documentation:

'keyword'       A keyword in single quotes is a string literal.
"keyword"       A keyword in double-quotes is an identifier.

Your code has hardcoded those column names and none of them are reserved SQL keywords, so they don't really need this protection, but it is still good practice.

For your code, where meta_data has a fixed number of keys, the above builds this sql string:

>>> columns = [f'''"{name.replace('"', '""')}"''' for name in meta_data]
>>> placeholders = [f':{name}' for name in meta_data]
>>> sql = f'INSERT INTO Files ({", ".join(columns)}) VALUES ({", ".join(placeholders)})'
>>> from pprint import pprint
>>> pprint(sql)
('INSERT INTO Files ("file_name", "file_path", "extension", "mode", "inode", '
 '"file_size", "atime", "mtime", "ctime") VALUES (:file_name, :file_path, '
 ':extension, :mode, :inode, :file_size, :atime, :mtime, :ctime)')

I'd also change the way you log the error, instead of

logging.error('Could not gather data for file: ' + meta_data['file_path'] + e.__str__())

I'd use

logging.exception('Could not gather data for file: %s', meta_data['file_path'])

and leave the error gathering to the logging framework. Even if you do include the exception object, use str(e) or a %s placeholder.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343