Hash a column in CSV and output in Base64

Question

Still getting my feet wet with Python, but my goal is to read a CSV file and hash a specific column using SHA256 then output in Base64.

Here is an example of the conversion that needs to take place This calculator can be found at https://www.liavaag.org/English/SHA-Generator/

Here is the code I have currently

import hashlib
import csv
import base64

with open('File1.csv') as csvfile:

    with open('File2.csv', 'w') as newfile:

        reader = csv.DictReader(csvfile)

        for i, r in enumerate(reader):
            #  writing csv headers
            if i == 0:
                newfile.write(','.join(r) + '\n')

            # hashing the 'CardNumber' column
            r['consumer_id'] = base64.b64encode(hashlib.sha256(r['consumer_id']).encode('utf-8')).digest()
            
            # writing the new row to the file with hashed 'CardNumber'
            newfile.write(','.join(r.values()) + '\n')

The error I receive is

r['consumer_id'] = base64.b64encode(hashlib.sha256(r['consumer_id']).encode('utf-8')).digest()
TypeError: Strings must be encoded before hashing

Just move the close paren `)` after `r["consumer_id"]`: `base64.b64encode(hashlib.sha256(r['consumer_id'].encode('utf-8'))).digest()`. — MattDMo, Feb 16 '23 at 14:53
After @MattDMo correction, you will also need to re-home `digest()` — JonSG, Feb 16 '23 at 15:04
That returns a new error Traceback (most recent call last): File "c:\Elevate\HashCsv.py", line 64, in r['consumer_id'] = base64.b64encode(hashlib.sha256(r['consumer_id'].encode('utf-8'))).digest() File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.10_3.10.2800.0_x64__qbz5n2kfra8p0\lib\base64.py", line 58, in b64encode encoded = binascii.b2a_base64(s, newline=False) TypeError: a bytes-like object is required, not '_hashlib.HASH' — Jeff Irwin, Feb 16 '23 at 15:06

JonSG · Accepted Answer · 2023-02-16T17:03:09.977

2

You are on the right track, just need to take it a step at a time before doing it all at once to see how it pieces together:

import hashlib
import base64

text = "1234567890"
encoded = text.encode('utf-8')
encoded = hashlib.sha256(encoded).digest()
encoded = base64.b64encode(encoded)
print(text, str(encoded, encoding="utf-8"))

That should give you:

1234567890 x3Xnt1ft5jDNCqERO9ECZhqziCnKUqZCKreChi8mhkY=

As a "one-liner":

r['consumer_id'] = str(base64.b64encode(hashlib.sha256(r['consumer_id'].encode('utf-8')).digest()), encoding="utf-8")

As you can see, your current use is close, but just has some parentheses opportunities to fix.

If you wanted to use this in a loop, say when iterating over a list of words or the rows of a csv you might do this:

import hashlib
import base64

def encode_text(text):
    encoded = text.encode('utf-8')
    encoded = hashlib.sha256(encoded).digest()
    encoded = base64.b64encode(encoded)
    return str(encoded, encoding="utf-8")

words = "1234567890 Hello World".split()
for word in words:
    print(word, encode_text(word))

Giving you:

234567890 x3Xnt1ft5jDNCqERO9ECZhqziCnKUqZCKreChi8mhkY=
Hello GF+NsyJx/iX1Yab8k4suJkMG7DBO2lGAB9F2SCY4GWk=
World eK5kfcVUTSJxMKBoKlHjC8d3f7ttio8XAHRjo+zR1SQ=

Assuming the rest of your code works as you like, then:

import hashlib
import csv
import base64

def encode_text(text):
    encoded = text.encode('utf-8')
    encoded = hashlib.sha256(encoded).digest()
    encoded = base64.b64encode(encoded)
    return str(encoded, encoding="utf-8")

with open('File1.csv') as csvfile:

    with open('File2.csv', 'w') as newfile:

        reader = csv.DictReader(csvfile)

        for i, r in enumerate(reader):
            #  writing csv headers
            if i == 0:
                newfile.write(','.join(r) + '\n')

            # hashing the 'CardNumber' column
            r['consumer_id'] = encode_text(r['consumer_id'])
            
            # writing the new row to the file with hashed 'CardNumber'
            newfile.write(','.join(r.values()) + '\n')

edited Feb 16 '23 at 17:03

answered Feb 16 '23 at 15:03

JonSG

10,542
2
25
36

That's great for a single entry. I'm working with a CSV with thousands of entries – Jeff Irwin Feb 16 '23 at 15:09
I updated the answer with a function that you might use when iterating to convert text into your encoded result. – JonSG Feb 16 '23 at 15:16
You may be spot on, but to be honest I'm more confused now. I don't follow where I would call out the column – Jeff Irwin Feb 16 '23 at 15:21
`r['consumer_id'] = encode_text(r['consumer_id'])` according to what I see in your example code. – JonSG Feb 16 '23 at 15:24
alternatively, your one-liner is: `r['consumer_id'] = base64.b64encode(hashlib.sha256(r['consumer_id'].encode('utf-8')).digest())` – JonSG Feb 16 '23 at 15:30
You certainly are a pro at this. The error I'm getting with this is: "newfile.write(','.join(r.values()) + '\n') TypeError: sequence item 0: expected str instance, bytes found" – Jeff Irwin Feb 16 '23 at 15:36
If you want the string value rather than the bytes from base64 then you can add in a cast to `str()`. I'll update the answer to do that. The new one liner is in my answer – JonSG Feb 16 '23 at 16:21
I get the same result with the one-liner. I think your comment about str() is likely correct. – Jeff Irwin Feb 16 '23 at 16:29
1

That was the key. Well done sir, much appreciated! – Jeff Irwin Feb 16 '23 at 16:32
I updated my answer to cast the result of calling the one-liner or `encode_text()` to be a string rather than bytes if that helps – JonSG Feb 16 '23 at 16:33
The only issue with this output is the data writes as b'THEANSWER'. I'll need to remove the b'' – Jeff Irwin Feb 16 '23 at 16:51
@JeffIrwin Add an encoding ;-) `str(base64.b64encode(hashlib.sha256(r['consumer_id'].encode('utf-8')).digest()), encoding="utf-8")` – JonSG Feb 16 '23 at 17:01

Zach Young · Answer 2 · 2023-02-16T17:57:55.587

In addition to JonSG's answer about getting the hashing/encoding correct, I'd like to comment on how you're reading and writing the CSV files.

It took me a minute to understand how you're dealing with the header vs the body of the CSV here:

with open("File1.csv") as csvfile:
    with open("File2.csv", "w") as newfile:
        reader = csv.DictReader(csvfile)
        for i, r in enumerate(reader):
            print(i, r)
            if i == 0:
                newfile.write(",".join(r) + "\n")  # writing csv headers
            newfile.write(",".join(r.values()) + "\n")

At first, I didn't realize that calling join() on a dict would just give back the keys; then you move on to join the values. That's clever!

I think it'd be clearer, and easier, to use the complementary DictWriter.

For clarity, I'm going to separate the reading, processing, and writing:

with open("File1.csv", newline="") as f_in:
    reader = csv.DictReader(f_in, skipinitialspace=True)
    rows = list(reader)


for row in rows:
    row["ID"] = encode_text(row["ID"])
    print(row)


with open("File2.csv", "w", newline="") as f_out:
    writer = csv.DictWriter(f_out, fieldnames=rows[0])
    writer.writeheader()
    writer.writerows(rows)

In your case, you'll create your writer and need to give it the fieldnames. I just passed in the first row and the DictWriter() constructor used the keys from that dict to establish the header values. You need to explicitly call the writeheader() method, then you can write your (processed) rows.

I started with this File1.csv:

ID, Phone, Email
1234680000000000, 123-456-7890, johnsmith@test.com

and ended up with this File2.csv:

ID,Phone,Email
tO2Knao73NzQP/rnBR5t8Hsm/XIQVnsrPKQlsXmpkb8=,123-456-7890,johnsmith@test.com

That organization means all your rows are read into memory first. You mentioned having "thousands of entries", but for those 3 fields of data that'll only be a few hundred KB of RAM, maybe a MB of RAM.

If you do want to "stream" the data through, you'll want something like:

reader = csv.DictReader(f_in, skipinitialspace=True)
writer = csv.DictWriter(f_out, fieldnames=reader.fieldnames)

writer.writeheader()

for row in reader:
    row["ID"] = encode_text(row["ID"])
    writer.writerow(row)

In this example, I passed reader.fieldnames to the fieldnames= param of the DictWriter constructor.

For dealing with multiple files, I'll just open and close them myself, because the multiple with open(...) as x can look cluttered to me:

f_in = open("File1.csv", newline="")
f_out = open("File2.csv", "w", newline="")

...

f_in.close()
f_out.close()

I don't see any real benefit to the context managers for these simple utility scripts: if the program fails it will automatically close the files.

But the conventional wisdom is to use the with open(...) as x context managers, like you were. You could do nested, like you were, separate them with a comma, or if you have Python 3.10+ use grouping parenthesis for a cleaner look (also in that Q/A).

Hash a column in CSV and output in Base64

2 Answers2