Most efficient way to hash each line of a text file?

Question

I'm currently writing a Bash script which hashes each line of a text file and outputs it into a new file with the format hash:orginalword. The script I have at the moment to do this is:

cat $originalfile | while read -r line; do
    hash="$(printf %s "$line" | $hashfunction | cut -f1 -d' ')"
    echo "$hash:$line" >> $outputlocation
done

I originally got the code for this from a very similar question linked here. The script works exactly as advertised; however, the problem is that even for extremely small text files (<15KB) it takes a very long time to process.

I would really appreciate it if someone could suggest a script which achieves exactly the same outcome but does so far more efficiently.

Thank you in advance for any help,

Kind regards, John

Do you have to do it in `bash`? You should use a language that has a built-in `md5()` function, rather than having to run a program for each line. — Barmar, Jul 19 '18 at 00:13
@Barmar It could of course in another language; it's just that Bash is virtually the only language I am well versed in. The other slight complication is that the script's "$hashfunction" is a variable which can be allocated to any hash function command such as SHA1, MD5, whirlpool etc. which works find at the moment but as I said, it takes a very long time to process which is why I am looking to optimise the efficiency of the current script. Which language would you recommend if not Bash? — Tom, Jul 19 '18 at 00:25
When efficiency is a concern, `bash` is rarely the right solution. — Barmar, Jul 19 '18 at 00:27
@Barmar Do they both have multiple hash functions built into them? Either way, as I mentioned, I am not comfortable in many other scripts than Bash. Is there anything you can suggest to optimise the above script? Thank you for the suggestion of looking into scripting it in another language, I will look into that as well :) — Tom, Jul 19 '18 at 00:29
I can't think of any way to optimize the script. The problem you're running into is that you have to start a new invocation of the hash program for every line. There's no way around that in `bash`. — Barmar, Jul 19 '18 at 00:31
@Barmar What sort of script would you suggest for use in a Python script for optimum efficiency? Maybe I could work out some sort of crude "bodge" between the rest of the bash script and a sort of python function that could be called. Thank you once again. — Tom, Jul 19 '18 at 00:35
The Python script would simply replace everything you've written in the question. — Barmar, Jul 19 '18 at 00:39
You'd write `python scriptname.py < "$originalfile" >> "$outputlocation"` — Barmar, Jul 19 '18 at 00:39

Jon · Accepted Answer · 2018-07-19T12:43:03.417

4

I'd be very wary of doing this in pure shell. The overhead of starting up the hashing function for every line is going to make it really slow on a large file.

How about a short bit of Perl?

perl -MDigest::MD5 -nle 'print Digest::MD5::md5_hex($_), ":", $_' <$originalfile >>$outputlocation

Perl has a variety of Digest modules, so it is easy to use something less broken than MD5.

perl -MDigest::SHA -nle 'print Digest::SHA::sha256_hex($_), ":", $_' <$originalfile >>$outputlocation

If you want to use Whirlpool, you can install it from CPAN with

cpan install Digest::Whirlpool

and use it with

perl -MDigest -nle '$ctx = Digest->new("Whirlpool"); $ctx->add($_); print $ctx->hexdigest(), ":", $_' <$originalfile >>$outputlocation

edited Jul 19 '18 at 12:43

answered Jul 19 '18 at 08:48

Jon

3,573
2
17
24

Thanks for your response, I'll give that a shot later. How exactly would you implement other hash functions other than MD5 with that solution? – Tom Jul 19 '18 at 11:41
Perl has a load of different `Digest` modules. So for example, there is an `Digest::SHA` module which does sha256_hex, etc. The docs are available at https://metacpan.org/pod/Digest::SHA. – Jon Jul 19 '18 at 12:20
Just tried this method and it's very efficient. I suppose the only draw back is that you are unable to use many of the built in openssl hashing algorithms that come with OSX but I suppose that's the compromise you make for efficiency. Thank you very much! – Tom Jul 19 '18 at 13:11
Sorry to bother you; your method works absolutely perfectly for me. However, I would also like to use some of the other algorithms listed on the website you linked such as MD2, MD4 and SHA2 but can't seem to get them to work even after downloading them with cpan. Would you be able to provide the format to use these algorithms? Thank you so much! P.S. do you know if there are any other algorithms available which use the same method which are not listed on that website? – Tom Jul 19 '18 at 13:46
No problems :-) Your best best is to use the bare Digest module and just change the name of the hash in the `new` method. So for MD2, run `cpan install Digest::MD2` and then use `perl -MDigest -nle '$ctx = Digest->new("MD2"); $ctx->add($_); print $ctx->hexdigest(), ":", $_'`. The SHA module includes SHA-1, SHA-384, etc, so just specify those in `new`. The other module for hashes in `openssl` is `Digest::MD4`. I can't see one for MDC2. There is a module for RIPEMD160, `Crypt::Digest::RIPEMD160`, but it is a bit weird and doesn't work with `Digest`. The CPAN docs have examples though. – Jon Jul 19 '18 at 13:58
Thank you so much! You're a life saver! I've been able to do SHA384 etc but I noticed a separate page for SHA2: metacpan.org/pod/Digest::SHA2 . Would the MD4 (metacpan.org/pod/Digest::MD4) be the same format as the MD5 or the MD2? Once again, thank you so much :) – Tom Jul 19 '18 at 14:32
Not to worry, got both MD2 and SHA2 working. Can't thank you enough! Last question, do you know if there's a possibility that other hashes such as Tiger, gost-mac or md_gost might work with this method? I'm new to this whole Digest thing :) – Tom Jul 19 '18 at 14:41
CPAN has a search which turns up a surprising number of modules with `digest::` in the name :-) Have a go at modifying the search for gost and tiger. They appear to be in there. https://metacpan.org/search?q=digest%3A%3A – Jon Jul 19 '18 at 14:46
Perfect, your response was exactly what I was looking for! Thank you :) – Tom Jul 19 '18 at 14:59

score 2 · Answer 2 · answered Jul 19 '18 at 03:22

2

You could split the file into one file per line and do it in a single call:

$ cat > words.txt << EOF
> foo
> bar
> baz
> EOF
$ split --lines=1 words.txt 
$ sha256sum x*
b5bb9d8014a0f9b1d61e21e796d78dccdf1352f23cd32812f4850b878ae4944c  xaa
7d865e959b2466918c9863afca942d0fb89d7c9ac0c99bafc3749504ded97730  xab
bf07a7fbb825fc0aae7bf4a1177b2b31fcf8a3feeaf7092761e18c859ee52a9c  xac

answered Jul 19 '18 at 03:22

l0b0

55,365
30
138
223

Thank you for your answer! I'll give that a go in a minute, sounds like it's far more efficient. – Tom Jul 19 '18 at 11:40
Just tried this out and can't seem to get it to work. – Tom Jul 19 '18 at 13:11

Most efficient way to hash each line of a text file?

2 Answers2