For project documentation purposes I am trying to compare Python's Hashlib performance
to similiar implementations using Rust and C++. I do not know a lot about C++ or Rust; there is a chance that my code is not optimized correctly. I have noticed that I am getting better performance hashing MD5 using Python's Hashlib compared to Crypto++ and RustCrypto. The purpose of the programs below is to hash the file contents line by line. The code below was tested on rockyou.txt
; rockyou.txt
is a word-list that contains 14344392 lines.
Python
In python I am getting good performance; the fastest performance compered to RustCrypto and Crypto++.
Code
import hashlib
import sys
def HASH_MD5():
"""
Uses MD5 Algorithm to hash files
"""
file = sys.argv[1]
# path,encoding = "utf8", errors = "ignore"
# Ignores any encoding errors in lines
with open(file, encoding="utf8", errors="ignore") as file:
for line in file:
hashlib.md5(bytes(line.strip(), encoding = "utf8")).hexdigest()
#print(hashlib.md5(bytes(line.strip(), encoding = "utf8")).hexdigest())
if __name__ == "__main__":
HASH_MD5()
Using time
on GNU/Linux and Hashing rockyou.txt
I got the following output:
real 0m10.489s
user 0m10.473s
sys 0m0.016s
C++
Since I am not an expert on C++, I have decided to borrow the code from Crypto++ wiki. It is slower than Python, but still reasonable.
Code
#include <crypto++/cryptlib.h>
#define CRYPTOPP_ENABLE_NAMESPACE_WEAK 1
#include <crypto++/md5.h>
#include <crypto++/files.h>
#include <crypto++/hex.h>
#include <iostream>
//g++ hash.cpp -o b -lcryptopp
int main(int argc, char* argv[]) {
std::ifstream file(argv[1]);
std::string str;
while (std::getline(file, str)) {
byte digest[ CryptoPP::Weak::MD5::DIGESTSIZE ];
CryptoPP::Weak::MD5 hash;
hash.CalculateDigest( digest, (const byte*)str.c_str(), str.length() );
CryptoPP::HexEncoder encoder;
std::string output;
encoder.Attach( new CryptoPP::StringSink( output ) );
encoder.Put( digest, sizeof(digest) );
encoder.MessageEnd();
//std::cout << output << std::endl;
}
}
Using time
I got the following output:
real 0m36.225s
user 0m36.203s
sys 0m0.021s
Rust
I barely know anything about Rust, so I used RustCrypto Doc and this (to read file line by line). I know Rust is known for its performance; so I could not really understand why it took so much time.
Code
use std::env;
use std::fs::File;
use std::io::{BufRead, BufReader};
use std::str;
use md5::{Md5, Digest};
fn main() -> std::io::Result<()> {
let args: Vec<String> = env::args().collect();
let filename = &args[1];
// Open the file in read-only mode (ignoring errors).
let file = File::open(filename)?;
let mut reader = BufReader::new(file);
let mut buf = vec![];
while let Ok(_) = reader.read_until(b'\n', &mut buf) {
if buf.is_empty() {
break;
}
let mut hasher = Md5::new();
hasher.update(&buf);
hasher.finalize();
//println!("Result: {:x}", hash);
buf.clear();
}
Ok(())
}
Running the code above and using time
I got the following output:
real 1m28.250s
user 1m28.087s
sys 0m0.108s
Conclusion
Why is the Python script above significantly quicker than Rust and C++? Is this because of CPython MD5 implementation? Or is the Rust and C++ code written poorly?