1

For project documentation purposes I am trying to compare Python's Hashlib performance to similiar implementations using Rust and C++. I do not know a lot about C++ or Rust; there is a chance that my code is not optimized correctly. I have noticed that I am getting better performance hashing MD5 using Python's Hashlib compared to Crypto++ and RustCrypto. The purpose of the programs below is to hash the file contents line by line. The code below was tested on rockyou.txt; rockyou.txt is a word-list that contains 14344392 lines.

Python

In python I am getting good performance; the fastest performance compered to RustCrypto and Crypto++.

Code

import hashlib
import sys
def HASH_MD5():
      """
          Uses MD5 Algorithm to hash files
      """
      
      file = sys.argv[1]


      # path,encoding = "utf8", errors = "ignore"
      # Ignores any encoding errors in lines
      with open(file, encoding="utf8", errors="ignore") as file:
          for line in file:
             
              hashlib.md5(bytes(line.strip(), encoding = "utf8")).hexdigest()
              #print(hashlib.md5(bytes(line.strip(), encoding = "utf8")).hexdigest())
        
      
if __name__ == "__main__":
    
    HASH_MD5()

Using time on GNU/Linux and Hashing rockyou.txt I got the following output:

real    0m10.489s
user    0m10.473s
sys     0m0.016s
    

C++

Since I am not an expert on C++, I have decided to borrow the code from Crypto++ wiki. It is slower than Python, but still reasonable.

Code

#include <crypto++/cryptlib.h>
#define CRYPTOPP_ENABLE_NAMESPACE_WEAK 1
#include <crypto++/md5.h>
#include <crypto++/files.h>
#include <crypto++/hex.h>
#include <iostream>
//g++ hash.cpp -o b -lcryptopp 

int main(int argc, char* argv[]) {
  
  std::ifstream file(argv[1]);
  std::string str; 
  while (std::getline(file, str)) {

    
  
    byte digest[ CryptoPP::Weak::MD5::DIGESTSIZE ];
    
    CryptoPP::Weak::MD5 hash;
    hash.CalculateDigest( digest, (const byte*)str.c_str(), str.length() );

    CryptoPP::HexEncoder encoder;
    std::string output;

    encoder.Attach( new CryptoPP::StringSink( output ) );
    encoder.Put( digest, sizeof(digest) );
    encoder.MessageEnd();

    //std::cout << output << std::endl;
  
  }
  

  
}

Using time I got the following output:

real    0m36.225s
user    0m36.203s
sys     0m0.021s

Rust

I barely know anything about Rust, so I used RustCrypto Doc and this (to read file line by line). I know Rust is known for its performance; so I could not really understand why it took so much time.

Code

use std::env;
use std::fs::File;
use std::io::{BufRead, BufReader};
use std::str;
use md5::{Md5, Digest};

fn main() -> std::io::Result<()> {
    
    let args: Vec<String> = env::args().collect();
   
    let filename = &args[1];

    // Open the file in read-only mode (ignoring errors).
    let file = File::open(filename)?;
    let mut reader = BufReader::new(file);
    let mut buf = vec![];
    
    while let Ok(_) = reader.read_until(b'\n', &mut buf) {
        if buf.is_empty() {
            break;
        }
        let mut hasher = Md5::new();
        hasher.update(&buf);
        hasher.finalize();
        //println!("Result: {:x}", hash);
        
        buf.clear();
    }

    Ok(())


   
    
    
    
}

Running the code above and using time I got the following output:

real    1m28.250s
user    1m28.087s
sys     0m0.108s

Conclusion

Why is the Python script above significantly quicker than Rust and C++? Is this because of CPython MD5 implementation? Or is the Rust and C++ code written poorly?

Grabinuo
  • 334
  • 2
  • 11
  • 2
    Standard question: did you compile with `--release`? – trent Apr 10 '21 at 17:27
  • The Python code for md5 extension is probably written in c++ or c but certainly not in Python. Source code likely available. See https://stackoverflow.com/questions/59955854/what-is-md5-md5-and-why-is-hashlib-md5-so-much-slower – doug Apr 10 '21 at 17:30
  • @trentcl I used `--release` it now takes `real 0m2.422s user 0m2.809s sys 0m0.086s` – Grabinuo Apr 10 '21 at 17:32
  • @doug It is in C. I added the link for its source in the question. – Grabinuo Apr 10 '21 at 17:32
  • The C++ code should be compiled with optimizations, as well: try `-O3` (to match Cargo's default in release mode, which is `-C opt-level=3`) – trent Apr 10 '21 at 17:35
  • @trentcl It did not make much difference in C++. Kindly, could you post your comment as an answer so I can mark it? – Grabinuo Apr 10 '21 at 17:39
  • I'm pretty sure this is a question that has been asked before, which I'd like to link to instead of duplicating the answer. If I can't find it I will write a new answer. – trent Apr 10 '21 at 17:43

0 Answers0