11

Does anyone know how the checksum field in active_storage_blobs is calculated when using ActiveStorage on rails 5.2+?

For bonus points, does anyone know how I can get it to use an md5 checksum that would match the one from the md5 CLI command?

Sergio Tulentsev
  • 226,338
  • 43
  • 373
  • 367
mvh
  • 151
  • 1
  • 5

4 Answers4

12

Lets Break It Down

I know i'm a bit late to the party, but this is more for those that come across this in a search for answers. So here it is..

Background:

Rails introduced loads of new features in version 5.2, one of which was ActiveStorage. The official final release came out on April 9th, 2018.

Disclaimer:

So to be perfectly clear, the following information pertains to out-of-the-box vanilla active storage. This also doesn't take into account some crazy code-fu that revolves around some one off scenario.

With that said, the checksum is calculated differently depending on your Active Storage setup. With the vanilla out-of-the-box Rails Active Storage, there are 2 "types" (for lack of a better term) of configuration.

  1. Proxy Uploads
  2. Direct Uploads

Proxy Uploads

File Upload Flow: [Client] → [RoR App] → [Storage Service]

Comm. Flow: Can vary but in most cases it should be similar to File upload flow.

Pointed out above in SparkBao's answer is a "Proxy Upload". Meaning you upload the file to your RoR application and perform some sort of processing before sending the file to your configured storage service (AWS, Azure, Google, BackBlaze, etc...). Even if you set your storage service to "localdisk" the logic still technically applies, even though your RoR application is the storage endpoint.

A "Proxy Upload" approach isn't ideal for RoR applications that are deployed in the cloud on services like Heroku. Heroku has a hardset limit of 30 seconds to complete your transaction and send a response back to your client (end user). So if your file is fairly large, you need to consider the time it takes for your file to upload, and then account for the amount of time to calculate the checksum. If your caught in a scenario where you can't complete the request with a response in the 30 seconds you will need to use the "Direct Upload" approach.

Proxy Uploads Answer:

The Ruby class Digest::MD5 is used in the method compute_checksum_in_chunks(io) as pointed out by Spark.Bao.


Direct Uploads

File Upload Flow: [Client] → [Storage Service]

Comm. Flow: [Client] → [RoR App] → [Client] → [Storage Service] → [Client] → [RoR App] → [Client]

Our fine friends that maintain and develop Rails have already done all the heavy lifting for us. I won't go into details on how to setup a direct upload, but here is a link on how » Rails EdgeGuide - Direct Uploads.

Proxy Uploads Answer:

Now with all that said, with a vanilla out-of-the-box "Direct Uploads" setup, a file checksum is calculated by leveraging SparkMD5 (JavaScript).

Below is a snippet from the Rails Active Storage Source Code- (activestorage.js)

  var fileSlice = File.prototype.slice || File.prototype.mozSlice || File.prototype.webkitSlice;
  var FileChecksum = function() {
    createClass(FileChecksum, null, [ {
      key: "create",
      value: function create(file, callback) {
        var instance = new FileChecksum(file);
        instance.create(callback);
      }
    } ]);
    function FileChecksum(file) {
      classCallCheck(this, FileChecksum);
      this.file = file;
      this.chunkSize = 2097152;
      this.chunkCount = Math.ceil(this.file.size / this.chunkSize);
      this.chunkIndex = 0;
    }
    createClass(FileChecksum, [ {
      key: "create",
      value: function create(callback) {
        var _this = this;
        this.callback = callback;
        this.md5Buffer = new sparkMd5.ArrayBuffer();
        this.fileReader = new FileReader();
        this.fileReader.addEventListener("load", function(event) {
          return _this.fileReaderDidLoad(event);
        });
        this.fileReader.addEventListener("error", function(event) {
          return _this.fileReaderDidError(event);
        });
        this.readNextChunk();
      }
    },

Conclusion

If there is anything I missed I do apologize in advance. I tried to be as thorough as possible.

So to Sum things up the following should suffice as an acceptable answer:

  • Proxy Upload Configuration: The ruby class Digest::MD5

  • Direct Upload Configuration: The JavaScript hash library SparkMD5.

user953533
  • 321
  • 2
  • 6
  • What blob data is used in compute_checksum_in_chunks? If I wanted to compute new checksums that would match the ActiveStorage computed checksum for a file (without using DirectUpload), what do I need to provide? – johndisk Feb 24 '21 at 17:16
9

the source code is here: https://github.com/rails/rails/blob/6aca4a9ce5f0ae8af826945b272842dbc14645b4/activestorage/app/models/active_storage/blob.rb#L369-L377

def compute_checksum_in_chunks(io)
  Digest::MD5.new.tap do |checksum|
    while chunk = io.read(5.megabytes)
      checksum << chunk
    end

    io.rewind
  end.base64digest
end

in my project, I need to use this checksum value to judge whether the user uploads the duplicated file, I use the following code to get the same value with above method:

md5 = Digest::MD5.file(params[:file].tempfile.path).base64digest
puts "========= md5: #{md5}"

the output:

========= md5: F/9Inmc4zdQqpeSS2ZZGug==

database data:

pry(main)> ActiveStorage::Blob.find_by(checksum: 'F/9Inmc4zdQqpeSS2ZZGug==')
  ActiveStorage::Blob Load (2.7ms)  SELECT  "active_storage_blobs".* FROM "active_storage_blobs" WHERE "active_storage_blobs"."checksum" = $1 LIMIT $2  [["checksum", "F/9Inmc4zdQqpeSS2ZZGug=="], ["LIMIT", 1]]
=> #<ActiveStorage::Blob:0x00007f9a16729a90
id: 1,
key: "gpN2NSgfimVP8VwzHwQXs1cB",
filename: "15 Celebrate.mp3",
content_type: "audio/mpeg",
metadata: {"identified"=>true, "analyzed"=>true},
byte_size: 9204528,
checksum: "F/9Inmc4zdQqpeSS2ZZGug==",
created_at: Thu, 29 Nov 2018 01:38:15 UTC +00:00>
Erowlin
  • 9,555
  • 4
  • 35
  • 63
Spark.Bao
  • 5,573
  • 2
  • 31
  • 36
  • 1
    How did you implement checking for duplicates? I'm thinking to see if it's a duplicate by using the checksum and use the blob_id of the already uploaded file. Maybe I need to start a new topic. – Greg Feb 06 '20 at 05:04
3

It’s a base64-encoded MD5 digest of the blob’s data. I’m afraid Active Storage doesn’t support hexadecimal checksums like those emitted by md5(1). Sorry!

George Claghorn
  • 26,261
  • 3
  • 48
  • 48
2

For your bonus question (and potentially also the main one):

You can convert the checksum from base64 to hex (like the md5(1) command supports) and back.

Converting a hexadecimal digest to base64 in Ruby:

def hex_to_base64(hexdigest)
  Base64.strict_encode64([hex_string].pack("H*"))
end

From base64 to hex:

def base64_to_hex(base64_string)
  Base64.decode64(base64_string).each_byte.map { |b| "%02x" % b.to_i }.join
end
Juice10
  • 221
  • 3
  • 10