Why does Puppet (almost) always fail to write to my Gluster filesystem?

Question

I'm using Puppet to manage some files that are shared between servers, by way of the GlusterFS file system. (The specifics shouldn't matter, but in this case things like /etc/httpd/conf.d and /var/www/html are mounted over the network, via GlusterFS. This is on RHEL 6 servers, with Puppet 3.8 and Gluster 3.5.)

Puppet has no problems with files that are local to a given server, but when I try to create or update files on this shared filesystem, it almost never works. Puppet sees that a change needs to be made, but then the file fails the subsequent checksum check. Here's an example of Puppet trying (and failing) to create a file:

change from absent to file failed: File written to disk did not match checksum; discarding changes ({md5}990680e579211b74e3a8b58a3f4d9814 vs {md5}d41d8cd98f00b204e9800998ecf8427e)

Here's a similar example of a file edit:

change from {md5}216751de84e40fc247cb02da3944b415 to {md5}261e86c60ce62a99e4b1b91611c1af0e failed: File written to disk did not match checksum; discarding changes ({md5}261e86c60ce62a99e4b1b91611c1af0e vs {md5}d41d8cd98f00b204e9800998ecf8427e)

This doesn't always happen, but on my Gluster filesystems, I'd say it happens at least 90% of the time.

The latter checksum (d41d8...) is the checksum of an empty file. So I think this is what's happening: Puppet sees that the change needs to be made, and makes the change. But it checksums the file again before the write is committed, so it doesn't see that the change was successfully made, and so it rolls back.

Two questions, then. First: Does this seem plausible, and how do I test/confirm that this is the case? Second: Assuming this is what's happening, how do I prevent it? The first thing that comes to mind would be simply sleeping for a few hundred milliseconds after file change operations, but I don't immediately know if that's even possible, much less wise.

Are you deploying the same file on the Client and Server of GlusterFS? — 030, Dec 13 '15 at 12:12
Not in this case, no. On the clients, the files are going to /var/www/html and /etc/httpd/conf.d. On the server, the bricks are sourced from /data. Nothing in Puppet touches /data on the Gluster brick servers (or any part of the Gluster configuration, actually, I'm not yet smart enough to automate that). — David E. Smith, Dec 14 '15 at 23:30

score 1 · Answer 1 · edited Jun 11 '20 at 10:02

Concise

The checksum of the file will be checked and subsequently flushed. This checksum will be compared with the file that will be written. If there is a discrepancy the write will fail.

Verbose

The error is thrown by the following method that is defined in the file.rb:

  # Make sure the file we wrote out is what we think it is.
  def fail_if_checksum_is_wrong(path, content_checksum)
    newsum = parameter(:checksum).sum_file(path)
    return if [:absent, nil, content_checksum].include?(newsum)

    self.fail "File written to disk did not match checksum; discarding changes (#{content_checksum} vs #{newsum})"
  end

and this method contains the following method that resides in the checksum.rb:

  def sum_file(path)
    type = digest_algorithm()
    method = type.to_s + "_file"
    "{#{type}}" + send(method, path).to_s
  end

How is the checksum calculated?

The method that is responsible for this resides in the file.rb as well:

  def write(property)
    remove_existing(:file)

    mode = self.should(:mode) # might be nil
    mode_int = mode ? symbolic_mode_to_int(mode, Puppet::Util::DEFAULT_POSIX_MODE) : nil

    if write_temporary_file?
      Puppet::Util.replace_file(self[:path], mode_int) do |file|
        file.binmode
        content_checksum = write_content(file)
        file.flush
        fail_if_checksum_is_wrong(file.path, content_checksum) if validate_checksum?
        if self[:validate_cmd]
          output = Puppet::Util::Execution.execute(self[:validate_cmd].gsub(self[:validate_replacement], file.path), :failonfail => true, :combine => true)
          output.split(/\n/).each { |line|
            self.debug(line)
          }
        end
      end
    else
      umask = mode ? 000 : 022
      Puppet::Util.withumask(umask) { ::File.open(self[:path], 'wb', mode_int ) { |f| write_content(f) } }
    end

    # make sure all of the modes are actually correct
    property_fix
  end

The snippet that checks the checksum: content_checksum = write_content(file):

  # write the current content. Note that if there is no content property
  # simply opening the file with 'w' as done in write is enough to truncate
  # or write an empty length file.
  def write_content(file)
    (content = property(:content)) && content.write(file)
  end

The following snippet:

content_checksum = write_content(file)
file.flush
fail_if_checksum_is_wrong(file.path, content_checksum) if validate_checksum?

indicates that there is a discrepancy between the file that will be written and is actually written.

Discussion

The latter checksum (d41d8...) is the checksum of an empty file.

How did you check this?

So I think this is what's happening: Puppet sees that the change needs to be made, and makes the change. But it checksums the file again before the write is committed, so it doesn't see that the change was successfully made, and so it rolls back.

The code as explained above works always like explained and from my experience the checksum check works.

Conclusion

It looks like that there are issues with the GlusterFS, e.g. the file that was deployed using Puppet was changed for some reason by GlusterFS.

Suggestion

I suggest to debug the issue as follows:

Deploy file 1 with content X on Puppet
Deploy this file on GlusterFS using Puppet
Check the checksum of file 1 that resides on the puppetserver manually
Check the checksum of file 1 that resides on GlusterFS manually
Run Puppet on GlusterFS and check if the issue occurs

This seems to confirm my hypothesis. `write_content` opens the file for writing, which implicitly truncates it (yielding the d4148... checksum), writes the file, then immediately rechecks with `fail_if_checksum_is_wrong`. I got that checksum from simply creating and testing an empty file: `[davsmi@wuis1130 ~]$ rm -f foo && touch foo && md5sum foo d41d8cd98f00b204e9800998ecf8427e foo` So, is there a way to introduce a pause between file.flush and the subsequent fail_if_checksum_is_wrong? (Doesn't look like it.) — David E. Smith, Dec 15 '15 at 15:45
I'm already pretty confident in Puppet itself -- I have instances where the same content is deployed to a test environment (where /var/www/html is local storage), and to stage/prod environments (where /var/www/html is a network-mounted Gluster drive). Works on test, not on stage/prod. And if I manually create the files on the Gluster drives, Puppet subsequently sees that the checksums are correct, and doesn't attempt to modify them in any way. — David E. Smith, Dec 15 '15 at 15:49