2

I'm working with .mat files which are saved at the end of a program. The command is save foo.mat so everything is saved. I'm hoping to determine if the program changes by inspecting the .mat files. I see that from run to run, most of the .mat file is the same, but the field labeled __function_workspace__ changes somewhat.

(I am inspecting the .mat files via scipy.io.loadmat -- just loading the files and printing them out as plain text and then comparing the text. I found that save -ascii in Matlab doesn't put string labels on things, so going through Python is roundabout, but I get labels and that's useful.)

I am trying to determine from where these changes originate. Can anyone explain what __function_workspace__ contains? Why would it not be the same from one run of a given program to the next?

The variables I am really interested in are the same, but I worry that I might be overlooking some changes that might come back to bite me. Thanks in advance for any light you can shed on this problem.

EDIT: As I mentioned in a comment, the value of __function_workspace__ is an array of integers. I looked at the elements of the array and it appears that these numbers are ASCII or non-ASCII character codes. I see runs of characters which look like names of variables or functions, so that makes sense. But there are also some characters (non-ASCII) which don't seem to be part of a name, and there are a lot of null (zero) characters too. So aside from seeing names of things in __function_workspace__, I'm not sure what that stuff is exactly.

SECOND EDIT: I found that after commenting out calls to plotting functions, the content of __function_workspace__ is the same from one run of the program to the next, so that's great. At this point the only difference from one run to the next is that there is a __header__ field which contains a timestamp for the time at which the .mat file was created, which changes from run to run.

THIRD EDIT: I found an article, http://nbviewer.jupyter.org/gist/mbauman/9121961 "Parsing MAT files with class objects in them", about reverse-engineering the __function_workspace__ field. Thanks to Matt Bauman for this very enlightening article and thanks to @mpaskov for the pointer. It appears that __function_workspace__ is an undocumented catch-all for various stuff, only one part of which is actually a "function workspace".

Robert Dodier
  • 16,905
  • 2
  • 31
  • 48
  • Is there are reason why you are not expecting the file in matlab? Function workspace is the 'local variable space/stack' for each function. If the actual data under `__function_workspace__` is a pointer to the stack location that might change from run to run. What are typical values of `__function_workspace__`? – mpaskov Sep 18 '18 at 19:42
  • Thanks for your interest. I'm trying to determine automatically if there are differences between successive .mat files, and the way I see to do that is compute a file diff between ascii representations. I don't know a way to achieve that in Matlab. Not sure whether one should expect function-local stuff to change from run to run -- after all, the arguments are supposed to be the same and therefore any local variables (there are no calls to `rand`). Values of `__function_workspace__` are long lists of integers. – Robert Dodier Sep 18 '18 at 20:13
  • What data types are the variables you want to compare? If you're determined to use a text-based comparison then perhaps you could load your data with `s = load(foo.mat)` (giving you a struct `s`) then use something like [xml2struct](https://uk.mathworks.com/matlabcentral/fileexchange/28639-struct2xml) to turn it into XML text, as long as that supports your data types. But if what you really want to do is just *determine if the program changes*, why not write a test using the MATLAB unit testing tools and simply check that the mat file contents are equal to a saved reference copy? – nekomatic Sep 20 '18 at 10:57
  • This [question](https://stackoverflow.com/questions/15512560/access-mat-file-containing-matlab-classes-in-python) might be relevant. – mpaskov Sep 20 '18 at 13:10
  • @nekomatic I agree that comparing to a save .mat reference copy is a good idea. The problem is that the .mat files vary, but in a way that I would like to prove is inconsequential (namely the field `__function_workspace__` is not the same from run to run). The good news is that I've found that if I omit plotting functions, there are no changes in any fields except for `__header__` which contains a file-creation timestamp. I still don't understand exactly what `__function_workspace__` contains, but understanding that is no longer so important. – Robert Dodier Sep 20 '18 at 17:00
  • @mpaskov Thanks for the pointer, I've updated my question to mention it. – Robert Dodier Sep 20 '18 at 17:10

1 Answers1

6

1) Diffing .mat files

You may want to take a look at DiffPlug. It can do diffs of MAT files and I believe there is a command line interface for it as well.

2) Contents of function_workspace

SciPy's __function_workspace__ refers to a special variable at the end of a MAT file that contains extra data needed for reference types (e.g. table, string, handle, etc.) and various other stuff that is not covered by the official documentation. The name is misleading as it really refers to the "Subsystem" (briefly mentioned in the official spec as an offset in the header).

For example, if you save a reference type, e.g., emptyString = "", the resulting .mat will contain the following two entries:

(1) The variable itself. It looks sort of like a UInt32 matrix, but is actually an Opaque MCOS Reference (MATLAB Class Object System) to a string object at some location in the subsystem.

 [0] Compressed (81 bytes, position = 128)
  [0] Matrix (144 bytes, position = 0)
    [0] UInt32[2] = [17, 0] // Opaque
    [1] Int8[11] = ['emptyString'] // Variable Name
    [2] Int8[4] = ['MCOS'] // Object Type
    [3] Int8[6] = ['string'] // Class Name
    [4] Matrix (72 bytes, position = 72)
      [0] UInt32[2] = [13, 0] // UInt32
      [1] Int32[2] = [6, 1] // Dimensions
      [2] Int8[0] = [''] // Variable Name (not needed)
      [3] UInt32[6] = [-587202560, 2, 1, 1, 1, 1] // Data (Reference Target)

(2) A UInt8 matrix without name (SciPy renamed this to __function_workspace__) at the end of the file. Aside from the missing name it looks like a standard matrix, but the data is actually another MAT file (with a reduced header) that contains the real data.

[1] Compressed (251 bytes, position = 217)
  [0] Matrix (968 bytes, position = 0)
    [0] UInt32[2] = [9, 0] // UInt8
    [1] Int32[2] = [1, 920] // Dimensions
    [2] Int8[0] = [''] // Variable Name
    [3] ... 920 bytes ... // Data (Nested MAT File)

The format of the data is unfortunately completely undocumented and somewhat of a mess. I could post the contents of the Subsystem, but it gets somewhat overwhelming even for such a simple case. It's essentially a MAT file that contains a struct that contains a special variable (MCOS FileWrapper__) that contains a cell array with various values, including one that magically encodes various Object Properties.

Matt Bauman has done some great reverse engineering efforts (Parsing MAT files with class objects in them) that I believe all supporting implementations are based on. The MFL Java library contains a full (read-only) implementation of this (see McosFileWrapper.java).

Some updates on Matt Bauman's post that we found are:

  • The MCOS reference can refer to an array of handle objects and may have more than 6 values. It contains sizing information followed by an array of indices (see McosReference.java).
  • The Object Id field looks like a unique id, but the order seems random and sometimes doesn't match. I don't know what this value is, but completely ignoring it seems to work well :)
  • I've seen Segment 5 populated in .fig files, but I haven't been able to narrow down what's in there yet.

Edit: Fyi, once the string object is correctly parsed and all properties are filled in, the actual string value is encoded in yet another undocumented format (see testDoubleQuoteString)

Florian Enner
  • 386
  • 4
  • 8