1

I have a BareOS installation with very little modification to the default config files. There are Full, Incremental, and Differential backups being performed. Most clients appear to be being backed up as expected.

However, one of my clients appears to be repeatedly backing up over 10% of the overall filesystem in every incremental cycle.

How do I find the largest files and folders that are being backed up repeatedly?

BAT does not appear to be very helpful here, since it only lists the size of the file node itself, rather than the entire folder size. I'm effectively looking for a du command that works within the BareOS framework for a specific backup attempt.

3 Answers3

1

Please note that an important update has been added at the end of the fist part


Unfortunately, even if it's very easy to check exactly what's happened with a particular backup, it's not so easy to get the filesize of backupped files.

Let's get deeper with some examples.

In my case, from bconsole I can get the list of last jobs with:

*list jobs
 Automatically selected Catalog: CatalogoBareos
 Using Catalog "CatalogoBareos"
 +-------+------------------------+---------------------+------+-------+------------+-------------------+-----------+
 | JobId | Name                   | StartTime           | Type | Level | JobFiles   | JobBytes          | JobStatus |
 +-------+------------------------+---------------------+------+-------+------------+-------------------+-----------+
 [...]
 | 7,060 | backup-XXXXXXXX        | 2016-01-02 16:00:50 | B    | I     |          2 |        74,225,116 | T         |
 | 7,062 | backup-YYY             | 2016-01-02 16:04:47 | B    | F     |    890,482 |   181,800,839,481 | T         |
 [...]
 +-------+------------------------+---------------------+------+-------+------------+-------------------+-----------+

From the above, you can see two jobs:

  • Job 7060: an "I"ncremental backup, interesting 2 files for a total of 74MB of data;
  • Job 7062: a "F"ull backup, interesting 890492 file, for a total of 181GB of data;

Let's focus on Job 7060, as it's an incremental one. Let's check which files were backedup:

*list files jobid=7060 
 c:/allXXX_Backup/PLONE/Backup_Plone.bkf
 c:/allXXX_Backup/PLONE/
+-------+-----------------+---------------------+------+-------+----------+------------+-----------+
| JobId | Name            | StartTime           | Type | Level | JobFiles | JobBytes   | JobStatus |
+-------+-----------------+---------------------+------+-------+----------+------------+-----------+
| 7,060 | backup-XXXXXXXX | 2016-01-02 16:00:50 | B    | I     |        2 | 74,225,116 | T         |
+-------+-----------------+---------------------+------+-------+----------+------------+-----------+
*

So Job 7060 interested one file (Backup_Plone.bkf) and one directory (the containing folder).

Unfortunately, as you can see, the output of list files jobid=7060 does NOT present the filesize you need so.....it's useful, hopefully, but does not solve your problem.

Let's step ahead.

I've traveled allalong the bareos official documentation being unable to find the proper way to get "filesizes" from within bconsole. So I decided to get the heavy way: direct SQL access to the catalog.

Note: Please, be extremely careful when dealing with direct access to catalog as every single unproper action can lead to serious damage, with related data-loss!

Once connected to the DB-engine (MySQL, in my case.... but this is a detail, as with PostgreSQL it's the same), I saw that backupped file metadata are stored (...among others) in:

  • File table: it stores mostly all the metadata, with the exception of...
  • Filename table: it store the filename of the backupped file
  • Path table: it store the full-path of the backupped file

With a big surprise, I discovered that the File table does not include a size field. So it's not possible, with a simple query, to get what we need. Anyway, I found an interesting LStat field (more on it, later).

So I fired up following SQL query:

select 
  f.JobId,f.LStat,f.MD5, fn.Name, p.Path
from
  Filename fn,
  File f,
  Path p
where
  f.JobId=7060 and
  fn.FilenameId = f.FilenameId and 
  p.PathId = f.PathId

and got back following results:

mysql> select f.JobId,f.LStat,f.MD5, fn.Name, p.Path
    -> from
    -> Filename fn,
    -> File f,
    -> Path p
    -> where
    -> f.JobId=7060 and
    -> fn.FilenameId = f.FilenameId and 
    -> p.PathId = f.PathId
    -> ;
+-------+------------------------------------------------------+------------------------+------------------+-------------------------+
| JobId | LStat                                                | MD5                    | Name             | Path                    |
+-------+------------------------------------------------------+------------------------+------------------+-------------------------+
|  7060 | A A IH/ B A A A EbJQA A A BWheFw BWheFw BTD/En A A L | 8ZuPGwdo9JYJileo+sVlfg | Backup_Plone.bkf | c:/all***_Backup/PLONE/ |
|  7060 | A A EH/ B A A A A A A BWhRjY BWgz4o BTD/En A A L     | 0                      |                  | c:/all***_Backup/PLONE/ |
+-------+------------------------------------------------------+------------------------+------------------+-------------------------+
2 rows in set (0.00 sec)

As for the LStat field, in the Official BareOS Developer Guide I saw:

> Column Name   Data Type   Remark
> [...]
> LStat         tinyblob    File attributes in base64 encoding

So, now, the problem is:

  • Does the LStat include the filesize?

and, as I would bet for a "YES! Definitely!":

  • How can the FileSize be retrieved from the LStat string?

A quick search for "BareOS LStat" lead me to several results. In a few seconds I got this thread, including several comments about the LStat field, including a little PERL script to properly decode it. Here it is (*courtesy of Brian McDonald, 2005 *), slightly modified to better suite your need:

#!/usr/bin/perl

my $base64_digits =
    "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
my $base64_values = { };
my $val = 0;
foreach my $letter ( split(//, $base64_digits ) )
    { $base64_values->{ $letter } = $val++; }

print "Please, enter your LSTAT string:";
my $lstat = <STDIN>;

my $vals = decode_stats($lstat);

print "Here are the results:\n";
foreach my $key (keys %{$vals}) {
   printf("%15s: %s\n",$key, $$vals{$key}) ;
}
exit;

sub decode_stats
{
  my $stats = shift;
  my $hash = { };

  # The stats data is in base64 format.  Yuck! - I mean Yay!
  # Each value is base64 encoded incorrectly, a deficiency we have
  # to correct here.  In particular, some values are encoded with a single
  # base64 character.  This results in a 6 bit value, and you have to
  # understand how bacula padded and manipulated those values before storing
  # them in the DB.

  # the fields are, in order:
  my @fields = qw(
    st_dev st_ino st_mode st_nlink st_uid st_gid st_rdev st_size
    st_blksize st_blocks st_atime st_mtime st_ctime LinkFI st_flags data
  );

  # Decoding this mess is based on reading src/lib/base64.c in bacula, in
  # particular the to_base64 function, which is how these got in the DB in
  # the first place.
  my $field_idx = 0;
  foreach my $element ( split( /\s+/, $stats ) )
  {
    my $result = 0;
    for ( my $i = 0; $i<length($element); $i++ )
    {
        if ($i) { $result <<= 6; }
        my $r = substr($element, $i, 1 );
        $result += $base64_values->{$r};
    }
    $hash->{ $fields[$field_idx] } = $result;
    $field_idx++;
  }

  return $hash;
}

When launched and given an LSTAT string, it reports lots of data, including the filesize (st_size, last field of the output):

verzulli@iMac-Chiara:/tmp$ perl pp.pl 
Please, enter your LSTAT string:A A IH/ B A A A EbJQA A A BWheFw BWheFw BTD/En A A L
Here are the results:
     LinkFI: 0
   st_atime: 1451614576
     st_ino: 0
   st_flags: 0
   st_mtime: 1451614576
     st_dev: 0
   st_nlink: 1
 st_blksize: 0
  st_blocks: 0
       data: 11
     st_gid: 0
    st_mode: 33279
     st_uid: 0
    st_rdev: 0
   st_ctime: 1393553703
    st_size: 74224640

So, now, we have the filesize but, unfortunately, it's not easily accessable in a single query to find the biggest file of a single backup job.

Several solutions exists:

  • if you're running MySQL 5.6.1 or later, or a DBMS engine supporting BASE_64 enconding/decoding, you could query for a SUBSTR of the LSTAT and then asking the DB engine to decode it's value as a Base64 one. For example, see here

  • you could write a STORED PROCEDURE. Actually it should be already present in PostgreSQL, as for this (who state: "...Added sample postgresql stored procedures for lstat field....");

  • you could write a little PERL script, querying the catalog and going through the decoding stuff

  • ...

Hope this will be enough ;-)


Update 1

I've just discovered the existence of the BVFS API, explicitely "...intended mostly for developers who wish to develop a new GUI interface to Bareos...".

Those APIs provide a new set of commands (so-called "dot-commands"), including an interesting .bvfs_lsfiles which shows on the console some metadata, including the LSTAT field. So:

  1. it's definitely possible to get the LSTAT field without direct access to the underlying DB/Catalog.

Also, with BareOS 15.2 a new "API mode 2" have been introduced, adding support for JSON output. I've just tested that:

  1. with the V.2 API enabled, the JSON objects returned by .bvfs_lsfiles, contains the file-size field, properly decoded.

Here follow an example:

*.bvfs_update
Using Catalog "MyCatalog"
*.bvfs_lsfiles path=/var/log/2016/01/06/ jobid=79
107131  34080   3614785 79  P0A CCMR IGA B A A A H1V BAA BI BWjIkK BWjJAx BWjJAx A A C  shorewall.log
107131  34081   3614786 79  P0A CCMQ IGA B A A A BT1 BAA Q BWjIkK BWjI7p BWjI7p A A C   system.log
*.api 2
{
  "jsonrpc": "2.0",
  "id": null,
  "result": {
    "api": 2
  }
}*
*.bvfs_lsfiles path=/var/log/2016/01/06/ jobid=79
{
  "jsonrpc": "2.0",
  "id": null,
  "result": {
    "files": [
      {
        "type": "F",
        "stat": {
          "dev": 64768,
          "nlink": 1,
          "mode": 33152,
          "ino": 533265,
          "rdev": 0,
          "user": "root",
          "group": "root",
          "atime": 1452050698,
          "size": 32085,
          "mtime": 1452052529,
          "ctime": 1452052529
        },
        "pathid": 107131,
        "name": "shorewall.log",
        "fileid": 3614785,
        "filenameid": 34080,
        "jobid": 79,
        "lstat": "P0A CCMR IGA B A A A H1V BAA BI BWjIkK BWjJAx BWjJAx A A C",
        "linkfileindex": 0
      },
      {
        "type": "F",
        "stat": {
          "dev": 64768,
          "nlink": 1,
          "mode": 33152,
          "ino": 533264,
          "rdev": 0,
          "user": "root",
          "group": "root",
          "atime": 1452050698,
          "size": 5365,
          "mtime": 1452052201,
          "ctime": 1452052201
        },
        "pathid": 107131,
        "name": "system.log",
        "fileid": 3614786,
        "filenameid": 34081,
        "jobid": 79,
        "lstat": "P0A CCMQ IGA B A A A BT1 BAA Q BWjIkK BWjI7p BWjI7p A A C",
        "linkfileindex": 0
      }
    ]
  }
}*

So, in the end, with a recent version of BareOS, the original problem seems to be solvable without direct access to the catalog.

Damiano Verzulli
  • 4,078
  • 1
  • 21
  • 33
1

While I appreciate @damiano-verzulli 's effort, a discussion in the BareOS IRC channel on FreeNode eluded this response:

It turns out that Kjetil Torgrim Homme has already written a script to do this, called bacula-du. (Along with quite a few other useful scripts!)

They're all listed and obtainable from here:

http://heim.ifi.uio.no/kjetilho/hacks/

In particular bacula-du is explained as this:

Usage: bacula-du [OPTIONS] -j JOBID 
       Summarize disk usage of directories included in the backup JOBID

Main options are:   
    -a, --all             write counts for all files, not just directories
    -S, --separate-dirs   do not include size of subdirectories   
    -t, --threshold=SIZE  skip output for files or directories with usage below SIZE.  default is 1 octet.   
    -L, --largest=NUM     only print NUM largest directories/files 

There is also an alternate mode which can be useful as a faster alternative to a verify job.

Usage: bacula-du --md5sum -j JOBID   --md5sum              
       output list of all files in job in md5sum format 

bacula-du (version 1.4)

There's one small note I have to add here. For this to work, it has to have access to the database (obviously). In the default configuration, it uses a user-based security mechanism, so you have to run the command as the bareos user for it to work, e.g.

$ sudo -u bareos ./bacula-du -j 1429
done reading database.
   807160 /log/
     6372 /var/openldap-data/
     6372 /var/
   813532 /admin/
...
119983392 /
0

Adding to Diamano Verzulli's useful answer, here is a SQL stored procedure that decodes the LSTAT field like his perl code;

It's not extremely efficient, because it re-defines variables each time you run it. I expect it to be fully an order of magnitude slower than native code for large amounts of files.


USE bacula;
-- Function to add. (Only once)


-- OUTPUTS:  
-- 1: Device
-- 2: Inode
-- 3: Mode (unix permissions)
-- 4: Hard link count 
-- 5: UID 
-- 6: GID 
-- 7: DeviceType 
-- 8: Size in bytes 
-- 9: Block size (4k?)
-- 10: Last access 
-- 11: Last modified 
-- 12: Last changed 
-- 13/14/15 Link/flags/misc. 



DELIMITER $$
CREATE OR REPLACE FUNCTION base64_decode_lstat (lsIndex INT ,lstat VARCHAR(128)) RETURNS BIGINT
BEGIN 
    DECLARE i,j,partlen INT;
    DECLARE alphabet BINARY(64);
    DECLARE part VARCHAR(64);
    DECLARE iOut BIGINT;

    SET part =  TRIM(SUBSTRING_INDEX(SUBSTRING_INDEX(lstat, ' ', lsIndex), ' ', -1));
    SET alphabet = BINARY 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/';
    SET i = 0;
    SET iOut = 0;   
    
    SET partlen = LENGTH(part);
    WHILE i < partlen DO
        SET j = partlen - i - 1;
        -- Note: SUBSTRING is not zero-indexed, so +1 as otherwise off-by-1. 
        -- Note 2: LOCATE is also not zero-indexed, so subtract 1 to fix off-by-1.
        -- Note 3: j is made zero-indexed so it runs from n-1 to 0.
        -- This does addition in base-64.
        SET iOut = iOut + ((LOCATE(SUBSTRING(part, i + 1, 1), alphabet) - 1) << (6 * j));
        SET i = i + 1;
    END WHILE;
    RETURN iOut;
END $$
DELIMITER ;

Example usage (this gets a unix timestamp for the modify time of the first file in the database);

SELECT base64_decode_lstat(11, (SELECT File.LStat FROM File LIMIT 1));
aphid
  • 139
  • 8