rsnapshot frequently ext3-fs error: "ext3_lookup: deleted inode referenced:"

Question

I frequently have the following problem on our backup server. I'll try to explain the issue in the hope somebody could clarify why and how to fix it.

Setup details

We have a Dell R200 server attached with an EasyRAID Q16R-S3R3 RAID disk array. We use an LSI SAS2008 PCI card to connect those two components (the disk-array and the head-node). In the EasyRAID we have eight disks installed and are bound to one Logical Disk.

On the R200 we have following disk configuration: We create the LVM's on the R200 server, not on the EasyRAID

pvs
  PV         VG   Fmt  Attr PSize PFree
  /dev/sdc   vg0  lvm2 a-   5.46t 1.03t

root@backupserver:/home/netsys# vgs
  VG   #PV #LV #SN Attr   VSize VFree
  vg0    1   9   0 wz--n- 5.46t 1.03t

root@backupserver:/home/netsys# lvs
  LV                  VG   Attr   LSize    Origin Snap%  Move Log Copy%  Convert
  lv0vm               vg0  -wi-ao 1000.00g
  lv0vm2              vg0  -wi-a-  100.00g
  lv1data             vg0  -wi-ao 1000.00g
  lv1databackup       vg0  -wi-ao 1000.00g
  lv1dataold20120903  vg0  -wi-a- 1000.00g
  lv2ceres            vg0  -wi-ao  200.00g
  lv2ceresold20121022 vg0  -wi-a-  100.00g
  lv3iso              vg0  -wi-ao   34.00g
  lv4svn              vg0  -wi-ao  100.00g

Every night around 22h we run Rsnapshot between lv1databackup and lv1data (containing the snapshots). Now we end up after running this setup every time with following errors in the logs

May 20 22:15:20 backupserver kernel: [11777489.404269] EXT3-fs error (device dm-8): ext3_lookup: deleted inode referenced: 60891438
May 20 22:15:20 backupserver kernel: [11777489.406210] EXT3-fs error (device dm-8): ext3_lookup: deleted inode referenced: 60891429
May 20 22:15:20 backupserver kernel: [11777489.407835] EXT3-fs error (device dm-8): ext3_lookup: deleted inode referenced: 60891431
May 20 22:15:20 backupserver kernel: [11777489.409474] EXT3-fs error (device dm-8): ext3_lookup: deleted inode referenced: 60891430
May 20 22:15:21 backupserver kernel: [11777489.422835] EXT3-fs error (device dm-8): ext3_lookup: deleted inode referenced: 60891523
May 20 22:15:21 backupserver kernel: [11777489.424514] EXT3-fs error (device dm-8): ext3_lookup: deleted inode referenced: 60891533
May 20 22:15:21 backupserver kernel: [11777489.426153] EXT3-fs error (device dm-8): ext3_lookup: deleted inode referenced: 60891524

When running e2fsck, those errors are fixed, but 2 or 3 days later, those errors are back. We then just recreate the LVM and start all over again. Now this is not a stable backup-system.

Why do we get these EXT3-fs errors and what is wrong with our setup?

Below is extra information that may help.

tune2fs on the source for the rsnapshot

root@backupserver:/home/netsys# tune2fs -l /dev/mapper/vg0-lv1databackup
tune2fs 1.42 (29-Nov-2011)
Filesystem volume name:   
Last mounted on:          
Filesystem UUID:          c150d0c9-cc31-41ab-85a5-3d63b79d0076
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file
Filesystem flags:         signed_directory_hash
Default mount options:    (none)
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              65536000
Block count:              262144000
Reserved block count:     0
Free blocks:              143705208
Free inodes:              64168616
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      961
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
RAID stride:              128
RAID stripe width:        128
Filesystem created:       Thu Sep  6 13:03:04 2012
Last mount time:          Fri Jan  4 17:49:01 2013
Last write time:          Fri Jan  4 17:49:01 2013
Mount count:              6
Maximum mount count:      27
Last checked:             Wed Dec 12 15:03:33 2012
Check interval:           15552000 (6 months)
Next check after:         Mon Jun 10 16:03:33 2013
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:              256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      911d0866-e924-4069-8ce5-c945fbb6ee27
Journal backup:           inode blocks

Tune2fs -l on the rsnapshot volume

root@backupserver:/home/netsys# tune2fs -l /dev/mapper/vg0-lv1data
tune2fs 1.42 (29-Nov-2011)
Filesystem volume name:   
Last mounted on:          
Filesystem UUID:          c91740f4-17df-4518-9ef1-ba36b7820870
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery sparse_super large_file
Filesystem flags:         signed_directory_hash
Default mount options:    (none)
Filesystem state:         clean with errors
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              65536000
Block count:              262144000
Reserved block count:     0
Free blocks:              127616425
Free inodes:              63661979
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      961
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         8192
Inode blocks per group:   512
RAID stride:              128
RAID stripe width:        128
Filesystem created:       Tue Sep  4 14:20:00 2012
Last mount time:          Mon Apr 29 16:49:09 2013
Last write time:          Tue May 21 06:52:48 2013
Mount count:              1
Maximum mount count:      23
Last checked:             Mon Apr 29 10:18:08 2013
Check interval:           15552000 (6 months)
Next check after:         Sat Oct 26 10:18:08 2013
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:              256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      74faab9b-739f-47dd-ba48-059e5b06829a
Journal backup:           inode blocks

Inode usage on the rsnapshot volume

root@backupserver:/home/netsys# df -i /mnt/lv1data/
Filesystem                Inodes   IUsed    IFree IUse% Mounted on
/dev/mapper/vg0-lv1data 65536000 1874021 63661979    3% /mnt/lv1data

modinfo on the driver for the LSI SAS2008

root@backupserver:/home/netsys# modinfo mpt2sas
filename:       /lib/modules/3.2.0-23-generic/kernel/drivers/scsi/mpt2sas/mpt2sas.ko
version:        10.100.00.00
license:        GPL
description:    LSI MPT Fusion SAS 2.0 Device Driver
author:         LSI Corporation 
srcversion:     C1D4E89BF318C53971B5113
alias:          pci:v00001000d0000007Esv*sd*bc*sc*i*
alias:          pci:v00001000d0000006Esv*sd*bc*sc*i*
alias:          pci:v00001000d00000087sv*sd*bc*sc*i*
alias:          pci:v00001000d00000086sv*sd*bc*sc*i*
alias:          pci:v00001000d00000085sv*sd*bc*sc*i*
alias:          pci:v00001000d00000084sv*sd*bc*sc*i*
alias:          pci:v00001000d00000083sv*sd*bc*sc*i*
alias:          pci:v00001000d00000082sv*sd*bc*sc*i*
alias:          pci:v00001000d00000081sv*sd*bc*sc*i*
alias:          pci:v00001000d00000080sv*sd*bc*sc*i*
alias:          pci:v00001000d00000065sv*sd*bc*sc*i*
alias:          pci:v00001000d00000064sv*sd*bc*sc*i*
alias:          pci:v00001000d00000077sv*sd*bc*sc*i*
alias:          pci:v00001000d00000076sv*sd*bc*sc*i*
alias:          pci:v00001000d00000074sv*sd*bc*sc*i*
alias:          pci:v00001000d00000072sv*sd*bc*sc*i*
alias:          pci:v00001000d00000070sv*sd*bc*sc*i*
depends:        scsi_transport_sas,raid_class
intree:         Y
vermagic:       3.2.0-23-generic SMP mod_unload modversions
parm:           logging_level: bits for enabling additional logging info (default=0)
parm:           max_sectors:max sectors, range 64 to 8192  default=8192 (ushort)
parm:           max_lun: max lun, default=16895  (int)
parm:           max_queue_depth: max controller queue depth  (int)
parm:           max_sgl_entries: max sg entries  (int)
parm:           msix_disable: disable msix routed interrupts (default=0) (int)
parm:           missing_delay: device missing delay , io missing delay (array of int)
parm:           mpt2sas_fwfault_debug: enable detection of firmware fault and halt firmware - (default=0)
parm:           disable_discovery: disable discovery  (int)
parm:           diag_buffer_enable: post diag buffers (TRACE=1/SNAPSHOT=2/EXTENDED=4/default=0) (int)

Kernel version

root@backupserver:/home/netsys# uname -a
Linux backupserver 3.2.0-23-generic #36-Ubuntu SMP Tue Apr 10 20:39:51 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Version

root@backupserver:/home/netsys# cat /etc/issue
Ubuntu 12.04 LTS \n \l

We don't use multipath...

Why should one "recreate the LVM" (and what does that mean?) when file system problems occur? Maybe this is just a race condition. Can you create a snapshot volume and use that as source for your backup? Which of the two devices is dm-8? — Hauke Laging, May 21 '13 at 18:22
We've noticed that when we do a e2fsck on the filesystem it fixes the errors but we get new errors within 2 weeks. As we recreate the LVM ( aka lvremove /dev/vg0/lv1data && lvcreate ), the setup stays clean for at least 4 weeks up to 2 months. But sinds the errors return, there must be something fundamentally wrong. dm-8 is our rsnapshot lvm aka lv1data... — Wouter Debie, May 21 '13 at 19:55
Maybe our setup is complete wrong, we have 1T of data stored on machine A and want a history backup (aka day-1,day-2,day-3...day-7) ,possible with rsnapshot, on machine B (in an other building). That is the general goal I want tho achieve. — Wouter Debie, May 21 '13 at 19:58
Doesn't make sense to me at all. If you delete a LV and recreate it then usually it covers the same area of the volume as before. Especially if you do that several times. It might be interesting to have an strace dump of rsync (just the last 100 lines) in order to see what it does when this error is triggered. Maybe undetected hardware errors? Is there any reason for using ext3 instead of ext4? Would be interesting to use a completely different file system (btrfs, xfs) and see whether the problems disappear. — Hauke Laging, May 21 '13 at 20:22
@HaukeLaging Interesting remarks! Why ext3 duno actualy, We are used to work with ext3 on our server systems. But indeed will propose it to try ext4 and see... About the strace dump, will try that one also. Thanks for your thoughts on this issue. — Wouter Debie, May 21 '13 at 21:00
Currently it is running again but now on a xfs filesystem. Will see what this gives and will report back — Wouter Debie, May 27 '13 at 14:32