What could cause a file system to spontaneously unmount or become invalid for a short time?

Question

We've got DB2 LUW running on a RHEL box. We had a crash of DB2 and IBM came back and said that a file that DB2 was trying to access (through open64()) unmounted or became invalid.

We have done nothing but restart the database and things seem to be running fine. Also, the file in question looks perfectly normal now:

$ cd /db/log/TEAMS/tmsinst/NODE0000/TEAMS/T0000000/
$ ls -l
total 557604
-rw------- 1 tmsinst tmsinst 570425344 Jan 14 10:24 C0000000.CAT
$ file C0000000.CAT 
C0000000.CAT: data
$ lsattr C0000000.CAT 
------------- C0000000.CAT
$ ls -l
total 557604
-rw------- 1 tmsinst tmsinst 570425344 Jan 14 10:24 C0000000.CAT

With those facts in hand (please correct me if I am mis-interpreting the data at hand) what could cause a file system to 'spontaneously unmount or become invalid for a short time'?

What should my next step be?

This is on Dell hardware and we ran their diagnostic tools against the hardware and it came back clean.

Is the mounted partition a local disk or some kind of network mount? — Scott Pack, Jan 14 '11 at 18:38

score 1 · Accepted Answer · answered Jan 14 '11 at 18:14

1

My guess would be underlying hardware issue, for example a drive disconnecting and reconnecting to the bus. Examing /var/log/messages (and run dmesg) and look for unusual scsi or sata messages about disconnects, etc.

answered Jan 14 '11 at 18:14

Phil Hollenback

14,947
4
35
52

That is also the first thing we thought...no relevant messages in dmesg or the log messages. Also, no manual umounts in history. – Ichorus Jan 14 '11 at 18:24
Because of this answer I took a second look at dmesg: There were a lot of similar messages about oom killer getting invoked (on a box with 96 GB RAM no less) so I missed this: – Ichorus Jan 14 '11 at 18:38
dsm_sa_datamgr3 invoked oom-killer: gfp_mask=0x201d2, order=0, oomkilladj=0 – Ichorus Jan 14 '11 at 18:39
all Trace: [] out_of_memory+0x8e/0x2f3 [] __alloc_pages+0x245/0x2ce [] __do_page_cache_readahead+0x96/0x179 [] filemap_nopage+0x14c/0x360 [] __handle_mm_fault+0x1fa/0xf99 [] generic_file_read+0xac/0xc5 [] do_page_fault+0x4cb/0x830 [] mutex_lock+0xd/0x1d [] :sd_mod:scsi_disk_put+0x2e/0x3f [] iput+0x4b/0x84 [] dput+0x2c/0x114 [] error_exit+0x0/0x84 – Ichorus Jan 14 '11 at 18:40
I wonder if oom killer just happened to kill the file read process and the failure of that read is what showed up in the db2diag log? – Ichorus Jan 14 '11 at 18:41
I suggest editing your question and adding this info, instead of putting it in comments. Also yeah if you are getting oom killer messages than that sounds like the probably cause. – Phil Hollenback Jan 14 '11 at 18:45

What could cause a file system to spontaneously unmount or become invalid for a short time?

1 Answers1