My XFS saga continues…

December 20th, 2007

My latest addition to the XFS maling list:

I’m still seeing problems. =(

Most recently I have copied all of the data off of the suspect XFS volume onto another fresh XFS volume. A few days later I saw the same messages show up in dmesg. I haven’t had a catastrophic failure that makes the kernel remount the FS RO, but I don’t want to wait for that to happen.

Today I upgraded to the latest stable kernel in Gentoo (2.6.23-r3) and I’m still on xfsprogs 2.9.4, also the latest stable release. A few hours after rebooting to load the new kernel, I saw the following in dmesg:

attempt to access beyond end of device
dm-0: rw=0, want=68609558288793608, limit=8178892800
I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0xf3c0079e000000 ("xfs_trans_read_buf") error 5 buf count 4096
attempt to access beyond end of device
dm-0: rw=0, want=68609558288793608, limit=8178892800
I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0xf3c0079e000000 ("xfs_trans_read_buf") error 5 buf count 4096
attempt to access beyond end of device
dm-0: rw=0, want=68609558288793608, limit=8178892800
I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0xf3c0079e000000 ("xfs_trans_read_buf") error 5 buf count 4096
attempt to access beyond end of device
dm-0: rw=0, want=68609558288793608, limit=8178892800
I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0xf3c0079e000000 ("xfs_trans_read_buf") error 5 buf count 4096

These are the same types of messages (trying to access a block that is WAAAY outside of the range of my drives) that I was seeing before the last time my FS got remounted read-only by the colonel.

Any ideas? What other information can I gather that would help with troubleshooting? Here are some more specifics:

This is a Dell PowerEdge 1850 with a FusionMPT/LSI fibre channel card. The XFS volume is a 3.9TB logical volume in LVM. The volume group is spread across LUNs of a Apple XServe RAIDs which are connected o’er FC to our fabric. I just swapped FC switches (to a different brand even) and the problem was showing before and after the switch switch, so that’s not it. I have also swapped FC cards, upgraded FC card firmware, updated BIOSs, etc.. This server sees heavy NFS (v3) and samba (currently 3.0.24 until the current regression bug is squashed and stable) traffic. ‘Heavy traffic’ means it usually sees 200-300Mbps throughput 24/7, although sometimes more.

Far-fetched: Is there any way that a particular file on my FS, when accessed, is causing the problem?

I have a very similar system (Dell PE 2650, same FC card, same type of RAID, same SFP cables, same GPT scheme, same kernel) but instead with an ext3 (full journal) FS in a 5.[something]TB logical volume (LVM) with no problems. Oh, and it sees system load values in the mid-20s just about all day.

Grasping at straws. I need XFS to work because we’ll soon be requiring seriously large filesystems with non-sucky extended attribute and ACL support. Plus it’s fast and I like it.

Can the XFS community help? I don’t want to have to turn to that guy that allegedly killed his wife. =P

Leave a Reply

You must be logged in to post a comment.