I changed smbclient’s default timeout from 20 seconds to 200 seconds. The shorter timeout was causing problems with our backuppc server, especially with large files on Windows computers with virus protection.

Check out line 6246 in libsmb/libsmbclient.c, at least that’s where it is in 3.0.28.

context->timeout = 200000; /* 200 seconds */

Specify the samba config path when you run configure:

./configure --with-configdir=/etc/samba && make

I copied out the new smbclient and all seems good so far…

UPDATE:  The timeout isn’t in the smbclient binary, but rather the libsmbclient.so library, in my case in /usr/lib/samba.

*crosses fingers*

samba privs; unix extensions = no

February 22nd, 2008

Here’s the problem: If a folder on a samba (3.0.28, but any 3.0.2x should be the same) share is 700 and has a non-root owner, even samba admin users will not be able to access the folder from the Finder in Mac OS X 10.5.2. The solution? Turn off ‘unix extensions’ in the server’s smb.conf.

unix extensions = no

Leopard and samba, continued

February 19th, 2008

I was able to successfully view files/folders using smbclient. It looks like the problem is the Finder.Basically, privs on folders on a share are set up such that a particular user will have rwx access. For some reason the way Leopard reads the ACLs on certain folders is fooling the Finder into thinking that the user has no access.I tried recreating the situation with nested groups in AD and a test user account. I couldn’t get the problem to show up.I’m thinking max-token-length or something else hard to troubleshoot. I tried turning on and off unix extensions–no luck. Most profs are members of at least 28 AD groups.  That’s not too many, is it?

Today we encountered a 10.5.2 Mac that couldn’t correctly resolve privs; a Tiger machine with the same credentials did just fine. Since the privs in question were ACL-based, the only thing I can think of is that Leopard is getting confused somewhere. It could be UNIX extensions. It could be only a Finder issue (I didn’t think of testing smbclient until later).

Next is figuring out how to recreate the problem… That’s for tomorrow. =)

BackupPC: tar error 256

January 20th, 2008

From the BackupPC v3 source code:

# Tar 1.16 uses exit status 1 (256) when some files
# changed during archive creation. We allow this
# as a benign error and consider the archive ok

We are using v2, but the Tar.pm file looked nearly identical to v3. Thus, I changed line 201 in /usr/lib/BackupPC/Xfer/Tar.pm to look like this:

if ( !close($t->{pipeTar}) && $? != 256 ) {

That should work, right? =)

XFS Progress?

January 15th, 2008

Over break my problematic XFS volume wasn’t quite as busy as usual. However, I was still seeing FS errors in dmesg. The same messages showed up in my syslog and I was able to correlate the time of the errors to a particular cron job. I was able to systematically find a specific directory that, whenever it was touched, was the cause of the errors in my logs! Sure enough, I tried cd baddirectory and I got an FS error. I did it enough times in a row and the FS got angry and remounted RO. After an xfs_repair marathon I was able to delete the bad directory and I haven’t seen an FS error since. I went through the trouble of touching every single file and directory on the volume in search of a similar rotten apple–no luck so far.

So it would appear that bad files/directories were the culprit. But how did they get there? I won’t know until it happens again, I guess.

More XFS errata

December 22nd, 2007


XFS internal error XFS_WANT_CORRUPTED_GOTO at line 4535 of file fs/xfs/xfs_bmap.c. Caller 0xc029537f
[] xfs_bmap_read_extents+0x3c7/0x4a2
[] xfs_iread_extents+0x74/0xe1
[] xfs_iext_realloc_direct+0xb0/0x10c
[] xfs_iext_add+0x138/0x272
[] xfs_iread_extents+0x74/0xe1
[] xfs_bmapi+0x1ca/0x17b3
[] __bio_clone+0x9e/0xaf
[] xfs_ilock+0x58/0xa0
[] xfs_iunlock+0x69/0x84
[] xfs_iomap+0x27b/0x4b7
[] xfs_iunlock+0x69/0x84
[] __xfs_get_blocks+0x6b/0x221
[] tcp_in_window+0x38d/0x46b
[] xfs_ilock+0x58/0xa0
[] xfs_iomap+0x216/0x4b7
[] __nf_conntrack_find+0x18/0xee
[] __xfs_get_blocks+0x6b/0x221
[] nf_ct_deliver_cached_events+0x7f/0x8c
[] ipv4_confirm+0x26/0x46
[] xfs_get_blocks+0x28/0x2d
[] block_read_full_page+0x19e/0x35d
[] xfs_get_blocks+0x0/0x2d
[] ip_rcv+0x259/0x59b
[] ip_rcv_finish+0x0/0x310
[] do_mpage_readpage+0x530/0x621
[] netif_receive_skb+0x1d5/0x1f1
[] process_backlog+0x81/0x107
[] e1000_xmit_frame+0x2b5/0x3f4
[] mpage_readpage+0x4b/0x5e
[] xfs_get_blocks+0x0/0x2d
[] radix_tree_gang_lookup+0x5c/0x9b
[] find_get_pages_contig+0x28/0x73
[] __generic_file_splice_read+0x1d3/0x461
[] ip_queue_xmit+0x3a3/0x40c
[] xfs_iunlock+0x43/0x84
[] xfs_vget+0xe1/0xf2
[] iput+0x31/0x5f
[] d_alloc_anon+0x22/0xf3
[] find_acceptable_alias+0x17/0xc1
[] nfsd_acceptable+0x0/0xc0
[] find_exported_dentry+0x63/0x1bd
[] bictcp_acked+0x55/0x7c
[] _spin_lock_bh+0x8/0x18
[] release_sock+0x1b/0x91
[] tcp_recvmsg+0x2e0/0x74d
[] sock_common_recvmsg+0x47/0x66
[] generic_file_splice_read+0x86/0xe8
[] xfs_splice_read+0x93/0x163
[] xfs_file_splice_read+0x4c/0x5c
[] do_splice_to+0x63/0x7e
[] splice_direct_to_actor+0x9e/0x182
[] nfsd_direct_splice_actor+0x0/0xa
[] nfsd_vfs_read+0x384/0x3a4
[] dentry_open+0x34/0x72
[] nfsd_read+0xda/0xfb
[] nfsd3_proc_read+0xc5/0x189
[] nfs3svc_decode_readargs+0x0/0xed
[] nfsd_dispatch+0x99/0x210
[] svcauth_unix_set_client+0x116/0x16e
[] svc_process+0x57d/0x701
[] default_wake_function+0x0/0xc
[] nfsd+0x163/0x273
[] nfsd+0x0/0x273
[] kernel_thread_helper+0x7/0x10
=======================
attempt to access beyond end of device
dm-0: rw=0, want=68609558288793608, limit=8178892800
I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0xf3c0079e000000 ("xfs_trans_read_buf") error 5 buf count 4096

My XFS saga continues…

December 20th, 2007

My latest addition to the XFS maling list:

I’m still seeing problems. =(

Most recently I have copied all of the data off of the suspect XFS volume onto another fresh XFS volume. A few days later I saw the same messages show up in dmesg. I haven’t had a catastrophic failure that makes the kernel remount the FS RO, but I don’t want to wait for that to happen.

Today I upgraded to the latest stable kernel in Gentoo (2.6.23-r3) and I’m still on xfsprogs 2.9.4, also the latest stable release. A few hours after rebooting to load the new kernel, I saw the following in dmesg:

attempt to access beyond end of device
dm-0: rw=0, want=68609558288793608, limit=8178892800
I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0xf3c0079e000000 ("xfs_trans_read_buf") error 5 buf count 4096
attempt to access beyond end of device
dm-0: rw=0, want=68609558288793608, limit=8178892800
I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0xf3c0079e000000 ("xfs_trans_read_buf") error 5 buf count 4096
attempt to access beyond end of device
dm-0: rw=0, want=68609558288793608, limit=8178892800
I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0xf3c0079e000000 ("xfs_trans_read_buf") error 5 buf count 4096
attempt to access beyond end of device
dm-0: rw=0, want=68609558288793608, limit=8178892800
I/O error in filesystem ("dm-0") meta-data dev dm-0 block 0xf3c0079e000000 ("xfs_trans_read_buf") error 5 buf count 4096

These are the same types of messages (trying to access a block that is WAAAY outside of the range of my drives) that I was seeing before the last time my FS got remounted read-only by the colonel.

Any ideas? What other information can I gather that would help with troubleshooting? Here are some more specifics:

This is a Dell PowerEdge 1850 with a FusionMPT/LSI fibre channel card. The XFS volume is a 3.9TB logical volume in LVM. The volume group is spread across LUNs of a Apple XServe RAIDs which are connected o’er FC to our fabric. I just swapped FC switches (to a different brand even) and the problem was showing before and after the switch switch, so that’s not it. I have also swapped FC cards, upgraded FC card firmware, updated BIOSs, etc.. This server sees heavy NFS (v3) and samba (currently 3.0.24 until the current regression bug is squashed and stable) traffic. ‘Heavy traffic’ means it usually sees 200-300Mbps throughput 24/7, although sometimes more.

Far-fetched: Is there any way that a particular file on my FS, when accessed, is causing the problem?

I have a very similar system (Dell PE 2650, same FC card, same type of RAID, same SFP cables, same GPT scheme, same kernel) but instead with an ext3 (full journal) FS in a 5.[something]TB logical volume (LVM) with no problems. Oh, and it sees system load values in the mid-20s just about all day.

Grasping at straws. I need XFS to work because we’ll soon be requiring seriously large filesystems with non-sucky extended attribute and ACL support. Plus it’s fast and I like it.

Can the XFS community help? I don’t want to have to turn to that guy that allegedly killed his wife. =P

I upgraded my samba test box to 3.0.28 (latest ~x86 in portage) and was pumped after preliminary testing to upgrade our main file services to the same version. There was only one problem: Mac OS X 10.4.latest (Tiger) clients can’t connect.

With versions of samba greater than 3.0.24 there is a Finder ‘bug’ that shows itself in Leopard (<=10.5.1)--you can't connect to a samba server with no browsable shares. Now that I figured out that you just need to make _one_ browsable share and the Finder will be happy, I've found this Tiger problem... I'm going to fiddle with the unix extensions directive and see what happens. Right now I’m not too worried about upgrading since _ALL_ versions of samba in portage are masked due to a regression bug that needs some TLC.

Dell PERC RAID controller whoops

December 19th, 2007

This basically describes my situation to a T. I upgraded the kernel on a Gentoo system (Dell PE 2650) from 2.6.20.something to 2.6.23-r3 (gentoo-sources) and after a reboot the kernel was showing tons of SCSI errors like this:

aacraid: Host adapter abort request (0,2,13,0)
aacraid: Host adapter abort request (0,2,13,0)
aacraid: Host adapter reset request. SCSI hang ?
aacraid: Host adapter abort request (0,2,13,0)
scsi 0:2:13:0: scsi: Device offlined - not ready after error recovery
aacraid: Host adapter abort request (0,2,14,0)
aacraid: Host adapter abort request (0,2,14,0)
aacraid: Host adapter reset request. SCSI hang ?
aacraid: Host adapter abort request (0,2,14,0)
scsi 0:2:14:0: scsi: Device offlined - not ready after error recovery

It took my system 21+ minutes to boot up normally after the kernel had enumerated all of the SCSI devices that weren’t there. Turns out (thanks to the gentoo forums) that the aacraid driver had been updated, but requires a newer PERC firmware, which in turn requires a newer BIOS (at least A15 according to Dell, A21 is out now).

I cooked up a BIOS update CD and I’ll be trying it out tomorrow. The sucky part is that the PERC3/4/69/whatever update is ONLY available as floppy images. Well, there is an RPM/BIN-thingie for RHEL, a Windows package, and floppy images. I tried the RHEL package on Gentoo and it didn’t work. Surprise.

So now I have to find not one but TWO floppies and treck over to our colo. The last time we had to find 1.44MB disks in our office was a non-trivial task.

Tomorrow night I’ll be updating the kernel on our file server that has potential XFS/NFS problems. Only time will tell…