How do YOU detect a failed XFS filesystem?

Currently I monitor for a failed filesystem (as a result of a failed disk, controller, whatever) by checking syslog for messages like this:

2017-06-15T17:18:10.081665+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381844.448488] blk_update_request: critical target error, dev sdj, sector 97672656

2017-06-15T17:18:10.724329+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381845.047871] XFS (md0): metadata I/O error: block 0x2baa81400 ("xlog_iodone") error 121 numblks 512

2017-06-15T17:18:10.724329+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381845.124418] XFS (md0): xfs_do_force_shutdown(0x2) called from line 1177 of file /build/linux-lts-wily-8ENwT0/linux-lts-wily-4.2.0/fs/xfs/xfs_log.c.  Return address = 0xffffffffc050e100

2017-06-15T17:18:10.724349+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381845.124425] XFS (md0): Log I/O Error Detected.  Shutting down filesystem

2017-06-15T17:18:10.724349+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381845.124452] XFS (md0): xfs_log_force: error -5 returned.

2017-06-15T17:18:10.724354+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381845.163480] XFS (md0): Please umount the filesystem and rectify the problem(s)

2017-06-15T17:18:40.612572+00:00    2017-06-15T17:18:40+00:00   localhost  kernel:  [1381875.074647] XFS (md0): xfs_log_force: error -5 returned.

2017-06-15T17:19:10.612554+00:00    2017-06-15T17:19:10+00:00   localhost  kernel:  [1381905.101606] XFS (md0): xfs_log_force: error -5 returned.

2017-06-15T17:19:40.612558+00:00    2017-06-15T17:19:40+00:00   localhost  kernel:  [1381935.128546] XFS (md0): xfs_log_force: error -5 returned.

This is ok but I'd really like a more canonical check. The only thing I can think of is to write a script that attempts to write a file to disk and fires off an alarm if it can't for any reason. However, that seems like it's prone to false positives - there are several reasons why a file might not be able to be written, not just a failed filesystem.

Aside from grepping the logs or writing a canary file to disk, how can this be monitored?

asked Jun 15 '17 at 19:34

Evan

164

You could configure nagios, they may have a module in place that can check the status of xfs...
– ryekayo
Jun 15 '17 at 19:48

@ryekayo A quick perusal of exchange.nagios.org/directory/Plugins/System-Metrics/… did not uncover anything for monitoring the health of XFS file systems directly, though some that probably could be pressed into service.
– a CVn
Jun 15 '17 at 20:57

add a comment |

Currently I monitor for a failed filesystem (as a result of a failed disk, controller, whatever) by checking syslog for messages like this:

2017-06-15T17:18:10.081665+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381844.448488] blk_update_request: critical target error, dev sdj, sector 97672656

2017-06-15T17:18:10.724329+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381845.047871] XFS (md0): metadata I/O error: block 0x2baa81400 ("xlog_iodone") error 121 numblks 512

2017-06-15T17:18:10.724329+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381845.124418] XFS (md0): xfs_do_force_shutdown(0x2) called from line 1177 of file /build/linux-lts-wily-8ENwT0/linux-lts-wily-4.2.0/fs/xfs/xfs_log.c.  Return address = 0xffffffffc050e100

2017-06-15T17:18:10.724349+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381845.124425] XFS (md0): Log I/O Error Detected.  Shutting down filesystem

2017-06-15T17:18:10.724349+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381845.124452] XFS (md0): xfs_log_force: error -5 returned.

2017-06-15T17:18:10.724354+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381845.163480] XFS (md0): Please umount the filesystem and rectify the problem(s)

2017-06-15T17:18:40.612572+00:00    2017-06-15T17:18:40+00:00   localhost  kernel:  [1381875.074647] XFS (md0): xfs_log_force: error -5 returned.

2017-06-15T17:19:10.612554+00:00    2017-06-15T17:19:10+00:00   localhost  kernel:  [1381905.101606] XFS (md0): xfs_log_force: error -5 returned.

2017-06-15T17:19:40.612558+00:00    2017-06-15T17:19:40+00:00   localhost  kernel:  [1381935.128546] XFS (md0): xfs_log_force: error -5 returned.

Aside from grepping the logs or writing a canary file to disk, how can this be monitored?

asked Jun 15 '17 at 19:34

Evan

164

You could configure nagios, they may have a module in place that can check the status of xfs...
– ryekayo
Jun 15 '17 at 19:48

@ryekayo A quick perusal of exchange.nagios.org/directory/Plugins/System-Metrics/… did not uncover anything for monitoring the health of XFS file systems directly, though some that probably could be pressed into service.
– a CVn
Jun 15 '17 at 20:57

add a comment |

Currently I monitor for a failed filesystem (as a result of a failed disk, controller, whatever) by checking syslog for messages like this:

2017-06-15T17:18:10.081665+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381844.448488] blk_update_request: critical target error, dev sdj, sector 97672656

2017-06-15T17:18:10.724329+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381845.047871] XFS (md0): metadata I/O error: block 0x2baa81400 ("xlog_iodone") error 121 numblks 512

2017-06-15T17:18:10.724329+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381845.124418] XFS (md0): xfs_do_force_shutdown(0x2) called from line 1177 of file /build/linux-lts-wily-8ENwT0/linux-lts-wily-4.2.0/fs/xfs/xfs_log.c.  Return address = 0xffffffffc050e100

2017-06-15T17:18:10.724349+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381845.124425] XFS (md0): Log I/O Error Detected.  Shutting down filesystem

2017-06-15T17:18:10.724349+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381845.124452] XFS (md0): xfs_log_force: error -5 returned.

2017-06-15T17:18:10.724354+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381845.163480] XFS (md0): Please umount the filesystem and rectify the problem(s)

2017-06-15T17:18:40.612572+00:00    2017-06-15T17:18:40+00:00   localhost  kernel:  [1381875.074647] XFS (md0): xfs_log_force: error -5 returned.

2017-06-15T17:19:10.612554+00:00    2017-06-15T17:19:10+00:00   localhost  kernel:  [1381905.101606] XFS (md0): xfs_log_force: error -5 returned.

2017-06-15T17:19:40.612558+00:00    2017-06-15T17:19:40+00:00   localhost  kernel:  [1381935.128546] XFS (md0): xfs_log_force: error -5 returned.

Aside from grepping the logs or writing a canary file to disk, how can this be monitored?

asked Jun 15 '17 at 19:34

Evan

164

Currently I monitor for a failed filesystem (as a result of a failed disk, controller, whatever) by checking syslog for messages like this:

2017-06-15T17:18:10.081665+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381844.448488] blk_update_request: critical target error, dev sdj, sector 97672656

2017-06-15T17:18:10.724329+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381845.047871] XFS (md0): metadata I/O error: block 0x2baa81400 ("xlog_iodone") error 121 numblks 512

2017-06-15T17:18:10.724329+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381845.124418] XFS (md0): xfs_do_force_shutdown(0x2) called from line 1177 of file /build/linux-lts-wily-8ENwT0/linux-lts-wily-4.2.0/fs/xfs/xfs_log.c.  Return address = 0xffffffffc050e100

2017-06-15T17:18:10.724349+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381845.124425] XFS (md0): Log I/O Error Detected.  Shutting down filesystem

2017-06-15T17:18:10.724349+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381845.124452] XFS (md0): xfs_log_force: error -5 returned.

2017-06-15T17:18:10.724354+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381845.163480] XFS (md0): Please umount the filesystem and rectify the problem(s)

2017-06-15T17:18:40.612572+00:00    2017-06-15T17:18:40+00:00   localhost  kernel:  [1381875.074647] XFS (md0): xfs_log_force: error -5 returned.

2017-06-15T17:19:10.612554+00:00    2017-06-15T17:19:10+00:00   localhost  kernel:  [1381905.101606] XFS (md0): xfs_log_force: error -5 returned.

2017-06-15T17:19:40.612558+00:00    2017-06-15T17:19:40+00:00   localhost  kernel:  [1381935.128546] XFS (md0): xfs_log_force: error -5 returned.

Aside from grepping the logs or writing a canary file to disk, how can this be monitored?

filesystems monitoring xfs

asked Jun 15 '17 at 19:34

Evan

164

asked Jun 15 '17 at 19:34

Evan

164

asked Jun 15 '17 at 19:34

Evan

164

asked Jun 15 '17 at 19:34

Evan

164

asked Jun 15 '17 at 19:34

Evan

164

You could configure nagios, they may have a module in place that can check the status of xfs...
– ryekayo
Jun 15 '17 at 19:48

@ryekayo A quick perusal of exchange.nagios.org/directory/Plugins/System-Metrics/… did not uncover anything for monitoring the health of XFS file systems directly, though some that probably could be pressed into service.
– a CVn
Jun 15 '17 at 20:57

add a comment |

You could configure nagios, they may have a module in place that can check the status of xfs...
– ryekayo
Jun 15 '17 at 19:48

@ryekayo A quick perusal of exchange.nagios.org/directory/Plugins/System-Metrics/… did not uncover anything for monitoring the health of XFS file systems directly, though some that probably could be pressed into service.
– a CVn
Jun 15 '17 at 20:57

You could configure nagios, they may have a module in place that can check the status of xfs...
– ryekayo
Jun 15 '17 at 19:48

@ryekayo A quick perusal of exchange.nagios.org/directory/Plugins/System-Metrics/… did not uncover anything for monitoring the health of XFS file systems directly, though some that probably could be pressed into service.
– a CVn
Jun 15 '17 at 20:57

add a comment |

3 Answers
3

active

oldest

votes

Hmmm. How do I detect a failed XFS filesystem?

I've been using XFS for ages. But... I guess I don't detect it, at all. If it mounts, I trust it works. That's how most people do it... filesystem checks are automated, if it boots and it's up and running, that's that.

Now, don't get me wrong. I actually do a ton of monitoring, but none of it is filesystem specific. I run SMART selftests (using select,cont to do a disk segment per day, because long simply takes too long). I run RAID checks (also in segments) and also check that there are no mismatches in parity (mismatch_cnt = 0). I get instant mail notification if any of those fail and I actually replace HDDs once they start reallocating sectors (or at least, no longer trust them with important data).

So I have monitoring to make sure the storage works as it should. This covers errors inside the drives themselves (SMART) and to some extent also errors on a higher level (RAID checks in a way also test controllers, cables, RAID logic, ...).

As long as that works fine, the filesystem better be fine, too. Outside of checksumming filesystems like ZFS/btrfs (maybe XFS in the future, too) it's not really part of the concept to run checks on a filesystem level while mounted, apart from whatever sanity checks the filesystem itself does internally.

Your output suggests you're running RAID too, and had a failed disk in that RAID; even so that really should not cause errors happening on md0, unless it was RAID without redundancy (RAID0 or already degraded RAID1/5/6/10).

You should fix your problems below the filesystem layer first. You can hardly blame XFS for disk errors and that's not how you check for disk errors.

I guess if you really wanted to run a full read test on top of the filesystem, you could do an xfsdump to a backup disk... if you're doing a full read test of your filesystem anyhow, might as well do it in a way that's meaningful somehow.

It's the nature of xfsdump to walk the XFS filesystem in its entirety and store all the files. So that should come as close as possible to a full read test, not including free space.

Of course, if you're already running another backup system, that's really the same story in a filesystem-agnostic way (and if that backup system encounters read errors that aren't just lack of permissions, it better send a mail report to you, too), although of course if it's an incremental backup, without periodic full backups it won't actually read a file more than once...

But in general, we trust filesystems to "just work" as long as the storage is known to work. While it would be nice to have each and every program without exception to elevate any and all I/O errors it encounters, I'm not aware of a generic purpose solution to actually do so. Each program does its own error handling.

answered Jun 15 '17 at 20:04

frostschutz

25.7k15280

Shouldn't something like making sure xfsdump -l 0 - /dev/md0 >/dev/null && echo sane terminates tell you whether the file system itself is sane? You could then do a read-only badblocks pass (or dd, for that matter) to make sure that all parts of the disk are at least readable.
– a CVn
Jun 15 '17 at 20:52

Yes, only it might be more meaningful to keep the result than just toss it away. badblocks is one such example, if I have reason to believe there actually are read errors, I never run badblocks. I run ddrescue. Of course badblocks can do write tests too, but you don't really do that when there is data you need.
– frostschutz
Jun 15 '17 at 21:11

True; it really depends on how thorough you want to be. Though it sounds like OP is mainly interested in reactive monitoring, as evidenced by the discussion on monitoring system logs for I/O-related errors. I use ZFS myself (not XFS), with redundancy and regular file system scrubs as well as SMART and system log monitoring, which ensures that developing problems are likely to be caught early, hopefully while still recoverable without more effort than a simple disk replacement.
– a CVn
Jun 15 '17 at 21:19

1

Hmm. I guess I was really hoping for something along the lines of /proc/fs/xfs/health that I could poll. Running badblocks or xfsdump would be very time and resource intensive. Clearly the kernel knows something is wrong since the errors appear in syslog. And while I appreciate the sentiment about monitoring the underlying devices and assuming the filesystem works, that's just not an acceptable solution for me.
– Evan
Jun 22 '17 at 12:45

add a comment |

Many critical storage servers enable "panic on error", so that an error wouldn't have a chance to become bigger, to cause further data damage, or to serve corrupted data to the users. With panic on error, you could just watch for panic events or system down to detect file system errors.

Of course, if you don't have a redundant system, one server going panic would mean actual down time. However, mission-critical systems must have redundancy. In fact, any data on a file system that shows such kind of I/O errors should no longer be used, and the system should be disconnected asap so that a backup system could kick in. Serving no data is actually better than serving corrupt data in most cases.

According to https://access.redhat.com/solutions/3645252, you could set the sysctl fs.xfs.panic_mask=127 to make any error detected on any XFS filesystem become a system panic.

In order to persist to the reboot of this configure systems, do:

echo 'fs.xfs.panic_mask=127' > /etc/sysctl.d/01-xfs.conf

answered yesterday

Yan Li

New contributor

add a comment |

xfs_repair -n

Usage: xfs_repair [options] device



Options:

  -f           The device is a file

  -L           Force log zeroing. Do this as a last resort.

  -l logdev    Specifies the device where the external log resides.

  -m maxmem    Maximum amount of memory to be used in megabytes.

  -n           No modify mode, just checks the filesystem for damage.

  -P           Disables prefetching.

  -r rtdev     Specifies the device where the realtime section resides.

  -v           Verbose output.

  -c subopts   Change filesystem parameters - use xfs_admin.

  -o subopts   Override default behaviour, refer to man page.

  -t interval  Reporting interval in minutes.

  -d           Repair dangerously.

  -V           Reports version and exits.



man page:

-n     No modify mode.

       Specifies that xfs_repair should not modify the filesystem but

       should only scan the filesystem and indicate what repairs would have been made.

and there is xfs_check but if you do a man page on it you will see: check XFS filesystem consistency... Note that using xfs_check is NOT recommended. Please use xfs_repair -n instead, for better scalability and speed.

And in /etc/fstab the 6th or last column if it is a 1 or 2 causes fsck or file system check on mount which would happen on every boot... will it specifically be xfs_repair -n? I don't know.

you asked about detecting a failed file system: my interpretation of that is if it is failed then it would not be mounted and not accessible at all... you would know without really having to check it would not be obvious when you notice it is not mounted, then rudely fails to mount when tried manually.

Must be unmounted to do this, but to monitor here's what you would periodically do manually:

# df -h

Filesystem      Size  Used Avail Use% Mounted on

/dev/sdc2       550G  152G  371G  30% /            {ext3}

udev            253G  216K  253G   1% /dev

tmpfs           253G  5.5M  253G   1% /dev/shm

/dev/sdc1       195M   13M  183M   7% /boot/efi

/dev/sda1       5.0T  4.9T   99G  99% /data         {xfs}

/dev/sdb1       559G   67G  492G  12% /scratch

tmpfs           450G     0  450G   0% /ramdisk

/dev/sdd1       5.0T  4.9T  9.8G 100% /bkup         {xfs}



how do i find file system types?



# mount

/dev/sdc2 on / type ext3 (rw,acl,user_xattr)

proc on /proc type proc (rw)

sysfs on /sys type sysfs (rw)

debugfs on /sys/kernel/debug type debugfs (rw)

udev on /dev type tmpfs (rw,mode=0755)

tmpfs on /dev/shm type tmpfs (rw,mode=1777)

devpts on /dev/pts type devpts (rw,mode=0620,gid=5)

/dev/sdc1 on /boot/efi type vfat (rw,umask=0002,utf8=true)

/dev/sda1 on /data type xfs (rw)

/dev/sdb1 on /scratch type xfs (rw)

fusectl on /sys/fs/fuse/connections type fusectl (rw)

securityfs on /sys/kernel/security type securityfs (rw)

none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)

tmpfs on /ramdisk type tmpfs (rw,size=450G)

nfsd on /proc/fs/nfsd type nfsd (rw)

/dev/sdd1 on /bkup type xfs (rw)



# xfs_repair -n /dev/sdd1

xfs_repair: /dev/sdd1 contains a mounted and writable filesystem



fatal error -- couldn't initialize XFS library



# umount /bkup/

# xfs_repair -n /dev/sdd1



Phase 1 - find and verify superblock...

Phase 2 - using internal log

        - scan filesystem freespace and inode maps...

        - found root inode chunk

Phase 3 - for each AG...

        - scan (but don't clear) agi unlinked lists...

        - process known inodes and perform inode discovery...

        - agno = 0

        - agno = 1

        - agno = 2

        - agno = 3

        - agno = 4

        - process newly discovered inodes...

Phase 4 - check for duplicate blocks...

        - setting up duplicate extent list...

        - check for inodes claiming duplicate blocks...

        - agno = 0

        - agno = 4

        - agno = 3

        - agno = 1

        - agno = 2

No modify flag set, skipping phase 5

Phase 6 - check inode connectivity...

        - traversing filesystem ...

        - traversal finished ...

        - moving disconnected inodes to lost+found ...

Phase 7 - verify link counts...

No modify flag set, skipping filesystem flush and exiting.



this "xfs_repair -n" output is on a good XFS file system that has been problem free for years.

edited yesterday

answered yesterday

ron

8861714

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f371381%2fhow-do-you-detect-a-failed-xfs-filesystem%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

Hmmm. How do I detect a failed XFS filesystem?

You should fix your problems below the filesystem layer first. You can hardly blame XFS for disk errors and that's not how you check for disk errors.

It's the nature of xfsdump to walk the XFS filesystem in its entirety and store all the files. So that should come as close as possible to a full read test, not including free space.

answered Jun 15 '17 at 20:04

frostschutz

25.7k15280

Shouldn't something like making sure xfsdump -l 0 - /dev/md0 >/dev/null && echo sane terminates tell you whether the file system itself is sane? You could then do a read-only badblocks pass (or dd, for that matter) to make sure that all parts of the disk are at least readable.
– a CVn
Jun 15 '17 at 20:52

Yes, only it might be more meaningful to keep the result than just toss it away. badblocks is one such example, if I have reason to believe there actually are read errors, I never run badblocks. I run ddrescue. Of course badblocks can do write tests too, but you don't really do that when there is data you need.
– frostschutz
Jun 15 '17 at 21:11

True; it really depends on how thorough you want to be. Though it sounds like OP is mainly interested in reactive monitoring, as evidenced by the discussion on monitoring system logs for I/O-related errors. I use ZFS myself (not XFS), with redundancy and regular file system scrubs as well as SMART and system log monitoring, which ensures that developing problems are likely to be caught early, hopefully while still recoverable without more effort than a simple disk replacement.
– a CVn
Jun 15 '17 at 21:19

1

Hmm. I guess I was really hoping for something along the lines of /proc/fs/xfs/health that I could poll. Running badblocks or xfsdump would be very time and resource intensive. Clearly the kernel knows something is wrong since the errors appear in syslog. And while I appreciate the sentiment about monitoring the underlying devices and assuming the filesystem works, that's just not an acceptable solution for me.
– Evan
Jun 22 '17 at 12:45

add a comment |

Hmmm. How do I detect a failed XFS filesystem?

You should fix your problems below the filesystem layer first. You can hardly blame XFS for disk errors and that's not how you check for disk errors.

It's the nature of xfsdump to walk the XFS filesystem in its entirety and store all the files. So that should come as close as possible to a full read test, not including free space.

answered Jun 15 '17 at 20:04

frostschutz

25.7k15280

Shouldn't something like making sure xfsdump -l 0 - /dev/md0 >/dev/null && echo sane terminates tell you whether the file system itself is sane? You could then do a read-only badblocks pass (or dd, for that matter) to make sure that all parts of the disk are at least readable.
– a CVn
Jun 15 '17 at 20:52

Yes, only it might be more meaningful to keep the result than just toss it away. badblocks is one such example, if I have reason to believe there actually are read errors, I never run badblocks. I run ddrescue. Of course badblocks can do write tests too, but you don't really do that when there is data you need.
– frostschutz
Jun 15 '17 at 21:11

True; it really depends on how thorough you want to be. Though it sounds like OP is mainly interested in reactive monitoring, as evidenced by the discussion on monitoring system logs for I/O-related errors. I use ZFS myself (not XFS), with redundancy and regular file system scrubs as well as SMART and system log monitoring, which ensures that developing problems are likely to be caught early, hopefully while still recoverable without more effort than a simple disk replacement.
– a CVn
Jun 15 '17 at 21:19

1

Hmm. I guess I was really hoping for something along the lines of /proc/fs/xfs/health that I could poll. Running badblocks or xfsdump would be very time and resource intensive. Clearly the kernel knows something is wrong since the errors appear in syslog. And while I appreciate the sentiment about monitoring the underlying devices and assuming the filesystem works, that's just not an acceptable solution for me.
– Evan
Jun 22 '17 at 12:45

add a comment |

Hmmm. How do I detect a failed XFS filesystem?

You should fix your problems below the filesystem layer first. You can hardly blame XFS for disk errors and that's not how you check for disk errors.

It's the nature of xfsdump to walk the XFS filesystem in its entirety and store all the files. So that should come as close as possible to a full read test, not including free space.

answered Jun 15 '17 at 20:04

frostschutz

25.7k15280

Hmmm. How do I detect a failed XFS filesystem?

You should fix your problems below the filesystem layer first. You can hardly blame XFS for disk errors and that's not how you check for disk errors.

It's the nature of xfsdump to walk the XFS filesystem in its entirety and store all the files. So that should come as close as possible to a full read test, not including free space.

answered Jun 15 '17 at 20:04

frostschutz

25.7k15280

answered Jun 15 '17 at 20:04

frostschutz

25.7k15280

answered Jun 15 '17 at 20:04

frostschutz

25.7k15280

answered Jun 15 '17 at 20:04

frostschutz

25.7k15280

Shouldn't something like making sure xfsdump -l 0 - /dev/md0 >/dev/null && echo sane terminates tell you whether the file system itself is sane? You could then do a read-only badblocks pass (or dd, for that matter) to make sure that all parts of the disk are at least readable.
– a CVn
Jun 15 '17 at 20:52

Yes, only it might be more meaningful to keep the result than just toss it away. badblocks is one such example, if I have reason to believe there actually are read errors, I never run badblocks. I run ddrescue. Of course badblocks can do write tests too, but you don't really do that when there is data you need.
– frostschutz
Jun 15 '17 at 21:11

True; it really depends on how thorough you want to be. Though it sounds like OP is mainly interested in reactive monitoring, as evidenced by the discussion on monitoring system logs for I/O-related errors. I use ZFS myself (not XFS), with redundancy and regular file system scrubs as well as SMART and system log monitoring, which ensures that developing problems are likely to be caught early, hopefully while still recoverable without more effort than a simple disk replacement.
– a CVn
Jun 15 '17 at 21:19

1

Hmm. I guess I was really hoping for something along the lines of /proc/fs/xfs/health that I could poll. Running badblocks or xfsdump would be very time and resource intensive. Clearly the kernel knows something is wrong since the errors appear in syslog. And while I appreciate the sentiment about monitoring the underlying devices and assuming the filesystem works, that's just not an acceptable solution for me.
– Evan
Jun 22 '17 at 12:45

add a comment |

Shouldn't something like making sure xfsdump -l 0 - /dev/md0 >/dev/null && echo sane terminates tell you whether the file system itself is sane? You could then do a read-only badblocks pass (or dd, for that matter) to make sure that all parts of the disk are at least readable.
– a CVn
Jun 15 '17 at 20:52

Yes, only it might be more meaningful to keep the result than just toss it away. badblocks is one such example, if I have reason to believe there actually are read errors, I never run badblocks. I run ddrescue. Of course badblocks can do write tests too, but you don't really do that when there is data you need.
– frostschutz
Jun 15 '17 at 21:11

True; it really depends on how thorough you want to be. Though it sounds like OP is mainly interested in reactive monitoring, as evidenced by the discussion on monitoring system logs for I/O-related errors. I use ZFS myself (not XFS), with redundancy and regular file system scrubs as well as SMART and system log monitoring, which ensures that developing problems are likely to be caught early, hopefully while still recoverable without more effort than a simple disk replacement.
– a CVn
Jun 15 '17 at 21:19

1

Hmm. I guess I was really hoping for something along the lines of /proc/fs/xfs/health that I could poll. Running badblocks or xfsdump would be very time and resource intensive. Clearly the kernel knows something is wrong since the errors appear in syslog. And while I appreciate the sentiment about monitoring the underlying devices and assuming the filesystem works, that's just not an acceptable solution for me.
– Evan
Jun 22 '17 at 12:45

Shouldn't something like making sure xfsdump -l 0 - /dev/md0 >/dev/null && echo sane terminates tell you whether the file system itself is sane? You could then do a read-only badblocks pass (or dd, for that matter) to make sure that all parts of the disk are at least readable.
– a CVn
Jun 15 '17 at 20:52

Yes, only it might be more meaningful to keep the result than just toss it away. badblocks is one such example, if I have reason to believe there actually are read errors, I never run badblocks. I run ddrescue. Of course badblocks can do write tests too, but you don't really do that when there is data you need.
– frostschutz
Jun 15 '17 at 21:11

True; it really depends on how thorough you want to be. Though it sounds like OP is mainly interested in reactive monitoring, as evidenced by the discussion on monitoring system logs for I/O-related errors. I use ZFS myself (not XFS), with redundancy and regular file system scrubs as well as SMART and system log monitoring, which ensures that developing problems are likely to be caught early, hopefully while still recoverable without more effort than a simple disk replacement.
– a CVn
Jun 15 '17 at 21:19

Hmm. I guess I was really hoping for something along the lines of /proc/fs/xfs/health that I could poll. Running badblocks or xfsdump would be very time and resource intensive. Clearly the kernel knows something is wrong since the errors appear in syslog. And while I appreciate the sentiment about monitoring the underlying devices and assuming the filesystem works, that's just not an acceptable solution for me.
– Evan
Jun 22 '17 at 12:45

add a comment |

According to https://access.redhat.com/solutions/3645252, you could set the sysctl fs.xfs.panic_mask=127 to make any error detected on any XFS filesystem become a system panic.

In order to persist to the reboot of this configure systems, do:

echo 'fs.xfs.panic_mask=127' > /etc/sysctl.d/01-xfs.conf

answered yesterday

Yan Li

New contributor

add a comment |

According to https://access.redhat.com/solutions/3645252, you could set the sysctl fs.xfs.panic_mask=127 to make any error detected on any XFS filesystem become a system panic.

In order to persist to the reboot of this configure systems, do:

echo 'fs.xfs.panic_mask=127' > /etc/sysctl.d/01-xfs.conf

answered yesterday

Yan Li

New contributor

add a comment |

According to https://access.redhat.com/solutions/3645252, you could set the sysctl fs.xfs.panic_mask=127 to make any error detected on any XFS filesystem become a system panic.

In order to persist to the reboot of this configure systems, do:

echo 'fs.xfs.panic_mask=127' > /etc/sysctl.d/01-xfs.conf

answered yesterday

Yan Li

New contributor

According to https://access.redhat.com/solutions/3645252, you could set the sysctl fs.xfs.panic_mask=127 to make any error detected on any XFS filesystem become a system panic.

In order to persist to the reboot of this configure systems, do:

echo 'fs.xfs.panic_mask=127' > /etc/sysctl.d/01-xfs.conf

answered yesterday

Yan Li

New contributor

answered yesterday

Yan Li

New contributor

answered yesterday

Yan Li

answered yesterday

Yan Li

New contributor

Yan Li is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

add a comment |

xfs_repair -n

Usage: xfs_repair [options] device



Options:

  -f           The device is a file

  -L           Force log zeroing. Do this as a last resort.

  -l logdev    Specifies the device where the external log resides.

  -m maxmem    Maximum amount of memory to be used in megabytes.

  -n           No modify mode, just checks the filesystem for damage.

  -P           Disables prefetching.

  -r rtdev     Specifies the device where the realtime section resides.

  -v           Verbose output.

  -c subopts   Change filesystem parameters - use xfs_admin.

  -o subopts   Override default behaviour, refer to man page.

  -t interval  Reporting interval in minutes.

  -d           Repair dangerously.

  -V           Reports version and exits.



man page:

-n     No modify mode.

       Specifies that xfs_repair should not modify the filesystem but

       should only scan the filesystem and indicate what repairs would have been made.

Must be unmounted to do this, but to monitor here's what you would periodically do manually:

# df -h

Filesystem      Size  Used Avail Use% Mounted on

/dev/sdc2       550G  152G  371G  30% /            {ext3}

udev            253G  216K  253G   1% /dev

tmpfs           253G  5.5M  253G   1% /dev/shm

/dev/sdc1       195M   13M  183M   7% /boot/efi

/dev/sda1       5.0T  4.9T   99G  99% /data         {xfs}

/dev/sdb1       559G   67G  492G  12% /scratch

tmpfs           450G     0  450G   0% /ramdisk

/dev/sdd1       5.0T  4.9T  9.8G 100% /bkup         {xfs}



how do i find file system types?



# mount

/dev/sdc2 on / type ext3 (rw,acl,user_xattr)

proc on /proc type proc (rw)

sysfs on /sys type sysfs (rw)

debugfs on /sys/kernel/debug type debugfs (rw)

udev on /dev type tmpfs (rw,mode=0755)

tmpfs on /dev/shm type tmpfs (rw,mode=1777)

devpts on /dev/pts type devpts (rw,mode=0620,gid=5)

/dev/sdc1 on /boot/efi type vfat (rw,umask=0002,utf8=true)

/dev/sda1 on /data type xfs (rw)

/dev/sdb1 on /scratch type xfs (rw)

fusectl on /sys/fs/fuse/connections type fusectl (rw)

securityfs on /sys/kernel/security type securityfs (rw)

none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)

tmpfs on /ramdisk type tmpfs (rw,size=450G)

nfsd on /proc/fs/nfsd type nfsd (rw)

/dev/sdd1 on /bkup type xfs (rw)



# xfs_repair -n /dev/sdd1

xfs_repair: /dev/sdd1 contains a mounted and writable filesystem



fatal error -- couldn't initialize XFS library



# umount /bkup/

# xfs_repair -n /dev/sdd1



Phase 1 - find and verify superblock...

Phase 2 - using internal log

        - scan filesystem freespace and inode maps...

        - found root inode chunk

Phase 3 - for each AG...

        - scan (but don't clear) agi unlinked lists...

        - process known inodes and perform inode discovery...

        - agno = 0

        - agno = 1

        - agno = 2

        - agno = 3

        - agno = 4

        - process newly discovered inodes...

Phase 4 - check for duplicate blocks...

        - setting up duplicate extent list...

        - check for inodes claiming duplicate blocks...

        - agno = 0

        - agno = 4

        - agno = 3

        - agno = 1

        - agno = 2

No modify flag set, skipping phase 5

Phase 6 - check inode connectivity...

        - traversing filesystem ...

        - traversal finished ...

        - moving disconnected inodes to lost+found ...

Phase 7 - verify link counts...

No modify flag set, skipping filesystem flush and exiting.



this "xfs_repair -n" output is on a good XFS file system that has been problem free for years.

edited yesterday

answered yesterday

ron

8861714

add a comment |

xfs_repair -n

Usage: xfs_repair [options] device



Options:

  -f           The device is a file

  -L           Force log zeroing. Do this as a last resort.

  -l logdev    Specifies the device where the external log resides.

  -m maxmem    Maximum amount of memory to be used in megabytes.

  -n           No modify mode, just checks the filesystem for damage.

  -P           Disables prefetching.

  -r rtdev     Specifies the device where the realtime section resides.

  -v           Verbose output.

  -c subopts   Change filesystem parameters - use xfs_admin.

  -o subopts   Override default behaviour, refer to man page.

  -t interval  Reporting interval in minutes.

  -d           Repair dangerously.

  -V           Reports version and exits.



man page:

-n     No modify mode.

       Specifies that xfs_repair should not modify the filesystem but

       should only scan the filesystem and indicate what repairs would have been made.

Must be unmounted to do this, but to monitor here's what you would periodically do manually:

# df -h

Filesystem      Size  Used Avail Use% Mounted on

/dev/sdc2       550G  152G  371G  30% /            {ext3}

udev            253G  216K  253G   1% /dev

tmpfs           253G  5.5M  253G   1% /dev/shm

/dev/sdc1       195M   13M  183M   7% /boot/efi

/dev/sda1       5.0T  4.9T   99G  99% /data         {xfs}

/dev/sdb1       559G   67G  492G  12% /scratch

tmpfs           450G     0  450G   0% /ramdisk

/dev/sdd1       5.0T  4.9T  9.8G 100% /bkup         {xfs}



how do i find file system types?



# mount

/dev/sdc2 on / type ext3 (rw,acl,user_xattr)

proc on /proc type proc (rw)

sysfs on /sys type sysfs (rw)

debugfs on /sys/kernel/debug type debugfs (rw)

udev on /dev type tmpfs (rw,mode=0755)

tmpfs on /dev/shm type tmpfs (rw,mode=1777)

devpts on /dev/pts type devpts (rw,mode=0620,gid=5)

/dev/sdc1 on /boot/efi type vfat (rw,umask=0002,utf8=true)

/dev/sda1 on /data type xfs (rw)

/dev/sdb1 on /scratch type xfs (rw)

fusectl on /sys/fs/fuse/connections type fusectl (rw)

securityfs on /sys/kernel/security type securityfs (rw)

none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)

tmpfs on /ramdisk type tmpfs (rw,size=450G)

nfsd on /proc/fs/nfsd type nfsd (rw)

/dev/sdd1 on /bkup type xfs (rw)



# xfs_repair -n /dev/sdd1

xfs_repair: /dev/sdd1 contains a mounted and writable filesystem



fatal error -- couldn't initialize XFS library



# umount /bkup/

# xfs_repair -n /dev/sdd1



Phase 1 - find and verify superblock...

Phase 2 - using internal log

        - scan filesystem freespace and inode maps...

        - found root inode chunk

Phase 3 - for each AG...

        - scan (but don't clear) agi unlinked lists...

        - process known inodes and perform inode discovery...

        - agno = 0

        - agno = 1

        - agno = 2

        - agno = 3

        - agno = 4

        - process newly discovered inodes...

Phase 4 - check for duplicate blocks...

        - setting up duplicate extent list...

        - check for inodes claiming duplicate blocks...

        - agno = 0

        - agno = 4

        - agno = 3

        - agno = 1

        - agno = 2

No modify flag set, skipping phase 5

Phase 6 - check inode connectivity...

        - traversing filesystem ...

        - traversal finished ...

        - moving disconnected inodes to lost+found ...

Phase 7 - verify link counts...

No modify flag set, skipping filesystem flush and exiting.



this "xfs_repair -n" output is on a good XFS file system that has been problem free for years.

edited yesterday

answered yesterday

ron

8861714

add a comment |

xfs_repair -n

Usage: xfs_repair [options] device



Options:

  -f           The device is a file

  -L           Force log zeroing. Do this as a last resort.

  -l logdev    Specifies the device where the external log resides.

  -m maxmem    Maximum amount of memory to be used in megabytes.

  -n           No modify mode, just checks the filesystem for damage.

  -P           Disables prefetching.

  -r rtdev     Specifies the device where the realtime section resides.

  -v           Verbose output.

  -c subopts   Change filesystem parameters - use xfs_admin.

  -o subopts   Override default behaviour, refer to man page.

  -t interval  Reporting interval in minutes.

  -d           Repair dangerously.

  -V           Reports version and exits.



man page:

-n     No modify mode.

       Specifies that xfs_repair should not modify the filesystem but

       should only scan the filesystem and indicate what repairs would have been made.

Must be unmounted to do this, but to monitor here's what you would periodically do manually:

# df -h

Filesystem      Size  Used Avail Use% Mounted on

/dev/sdc2       550G  152G  371G  30% /            {ext3}

udev            253G  216K  253G   1% /dev

tmpfs           253G  5.5M  253G   1% /dev/shm

/dev/sdc1       195M   13M  183M   7% /boot/efi

/dev/sda1       5.0T  4.9T   99G  99% /data         {xfs}

/dev/sdb1       559G   67G  492G  12% /scratch

tmpfs           450G     0  450G   0% /ramdisk

/dev/sdd1       5.0T  4.9T  9.8G 100% /bkup         {xfs}



how do i find file system types?



# mount

/dev/sdc2 on / type ext3 (rw,acl,user_xattr)

proc on /proc type proc (rw)

sysfs on /sys type sysfs (rw)

debugfs on /sys/kernel/debug type debugfs (rw)

udev on /dev type tmpfs (rw,mode=0755)

tmpfs on /dev/shm type tmpfs (rw,mode=1777)

devpts on /dev/pts type devpts (rw,mode=0620,gid=5)

/dev/sdc1 on /boot/efi type vfat (rw,umask=0002,utf8=true)

/dev/sda1 on /data type xfs (rw)

/dev/sdb1 on /scratch type xfs (rw)

fusectl on /sys/fs/fuse/connections type fusectl (rw)

securityfs on /sys/kernel/security type securityfs (rw)

none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)

tmpfs on /ramdisk type tmpfs (rw,size=450G)

nfsd on /proc/fs/nfsd type nfsd (rw)

/dev/sdd1 on /bkup type xfs (rw)



# xfs_repair -n /dev/sdd1

xfs_repair: /dev/sdd1 contains a mounted and writable filesystem



fatal error -- couldn't initialize XFS library



# umount /bkup/

# xfs_repair -n /dev/sdd1



Phase 1 - find and verify superblock...

Phase 2 - using internal log

        - scan filesystem freespace and inode maps...

        - found root inode chunk

Phase 3 - for each AG...

        - scan (but don't clear) agi unlinked lists...

        - process known inodes and perform inode discovery...

        - agno = 0

        - agno = 1

        - agno = 2

        - agno = 3

        - agno = 4

        - process newly discovered inodes...

Phase 4 - check for duplicate blocks...

        - setting up duplicate extent list...

        - check for inodes claiming duplicate blocks...

        - agno = 0

        - agno = 4

        - agno = 3

        - agno = 1

        - agno = 2

No modify flag set, skipping phase 5

Phase 6 - check inode connectivity...

        - traversing filesystem ...

        - traversal finished ...

        - moving disconnected inodes to lost+found ...

Phase 7 - verify link counts...

No modify flag set, skipping filesystem flush and exiting.



this "xfs_repair -n" output is on a good XFS file system that has been problem free for years.

edited yesterday

answered yesterday

ron

8861714

xfs_repair -n

Usage: xfs_repair [options] device



Options:

  -f           The device is a file

  -L           Force log zeroing. Do this as a last resort.

  -l logdev    Specifies the device where the external log resides.

  -m maxmem    Maximum amount of memory to be used in megabytes.

  -n           No modify mode, just checks the filesystem for damage.

  -P           Disables prefetching.

  -r rtdev     Specifies the device where the realtime section resides.

  -v           Verbose output.

  -c subopts   Change filesystem parameters - use xfs_admin.

  -o subopts   Override default behaviour, refer to man page.

  -t interval  Reporting interval in minutes.

  -d           Repair dangerously.

  -V           Reports version and exits.



man page:

-n     No modify mode.

       Specifies that xfs_repair should not modify the filesystem but

       should only scan the filesystem and indicate what repairs would have been made.

Must be unmounted to do this, but to monitor here's what you would periodically do manually:

# df -h

Filesystem      Size  Used Avail Use% Mounted on

/dev/sdc2       550G  152G  371G  30% /            {ext3}

udev            253G  216K  253G   1% /dev

tmpfs           253G  5.5M  253G   1% /dev/shm

/dev/sdc1       195M   13M  183M   7% /boot/efi

/dev/sda1       5.0T  4.9T   99G  99% /data         {xfs}

/dev/sdb1       559G   67G  492G  12% /scratch

tmpfs           450G     0  450G   0% /ramdisk

/dev/sdd1       5.0T  4.9T  9.8G 100% /bkup         {xfs}



how do i find file system types?



# mount

/dev/sdc2 on / type ext3 (rw,acl,user_xattr)

proc on /proc type proc (rw)

sysfs on /sys type sysfs (rw)

debugfs on /sys/kernel/debug type debugfs (rw)

udev on /dev type tmpfs (rw,mode=0755)

tmpfs on /dev/shm type tmpfs (rw,mode=1777)

devpts on /dev/pts type devpts (rw,mode=0620,gid=5)

/dev/sdc1 on /boot/efi type vfat (rw,umask=0002,utf8=true)

/dev/sda1 on /data type xfs (rw)

/dev/sdb1 on /scratch type xfs (rw)

fusectl on /sys/fs/fuse/connections type fusectl (rw)

securityfs on /sys/kernel/security type securityfs (rw)

none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)

tmpfs on /ramdisk type tmpfs (rw,size=450G)

nfsd on /proc/fs/nfsd type nfsd (rw)

/dev/sdd1 on /bkup type xfs (rw)



# xfs_repair -n /dev/sdd1

xfs_repair: /dev/sdd1 contains a mounted and writable filesystem



fatal error -- couldn't initialize XFS library



# umount /bkup/

# xfs_repair -n /dev/sdd1



Phase 1 - find and verify superblock...

Phase 2 - using internal log

        - scan filesystem freespace and inode maps...

        - found root inode chunk

Phase 3 - for each AG...

        - scan (but don't clear) agi unlinked lists...

        - process known inodes and perform inode discovery...

        - agno = 0

        - agno = 1

        - agno = 2

        - agno = 3

        - agno = 4

        - process newly discovered inodes...

Phase 4 - check for duplicate blocks...

        - setting up duplicate extent list...

        - check for inodes claiming duplicate blocks...

        - agno = 0

        - agno = 4

        - agno = 3

        - agno = 1

        - agno = 2

No modify flag set, skipping phase 5

Phase 6 - check inode connectivity...

        - traversing filesystem ...

        - traversal finished ...

        - moving disconnected inodes to lost+found ...

Phase 7 - verify link counts...

No modify flag set, skipping filesystem flush and exiting.



this "xfs_repair -n" output is on a good XFS file system that has been problem free for years.

edited yesterday

answered yesterday

ron

8861714

edited yesterday

answered yesterday

ron

8861714

answered yesterday

ron

8861714

answered yesterday

ron

8861714

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Sstrhsrtj