How do YOU detect a failed XFS filesystem?
Currently I monitor for a failed filesystem (as a result of a failed disk, controller, whatever) by checking syslog for messages like this:
2017-06-15T17:18:10.081665+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381844.448488] blk_update_request: critical target error, dev sdj, sector 97672656
2017-06-15T17:18:10.724329+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.047871] XFS (md0): metadata I/O error: block 0x2baa81400 ("xlog_iodone") error 121 numblks 512
2017-06-15T17:18:10.724329+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124418] XFS (md0): xfs_do_force_shutdown(0x2) called from line 1177 of file /build/linux-lts-wily-8ENwT0/linux-lts-wily-4.2.0/fs/xfs/xfs_log.c. Return address = 0xffffffffc050e100
2017-06-15T17:18:10.724349+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124425] XFS (md0): Log I/O Error Detected. Shutting down filesystem
2017-06-15T17:18:10.724349+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124452] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:18:10.724354+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.163480] XFS (md0): Please umount the filesystem and rectify the problem(s)
2017-06-15T17:18:40.612572+00:00 2017-06-15T17:18:40+00:00 localhost kernel: [1381875.074647] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:19:10.612554+00:00 2017-06-15T17:19:10+00:00 localhost kernel: [1381905.101606] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:19:40.612558+00:00 2017-06-15T17:19:40+00:00 localhost kernel: [1381935.128546] XFS (md0): xfs_log_force: error -5 returned.
This is ok but I'd really like a more canonical check. The only thing I can think of is to write a script that attempts to write a file to disk and fires off an alarm if it can't for any reason. However, that seems like it's prone to false positives - there are several reasons why a file might not be able to be written, not just a failed filesystem.
Aside from grepping the logs or writing a canary file to disk, how can this be monitored?
filesystems monitoring xfs
add a comment |
Currently I monitor for a failed filesystem (as a result of a failed disk, controller, whatever) by checking syslog for messages like this:
2017-06-15T17:18:10.081665+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381844.448488] blk_update_request: critical target error, dev sdj, sector 97672656
2017-06-15T17:18:10.724329+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.047871] XFS (md0): metadata I/O error: block 0x2baa81400 ("xlog_iodone") error 121 numblks 512
2017-06-15T17:18:10.724329+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124418] XFS (md0): xfs_do_force_shutdown(0x2) called from line 1177 of file /build/linux-lts-wily-8ENwT0/linux-lts-wily-4.2.0/fs/xfs/xfs_log.c. Return address = 0xffffffffc050e100
2017-06-15T17:18:10.724349+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124425] XFS (md0): Log I/O Error Detected. Shutting down filesystem
2017-06-15T17:18:10.724349+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124452] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:18:10.724354+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.163480] XFS (md0): Please umount the filesystem and rectify the problem(s)
2017-06-15T17:18:40.612572+00:00 2017-06-15T17:18:40+00:00 localhost kernel: [1381875.074647] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:19:10.612554+00:00 2017-06-15T17:19:10+00:00 localhost kernel: [1381905.101606] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:19:40.612558+00:00 2017-06-15T17:19:40+00:00 localhost kernel: [1381935.128546] XFS (md0): xfs_log_force: error -5 returned.
This is ok but I'd really like a more canonical check. The only thing I can think of is to write a script that attempts to write a file to disk and fires off an alarm if it can't for any reason. However, that seems like it's prone to false positives - there are several reasons why a file might not be able to be written, not just a failed filesystem.
Aside from grepping the logs or writing a canary file to disk, how can this be monitored?
filesystems monitoring xfs
You could configure nagios, they may have a module in place that can check the status of xfs...
– ryekayo
Jun 15 '17 at 19:48
@ryekayo A quick perusal of exchange.nagios.org/directory/Plugins/System-Metrics/… did not uncover anything for monitoring the health of XFS file systems directly, though some that probably could be pressed into service.
– a CVn
Jun 15 '17 at 20:57
add a comment |
Currently I monitor for a failed filesystem (as a result of a failed disk, controller, whatever) by checking syslog for messages like this:
2017-06-15T17:18:10.081665+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381844.448488] blk_update_request: critical target error, dev sdj, sector 97672656
2017-06-15T17:18:10.724329+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.047871] XFS (md0): metadata I/O error: block 0x2baa81400 ("xlog_iodone") error 121 numblks 512
2017-06-15T17:18:10.724329+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124418] XFS (md0): xfs_do_force_shutdown(0x2) called from line 1177 of file /build/linux-lts-wily-8ENwT0/linux-lts-wily-4.2.0/fs/xfs/xfs_log.c. Return address = 0xffffffffc050e100
2017-06-15T17:18:10.724349+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124425] XFS (md0): Log I/O Error Detected. Shutting down filesystem
2017-06-15T17:18:10.724349+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124452] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:18:10.724354+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.163480] XFS (md0): Please umount the filesystem and rectify the problem(s)
2017-06-15T17:18:40.612572+00:00 2017-06-15T17:18:40+00:00 localhost kernel: [1381875.074647] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:19:10.612554+00:00 2017-06-15T17:19:10+00:00 localhost kernel: [1381905.101606] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:19:40.612558+00:00 2017-06-15T17:19:40+00:00 localhost kernel: [1381935.128546] XFS (md0): xfs_log_force: error -5 returned.
This is ok but I'd really like a more canonical check. The only thing I can think of is to write a script that attempts to write a file to disk and fires off an alarm if it can't for any reason. However, that seems like it's prone to false positives - there are several reasons why a file might not be able to be written, not just a failed filesystem.
Aside from grepping the logs or writing a canary file to disk, how can this be monitored?
filesystems monitoring xfs
Currently I monitor for a failed filesystem (as a result of a failed disk, controller, whatever) by checking syslog for messages like this:
2017-06-15T17:18:10.081665+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381844.448488] blk_update_request: critical target error, dev sdj, sector 97672656
2017-06-15T17:18:10.724329+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.047871] XFS (md0): metadata I/O error: block 0x2baa81400 ("xlog_iodone") error 121 numblks 512
2017-06-15T17:18:10.724329+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124418] XFS (md0): xfs_do_force_shutdown(0x2) called from line 1177 of file /build/linux-lts-wily-8ENwT0/linux-lts-wily-4.2.0/fs/xfs/xfs_log.c. Return address = 0xffffffffc050e100
2017-06-15T17:18:10.724349+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124425] XFS (md0): Log I/O Error Detected. Shutting down filesystem
2017-06-15T17:18:10.724349+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124452] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:18:10.724354+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.163480] XFS (md0): Please umount the filesystem and rectify the problem(s)
2017-06-15T17:18:40.612572+00:00 2017-06-15T17:18:40+00:00 localhost kernel: [1381875.074647] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:19:10.612554+00:00 2017-06-15T17:19:10+00:00 localhost kernel: [1381905.101606] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:19:40.612558+00:00 2017-06-15T17:19:40+00:00 localhost kernel: [1381935.128546] XFS (md0): xfs_log_force: error -5 returned.
This is ok but I'd really like a more canonical check. The only thing I can think of is to write a script that attempts to write a file to disk and fires off an alarm if it can't for any reason. However, that seems like it's prone to false positives - there are several reasons why a file might not be able to be written, not just a failed filesystem.
Aside from grepping the logs or writing a canary file to disk, how can this be monitored?
filesystems monitoring xfs
filesystems monitoring xfs
asked Jun 15 '17 at 19:34
Evan
164
164
You could configure nagios, they may have a module in place that can check the status of xfs...
– ryekayo
Jun 15 '17 at 19:48
@ryekayo A quick perusal of exchange.nagios.org/directory/Plugins/System-Metrics/… did not uncover anything for monitoring the health of XFS file systems directly, though some that probably could be pressed into service.
– a CVn
Jun 15 '17 at 20:57
add a comment |
You could configure nagios, they may have a module in place that can check the status of xfs...
– ryekayo
Jun 15 '17 at 19:48
@ryekayo A quick perusal of exchange.nagios.org/directory/Plugins/System-Metrics/… did not uncover anything for monitoring the health of XFS file systems directly, though some that probably could be pressed into service.
– a CVn
Jun 15 '17 at 20:57
You could configure nagios, they may have a module in place that can check the status of xfs...
– ryekayo
Jun 15 '17 at 19:48
You could configure nagios, they may have a module in place that can check the status of xfs...
– ryekayo
Jun 15 '17 at 19:48
@ryekayo A quick perusal of exchange.nagios.org/directory/Plugins/System-Metrics/… did not uncover anything for monitoring the health of XFS file systems directly, though some that probably could be pressed into service.
– a CVn
Jun 15 '17 at 20:57
@ryekayo A quick perusal of exchange.nagios.org/directory/Plugins/System-Metrics/… did not uncover anything for monitoring the health of XFS file systems directly, though some that probably could be pressed into service.
– a CVn
Jun 15 '17 at 20:57
add a comment |
3 Answers
3
active
oldest
votes
Hmmm. How do I detect a failed XFS filesystem?
I've been using XFS for ages. But... I guess I don't detect it, at all. If it mounts, I trust it works. That's how most people do it... filesystem checks are automated, if it boots and it's up and running, that's that.
Now, don't get me wrong. I actually do a ton of monitoring, but none of it is filesystem specific. I run SMART selftests (using select,cont to do a disk segment per day, because long simply takes too long). I run RAID checks (also in segments) and also check that there are no mismatches in parity (mismatch_cnt = 0). I get instant mail notification if any of those fail and I actually replace HDDs once they start reallocating sectors (or at least, no longer trust them with important data).
So I have monitoring to make sure the storage works as it should. This covers errors inside the drives themselves (SMART) and to some extent also errors on a higher level (RAID checks in a way also test controllers, cables, RAID logic, ...).
As long as that works fine, the filesystem better be fine, too. Outside of checksumming filesystems like ZFS/btrfs (maybe XFS in the future, too) it's not really part of the concept to run checks on a filesystem level while mounted, apart from whatever sanity checks the filesystem itself does internally.
Your output suggests you're running RAID too, and had a failed disk in that RAID; even so that really should not cause errors happening on md0, unless it was RAID without redundancy (RAID0 or already degraded RAID1/5/6/10).
You should fix your problems below the filesystem layer first. You can hardly blame XFS for disk errors and that's not how you check for disk errors.
I guess if you really wanted to run a full read test on top of the filesystem, you could do an xfsdump to a backup disk... if you're doing a full read test of your filesystem anyhow, might as well do it in a way that's meaningful somehow.
It's the nature of xfsdump to walk the XFS filesystem in its entirety and store all the files. So that should come as close as possible to a full read test, not including free space.
Of course, if you're already running another backup system, that's really the same story in a filesystem-agnostic way (and if that backup system encounters read errors that aren't just lack of permissions, it better send a mail report to you, too), although of course if it's an incremental backup, without periodic full backups it won't actually read a file more than once...
But in general, we trust filesystems to "just work" as long as the storage is known to work. While it would be nice to have each and every program without exception to elevate any and all I/O errors it encounters, I'm not aware of a generic purpose solution to actually do so. Each program does its own error handling.
Shouldn't something like making surexfsdump -l 0 - /dev/md0 >/dev/null && echo saneterminates tell you whether the file system itself is sane? You could then do a read-onlybadblockspass (ordd, for that matter) to make sure that all parts of the disk are at least readable.
– a CVn
Jun 15 '17 at 20:52
Yes, only it might be more meaningful to keep the result than just toss it away.badblocksis one such example, if I have reason to believe there actually are read errors, I never runbadblocks. I runddrescue. Of coursebadblockscan do write tests too, but you don't really do that when there is data you need.
– frostschutz
Jun 15 '17 at 21:11
True; it really depends on how thorough you want to be. Though it sounds like OP is mainly interested in reactive monitoring, as evidenced by the discussion on monitoring system logs for I/O-related errors. I use ZFS myself (not XFS), with redundancy and regular file system scrubs as well as SMART and system log monitoring, which ensures that developing problems are likely to be caught early, hopefully while still recoverable without more effort than a simple disk replacement.
– a CVn
Jun 15 '17 at 21:19
1
Hmm. I guess I was really hoping for something along the lines of/proc/fs/xfs/healththat I could poll. Running badblocks or xfsdump would be very time and resource intensive. Clearly the kernel knows something is wrong since the errors appear in syslog. And while I appreciate the sentiment about monitoring the underlying devices and assuming the filesystem works, that's just not an acceptable solution for me.
– Evan
Jun 22 '17 at 12:45
add a comment |
Many critical storage servers enable "panic on error", so that an error wouldn't have a chance to become bigger, to cause further data damage, or to serve corrupted data to the users. With panic on error, you could just watch for panic events or system down to detect file system errors.
Of course, if you don't have a redundant system, one server going panic would mean actual down time. However, mission-critical systems must have redundancy. In fact, any data on a file system that shows such kind of I/O errors should no longer be used, and the system should be disconnected asap so that a backup system could kick in. Serving no data is actually better than serving corrupt data in most cases.
According to https://access.redhat.com/solutions/3645252, you could set the sysctl fs.xfs.panic_mask=127 to make any error detected on any XFS filesystem become a system panic.
In order to persist to the reboot of this configure systems, do:
echo 'fs.xfs.panic_mask=127' > /etc/sysctl.d/01-xfs.conf
New contributor
Yan Li is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
xfs_repair -n
Usage: xfs_repair [options] device
Options:
-f The device is a file
-L Force log zeroing. Do this as a last resort.
-l logdev Specifies the device where the external log resides.
-m maxmem Maximum amount of memory to be used in megabytes.
-n No modify mode, just checks the filesystem for damage.
-P Disables prefetching.
-r rtdev Specifies the device where the realtime section resides.
-v Verbose output.
-c subopts Change filesystem parameters - use xfs_admin.
-o subopts Override default behaviour, refer to man page.
-t interval Reporting interval in minutes.
-d Repair dangerously.
-V Reports version and exits.
man page:
-n No modify mode.
Specifies that xfs_repair should not modify the filesystem but
should only scan the filesystem and indicate what repairs would have been made.
and there is xfs_check but if you do a man page on it you will see: check XFS filesystem consistency... Note that using xfs_check is NOT recommended. Please use xfs_repair -n instead, for better scalability and speed.
And in /etc/fstab the 6th or last column if it is a 1 or 2 causes fsck or file system check on mount which would happen on every boot... will it specifically be xfs_repair -n? I don't know.
you asked about detecting a failed file system: my interpretation of that is if it is failed then it would not be mounted and not accessible at all... you would know without really having to check it would not be obvious when you notice it is not mounted, then rudely fails to mount when tried manually.
Must be unmounted to do this, but to monitor here's what you would periodically do manually:
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sdc2 550G 152G 371G 30% / {ext3}
udev 253G 216K 253G 1% /dev
tmpfs 253G 5.5M 253G 1% /dev/shm
/dev/sdc1 195M 13M 183M 7% /boot/efi
/dev/sda1 5.0T 4.9T 99G 99% /data {xfs}
/dev/sdb1 559G 67G 492G 12% /scratch
tmpfs 450G 0 450G 0% /ramdisk
/dev/sdd1 5.0T 4.9T 9.8G 100% /bkup {xfs}
how do i find file system types?
# mount
/dev/sdc2 on / type ext3 (rw,acl,user_xattr)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
debugfs on /sys/kernel/debug type debugfs (rw)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,mode=1777)
devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
/dev/sdc1 on /boot/efi type vfat (rw,umask=0002,utf8=true)
/dev/sda1 on /data type xfs (rw)
/dev/sdb1 on /scratch type xfs (rw)
fusectl on /sys/fs/fuse/connections type fusectl (rw)
securityfs on /sys/kernel/security type securityfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
tmpfs on /ramdisk type tmpfs (rw,size=450G)
nfsd on /proc/fs/nfsd type nfsd (rw)
/dev/sdd1 on /bkup type xfs (rw)
# xfs_repair -n /dev/sdd1
xfs_repair: /dev/sdd1 contains a mounted and writable filesystem
fatal error -- couldn't initialize XFS library
# umount /bkup/
# xfs_repair -n /dev/sdd1
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 4
- agno = 3
- agno = 1
- agno = 2
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.
this "xfs_repair -n" output is on a good XFS file system that has been problem free for years.
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f371381%2fhow-do-you-detect-a-failed-xfs-filesystem%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
Hmmm. How do I detect a failed XFS filesystem?
I've been using XFS for ages. But... I guess I don't detect it, at all. If it mounts, I trust it works. That's how most people do it... filesystem checks are automated, if it boots and it's up and running, that's that.
Now, don't get me wrong. I actually do a ton of monitoring, but none of it is filesystem specific. I run SMART selftests (using select,cont to do a disk segment per day, because long simply takes too long). I run RAID checks (also in segments) and also check that there are no mismatches in parity (mismatch_cnt = 0). I get instant mail notification if any of those fail and I actually replace HDDs once they start reallocating sectors (or at least, no longer trust them with important data).
So I have monitoring to make sure the storage works as it should. This covers errors inside the drives themselves (SMART) and to some extent also errors on a higher level (RAID checks in a way also test controllers, cables, RAID logic, ...).
As long as that works fine, the filesystem better be fine, too. Outside of checksumming filesystems like ZFS/btrfs (maybe XFS in the future, too) it's not really part of the concept to run checks on a filesystem level while mounted, apart from whatever sanity checks the filesystem itself does internally.
Your output suggests you're running RAID too, and had a failed disk in that RAID; even so that really should not cause errors happening on md0, unless it was RAID without redundancy (RAID0 or already degraded RAID1/5/6/10).
You should fix your problems below the filesystem layer first. You can hardly blame XFS for disk errors and that's not how you check for disk errors.
I guess if you really wanted to run a full read test on top of the filesystem, you could do an xfsdump to a backup disk... if you're doing a full read test of your filesystem anyhow, might as well do it in a way that's meaningful somehow.
It's the nature of xfsdump to walk the XFS filesystem in its entirety and store all the files. So that should come as close as possible to a full read test, not including free space.
Of course, if you're already running another backup system, that's really the same story in a filesystem-agnostic way (and if that backup system encounters read errors that aren't just lack of permissions, it better send a mail report to you, too), although of course if it's an incremental backup, without periodic full backups it won't actually read a file more than once...
But in general, we trust filesystems to "just work" as long as the storage is known to work. While it would be nice to have each and every program without exception to elevate any and all I/O errors it encounters, I'm not aware of a generic purpose solution to actually do so. Each program does its own error handling.
Shouldn't something like making surexfsdump -l 0 - /dev/md0 >/dev/null && echo saneterminates tell you whether the file system itself is sane? You could then do a read-onlybadblockspass (ordd, for that matter) to make sure that all parts of the disk are at least readable.
– a CVn
Jun 15 '17 at 20:52
Yes, only it might be more meaningful to keep the result than just toss it away.badblocksis one such example, if I have reason to believe there actually are read errors, I never runbadblocks. I runddrescue. Of coursebadblockscan do write tests too, but you don't really do that when there is data you need.
– frostschutz
Jun 15 '17 at 21:11
True; it really depends on how thorough you want to be. Though it sounds like OP is mainly interested in reactive monitoring, as evidenced by the discussion on monitoring system logs for I/O-related errors. I use ZFS myself (not XFS), with redundancy and regular file system scrubs as well as SMART and system log monitoring, which ensures that developing problems are likely to be caught early, hopefully while still recoverable without more effort than a simple disk replacement.
– a CVn
Jun 15 '17 at 21:19
1
Hmm. I guess I was really hoping for something along the lines of/proc/fs/xfs/healththat I could poll. Running badblocks or xfsdump would be very time and resource intensive. Clearly the kernel knows something is wrong since the errors appear in syslog. And while I appreciate the sentiment about monitoring the underlying devices and assuming the filesystem works, that's just not an acceptable solution for me.
– Evan
Jun 22 '17 at 12:45
add a comment |
Hmmm. How do I detect a failed XFS filesystem?
I've been using XFS for ages. But... I guess I don't detect it, at all. If it mounts, I trust it works. That's how most people do it... filesystem checks are automated, if it boots and it's up and running, that's that.
Now, don't get me wrong. I actually do a ton of monitoring, but none of it is filesystem specific. I run SMART selftests (using select,cont to do a disk segment per day, because long simply takes too long). I run RAID checks (also in segments) and also check that there are no mismatches in parity (mismatch_cnt = 0). I get instant mail notification if any of those fail and I actually replace HDDs once they start reallocating sectors (or at least, no longer trust them with important data).
So I have monitoring to make sure the storage works as it should. This covers errors inside the drives themselves (SMART) and to some extent also errors on a higher level (RAID checks in a way also test controllers, cables, RAID logic, ...).
As long as that works fine, the filesystem better be fine, too. Outside of checksumming filesystems like ZFS/btrfs (maybe XFS in the future, too) it's not really part of the concept to run checks on a filesystem level while mounted, apart from whatever sanity checks the filesystem itself does internally.
Your output suggests you're running RAID too, and had a failed disk in that RAID; even so that really should not cause errors happening on md0, unless it was RAID without redundancy (RAID0 or already degraded RAID1/5/6/10).
You should fix your problems below the filesystem layer first. You can hardly blame XFS for disk errors and that's not how you check for disk errors.
I guess if you really wanted to run a full read test on top of the filesystem, you could do an xfsdump to a backup disk... if you're doing a full read test of your filesystem anyhow, might as well do it in a way that's meaningful somehow.
It's the nature of xfsdump to walk the XFS filesystem in its entirety and store all the files. So that should come as close as possible to a full read test, not including free space.
Of course, if you're already running another backup system, that's really the same story in a filesystem-agnostic way (and if that backup system encounters read errors that aren't just lack of permissions, it better send a mail report to you, too), although of course if it's an incremental backup, without periodic full backups it won't actually read a file more than once...
But in general, we trust filesystems to "just work" as long as the storage is known to work. While it would be nice to have each and every program without exception to elevate any and all I/O errors it encounters, I'm not aware of a generic purpose solution to actually do so. Each program does its own error handling.
Shouldn't something like making surexfsdump -l 0 - /dev/md0 >/dev/null && echo saneterminates tell you whether the file system itself is sane? You could then do a read-onlybadblockspass (ordd, for that matter) to make sure that all parts of the disk are at least readable.
– a CVn
Jun 15 '17 at 20:52
Yes, only it might be more meaningful to keep the result than just toss it away.badblocksis one such example, if I have reason to believe there actually are read errors, I never runbadblocks. I runddrescue. Of coursebadblockscan do write tests too, but you don't really do that when there is data you need.
– frostschutz
Jun 15 '17 at 21:11
True; it really depends on how thorough you want to be. Though it sounds like OP is mainly interested in reactive monitoring, as evidenced by the discussion on monitoring system logs for I/O-related errors. I use ZFS myself (not XFS), with redundancy and regular file system scrubs as well as SMART and system log monitoring, which ensures that developing problems are likely to be caught early, hopefully while still recoverable without more effort than a simple disk replacement.
– a CVn
Jun 15 '17 at 21:19
1
Hmm. I guess I was really hoping for something along the lines of/proc/fs/xfs/healththat I could poll. Running badblocks or xfsdump would be very time and resource intensive. Clearly the kernel knows something is wrong since the errors appear in syslog. And while I appreciate the sentiment about monitoring the underlying devices and assuming the filesystem works, that's just not an acceptable solution for me.
– Evan
Jun 22 '17 at 12:45
add a comment |
Hmmm. How do I detect a failed XFS filesystem?
I've been using XFS for ages. But... I guess I don't detect it, at all. If it mounts, I trust it works. That's how most people do it... filesystem checks are automated, if it boots and it's up and running, that's that.
Now, don't get me wrong. I actually do a ton of monitoring, but none of it is filesystem specific. I run SMART selftests (using select,cont to do a disk segment per day, because long simply takes too long). I run RAID checks (also in segments) and also check that there are no mismatches in parity (mismatch_cnt = 0). I get instant mail notification if any of those fail and I actually replace HDDs once they start reallocating sectors (or at least, no longer trust them with important data).
So I have monitoring to make sure the storage works as it should. This covers errors inside the drives themselves (SMART) and to some extent also errors on a higher level (RAID checks in a way also test controllers, cables, RAID logic, ...).
As long as that works fine, the filesystem better be fine, too. Outside of checksumming filesystems like ZFS/btrfs (maybe XFS in the future, too) it's not really part of the concept to run checks on a filesystem level while mounted, apart from whatever sanity checks the filesystem itself does internally.
Your output suggests you're running RAID too, and had a failed disk in that RAID; even so that really should not cause errors happening on md0, unless it was RAID without redundancy (RAID0 or already degraded RAID1/5/6/10).
You should fix your problems below the filesystem layer first. You can hardly blame XFS for disk errors and that's not how you check for disk errors.
I guess if you really wanted to run a full read test on top of the filesystem, you could do an xfsdump to a backup disk... if you're doing a full read test of your filesystem anyhow, might as well do it in a way that's meaningful somehow.
It's the nature of xfsdump to walk the XFS filesystem in its entirety and store all the files. So that should come as close as possible to a full read test, not including free space.
Of course, if you're already running another backup system, that's really the same story in a filesystem-agnostic way (and if that backup system encounters read errors that aren't just lack of permissions, it better send a mail report to you, too), although of course if it's an incremental backup, without periodic full backups it won't actually read a file more than once...
But in general, we trust filesystems to "just work" as long as the storage is known to work. While it would be nice to have each and every program without exception to elevate any and all I/O errors it encounters, I'm not aware of a generic purpose solution to actually do so. Each program does its own error handling.
Hmmm. How do I detect a failed XFS filesystem?
I've been using XFS for ages. But... I guess I don't detect it, at all. If it mounts, I trust it works. That's how most people do it... filesystem checks are automated, if it boots and it's up and running, that's that.
Now, don't get me wrong. I actually do a ton of monitoring, but none of it is filesystem specific. I run SMART selftests (using select,cont to do a disk segment per day, because long simply takes too long). I run RAID checks (also in segments) and also check that there are no mismatches in parity (mismatch_cnt = 0). I get instant mail notification if any of those fail and I actually replace HDDs once they start reallocating sectors (or at least, no longer trust them with important data).
So I have monitoring to make sure the storage works as it should. This covers errors inside the drives themselves (SMART) and to some extent also errors on a higher level (RAID checks in a way also test controllers, cables, RAID logic, ...).
As long as that works fine, the filesystem better be fine, too. Outside of checksumming filesystems like ZFS/btrfs (maybe XFS in the future, too) it's not really part of the concept to run checks on a filesystem level while mounted, apart from whatever sanity checks the filesystem itself does internally.
Your output suggests you're running RAID too, and had a failed disk in that RAID; even so that really should not cause errors happening on md0, unless it was RAID without redundancy (RAID0 or already degraded RAID1/5/6/10).
You should fix your problems below the filesystem layer first. You can hardly blame XFS for disk errors and that's not how you check for disk errors.
I guess if you really wanted to run a full read test on top of the filesystem, you could do an xfsdump to a backup disk... if you're doing a full read test of your filesystem anyhow, might as well do it in a way that's meaningful somehow.
It's the nature of xfsdump to walk the XFS filesystem in its entirety and store all the files. So that should come as close as possible to a full read test, not including free space.
Of course, if you're already running another backup system, that's really the same story in a filesystem-agnostic way (and if that backup system encounters read errors that aren't just lack of permissions, it better send a mail report to you, too), although of course if it's an incremental backup, without periodic full backups it won't actually read a file more than once...
But in general, we trust filesystems to "just work" as long as the storage is known to work. While it would be nice to have each and every program without exception to elevate any and all I/O errors it encounters, I'm not aware of a generic purpose solution to actually do so. Each program does its own error handling.
answered Jun 15 '17 at 20:04
frostschutz
25.7k15280
25.7k15280
Shouldn't something like making surexfsdump -l 0 - /dev/md0 >/dev/null && echo saneterminates tell you whether the file system itself is sane? You could then do a read-onlybadblockspass (ordd, for that matter) to make sure that all parts of the disk are at least readable.
– a CVn
Jun 15 '17 at 20:52
Yes, only it might be more meaningful to keep the result than just toss it away.badblocksis one such example, if I have reason to believe there actually are read errors, I never runbadblocks. I runddrescue. Of coursebadblockscan do write tests too, but you don't really do that when there is data you need.
– frostschutz
Jun 15 '17 at 21:11
True; it really depends on how thorough you want to be. Though it sounds like OP is mainly interested in reactive monitoring, as evidenced by the discussion on monitoring system logs for I/O-related errors. I use ZFS myself (not XFS), with redundancy and regular file system scrubs as well as SMART and system log monitoring, which ensures that developing problems are likely to be caught early, hopefully while still recoverable without more effort than a simple disk replacement.
– a CVn
Jun 15 '17 at 21:19
1
Hmm. I guess I was really hoping for something along the lines of/proc/fs/xfs/healththat I could poll. Running badblocks or xfsdump would be very time and resource intensive. Clearly the kernel knows something is wrong since the errors appear in syslog. And while I appreciate the sentiment about monitoring the underlying devices and assuming the filesystem works, that's just not an acceptable solution for me.
– Evan
Jun 22 '17 at 12:45
add a comment |
Shouldn't something like making surexfsdump -l 0 - /dev/md0 >/dev/null && echo saneterminates tell you whether the file system itself is sane? You could then do a read-onlybadblockspass (ordd, for that matter) to make sure that all parts of the disk are at least readable.
– a CVn
Jun 15 '17 at 20:52
Yes, only it might be more meaningful to keep the result than just toss it away.badblocksis one such example, if I have reason to believe there actually are read errors, I never runbadblocks. I runddrescue. Of coursebadblockscan do write tests too, but you don't really do that when there is data you need.
– frostschutz
Jun 15 '17 at 21:11
True; it really depends on how thorough you want to be. Though it sounds like OP is mainly interested in reactive monitoring, as evidenced by the discussion on monitoring system logs for I/O-related errors. I use ZFS myself (not XFS), with redundancy and regular file system scrubs as well as SMART and system log monitoring, which ensures that developing problems are likely to be caught early, hopefully while still recoverable without more effort than a simple disk replacement.
– a CVn
Jun 15 '17 at 21:19
1
Hmm. I guess I was really hoping for something along the lines of/proc/fs/xfs/healththat I could poll. Running badblocks or xfsdump would be very time and resource intensive. Clearly the kernel knows something is wrong since the errors appear in syslog. And while I appreciate the sentiment about monitoring the underlying devices and assuming the filesystem works, that's just not an acceptable solution for me.
– Evan
Jun 22 '17 at 12:45
Shouldn't something like making sure
xfsdump -l 0 - /dev/md0 >/dev/null && echo sane terminates tell you whether the file system itself is sane? You could then do a read-only badblocks pass (or dd, for that matter) to make sure that all parts of the disk are at least readable.– a CVn
Jun 15 '17 at 20:52
Shouldn't something like making sure
xfsdump -l 0 - /dev/md0 >/dev/null && echo sane terminates tell you whether the file system itself is sane? You could then do a read-only badblocks pass (or dd, for that matter) to make sure that all parts of the disk are at least readable.– a CVn
Jun 15 '17 at 20:52
Yes, only it might be more meaningful to keep the result than just toss it away.
badblocks is one such example, if I have reason to believe there actually are read errors, I never run badblocks. I run ddrescue. Of course badblocks can do write tests too, but you don't really do that when there is data you need.– frostschutz
Jun 15 '17 at 21:11
Yes, only it might be more meaningful to keep the result than just toss it away.
badblocks is one such example, if I have reason to believe there actually are read errors, I never run badblocks. I run ddrescue. Of course badblocks can do write tests too, but you don't really do that when there is data you need.– frostschutz
Jun 15 '17 at 21:11
True; it really depends on how thorough you want to be. Though it sounds like OP is mainly interested in reactive monitoring, as evidenced by the discussion on monitoring system logs for I/O-related errors. I use ZFS myself (not XFS), with redundancy and regular file system scrubs as well as SMART and system log monitoring, which ensures that developing problems are likely to be caught early, hopefully while still recoverable without more effort than a simple disk replacement.
– a CVn
Jun 15 '17 at 21:19
True; it really depends on how thorough you want to be. Though it sounds like OP is mainly interested in reactive monitoring, as evidenced by the discussion on monitoring system logs for I/O-related errors. I use ZFS myself (not XFS), with redundancy and regular file system scrubs as well as SMART and system log monitoring, which ensures that developing problems are likely to be caught early, hopefully while still recoverable without more effort than a simple disk replacement.
– a CVn
Jun 15 '17 at 21:19
1
1
Hmm. I guess I was really hoping for something along the lines of
/proc/fs/xfs/health that I could poll. Running badblocks or xfsdump would be very time and resource intensive. Clearly the kernel knows something is wrong since the errors appear in syslog. And while I appreciate the sentiment about monitoring the underlying devices and assuming the filesystem works, that's just not an acceptable solution for me.– Evan
Jun 22 '17 at 12:45
Hmm. I guess I was really hoping for something along the lines of
/proc/fs/xfs/health that I could poll. Running badblocks or xfsdump would be very time and resource intensive. Clearly the kernel knows something is wrong since the errors appear in syslog. And while I appreciate the sentiment about monitoring the underlying devices and assuming the filesystem works, that's just not an acceptable solution for me.– Evan
Jun 22 '17 at 12:45
add a comment |
Many critical storage servers enable "panic on error", so that an error wouldn't have a chance to become bigger, to cause further data damage, or to serve corrupted data to the users. With panic on error, you could just watch for panic events or system down to detect file system errors.
Of course, if you don't have a redundant system, one server going panic would mean actual down time. However, mission-critical systems must have redundancy. In fact, any data on a file system that shows such kind of I/O errors should no longer be used, and the system should be disconnected asap so that a backup system could kick in. Serving no data is actually better than serving corrupt data in most cases.
According to https://access.redhat.com/solutions/3645252, you could set the sysctl fs.xfs.panic_mask=127 to make any error detected on any XFS filesystem become a system panic.
In order to persist to the reboot of this configure systems, do:
echo 'fs.xfs.panic_mask=127' > /etc/sysctl.d/01-xfs.conf
New contributor
Yan Li is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
Many critical storage servers enable "panic on error", so that an error wouldn't have a chance to become bigger, to cause further data damage, or to serve corrupted data to the users. With panic on error, you could just watch for panic events or system down to detect file system errors.
Of course, if you don't have a redundant system, one server going panic would mean actual down time. However, mission-critical systems must have redundancy. In fact, any data on a file system that shows such kind of I/O errors should no longer be used, and the system should be disconnected asap so that a backup system could kick in. Serving no data is actually better than serving corrupt data in most cases.
According to https://access.redhat.com/solutions/3645252, you could set the sysctl fs.xfs.panic_mask=127 to make any error detected on any XFS filesystem become a system panic.
In order to persist to the reboot of this configure systems, do:
echo 'fs.xfs.panic_mask=127' > /etc/sysctl.d/01-xfs.conf
New contributor
Yan Li is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
Many critical storage servers enable "panic on error", so that an error wouldn't have a chance to become bigger, to cause further data damage, or to serve corrupted data to the users. With panic on error, you could just watch for panic events or system down to detect file system errors.
Of course, if you don't have a redundant system, one server going panic would mean actual down time. However, mission-critical systems must have redundancy. In fact, any data on a file system that shows such kind of I/O errors should no longer be used, and the system should be disconnected asap so that a backup system could kick in. Serving no data is actually better than serving corrupt data in most cases.
According to https://access.redhat.com/solutions/3645252, you could set the sysctl fs.xfs.panic_mask=127 to make any error detected on any XFS filesystem become a system panic.
In order to persist to the reboot of this configure systems, do:
echo 'fs.xfs.panic_mask=127' > /etc/sysctl.d/01-xfs.conf
New contributor
Yan Li is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Many critical storage servers enable "panic on error", so that an error wouldn't have a chance to become bigger, to cause further data damage, or to serve corrupted data to the users. With panic on error, you could just watch for panic events or system down to detect file system errors.
Of course, if you don't have a redundant system, one server going panic would mean actual down time. However, mission-critical systems must have redundancy. In fact, any data on a file system that shows such kind of I/O errors should no longer be used, and the system should be disconnected asap so that a backup system could kick in. Serving no data is actually better than serving corrupt data in most cases.
According to https://access.redhat.com/solutions/3645252, you could set the sysctl fs.xfs.panic_mask=127 to make any error detected on any XFS filesystem become a system panic.
In order to persist to the reboot of this configure systems, do:
echo 'fs.xfs.panic_mask=127' > /etc/sysctl.d/01-xfs.conf
New contributor
Yan Li is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Yan Li is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
answered yesterday
Yan Li
1
1
New contributor
Yan Li is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Yan Li is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Yan Li is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
add a comment |
add a comment |
xfs_repair -n
Usage: xfs_repair [options] device
Options:
-f The device is a file
-L Force log zeroing. Do this as a last resort.
-l logdev Specifies the device where the external log resides.
-m maxmem Maximum amount of memory to be used in megabytes.
-n No modify mode, just checks the filesystem for damage.
-P Disables prefetching.
-r rtdev Specifies the device where the realtime section resides.
-v Verbose output.
-c subopts Change filesystem parameters - use xfs_admin.
-o subopts Override default behaviour, refer to man page.
-t interval Reporting interval in minutes.
-d Repair dangerously.
-V Reports version and exits.
man page:
-n No modify mode.
Specifies that xfs_repair should not modify the filesystem but
should only scan the filesystem and indicate what repairs would have been made.
and there is xfs_check but if you do a man page on it you will see: check XFS filesystem consistency... Note that using xfs_check is NOT recommended. Please use xfs_repair -n instead, for better scalability and speed.
And in /etc/fstab the 6th or last column if it is a 1 or 2 causes fsck or file system check on mount which would happen on every boot... will it specifically be xfs_repair -n? I don't know.
you asked about detecting a failed file system: my interpretation of that is if it is failed then it would not be mounted and not accessible at all... you would know without really having to check it would not be obvious when you notice it is not mounted, then rudely fails to mount when tried manually.
Must be unmounted to do this, but to monitor here's what you would periodically do manually:
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sdc2 550G 152G 371G 30% / {ext3}
udev 253G 216K 253G 1% /dev
tmpfs 253G 5.5M 253G 1% /dev/shm
/dev/sdc1 195M 13M 183M 7% /boot/efi
/dev/sda1 5.0T 4.9T 99G 99% /data {xfs}
/dev/sdb1 559G 67G 492G 12% /scratch
tmpfs 450G 0 450G 0% /ramdisk
/dev/sdd1 5.0T 4.9T 9.8G 100% /bkup {xfs}
how do i find file system types?
# mount
/dev/sdc2 on / type ext3 (rw,acl,user_xattr)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
debugfs on /sys/kernel/debug type debugfs (rw)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,mode=1777)
devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
/dev/sdc1 on /boot/efi type vfat (rw,umask=0002,utf8=true)
/dev/sda1 on /data type xfs (rw)
/dev/sdb1 on /scratch type xfs (rw)
fusectl on /sys/fs/fuse/connections type fusectl (rw)
securityfs on /sys/kernel/security type securityfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
tmpfs on /ramdisk type tmpfs (rw,size=450G)
nfsd on /proc/fs/nfsd type nfsd (rw)
/dev/sdd1 on /bkup type xfs (rw)
# xfs_repair -n /dev/sdd1
xfs_repair: /dev/sdd1 contains a mounted and writable filesystem
fatal error -- couldn't initialize XFS library
# umount /bkup/
# xfs_repair -n /dev/sdd1
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 4
- agno = 3
- agno = 1
- agno = 2
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.
this "xfs_repair -n" output is on a good XFS file system that has been problem free for years.
add a comment |
xfs_repair -n
Usage: xfs_repair [options] device
Options:
-f The device is a file
-L Force log zeroing. Do this as a last resort.
-l logdev Specifies the device where the external log resides.
-m maxmem Maximum amount of memory to be used in megabytes.
-n No modify mode, just checks the filesystem for damage.
-P Disables prefetching.
-r rtdev Specifies the device where the realtime section resides.
-v Verbose output.
-c subopts Change filesystem parameters - use xfs_admin.
-o subopts Override default behaviour, refer to man page.
-t interval Reporting interval in minutes.
-d Repair dangerously.
-V Reports version and exits.
man page:
-n No modify mode.
Specifies that xfs_repair should not modify the filesystem but
should only scan the filesystem and indicate what repairs would have been made.
and there is xfs_check but if you do a man page on it you will see: check XFS filesystem consistency... Note that using xfs_check is NOT recommended. Please use xfs_repair -n instead, for better scalability and speed.
And in /etc/fstab the 6th or last column if it is a 1 or 2 causes fsck or file system check on mount which would happen on every boot... will it specifically be xfs_repair -n? I don't know.
you asked about detecting a failed file system: my interpretation of that is if it is failed then it would not be mounted and not accessible at all... you would know without really having to check it would not be obvious when you notice it is not mounted, then rudely fails to mount when tried manually.
Must be unmounted to do this, but to monitor here's what you would periodically do manually:
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sdc2 550G 152G 371G 30% / {ext3}
udev 253G 216K 253G 1% /dev
tmpfs 253G 5.5M 253G 1% /dev/shm
/dev/sdc1 195M 13M 183M 7% /boot/efi
/dev/sda1 5.0T 4.9T 99G 99% /data {xfs}
/dev/sdb1 559G 67G 492G 12% /scratch
tmpfs 450G 0 450G 0% /ramdisk
/dev/sdd1 5.0T 4.9T 9.8G 100% /bkup {xfs}
how do i find file system types?
# mount
/dev/sdc2 on / type ext3 (rw,acl,user_xattr)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
debugfs on /sys/kernel/debug type debugfs (rw)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,mode=1777)
devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
/dev/sdc1 on /boot/efi type vfat (rw,umask=0002,utf8=true)
/dev/sda1 on /data type xfs (rw)
/dev/sdb1 on /scratch type xfs (rw)
fusectl on /sys/fs/fuse/connections type fusectl (rw)
securityfs on /sys/kernel/security type securityfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
tmpfs on /ramdisk type tmpfs (rw,size=450G)
nfsd on /proc/fs/nfsd type nfsd (rw)
/dev/sdd1 on /bkup type xfs (rw)
# xfs_repair -n /dev/sdd1
xfs_repair: /dev/sdd1 contains a mounted and writable filesystem
fatal error -- couldn't initialize XFS library
# umount /bkup/
# xfs_repair -n /dev/sdd1
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 4
- agno = 3
- agno = 1
- agno = 2
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.
this "xfs_repair -n" output is on a good XFS file system that has been problem free for years.
add a comment |
xfs_repair -n
Usage: xfs_repair [options] device
Options:
-f The device is a file
-L Force log zeroing. Do this as a last resort.
-l logdev Specifies the device where the external log resides.
-m maxmem Maximum amount of memory to be used in megabytes.
-n No modify mode, just checks the filesystem for damage.
-P Disables prefetching.
-r rtdev Specifies the device where the realtime section resides.
-v Verbose output.
-c subopts Change filesystem parameters - use xfs_admin.
-o subopts Override default behaviour, refer to man page.
-t interval Reporting interval in minutes.
-d Repair dangerously.
-V Reports version and exits.
man page:
-n No modify mode.
Specifies that xfs_repair should not modify the filesystem but
should only scan the filesystem and indicate what repairs would have been made.
and there is xfs_check but if you do a man page on it you will see: check XFS filesystem consistency... Note that using xfs_check is NOT recommended. Please use xfs_repair -n instead, for better scalability and speed.
And in /etc/fstab the 6th or last column if it is a 1 or 2 causes fsck or file system check on mount which would happen on every boot... will it specifically be xfs_repair -n? I don't know.
you asked about detecting a failed file system: my interpretation of that is if it is failed then it would not be mounted and not accessible at all... you would know without really having to check it would not be obvious when you notice it is not mounted, then rudely fails to mount when tried manually.
Must be unmounted to do this, but to monitor here's what you would periodically do manually:
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sdc2 550G 152G 371G 30% / {ext3}
udev 253G 216K 253G 1% /dev
tmpfs 253G 5.5M 253G 1% /dev/shm
/dev/sdc1 195M 13M 183M 7% /boot/efi
/dev/sda1 5.0T 4.9T 99G 99% /data {xfs}
/dev/sdb1 559G 67G 492G 12% /scratch
tmpfs 450G 0 450G 0% /ramdisk
/dev/sdd1 5.0T 4.9T 9.8G 100% /bkup {xfs}
how do i find file system types?
# mount
/dev/sdc2 on / type ext3 (rw,acl,user_xattr)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
debugfs on /sys/kernel/debug type debugfs (rw)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,mode=1777)
devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
/dev/sdc1 on /boot/efi type vfat (rw,umask=0002,utf8=true)
/dev/sda1 on /data type xfs (rw)
/dev/sdb1 on /scratch type xfs (rw)
fusectl on /sys/fs/fuse/connections type fusectl (rw)
securityfs on /sys/kernel/security type securityfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
tmpfs on /ramdisk type tmpfs (rw,size=450G)
nfsd on /proc/fs/nfsd type nfsd (rw)
/dev/sdd1 on /bkup type xfs (rw)
# xfs_repair -n /dev/sdd1
xfs_repair: /dev/sdd1 contains a mounted and writable filesystem
fatal error -- couldn't initialize XFS library
# umount /bkup/
# xfs_repair -n /dev/sdd1
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 4
- agno = 3
- agno = 1
- agno = 2
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.
this "xfs_repair -n" output is on a good XFS file system that has been problem free for years.
xfs_repair -n
Usage: xfs_repair [options] device
Options:
-f The device is a file
-L Force log zeroing. Do this as a last resort.
-l logdev Specifies the device where the external log resides.
-m maxmem Maximum amount of memory to be used in megabytes.
-n No modify mode, just checks the filesystem for damage.
-P Disables prefetching.
-r rtdev Specifies the device where the realtime section resides.
-v Verbose output.
-c subopts Change filesystem parameters - use xfs_admin.
-o subopts Override default behaviour, refer to man page.
-t interval Reporting interval in minutes.
-d Repair dangerously.
-V Reports version and exits.
man page:
-n No modify mode.
Specifies that xfs_repair should not modify the filesystem but
should only scan the filesystem and indicate what repairs would have been made.
and there is xfs_check but if you do a man page on it you will see: check XFS filesystem consistency... Note that using xfs_check is NOT recommended. Please use xfs_repair -n instead, for better scalability and speed.
And in /etc/fstab the 6th or last column if it is a 1 or 2 causes fsck or file system check on mount which would happen on every boot... will it specifically be xfs_repair -n? I don't know.
you asked about detecting a failed file system: my interpretation of that is if it is failed then it would not be mounted and not accessible at all... you would know without really having to check it would not be obvious when you notice it is not mounted, then rudely fails to mount when tried manually.
Must be unmounted to do this, but to monitor here's what you would periodically do manually:
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sdc2 550G 152G 371G 30% / {ext3}
udev 253G 216K 253G 1% /dev
tmpfs 253G 5.5M 253G 1% /dev/shm
/dev/sdc1 195M 13M 183M 7% /boot/efi
/dev/sda1 5.0T 4.9T 99G 99% /data {xfs}
/dev/sdb1 559G 67G 492G 12% /scratch
tmpfs 450G 0 450G 0% /ramdisk
/dev/sdd1 5.0T 4.9T 9.8G 100% /bkup {xfs}
how do i find file system types?
# mount
/dev/sdc2 on / type ext3 (rw,acl,user_xattr)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
debugfs on /sys/kernel/debug type debugfs (rw)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,mode=1777)
devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
/dev/sdc1 on /boot/efi type vfat (rw,umask=0002,utf8=true)
/dev/sda1 on /data type xfs (rw)
/dev/sdb1 on /scratch type xfs (rw)
fusectl on /sys/fs/fuse/connections type fusectl (rw)
securityfs on /sys/kernel/security type securityfs (rw)
none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
tmpfs on /ramdisk type tmpfs (rw,size=450G)
nfsd on /proc/fs/nfsd type nfsd (rw)
/dev/sdd1 on /bkup type xfs (rw)
# xfs_repair -n /dev/sdd1
xfs_repair: /dev/sdd1 contains a mounted and writable filesystem
fatal error -- couldn't initialize XFS library
# umount /bkup/
# xfs_repair -n /dev/sdd1
Phase 1 - find and verify superblock...
Phase 2 - using internal log
- scan filesystem freespace and inode maps...
- found root inode chunk
Phase 3 - for each AG...
- scan (but don't clear) agi unlinked lists...
- process known inodes and perform inode discovery...
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- process newly discovered inodes...
Phase 4 - check for duplicate blocks...
- setting up duplicate extent list...
- check for inodes claiming duplicate blocks...
- agno = 0
- agno = 4
- agno = 3
- agno = 1
- agno = 2
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity...
- traversing filesystem ...
- traversal finished ...
- moving disconnected inodes to lost+found ...
Phase 7 - verify link counts...
No modify flag set, skipping filesystem flush and exiting.
this "xfs_repair -n" output is on a good XFS file system that has been problem free for years.
edited yesterday
answered yesterday
ron
8861714
8861714
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f371381%2fhow-do-you-detect-a-failed-xfs-filesystem%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You could configure nagios, they may have a module in place that can check the status of xfs...
– ryekayo
Jun 15 '17 at 19:48
@ryekayo A quick perusal of exchange.nagios.org/directory/Plugins/System-Metrics/… did not uncover anything for monitoring the health of XFS file systems directly, though some that probably could be pressed into service.
– a CVn
Jun 15 '17 at 20:57