How do YOU detect a failed XFS filesystem?












3














Currently I monitor for a failed filesystem (as a result of a failed disk, controller, whatever) by checking syslog for messages like this:



2017-06-15T17:18:10.081665+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381844.448488] blk_update_request: critical target error, dev sdj, sector 97672656
2017-06-15T17:18:10.724329+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.047871] XFS (md0): metadata I/O error: block 0x2baa81400 ("xlog_iodone") error 121 numblks 512
2017-06-15T17:18:10.724329+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124418] XFS (md0): xfs_do_force_shutdown(0x2) called from line 1177 of file /build/linux-lts-wily-8ENwT0/linux-lts-wily-4.2.0/fs/xfs/xfs_log.c. Return address = 0xffffffffc050e100
2017-06-15T17:18:10.724349+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124425] XFS (md0): Log I/O Error Detected. Shutting down filesystem
2017-06-15T17:18:10.724349+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124452] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:18:10.724354+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.163480] XFS (md0): Please umount the filesystem and rectify the problem(s)
2017-06-15T17:18:40.612572+00:00 2017-06-15T17:18:40+00:00 localhost kernel: [1381875.074647] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:19:10.612554+00:00 2017-06-15T17:19:10+00:00 localhost kernel: [1381905.101606] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:19:40.612558+00:00 2017-06-15T17:19:40+00:00 localhost kernel: [1381935.128546] XFS (md0): xfs_log_force: error -5 returned.


This is ok but I'd really like a more canonical check. The only thing I can think of is to write a script that attempts to write a file to disk and fires off an alarm if it can't for any reason. However, that seems like it's prone to false positives - there are several reasons why a file might not be able to be written, not just a failed filesystem.



Aside from grepping the logs or writing a canary file to disk, how can this be monitored?










share|improve this question






















  • You could configure nagios, they may have a module in place that can check the status of xfs...
    – ryekayo
    Jun 15 '17 at 19:48










  • @ryekayo A quick perusal of exchange.nagios.org/directory/Plugins/System-Metrics/… did not uncover anything for monitoring the health of XFS file systems directly, though some that probably could be pressed into service.
    – a CVn
    Jun 15 '17 at 20:57
















3














Currently I monitor for a failed filesystem (as a result of a failed disk, controller, whatever) by checking syslog for messages like this:



2017-06-15T17:18:10.081665+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381844.448488] blk_update_request: critical target error, dev sdj, sector 97672656
2017-06-15T17:18:10.724329+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.047871] XFS (md0): metadata I/O error: block 0x2baa81400 ("xlog_iodone") error 121 numblks 512
2017-06-15T17:18:10.724329+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124418] XFS (md0): xfs_do_force_shutdown(0x2) called from line 1177 of file /build/linux-lts-wily-8ENwT0/linux-lts-wily-4.2.0/fs/xfs/xfs_log.c. Return address = 0xffffffffc050e100
2017-06-15T17:18:10.724349+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124425] XFS (md0): Log I/O Error Detected. Shutting down filesystem
2017-06-15T17:18:10.724349+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124452] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:18:10.724354+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.163480] XFS (md0): Please umount the filesystem and rectify the problem(s)
2017-06-15T17:18:40.612572+00:00 2017-06-15T17:18:40+00:00 localhost kernel: [1381875.074647] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:19:10.612554+00:00 2017-06-15T17:19:10+00:00 localhost kernel: [1381905.101606] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:19:40.612558+00:00 2017-06-15T17:19:40+00:00 localhost kernel: [1381935.128546] XFS (md0): xfs_log_force: error -5 returned.


This is ok but I'd really like a more canonical check. The only thing I can think of is to write a script that attempts to write a file to disk and fires off an alarm if it can't for any reason. However, that seems like it's prone to false positives - there are several reasons why a file might not be able to be written, not just a failed filesystem.



Aside from grepping the logs or writing a canary file to disk, how can this be monitored?










share|improve this question






















  • You could configure nagios, they may have a module in place that can check the status of xfs...
    – ryekayo
    Jun 15 '17 at 19:48










  • @ryekayo A quick perusal of exchange.nagios.org/directory/Plugins/System-Metrics/… did not uncover anything for monitoring the health of XFS file systems directly, though some that probably could be pressed into service.
    – a CVn
    Jun 15 '17 at 20:57














3












3








3







Currently I monitor for a failed filesystem (as a result of a failed disk, controller, whatever) by checking syslog for messages like this:



2017-06-15T17:18:10.081665+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381844.448488] blk_update_request: critical target error, dev sdj, sector 97672656
2017-06-15T17:18:10.724329+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.047871] XFS (md0): metadata I/O error: block 0x2baa81400 ("xlog_iodone") error 121 numblks 512
2017-06-15T17:18:10.724329+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124418] XFS (md0): xfs_do_force_shutdown(0x2) called from line 1177 of file /build/linux-lts-wily-8ENwT0/linux-lts-wily-4.2.0/fs/xfs/xfs_log.c. Return address = 0xffffffffc050e100
2017-06-15T17:18:10.724349+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124425] XFS (md0): Log I/O Error Detected. Shutting down filesystem
2017-06-15T17:18:10.724349+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124452] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:18:10.724354+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.163480] XFS (md0): Please umount the filesystem and rectify the problem(s)
2017-06-15T17:18:40.612572+00:00 2017-06-15T17:18:40+00:00 localhost kernel: [1381875.074647] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:19:10.612554+00:00 2017-06-15T17:19:10+00:00 localhost kernel: [1381905.101606] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:19:40.612558+00:00 2017-06-15T17:19:40+00:00 localhost kernel: [1381935.128546] XFS (md0): xfs_log_force: error -5 returned.


This is ok but I'd really like a more canonical check. The only thing I can think of is to write a script that attempts to write a file to disk and fires off an alarm if it can't for any reason. However, that seems like it's prone to false positives - there are several reasons why a file might not be able to be written, not just a failed filesystem.



Aside from grepping the logs or writing a canary file to disk, how can this be monitored?










share|improve this question













Currently I monitor for a failed filesystem (as a result of a failed disk, controller, whatever) by checking syslog for messages like this:



2017-06-15T17:18:10.081665+00:00    2017-06-15T17:18:10+00:00   localhost  kernel:  [1381844.448488] blk_update_request: critical target error, dev sdj, sector 97672656
2017-06-15T17:18:10.724329+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.047871] XFS (md0): metadata I/O error: block 0x2baa81400 ("xlog_iodone") error 121 numblks 512
2017-06-15T17:18:10.724329+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124418] XFS (md0): xfs_do_force_shutdown(0x2) called from line 1177 of file /build/linux-lts-wily-8ENwT0/linux-lts-wily-4.2.0/fs/xfs/xfs_log.c. Return address = 0xffffffffc050e100
2017-06-15T17:18:10.724349+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124425] XFS (md0): Log I/O Error Detected. Shutting down filesystem
2017-06-15T17:18:10.724349+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.124452] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:18:10.724354+00:00 2017-06-15T17:18:10+00:00 localhost kernel: [1381845.163480] XFS (md0): Please umount the filesystem and rectify the problem(s)
2017-06-15T17:18:40.612572+00:00 2017-06-15T17:18:40+00:00 localhost kernel: [1381875.074647] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:19:10.612554+00:00 2017-06-15T17:19:10+00:00 localhost kernel: [1381905.101606] XFS (md0): xfs_log_force: error -5 returned.
2017-06-15T17:19:40.612558+00:00 2017-06-15T17:19:40+00:00 localhost kernel: [1381935.128546] XFS (md0): xfs_log_force: error -5 returned.


This is ok but I'd really like a more canonical check. The only thing I can think of is to write a script that attempts to write a file to disk and fires off an alarm if it can't for any reason. However, that seems like it's prone to false positives - there are several reasons why a file might not be able to be written, not just a failed filesystem.



Aside from grepping the logs or writing a canary file to disk, how can this be monitored?







filesystems monitoring xfs






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Jun 15 '17 at 19:34









Evan

164




164












  • You could configure nagios, they may have a module in place that can check the status of xfs...
    – ryekayo
    Jun 15 '17 at 19:48










  • @ryekayo A quick perusal of exchange.nagios.org/directory/Plugins/System-Metrics/… did not uncover anything for monitoring the health of XFS file systems directly, though some that probably could be pressed into service.
    – a CVn
    Jun 15 '17 at 20:57


















  • You could configure nagios, they may have a module in place that can check the status of xfs...
    – ryekayo
    Jun 15 '17 at 19:48










  • @ryekayo A quick perusal of exchange.nagios.org/directory/Plugins/System-Metrics/… did not uncover anything for monitoring the health of XFS file systems directly, though some that probably could be pressed into service.
    – a CVn
    Jun 15 '17 at 20:57
















You could configure nagios, they may have a module in place that can check the status of xfs...
– ryekayo
Jun 15 '17 at 19:48




You could configure nagios, they may have a module in place that can check the status of xfs...
– ryekayo
Jun 15 '17 at 19:48












@ryekayo A quick perusal of exchange.nagios.org/directory/Plugins/System-Metrics/… did not uncover anything for monitoring the health of XFS file systems directly, though some that probably could be pressed into service.
– a CVn
Jun 15 '17 at 20:57




@ryekayo A quick perusal of exchange.nagios.org/directory/Plugins/System-Metrics/… did not uncover anything for monitoring the health of XFS file systems directly, though some that probably could be pressed into service.
– a CVn
Jun 15 '17 at 20:57










3 Answers
3






active

oldest

votes


















1














Hmmm. How do I detect a failed XFS filesystem?



I've been using XFS for ages. But... I guess I don't detect it, at all. If it mounts, I trust it works. That's how most people do it... filesystem checks are automated, if it boots and it's up and running, that's that.



Now, don't get me wrong. I actually do a ton of monitoring, but none of it is filesystem specific. I run SMART selftests (using select,cont to do a disk segment per day, because long simply takes too long). I run RAID checks (also in segments) and also check that there are no mismatches in parity (mismatch_cnt = 0). I get instant mail notification if any of those fail and I actually replace HDDs once they start reallocating sectors (or at least, no longer trust them with important data).



So I have monitoring to make sure the storage works as it should. This covers errors inside the drives themselves (SMART) and to some extent also errors on a higher level (RAID checks in a way also test controllers, cables, RAID logic, ...).



As long as that works fine, the filesystem better be fine, too. Outside of checksumming filesystems like ZFS/btrfs (maybe XFS in the future, too) it's not really part of the concept to run checks on a filesystem level while mounted, apart from whatever sanity checks the filesystem itself does internally.



Your output suggests you're running RAID too, and had a failed disk in that RAID; even so that really should not cause errors happening on md0, unless it was RAID without redundancy (RAID0 or already degraded RAID1/5/6/10).



You should fix your problems below the filesystem layer first. You can hardly blame XFS for disk errors and that's not how you check for disk errors.





I guess if you really wanted to run a full read test on top of the filesystem, you could do an xfsdump to a backup disk... if you're doing a full read test of your filesystem anyhow, might as well do it in a way that's meaningful somehow.



It's the nature of xfsdump to walk the XFS filesystem in its entirety and store all the files. So that should come as close as possible to a full read test, not including free space.



Of course, if you're already running another backup system, that's really the same story in a filesystem-agnostic way (and if that backup system encounters read errors that aren't just lack of permissions, it better send a mail report to you, too), although of course if it's an incremental backup, without periodic full backups it won't actually read a file more than once...





But in general, we trust filesystems to "just work" as long as the storage is known to work. While it would be nice to have each and every program without exception to elevate any and all I/O errors it encounters, I'm not aware of a generic purpose solution to actually do so. Each program does its own error handling.






share|improve this answer





















  • Shouldn't something like making sure xfsdump -l 0 - /dev/md0 >/dev/null && echo sane terminates tell you whether the file system itself is sane? You could then do a read-only badblocks pass (or dd, for that matter) to make sure that all parts of the disk are at least readable.
    – a CVn
    Jun 15 '17 at 20:52










  • Yes, only it might be more meaningful to keep the result than just toss it away. badblocks is one such example, if I have reason to believe there actually are read errors, I never run badblocks. I run ddrescue. Of course badblocks can do write tests too, but you don't really do that when there is data you need.
    – frostschutz
    Jun 15 '17 at 21:11












  • True; it really depends on how thorough you want to be. Though it sounds like OP is mainly interested in reactive monitoring, as evidenced by the discussion on monitoring system logs for I/O-related errors. I use ZFS myself (not XFS), with redundancy and regular file system scrubs as well as SMART and system log monitoring, which ensures that developing problems are likely to be caught early, hopefully while still recoverable without more effort than a simple disk replacement.
    – a CVn
    Jun 15 '17 at 21:19






  • 1




    Hmm. I guess I was really hoping for something along the lines of /proc/fs/xfs/health that I could poll. Running badblocks or xfsdump would be very time and resource intensive. Clearly the kernel knows something is wrong since the errors appear in syslog. And while I appreciate the sentiment about monitoring the underlying devices and assuming the filesystem works, that's just not an acceptable solution for me.
    – Evan
    Jun 22 '17 at 12:45





















0














Many critical storage servers enable "panic on error", so that an error wouldn't have a chance to become bigger, to cause further data damage, or to serve corrupted data to the users. With panic on error, you could just watch for panic events or system down to detect file system errors.



Of course, if you don't have a redundant system, one server going panic would mean actual down time. However, mission-critical systems must have redundancy. In fact, any data on a file system that shows such kind of I/O errors should no longer be used, and the system should be disconnected asap so that a backup system could kick in. Serving no data is actually better than serving corrupt data in most cases.



According to https://access.redhat.com/solutions/3645252, you could set the sysctl fs.xfs.panic_mask=127 to make any error detected on any XFS filesystem become a system panic.



In order to persist to the reboot of this configure systems, do:



echo 'fs.xfs.panic_mask=127' > /etc/sysctl.d/01-xfs.conf






share|improve this answer








New contributor




Yan Li is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.


























    0














    xfs_repair -n



    Usage: xfs_repair [options] device

    Options:
    -f The device is a file
    -L Force log zeroing. Do this as a last resort.
    -l logdev Specifies the device where the external log resides.
    -m maxmem Maximum amount of memory to be used in megabytes.
    -n No modify mode, just checks the filesystem for damage.
    -P Disables prefetching.
    -r rtdev Specifies the device where the realtime section resides.
    -v Verbose output.
    -c subopts Change filesystem parameters - use xfs_admin.
    -o subopts Override default behaviour, refer to man page.
    -t interval Reporting interval in minutes.
    -d Repair dangerously.
    -V Reports version and exits.

    man page:
    -n No modify mode.
    Specifies that xfs_repair should not modify the filesystem but
    should only scan the filesystem and indicate what repairs would have been made.


    and there is xfs_check but if you do a man page on it you will see: check XFS filesystem consistency... Note that using xfs_check is NOT recommended. Please use xfs_repair -n instead, for better scalability and speed.



    And in /etc/fstab the 6th or last column if it is a 1 or 2 causes fsck or file system check on mount which would happen on every boot... will it specifically be xfs_repair -n? I don't know.



    you asked about detecting a failed file system: my interpretation of that is if it is failed then it would not be mounted and not accessible at all... you would know without really having to check it would not be obvious when you notice it is not mounted, then rudely fails to mount when tried manually.



    Must be unmounted to do this, but to monitor here's what you would periodically do manually:



    # df -h
    Filesystem Size Used Avail Use% Mounted on
    /dev/sdc2 550G 152G 371G 30% / {ext3}
    udev 253G 216K 253G 1% /dev
    tmpfs 253G 5.5M 253G 1% /dev/shm
    /dev/sdc1 195M 13M 183M 7% /boot/efi
    /dev/sda1 5.0T 4.9T 99G 99% /data {xfs}
    /dev/sdb1 559G 67G 492G 12% /scratch
    tmpfs 450G 0 450G 0% /ramdisk
    /dev/sdd1 5.0T 4.9T 9.8G 100% /bkup {xfs}

    how do i find file system types?

    # mount
    /dev/sdc2 on / type ext3 (rw,acl,user_xattr)
    proc on /proc type proc (rw)
    sysfs on /sys type sysfs (rw)
    debugfs on /sys/kernel/debug type debugfs (rw)
    udev on /dev type tmpfs (rw,mode=0755)
    tmpfs on /dev/shm type tmpfs (rw,mode=1777)
    devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
    /dev/sdc1 on /boot/efi type vfat (rw,umask=0002,utf8=true)
    /dev/sda1 on /data type xfs (rw)
    /dev/sdb1 on /scratch type xfs (rw)
    fusectl on /sys/fs/fuse/connections type fusectl (rw)
    securityfs on /sys/kernel/security type securityfs (rw)
    none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
    tmpfs on /ramdisk type tmpfs (rw,size=450G)
    nfsd on /proc/fs/nfsd type nfsd (rw)
    /dev/sdd1 on /bkup type xfs (rw)

    # xfs_repair -n /dev/sdd1
    xfs_repair: /dev/sdd1 contains a mounted and writable filesystem

    fatal error -- couldn't initialize XFS library

    # umount /bkup/
    # xfs_repair -n /dev/sdd1

    Phase 1 - find and verify superblock...
    Phase 2 - using internal log
    - scan filesystem freespace and inode maps...
    - found root inode chunk
    Phase 3 - for each AG...
    - scan (but don't clear) agi unlinked lists...
    - process known inodes and perform inode discovery...
    - agno = 0
    - agno = 1
    - agno = 2
    - agno = 3
    - agno = 4
    - process newly discovered inodes...
    Phase 4 - check for duplicate blocks...
    - setting up duplicate extent list...
    - check for inodes claiming duplicate blocks...
    - agno = 0
    - agno = 4
    - agno = 3
    - agno = 1
    - agno = 2
    No modify flag set, skipping phase 5
    Phase 6 - check inode connectivity...
    - traversing filesystem ...
    - traversal finished ...
    - moving disconnected inodes to lost+found ...
    Phase 7 - verify link counts...
    No modify flag set, skipping filesystem flush and exiting.

    this "xfs_repair -n" output is on a good XFS file system that has been problem free for years.





    share|improve this answer























      Your Answer








      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "106"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f371381%2fhow-do-you-detect-a-failed-xfs-filesystem%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      3 Answers
      3






      active

      oldest

      votes








      3 Answers
      3






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      1














      Hmmm. How do I detect a failed XFS filesystem?



      I've been using XFS for ages. But... I guess I don't detect it, at all. If it mounts, I trust it works. That's how most people do it... filesystem checks are automated, if it boots and it's up and running, that's that.



      Now, don't get me wrong. I actually do a ton of monitoring, but none of it is filesystem specific. I run SMART selftests (using select,cont to do a disk segment per day, because long simply takes too long). I run RAID checks (also in segments) and also check that there are no mismatches in parity (mismatch_cnt = 0). I get instant mail notification if any of those fail and I actually replace HDDs once they start reallocating sectors (or at least, no longer trust them with important data).



      So I have monitoring to make sure the storage works as it should. This covers errors inside the drives themselves (SMART) and to some extent also errors on a higher level (RAID checks in a way also test controllers, cables, RAID logic, ...).



      As long as that works fine, the filesystem better be fine, too. Outside of checksumming filesystems like ZFS/btrfs (maybe XFS in the future, too) it's not really part of the concept to run checks on a filesystem level while mounted, apart from whatever sanity checks the filesystem itself does internally.



      Your output suggests you're running RAID too, and had a failed disk in that RAID; even so that really should not cause errors happening on md0, unless it was RAID without redundancy (RAID0 or already degraded RAID1/5/6/10).



      You should fix your problems below the filesystem layer first. You can hardly blame XFS for disk errors and that's not how you check for disk errors.





      I guess if you really wanted to run a full read test on top of the filesystem, you could do an xfsdump to a backup disk... if you're doing a full read test of your filesystem anyhow, might as well do it in a way that's meaningful somehow.



      It's the nature of xfsdump to walk the XFS filesystem in its entirety and store all the files. So that should come as close as possible to a full read test, not including free space.



      Of course, if you're already running another backup system, that's really the same story in a filesystem-agnostic way (and if that backup system encounters read errors that aren't just lack of permissions, it better send a mail report to you, too), although of course if it's an incremental backup, without periodic full backups it won't actually read a file more than once...





      But in general, we trust filesystems to "just work" as long as the storage is known to work. While it would be nice to have each and every program without exception to elevate any and all I/O errors it encounters, I'm not aware of a generic purpose solution to actually do so. Each program does its own error handling.






      share|improve this answer





















      • Shouldn't something like making sure xfsdump -l 0 - /dev/md0 >/dev/null && echo sane terminates tell you whether the file system itself is sane? You could then do a read-only badblocks pass (or dd, for that matter) to make sure that all parts of the disk are at least readable.
        – a CVn
        Jun 15 '17 at 20:52










      • Yes, only it might be more meaningful to keep the result than just toss it away. badblocks is one such example, if I have reason to believe there actually are read errors, I never run badblocks. I run ddrescue. Of course badblocks can do write tests too, but you don't really do that when there is data you need.
        – frostschutz
        Jun 15 '17 at 21:11












      • True; it really depends on how thorough you want to be. Though it sounds like OP is mainly interested in reactive monitoring, as evidenced by the discussion on monitoring system logs for I/O-related errors. I use ZFS myself (not XFS), with redundancy and regular file system scrubs as well as SMART and system log monitoring, which ensures that developing problems are likely to be caught early, hopefully while still recoverable without more effort than a simple disk replacement.
        – a CVn
        Jun 15 '17 at 21:19






      • 1




        Hmm. I guess I was really hoping for something along the lines of /proc/fs/xfs/health that I could poll. Running badblocks or xfsdump would be very time and resource intensive. Clearly the kernel knows something is wrong since the errors appear in syslog. And while I appreciate the sentiment about monitoring the underlying devices and assuming the filesystem works, that's just not an acceptable solution for me.
        – Evan
        Jun 22 '17 at 12:45


















      1














      Hmmm. How do I detect a failed XFS filesystem?



      I've been using XFS for ages. But... I guess I don't detect it, at all. If it mounts, I trust it works. That's how most people do it... filesystem checks are automated, if it boots and it's up and running, that's that.



      Now, don't get me wrong. I actually do a ton of monitoring, but none of it is filesystem specific. I run SMART selftests (using select,cont to do a disk segment per day, because long simply takes too long). I run RAID checks (also in segments) and also check that there are no mismatches in parity (mismatch_cnt = 0). I get instant mail notification if any of those fail and I actually replace HDDs once they start reallocating sectors (or at least, no longer trust them with important data).



      So I have monitoring to make sure the storage works as it should. This covers errors inside the drives themselves (SMART) and to some extent also errors on a higher level (RAID checks in a way also test controllers, cables, RAID logic, ...).



      As long as that works fine, the filesystem better be fine, too. Outside of checksumming filesystems like ZFS/btrfs (maybe XFS in the future, too) it's not really part of the concept to run checks on a filesystem level while mounted, apart from whatever sanity checks the filesystem itself does internally.



      Your output suggests you're running RAID too, and had a failed disk in that RAID; even so that really should not cause errors happening on md0, unless it was RAID without redundancy (RAID0 or already degraded RAID1/5/6/10).



      You should fix your problems below the filesystem layer first. You can hardly blame XFS for disk errors and that's not how you check for disk errors.





      I guess if you really wanted to run a full read test on top of the filesystem, you could do an xfsdump to a backup disk... if you're doing a full read test of your filesystem anyhow, might as well do it in a way that's meaningful somehow.



      It's the nature of xfsdump to walk the XFS filesystem in its entirety and store all the files. So that should come as close as possible to a full read test, not including free space.



      Of course, if you're already running another backup system, that's really the same story in a filesystem-agnostic way (and if that backup system encounters read errors that aren't just lack of permissions, it better send a mail report to you, too), although of course if it's an incremental backup, without periodic full backups it won't actually read a file more than once...





      But in general, we trust filesystems to "just work" as long as the storage is known to work. While it would be nice to have each and every program without exception to elevate any and all I/O errors it encounters, I'm not aware of a generic purpose solution to actually do so. Each program does its own error handling.






      share|improve this answer





















      • Shouldn't something like making sure xfsdump -l 0 - /dev/md0 >/dev/null && echo sane terminates tell you whether the file system itself is sane? You could then do a read-only badblocks pass (or dd, for that matter) to make sure that all parts of the disk are at least readable.
        – a CVn
        Jun 15 '17 at 20:52










      • Yes, only it might be more meaningful to keep the result than just toss it away. badblocks is one such example, if I have reason to believe there actually are read errors, I never run badblocks. I run ddrescue. Of course badblocks can do write tests too, but you don't really do that when there is data you need.
        – frostschutz
        Jun 15 '17 at 21:11












      • True; it really depends on how thorough you want to be. Though it sounds like OP is mainly interested in reactive monitoring, as evidenced by the discussion on monitoring system logs for I/O-related errors. I use ZFS myself (not XFS), with redundancy and regular file system scrubs as well as SMART and system log monitoring, which ensures that developing problems are likely to be caught early, hopefully while still recoverable without more effort than a simple disk replacement.
        – a CVn
        Jun 15 '17 at 21:19






      • 1




        Hmm. I guess I was really hoping for something along the lines of /proc/fs/xfs/health that I could poll. Running badblocks or xfsdump would be very time and resource intensive. Clearly the kernel knows something is wrong since the errors appear in syslog. And while I appreciate the sentiment about monitoring the underlying devices and assuming the filesystem works, that's just not an acceptable solution for me.
        – Evan
        Jun 22 '17 at 12:45
















      1












      1








      1






      Hmmm. How do I detect a failed XFS filesystem?



      I've been using XFS for ages. But... I guess I don't detect it, at all. If it mounts, I trust it works. That's how most people do it... filesystem checks are automated, if it boots and it's up and running, that's that.



      Now, don't get me wrong. I actually do a ton of monitoring, but none of it is filesystem specific. I run SMART selftests (using select,cont to do a disk segment per day, because long simply takes too long). I run RAID checks (also in segments) and also check that there are no mismatches in parity (mismatch_cnt = 0). I get instant mail notification if any of those fail and I actually replace HDDs once they start reallocating sectors (or at least, no longer trust them with important data).



      So I have monitoring to make sure the storage works as it should. This covers errors inside the drives themselves (SMART) and to some extent also errors on a higher level (RAID checks in a way also test controllers, cables, RAID logic, ...).



      As long as that works fine, the filesystem better be fine, too. Outside of checksumming filesystems like ZFS/btrfs (maybe XFS in the future, too) it's not really part of the concept to run checks on a filesystem level while mounted, apart from whatever sanity checks the filesystem itself does internally.



      Your output suggests you're running RAID too, and had a failed disk in that RAID; even so that really should not cause errors happening on md0, unless it was RAID without redundancy (RAID0 or already degraded RAID1/5/6/10).



      You should fix your problems below the filesystem layer first. You can hardly blame XFS for disk errors and that's not how you check for disk errors.





      I guess if you really wanted to run a full read test on top of the filesystem, you could do an xfsdump to a backup disk... if you're doing a full read test of your filesystem anyhow, might as well do it in a way that's meaningful somehow.



      It's the nature of xfsdump to walk the XFS filesystem in its entirety and store all the files. So that should come as close as possible to a full read test, not including free space.



      Of course, if you're already running another backup system, that's really the same story in a filesystem-agnostic way (and if that backup system encounters read errors that aren't just lack of permissions, it better send a mail report to you, too), although of course if it's an incremental backup, without periodic full backups it won't actually read a file more than once...





      But in general, we trust filesystems to "just work" as long as the storage is known to work. While it would be nice to have each and every program without exception to elevate any and all I/O errors it encounters, I'm not aware of a generic purpose solution to actually do so. Each program does its own error handling.






      share|improve this answer












      Hmmm. How do I detect a failed XFS filesystem?



      I've been using XFS for ages. But... I guess I don't detect it, at all. If it mounts, I trust it works. That's how most people do it... filesystem checks are automated, if it boots and it's up and running, that's that.



      Now, don't get me wrong. I actually do a ton of monitoring, but none of it is filesystem specific. I run SMART selftests (using select,cont to do a disk segment per day, because long simply takes too long). I run RAID checks (also in segments) and also check that there are no mismatches in parity (mismatch_cnt = 0). I get instant mail notification if any of those fail and I actually replace HDDs once they start reallocating sectors (or at least, no longer trust them with important data).



      So I have monitoring to make sure the storage works as it should. This covers errors inside the drives themselves (SMART) and to some extent also errors on a higher level (RAID checks in a way also test controllers, cables, RAID logic, ...).



      As long as that works fine, the filesystem better be fine, too. Outside of checksumming filesystems like ZFS/btrfs (maybe XFS in the future, too) it's not really part of the concept to run checks on a filesystem level while mounted, apart from whatever sanity checks the filesystem itself does internally.



      Your output suggests you're running RAID too, and had a failed disk in that RAID; even so that really should not cause errors happening on md0, unless it was RAID without redundancy (RAID0 or already degraded RAID1/5/6/10).



      You should fix your problems below the filesystem layer first. You can hardly blame XFS for disk errors and that's not how you check for disk errors.





      I guess if you really wanted to run a full read test on top of the filesystem, you could do an xfsdump to a backup disk... if you're doing a full read test of your filesystem anyhow, might as well do it in a way that's meaningful somehow.



      It's the nature of xfsdump to walk the XFS filesystem in its entirety and store all the files. So that should come as close as possible to a full read test, not including free space.



      Of course, if you're already running another backup system, that's really the same story in a filesystem-agnostic way (and if that backup system encounters read errors that aren't just lack of permissions, it better send a mail report to you, too), although of course if it's an incremental backup, without periodic full backups it won't actually read a file more than once...





      But in general, we trust filesystems to "just work" as long as the storage is known to work. While it would be nice to have each and every program without exception to elevate any and all I/O errors it encounters, I'm not aware of a generic purpose solution to actually do so. Each program does its own error handling.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Jun 15 '17 at 20:04









      frostschutz

      25.7k15280




      25.7k15280












      • Shouldn't something like making sure xfsdump -l 0 - /dev/md0 >/dev/null && echo sane terminates tell you whether the file system itself is sane? You could then do a read-only badblocks pass (or dd, for that matter) to make sure that all parts of the disk are at least readable.
        – a CVn
        Jun 15 '17 at 20:52










      • Yes, only it might be more meaningful to keep the result than just toss it away. badblocks is one such example, if I have reason to believe there actually are read errors, I never run badblocks. I run ddrescue. Of course badblocks can do write tests too, but you don't really do that when there is data you need.
        – frostschutz
        Jun 15 '17 at 21:11












      • True; it really depends on how thorough you want to be. Though it sounds like OP is mainly interested in reactive monitoring, as evidenced by the discussion on monitoring system logs for I/O-related errors. I use ZFS myself (not XFS), with redundancy and regular file system scrubs as well as SMART and system log monitoring, which ensures that developing problems are likely to be caught early, hopefully while still recoverable without more effort than a simple disk replacement.
        – a CVn
        Jun 15 '17 at 21:19






      • 1




        Hmm. I guess I was really hoping for something along the lines of /proc/fs/xfs/health that I could poll. Running badblocks or xfsdump would be very time and resource intensive. Clearly the kernel knows something is wrong since the errors appear in syslog. And while I appreciate the sentiment about monitoring the underlying devices and assuming the filesystem works, that's just not an acceptable solution for me.
        – Evan
        Jun 22 '17 at 12:45




















      • Shouldn't something like making sure xfsdump -l 0 - /dev/md0 >/dev/null && echo sane terminates tell you whether the file system itself is sane? You could then do a read-only badblocks pass (or dd, for that matter) to make sure that all parts of the disk are at least readable.
        – a CVn
        Jun 15 '17 at 20:52










      • Yes, only it might be more meaningful to keep the result than just toss it away. badblocks is one such example, if I have reason to believe there actually are read errors, I never run badblocks. I run ddrescue. Of course badblocks can do write tests too, but you don't really do that when there is data you need.
        – frostschutz
        Jun 15 '17 at 21:11












      • True; it really depends on how thorough you want to be. Though it sounds like OP is mainly interested in reactive monitoring, as evidenced by the discussion on monitoring system logs for I/O-related errors. I use ZFS myself (not XFS), with redundancy and regular file system scrubs as well as SMART and system log monitoring, which ensures that developing problems are likely to be caught early, hopefully while still recoverable without more effort than a simple disk replacement.
        – a CVn
        Jun 15 '17 at 21:19






      • 1




        Hmm. I guess I was really hoping for something along the lines of /proc/fs/xfs/health that I could poll. Running badblocks or xfsdump would be very time and resource intensive. Clearly the kernel knows something is wrong since the errors appear in syslog. And while I appreciate the sentiment about monitoring the underlying devices and assuming the filesystem works, that's just not an acceptable solution for me.
        – Evan
        Jun 22 '17 at 12:45


















      Shouldn't something like making sure xfsdump -l 0 - /dev/md0 >/dev/null && echo sane terminates tell you whether the file system itself is sane? You could then do a read-only badblocks pass (or dd, for that matter) to make sure that all parts of the disk are at least readable.
      – a CVn
      Jun 15 '17 at 20:52




      Shouldn't something like making sure xfsdump -l 0 - /dev/md0 >/dev/null && echo sane terminates tell you whether the file system itself is sane? You could then do a read-only badblocks pass (or dd, for that matter) to make sure that all parts of the disk are at least readable.
      – a CVn
      Jun 15 '17 at 20:52












      Yes, only it might be more meaningful to keep the result than just toss it away. badblocks is one such example, if I have reason to believe there actually are read errors, I never run badblocks. I run ddrescue. Of course badblocks can do write tests too, but you don't really do that when there is data you need.
      – frostschutz
      Jun 15 '17 at 21:11






      Yes, only it might be more meaningful to keep the result than just toss it away. badblocks is one such example, if I have reason to believe there actually are read errors, I never run badblocks. I run ddrescue. Of course badblocks can do write tests too, but you don't really do that when there is data you need.
      – frostschutz
      Jun 15 '17 at 21:11














      True; it really depends on how thorough you want to be. Though it sounds like OP is mainly interested in reactive monitoring, as evidenced by the discussion on monitoring system logs for I/O-related errors. I use ZFS myself (not XFS), with redundancy and regular file system scrubs as well as SMART and system log monitoring, which ensures that developing problems are likely to be caught early, hopefully while still recoverable without more effort than a simple disk replacement.
      – a CVn
      Jun 15 '17 at 21:19




      True; it really depends on how thorough you want to be. Though it sounds like OP is mainly interested in reactive monitoring, as evidenced by the discussion on monitoring system logs for I/O-related errors. I use ZFS myself (not XFS), with redundancy and regular file system scrubs as well as SMART and system log monitoring, which ensures that developing problems are likely to be caught early, hopefully while still recoverable without more effort than a simple disk replacement.
      – a CVn
      Jun 15 '17 at 21:19




      1




      1




      Hmm. I guess I was really hoping for something along the lines of /proc/fs/xfs/health that I could poll. Running badblocks or xfsdump would be very time and resource intensive. Clearly the kernel knows something is wrong since the errors appear in syslog. And while I appreciate the sentiment about monitoring the underlying devices and assuming the filesystem works, that's just not an acceptable solution for me.
      – Evan
      Jun 22 '17 at 12:45






      Hmm. I guess I was really hoping for something along the lines of /proc/fs/xfs/health that I could poll. Running badblocks or xfsdump would be very time and resource intensive. Clearly the kernel knows something is wrong since the errors appear in syslog. And while I appreciate the sentiment about monitoring the underlying devices and assuming the filesystem works, that's just not an acceptable solution for me.
      – Evan
      Jun 22 '17 at 12:45















      0














      Many critical storage servers enable "panic on error", so that an error wouldn't have a chance to become bigger, to cause further data damage, or to serve corrupted data to the users. With panic on error, you could just watch for panic events or system down to detect file system errors.



      Of course, if you don't have a redundant system, one server going panic would mean actual down time. However, mission-critical systems must have redundancy. In fact, any data on a file system that shows such kind of I/O errors should no longer be used, and the system should be disconnected asap so that a backup system could kick in. Serving no data is actually better than serving corrupt data in most cases.



      According to https://access.redhat.com/solutions/3645252, you could set the sysctl fs.xfs.panic_mask=127 to make any error detected on any XFS filesystem become a system panic.



      In order to persist to the reboot of this configure systems, do:



      echo 'fs.xfs.panic_mask=127' > /etc/sysctl.d/01-xfs.conf






      share|improve this answer








      New contributor




      Yan Li is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
      Check out our Code of Conduct.























        0














        Many critical storage servers enable "panic on error", so that an error wouldn't have a chance to become bigger, to cause further data damage, or to serve corrupted data to the users. With panic on error, you could just watch for panic events or system down to detect file system errors.



        Of course, if you don't have a redundant system, one server going panic would mean actual down time. However, mission-critical systems must have redundancy. In fact, any data on a file system that shows such kind of I/O errors should no longer be used, and the system should be disconnected asap so that a backup system could kick in. Serving no data is actually better than serving corrupt data in most cases.



        According to https://access.redhat.com/solutions/3645252, you could set the sysctl fs.xfs.panic_mask=127 to make any error detected on any XFS filesystem become a system panic.



        In order to persist to the reboot of this configure systems, do:



        echo 'fs.xfs.panic_mask=127' > /etc/sysctl.d/01-xfs.conf






        share|improve this answer








        New contributor




        Yan Li is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
        Check out our Code of Conduct.





















          0












          0








          0






          Many critical storage servers enable "panic on error", so that an error wouldn't have a chance to become bigger, to cause further data damage, or to serve corrupted data to the users. With panic on error, you could just watch for panic events or system down to detect file system errors.



          Of course, if you don't have a redundant system, one server going panic would mean actual down time. However, mission-critical systems must have redundancy. In fact, any data on a file system that shows such kind of I/O errors should no longer be used, and the system should be disconnected asap so that a backup system could kick in. Serving no data is actually better than serving corrupt data in most cases.



          According to https://access.redhat.com/solutions/3645252, you could set the sysctl fs.xfs.panic_mask=127 to make any error detected on any XFS filesystem become a system panic.



          In order to persist to the reboot of this configure systems, do:



          echo 'fs.xfs.panic_mask=127' > /etc/sysctl.d/01-xfs.conf






          share|improve this answer








          New contributor




          Yan Li is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.









          Many critical storage servers enable "panic on error", so that an error wouldn't have a chance to become bigger, to cause further data damage, or to serve corrupted data to the users. With panic on error, you could just watch for panic events or system down to detect file system errors.



          Of course, if you don't have a redundant system, one server going panic would mean actual down time. However, mission-critical systems must have redundancy. In fact, any data on a file system that shows such kind of I/O errors should no longer be used, and the system should be disconnected asap so that a backup system could kick in. Serving no data is actually better than serving corrupt data in most cases.



          According to https://access.redhat.com/solutions/3645252, you could set the sysctl fs.xfs.panic_mask=127 to make any error detected on any XFS filesystem become a system panic.



          In order to persist to the reboot of this configure systems, do:



          echo 'fs.xfs.panic_mask=127' > /etc/sysctl.d/01-xfs.conf







          share|improve this answer








          New contributor




          Yan Li is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.









          share|improve this answer



          share|improve this answer






          New contributor




          Yan Li is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.









          answered yesterday









          Yan Li

          1




          1




          New contributor




          Yan Li is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.





          New contributor





          Yan Li is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.






          Yan Li is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
          Check out our Code of Conduct.























              0














              xfs_repair -n



              Usage: xfs_repair [options] device

              Options:
              -f The device is a file
              -L Force log zeroing. Do this as a last resort.
              -l logdev Specifies the device where the external log resides.
              -m maxmem Maximum amount of memory to be used in megabytes.
              -n No modify mode, just checks the filesystem for damage.
              -P Disables prefetching.
              -r rtdev Specifies the device where the realtime section resides.
              -v Verbose output.
              -c subopts Change filesystem parameters - use xfs_admin.
              -o subopts Override default behaviour, refer to man page.
              -t interval Reporting interval in minutes.
              -d Repair dangerously.
              -V Reports version and exits.

              man page:
              -n No modify mode.
              Specifies that xfs_repair should not modify the filesystem but
              should only scan the filesystem and indicate what repairs would have been made.


              and there is xfs_check but if you do a man page on it you will see: check XFS filesystem consistency... Note that using xfs_check is NOT recommended. Please use xfs_repair -n instead, for better scalability and speed.



              And in /etc/fstab the 6th or last column if it is a 1 or 2 causes fsck or file system check on mount which would happen on every boot... will it specifically be xfs_repair -n? I don't know.



              you asked about detecting a failed file system: my interpretation of that is if it is failed then it would not be mounted and not accessible at all... you would know without really having to check it would not be obvious when you notice it is not mounted, then rudely fails to mount when tried manually.



              Must be unmounted to do this, but to monitor here's what you would periodically do manually:



              # df -h
              Filesystem Size Used Avail Use% Mounted on
              /dev/sdc2 550G 152G 371G 30% / {ext3}
              udev 253G 216K 253G 1% /dev
              tmpfs 253G 5.5M 253G 1% /dev/shm
              /dev/sdc1 195M 13M 183M 7% /boot/efi
              /dev/sda1 5.0T 4.9T 99G 99% /data {xfs}
              /dev/sdb1 559G 67G 492G 12% /scratch
              tmpfs 450G 0 450G 0% /ramdisk
              /dev/sdd1 5.0T 4.9T 9.8G 100% /bkup {xfs}

              how do i find file system types?

              # mount
              /dev/sdc2 on / type ext3 (rw,acl,user_xattr)
              proc on /proc type proc (rw)
              sysfs on /sys type sysfs (rw)
              debugfs on /sys/kernel/debug type debugfs (rw)
              udev on /dev type tmpfs (rw,mode=0755)
              tmpfs on /dev/shm type tmpfs (rw,mode=1777)
              devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
              /dev/sdc1 on /boot/efi type vfat (rw,umask=0002,utf8=true)
              /dev/sda1 on /data type xfs (rw)
              /dev/sdb1 on /scratch type xfs (rw)
              fusectl on /sys/fs/fuse/connections type fusectl (rw)
              securityfs on /sys/kernel/security type securityfs (rw)
              none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
              tmpfs on /ramdisk type tmpfs (rw,size=450G)
              nfsd on /proc/fs/nfsd type nfsd (rw)
              /dev/sdd1 on /bkup type xfs (rw)

              # xfs_repair -n /dev/sdd1
              xfs_repair: /dev/sdd1 contains a mounted and writable filesystem

              fatal error -- couldn't initialize XFS library

              # umount /bkup/
              # xfs_repair -n /dev/sdd1

              Phase 1 - find and verify superblock...
              Phase 2 - using internal log
              - scan filesystem freespace and inode maps...
              - found root inode chunk
              Phase 3 - for each AG...
              - scan (but don't clear) agi unlinked lists...
              - process known inodes and perform inode discovery...
              - agno = 0
              - agno = 1
              - agno = 2
              - agno = 3
              - agno = 4
              - process newly discovered inodes...
              Phase 4 - check for duplicate blocks...
              - setting up duplicate extent list...
              - check for inodes claiming duplicate blocks...
              - agno = 0
              - agno = 4
              - agno = 3
              - agno = 1
              - agno = 2
              No modify flag set, skipping phase 5
              Phase 6 - check inode connectivity...
              - traversing filesystem ...
              - traversal finished ...
              - moving disconnected inodes to lost+found ...
              Phase 7 - verify link counts...
              No modify flag set, skipping filesystem flush and exiting.

              this "xfs_repair -n" output is on a good XFS file system that has been problem free for years.





              share|improve this answer




























                0














                xfs_repair -n



                Usage: xfs_repair [options] device

                Options:
                -f The device is a file
                -L Force log zeroing. Do this as a last resort.
                -l logdev Specifies the device where the external log resides.
                -m maxmem Maximum amount of memory to be used in megabytes.
                -n No modify mode, just checks the filesystem for damage.
                -P Disables prefetching.
                -r rtdev Specifies the device where the realtime section resides.
                -v Verbose output.
                -c subopts Change filesystem parameters - use xfs_admin.
                -o subopts Override default behaviour, refer to man page.
                -t interval Reporting interval in minutes.
                -d Repair dangerously.
                -V Reports version and exits.

                man page:
                -n No modify mode.
                Specifies that xfs_repair should not modify the filesystem but
                should only scan the filesystem and indicate what repairs would have been made.


                and there is xfs_check but if you do a man page on it you will see: check XFS filesystem consistency... Note that using xfs_check is NOT recommended. Please use xfs_repair -n instead, for better scalability and speed.



                And in /etc/fstab the 6th or last column if it is a 1 or 2 causes fsck or file system check on mount which would happen on every boot... will it specifically be xfs_repair -n? I don't know.



                you asked about detecting a failed file system: my interpretation of that is if it is failed then it would not be mounted and not accessible at all... you would know without really having to check it would not be obvious when you notice it is not mounted, then rudely fails to mount when tried manually.



                Must be unmounted to do this, but to monitor here's what you would periodically do manually:



                # df -h
                Filesystem Size Used Avail Use% Mounted on
                /dev/sdc2 550G 152G 371G 30% / {ext3}
                udev 253G 216K 253G 1% /dev
                tmpfs 253G 5.5M 253G 1% /dev/shm
                /dev/sdc1 195M 13M 183M 7% /boot/efi
                /dev/sda1 5.0T 4.9T 99G 99% /data {xfs}
                /dev/sdb1 559G 67G 492G 12% /scratch
                tmpfs 450G 0 450G 0% /ramdisk
                /dev/sdd1 5.0T 4.9T 9.8G 100% /bkup {xfs}

                how do i find file system types?

                # mount
                /dev/sdc2 on / type ext3 (rw,acl,user_xattr)
                proc on /proc type proc (rw)
                sysfs on /sys type sysfs (rw)
                debugfs on /sys/kernel/debug type debugfs (rw)
                udev on /dev type tmpfs (rw,mode=0755)
                tmpfs on /dev/shm type tmpfs (rw,mode=1777)
                devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
                /dev/sdc1 on /boot/efi type vfat (rw,umask=0002,utf8=true)
                /dev/sda1 on /data type xfs (rw)
                /dev/sdb1 on /scratch type xfs (rw)
                fusectl on /sys/fs/fuse/connections type fusectl (rw)
                securityfs on /sys/kernel/security type securityfs (rw)
                none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
                tmpfs on /ramdisk type tmpfs (rw,size=450G)
                nfsd on /proc/fs/nfsd type nfsd (rw)
                /dev/sdd1 on /bkup type xfs (rw)

                # xfs_repair -n /dev/sdd1
                xfs_repair: /dev/sdd1 contains a mounted and writable filesystem

                fatal error -- couldn't initialize XFS library

                # umount /bkup/
                # xfs_repair -n /dev/sdd1

                Phase 1 - find and verify superblock...
                Phase 2 - using internal log
                - scan filesystem freespace and inode maps...
                - found root inode chunk
                Phase 3 - for each AG...
                - scan (but don't clear) agi unlinked lists...
                - process known inodes and perform inode discovery...
                - agno = 0
                - agno = 1
                - agno = 2
                - agno = 3
                - agno = 4
                - process newly discovered inodes...
                Phase 4 - check for duplicate blocks...
                - setting up duplicate extent list...
                - check for inodes claiming duplicate blocks...
                - agno = 0
                - agno = 4
                - agno = 3
                - agno = 1
                - agno = 2
                No modify flag set, skipping phase 5
                Phase 6 - check inode connectivity...
                - traversing filesystem ...
                - traversal finished ...
                - moving disconnected inodes to lost+found ...
                Phase 7 - verify link counts...
                No modify flag set, skipping filesystem flush and exiting.

                this "xfs_repair -n" output is on a good XFS file system that has been problem free for years.





                share|improve this answer


























                  0












                  0








                  0






                  xfs_repair -n



                  Usage: xfs_repair [options] device

                  Options:
                  -f The device is a file
                  -L Force log zeroing. Do this as a last resort.
                  -l logdev Specifies the device where the external log resides.
                  -m maxmem Maximum amount of memory to be used in megabytes.
                  -n No modify mode, just checks the filesystem for damage.
                  -P Disables prefetching.
                  -r rtdev Specifies the device where the realtime section resides.
                  -v Verbose output.
                  -c subopts Change filesystem parameters - use xfs_admin.
                  -o subopts Override default behaviour, refer to man page.
                  -t interval Reporting interval in minutes.
                  -d Repair dangerously.
                  -V Reports version and exits.

                  man page:
                  -n No modify mode.
                  Specifies that xfs_repair should not modify the filesystem but
                  should only scan the filesystem and indicate what repairs would have been made.


                  and there is xfs_check but if you do a man page on it you will see: check XFS filesystem consistency... Note that using xfs_check is NOT recommended. Please use xfs_repair -n instead, for better scalability and speed.



                  And in /etc/fstab the 6th or last column if it is a 1 or 2 causes fsck or file system check on mount which would happen on every boot... will it specifically be xfs_repair -n? I don't know.



                  you asked about detecting a failed file system: my interpretation of that is if it is failed then it would not be mounted and not accessible at all... you would know without really having to check it would not be obvious when you notice it is not mounted, then rudely fails to mount when tried manually.



                  Must be unmounted to do this, but to monitor here's what you would periodically do manually:



                  # df -h
                  Filesystem Size Used Avail Use% Mounted on
                  /dev/sdc2 550G 152G 371G 30% / {ext3}
                  udev 253G 216K 253G 1% /dev
                  tmpfs 253G 5.5M 253G 1% /dev/shm
                  /dev/sdc1 195M 13M 183M 7% /boot/efi
                  /dev/sda1 5.0T 4.9T 99G 99% /data {xfs}
                  /dev/sdb1 559G 67G 492G 12% /scratch
                  tmpfs 450G 0 450G 0% /ramdisk
                  /dev/sdd1 5.0T 4.9T 9.8G 100% /bkup {xfs}

                  how do i find file system types?

                  # mount
                  /dev/sdc2 on / type ext3 (rw,acl,user_xattr)
                  proc on /proc type proc (rw)
                  sysfs on /sys type sysfs (rw)
                  debugfs on /sys/kernel/debug type debugfs (rw)
                  udev on /dev type tmpfs (rw,mode=0755)
                  tmpfs on /dev/shm type tmpfs (rw,mode=1777)
                  devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
                  /dev/sdc1 on /boot/efi type vfat (rw,umask=0002,utf8=true)
                  /dev/sda1 on /data type xfs (rw)
                  /dev/sdb1 on /scratch type xfs (rw)
                  fusectl on /sys/fs/fuse/connections type fusectl (rw)
                  securityfs on /sys/kernel/security type securityfs (rw)
                  none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
                  tmpfs on /ramdisk type tmpfs (rw,size=450G)
                  nfsd on /proc/fs/nfsd type nfsd (rw)
                  /dev/sdd1 on /bkup type xfs (rw)

                  # xfs_repair -n /dev/sdd1
                  xfs_repair: /dev/sdd1 contains a mounted and writable filesystem

                  fatal error -- couldn't initialize XFS library

                  # umount /bkup/
                  # xfs_repair -n /dev/sdd1

                  Phase 1 - find and verify superblock...
                  Phase 2 - using internal log
                  - scan filesystem freespace and inode maps...
                  - found root inode chunk
                  Phase 3 - for each AG...
                  - scan (but don't clear) agi unlinked lists...
                  - process known inodes and perform inode discovery...
                  - agno = 0
                  - agno = 1
                  - agno = 2
                  - agno = 3
                  - agno = 4
                  - process newly discovered inodes...
                  Phase 4 - check for duplicate blocks...
                  - setting up duplicate extent list...
                  - check for inodes claiming duplicate blocks...
                  - agno = 0
                  - agno = 4
                  - agno = 3
                  - agno = 1
                  - agno = 2
                  No modify flag set, skipping phase 5
                  Phase 6 - check inode connectivity...
                  - traversing filesystem ...
                  - traversal finished ...
                  - moving disconnected inodes to lost+found ...
                  Phase 7 - verify link counts...
                  No modify flag set, skipping filesystem flush and exiting.

                  this "xfs_repair -n" output is on a good XFS file system that has been problem free for years.





                  share|improve this answer














                  xfs_repair -n



                  Usage: xfs_repair [options] device

                  Options:
                  -f The device is a file
                  -L Force log zeroing. Do this as a last resort.
                  -l logdev Specifies the device where the external log resides.
                  -m maxmem Maximum amount of memory to be used in megabytes.
                  -n No modify mode, just checks the filesystem for damage.
                  -P Disables prefetching.
                  -r rtdev Specifies the device where the realtime section resides.
                  -v Verbose output.
                  -c subopts Change filesystem parameters - use xfs_admin.
                  -o subopts Override default behaviour, refer to man page.
                  -t interval Reporting interval in minutes.
                  -d Repair dangerously.
                  -V Reports version and exits.

                  man page:
                  -n No modify mode.
                  Specifies that xfs_repair should not modify the filesystem but
                  should only scan the filesystem and indicate what repairs would have been made.


                  and there is xfs_check but if you do a man page on it you will see: check XFS filesystem consistency... Note that using xfs_check is NOT recommended. Please use xfs_repair -n instead, for better scalability and speed.



                  And in /etc/fstab the 6th or last column if it is a 1 or 2 causes fsck or file system check on mount which would happen on every boot... will it specifically be xfs_repair -n? I don't know.



                  you asked about detecting a failed file system: my interpretation of that is if it is failed then it would not be mounted and not accessible at all... you would know without really having to check it would not be obvious when you notice it is not mounted, then rudely fails to mount when tried manually.



                  Must be unmounted to do this, but to monitor here's what you would periodically do manually:



                  # df -h
                  Filesystem Size Used Avail Use% Mounted on
                  /dev/sdc2 550G 152G 371G 30% / {ext3}
                  udev 253G 216K 253G 1% /dev
                  tmpfs 253G 5.5M 253G 1% /dev/shm
                  /dev/sdc1 195M 13M 183M 7% /boot/efi
                  /dev/sda1 5.0T 4.9T 99G 99% /data {xfs}
                  /dev/sdb1 559G 67G 492G 12% /scratch
                  tmpfs 450G 0 450G 0% /ramdisk
                  /dev/sdd1 5.0T 4.9T 9.8G 100% /bkup {xfs}

                  how do i find file system types?

                  # mount
                  /dev/sdc2 on / type ext3 (rw,acl,user_xattr)
                  proc on /proc type proc (rw)
                  sysfs on /sys type sysfs (rw)
                  debugfs on /sys/kernel/debug type debugfs (rw)
                  udev on /dev type tmpfs (rw,mode=0755)
                  tmpfs on /dev/shm type tmpfs (rw,mode=1777)
                  devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
                  /dev/sdc1 on /boot/efi type vfat (rw,umask=0002,utf8=true)
                  /dev/sda1 on /data type xfs (rw)
                  /dev/sdb1 on /scratch type xfs (rw)
                  fusectl on /sys/fs/fuse/connections type fusectl (rw)
                  securityfs on /sys/kernel/security type securityfs (rw)
                  none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw)
                  tmpfs on /ramdisk type tmpfs (rw,size=450G)
                  nfsd on /proc/fs/nfsd type nfsd (rw)
                  /dev/sdd1 on /bkup type xfs (rw)

                  # xfs_repair -n /dev/sdd1
                  xfs_repair: /dev/sdd1 contains a mounted and writable filesystem

                  fatal error -- couldn't initialize XFS library

                  # umount /bkup/
                  # xfs_repair -n /dev/sdd1

                  Phase 1 - find and verify superblock...
                  Phase 2 - using internal log
                  - scan filesystem freespace and inode maps...
                  - found root inode chunk
                  Phase 3 - for each AG...
                  - scan (but don't clear) agi unlinked lists...
                  - process known inodes and perform inode discovery...
                  - agno = 0
                  - agno = 1
                  - agno = 2
                  - agno = 3
                  - agno = 4
                  - process newly discovered inodes...
                  Phase 4 - check for duplicate blocks...
                  - setting up duplicate extent list...
                  - check for inodes claiming duplicate blocks...
                  - agno = 0
                  - agno = 4
                  - agno = 3
                  - agno = 1
                  - agno = 2
                  No modify flag set, skipping phase 5
                  Phase 6 - check inode connectivity...
                  - traversing filesystem ...
                  - traversal finished ...
                  - moving disconnected inodes to lost+found ...
                  Phase 7 - verify link counts...
                  No modify flag set, skipping filesystem flush and exiting.

                  this "xfs_repair -n" output is on a good XFS file system that has been problem free for years.






                  share|improve this answer














                  share|improve this answer



                  share|improve this answer








                  edited yesterday

























                  answered yesterday









                  ron

                  8861714




                  8861714






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Unix & Linux Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.





                      Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


                      Please pay close attention to the following guidance:


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f371381%2fhow-do-you-detect-a-failed-xfs-filesystem%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Entries order in /etc/network/interfaces

                      新発田市

                      Grub takes very long (several minutes) to open Menu (in Multi-Boot-System)