Rocks cluster slave spontaneous reset?












0















I am 'administering' a small cluster (4 nodes) based on Rocks Cluster. After a recent restart it appears that the slave nodes have all decided to spontaneously reinstall their operating systems, wiping their whole configuration, infiniband support, installed software etc.



I cannot fathom why the system might have done this, and it is quite unhelpful. Has anyone had this happen before? What has caused it?



And for the kicker, since I'm probably resigned to rebuilding the nodes to the spec they should have had, how does one backup the slaves once they're in a working state?



Additional info:



Also the head node seems to be largely incapable of reaching the internet, based on attempted pings. It also cannot seem to ping the local DNS address (192.168.0.1)










share|improve this question





























    0















    I am 'administering' a small cluster (4 nodes) based on Rocks Cluster. After a recent restart it appears that the slave nodes have all decided to spontaneously reinstall their operating systems, wiping their whole configuration, infiniband support, installed software etc.



    I cannot fathom why the system might have done this, and it is quite unhelpful. Has anyone had this happen before? What has caused it?



    And for the kicker, since I'm probably resigned to rebuilding the nodes to the spec they should have had, how does one backup the slaves once they're in a working state?



    Additional info:



    Also the head node seems to be largely incapable of reaching the internet, based on attempted pings. It also cannot seem to ping the local DNS address (192.168.0.1)










    share|improve this question



























      0












      0








      0








      I am 'administering' a small cluster (4 nodes) based on Rocks Cluster. After a recent restart it appears that the slave nodes have all decided to spontaneously reinstall their operating systems, wiping their whole configuration, infiniband support, installed software etc.



      I cannot fathom why the system might have done this, and it is quite unhelpful. Has anyone had this happen before? What has caused it?



      And for the kicker, since I'm probably resigned to rebuilding the nodes to the spec they should have had, how does one backup the slaves once they're in a working state?



      Additional info:



      Also the head node seems to be largely incapable of reaching the internet, based on attempted pings. It also cannot seem to ping the local DNS address (192.168.0.1)










      share|improve this question
















      I am 'administering' a small cluster (4 nodes) based on Rocks Cluster. After a recent restart it appears that the slave nodes have all decided to spontaneously reinstall their operating systems, wiping their whole configuration, infiniband support, installed software etc.



      I cannot fathom why the system might have done this, and it is quite unhelpful. Has anyone had this happen before? What has caused it?



      And for the kicker, since I'm probably resigned to rebuilding the nodes to the spec they should have had, how does one backup the slaves once they're in a working state?



      Additional info:



      Also the head node seems to be largely incapable of reaching the internet, based on attempted pings. It also cannot seem to ping the local DNS address (192.168.0.1)







      cluster






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited 2 hours ago









      Rui F Ribeiro

      41.5k1483140




      41.5k1483140










      asked Apr 28 '15 at 17:12









      J CollinsJ Collins

      4301417




      4301417






















          1 Answer
          1






          active

          oldest

          votes


















          0














          It turns out that, at least in some cases, the default is for Rocks to reinstall itself on the slave nodes at every boot up (1). Presumably the intent is that clusters are always on and a restart probably means some changes were made that would benefit from a reinstall. For a casually-used system, this is not appropriate as it is unlikely all of the post-install scripts are configured to complete a full reinstallation. The way to avoid this reinstall is to execute:



          rocks run host compute "chkconfig rocks-grub off"


          This executes the function on all slave nodes in the 'compute' group, disabling the reinstall functionality.



          In my case the slave nodes were set to boot from the local drive first, avoiding the auto-reinstall. I believe what triggered it was a forced power-off corrupting the local disk such that on next boot, the local corrupted disk would not boot and handed over to the PXE boot that got a reinstall instruction from the head node. The forced power off was caused by something unknown interrupting shutdown now being run on the slaves. A physical power off was all that would shut them down. I now use shutdown -h now which seems to overcome whatever interrupted the vanilla shutdown.






          share|improve this answer























            Your Answer








            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "106"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: false,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: null,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f199192%2frocks-cluster-slave-spontaneous-reset%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            0














            It turns out that, at least in some cases, the default is for Rocks to reinstall itself on the slave nodes at every boot up (1). Presumably the intent is that clusters are always on and a restart probably means some changes were made that would benefit from a reinstall. For a casually-used system, this is not appropriate as it is unlikely all of the post-install scripts are configured to complete a full reinstallation. The way to avoid this reinstall is to execute:



            rocks run host compute "chkconfig rocks-grub off"


            This executes the function on all slave nodes in the 'compute' group, disabling the reinstall functionality.



            In my case the slave nodes were set to boot from the local drive first, avoiding the auto-reinstall. I believe what triggered it was a forced power-off corrupting the local disk such that on next boot, the local corrupted disk would not boot and handed over to the PXE boot that got a reinstall instruction from the head node. The forced power off was caused by something unknown interrupting shutdown now being run on the slaves. A physical power off was all that would shut them down. I now use shutdown -h now which seems to overcome whatever interrupted the vanilla shutdown.






            share|improve this answer




























              0














              It turns out that, at least in some cases, the default is for Rocks to reinstall itself on the slave nodes at every boot up (1). Presumably the intent is that clusters are always on and a restart probably means some changes were made that would benefit from a reinstall. For a casually-used system, this is not appropriate as it is unlikely all of the post-install scripts are configured to complete a full reinstallation. The way to avoid this reinstall is to execute:



              rocks run host compute "chkconfig rocks-grub off"


              This executes the function on all slave nodes in the 'compute' group, disabling the reinstall functionality.



              In my case the slave nodes were set to boot from the local drive first, avoiding the auto-reinstall. I believe what triggered it was a forced power-off corrupting the local disk such that on next boot, the local corrupted disk would not boot and handed over to the PXE boot that got a reinstall instruction from the head node. The forced power off was caused by something unknown interrupting shutdown now being run on the slaves. A physical power off was all that would shut them down. I now use shutdown -h now which seems to overcome whatever interrupted the vanilla shutdown.






              share|improve this answer


























                0












                0








                0







                It turns out that, at least in some cases, the default is for Rocks to reinstall itself on the slave nodes at every boot up (1). Presumably the intent is that clusters are always on and a restart probably means some changes were made that would benefit from a reinstall. For a casually-used system, this is not appropriate as it is unlikely all of the post-install scripts are configured to complete a full reinstallation. The way to avoid this reinstall is to execute:



                rocks run host compute "chkconfig rocks-grub off"


                This executes the function on all slave nodes in the 'compute' group, disabling the reinstall functionality.



                In my case the slave nodes were set to boot from the local drive first, avoiding the auto-reinstall. I believe what triggered it was a forced power-off corrupting the local disk such that on next boot, the local corrupted disk would not boot and handed over to the PXE boot that got a reinstall instruction from the head node. The forced power off was caused by something unknown interrupting shutdown now being run on the slaves. A physical power off was all that would shut them down. I now use shutdown -h now which seems to overcome whatever interrupted the vanilla shutdown.






                share|improve this answer













                It turns out that, at least in some cases, the default is for Rocks to reinstall itself on the slave nodes at every boot up (1). Presumably the intent is that clusters are always on and a restart probably means some changes were made that would benefit from a reinstall. For a casually-used system, this is not appropriate as it is unlikely all of the post-install scripts are configured to complete a full reinstallation. The way to avoid this reinstall is to execute:



                rocks run host compute "chkconfig rocks-grub off"


                This executes the function on all slave nodes in the 'compute' group, disabling the reinstall functionality.



                In my case the slave nodes were set to boot from the local drive first, avoiding the auto-reinstall. I believe what triggered it was a forced power-off corrupting the local disk such that on next boot, the local corrupted disk would not boot and handed over to the PXE boot that got a reinstall instruction from the head node. The forced power off was caused by something unknown interrupting shutdown now being run on the slaves. A physical power off was all that would shut them down. I now use shutdown -h now which seems to overcome whatever interrupted the vanilla shutdown.







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered May 4 '15 at 18:03









                J CollinsJ Collins

                4301417




                4301417






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Unix & Linux Stack Exchange!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f199192%2frocks-cluster-slave-spontaneous-reset%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Accessing regular linux commands in Huawei's Dopra Linux

                    Can't connect RFCOMM socket: Host is down

                    Kernel panic - not syncing: Fatal Exception in Interrupt