Rocks cluster slave spontaneous reset?
I am 'administering' a small cluster (4 nodes) based on Rocks Cluster. After a recent restart it appears that the slave nodes have all decided to spontaneously reinstall their operating systems, wiping their whole configuration, infiniband support, installed software etc.
I cannot fathom why the system might have done this, and it is quite unhelpful. Has anyone had this happen before? What has caused it?
And for the kicker, since I'm probably resigned to rebuilding the nodes to the spec they should have had, how does one backup the slaves once they're in a working state?
Additional info:
Also the head node seems to be largely incapable of reaching the internet, based on attempted pings. It also cannot seem to ping the local DNS address (192.168.0.1)
cluster
add a comment |
I am 'administering' a small cluster (4 nodes) based on Rocks Cluster. After a recent restart it appears that the slave nodes have all decided to spontaneously reinstall their operating systems, wiping their whole configuration, infiniband support, installed software etc.
I cannot fathom why the system might have done this, and it is quite unhelpful. Has anyone had this happen before? What has caused it?
And for the kicker, since I'm probably resigned to rebuilding the nodes to the spec they should have had, how does one backup the slaves once they're in a working state?
Additional info:
Also the head node seems to be largely incapable of reaching the internet, based on attempted pings. It also cannot seem to ping the local DNS address (192.168.0.1)
cluster
add a comment |
I am 'administering' a small cluster (4 nodes) based on Rocks Cluster. After a recent restart it appears that the slave nodes have all decided to spontaneously reinstall their operating systems, wiping their whole configuration, infiniband support, installed software etc.
I cannot fathom why the system might have done this, and it is quite unhelpful. Has anyone had this happen before? What has caused it?
And for the kicker, since I'm probably resigned to rebuilding the nodes to the spec they should have had, how does one backup the slaves once they're in a working state?
Additional info:
Also the head node seems to be largely incapable of reaching the internet, based on attempted pings. It also cannot seem to ping the local DNS address (192.168.0.1)
cluster
I am 'administering' a small cluster (4 nodes) based on Rocks Cluster. After a recent restart it appears that the slave nodes have all decided to spontaneously reinstall their operating systems, wiping their whole configuration, infiniband support, installed software etc.
I cannot fathom why the system might have done this, and it is quite unhelpful. Has anyone had this happen before? What has caused it?
And for the kicker, since I'm probably resigned to rebuilding the nodes to the spec they should have had, how does one backup the slaves once they're in a working state?
Additional info:
Also the head node seems to be largely incapable of reaching the internet, based on attempted pings. It also cannot seem to ping the local DNS address (192.168.0.1)
cluster
cluster
edited 2 hours ago
Rui F Ribeiro
41.5k1483140
41.5k1483140
asked Apr 28 '15 at 17:12
J CollinsJ Collins
4301417
4301417
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
It turns out that, at least in some cases, the default is for Rocks to reinstall itself on the slave nodes at every boot up (1). Presumably the intent is that clusters are always on and a restart probably means some changes were made that would benefit from a reinstall. For a casually-used system, this is not appropriate as it is unlikely all of the post-install scripts are configured to complete a full reinstallation. The way to avoid this reinstall is to execute:
rocks run host compute "chkconfig rocks-grub off"
This executes the function on all slave nodes in the 'compute' group, disabling the reinstall functionality.
In my case the slave nodes were set to boot from the local drive first, avoiding the auto-reinstall. I believe what triggered it was a forced power-off corrupting the local disk such that on next boot, the local corrupted disk would not boot and handed over to the PXE boot that got a reinstall instruction from the head node. The forced power off was caused by something unknown interrupting shutdown now
being run on the slaves. A physical power off was all that would shut them down. I now use shutdown -h now
which seems to overcome whatever interrupted the vanilla shutdown.
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f199192%2frocks-cluster-slave-spontaneous-reset%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
It turns out that, at least in some cases, the default is for Rocks to reinstall itself on the slave nodes at every boot up (1). Presumably the intent is that clusters are always on and a restart probably means some changes were made that would benefit from a reinstall. For a casually-used system, this is not appropriate as it is unlikely all of the post-install scripts are configured to complete a full reinstallation. The way to avoid this reinstall is to execute:
rocks run host compute "chkconfig rocks-grub off"
This executes the function on all slave nodes in the 'compute' group, disabling the reinstall functionality.
In my case the slave nodes were set to boot from the local drive first, avoiding the auto-reinstall. I believe what triggered it was a forced power-off corrupting the local disk such that on next boot, the local corrupted disk would not boot and handed over to the PXE boot that got a reinstall instruction from the head node. The forced power off was caused by something unknown interrupting shutdown now
being run on the slaves. A physical power off was all that would shut them down. I now use shutdown -h now
which seems to overcome whatever interrupted the vanilla shutdown.
add a comment |
It turns out that, at least in some cases, the default is for Rocks to reinstall itself on the slave nodes at every boot up (1). Presumably the intent is that clusters are always on and a restart probably means some changes were made that would benefit from a reinstall. For a casually-used system, this is not appropriate as it is unlikely all of the post-install scripts are configured to complete a full reinstallation. The way to avoid this reinstall is to execute:
rocks run host compute "chkconfig rocks-grub off"
This executes the function on all slave nodes in the 'compute' group, disabling the reinstall functionality.
In my case the slave nodes were set to boot from the local drive first, avoiding the auto-reinstall. I believe what triggered it was a forced power-off corrupting the local disk such that on next boot, the local corrupted disk would not boot and handed over to the PXE boot that got a reinstall instruction from the head node. The forced power off was caused by something unknown interrupting shutdown now
being run on the slaves. A physical power off was all that would shut them down. I now use shutdown -h now
which seems to overcome whatever interrupted the vanilla shutdown.
add a comment |
It turns out that, at least in some cases, the default is for Rocks to reinstall itself on the slave nodes at every boot up (1). Presumably the intent is that clusters are always on and a restart probably means some changes were made that would benefit from a reinstall. For a casually-used system, this is not appropriate as it is unlikely all of the post-install scripts are configured to complete a full reinstallation. The way to avoid this reinstall is to execute:
rocks run host compute "chkconfig rocks-grub off"
This executes the function on all slave nodes in the 'compute' group, disabling the reinstall functionality.
In my case the slave nodes were set to boot from the local drive first, avoiding the auto-reinstall. I believe what triggered it was a forced power-off corrupting the local disk such that on next boot, the local corrupted disk would not boot and handed over to the PXE boot that got a reinstall instruction from the head node. The forced power off was caused by something unknown interrupting shutdown now
being run on the slaves. A physical power off was all that would shut them down. I now use shutdown -h now
which seems to overcome whatever interrupted the vanilla shutdown.
It turns out that, at least in some cases, the default is for Rocks to reinstall itself on the slave nodes at every boot up (1). Presumably the intent is that clusters are always on and a restart probably means some changes were made that would benefit from a reinstall. For a casually-used system, this is not appropriate as it is unlikely all of the post-install scripts are configured to complete a full reinstallation. The way to avoid this reinstall is to execute:
rocks run host compute "chkconfig rocks-grub off"
This executes the function on all slave nodes in the 'compute' group, disabling the reinstall functionality.
In my case the slave nodes were set to boot from the local drive first, avoiding the auto-reinstall. I believe what triggered it was a forced power-off corrupting the local disk such that on next boot, the local corrupted disk would not boot and handed over to the PXE boot that got a reinstall instruction from the head node. The forced power off was caused by something unknown interrupting shutdown now
being run on the slaves. A physical power off was all that would shut them down. I now use shutdown -h now
which seems to overcome whatever interrupted the vanilla shutdown.
answered May 4 '15 at 18:03
J CollinsJ Collins
4301417
4301417
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f199192%2frocks-cluster-slave-spontaneous-reset%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown