What might cause a single “rcu_sched detected stall on CPU” warning in syslog?
Environment: Linux [hostname] 3.2.0-4-amd64 #1 SMP Debian 3.2.96-2 x86_64 GNU/Linux
Hardware: AMD Opteron(tm) Processor 6344, 6x2MiB L2, 2x8MiB L3, 6-core ht (12 logical cores)
Today I got a warning in syslog:
Feb 28 09:58:53 amalthea kernel: [4367033.060016] INFO: rcu_bh detected stall on CPU 10 (t=0 jiffies)
Feb 28 09:58:53 amalthea kernel: [4367033.060018] sending NMI to all CPUs:
Followed by a dump of CPU state. There seems to be nothing "bad" leading up to this situation in the log.
The server is still running, no (apparent) stalled processes, etc., and the warning has not repeated itself in the hour or so it's been since it happened.
I've skimmed through some information on RCU's stall detector (it's far too technical for me to really understand), and I can see that:
- My CPU was stalled for
t=0 jiffies
- There is no "detected by" CPU
There is a note in that document that suggests to me this may be a false-positive:
["Stall ended before state dump start"] is rare, but does happen from time to time in real life. It is also possible for a zero-jiffy stall to be flagged in this case, depending on how the stall warning and the grace-period initialization happen to interact. Please note that it is not possible to entirely eliminate this sort of false positive without resorting to things like stop_machine(), which is overkill for this sort of problem.
(Emphasis mine)
I didn't get a "Stall ended before state dump start" message, but I also didn't seem to get much else in terms of diagnostics other than the slew of CPU-dumps that came after the two log lines shown above.
I can post more information from the CPU-dump if it would be helpful. Nothing jumped-out at me, though I'm no expert.
What could have caused this situation? Is it likely to be a false-positive based solely upon the t=0 jiffies data point, plus no other diagnostic information printed to the log?
(Please note that this question is distinct from rcu_sched detected stall on CPU, which seems to have indicated a "real problem".)
linux-kernel
bumped to the homepage by Community♦ 2 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
Environment: Linux [hostname] 3.2.0-4-amd64 #1 SMP Debian 3.2.96-2 x86_64 GNU/Linux
Hardware: AMD Opteron(tm) Processor 6344, 6x2MiB L2, 2x8MiB L3, 6-core ht (12 logical cores)
Today I got a warning in syslog:
Feb 28 09:58:53 amalthea kernel: [4367033.060016] INFO: rcu_bh detected stall on CPU 10 (t=0 jiffies)
Feb 28 09:58:53 amalthea kernel: [4367033.060018] sending NMI to all CPUs:
Followed by a dump of CPU state. There seems to be nothing "bad" leading up to this situation in the log.
The server is still running, no (apparent) stalled processes, etc., and the warning has not repeated itself in the hour or so it's been since it happened.
I've skimmed through some information on RCU's stall detector (it's far too technical for me to really understand), and I can see that:
- My CPU was stalled for
t=0 jiffies
- There is no "detected by" CPU
There is a note in that document that suggests to me this may be a false-positive:
["Stall ended before state dump start"] is rare, but does happen from time to time in real life. It is also possible for a zero-jiffy stall to be flagged in this case, depending on how the stall warning and the grace-period initialization happen to interact. Please note that it is not possible to entirely eliminate this sort of false positive without resorting to things like stop_machine(), which is overkill for this sort of problem.
(Emphasis mine)
I didn't get a "Stall ended before state dump start" message, but I also didn't seem to get much else in terms of diagnostics other than the slew of CPU-dumps that came after the two log lines shown above.
I can post more information from the CPU-dump if it would be helpful. Nothing jumped-out at me, though I'm no expert.
What could have caused this situation? Is it likely to be a false-positive based solely upon the t=0 jiffies data point, plus no other diagnostic information printed to the log?
(Please note that this question is distinct from rcu_sched detected stall on CPU, which seems to have indicated a "real problem".)
linux-kernel
bumped to the homepage by Community♦ 2 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
Environment: Linux [hostname] 3.2.0-4-amd64 #1 SMP Debian 3.2.96-2 x86_64 GNU/Linux
Hardware: AMD Opteron(tm) Processor 6344, 6x2MiB L2, 2x8MiB L3, 6-core ht (12 logical cores)
Today I got a warning in syslog:
Feb 28 09:58:53 amalthea kernel: [4367033.060016] INFO: rcu_bh detected stall on CPU 10 (t=0 jiffies)
Feb 28 09:58:53 amalthea kernel: [4367033.060018] sending NMI to all CPUs:
Followed by a dump of CPU state. There seems to be nothing "bad" leading up to this situation in the log.
The server is still running, no (apparent) stalled processes, etc., and the warning has not repeated itself in the hour or so it's been since it happened.
I've skimmed through some information on RCU's stall detector (it's far too technical for me to really understand), and I can see that:
- My CPU was stalled for
t=0 jiffies
- There is no "detected by" CPU
There is a note in that document that suggests to me this may be a false-positive:
["Stall ended before state dump start"] is rare, but does happen from time to time in real life. It is also possible for a zero-jiffy stall to be flagged in this case, depending on how the stall warning and the grace-period initialization happen to interact. Please note that it is not possible to entirely eliminate this sort of false positive without resorting to things like stop_machine(), which is overkill for this sort of problem.
(Emphasis mine)
I didn't get a "Stall ended before state dump start" message, but I also didn't seem to get much else in terms of diagnostics other than the slew of CPU-dumps that came after the two log lines shown above.
I can post more information from the CPU-dump if it would be helpful. Nothing jumped-out at me, though I'm no expert.
What could have caused this situation? Is it likely to be a false-positive based solely upon the t=0 jiffies data point, plus no other diagnostic information printed to the log?
(Please note that this question is distinct from rcu_sched detected stall on CPU, which seems to have indicated a "real problem".)
linux-kernel
Environment: Linux [hostname] 3.2.0-4-amd64 #1 SMP Debian 3.2.96-2 x86_64 GNU/Linux
Hardware: AMD Opteron(tm) Processor 6344, 6x2MiB L2, 2x8MiB L3, 6-core ht (12 logical cores)
Today I got a warning in syslog:
Feb 28 09:58:53 amalthea kernel: [4367033.060016] INFO: rcu_bh detected stall on CPU 10 (t=0 jiffies)
Feb 28 09:58:53 amalthea kernel: [4367033.060018] sending NMI to all CPUs:
Followed by a dump of CPU state. There seems to be nothing "bad" leading up to this situation in the log.
The server is still running, no (apparent) stalled processes, etc., and the warning has not repeated itself in the hour or so it's been since it happened.
I've skimmed through some information on RCU's stall detector (it's far too technical for me to really understand), and I can see that:
- My CPU was stalled for
t=0 jiffies
- There is no "detected by" CPU
There is a note in that document that suggests to me this may be a false-positive:
["Stall ended before state dump start"] is rare, but does happen from time to time in real life. It is also possible for a zero-jiffy stall to be flagged in this case, depending on how the stall warning and the grace-period initialization happen to interact. Please note that it is not possible to entirely eliminate this sort of false positive without resorting to things like stop_machine(), which is overkill for this sort of problem.
(Emphasis mine)
I didn't get a "Stall ended before state dump start" message, but I also didn't seem to get much else in terms of diagnostics other than the slew of CPU-dumps that came after the two log lines shown above.
I can post more information from the CPU-dump if it would be helpful. Nothing jumped-out at me, though I'm no expert.
What could have caused this situation? Is it likely to be a false-positive based solely upon the t=0 jiffies data point, plus no other diagnostic information printed to the log?
(Please note that this question is distinct from rcu_sched detected stall on CPU, which seems to have indicated a "real problem".)
linux-kernel
linux-kernel
asked Feb 28 '18 at 15:48
Christopher SchultzChristopher Schultz
1657
1657
bumped to the homepage by Community♦ 2 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 2 hours ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Most likely, this is caused by one of three things:
- A very hard to trigger kernel bug. I don't think this is likely though, as the RCU stuff is so core to the kernel that any bugs there are likely to be hit by just about everyone.
- Bad RAM. Memory corruption caused by a bad memory module can quite easily cause bizarre stuff like this to happen.
- A transient memory error caused by something other than the memory itself. Similar to the above, but not likely to show up again. This is the type of thing that ECC memory tries to protect against (but can't completely because it's fully possible for something to go wrong in the ECC logic).
Unless it' happens again, you can probably assume it's case 3 and just move on. If it does happen again, look for similarities in the surrounding kernel messages and/or check your RAM.
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f427241%2fwhat-might-cause-a-single-rcu-sched-detected-stall-on-cpu-warning-in-syslog%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Most likely, this is caused by one of three things:
- A very hard to trigger kernel bug. I don't think this is likely though, as the RCU stuff is so core to the kernel that any bugs there are likely to be hit by just about everyone.
- Bad RAM. Memory corruption caused by a bad memory module can quite easily cause bizarre stuff like this to happen.
- A transient memory error caused by something other than the memory itself. Similar to the above, but not likely to show up again. This is the type of thing that ECC memory tries to protect against (but can't completely because it's fully possible for something to go wrong in the ECC logic).
Unless it' happens again, you can probably assume it's case 3 and just move on. If it does happen again, look for similarities in the surrounding kernel messages and/or check your RAM.
add a comment |
Most likely, this is caused by one of three things:
- A very hard to trigger kernel bug. I don't think this is likely though, as the RCU stuff is so core to the kernel that any bugs there are likely to be hit by just about everyone.
- Bad RAM. Memory corruption caused by a bad memory module can quite easily cause bizarre stuff like this to happen.
- A transient memory error caused by something other than the memory itself. Similar to the above, but not likely to show up again. This is the type of thing that ECC memory tries to protect against (but can't completely because it's fully possible for something to go wrong in the ECC logic).
Unless it' happens again, you can probably assume it's case 3 and just move on. If it does happen again, look for similarities in the surrounding kernel messages and/or check your RAM.
add a comment |
Most likely, this is caused by one of three things:
- A very hard to trigger kernel bug. I don't think this is likely though, as the RCU stuff is so core to the kernel that any bugs there are likely to be hit by just about everyone.
- Bad RAM. Memory corruption caused by a bad memory module can quite easily cause bizarre stuff like this to happen.
- A transient memory error caused by something other than the memory itself. Similar to the above, but not likely to show up again. This is the type of thing that ECC memory tries to protect against (but can't completely because it's fully possible for something to go wrong in the ECC logic).
Unless it' happens again, you can probably assume it's case 3 and just move on. If it does happen again, look for similarities in the surrounding kernel messages and/or check your RAM.
Most likely, this is caused by one of three things:
- A very hard to trigger kernel bug. I don't think this is likely though, as the RCU stuff is so core to the kernel that any bugs there are likely to be hit by just about everyone.
- Bad RAM. Memory corruption caused by a bad memory module can quite easily cause bizarre stuff like this to happen.
- A transient memory error caused by something other than the memory itself. Similar to the above, but not likely to show up again. This is the type of thing that ECC memory tries to protect against (but can't completely because it's fully possible for something to go wrong in the ECC logic).
Unless it' happens again, you can probably assume it's case 3 and just move on. If it does happen again, look for similarities in the surrounding kernel messages and/or check your RAM.
answered Feb 28 '18 at 20:39
Austin HemmelgarnAustin Hemmelgarn
6,04611017
6,04611017
add a comment |
add a comment |
Thanks for contributing an answer to Unix & Linux Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f427241%2fwhat-might-cause-a-single-rcu-sched-detected-stall-on-cpu-warning-in-syslog%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown