What might cause a single “rcu_sched detected stall on CPU” warning in syslog?

Environment: Linux [hostname] 3.2.0-4-amd64 #1 SMP Debian 3.2.96-2 x86_64 GNU/Linux
Hardware: AMD Opteron(tm) Processor 6344, 6x2MiB L2, 2x8MiB L3, 6-core ht (12 logical cores)

Today I got a warning in syslog:

Feb 28 09:58:53 amalthea kernel: [4367033.060016] INFO: rcu_bh detected stall on CPU 10 (t=0 jiffies)

Feb 28 09:58:53 amalthea kernel: [4367033.060018] sending NMI to all CPUs:

Followed by a dump of CPU state. There seems to be nothing "bad" leading up to this situation in the log.

The server is still running, no (apparent) stalled processes, etc., and the warning has not repeated itself in the hour or so it's been since it happened.

I've skimmed through some information on RCU's stall detector (it's far too technical for me to really understand), and I can see that:

My CPU was stalled for t=0 jiffies

There is no "detected by" CPU

There is a note in that document that suggests to me this may be a false-positive:

["Stall ended before state dump start"] is rare, but does happen from time to time in real life. It is also possible for a zero-jiffy stall to be flagged in this case, depending on how the stall warning and the grace-period initialization happen to interact. Please note that it is not possible to entirely eliminate this sort of false positive without resorting to things like stop_machine(), which is overkill for this sort of problem.

(Emphasis mine)

I didn't get a "Stall ended before state dump start" message, but I also didn't seem to get much else in terms of diagnostics other than the slew of CPU-dumps that came after the two log lines shown above.

I can post more information from the CPU-dump if it would be helpful. Nothing jumped-out at me, though I'm no expert.

What could have caused this situation? Is it likely to be a false-positive based solely upon the t=0 jiffies data point, plus no other diagnostic information printed to the log?

(Please note that this question is distinct from rcu_sched detected stall on CPU, which seems to have indicated a "real problem".)

asked Feb 28 '18 at 15:48

Christopher Schultz

1657

bumped to the homepage by Community♦ 2 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

Environment: Linux [hostname] 3.2.0-4-amd64 #1 SMP Debian 3.2.96-2 x86_64 GNU/Linux
Hardware: AMD Opteron(tm) Processor 6344, 6x2MiB L2, 2x8MiB L3, 6-core ht (12 logical cores)

Today I got a warning in syslog:

Feb 28 09:58:53 amalthea kernel: [4367033.060016] INFO: rcu_bh detected stall on CPU 10 (t=0 jiffies)

Feb 28 09:58:53 amalthea kernel: [4367033.060018] sending NMI to all CPUs:

Followed by a dump of CPU state. There seems to be nothing "bad" leading up to this situation in the log.

The server is still running, no (apparent) stalled processes, etc., and the warning has not repeated itself in the hour or so it's been since it happened.

I've skimmed through some information on RCU's stall detector (it's far too technical for me to really understand), and I can see that:

My CPU was stalled for t=0 jiffies

There is no "detected by" CPU

There is a note in that document that suggests to me this may be a false-positive:

["Stall ended before state dump start"] is rare, but does happen from time to time in real life. It is also possible for a zero-jiffy stall to be flagged in this case, depending on how the stall warning and the grace-period initialization happen to interact. Please note that it is not possible to entirely eliminate this sort of false positive without resorting to things like stop_machine(), which is overkill for this sort of problem.

(Emphasis mine)

I can post more information from the CPU-dump if it would be helpful. Nothing jumped-out at me, though I'm no expert.

What could have caused this situation? Is it likely to be a false-positive based solely upon the t=0 jiffies data point, plus no other diagnostic information printed to the log?

(Please note that this question is distinct from rcu_sched detected stall on CPU, which seems to have indicated a "real problem".)

asked Feb 28 '18 at 15:48

Christopher Schultz

1657

bumped to the homepage by Community♦ 2 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

Environment: Linux [hostname] 3.2.0-4-amd64 #1 SMP Debian 3.2.96-2 x86_64 GNU/Linux
Hardware: AMD Opteron(tm) Processor 6344, 6x2MiB L2, 2x8MiB L3, 6-core ht (12 logical cores)

Today I got a warning in syslog:

Feb 28 09:58:53 amalthea kernel: [4367033.060016] INFO: rcu_bh detected stall on CPU 10 (t=0 jiffies)

Feb 28 09:58:53 amalthea kernel: [4367033.060018] sending NMI to all CPUs:

Followed by a dump of CPU state. There seems to be nothing "bad" leading up to this situation in the log.

The server is still running, no (apparent) stalled processes, etc., and the warning has not repeated itself in the hour or so it's been since it happened.

I've skimmed through some information on RCU's stall detector (it's far too technical for me to really understand), and I can see that:

My CPU was stalled for t=0 jiffies

There is no "detected by" CPU

There is a note in that document that suggests to me this may be a false-positive:

["Stall ended before state dump start"] is rare, but does happen from time to time in real life. It is also possible for a zero-jiffy stall to be flagged in this case, depending on how the stall warning and the grace-period initialization happen to interact. Please note that it is not possible to entirely eliminate this sort of false positive without resorting to things like stop_machine(), which is overkill for this sort of problem.

(Emphasis mine)

I can post more information from the CPU-dump if it would be helpful. Nothing jumped-out at me, though I'm no expert.

What could have caused this situation? Is it likely to be a false-positive based solely upon the t=0 jiffies data point, plus no other diagnostic information printed to the log?

(Please note that this question is distinct from rcu_sched detected stall on CPU, which seems to have indicated a "real problem".)

asked Feb 28 '18 at 15:48

Christopher Schultz

1657

Environment: Linux [hostname] 3.2.0-4-amd64 #1 SMP Debian 3.2.96-2 x86_64 GNU/Linux
Hardware: AMD Opteron(tm) Processor 6344, 6x2MiB L2, 2x8MiB L3, 6-core ht (12 logical cores)

Today I got a warning in syslog:

Feb 28 09:58:53 amalthea kernel: [4367033.060016] INFO: rcu_bh detected stall on CPU 10 (t=0 jiffies)

Feb 28 09:58:53 amalthea kernel: [4367033.060018] sending NMI to all CPUs:

Followed by a dump of CPU state. There seems to be nothing "bad" leading up to this situation in the log.

The server is still running, no (apparent) stalled processes, etc., and the warning has not repeated itself in the hour or so it's been since it happened.

I've skimmed through some information on RCU's stall detector (it's far too technical for me to really understand), and I can see that:

My CPU was stalled for t=0 jiffies

There is no "detected by" CPU

There is a note in that document that suggests to me this may be a false-positive:

["Stall ended before state dump start"] is rare, but does happen from time to time in real life. It is also possible for a zero-jiffy stall to be flagged in this case, depending on how the stall warning and the grace-period initialization happen to interact. Please note that it is not possible to entirely eliminate this sort of false positive without resorting to things like stop_machine(), which is overkill for this sort of problem.

(Emphasis mine)

I can post more information from the CPU-dump if it would be helpful. Nothing jumped-out at me, though I'm no expert.

What could have caused this situation? Is it likely to be a false-positive based solely upon the t=0 jiffies data point, plus no other diagnostic information printed to the log?

(Please note that this question is distinct from rcu_sched detected stall on CPU, which seems to have indicated a "real problem".)

linux-kernel

asked Feb 28 '18 at 15:48

Christopher Schultz

1657

asked Feb 28 '18 at 15:48

Christopher Schultz

1657

asked Feb 28 '18 at 15:48

Christopher Schultz

1657

asked Feb 28 '18 at 15:48

Christopher Schultz

1657

asked Feb 28 '18 at 15:48

Christopher Schultz

1657

bumped to the homepage by Community♦ 2 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 2 hours ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

1 Answer
1

active

oldest

votes

Most likely, this is caused by one of three things:

A very hard to trigger kernel bug. I don't think this is likely though, as the RCU stuff is so core to the kernel that any bugs there are likely to be hit by just about everyone.

Bad RAM. Memory corruption caused by a bad memory module can quite easily cause bizarre stuff like this to happen.

A transient memory error caused by something other than the memory itself. Similar to the above, but not likely to show up again. This is the type of thing that ECC memory tries to protect against (but can't completely because it's fully possible for something to go wrong in the ECC logic).

Unless it' happens again, you can probably assume it's case 3 and just move on. If it does happen again, look for similarities in the surrounding kernel messages and/or check your RAM.

answered Feb 28 '18 at 20:39

Austin Hemmelgarn

6,04611017

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f427241%2fwhat-might-cause-a-single-rcu-sched-detected-stall-on-cpu-warning-in-syslog%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Most likely, this is caused by one of three things:

A very hard to trigger kernel bug. I don't think this is likely though, as the RCU stuff is so core to the kernel that any bugs there are likely to be hit by just about everyone.

Bad RAM. Memory corruption caused by a bad memory module can quite easily cause bizarre stuff like this to happen.

A transient memory error caused by something other than the memory itself. Similar to the above, but not likely to show up again. This is the type of thing that ECC memory tries to protect against (but can't completely because it's fully possible for something to go wrong in the ECC logic).

Unless it' happens again, you can probably assume it's case 3 and just move on. If it does happen again, look for similarities in the surrounding kernel messages and/or check your RAM.

answered Feb 28 '18 at 20:39

Austin Hemmelgarn

6,04611017

add a comment |

Most likely, this is caused by one of three things:

A very hard to trigger kernel bug. I don't think this is likely though, as the RCU stuff is so core to the kernel that any bugs there are likely to be hit by just about everyone.

Bad RAM. Memory corruption caused by a bad memory module can quite easily cause bizarre stuff like this to happen.

A transient memory error caused by something other than the memory itself. Similar to the above, but not likely to show up again. This is the type of thing that ECC memory tries to protect against (but can't completely because it's fully possible for something to go wrong in the ECC logic).

Unless it' happens again, you can probably assume it's case 3 and just move on. If it does happen again, look for similarities in the surrounding kernel messages and/or check your RAM.

answered Feb 28 '18 at 20:39

Austin Hemmelgarn

6,04611017

add a comment |

Most likely, this is caused by one of three things:

A very hard to trigger kernel bug. I don't think this is likely though, as the RCU stuff is so core to the kernel that any bugs there are likely to be hit by just about everyone.

Bad RAM. Memory corruption caused by a bad memory module can quite easily cause bizarre stuff like this to happen.

A transient memory error caused by something other than the memory itself. Similar to the above, but not likely to show up again. This is the type of thing that ECC memory tries to protect against (but can't completely because it's fully possible for something to go wrong in the ECC logic).

Unless it' happens again, you can probably assume it's case 3 and just move on. If it does happen again, look for similarities in the surrounding kernel messages and/or check your RAM.

answered Feb 28 '18 at 20:39

Austin Hemmelgarn

6,04611017

Most likely, this is caused by one of three things:

A very hard to trigger kernel bug. I don't think this is likely though, as the RCU stuff is so core to the kernel that any bugs there are likely to be hit by just about everyone.

Bad RAM. Memory corruption caused by a bad memory module can quite easily cause bizarre stuff like this to happen.

A transient memory error caused by something other than the memory itself. Similar to the above, but not likely to show up again. This is the type of thing that ECC memory tries to protect against (but can't completely because it's fully possible for something to go wrong in the ECC logic).

Unless it' happens again, you can probably assume it's case 3 and just move on. If it does happen again, look for similarities in the surrounding kernel messages and/or check your RAM.

answered Feb 28 '18 at 20:39

Austin Hemmelgarn

6,04611017

answered Feb 28 '18 at 20:39

Austin Hemmelgarn

6,04611017

answered Feb 28 '18 at 20:39

Austin Hemmelgarn

6,04611017

answered Feb 28 '18 at 20:39

Austin Hemmelgarn

6,04611017

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Sstrhsrtj