“legal” ARP poisoning by machine aggregating 2 NICs crashes us

Strange things are afoot, threats are being made and we need to sort this problem out;

The situation:

Our device (a network camera) streams video over a network to a recorder/server (Using Live555 / WIS Streamer). The video is UDP packets.

On one particular site using one particular server, every so often (~24 hours) one thread of the Live555 streamer locks up whilst sending video. Other threads keep going, and we still have connectivity to the camera over IP - see web pages from it, PING it, etc.

We suspect: the server; it has 2 network ports and aggregates them - it has two MAC's but one IP address. On wiresharking this, we see the camera streaming to one port (let's call it A), we then get an ARP from the other port (let's call it B), our device stops squirting packets to MAC A, squirts one packet up the wire to MAC B and then appears to stop in its tracks.

Further info: The server seems to corrupt ARP packets from the "wrong" port, possibly as result of a misconfiguration or somesuch, but those packets still get read & acted upon by our device, possibly as a result of our driver or kernel networking being misconfigured or skipping checksums to save CPU cycles.

So this messy situation begs a few questions:

Where in the kernel networking code should I be looking to check the packet checksum or enable checking? Our hardware is fixed, being an embedded device, so a tweak made to the driver is not the worst idea ever.

Can anyone guess the failure mechanism that causes a process to lock up when it's constantly send()ing data on a port and the ARP tables shift underneath it?

Edited to add: We now suspect that the ARPs are not really corrupt, just that Wireshark is not correctly identifying the packet (it thinks the packet is long enough that there must be a FSC word, but we now think it's just zero-padding). That really just leaves part 2 of this question: what can we do to prevent this change in the ARP table knocking a transmitting process over?

Edit to further add: I don't want people to think I'm ignoring questions about port states or process states, the issue happens very rarely (average maybe once per 24h) and only on one (remote) installation that we can't easily get access to, we're trying hard to replicate it in the lab so we can do more detailed diagnostics but the system watchdog resets within ~3 mins of the problem occurring, so by the time the news reaches us it's already rebooted and started working OK.

Edit to add Wireshark info:
I'm not sure the best way to summarise wireshark captures here (very hard to upload ~1Tb of captured packets!) but I'll try. Cam:X & Cam:Y are two streams of RTSP video streamed by two identical instances of Live555 WIS Streamer from different ports. Server 'A' and 'B' are the MACs of the two NICs on the server.

The sequence of packets goes like this:

UDP Packet from Cam:X -> Server 'A'

UDP Packet from Cam:Y -> Server 'A'

UDP Packet from Cam:X -> Server 'A'

UDP Packet from Cam:Y -> Server 'A'

UDP Packet from Cam:X -> Server 'A'

UDP Packet from Cam:Y -> Server 'A'

ARP Packet to Cam from Server 'B' "<my IP> is now on 'B'"

Intel ANS Probe broadcast from Server 'B', Sender ID '1' team ID 'B'

Intel ANS Probe broadcast from Server 'A', Sender ID '2' team ID 'B'

<silence> from Cam:X

UDP Packet from Cam:Y -> Server 'B'

UDP Packet from Cam:Y -> Server 'B'

UDP Packet from Cam:Y -> Server 'B'

There are no other packets in the stream at or around this time. The Intel ANS packets do not always coincide with the ARPs from the NIC but I thought I'd include them for the sake of completeness.

The issue seems to be VERY sensitive to timing, we see these "team" ARPs regularly from the server and only once in a blue moon do they cause us an issue - as if there's a particular point in the network stack code that's sensitive to the ARP table changing. It's not always the same stream instance that falls over, and notably the other instance (as well as all other net traffic - HTTP etc.) continues to work fine.

It sounds like teamed NICs "should not" ARP like this mid-session, but of course they won't be aware of any session when the traffic is all UDP.

edited 39 mins ago

Rui F Ribeiro

40.1k1479136

asked Feb 15 '16 at 18:53

John U

18519

An IP is like the name in a phone book, and the MAC the actual file number...so if it indeed it changes, the call is gone.

– Rui F Ribeiro
Feb 15 '16 at 19:11

1

Well, yes, but I'm curious about how a send() call can block/lock/crash when the ARP table changes rather than failing gracefully?

– John U
Feb 15 '16 at 19:19

The thread may keep in the sleep state waiting for data that never arrive I guess. Hard to tell. Do you have console? What does ps axms says about the thread when it happens?

– Rui F Ribeiro
Feb 16 '16 at 9:00

We don't have the -axms option on ps, it's an embedded system running busybox so relatively limited command set. Currently we can't reproduce the issue on demand as it's a bit challenging to craft corrupted packets of the correct form to order.

– John U
Feb 16 '16 at 11:25

Link aggregation may be the cause, if configured uncorrectly. Do you get ICMP answers shortly before the transmission stops, and shortly after the single packet to "B" ?

– gerhard d.
Feb 17 '16 at 11:04

|
show 2 more comments

Strange things are afoot, threats are being made and we need to sort this problem out;

The situation:

Our device (a network camera) streams video over a network to a recorder/server (Using Live555 / WIS Streamer). The video is UDP packets.

So this messy situation begs a few questions:

Where in the kernel networking code should I be looking to check the packet checksum or enable checking? Our hardware is fixed, being an embedded device, so a tweak made to the driver is not the worst idea ever.

Can anyone guess the failure mechanism that causes a process to lock up when it's constantly send()ing data on a port and the ARP tables shift underneath it?

The sequence of packets goes like this:

UDP Packet from Cam:X -> Server 'A'

UDP Packet from Cam:Y -> Server 'A'

UDP Packet from Cam:X -> Server 'A'

UDP Packet from Cam:Y -> Server 'A'

UDP Packet from Cam:X -> Server 'A'

UDP Packet from Cam:Y -> Server 'A'

ARP Packet to Cam from Server 'B' "<my IP> is now on 'B'"

Intel ANS Probe broadcast from Server 'B', Sender ID '1' team ID 'B'

Intel ANS Probe broadcast from Server 'A', Sender ID '2' team ID 'B'

<silence> from Cam:X

UDP Packet from Cam:Y -> Server 'B'

UDP Packet from Cam:Y -> Server 'B'

UDP Packet from Cam:Y -> Server 'B'

There are no other packets in the stream at or around this time. The Intel ANS packets do not always coincide with the ARPs from the NIC but I thought I'd include them for the sake of completeness.

It sounds like teamed NICs "should not" ARP like this mid-session, but of course they won't be aware of any session when the traffic is all UDP.

edited 39 mins ago

Rui F Ribeiro

40.1k1479136

asked Feb 15 '16 at 18:53

John U

18519

An IP is like the name in a phone book, and the MAC the actual file number...so if it indeed it changes, the call is gone.

– Rui F Ribeiro
Feb 15 '16 at 19:11

1

Well, yes, but I'm curious about how a send() call can block/lock/crash when the ARP table changes rather than failing gracefully?

– John U
Feb 15 '16 at 19:19

The thread may keep in the sleep state waiting for data that never arrive I guess. Hard to tell. Do you have console? What does ps axms says about the thread when it happens?

– Rui F Ribeiro
Feb 16 '16 at 9:00

We don't have the -axms option on ps, it's an embedded system running busybox so relatively limited command set. Currently we can't reproduce the issue on demand as it's a bit challenging to craft corrupted packets of the correct form to order.

– John U
Feb 16 '16 at 11:25

Link aggregation may be the cause, if configured uncorrectly. Do you get ICMP answers shortly before the transmission stops, and shortly after the single packet to "B" ?

– gerhard d.
Feb 17 '16 at 11:04

|
show 2 more comments

Strange things are afoot, threats are being made and we need to sort this problem out;

The situation:

Our device (a network camera) streams video over a network to a recorder/server (Using Live555 / WIS Streamer). The video is UDP packets.

So this messy situation begs a few questions:

Where in the kernel networking code should I be looking to check the packet checksum or enable checking? Our hardware is fixed, being an embedded device, so a tweak made to the driver is not the worst idea ever.

Can anyone guess the failure mechanism that causes a process to lock up when it's constantly send()ing data on a port and the ARP tables shift underneath it?

The sequence of packets goes like this:

UDP Packet from Cam:X -> Server 'A'

UDP Packet from Cam:Y -> Server 'A'

UDP Packet from Cam:X -> Server 'A'

UDP Packet from Cam:Y -> Server 'A'

UDP Packet from Cam:X -> Server 'A'

UDP Packet from Cam:Y -> Server 'A'

ARP Packet to Cam from Server 'B' "<my IP> is now on 'B'"

Intel ANS Probe broadcast from Server 'B', Sender ID '1' team ID 'B'

Intel ANS Probe broadcast from Server 'A', Sender ID '2' team ID 'B'

<silence> from Cam:X

UDP Packet from Cam:Y -> Server 'B'

UDP Packet from Cam:Y -> Server 'B'

UDP Packet from Cam:Y -> Server 'B'

There are no other packets in the stream at or around this time. The Intel ANS packets do not always coincide with the ARPs from the NIC but I thought I'd include them for the sake of completeness.

It sounds like teamed NICs "should not" ARP like this mid-session, but of course they won't be aware of any session when the traffic is all UDP.

edited 39 mins ago

Rui F Ribeiro

40.1k1479136

asked Feb 15 '16 at 18:53

John U

18519

Strange things are afoot, threats are being made and we need to sort this problem out;

The situation:

Our device (a network camera) streams video over a network to a recorder/server (Using Live555 / WIS Streamer). The video is UDP packets.

So this messy situation begs a few questions:

Where in the kernel networking code should I be looking to check the packet checksum or enable checking? Our hardware is fixed, being an embedded device, so a tweak made to the driver is not the worst idea ever.

Can anyone guess the failure mechanism that causes a process to lock up when it's constantly send()ing data on a port and the ARP tables shift underneath it?

The sequence of packets goes like this:

UDP Packet from Cam:X -> Server 'A'

UDP Packet from Cam:Y -> Server 'A'

UDP Packet from Cam:X -> Server 'A'

UDP Packet from Cam:Y -> Server 'A'

UDP Packet from Cam:X -> Server 'A'

UDP Packet from Cam:Y -> Server 'A'

ARP Packet to Cam from Server 'B' "<my IP> is now on 'B'"

Intel ANS Probe broadcast from Server 'B', Sender ID '1' team ID 'B'

Intel ANS Probe broadcast from Server 'A', Sender ID '2' team ID 'B'

<silence> from Cam:X

UDP Packet from Cam:Y -> Server 'B'

UDP Packet from Cam:Y -> Server 'B'

UDP Packet from Cam:Y -> Server 'B'

There are no other packets in the stream at or around this time. The Intel ANS packets do not always coincide with the ARPs from the NIC but I thought I'd include them for the sake of completeness.

It sounds like teamed NICs "should not" ARP like this mid-session, but of course they won't be aware of any session when the traffic is all UDP.

networking ip arp

edited 39 mins ago

Rui F Ribeiro

40.1k1479136

asked Feb 15 '16 at 18:53

John U

18519

edited 39 mins ago

Rui F Ribeiro

40.1k1479136

asked Feb 15 '16 at 18:53

John U

18519

edited 39 mins ago

Rui F Ribeiro

40.1k1479136

edited 39 mins ago

Rui F Ribeiro

40.1k1479136

edited 39 mins ago

Rui F Ribeiro

40.1k1479136

asked Feb 15 '16 at 18:53

John U

18519

asked Feb 15 '16 at 18:53

John U

18519

asked Feb 15 '16 at 18:53

John U

18519

An IP is like the name in a phone book, and the MAC the actual file number...so if it indeed it changes, the call is gone.

– Rui F Ribeiro
Feb 15 '16 at 19:11

1

Well, yes, but I'm curious about how a send() call can block/lock/crash when the ARP table changes rather than failing gracefully?

– John U
Feb 15 '16 at 19:19

The thread may keep in the sleep state waiting for data that never arrive I guess. Hard to tell. Do you have console? What does ps axms says about the thread when it happens?

– Rui F Ribeiro
Feb 16 '16 at 9:00

We don't have the -axms option on ps, it's an embedded system running busybox so relatively limited command set. Currently we can't reproduce the issue on demand as it's a bit challenging to craft corrupted packets of the correct form to order.

– John U
Feb 16 '16 at 11:25

Link aggregation may be the cause, if configured uncorrectly. Do you get ICMP answers shortly before the transmission stops, and shortly after the single packet to "B" ?

– gerhard d.
Feb 17 '16 at 11:04

|
show 2 more comments

An IP is like the name in a phone book, and the MAC the actual file number...so if it indeed it changes, the call is gone.

– Rui F Ribeiro
Feb 15 '16 at 19:11

1

Well, yes, but I'm curious about how a send() call can block/lock/crash when the ARP table changes rather than failing gracefully?

– John U
Feb 15 '16 at 19:19

The thread may keep in the sleep state waiting for data that never arrive I guess. Hard to tell. Do you have console? What does ps axms says about the thread when it happens?

– Rui F Ribeiro
Feb 16 '16 at 9:00

We don't have the -axms option on ps, it's an embedded system running busybox so relatively limited command set. Currently we can't reproduce the issue on demand as it's a bit challenging to craft corrupted packets of the correct form to order.

– John U
Feb 16 '16 at 11:25

Link aggregation may be the cause, if configured uncorrectly. Do you get ICMP answers shortly before the transmission stops, and shortly after the single packet to "B" ?

– gerhard d.
Feb 17 '16 at 11:04

An IP is like the name in a phone book, and the MAC the actual file number...so if it indeed it changes, the call is gone.

– Rui F Ribeiro
Feb 15 '16 at 19:11

Well, yes, but I'm curious about how a send() call can block/lock/crash when the ARP table changes rather than failing gracefully?

– John U
Feb 15 '16 at 19:19

The thread may keep in the sleep state waiting for data that never arrive I guess. Hard to tell. Do you have console? What does ps axms says about the thread when it happens?

– Rui F Ribeiro
Feb 16 '16 at 9:00

We don't have the -axms option on ps, it's an embedded system running busybox so relatively limited command set. Currently we can't reproduce the issue on demand as it's a bit challenging to craft corrupted packets of the correct form to order.

– John U
Feb 16 '16 at 11:25

Link aggregation may be the cause, if configured uncorrectly. Do you get ICMP answers shortly before the transmission stops, and shortly after the single packet to "B" ?

– gerhard d.
Feb 17 '16 at 11:04

|
show 2 more comments

1 Answer
1

active

oldest

votes

Well if only to give some closure to this the customer reconfigured their dodgy network card and everything worked, so unfortunately for the curious that means no-one is going to pay anyone to look too closely at what could've been done to fix that case.

answered Mar 16 '16 at 15:11

John U

18519

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "106"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f263450%2flegal-arp-poisoning-by-machine-aggregating-2-nics-crashes-us%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

answered Mar 16 '16 at 15:11

John U

18519

add a comment |

answered Mar 16 '16 at 15:11

John U

18519

add a comment |

answered Mar 16 '16 at 15:11

John U

18519

answered Mar 16 '16 at 15:11

John U

18519

answered Mar 16 '16 at 15:11

John U

18519

answered Mar 16 '16 at 15:11

John U

18519

answered Mar 16 '16 at 15:11

John U

18519

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Unix & Linux Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

lr0GEdY,Z3wm8,voIB7Xg9NurUrb 9N 2 yMfpIe8,0wY,YHAG9 U kz3S,Rr1Zk,I3c a xEw6Q uPdGbFk16ejg BeCpGKyWVDAKT,23oMFwEBa

搜尋此網誌

Sstrhsrtj