Jumphost suddenly reseting first SSH MUX connection attempts












5















I have been using a Debian 9 SSH jumpbox host to run my scripts/ansible playbooks for a while. The jumbox talks with Debian 9 and some Debian 8 servers, mostly. Most of the servers are VMs running under VMWare Enterprise 5.5.



The SSH client in the jumbox is configured for doing SSH MUX, and the authentication is done by an RSA certificate file.



The SSH has been working well for years now, however suddenly SSH connections started giving the error ssh_exchange_identification: read: Connection reset by peer at first try, several times a day, which obviously creates havoc with my scripts and scripts of our development team.



However, after the first try they are ok for a while. The servers misbehaving appear be random at first, but they have some patterns/timeouts. If I do send a command to all of the servers, for instance, running in a command before the intended script/playbook, a few will fail, but the next script will run in all of them.



There havent been recent significant changes on the servers, except for security updates. The transition for Debian 9 has already some (significant) time.



I already found a MTU configuration or other that was once applied to several servers in a malfunction and forgotten, however that was not the case. I also diminished both on the client and server side the ControlPersist and ClientAliveInterval both to 1h, and that did not improve the situation.



So at the moment, I am at loss of why this is happening. I am however more inclined to a layer 7 issue than a network problem.



The SSH configuration on the client side /etc/ssh_config, Debian 9 is:



Host *
SendEnv LANG LC_*
HashKnownHosts yes
GSSAPIAuthentication yes
GSSAPIDelegateCredentials no
ControlMaster auto
ControlPath /tmp/ssh_mux_%h_%p_%r
ControlPersist 1h
Compression no
UseRoaming no


On SSH on the server side of several Debian servers:



Protocol 2
HostKey /etc/ssh/ssh_host_rsa_key
HostKey /etc/ssh/ssh_host_dsa_key
UsePrivilegeSeparation yes

SyslogFacility AUTH
LogLevel INFO
LoginGraceTime 120
PermitRootLogin forced-commands-only
StrictModes yes
PubkeyAuthentication yes
IgnoreRhosts yes
HostbasedAuthentication no
PermitEmptyPasswords no
ChallengeResponseAuthentication no
PasswordAuthentication no

X11Forwarding no
X11DisplayOffset 10
PrintMotd no
PrintLastLog yes
TCPKeepAlive yes

AcceptEnv LANG LC_*
Subsystem sftp /usr/lib/openssh/sftp-server -l INFO
UsePAM yes
ClientAliveInterval 3600
ClientAliveCountMax 0
AddressFamily inet


SSH versions:



client -



$ssh -V 
OpenSSH_7.4p1 Debian-10+deb9u1, OpenSSL 1.0.2l 25 May 2017


server(s)



SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u1 (Debian 9)
SSH-2.0-OpenSSH_6.7p1 Debian-5+deb8u3 (Debian 8)


I have seen that error at least in situations with both servers with the 4.9.0-0.bpo.1-amd64 version.



The tcpdump of a server misbehaving, both machines being in the same network without any firewalls in the middle. I also monitor MAC addresses and there is not log of a new machine/MAC with the same MAC addresses in the last few years.



#tcpdump port 22
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
19:42:25.462896 IP jumbox.40270 > server.ssh: Flags [S], seq 3882361678, win 23200, options [mss 1160,sackOK,TS val 354223428 ecr 0,nop,wscale 7], length 0
19:42:25.463289 IP server.ssh > jumbox.40270: Flags [S.], seq 405921081, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:25.463306 IP jumbox.40270 > server.ssh: Flags [.], ack 1, win 182, length 0
19:42:25.481470 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:25.481477 IP jumbox.40270 > server.ssh: Flags [.], ack 504902058, win 182, length 0
19:42:25.481490 IP server.ssh > jumbox.40270: Flags [R], seq 405921082, win 0, length 0
19:42:25.481494 IP server.ssh > jumbox.40270: Flags [P.], seq 504902058:504902097, ack 1, win 182, length 39
19:42:26.491536 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:26.491551 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0
19:42:28.507528 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:28.507552 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0
19:42:32.699540 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:32.699556 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0
19:42:40.891490 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:40.891514 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0
19:42:57.019511 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:57.019534 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0


An ssh -v server log of a failed connection, with the reset error:



OpenSSH_7.4p1 Debian-10+deb9u1, OpenSSL 1.0.2l  25 May 2017
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: /etc/ssh/ssh_config line 59: Deprecated option "useroaming"
debug1: auto-mux: Trying existing master
debug1: Control socket "/tmp/ssh_mux_fenix-storage_22_rui" does not exist
debug1: Connecting to fenix-storage [10.10.32.156] port 22.
debug1: Connection established.
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_rsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ecdsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
write: Connection reset by peer


An ssh -v server of a successful connection:



OpenSSH_7.4p1 Debian-10+deb9u1, OpenSSL 1.0.2l  25 May 2017
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: /etc/ssh/ssh_config line 59: Deprecated option "useroaming"
debug1: auto-mux: Trying existing master
debug1: Control socket "/tmp/ssh_mux_sql01_22_rui" does not exist
debug1: Connecting to sql01 [10.20.10.88] port 22.
debug1: Connection established.
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_rsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ecdsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u1
debug1: Remote protocol version 2.0, remote software version OpenSSH_7.4p1 Debian-10+deb9u1
debug1: match: OpenSSH_7.4p1 Debian-10+deb9u1 pat OpenSSH* compat 0x04000000
debug1: Authenticating to sql01:22 as 'rui'
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: algorithm: curve25519-sha256
debug1: kex: host key algorithm: rsa-sha2-512
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Server host key: ssh-rsa SHA256:6aJ+ipXRZJfbei5YbYtvqKXB01t1YO34O2ChdT/vk/4
debug1: Host 'sql01' is known and matches the RSA host key.
debug1: Found key in /home/rui/.ssh/known_hosts:315
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_input_ext_info: server-sig-algs=<ssh-ed25519,ssh-rsa,ssh-dss,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521>
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey
debug1: Next authentication method: publickey
debug1: Offering RSA public key: /home/rui/.ssh/id_rsa
debug1: Server accepts key: pkalg ssh-rsa blen 277
debug1: Authentication succeeded (publickey).
Authenticated to sql01 ([10.20.10.88]:22).
debug1: setting up multiplex master socket
debug1: channel 0: new [/tmp/ssh_mux_sql01_22_rui]
debug1: control_persist_detach: backgrounding master process
debug1: forking to background
debug1: Entering interactive session.
debug1: pledge: id
debug1: multiplexing control connection
debug1: channel 1: new [mux-control]
debug1: channel 2: new [client-session]
debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0
debug1: Sending environment.
debug1: Sending env LC_ALL = en_US.utf8
debug1: Sending env LANG = en_US.UTF-8
debug1: mux_client_request_session: master session id: 2


Interestingly enough, the behaviour can be reproduced with a telnet command:



$ telnet remote-server 22
Trying x.x.x.x...
Connected to remote-server
Escape character is '^]'.
Connection closed by foreign host.
$ telnet remote-server 22
Trying x.x.x.x...
Connected to remote-server
Escape character is '^]'.
SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u1

Protocol mismatch.
Connection closed by foreign host.


UPDATE:



Forced Protocol 2 in the /etc/ssh_client client configuration in the jumpbox. No change.



UPDATE2:



Changed the old key encrypted with DES-EDE3-CBC for a new key encrypted with AES-128-CBC. Again no visible change.



UPDATE3:



Interestingly enough, while the mux is active, the situation does not presents itself.



UPDATE4:



I also have found a similar question at serverfault, however without a chosen answer: https://serverfault.com/questions/445045/ssh-connection-error-ssh-exchange-identification-read-connection-reset-by-pe



Tried regenerating the ssh host keys, and the suggestion of sshd: ALL without success.



UPDATE 5



Opened a console on the VM on the destination and saw something 'strange'.
tcpdump whereas 1.1.1.1 is the jumpbox.



# tcpdump -n -vvv "host 1.1.1.1"
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
11:47:45.808273 IP (tos 0x0, ttl 64, id 38171, offset 0, flags [DF], proto TCP (6), length 60)
1.1.1.1.37924 > 1.1.1.2.22: Flags [S], cksum 0xfc1f (correct), seq 3260568985, win 29200, options [mss 1460,sackOK,TS val 407355522 ecr 0,nop,wscale 7], length 0
11:47:45.808318 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
1.1.1.2.22 > 1.1.1.1.37924: Flags [S.], cksum 0x5508 (incorrect -> 0x68a8), seq 2881609759, ack 3260568986, win 28960, options [mss 1460,sackOK,TS val 561702650 ecr 407355522,nop,wscale 7], length 0
11:47:45.808525 IP (tos 0x0, ttl 64, id 38172, offset 0, flags [DF], proto TCP (6), length 52)
1.1.1.1.37924 > 1.1.1.2.22: Flags [.], cksum 0x07b0 (correct), seq 1, ack 1, win 229, options [nop,nop,TS val 407355522 ecr 561702650], length 0
11:47:45.808917 IP (tos 0x0, ttl 64, id 38173, offset 0, flags [DF], proto TCP (6), length 92)
1.1.1.1.37924 > 1.1.1.2.22: Flags [P.], cksum 0x6de0 (correct), seq 1:41, ack 1, win 229, options [nop,nop,TS val 407355522 ecr 561702650], length 40
11:47:45.808930 IP (tos 0x0, ttl 64, id 1754, offset 0, flags [DF], proto TCP (6), length 52)
1.1.1.2.22 > 1.1.1.1.37924: Flags [.], cksum 0x5500 (incorrect -> 0x0789), seq 1, ack 41, win 227, options [nop,nop,TS val 561702651 ecr 407355522], length 0
11:47:45.822178 IP (tos 0x0, ttl 64, id 1755, offset 0, flags [DF], proto TCP (6), length 91)
1.1.1.2.22 > 1.1.1.1.37924: Flags [P.], cksum 0x5527 (incorrect -> 0x70c1), seq 1:40, ack 41, win 227, options [nop,nop,TS val 561702654 ecr 407355522], length 39
11:47:45.822645 IP (tos 0x0, ttl 64, id 21666, offset 0, flags [DF], proto TCP (6), length 40)
1.1.1.1.37924 > 1.1.1.2.22: Flags [R], cksum 0xaeb1 (correct), seq 3260569026, win 0, length 0
11:47:50.919752 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.1.1.2 tell 1.1.1.1, length 46
11:47:50.919773 ARP, Ethernet (len 6), IPv4 (len 4), Reply 1.1.1.2 is-at 00:50:56:b9:3d:2b, length 28
11:47:50.948732 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.1.1.1 tell 1.1.1.2, length 28
11:47:50.948916 ARP, Ethernet (len 6), IPv4 (len 4), Reply 1.1.1.1 is-at 00:50:56:80:57:1a, length 46
^C
11 packets captured
11 packets received by filter
0 packets dropped by kernel


UPDATE 6



Due to the checkum error, I disabled the TCP/UDP checksum offloading to the NIC in the VM, however it did not help.



$sudo ethtool -K eth0 rx off
$sudo ethtool -K eth0 tx off

iface eth0 inet static
address 1.1.1.2
netmask 255.255.255.0
network 1.1.1.0
broadcast 1.11.1.255
gateway 1.1.1.254
post-up /sbin/ethtool -K $IFACE rx off
post-up /sbin/ethtool -K $IFACE tx off


Understanding TCP Checksum Offloading (TCO) in a VMware Environment (2052904)



UPDATE 7



Disabled GSSAPIAuthentication in the ssh client in the jumpbox. Tested Enable Compression yes No change.



UPDATE 8



Testing filling up the checksum with iptables.



/sbin/iptables -A POSTROUTING -t mangle -p tcp -j CHECKSUM --checksum-fill


It did not improve the situation.



UPDATE 9:



Found an interesting test about limiting cyphers, will try it out. MTU problems does not seem the culprit as I am having problems in some cases with server and client in the same network.



For now tested in the client side "ssh -c aes256-ctr", and the symptoms do not improve.



The mysterious case of broken SSH client (“connection reset by peer”)



UPDATE 10



Added this to /etc/ssh/ssh_config. No changes.



Ciphers aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc


SSH issues: Read from socket failed: Connection reset by peer



UPDATE 11



Defined the ssh service in port 22 and port 2222. It did not help.



UPDATE 12



I suspect it being a regression bug present in OpenSSH 7.4 that was corrected with OpenSSH 7.5



Release notes from OpenSSH 7.5




  • sshd(8): Fix regression in OpenSSH 7.4 support for the
    server-sig-algs extension, where SHA2 RSA signature methods were
    not being correctly advertised. bz#2680


For using openSSH 7.5 in Debian 9/Stretch, I installed openssh-client and openssh-server from Debian testing/Buster.



No improvements on the situation.



UPDATE 13



Defined



Ciphers aes256-ctr
MACs hmac-sha1



Both at the client(s) and server side. No improvements.



UPDATE 14



Setup



UseDNS no
GSSAPIAuthentication no
GSSAPIKeyExchange no


No change.



UPDATE 15



/etc/ssh/sshd_config



Changed it to /etc/ssh/sshd_config:



TCPKeepAlive no


From How does tcp-keepalive work in ssh?




TCPKeepAlive operates on the TCP layer. It sends an empty TCP ACK
packet [from the SSH server to the client - Rui]. Firewalls can be configured to ignore these packets, so if you
go through a firewall that drops idle connections, these may not keep
the connection alive.




My guess is that TCPKeepAlive was configuring the server sending a packet that is being optimised/ignored in some layer down the stack bellow, and somewhat the remote SSH server believed it was still connected to the TCP mux client, while in fact the session was already teared down; thus the TCP reset(s) at first try.



So whilst some say that if you're using ClientAliveInterval, you can disable TCPKeepAlive, it seems to be more it you are using ClientAliveInterval you ought to disable TCPKeepAlive.




  • It is clearly this option; as for the explanation, they are mainly conjectures and will have to double check them/the source when and if I have got time.


TCPKeepAlive apparently also has spoofing issues, so it is recommended that it should be turned off.



Nevertheless, still with the problem.










share|improve this question

























  • The RST packets are not normal, something between your machine and the server seems to be killing your TCP connection. It's hard to tell what that might be without a full packet dump.

    – Satō Katsura
    Sep 8 '17 at 10:01











  • @SatōKatsura Though better. That server and jumpbox in the tcpdump are both in the same network; I do have other servers that do routing via firewall

    – Rui F Ribeiro
    Sep 8 '17 at 11:24













  • Well, you need to find out where those RST come from. There could be any number of reasons for that. shrug

    – Satō Katsura
    Sep 8 '17 at 11:33











  • @SatōKatsura sure indeed. Will add another tcpdump when at work. The difficult part is that this is a bit random

    – Rui F Ribeiro
    Sep 8 '17 at 11:37


















5















I have been using a Debian 9 SSH jumpbox host to run my scripts/ansible playbooks for a while. The jumbox talks with Debian 9 and some Debian 8 servers, mostly. Most of the servers are VMs running under VMWare Enterprise 5.5.



The SSH client in the jumbox is configured for doing SSH MUX, and the authentication is done by an RSA certificate file.



The SSH has been working well for years now, however suddenly SSH connections started giving the error ssh_exchange_identification: read: Connection reset by peer at first try, several times a day, which obviously creates havoc with my scripts and scripts of our development team.



However, after the first try they are ok for a while. The servers misbehaving appear be random at first, but they have some patterns/timeouts. If I do send a command to all of the servers, for instance, running in a command before the intended script/playbook, a few will fail, but the next script will run in all of them.



There havent been recent significant changes on the servers, except for security updates. The transition for Debian 9 has already some (significant) time.



I already found a MTU configuration or other that was once applied to several servers in a malfunction and forgotten, however that was not the case. I also diminished both on the client and server side the ControlPersist and ClientAliveInterval both to 1h, and that did not improve the situation.



So at the moment, I am at loss of why this is happening. I am however more inclined to a layer 7 issue than a network problem.



The SSH configuration on the client side /etc/ssh_config, Debian 9 is:



Host *
SendEnv LANG LC_*
HashKnownHosts yes
GSSAPIAuthentication yes
GSSAPIDelegateCredentials no
ControlMaster auto
ControlPath /tmp/ssh_mux_%h_%p_%r
ControlPersist 1h
Compression no
UseRoaming no


On SSH on the server side of several Debian servers:



Protocol 2
HostKey /etc/ssh/ssh_host_rsa_key
HostKey /etc/ssh/ssh_host_dsa_key
UsePrivilegeSeparation yes

SyslogFacility AUTH
LogLevel INFO
LoginGraceTime 120
PermitRootLogin forced-commands-only
StrictModes yes
PubkeyAuthentication yes
IgnoreRhosts yes
HostbasedAuthentication no
PermitEmptyPasswords no
ChallengeResponseAuthentication no
PasswordAuthentication no

X11Forwarding no
X11DisplayOffset 10
PrintMotd no
PrintLastLog yes
TCPKeepAlive yes

AcceptEnv LANG LC_*
Subsystem sftp /usr/lib/openssh/sftp-server -l INFO
UsePAM yes
ClientAliveInterval 3600
ClientAliveCountMax 0
AddressFamily inet


SSH versions:



client -



$ssh -V 
OpenSSH_7.4p1 Debian-10+deb9u1, OpenSSL 1.0.2l 25 May 2017


server(s)



SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u1 (Debian 9)
SSH-2.0-OpenSSH_6.7p1 Debian-5+deb8u3 (Debian 8)


I have seen that error at least in situations with both servers with the 4.9.0-0.bpo.1-amd64 version.



The tcpdump of a server misbehaving, both machines being in the same network without any firewalls in the middle. I also monitor MAC addresses and there is not log of a new machine/MAC with the same MAC addresses in the last few years.



#tcpdump port 22
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
19:42:25.462896 IP jumbox.40270 > server.ssh: Flags [S], seq 3882361678, win 23200, options [mss 1160,sackOK,TS val 354223428 ecr 0,nop,wscale 7], length 0
19:42:25.463289 IP server.ssh > jumbox.40270: Flags [S.], seq 405921081, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:25.463306 IP jumbox.40270 > server.ssh: Flags [.], ack 1, win 182, length 0
19:42:25.481470 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:25.481477 IP jumbox.40270 > server.ssh: Flags [.], ack 504902058, win 182, length 0
19:42:25.481490 IP server.ssh > jumbox.40270: Flags [R], seq 405921082, win 0, length 0
19:42:25.481494 IP server.ssh > jumbox.40270: Flags [P.], seq 504902058:504902097, ack 1, win 182, length 39
19:42:26.491536 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:26.491551 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0
19:42:28.507528 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:28.507552 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0
19:42:32.699540 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:32.699556 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0
19:42:40.891490 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:40.891514 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0
19:42:57.019511 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:57.019534 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0


An ssh -v server log of a failed connection, with the reset error:



OpenSSH_7.4p1 Debian-10+deb9u1, OpenSSL 1.0.2l  25 May 2017
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: /etc/ssh/ssh_config line 59: Deprecated option "useroaming"
debug1: auto-mux: Trying existing master
debug1: Control socket "/tmp/ssh_mux_fenix-storage_22_rui" does not exist
debug1: Connecting to fenix-storage [10.10.32.156] port 22.
debug1: Connection established.
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_rsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ecdsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
write: Connection reset by peer


An ssh -v server of a successful connection:



OpenSSH_7.4p1 Debian-10+deb9u1, OpenSSL 1.0.2l  25 May 2017
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: /etc/ssh/ssh_config line 59: Deprecated option "useroaming"
debug1: auto-mux: Trying existing master
debug1: Control socket "/tmp/ssh_mux_sql01_22_rui" does not exist
debug1: Connecting to sql01 [10.20.10.88] port 22.
debug1: Connection established.
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_rsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ecdsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u1
debug1: Remote protocol version 2.0, remote software version OpenSSH_7.4p1 Debian-10+deb9u1
debug1: match: OpenSSH_7.4p1 Debian-10+deb9u1 pat OpenSSH* compat 0x04000000
debug1: Authenticating to sql01:22 as 'rui'
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: algorithm: curve25519-sha256
debug1: kex: host key algorithm: rsa-sha2-512
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Server host key: ssh-rsa SHA256:6aJ+ipXRZJfbei5YbYtvqKXB01t1YO34O2ChdT/vk/4
debug1: Host 'sql01' is known and matches the RSA host key.
debug1: Found key in /home/rui/.ssh/known_hosts:315
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_input_ext_info: server-sig-algs=<ssh-ed25519,ssh-rsa,ssh-dss,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521>
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey
debug1: Next authentication method: publickey
debug1: Offering RSA public key: /home/rui/.ssh/id_rsa
debug1: Server accepts key: pkalg ssh-rsa blen 277
debug1: Authentication succeeded (publickey).
Authenticated to sql01 ([10.20.10.88]:22).
debug1: setting up multiplex master socket
debug1: channel 0: new [/tmp/ssh_mux_sql01_22_rui]
debug1: control_persist_detach: backgrounding master process
debug1: forking to background
debug1: Entering interactive session.
debug1: pledge: id
debug1: multiplexing control connection
debug1: channel 1: new [mux-control]
debug1: channel 2: new [client-session]
debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0
debug1: Sending environment.
debug1: Sending env LC_ALL = en_US.utf8
debug1: Sending env LANG = en_US.UTF-8
debug1: mux_client_request_session: master session id: 2


Interestingly enough, the behaviour can be reproduced with a telnet command:



$ telnet remote-server 22
Trying x.x.x.x...
Connected to remote-server
Escape character is '^]'.
Connection closed by foreign host.
$ telnet remote-server 22
Trying x.x.x.x...
Connected to remote-server
Escape character is '^]'.
SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u1

Protocol mismatch.
Connection closed by foreign host.


UPDATE:



Forced Protocol 2 in the /etc/ssh_client client configuration in the jumpbox. No change.



UPDATE2:



Changed the old key encrypted with DES-EDE3-CBC for a new key encrypted with AES-128-CBC. Again no visible change.



UPDATE3:



Interestingly enough, while the mux is active, the situation does not presents itself.



UPDATE4:



I also have found a similar question at serverfault, however without a chosen answer: https://serverfault.com/questions/445045/ssh-connection-error-ssh-exchange-identification-read-connection-reset-by-pe



Tried regenerating the ssh host keys, and the suggestion of sshd: ALL without success.



UPDATE 5



Opened a console on the VM on the destination and saw something 'strange'.
tcpdump whereas 1.1.1.1 is the jumpbox.



# tcpdump -n -vvv "host 1.1.1.1"
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
11:47:45.808273 IP (tos 0x0, ttl 64, id 38171, offset 0, flags [DF], proto TCP (6), length 60)
1.1.1.1.37924 > 1.1.1.2.22: Flags [S], cksum 0xfc1f (correct), seq 3260568985, win 29200, options [mss 1460,sackOK,TS val 407355522 ecr 0,nop,wscale 7], length 0
11:47:45.808318 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
1.1.1.2.22 > 1.1.1.1.37924: Flags [S.], cksum 0x5508 (incorrect -> 0x68a8), seq 2881609759, ack 3260568986, win 28960, options [mss 1460,sackOK,TS val 561702650 ecr 407355522,nop,wscale 7], length 0
11:47:45.808525 IP (tos 0x0, ttl 64, id 38172, offset 0, flags [DF], proto TCP (6), length 52)
1.1.1.1.37924 > 1.1.1.2.22: Flags [.], cksum 0x07b0 (correct), seq 1, ack 1, win 229, options [nop,nop,TS val 407355522 ecr 561702650], length 0
11:47:45.808917 IP (tos 0x0, ttl 64, id 38173, offset 0, flags [DF], proto TCP (6), length 92)
1.1.1.1.37924 > 1.1.1.2.22: Flags [P.], cksum 0x6de0 (correct), seq 1:41, ack 1, win 229, options [nop,nop,TS val 407355522 ecr 561702650], length 40
11:47:45.808930 IP (tos 0x0, ttl 64, id 1754, offset 0, flags [DF], proto TCP (6), length 52)
1.1.1.2.22 > 1.1.1.1.37924: Flags [.], cksum 0x5500 (incorrect -> 0x0789), seq 1, ack 41, win 227, options [nop,nop,TS val 561702651 ecr 407355522], length 0
11:47:45.822178 IP (tos 0x0, ttl 64, id 1755, offset 0, flags [DF], proto TCP (6), length 91)
1.1.1.2.22 > 1.1.1.1.37924: Flags [P.], cksum 0x5527 (incorrect -> 0x70c1), seq 1:40, ack 41, win 227, options [nop,nop,TS val 561702654 ecr 407355522], length 39
11:47:45.822645 IP (tos 0x0, ttl 64, id 21666, offset 0, flags [DF], proto TCP (6), length 40)
1.1.1.1.37924 > 1.1.1.2.22: Flags [R], cksum 0xaeb1 (correct), seq 3260569026, win 0, length 0
11:47:50.919752 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.1.1.2 tell 1.1.1.1, length 46
11:47:50.919773 ARP, Ethernet (len 6), IPv4 (len 4), Reply 1.1.1.2 is-at 00:50:56:b9:3d:2b, length 28
11:47:50.948732 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.1.1.1 tell 1.1.1.2, length 28
11:47:50.948916 ARP, Ethernet (len 6), IPv4 (len 4), Reply 1.1.1.1 is-at 00:50:56:80:57:1a, length 46
^C
11 packets captured
11 packets received by filter
0 packets dropped by kernel


UPDATE 6



Due to the checkum error, I disabled the TCP/UDP checksum offloading to the NIC in the VM, however it did not help.



$sudo ethtool -K eth0 rx off
$sudo ethtool -K eth0 tx off

iface eth0 inet static
address 1.1.1.2
netmask 255.255.255.0
network 1.1.1.0
broadcast 1.11.1.255
gateway 1.1.1.254
post-up /sbin/ethtool -K $IFACE rx off
post-up /sbin/ethtool -K $IFACE tx off


Understanding TCP Checksum Offloading (TCO) in a VMware Environment (2052904)



UPDATE 7



Disabled GSSAPIAuthentication in the ssh client in the jumpbox. Tested Enable Compression yes No change.



UPDATE 8



Testing filling up the checksum with iptables.



/sbin/iptables -A POSTROUTING -t mangle -p tcp -j CHECKSUM --checksum-fill


It did not improve the situation.



UPDATE 9:



Found an interesting test about limiting cyphers, will try it out. MTU problems does not seem the culprit as I am having problems in some cases with server and client in the same network.



For now tested in the client side "ssh -c aes256-ctr", and the symptoms do not improve.



The mysterious case of broken SSH client (“connection reset by peer”)



UPDATE 10



Added this to /etc/ssh/ssh_config. No changes.



Ciphers aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc


SSH issues: Read from socket failed: Connection reset by peer



UPDATE 11



Defined the ssh service in port 22 and port 2222. It did not help.



UPDATE 12



I suspect it being a regression bug present in OpenSSH 7.4 that was corrected with OpenSSH 7.5



Release notes from OpenSSH 7.5




  • sshd(8): Fix regression in OpenSSH 7.4 support for the
    server-sig-algs extension, where SHA2 RSA signature methods were
    not being correctly advertised. bz#2680


For using openSSH 7.5 in Debian 9/Stretch, I installed openssh-client and openssh-server from Debian testing/Buster.



No improvements on the situation.



UPDATE 13



Defined



Ciphers aes256-ctr
MACs hmac-sha1



Both at the client(s) and server side. No improvements.



UPDATE 14



Setup



UseDNS no
GSSAPIAuthentication no
GSSAPIKeyExchange no


No change.



UPDATE 15



/etc/ssh/sshd_config



Changed it to /etc/ssh/sshd_config:



TCPKeepAlive no


From How does tcp-keepalive work in ssh?




TCPKeepAlive operates on the TCP layer. It sends an empty TCP ACK
packet [from the SSH server to the client - Rui]. Firewalls can be configured to ignore these packets, so if you
go through a firewall that drops idle connections, these may not keep
the connection alive.




My guess is that TCPKeepAlive was configuring the server sending a packet that is being optimised/ignored in some layer down the stack bellow, and somewhat the remote SSH server believed it was still connected to the TCP mux client, while in fact the session was already teared down; thus the TCP reset(s) at first try.



So whilst some say that if you're using ClientAliveInterval, you can disable TCPKeepAlive, it seems to be more it you are using ClientAliveInterval you ought to disable TCPKeepAlive.




  • It is clearly this option; as for the explanation, they are mainly conjectures and will have to double check them/the source when and if I have got time.


TCPKeepAlive apparently also has spoofing issues, so it is recommended that it should be turned off.



Nevertheless, still with the problem.










share|improve this question

























  • The RST packets are not normal, something between your machine and the server seems to be killing your TCP connection. It's hard to tell what that might be without a full packet dump.

    – Satō Katsura
    Sep 8 '17 at 10:01











  • @SatōKatsura Though better. That server and jumpbox in the tcpdump are both in the same network; I do have other servers that do routing via firewall

    – Rui F Ribeiro
    Sep 8 '17 at 11:24













  • Well, you need to find out where those RST come from. There could be any number of reasons for that. shrug

    – Satō Katsura
    Sep 8 '17 at 11:33











  • @SatōKatsura sure indeed. Will add another tcpdump when at work. The difficult part is that this is a bit random

    – Rui F Ribeiro
    Sep 8 '17 at 11:37
















5












5








5


1






I have been using a Debian 9 SSH jumpbox host to run my scripts/ansible playbooks for a while. The jumbox talks with Debian 9 and some Debian 8 servers, mostly. Most of the servers are VMs running under VMWare Enterprise 5.5.



The SSH client in the jumbox is configured for doing SSH MUX, and the authentication is done by an RSA certificate file.



The SSH has been working well for years now, however suddenly SSH connections started giving the error ssh_exchange_identification: read: Connection reset by peer at first try, several times a day, which obviously creates havoc with my scripts and scripts of our development team.



However, after the first try they are ok for a while. The servers misbehaving appear be random at first, but they have some patterns/timeouts. If I do send a command to all of the servers, for instance, running in a command before the intended script/playbook, a few will fail, but the next script will run in all of them.



There havent been recent significant changes on the servers, except for security updates. The transition for Debian 9 has already some (significant) time.



I already found a MTU configuration or other that was once applied to several servers in a malfunction and forgotten, however that was not the case. I also diminished both on the client and server side the ControlPersist and ClientAliveInterval both to 1h, and that did not improve the situation.



So at the moment, I am at loss of why this is happening. I am however more inclined to a layer 7 issue than a network problem.



The SSH configuration on the client side /etc/ssh_config, Debian 9 is:



Host *
SendEnv LANG LC_*
HashKnownHosts yes
GSSAPIAuthentication yes
GSSAPIDelegateCredentials no
ControlMaster auto
ControlPath /tmp/ssh_mux_%h_%p_%r
ControlPersist 1h
Compression no
UseRoaming no


On SSH on the server side of several Debian servers:



Protocol 2
HostKey /etc/ssh/ssh_host_rsa_key
HostKey /etc/ssh/ssh_host_dsa_key
UsePrivilegeSeparation yes

SyslogFacility AUTH
LogLevel INFO
LoginGraceTime 120
PermitRootLogin forced-commands-only
StrictModes yes
PubkeyAuthentication yes
IgnoreRhosts yes
HostbasedAuthentication no
PermitEmptyPasswords no
ChallengeResponseAuthentication no
PasswordAuthentication no

X11Forwarding no
X11DisplayOffset 10
PrintMotd no
PrintLastLog yes
TCPKeepAlive yes

AcceptEnv LANG LC_*
Subsystem sftp /usr/lib/openssh/sftp-server -l INFO
UsePAM yes
ClientAliveInterval 3600
ClientAliveCountMax 0
AddressFamily inet


SSH versions:



client -



$ssh -V 
OpenSSH_7.4p1 Debian-10+deb9u1, OpenSSL 1.0.2l 25 May 2017


server(s)



SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u1 (Debian 9)
SSH-2.0-OpenSSH_6.7p1 Debian-5+deb8u3 (Debian 8)


I have seen that error at least in situations with both servers with the 4.9.0-0.bpo.1-amd64 version.



The tcpdump of a server misbehaving, both machines being in the same network without any firewalls in the middle. I also monitor MAC addresses and there is not log of a new machine/MAC with the same MAC addresses in the last few years.



#tcpdump port 22
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
19:42:25.462896 IP jumbox.40270 > server.ssh: Flags [S], seq 3882361678, win 23200, options [mss 1160,sackOK,TS val 354223428 ecr 0,nop,wscale 7], length 0
19:42:25.463289 IP server.ssh > jumbox.40270: Flags [S.], seq 405921081, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:25.463306 IP jumbox.40270 > server.ssh: Flags [.], ack 1, win 182, length 0
19:42:25.481470 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:25.481477 IP jumbox.40270 > server.ssh: Flags [.], ack 504902058, win 182, length 0
19:42:25.481490 IP server.ssh > jumbox.40270: Flags [R], seq 405921082, win 0, length 0
19:42:25.481494 IP server.ssh > jumbox.40270: Flags [P.], seq 504902058:504902097, ack 1, win 182, length 39
19:42:26.491536 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:26.491551 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0
19:42:28.507528 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:28.507552 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0
19:42:32.699540 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:32.699556 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0
19:42:40.891490 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:40.891514 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0
19:42:57.019511 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:57.019534 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0


An ssh -v server log of a failed connection, with the reset error:



OpenSSH_7.4p1 Debian-10+deb9u1, OpenSSL 1.0.2l  25 May 2017
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: /etc/ssh/ssh_config line 59: Deprecated option "useroaming"
debug1: auto-mux: Trying existing master
debug1: Control socket "/tmp/ssh_mux_fenix-storage_22_rui" does not exist
debug1: Connecting to fenix-storage [10.10.32.156] port 22.
debug1: Connection established.
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_rsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ecdsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
write: Connection reset by peer


An ssh -v server of a successful connection:



OpenSSH_7.4p1 Debian-10+deb9u1, OpenSSL 1.0.2l  25 May 2017
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: /etc/ssh/ssh_config line 59: Deprecated option "useroaming"
debug1: auto-mux: Trying existing master
debug1: Control socket "/tmp/ssh_mux_sql01_22_rui" does not exist
debug1: Connecting to sql01 [10.20.10.88] port 22.
debug1: Connection established.
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_rsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ecdsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u1
debug1: Remote protocol version 2.0, remote software version OpenSSH_7.4p1 Debian-10+deb9u1
debug1: match: OpenSSH_7.4p1 Debian-10+deb9u1 pat OpenSSH* compat 0x04000000
debug1: Authenticating to sql01:22 as 'rui'
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: algorithm: curve25519-sha256
debug1: kex: host key algorithm: rsa-sha2-512
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Server host key: ssh-rsa SHA256:6aJ+ipXRZJfbei5YbYtvqKXB01t1YO34O2ChdT/vk/4
debug1: Host 'sql01' is known and matches the RSA host key.
debug1: Found key in /home/rui/.ssh/known_hosts:315
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_input_ext_info: server-sig-algs=<ssh-ed25519,ssh-rsa,ssh-dss,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521>
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey
debug1: Next authentication method: publickey
debug1: Offering RSA public key: /home/rui/.ssh/id_rsa
debug1: Server accepts key: pkalg ssh-rsa blen 277
debug1: Authentication succeeded (publickey).
Authenticated to sql01 ([10.20.10.88]:22).
debug1: setting up multiplex master socket
debug1: channel 0: new [/tmp/ssh_mux_sql01_22_rui]
debug1: control_persist_detach: backgrounding master process
debug1: forking to background
debug1: Entering interactive session.
debug1: pledge: id
debug1: multiplexing control connection
debug1: channel 1: new [mux-control]
debug1: channel 2: new [client-session]
debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0
debug1: Sending environment.
debug1: Sending env LC_ALL = en_US.utf8
debug1: Sending env LANG = en_US.UTF-8
debug1: mux_client_request_session: master session id: 2


Interestingly enough, the behaviour can be reproduced with a telnet command:



$ telnet remote-server 22
Trying x.x.x.x...
Connected to remote-server
Escape character is '^]'.
Connection closed by foreign host.
$ telnet remote-server 22
Trying x.x.x.x...
Connected to remote-server
Escape character is '^]'.
SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u1

Protocol mismatch.
Connection closed by foreign host.


UPDATE:



Forced Protocol 2 in the /etc/ssh_client client configuration in the jumpbox. No change.



UPDATE2:



Changed the old key encrypted with DES-EDE3-CBC for a new key encrypted with AES-128-CBC. Again no visible change.



UPDATE3:



Interestingly enough, while the mux is active, the situation does not presents itself.



UPDATE4:



I also have found a similar question at serverfault, however without a chosen answer: https://serverfault.com/questions/445045/ssh-connection-error-ssh-exchange-identification-read-connection-reset-by-pe



Tried regenerating the ssh host keys, and the suggestion of sshd: ALL without success.



UPDATE 5



Opened a console on the VM on the destination and saw something 'strange'.
tcpdump whereas 1.1.1.1 is the jumpbox.



# tcpdump -n -vvv "host 1.1.1.1"
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
11:47:45.808273 IP (tos 0x0, ttl 64, id 38171, offset 0, flags [DF], proto TCP (6), length 60)
1.1.1.1.37924 > 1.1.1.2.22: Flags [S], cksum 0xfc1f (correct), seq 3260568985, win 29200, options [mss 1460,sackOK,TS val 407355522 ecr 0,nop,wscale 7], length 0
11:47:45.808318 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
1.1.1.2.22 > 1.1.1.1.37924: Flags [S.], cksum 0x5508 (incorrect -> 0x68a8), seq 2881609759, ack 3260568986, win 28960, options [mss 1460,sackOK,TS val 561702650 ecr 407355522,nop,wscale 7], length 0
11:47:45.808525 IP (tos 0x0, ttl 64, id 38172, offset 0, flags [DF], proto TCP (6), length 52)
1.1.1.1.37924 > 1.1.1.2.22: Flags [.], cksum 0x07b0 (correct), seq 1, ack 1, win 229, options [nop,nop,TS val 407355522 ecr 561702650], length 0
11:47:45.808917 IP (tos 0x0, ttl 64, id 38173, offset 0, flags [DF], proto TCP (6), length 92)
1.1.1.1.37924 > 1.1.1.2.22: Flags [P.], cksum 0x6de0 (correct), seq 1:41, ack 1, win 229, options [nop,nop,TS val 407355522 ecr 561702650], length 40
11:47:45.808930 IP (tos 0x0, ttl 64, id 1754, offset 0, flags [DF], proto TCP (6), length 52)
1.1.1.2.22 > 1.1.1.1.37924: Flags [.], cksum 0x5500 (incorrect -> 0x0789), seq 1, ack 41, win 227, options [nop,nop,TS val 561702651 ecr 407355522], length 0
11:47:45.822178 IP (tos 0x0, ttl 64, id 1755, offset 0, flags [DF], proto TCP (6), length 91)
1.1.1.2.22 > 1.1.1.1.37924: Flags [P.], cksum 0x5527 (incorrect -> 0x70c1), seq 1:40, ack 41, win 227, options [nop,nop,TS val 561702654 ecr 407355522], length 39
11:47:45.822645 IP (tos 0x0, ttl 64, id 21666, offset 0, flags [DF], proto TCP (6), length 40)
1.1.1.1.37924 > 1.1.1.2.22: Flags [R], cksum 0xaeb1 (correct), seq 3260569026, win 0, length 0
11:47:50.919752 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.1.1.2 tell 1.1.1.1, length 46
11:47:50.919773 ARP, Ethernet (len 6), IPv4 (len 4), Reply 1.1.1.2 is-at 00:50:56:b9:3d:2b, length 28
11:47:50.948732 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.1.1.1 tell 1.1.1.2, length 28
11:47:50.948916 ARP, Ethernet (len 6), IPv4 (len 4), Reply 1.1.1.1 is-at 00:50:56:80:57:1a, length 46
^C
11 packets captured
11 packets received by filter
0 packets dropped by kernel


UPDATE 6



Due to the checkum error, I disabled the TCP/UDP checksum offloading to the NIC in the VM, however it did not help.



$sudo ethtool -K eth0 rx off
$sudo ethtool -K eth0 tx off

iface eth0 inet static
address 1.1.1.2
netmask 255.255.255.0
network 1.1.1.0
broadcast 1.11.1.255
gateway 1.1.1.254
post-up /sbin/ethtool -K $IFACE rx off
post-up /sbin/ethtool -K $IFACE tx off


Understanding TCP Checksum Offloading (TCO) in a VMware Environment (2052904)



UPDATE 7



Disabled GSSAPIAuthentication in the ssh client in the jumpbox. Tested Enable Compression yes No change.



UPDATE 8



Testing filling up the checksum with iptables.



/sbin/iptables -A POSTROUTING -t mangle -p tcp -j CHECKSUM --checksum-fill


It did not improve the situation.



UPDATE 9:



Found an interesting test about limiting cyphers, will try it out. MTU problems does not seem the culprit as I am having problems in some cases with server and client in the same network.



For now tested in the client side "ssh -c aes256-ctr", and the symptoms do not improve.



The mysterious case of broken SSH client (“connection reset by peer”)



UPDATE 10



Added this to /etc/ssh/ssh_config. No changes.



Ciphers aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc


SSH issues: Read from socket failed: Connection reset by peer



UPDATE 11



Defined the ssh service in port 22 and port 2222. It did not help.



UPDATE 12



I suspect it being a regression bug present in OpenSSH 7.4 that was corrected with OpenSSH 7.5



Release notes from OpenSSH 7.5




  • sshd(8): Fix regression in OpenSSH 7.4 support for the
    server-sig-algs extension, where SHA2 RSA signature methods were
    not being correctly advertised. bz#2680


For using openSSH 7.5 in Debian 9/Stretch, I installed openssh-client and openssh-server from Debian testing/Buster.



No improvements on the situation.



UPDATE 13



Defined



Ciphers aes256-ctr
MACs hmac-sha1



Both at the client(s) and server side. No improvements.



UPDATE 14



Setup



UseDNS no
GSSAPIAuthentication no
GSSAPIKeyExchange no


No change.



UPDATE 15



/etc/ssh/sshd_config



Changed it to /etc/ssh/sshd_config:



TCPKeepAlive no


From How does tcp-keepalive work in ssh?




TCPKeepAlive operates on the TCP layer. It sends an empty TCP ACK
packet [from the SSH server to the client - Rui]. Firewalls can be configured to ignore these packets, so if you
go through a firewall that drops idle connections, these may not keep
the connection alive.




My guess is that TCPKeepAlive was configuring the server sending a packet that is being optimised/ignored in some layer down the stack bellow, and somewhat the remote SSH server believed it was still connected to the TCP mux client, while in fact the session was already teared down; thus the TCP reset(s) at first try.



So whilst some say that if you're using ClientAliveInterval, you can disable TCPKeepAlive, it seems to be more it you are using ClientAliveInterval you ought to disable TCPKeepAlive.




  • It is clearly this option; as for the explanation, they are mainly conjectures and will have to double check them/the source when and if I have got time.


TCPKeepAlive apparently also has spoofing issues, so it is recommended that it should be turned off.



Nevertheless, still with the problem.










share|improve this question
















I have been using a Debian 9 SSH jumpbox host to run my scripts/ansible playbooks for a while. The jumbox talks with Debian 9 and some Debian 8 servers, mostly. Most of the servers are VMs running under VMWare Enterprise 5.5.



The SSH client in the jumbox is configured for doing SSH MUX, and the authentication is done by an RSA certificate file.



The SSH has been working well for years now, however suddenly SSH connections started giving the error ssh_exchange_identification: read: Connection reset by peer at first try, several times a day, which obviously creates havoc with my scripts and scripts of our development team.



However, after the first try they are ok for a while. The servers misbehaving appear be random at first, but they have some patterns/timeouts. If I do send a command to all of the servers, for instance, running in a command before the intended script/playbook, a few will fail, but the next script will run in all of them.



There havent been recent significant changes on the servers, except for security updates. The transition for Debian 9 has already some (significant) time.



I already found a MTU configuration or other that was once applied to several servers in a malfunction and forgotten, however that was not the case. I also diminished both on the client and server side the ControlPersist and ClientAliveInterval both to 1h, and that did not improve the situation.



So at the moment, I am at loss of why this is happening. I am however more inclined to a layer 7 issue than a network problem.



The SSH configuration on the client side /etc/ssh_config, Debian 9 is:



Host *
SendEnv LANG LC_*
HashKnownHosts yes
GSSAPIAuthentication yes
GSSAPIDelegateCredentials no
ControlMaster auto
ControlPath /tmp/ssh_mux_%h_%p_%r
ControlPersist 1h
Compression no
UseRoaming no


On SSH on the server side of several Debian servers:



Protocol 2
HostKey /etc/ssh/ssh_host_rsa_key
HostKey /etc/ssh/ssh_host_dsa_key
UsePrivilegeSeparation yes

SyslogFacility AUTH
LogLevel INFO
LoginGraceTime 120
PermitRootLogin forced-commands-only
StrictModes yes
PubkeyAuthentication yes
IgnoreRhosts yes
HostbasedAuthentication no
PermitEmptyPasswords no
ChallengeResponseAuthentication no
PasswordAuthentication no

X11Forwarding no
X11DisplayOffset 10
PrintMotd no
PrintLastLog yes
TCPKeepAlive yes

AcceptEnv LANG LC_*
Subsystem sftp /usr/lib/openssh/sftp-server -l INFO
UsePAM yes
ClientAliveInterval 3600
ClientAliveCountMax 0
AddressFamily inet


SSH versions:



client -



$ssh -V 
OpenSSH_7.4p1 Debian-10+deb9u1, OpenSSL 1.0.2l 25 May 2017


server(s)



SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u1 (Debian 9)
SSH-2.0-OpenSSH_6.7p1 Debian-5+deb8u3 (Debian 8)


I have seen that error at least in situations with both servers with the 4.9.0-0.bpo.1-amd64 version.



The tcpdump of a server misbehaving, both machines being in the same network without any firewalls in the middle. I also monitor MAC addresses and there is not log of a new machine/MAC with the same MAC addresses in the last few years.



#tcpdump port 22
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
19:42:25.462896 IP jumbox.40270 > server.ssh: Flags [S], seq 3882361678, win 23200, options [mss 1160,sackOK,TS val 354223428 ecr 0,nop,wscale 7], length 0
19:42:25.463289 IP server.ssh > jumbox.40270: Flags [S.], seq 405921081, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:25.463306 IP jumbox.40270 > server.ssh: Flags [.], ack 1, win 182, length 0
19:42:25.481470 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:25.481477 IP jumbox.40270 > server.ssh: Flags [.], ack 504902058, win 182, length 0
19:42:25.481490 IP server.ssh > jumbox.40270: Flags [R], seq 405921082, win 0, length 0
19:42:25.481494 IP server.ssh > jumbox.40270: Flags [P.], seq 504902058:504902097, ack 1, win 182, length 39
19:42:26.491536 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:26.491551 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0
19:42:28.507528 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:28.507552 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0
19:42:32.699540 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:32.699556 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0
19:42:40.891490 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:40.891514 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0
19:42:57.019511 IP server.ssh > jumbox.40270: Flags [S.], seq 4195986320, ack 3882361679, win 23200, options [mss 1160,nop,nop,sackOK,nop,wscale 7], length 0
19:42:57.019534 IP jumbox.40270 > server.ssh: Flags [R], seq 3882361679, win 0, length 0


An ssh -v server log of a failed connection, with the reset error:



OpenSSH_7.4p1 Debian-10+deb9u1, OpenSSL 1.0.2l  25 May 2017
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: /etc/ssh/ssh_config line 59: Deprecated option "useroaming"
debug1: auto-mux: Trying existing master
debug1: Control socket "/tmp/ssh_mux_fenix-storage_22_rui" does not exist
debug1: Connecting to fenix-storage [10.10.32.156] port 22.
debug1: Connection established.
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_rsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ecdsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
write: Connection reset by peer


An ssh -v server of a successful connection:



OpenSSH_7.4p1 Debian-10+deb9u1, OpenSSL 1.0.2l  25 May 2017
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: Applying options for *
debug1: /etc/ssh/ssh_config line 59: Deprecated option "useroaming"
debug1: auto-mux: Trying existing master
debug1: Control socket "/tmp/ssh_mux_sql01_22_rui" does not exist
debug1: Connecting to sql01 [10.20.10.88] port 22.
debug1: Connection established.
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_rsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_rsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_dsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_dsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ecdsa type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ecdsa-cert type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ed25519 type -1
debug1: key_load_public: No such file or directory
debug1: identity file /home/rui/.ssh/id_ed25519-cert type -1
debug1: Enabling compatibility mode for protocol 2.0
debug1: Local version string SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u1
debug1: Remote protocol version 2.0, remote software version OpenSSH_7.4p1 Debian-10+deb9u1
debug1: match: OpenSSH_7.4p1 Debian-10+deb9u1 pat OpenSSH* compat 0x04000000
debug1: Authenticating to sql01:22 as 'rui'
debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: algorithm: curve25519-sha256
debug1: kex: host key algorithm: rsa-sha2-512
debug1: kex: server->client cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: kex: client->server cipher: chacha20-poly1305@openssh.com MAC: <implicit> compression: none
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY
debug1: Server host key: ssh-rsa SHA256:6aJ+ipXRZJfbei5YbYtvqKXB01t1YO34O2ChdT/vk/4
debug1: Host 'sql01' is known and matches the RSA host key.
debug1: Found key in /home/rui/.ssh/known_hosts:315
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_NEWKEYS sent
debug1: expecting SSH2_MSG_NEWKEYS
debug1: SSH2_MSG_NEWKEYS received
debug1: rekey after 134217728 blocks
debug1: SSH2_MSG_EXT_INFO received
debug1: kex_input_ext_info: server-sig-algs=<ssh-ed25519,ssh-rsa,ssh-dss,ecdsa-sha2-nistp256,ecdsa-sha2-nistp384,ecdsa-sha2-nistp521>
debug1: SSH2_MSG_SERVICE_ACCEPT received
debug1: Authentications that can continue: publickey
debug1: Next authentication method: publickey
debug1: Offering RSA public key: /home/rui/.ssh/id_rsa
debug1: Server accepts key: pkalg ssh-rsa blen 277
debug1: Authentication succeeded (publickey).
Authenticated to sql01 ([10.20.10.88]:22).
debug1: setting up multiplex master socket
debug1: channel 0: new [/tmp/ssh_mux_sql01_22_rui]
debug1: control_persist_detach: backgrounding master process
debug1: forking to background
debug1: Entering interactive session.
debug1: pledge: id
debug1: multiplexing control connection
debug1: channel 1: new [mux-control]
debug1: channel 2: new [client-session]
debug1: client_input_global_request: rtype hostkeys-00@openssh.com want_reply 0
debug1: Sending environment.
debug1: Sending env LC_ALL = en_US.utf8
debug1: Sending env LANG = en_US.UTF-8
debug1: mux_client_request_session: master session id: 2


Interestingly enough, the behaviour can be reproduced with a telnet command:



$ telnet remote-server 22
Trying x.x.x.x...
Connected to remote-server
Escape character is '^]'.
Connection closed by foreign host.
$ telnet remote-server 22
Trying x.x.x.x...
Connected to remote-server
Escape character is '^]'.
SSH-2.0-OpenSSH_7.4p1 Debian-10+deb9u1

Protocol mismatch.
Connection closed by foreign host.


UPDATE:



Forced Protocol 2 in the /etc/ssh_client client configuration in the jumpbox. No change.



UPDATE2:



Changed the old key encrypted with DES-EDE3-CBC for a new key encrypted with AES-128-CBC. Again no visible change.



UPDATE3:



Interestingly enough, while the mux is active, the situation does not presents itself.



UPDATE4:



I also have found a similar question at serverfault, however without a chosen answer: https://serverfault.com/questions/445045/ssh-connection-error-ssh-exchange-identification-read-connection-reset-by-pe



Tried regenerating the ssh host keys, and the suggestion of sshd: ALL without success.



UPDATE 5



Opened a console on the VM on the destination and saw something 'strange'.
tcpdump whereas 1.1.1.1 is the jumpbox.



# tcpdump -n -vvv "host 1.1.1.1"
tcpdump: listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
11:47:45.808273 IP (tos 0x0, ttl 64, id 38171, offset 0, flags [DF], proto TCP (6), length 60)
1.1.1.1.37924 > 1.1.1.2.22: Flags [S], cksum 0xfc1f (correct), seq 3260568985, win 29200, options [mss 1460,sackOK,TS val 407355522 ecr 0,nop,wscale 7], length 0
11:47:45.808318 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
1.1.1.2.22 > 1.1.1.1.37924: Flags [S.], cksum 0x5508 (incorrect -> 0x68a8), seq 2881609759, ack 3260568986, win 28960, options [mss 1460,sackOK,TS val 561702650 ecr 407355522,nop,wscale 7], length 0
11:47:45.808525 IP (tos 0x0, ttl 64, id 38172, offset 0, flags [DF], proto TCP (6), length 52)
1.1.1.1.37924 > 1.1.1.2.22: Flags [.], cksum 0x07b0 (correct), seq 1, ack 1, win 229, options [nop,nop,TS val 407355522 ecr 561702650], length 0
11:47:45.808917 IP (tos 0x0, ttl 64, id 38173, offset 0, flags [DF], proto TCP (6), length 92)
1.1.1.1.37924 > 1.1.1.2.22: Flags [P.], cksum 0x6de0 (correct), seq 1:41, ack 1, win 229, options [nop,nop,TS val 407355522 ecr 561702650], length 40
11:47:45.808930 IP (tos 0x0, ttl 64, id 1754, offset 0, flags [DF], proto TCP (6), length 52)
1.1.1.2.22 > 1.1.1.1.37924: Flags [.], cksum 0x5500 (incorrect -> 0x0789), seq 1, ack 41, win 227, options [nop,nop,TS val 561702651 ecr 407355522], length 0
11:47:45.822178 IP (tos 0x0, ttl 64, id 1755, offset 0, flags [DF], proto TCP (6), length 91)
1.1.1.2.22 > 1.1.1.1.37924: Flags [P.], cksum 0x5527 (incorrect -> 0x70c1), seq 1:40, ack 41, win 227, options [nop,nop,TS val 561702654 ecr 407355522], length 39
11:47:45.822645 IP (tos 0x0, ttl 64, id 21666, offset 0, flags [DF], proto TCP (6), length 40)
1.1.1.1.37924 > 1.1.1.2.22: Flags [R], cksum 0xaeb1 (correct), seq 3260569026, win 0, length 0
11:47:50.919752 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.1.1.2 tell 1.1.1.1, length 46
11:47:50.919773 ARP, Ethernet (len 6), IPv4 (len 4), Reply 1.1.1.2 is-at 00:50:56:b9:3d:2b, length 28
11:47:50.948732 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 1.1.1.1 tell 1.1.1.2, length 28
11:47:50.948916 ARP, Ethernet (len 6), IPv4 (len 4), Reply 1.1.1.1 is-at 00:50:56:80:57:1a, length 46
^C
11 packets captured
11 packets received by filter
0 packets dropped by kernel


UPDATE 6



Due to the checkum error, I disabled the TCP/UDP checksum offloading to the NIC in the VM, however it did not help.



$sudo ethtool -K eth0 rx off
$sudo ethtool -K eth0 tx off

iface eth0 inet static
address 1.1.1.2
netmask 255.255.255.0
network 1.1.1.0
broadcast 1.11.1.255
gateway 1.1.1.254
post-up /sbin/ethtool -K $IFACE rx off
post-up /sbin/ethtool -K $IFACE tx off


Understanding TCP Checksum Offloading (TCO) in a VMware Environment (2052904)



UPDATE 7



Disabled GSSAPIAuthentication in the ssh client in the jumpbox. Tested Enable Compression yes No change.



UPDATE 8



Testing filling up the checksum with iptables.



/sbin/iptables -A POSTROUTING -t mangle -p tcp -j CHECKSUM --checksum-fill


It did not improve the situation.



UPDATE 9:



Found an interesting test about limiting cyphers, will try it out. MTU problems does not seem the culprit as I am having problems in some cases with server and client in the same network.



For now tested in the client side "ssh -c aes256-ctr", and the symptoms do not improve.



The mysterious case of broken SSH client (“connection reset by peer”)



UPDATE 10



Added this to /etc/ssh/ssh_config. No changes.



Ciphers aes128-ctr,aes192-ctr,aes256-ctr,arcfour256,arcfour128,aes128-cbc,3des-cbc


SSH issues: Read from socket failed: Connection reset by peer



UPDATE 11



Defined the ssh service in port 22 and port 2222. It did not help.



UPDATE 12



I suspect it being a regression bug present in OpenSSH 7.4 that was corrected with OpenSSH 7.5



Release notes from OpenSSH 7.5




  • sshd(8): Fix regression in OpenSSH 7.4 support for the
    server-sig-algs extension, where SHA2 RSA signature methods were
    not being correctly advertised. bz#2680


For using openSSH 7.5 in Debian 9/Stretch, I installed openssh-client and openssh-server from Debian testing/Buster.



No improvements on the situation.



UPDATE 13



Defined



Ciphers aes256-ctr
MACs hmac-sha1



Both at the client(s) and server side. No improvements.



UPDATE 14



Setup



UseDNS no
GSSAPIAuthentication no
GSSAPIKeyExchange no


No change.



UPDATE 15



/etc/ssh/sshd_config



Changed it to /etc/ssh/sshd_config:



TCPKeepAlive no


From How does tcp-keepalive work in ssh?




TCPKeepAlive operates on the TCP layer. It sends an empty TCP ACK
packet [from the SSH server to the client - Rui]. Firewalls can be configured to ignore these packets, so if you
go through a firewall that drops idle connections, these may not keep
the connection alive.




My guess is that TCPKeepAlive was configuring the server sending a packet that is being optimised/ignored in some layer down the stack bellow, and somewhat the remote SSH server believed it was still connected to the TCP mux client, while in fact the session was already teared down; thus the TCP reset(s) at first try.



So whilst some say that if you're using ClientAliveInterval, you can disable TCPKeepAlive, it seems to be more it you are using ClientAliveInterval you ought to disable TCPKeepAlive.




  • It is clearly this option; as for the explanation, they are mainly conjectures and will have to double check them/the source when and if I have got time.


TCPKeepAlive apparently also has spoofing issues, so it is recommended that it should be turned off.



Nevertheless, still with the problem.







debian ssh vmware






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Oct 10 '17 at 13:02







Rui F Ribeiro

















asked Sep 8 '17 at 9:35









Rui F RibeiroRui F Ribeiro

40.7k1479137




40.7k1479137













  • The RST packets are not normal, something between your machine and the server seems to be killing your TCP connection. It's hard to tell what that might be without a full packet dump.

    – Satō Katsura
    Sep 8 '17 at 10:01











  • @SatōKatsura Though better. That server and jumpbox in the tcpdump are both in the same network; I do have other servers that do routing via firewall

    – Rui F Ribeiro
    Sep 8 '17 at 11:24













  • Well, you need to find out where those RST come from. There could be any number of reasons for that. shrug

    – Satō Katsura
    Sep 8 '17 at 11:33











  • @SatōKatsura sure indeed. Will add another tcpdump when at work. The difficult part is that this is a bit random

    – Rui F Ribeiro
    Sep 8 '17 at 11:37





















  • The RST packets are not normal, something between your machine and the server seems to be killing your TCP connection. It's hard to tell what that might be without a full packet dump.

    – Satō Katsura
    Sep 8 '17 at 10:01











  • @SatōKatsura Though better. That server and jumpbox in the tcpdump are both in the same network; I do have other servers that do routing via firewall

    – Rui F Ribeiro
    Sep 8 '17 at 11:24













  • Well, you need to find out where those RST come from. There could be any number of reasons for that. shrug

    – Satō Katsura
    Sep 8 '17 at 11:33











  • @SatōKatsura sure indeed. Will add another tcpdump when at work. The difficult part is that this is a bit random

    – Rui F Ribeiro
    Sep 8 '17 at 11:37



















The RST packets are not normal, something between your machine and the server seems to be killing your TCP connection. It's hard to tell what that might be without a full packet dump.

– Satō Katsura
Sep 8 '17 at 10:01





The RST packets are not normal, something between your machine and the server seems to be killing your TCP connection. It's hard to tell what that might be without a full packet dump.

– Satō Katsura
Sep 8 '17 at 10:01













@SatōKatsura Though better. That server and jumpbox in the tcpdump are both in the same network; I do have other servers that do routing via firewall

– Rui F Ribeiro
Sep 8 '17 at 11:24







@SatōKatsura Though better. That server and jumpbox in the tcpdump are both in the same network; I do have other servers that do routing via firewall

– Rui F Ribeiro
Sep 8 '17 at 11:24















Well, you need to find out where those RST come from. There could be any number of reasons for that. shrug

– Satō Katsura
Sep 8 '17 at 11:33





Well, you need to find out where those RST come from. There could be any number of reasons for that. shrug

– Satō Katsura
Sep 8 '17 at 11:33













@SatōKatsura sure indeed. Will add another tcpdump when at work. The difficult part is that this is a bit random

– Rui F Ribeiro
Sep 8 '17 at 11:37







@SatōKatsura sure indeed. Will add another tcpdump when at work. The difficult part is that this is a bit random

– Rui F Ribeiro
Sep 8 '17 at 11:37












4 Answers
4






active

oldest

votes


















1














Your symptoms sound consistent with having a machine on the network using the same IP address as the SSH server. Check the MAC address of the RST packets.






share|improve this answer
























  • I actually monitor MAC addresses on that netblock, and it seems not to be the case.

    – Rui F Ribeiro
    Sep 9 '17 at 8:20



















1














Are you going through any FW or device attempting TCP optimisation? I've got the same experience over a network and it turned out to be a device doing TCP optimisation.






share|improve this answer
























  • Most probably, later on I solved a couple of bugs on that FW/router manipulating the configuration at Cisco level. But I never come back to this and nowadays I am on another job.

    – Rui F Ribeiro
    Oct 25 '18 at 16:59





















0














Found some systems with



net.ipv4.tcp_timestamps = 0


in /etc/sysctl.conf ; the servers having the problem all have that enabled.



I ended up taking this line from the affected systems and running in all systems:



sudo sysctl -w net.ipv4.tcp_timestamps=1


Waiting for further tests.






share|improve this answer

































    0














    In the end, found out it was due to bugs in the Cisco 6059 core router and the ASA firewall being used.



    The Linux kernel v3 and v4 does not play well with TCP Sequence Randomization, and gives "random" problems on transferring big files, or other kind of obscure problems in many connections, of which SSH were more visible. Unfortunately, Windows, Mac and FreeBSD do play well, so it can be somewhat quoted as a Linux bug.




    Each TCP connection has two ISNs: one generated by the client and one
    generated by the server. The ASA randomizes the ISN of the TCP SYN
    passing in both the inbound and outbound directions.



    Randomizing the ISN of the protected host prevents an attacker from
    predicting the next ISN for a new connection and potentially hijacking
    the new session.



    You can disable TCP initial sequence number randomization if
    necessary, for example, because data is getting scrambled. For
    example:



    If another in-line firewall is also randomizing the initial sequence
    numbers, there is no need for both firewalls to be performing this
    action, even though this action does not affect the traffic.




    I initially disabled Cisco Randomization in the internal core router, it was not enough. After Cisco Randomization was disabled both in the border firewalls and core Cisco router/switch, the problem stopped happening.



    For disabling it, it is something similar to:



    policy-map global_policy
    class preserve-sq-no
    set connection random-sequence-number disable


    See Cisco note Disable TCP Sequence Randomization





    share























      Your Answer








      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "106"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f391125%2fjumphost-suddenly-reseting-first-ssh-mux-connection-attempts%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      4 Answers
      4






      active

      oldest

      votes








      4 Answers
      4






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      1














      Your symptoms sound consistent with having a machine on the network using the same IP address as the SSH server. Check the MAC address of the RST packets.






      share|improve this answer
























      • I actually monitor MAC addresses on that netblock, and it seems not to be the case.

        – Rui F Ribeiro
        Sep 9 '17 at 8:20
















      1














      Your symptoms sound consistent with having a machine on the network using the same IP address as the SSH server. Check the MAC address of the RST packets.






      share|improve this answer
























      • I actually monitor MAC addresses on that netblock, and it seems not to be the case.

        – Rui F Ribeiro
        Sep 9 '17 at 8:20














      1












      1








      1







      Your symptoms sound consistent with having a machine on the network using the same IP address as the SSH server. Check the MAC address of the RST packets.






      share|improve this answer













      Your symptoms sound consistent with having a machine on the network using the same IP address as the SSH server. Check the MAC address of the RST packets.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Sep 9 '17 at 6:57









      user1998586user1998586

      1613




      1613













      • I actually monitor MAC addresses on that netblock, and it seems not to be the case.

        – Rui F Ribeiro
        Sep 9 '17 at 8:20



















      • I actually monitor MAC addresses on that netblock, and it seems not to be the case.

        – Rui F Ribeiro
        Sep 9 '17 at 8:20

















      I actually monitor MAC addresses on that netblock, and it seems not to be the case.

      – Rui F Ribeiro
      Sep 9 '17 at 8:20





      I actually monitor MAC addresses on that netblock, and it seems not to be the case.

      – Rui F Ribeiro
      Sep 9 '17 at 8:20













      1














      Are you going through any FW or device attempting TCP optimisation? I've got the same experience over a network and it turned out to be a device doing TCP optimisation.






      share|improve this answer
























      • Most probably, later on I solved a couple of bugs on that FW/router manipulating the configuration at Cisco level. But I never come back to this and nowadays I am on another job.

        – Rui F Ribeiro
        Oct 25 '18 at 16:59


















      1














      Are you going through any FW or device attempting TCP optimisation? I've got the same experience over a network and it turned out to be a device doing TCP optimisation.






      share|improve this answer
























      • Most probably, later on I solved a couple of bugs on that FW/router manipulating the configuration at Cisco level. But I never come back to this and nowadays I am on another job.

        – Rui F Ribeiro
        Oct 25 '18 at 16:59
















      1












      1








      1







      Are you going through any FW or device attempting TCP optimisation? I've got the same experience over a network and it turned out to be a device doing TCP optimisation.






      share|improve this answer













      Are you going through any FW or device attempting TCP optimisation? I've got the same experience over a network and it turned out to be a device doing TCP optimisation.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Oct 4 '18 at 17:41









      YusufkYusufk

      1366




      1366













      • Most probably, later on I solved a couple of bugs on that FW/router manipulating the configuration at Cisco level. But I never come back to this and nowadays I am on another job.

        – Rui F Ribeiro
        Oct 25 '18 at 16:59





















      • Most probably, later on I solved a couple of bugs on that FW/router manipulating the configuration at Cisco level. But I never come back to this and nowadays I am on another job.

        – Rui F Ribeiro
        Oct 25 '18 at 16:59



















      Most probably, later on I solved a couple of bugs on that FW/router manipulating the configuration at Cisco level. But I never come back to this and nowadays I am on another job.

      – Rui F Ribeiro
      Oct 25 '18 at 16:59







      Most probably, later on I solved a couple of bugs on that FW/router manipulating the configuration at Cisco level. But I never come back to this and nowadays I am on another job.

      – Rui F Ribeiro
      Oct 25 '18 at 16:59













      0














      Found some systems with



      net.ipv4.tcp_timestamps = 0


      in /etc/sysctl.conf ; the servers having the problem all have that enabled.



      I ended up taking this line from the affected systems and running in all systems:



      sudo sysctl -w net.ipv4.tcp_timestamps=1


      Waiting for further tests.






      share|improve this answer






























        0














        Found some systems with



        net.ipv4.tcp_timestamps = 0


        in /etc/sysctl.conf ; the servers having the problem all have that enabled.



        I ended up taking this line from the affected systems and running in all systems:



        sudo sysctl -w net.ipv4.tcp_timestamps=1


        Waiting for further tests.






        share|improve this answer




























          0












          0








          0







          Found some systems with



          net.ipv4.tcp_timestamps = 0


          in /etc/sysctl.conf ; the servers having the problem all have that enabled.



          I ended up taking this line from the affected systems and running in all systems:



          sudo sysctl -w net.ipv4.tcp_timestamps=1


          Waiting for further tests.






          share|improve this answer















          Found some systems with



          net.ipv4.tcp_timestamps = 0


          in /etc/sysctl.conf ; the servers having the problem all have that enabled.



          I ended up taking this line from the affected systems and running in all systems:



          sudo sysctl -w net.ipv4.tcp_timestamps=1


          Waiting for further tests.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Oct 10 '17 at 13:26

























          answered Oct 6 '17 at 22:08









          Rui F RibeiroRui F Ribeiro

          40.7k1479137




          40.7k1479137























              0














              In the end, found out it was due to bugs in the Cisco 6059 core router and the ASA firewall being used.



              The Linux kernel v3 and v4 does not play well with TCP Sequence Randomization, and gives "random" problems on transferring big files, or other kind of obscure problems in many connections, of which SSH were more visible. Unfortunately, Windows, Mac and FreeBSD do play well, so it can be somewhat quoted as a Linux bug.




              Each TCP connection has two ISNs: one generated by the client and one
              generated by the server. The ASA randomizes the ISN of the TCP SYN
              passing in both the inbound and outbound directions.



              Randomizing the ISN of the protected host prevents an attacker from
              predicting the next ISN for a new connection and potentially hijacking
              the new session.



              You can disable TCP initial sequence number randomization if
              necessary, for example, because data is getting scrambled. For
              example:



              If another in-line firewall is also randomizing the initial sequence
              numbers, there is no need for both firewalls to be performing this
              action, even though this action does not affect the traffic.




              I initially disabled Cisco Randomization in the internal core router, it was not enough. After Cisco Randomization was disabled both in the border firewalls and core Cisco router/switch, the problem stopped happening.



              For disabling it, it is something similar to:



              policy-map global_policy
              class preserve-sq-no
              set connection random-sequence-number disable


              See Cisco note Disable TCP Sequence Randomization





              share




























                0














                In the end, found out it was due to bugs in the Cisco 6059 core router and the ASA firewall being used.



                The Linux kernel v3 and v4 does not play well with TCP Sequence Randomization, and gives "random" problems on transferring big files, or other kind of obscure problems in many connections, of which SSH were more visible. Unfortunately, Windows, Mac and FreeBSD do play well, so it can be somewhat quoted as a Linux bug.




                Each TCP connection has two ISNs: one generated by the client and one
                generated by the server. The ASA randomizes the ISN of the TCP SYN
                passing in both the inbound and outbound directions.



                Randomizing the ISN of the protected host prevents an attacker from
                predicting the next ISN for a new connection and potentially hijacking
                the new session.



                You can disable TCP initial sequence number randomization if
                necessary, for example, because data is getting scrambled. For
                example:



                If another in-line firewall is also randomizing the initial sequence
                numbers, there is no need for both firewalls to be performing this
                action, even though this action does not affect the traffic.




                I initially disabled Cisco Randomization in the internal core router, it was not enough. After Cisco Randomization was disabled both in the border firewalls and core Cisco router/switch, the problem stopped happening.



                For disabling it, it is something similar to:



                policy-map global_policy
                class preserve-sq-no
                set connection random-sequence-number disable


                See Cisco note Disable TCP Sequence Randomization





                share


























                  0












                  0








                  0







                  In the end, found out it was due to bugs in the Cisco 6059 core router and the ASA firewall being used.



                  The Linux kernel v3 and v4 does not play well with TCP Sequence Randomization, and gives "random" problems on transferring big files, or other kind of obscure problems in many connections, of which SSH were more visible. Unfortunately, Windows, Mac and FreeBSD do play well, so it can be somewhat quoted as a Linux bug.




                  Each TCP connection has two ISNs: one generated by the client and one
                  generated by the server. The ASA randomizes the ISN of the TCP SYN
                  passing in both the inbound and outbound directions.



                  Randomizing the ISN of the protected host prevents an attacker from
                  predicting the next ISN for a new connection and potentially hijacking
                  the new session.



                  You can disable TCP initial sequence number randomization if
                  necessary, for example, because data is getting scrambled. For
                  example:



                  If another in-line firewall is also randomizing the initial sequence
                  numbers, there is no need for both firewalls to be performing this
                  action, even though this action does not affect the traffic.




                  I initially disabled Cisco Randomization in the internal core router, it was not enough. After Cisco Randomization was disabled both in the border firewalls and core Cisco router/switch, the problem stopped happening.



                  For disabling it, it is something similar to:



                  policy-map global_policy
                  class preserve-sq-no
                  set connection random-sequence-number disable


                  See Cisco note Disable TCP Sequence Randomization





                  share













                  In the end, found out it was due to bugs in the Cisco 6059 core router and the ASA firewall being used.



                  The Linux kernel v3 and v4 does not play well with TCP Sequence Randomization, and gives "random" problems on transferring big files, or other kind of obscure problems in many connections, of which SSH were more visible. Unfortunately, Windows, Mac and FreeBSD do play well, so it can be somewhat quoted as a Linux bug.




                  Each TCP connection has two ISNs: one generated by the client and one
                  generated by the server. The ASA randomizes the ISN of the TCP SYN
                  passing in both the inbound and outbound directions.



                  Randomizing the ISN of the protected host prevents an attacker from
                  predicting the next ISN for a new connection and potentially hijacking
                  the new session.



                  You can disable TCP initial sequence number randomization if
                  necessary, for example, because data is getting scrambled. For
                  example:



                  If another in-line firewall is also randomizing the initial sequence
                  numbers, there is no need for both firewalls to be performing this
                  action, even though this action does not affect the traffic.




                  I initially disabled Cisco Randomization in the internal core router, it was not enough. After Cisco Randomization was disabled both in the border firewalls and core Cisco router/switch, the problem stopped happening.



                  For disabling it, it is something similar to:



                  policy-map global_policy
                  class preserve-sq-no
                  set connection random-sequence-number disable


                  See Cisco note Disable TCP Sequence Randomization






                  share











                  share


                  share










                  answered 4 mins ago









                  Rui F RibeiroRui F Ribeiro

                  40.7k1479137




                  40.7k1479137






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Unix & Linux Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f391125%2fjumphost-suddenly-reseting-first-ssh-mux-connection-attempts%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      Accessing regular linux commands in Huawei's Dopra Linux

                      Can't connect RFCOMM socket: Host is down

                      Kernel panic - not syncing: Fatal Exception in Interrupt