Loading ...
Sorry, an error occurred while loading the content.
 

Re: Postfix lost connection issue

Expand Messages
  • Viktor Dukhovni
    ... Yuck, nss_wins openlog(3) breakage! :-( I would find a way of disabling this. Perhaps proxymap can help. Also report the bug to your O/S vendor, shared
    Message 1 of 11 , May 2, 2013
      On Thu, May 02, 2013 at 10:31:07PM -0700, mailtime wrote:

      > After many years of hassle free use from our email system we are now getting
      > a rather odd disconnect as shown below.
      >
      >
      > May 3 16:07:36 spambox nss_wins[14039]: connect from
      > sender.remote.com[1.2.3.4]

      Yuck, nss_wins openlog(3) breakage! :-( I would find a way of
      disabling this. Perhaps proxymap can help. Also report the
      bug to your O/S vendor, shared libraries must not create this
      mess.

      > May 3 16:07:39 spambox nss_wins[14039]: lost connection after DATA from
      > remote.remote.com[1.2.3.4]

      http://www.postfix.org/DEBUG_README.html#sniffer

      Then read the result with wireshark. Make a careful note of the
      window-scaling, and MTU negotiated at the start of the session,
      what happened shortly before the sender closed the session. Who
      sent the first FIN or RST? Were there any retransmissions.

      You could disable window scaling on your end, or set the default
      scale exponent to 0. Most problems are firewall or NAT device
      problems and window scaling can confuse the more crude implementations.

      Any other logging from smtpd (aka nss_wins) 14039 between the
      connection start and end?

      --
      Viktor.
    • mailtime
      Thanks for your thoughts i will follow that article and see what i get back. In regard to the 14039, no only the 4 entries (pertaining to that transaction) I
      Message 2 of 11 , May 2, 2013
        Thanks for your thoughts i will follow that article and see what i get back.
        In regard to the 14039, no only the 4 entries (pertaining to that
        transaction)

        I do finally have a log file from one of sending mail servers though.

        20:08:45.214 RX: <220 spambox.co.nz ESMTP Postfix>
        20:08:45.214 TX: <EHLO remote.remote.com>
        20:08:45.926 RX: <250-spambox.co.nz>
        20:08:45.926 RX: <250-PIPELINING>
        20:08:45.926 RX: <250-SIZE 20728640>
        20:08:45.926 RX: <250-VRFY>
        20:08:45.926 RX: <250-ETRN>
        20:08:45.926 RX: <250-STARTTLS>
        20:08:45.926 RX: <250-AUTH PLAIN LOGIN CRAM-MD5>
        20:08:45.926 RX: <250-AUTH=PLAIN LOGIN CRAM-MD5>
        20:08:45.927 RX: <250-ENHANCEDSTATUSCODES>
        20:08:45.927 RX: <250-8BITMIME>
        20:08:45.927 RX: <250 DSN>
        20:08:45.927 TX: <MAIL FROM:<sender@...> SIZE=6469087>
        20:08:46.088 RX: <250 2.1.0 Ok>
        20:08:46.088 TX: <RCPT TO:<user@...>>
        20:08:46.940 RX: <250 2.1.5 Ok>
        20:08:46.941 TX: <RCPT TO:<user2@...>>
        20:08:50.163 RX: <250 2.1.5 Ok>
        20:08:50.163 TX: <DATA>
        20:08:50.323 RX: <354 End data with <CR><LF>.<CR><LF>>
        20:09:12.407 SMTPResponse(Sending Body): Socket error - 10053 -
        WSAECONNABORTED
        20:09:12.407 SMTPSend(QUIT): Socket send failed to 202.162.181.245 error -
        10053 - WSAECONNABORTED
        20:09:12.407 Thread exiting for B5174efe90000.5174efe90000.0001.mml after
        28519 millisecs






        --
        View this message in context: http://postfix.1071664.n5.nabble.com/Postfix-lost-connection-issue-tp57670p57672.html
        Sent from the Postfix Users mailing list archive at Nabble.com.
      • Viktor Dukhovni
        ... The message size is ~6 MB. ... Message content transmission starts at 20:08:50.323. ... Transmission aborts 22.084 seconds later. So the bandwidth used is
        Message 3 of 11 , May 2, 2013
          On Thu, May 02, 2013 at 11:09:35PM -0700, mailtime wrote:

          > 20:08:45.927 TX: <MAIL FROM:<sender@...> SIZE=6469087>

          The message size is ~6 MB.

          > 20:08:50.163 TX: <DATA>
          > 20:08:50.323 RX: <354 End data with <CR><LF>.<CR><LF>>

          Message content transmission starts at 20:08:50.323.

          > 20:09:12.407 SMTPResponse(Sending Body): Socket error - 10053 -
          > WSAECONNABORTED

          Transmission aborts 22.084 seconds later. So the bandwidth used
          is at most 3Mb/s, but sufficiently recent Postfix releases would
          normally log the number of bytes received, and cleanup did see
          headers, so your Postfix is likely relatively old. Do capture the
          session and apply wireshark or similar.

          --
          Viktor.
        • mailtime
          Thanks for your advice Viktor I have done a capture and loaded into WireShark. 1, 0.000000 , 1.2.3.4 , 192.168.1.239 , TCP , 66 , 19524 smtp [SYN] Seq=0
          Message 4 of 11 , May 7, 2013
            Thanks for your advice Viktor
            I have done a capture and loaded into WireShark.


            1,"0.000000","1.2.3.4","192.168.1.239","TCP","66","19524 > smtp [SYN] Seq=0
            Win=8192 Len=0 MSS=1380 WS=256 SACK_PERM=1"
            2,"0.000022","192.168.1.239","1.2.3.4","TCP","66","smtp > 19524 [SYN, ACK]
            Seq=0 Ack=1 Win=5840 Len=0 MSS=1460 SACK_PERM=1 WS=128"
            3,"0.043797","1.2.3.4","192.168.1.239","TCP","60","19524 > smtp [ACK] Seq=1
            Ack=1 Win=131072 Len=0"
            4,"0.047331","192.168.1.239","1.2.3.4","SMTP","96","S: 220 spambox.co.nz
            ESMTP Postfix"
            5,"0.093672","1.2.3.4","192.168.1.239","SMTP","90","C: EHLO
            remote.sender.com"
            6,"0.093699","192.168.1.239","1.2.3.4","TCP","54","smtp > 19524 [ACK] Seq=43
            Ack=37 Win=5888 Len=0"
            7,"0.095070","192.168.1.239","1.2.3.4","SMTP","261","S: 250-spambox.co.nz |
            250-PIPELINING | 250-SIZE 20728640 | 250-VRFY | 250-ETRN | 250-STARTTLS |
            250-AUTH PLAIN LOGIN CRAM-MD5 | 250-AUTH=PLAIN LOGIN CRAM-MD5 |
            250-ENHANCEDSTATUSCODES | 250-8BITMIME | 250 DSN"
            8,"0.140698","1.2.3.4","192.168.1.239","SMTP","115","C: MAIL
            FROM:<jono.smith@...> SIZE=221190"
            9,"0.146588","192.168.1.239","1.2.3.4","SMTP","68","S: 250 2.1.0 Ok"
            10,"0.190450","1.2.3.4","192.168.1.239","SMTP","89","C: RCPT
            TO:<recipient@...>"
            11,"0.230190","192.168.1.239","1.2.3.4","TCP","54","smtp > 19524 [ACK]
            Seq=264 Ack=133 Win=5888 Len=0"
            12,"3.625029","192.168.1.239","1.2.3.4","SMTP","68","S: 250 2.1.5 Ok"
            13,"3.668700","1.2.3.4","192.168.1.239","SMTP","60","C: DATA"

            Which follows on for a while transmitting then this occurs

            66,"4.267143","192.168.1.239","1.2.3.4","TCP","54","smtp > 19524 [ACK]
            Seq=315 Ack=47059 Win=64128 Len=0"
            67,"4.272844","1.2.3.4","192.168.1.239","SMTP","1434","C: DATA fragment,
            1380 bytes"
            68,"4.278587","1.2.3.4","192.168.1.239","SMTP","1434","C: DATA fragment,
            1380 bytes"
            69,"4.278597","192.168.1.239","1.2.3.4","TCP","54","smtp > 19524 [ACK]
            Seq=315 Ack=49819 Win=64128 Len=0"
            70,"4.284366","1.2.3.4","192.168.1.239","SMTP","1434","C: DATA fragment,
            1380 bytes"
            71,"4.290093","1.2.3.4","192.168.1.239","SMTP","1434","[TCP Previous segment
            not captured] C: DATA fragment, 1380 bytes"
            72,"4.290104","192.168.1.239","1.2.3.4","TCP","66","smtp > 19524 [ACK]
            Seq=315 Ack=51199 Win=64128 Len=0 SLE=53959 SRE=55339"
            73,"4.295853","1.2.3.4","192.168.1.239","SMTP","1434","C: DATA fragment,
            1380 bytes"
            74,"4.295859","192.168.1.239","1.2.3.4","TCP","66","[TCP Dup ACK 72#1] smtp
            > 19524 [ACK] Seq=315 Ack=51199 Win=64128 Len=0 SLE=53959 SRE=56719"
            75,"4.301731","1.2.3.4","192.168.1.239","SMTP","1434","C: DATA fragment,
            1380 bytes"
            76,"4.301738","192.168.1.239","1.2.3.4","TCP","66","[TCP Dup ACK 72#2] smtp
            > 19524 [ACK] Seq=315 Ack=51199 Win=64128 Len=0 SLE=53959 SRE=58099"
            77,"4.307588","1.2.3.4","192.168.1.239","SMTP","1434","[TCP Previous segment
            not captured] C: DATA fragment, 1380 bytes"

















            --
            View this message in context: http://postfix.1071664.n5.nabble.com/Postfix-lost-connection-issue-tp57670p57788.html
            Sent from the Postfix Users mailing list archive at Nabble.com.
          • Viktor Dukhovni
            ... The wireshark output, which omits much detail, (but the GUI allows you to drill in various details) is for you to interpret. If you want help from me,
            Message 5 of 11 , May 7, 2013
              On Tue, May 07, 2013 at 04:46:35AM -0700, mailtime wrote:

              > Thanks for your advice Viktor
              > I have done a capture and loaded into WireShark.

              The wireshark output, which omits much detail, (but the GUI allows
              you to drill in various details) is for you to interpret. If you
              want help from me, post complete output from "tcpdump -nr capture.pcap".

              That said, clearly some packets from the sender are lost, and never
              retransmitted. The TCP connection negotiates selective ACK and
              window scaling on both sides. First thing I would do is disable
              window scaling on your Postfix server. This will reduce throughput
              for mail from far away senders (without window scaling Hong Kong
              to USA will be at most 64KB per ~0.25 sec or 4 sec per MB of data).

              The problem is almost certainly a firewall bug either at the sending
              or receiving side. Updating a more recent firewall software or
              firmware version may help.

              --
              Viktor.
            • Wietse Venema
              ... Postfix 2.6 and later have a workaround for TCP window scaling bugs. For detailed instructions, see: http://www.postfix.org/postconf.5.html#tcp_windowsize
              Message 6 of 11 , May 7, 2013
                Viktor Dukhovni:
                > That said, clearly some packets from the sender are lost, and never
                > retransmitted. The TCP connection negotiates selective ACK and
                > window scaling on both sides. First thing I would do is disable
                > window scaling on your Postfix server. This will reduce throughput
                > for mail from far away senders (without window scaling Hong Kong
                > to USA will be at most 64KB per ~0.25 sec or 4 sec per MB of data).

                Postfix 2.6 and later have a workaround for TCP window scaling bugs.
                For detailed instructions, see:
                http://www.postfix.org/postconf.5.html#tcp_windowsize

                Wietse
              • Viktor Dukhovni
                ... On which operating systems does this suppress window scaling during the TCP 3-way handshake? I tried MacOSX and NetBSD, neither seemed to suppress window
                Message 7 of 11 , May 7, 2013
                  On Tue, May 07, 2013 at 12:57:27PM -0400, Wietse Venema wrote:

                  > Viktor Dukhovni:
                  > > That said, clearly some packets from the sender are lost, and never
                  > > retransmitted. The TCP connection negotiates selective ACK and
                  > > window scaling on both sides. First thing I would do is disable
                  > > window scaling on your Postfix server. This will reduce throughput
                  > > for mail from far away senders (without window scaling Hong Kong
                  > > to USA will be at most 64KB per ~0.25 sec or 4 sec per MB of data).
                  >
                  > Postfix 2.6 and later have a workaround for TCP window scaling bugs.
                  > For detailed instructions, see:
                  > http://www.postfix.org/postconf.5.html#tcp_windowsize

                  On which operating systems does this suppress window scaling during
                  the TCP 3-way handshake?

                  I tried MacOSX and NetBSD, neither seemed to suppress window scaling
                  when the server window size was set to 32767:

                  C > S: Flags [S], seq 227891400, win 65535, options
                  [mss 1460,wscale 4,TS val 342104207 ecr 0,sackOK,eol], length 0
                  S > C: Flags [S.], seq 3628790798, ack 227891401, win 32767, options
                  [mss 1460,wscale 3,TS val 1 ecr 342104207,sackOK], length 0

                  Have not tried Linux yet, perhaps this works there...

                  --
                  Viktor.
                • Wietse Venema
                  ... It s five years since this I added the tcp_windowsize feature, and the systems that I was using at the time have been decomissioned. Even with minuscule
                  Message 8 of 11 , May 7, 2013
                    Viktor Dukhovni:
                    > On Tue, May 07, 2013 at 12:57:27PM -0400, Wietse Venema wrote:
                    >
                    > > Viktor Dukhovni:
                    > > > That said, clearly some packets from the sender are lost, and never
                    > > > retransmitted. The TCP connection negotiates selective ACK and
                    > > > window scaling on both sides. First thing I would do is disable
                    > > > window scaling on your Postfix server. This will reduce throughput
                    > > > for mail from far away senders (without window scaling Hong Kong
                    > > > to USA will be at most 64KB per ~0.25 sec or 4 sec per MB of data).
                    > >
                    > > Postfix 2.6 and later have a workaround for TCP window scaling bugs.
                    > > For detailed instructions, see:
                    > > http://www.postfix.org/postconf.5.html#tcp_windowsize
                    >
                    > On which operating systems does this suppress window scaling during
                    > the TCP 3-way handshake?

                    It's five years since this I added the tcp_windowsize feature, and
                    the systems that I was using at the time have been decomissioned.
                    Even with minuscule window sizes I see a non-zero wscale option in
                    the SYN+ACK response from FreeBSD 8. I'll update the documentation.

                    Wietse
                  • mailtime
                    Thanks both for the info. I have attached a log file showing the tcp information. It starts to go wrong around Frame 24660: I did disable window scaling on
                    Message 9 of 11 , May 7, 2013
                      Thanks both for the info. I have attached a log file showing the tcp
                      information.
                      It starts to go wrong around Frame 24660:
                      I did disable window scaling on Postfix and for a while i thought it had
                      solved the issue but then more of the same as you can see in the log.
                      (hopefully showing the correct information!)

                      TCPLog <http://postfix.1071664.n5.nabble.com/file/n57846/TCPLog>

                      Thanks in advance




                      --
                      View this message in context: http://postfix.1071664.n5.nabble.com/Postfix-lost-connection-issue-tp57670p57846.html
                      Sent from the Postfix Users mailing list archive at Nabble.com.
                    • Viktor Dukhovni
                      ... What do you mean by on Postfix ? This is a kernel setting, modified via sysctl or similar. In any case the session you posted has no window scaling,
                      Message 10 of 11 , May 8, 2013
                        On Tue, May 07, 2013 at 10:59:45PM -0700, mailtime wrote:

                        > It starts to go wrong around Frame 24660:
                        > I did disable window scaling on Postfix

                        What do you mean by "on Postfix"? This is a kernel setting, modified
                        via "sysctl" or similar. In any case the session you posted has
                        no window scaling, so that is moot. You could try disabling
                        selective ACK support if that's possible:

                        net.ipv4.tcp_sack = 0
                        net.ipv4.tcp_dsack = 0
                        net.ipv4.tcp_fack = 0

                        > and for a while i thought it had
                        > solved the issue but then more of the same as you can see in the log.
                        > (hopefully showing the correct information!)
                        >
                        > TCPLog <http://postfix.1071664.n5.nabble.com/file/n57846/TCPLog>

                        Somebody's firewall or NAT device is pining for the scrap heap.
                        The receiver's duplicate acks eventually trigger sender RSTs instead
                        of retransmission of the lost segment. What we don't know is whose.

                        --
                        Viktor.
                      Your message has been successfully submitted and would be delivered to recipients shortly.