Loading ...
Sorry, an error occurred while loading the content.
 

revision RFC 2001 (draft-ietf-tcpimpl-cong-control-00.txt)

Expand Messages
  • Vern Paxson
    Mark, Rich & I just submitted the following I-D. Because of how close it now is to the I-D cutoff deadline, the I-D publication turnaround time can be quite
    Message 1 of 10 , Aug 6, 1998
      Mark, Rich & I just submitted the following I-D. Because of how close
      it now is to the I-D cutoff deadline, the I-D publication turnaround time
      can be quite large, so I thought I would also send the text directly to the
      mailing list.

      This I-D is a cut at revising RFC 2001 to clarify some points, fix
      some bugs, and remove some overly strict specifications. We also
      added discussion of acking policies (clarifying an ambiguity in 1122),
      restart-after-idle, and security considerations.

      We would like to see if we can drive this to consensus fairly quickly.
      We intend it to be one of the Chicago agenda items, so please try to
      read it between now and then. (The text has enough mods from 2001, and
      is meant to be stand-alone, that it should be read it in its entirety
      rather than us summarizing diffs from 2001.)

      Vern

      -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

      TCP Implementation Working Group W. Stevens
      INTERNET DRAFT Consultant
      File: draft-ietf-tcpimpl-cong-control-00.txt M. Allman
      NASA Lewis/Sterling Software
      V. Paxson
      LBNL
      August, 1998


      TCP Congestion Control

      Status of this Memo

      This document is an Internet-Draft. Internet-Drafts are working
      documents of the Internet Engineering Task Force (IETF), its areas,
      and its working groups. Note that other groups may also distribute
      working documents as Internet-Drafts.

      Internet-Drafts are draft documents valid for a maximum of six
      months and may be updated, replaced, or obsoleted by other documents
      at any time. It is inappropriate to use Internet-Drafts as
      reference material or to cite them other than as ``work in
      progress.''

      To view the entire list of current Internet-Drafts, please check the
      "1id-abstracts.txt" listing contained in the Internet-Drafts Shadow
      Directories on ftp.is.co.za (Africa), ftp.nordu.net (Northern
      Europe), ftp.nis.garr.it (Southern Europe), munnari.oz.au (Pacific
      Rim), ftp.ietf.org (US East Coast), or ftp.isi.edu (US West Coast).

      Abstract

      This document defines TCP's four intertwined congestion control
      algorithms: slow start, congestion avoidance, fast retransmit, and
      fast recovery. In addition, the document specifies how TCP should
      begin transmission after a relatively long idle period, as well as
      discussing various acknowledgment generation methods.

      1 Introduction

      This document specifies four TCP [Pos81] congestion control
      algorithms: slow start, congestion avoidance, fast retransmit and
      fast recovery. These algorithms were devised in [Jac88] and
      [Jac90]. Their use with TCP is required by [Bra89].

      This document is an update of [Ste97]. In addition to specifying
      the congestion control algorithms, this document specifies what TCP
      connections should do after a relatively long idle period, as well
      as specifying and clarifying some of the issues pertaining to TCP
      ACK generation.

      Note that [Ste94] provides examples of these algorithms in action
      and [WS95] provides an explanation of the source code for the BSD
      implementation of these algorithms.


      Expires: August, 1998 [Page 1]

      draft-ietf-tcpimpl-cong-control-00.txt August 1998

      This document is organized as follows. Section 2 provides various
      definitions which will be used throughout the paper. Section 3
      provides a specification of the congestion control algorithms.
      Section 4 outlines concerns related to the congestion control
      algorithms and finally, section 5 outlines security considerations.

      The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
      "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
      document are to be interpreted as described in [Bra97].

      2 Definitions

      This section provides the definition of several terms that will be
      used throughout the remainder of this document.

      SEGMENT:
      A segment is ANY TCP/IP data or acknowledgment packet (or both).

      MAXIMUM SEGMENT SIZE (MSS):
      The MSS is the largest segment size that can be used. The size
      does not include the TCP/IP headers and options.

      FULL-SIZED SEGMENT:
      A segment that contains the maximum number of data bytes
      permitted (i.e., a segment containing MSS bytes of data).

      RECEIVER WINDOW (rwnd)
      The most recently advertised receiver window.

      CONGESTION WINDOW (cwnd):
      A TCP state variable that limits the amount of data a TCP can
      send. At any given time, a TCP MUST NOT send data with a
      sequence number higher than the sum of the highest acknowledged
      sequence number and the minimum of cwnd and rwnd.

      INITIAL WINDOW (IW):
      The initial window is the size of the sender's congestion window
      when a connection is established.

      LOSS WINDOW (LW):
      The loss window is the size of the congestion window after a TCP
      sender detects loss using its retransmission timer.

      RESTART WINDOW (RW):
      The restart window is the size of the congestion window after a
      TCP restarts transmission after an idle period.

      3 Congestion Control Algorithms

      This section defines the four congestion control algorithms: slow
      start, congestion avoidance, fast retransmit and fast recovery,
      developed in [Jac88] and [Jac90]. In some situations it may be
      beneficial for a TCP sender to be more conservative than the
      algorithms allow, however a TCP MUST NOT be more aggressive than the

      Expires: August, 1998 [Page 2]

      draft-ietf-tcpimpl-cong-control-00.txt August 1998

      following algorithms allow (that is, MUST NOT send data when the
      value of cwnd computed by the following algorithms would not allow
      the data to be sent).

      3.1 Slow Start and Congestion Avoidance

      The slow start and congestion avoidance algorithms MUST be used by a
      TCP sender to control the amount of outstanding data being injected
      into the network. To implement these algorithms, two variables are
      added to the TCP per-connection state. The congestion window (cwnd)
      is a sender-side limit on the amount of data the sender can transmit
      into the network before receiving an acknowledgment (ACK), while the
      receiver's advertised window (rwnd) is a receiver-side limit on the
      amount of outstanding data. The minimum of cwnd and rwnd governs
      data transmission.

      Another state variable, the slow start threshold (ssthresh), is used
      to determine whether the slow start or congestion avoidance
      algorithm is used to control data transmission, as discussed below.

      Beginning transmission into a network with unknown conditions
      requires TCP to slowly probe the network to determine the available
      capacity, in order to avoid congesting the network with an
      inappropriately large burst of data. The slow start algorithm is
      used for this purpose at the beginning of a transfer, or after
      repairing loss detected by the retransmission timer.

      IW, the initial value of cwnd, MUST be less than or equal to MSS
      bytes.

      We note that a non-standard, experimental TCP extension allows that
      a TCP MAY use a larger initial window (IW), as defined in equation 1
      [AFP98]:

      IW = min (4*MSS, max (2*MSS, 4380 bytes)) (1)

      With this extension, a TCP sender MAY use a 2 segment initial
      window, regardless of the segment size, and 3 and 4 segment initial
      windows MAY be used, provided the combined size of the segments does
      not exceed 4380 bytes. We do NOT allow this change as part of the
      standard defined by this document. However, we include discussion
      of (1) in the remainder of this document as a guideline for those
      experimenting with the change, rather than conforming to the present
      standards for TCP congestion control.

      The initial value of ssthresh MAY be arbitrarily high (for example,
      some implementations use the size of the advertised window), but it
      may be reduced in response to congestion. The slow start algorithm
      is used when cwnd < ssthresh, while the congestion avoidance
      algorithm is used when cwnd > ssthresh. When cwnd and ssthresh are
      equal the sender may use either slow start or congestion avoidance.

      During slow start, a TCP increments cwnd by at most MSS bytes for
      each ACK received that acknowledges new data. Slow start ends when

      Expires: August, 1998 [Page 3]

      draft-ietf-tcpimpl-cong-control-00.txt August 1998

      cwnd exceeds ssthresh (or, optionally, when it reaches it, as noted
      above); or when cwnd reaches rwnd; or when congestion is observed.

      During congestion avoidance, cwnd is incremented by 1 full-sized
      segment per round-trip time (RTT). Congestion avoidance continues
      until cwnd reaches the receiver's advertised window or congestion is
      detected. One formula commonly used to update cwnd during
      congestion avoidance is given in equation 2:

      cwnd += MSS*MSS/cwnd (2)

      This provides an acceptable approximation to the underlying
      principle of increasing cwnd by 1 full-sized segment per RTT. (Note
      that for a connection in which the receiver acknowledges every data
      segment, (2) proves slightly more aggressive than 1 segment per RTT,
      and for a receiver acknowledging every-other packet, (2) is less
      aggressive.)

      Implementation Note: Since integer arithmetic is usually used in TCP
      implementations, the formula given in equation 2 can fail to
      increase cwnd when the congestion window is very large (larger than
      MSS*MSS). If the above formula yields 0, the result SHOULD be
      rounded up to 1 byte.

      Implementation Note: older implementations have an additional
      additive constant on the right-hand side of (2). This is incorrect
      and can actually lead to diminished performance [PAD+98].

      Another acceptable way to increase cwnd during congestion avoidance
      is to count the number of bytes that have been acknowledged by ACKs
      for new data. (A drawback of this implementation is that it
      requires maintaining an additional state variable.) When the number
      of bytes acknowledged reaches cwnd, then cwnd can be incremented by
      up to MSS bytes. Note that during congestion avoidance, cwnd MUST
      NOT be increased by more than the larger of either 1 full-sized
      segment per RTT, or the value computed using equation 2.

      Implementation Note: some implementations maintain cwnd in units of
      bytes, while others in units of full-sized segments. The latter
      will find equation (2) difficult to use, and may prefer to use the
      counting approach discussed in the previous paragraph.

      When a TCP sender detects segment loss using the retransmission
      timer, the value of ssthresh MUST be set to no more than the value
      given in equation 3:

      ssthresh = max (min (cwnd, rwnd) / 2, 2*MSS) (3)

      Implementation Note: an easy mistake to make is to forget the inner
      min() operation and simply use cwnd, which in some implementations
      may incidentally increase well beyond rwnd.

      Furthermore, upon a timeout cwnd MUST be set to no more than the
      loss window, LW, which equals 1 full-sized segment (regardless of

      Expires: August, 1998 [Page 4]

      draft-ietf-tcpimpl-cong-control-00.txt August 1998

      the value of IW). Therefore, after retransmitting the dropped
      segment the TCP sender uses the slow start algorithm to increase the
      window from 1 full-sized segment to the new value of ssthresh, at
      which point congestion avoidance again takes over in a fashion
      identical to that for a connection's initial slow start.

      3.3 Fast Retransmit/Fast Recovery

      A TCP receiver SHOULD send an immediate duplicate ACK when an
      out-of-order segment arrives. The purpose of this ACK is to inform
      the sender that a segment was received out-of-order and which
      sequence number is expected. From the sender's perspective,
      duplicate ACKs can be caused by a number of network problems.
      First, they can be caused by dropped segments. In this case, all
      segments after the dropped segment will trigger duplicate ACKs.
      Second, duplicate ACKs can be caused by the re-ordering of data
      segments by the network (not a rare event along some network paths).
      Finally, duplicate ACKs can be caused by replication of ACK or data
      segments by the network.

      The TCP sender SHOULD use the "fast retransmit" algorithm to detect
      and repair loss, based on incoming duplicate ACKs. The fast
      retransmit algorithm uses the arrival of 3 duplicate ACKs (i.e., 4
      identical ACKs) as an indication that a segment has been lost.
      After receiving 3 duplicate ACKs, TCP performs a retransmission of
      what appears to be the missing segment, without waiting for the
      retransmission timer to expire.

      After the fast retransmit sends what appears to be the missing
      segment, the "fast recovery" algorithm governs the transmission of
      new data until a non-duplicate ACK arrives. The reason for not
      performing slow start is that the receipt of the duplicate ACKs not
      only tells the TCP that a segment has been lost, but also that
      segments are leaving the network. In other words, since the
      receiver can only generate a duplicate ACK when a segment has
      arrived, that segment has left the network and is in the receiver's
      buffer, so we know it is no longer consuming network resources.
      Furthermore, since the ACK "clock" [Jac88] is preserved, the TCP
      sender can continue to transmit new segments (although transmission
      must continue using a reduced cwnd).

      The fast retransmit and fast recovery algorithms are usually
      implemented together as follows.

      1. When the third duplicate ACK is received, set ssthresh to no
      more than the value given in equation 3.

      2. Retransmit the lost segment and set cwnd to ssthresh plus 3*MSS.
      This artificially "inflates" the congestion window by the number
      of segments (three) that have left the network and which the
      receiver has buffered.

      3. For each additional duplicate ACK received, increment cwnd by
      MSS. This artificially inflates the congestion window in order

      Expires: August, 1998 [Page 5]

      draft-ietf-tcpimpl-cong-control-00.txt August 1998

      to reflect the additional segment that has left the network.

      4. Transmit a segment, if allowed by the new value of cwnd and the
      receiver's advertised window.

      5. When the next ACK arrives that acknowledges new data, set cwnd
      to ssthresh (the value set in step 1). This is termed
      "deflating" the window.

      This ACK should be the acknowledgment elicited by the
      retransmission from step 1, one RTT after the retransmission
      (though it may arrive sooner in the presence of significant
      out-of-order delivery of data segments at the receiver).
      Additionally, this ACK should acknowledge all the intermediate
      segments sent between the lost segment and the receipt of the
      first duplicate ACK, if none of these were lost.

      Implementing fast retransmit/fast recovery in this manner can lead
      to a phenomenon which allows the fast retransmit algorithm to repair
      multiple dropped segments from a single window of data [Flo94].
      However, in doing so, the size of cwnd is also reduced multiple
      times, which reduces performance. The following steps MAY be taken
      to reduce the impact of successive fast retransmits on performance.

      A. After the third duplicate ACK is received follow step 1 above,
      but also record the highest sequence number transmitted
      (send_high).

      B. Instead of reducing cwnd to ssthresh upon receipt of the first
      non-duplicate ACK received (step 5), the sender should wait
      until an ACK covering send_high is received. In addition, each
      duplicate ACK received should continue to artificially inflate
      cwnd by 1 MSS.

      C. A non-duplicate ACK that does not cover send_high, followed by 3
      duplicate ACKs should not reduce ssthresh or cwnd but SHOULD
      trigger a retransmission. A non-duplicate ACK that does not
      cover send_high SHOULD NOT cause any adjustment in cwnd.

      4 Additional Considerations

      4.1 Re-starting Idle Connections

      A known problem with the TCP congestion control algorithms described
      above is that they allow a potentially inappropriate burst of
      traffic to be transmitted after TCP has been idle for a relatively
      long period of time. After an idle period, TCP cannot use the ACK
      clock to strobe new segments into the network, as all the ACKs have
      drained from the network. Therefore, as specified above, TCP can
      potentially send a cwnd-size line-rate burst into the network after
      an idle period.

      [Jac88] recommends that a TCP use slow start to restart transmission
      after a relatively long idle period. Slow start serves to restart

      Expires: August, 1998 [Page 6]

      draft-ietf-tcpimpl-cong-control-00.txt August 1998

      the ACK clock, just as it does at the beginning of a transfer. This
      mechanism has been widely deployed in the following manner. When
      TCP has not received a segment for more than one retransmission
      timeout, cwnd is reduced to the value of the restart window (RW)
      before transmission begins.

      For the purposes of this standard, we define RW = IW = 1 full-sized
      segment.

      We note that the non-standard experimental extension to TCP defined
      in [AFP98] defines RW = min(IW, cwnd), with the definition of IW
      adjusted per equation (1) above.

      Using the last time a segment was received to determine whether or
      not to decrease cwnd fails to deflate cwnd in the common case of
      persistent HTTP connections [HTH98]. In this case, a WWW server
      receives a request before transmitting data to the WWW browser. The
      reception of the request makes the test for an idle connection fail,
      and allows the TCP to begin transmission with a possibly
      inappropriately large cwnd.

      Therefore, a TCP SHOULD reduce cwnd to no more than RW before
      beginning transmission if the TCP has not sent data in an interval
      exceeding the retransmission timeout.

      4.2 Acknowledgment Mechanisms

      The delayed ACK algorithm specified in [Bra89] SHOULD be used by a
      TCP receiver. When used, a TCP receiver MUST NOT excessively delay
      acknowledgments. Specifically, an ACK MUST be generated for every
      second full-sized segment. (This "MUST" is listed in [Bra89] in one
      place as a SHOULD and another as a MUST; here we unambiguously state
      it is a MUST.) Furthermore, an ACK SHOULD be generated for every
      second segment regardless of size. Finally, an ACK MUST NOT be
      delayed for more than 500 ms waiting on a second full-sized segment
      to arrive. Out-of-order data segments SHOULD be acknowledged
      immediately, in order to trigger the fast retransmit algorithm.

      A TCP receiver MUST NOT generate more than one ACK for every
      incoming segment.

      TCP implementations that implement the selective acknowledgment
      (SACK) option [MMFR96] are able to determine which segments have not
      arrived at the receiver. Consequently, they can retransmit the lost
      segments more quickly than TCPs without SACKs. This allows a TCP
      sender to repair multiple losses in roughly one RTT after detecting
      loss [FF96,MM96a,MM96b]. While no specific SACK-based recovery
      algorithm has yet been standardized, any SACK-based algorithm should
      follow the general principles embodied by the above algorithms.
      First, as soon as loss is detected, ssthresh should be decreased per
      equation (3). Second, in the RTT following loss detection, the
      number of segments sent should be no more than half the number
      transmitted in the previous RTT (i.e., before loss occurred).
      Third, after the recovery period is finished, cwnd should be set to

      Expires: August, 1998 [Page 7]

      draft-ietf-tcpimpl-cong-control-00.txt August 1998

      the reduced value of ssthresh. The SACK-based algorithms outlined
      in [FF96,MM96a,MM96b] adhere to these guidelines.

      5. Security Considerations

      This document requires a TCP to diminish its sending rate in the
      presence of retransmission timeouts and the arrival of duplicate
      acknowledgments. An attacker can therefore impair the performance
      of a TCP connection by either causing data packets or their
      acknowledgments to be lost, or by forging excessive duplicate
      acknowledgments. Causing two congestion control events back-to-back
      will often cut ssthresh to its minimum value of 2*MSS, causing the
      connection to immediately enter the slower-performing congestion
      avoidance phase.

      The Internet to a considerable degree relies on the correct
      implementation of these algorithms in order to preserve network
      stability and avoid congestion collapse. An attacker could cause
      TCP endpoints to respond more aggressively in the face of congestion
      by forging excessive duplicate acknowledgments or excessive
      acknowledgments for new data. Conceivably, such an attack could
      drive a portion of the network into congestion collapse.

      Acknowledgments

      The four algorithms that are described were developed by Van
      Jacobson.

      Some of the text from this document is taken from "TCP/IP
      Illustrated, Volume 1: The Protocols" by W. Richard Stevens
      (Addison-Wesley, 1994) and "TCP/IP Illustrated, Volume 2: The
      Implementation" by Gary R. Wright and W. Richard Stevens
      (Addison-Wesley, 1995). This material is used with the permission
      of Addison-Wesley.

      Sally Floyd devised the algorithm presented in section 3.3 for
      avoiding multiple cwnd cutbacks in the presence of multiple packets
      lost from the same flight. Craig Partridge and Joe Touch
      contributed a number of helpful suggestions.

      References

      [AFP98] M. Allman, S. Floyd, C. Partridge, Increasing TCP's Initial
      Window Size, Internet-Draft draft-floyd-incr-init-win-03.txt.
      May, 1998. (Work in progress).

      [Bra89] B. Braden, ed., "Requirements for Internet Hosts --
      Communication Layers," RFC 1122, Oct. 1989.

      [Bra97] S. Bradner, "Key words for use in RFCs to Indicate
      Requirement Levels", BCP 14, RFC 2119, March 1997.




      Expires: August, 1998 [Page 8]

      draft-ietf-tcpimpl-cong-control-00.txt August 1998

      [FF96] Kevin Fall and Sally Floyd. Simulation-based Comparisons of
      Tahoe, Reno and SACK TCP. Computer Communication Review, July
      1996. ftp://ftp.ee.lbl.gov/papers/sacks.ps.Z.

      [Flo94] S. Floyd, TCP and Successive Fast Retransmits. Technical
      report, October 1994.
      ftp://ftp.ee.lbl.gov/papers/fastretrans.ps.

      [HTH98] Amy Hughes, Joe Touch, John Heidemann. Internet-Draft
      draft-ietf-tcpimpl-restart-00.txt, March 1998. (Work in
      progress).

      [Jac88] V. Jacobson, "Congestion Avoidance and Control," Computer
      Communication Review, vol. 18, no. 4, pp. 314-329, Aug. 1988.
      ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z.

      [Jac90] V. Jacobson, "Modified TCP Congestion Avoidance Algorithm,"
      end2end-interest mailing list, April 30, 1990.
      ftp://ftp.isi.edu/end2end/end2end-interest-1990.mail.

      [MM96a] M. Mathis, J. Mahdavi, "Forward Acknowledgment: Refining TCP
      Congestion Control," Proceedings of SIGCOMM'96, August, 1996,
      Stanford, CA. Available from
      http://www.psc.edu/networking/papers/papers.html

      [MM96b] M. Mathis, J. Mahdavi, "TCP Rate-Halving with Bounding
      Parameters" Available from
      http://www.psc.edu/networking/papers/FACKnotes/current.

      [MMFR96] M. Mathis, J. Mahdavi, S. Floyd, A. Romanow, "TCP Selective
      Acknowledgement Options", RFC 2018, October 1996.

      [PAD+98] V. Paxson, M. Allman, S. Dawson, J. Griner, I. Heavens,
      K. Lahey, J. Semke, B. Volz. Internet-Draft
      draft-ietf-tcpimpl-prob-04.txt, August 1998. (Work in
      progress).

      [Pos81] J. Postel, Transmission Control Protocol, September 1981.
      RFC 793.

      [Ste94] W. R. Stevens, "TCP/IP Illustrated, Volume 1: The
      Protocols", Addison-Wesley, 1994.

      [Ste97] W. R. Stevens, "TCP Slow Start, Congestion Avoidance, Fast
      Retransmit, and Fast Recovery Algorithms", RFC 2001, January
      1997.

      [WS95] G. R. Wright, W. R. Stevens, "TCP/IP Illustrated, Volume 2:
      The Implementation", Addison-Wesley, 1995.






      Expires: August, 1998 [Page 9]

      draft-ietf-tcpimpl-cong-control-00.txt August 1998

      Author's Address:

      W. Richard Stevens
      1202 E. Paseo del Zorro
      Tucson, AZ 85718
      520-297-9416
      rstevens@...
      http://www.kohala.com/~rstevens

      Mark Allman
      NASA Lewis Research Center/Sterling Software
      21000 Brookpark Rd. MS 54-2
      Cleveland, OH 44135
      216-433-6586
      mallman@...
      http://gigahertz.lerc.nasa.gov/~mallman

      Vern Paxson
      Network Research Group
      Lawrence Berkeley National Laboratory
      Berkeley, CA 94720
      USA
      510-486-7504
      vern@...































      Expires: August, 1998 [Page 10]
    • Kacheong Poon
      In discussing multiple recovery in fast retransmit/recovery: B. Instead of reducing cwnd to ssthresh upon receipt of the first non-duplicate ACK received
      Message 2 of 10 , Aug 7, 1998
        In discussing multiple recovery in fast retransmit/recovery:

        B. Instead of reducing cwnd to ssthresh upon receipt of the first
        non-duplicate ACK received (step 5), the sender should wait
        until an ACK covering send_high is received. In addition, each
        duplicate ACK received should continue to artificially inflate
        cwnd by 1 MSS.

        C. A non-duplicate ACK that does not cover send_high, followed by 3
        duplicate ACKs should not reduce ssthresh or cwnd but SHOULD
        trigger a retransmission. A non-duplicate ACK that does not
        cover send_high SHOULD NOT cause any adjustment in cwnd.

        Consider this case. Both ssthresh and cwnd are 7*MSS. Suppose sender sends
        segments 1-6. 1 and 5 are dropped. After getting 3 dup ACKs for segments
        2-4, sender fast retransmits 1. Ssthresh is now 3*MSS and cwnd is set to
        6*MSS. Dup ACK for 6 is received and cwnd is inflated to 7*MSS. Thus one
        more segment, 7, can be sent. This is fast recovery. A non-dup ACK, for 1-4,
        is then received. By B & C, cwnd stays at 7*MSS. By waiting in B, do you mean
        that the sender should not send anything after getting this non-dup ACK? It
        should wait until 3 more dup ACKs are received before it can do anything. And
        does it means that fast recovery is stopped until ACK for send_high is
        received. Then why does cwnd need to be inflated for dup ACKs?

        If fast recovery is not stopped, then sender can now send segments 8-11. Is
        this burst considered harmful? If cwnd stays the same after getting a non-dup
        ACK for n segments that does not cover send_high, TCP can send out n segments
        in a burst. If we want to follow the same idea behind fast recovery, cwnd can
        be reduced by number of segments acked when a non-dup ACK is received.
        Ssthresh can stay the same. After send_high is ack'ed, cwnd is set to ssthresh
        and everything continues. This kind of burst is like the one described in
        [FF96] and the authors introduce a maxburst parameter for it.

        It is not clear to me what the suggestion in the draft is. Can you elaborate?

        K. Poon.
        kcpoon@...
      • David Borman
        Matt, ... Yes you can. In the receiver, you implement a local state variable that keeps track of the largest packet ever received from the peer, and use that
        Message 3 of 10 , Aug 12, 1998
          Matt,

          > MSS is better described as a sender state variable, which may or may
          > not agree with the other end of the connection.
          >
          > As a consequence, algorithms that require the receiver to know the
          > senders segment size can not, in general, be implemented correctly.
          >
          > In the draft, the sentence "Specifically, an ACK MUST be generated for
          > every second full-sized segment.", is not sufficient to assure correct
          > operation because the receiver can not identify segments which the
          > sender considers to be full sized.

          Yes you can. In the receiver, you implement a local state variable that
          keeps track of the largest packet ever received from the peer, and use
          that in determining how many full size packets you have received, instead
          of the sending MSS. I implemented this in UNICOS years ago, and it worked
          fine for most cases.

          -David Borman, dab@...
        • Andi Kleen
          In article , ... What happens when the routing changes and path mtu discovery on the sender lowers the MSS effectively?
          Message 4 of 10 , Aug 12, 1998
            In article <199808121840.NAA10344@...>,
            David Borman <dab@...> writes:
            > Matt,
            >> MSS is better described as a sender state variable, which may or may
            >> not agree with the other end of the connection.
            >>
            >> As a consequence, algorithms that require the receiver to know the
            >> senders segment size can not, in general, be implemented correctly.
            >>
            >> In the draft, the sentence "Specifically, an ACK MUST be generated for
            >> every second full-sized segment.", is not sufficient to assure correct
            >> operation because the receiver can not identify segments which the
            >> sender considers to be full sized.

            > Yes you can. In the receiver, you implement a local state variable that
            > keeps track of the largest packet ever received from the peer, and use
            > that in determining how many full size packets you have received, instead
            > of the sending MSS. I implemented this in UNICOS years ago, and it worked
            > fine for most cases.

            What happens when the routing changes and path mtu discovery on the sender
            lowers the MSS effectively? Then after the routing shift this variable will
            contain a too big value and the receiver will not send enough acks.

            In Linux 2.1 this problem came up and got fixed with the following
            algorithm:

            Initialize rcv_mss with the mss the other end announced.

            If a packet arrives with len > rcv_mss set rcv_mss to len.
            If two packets with SS >= 536 arrive in a row set rcv_mss to this length.

            Send an ack at least for every rcv_mss*2 data bytes.

            The reason of the 536 rule is to make sure that delayed acks still work
            with interactive traffic. For paths with MTUs < 576 [some packet radio links]
            it will break, but it is assumed that those fragment/defragment at a lower
            layer.

            So far this algorithm seems to work well to adapt to changing MSS. The
            original reason for its implementation was the fact that a well-known
            workstation OS announces 64k MSS on a HIPPI interface, but effectively
            only sends 48k packets - the result was that Linux already got into
            delayed ack mode and the data transfer was very slow.

            So far this algorithm seems to work well, but I would be interested in a
            discussion about any shortcommings it may have.

            -Andi
          • Andi Kleen
            In article , ... Exactly this problem got fixed in Linux 2.1 a few days ago. Funny that you bring it up now :) See my
            Message 5 of 10 , Aug 12, 1998
              In article <199808121612.MAA04021@...>,
              "Matt Mathis" <mathis@...> writes:

              > Yes, all(?) existing implementation are non-conformant, and yes all(?)
              > can be made to misbehave on real paths. (In the above FDDI/ethernet
              > examples most receivers send one ACK per six segments. We have one
              > trace where the receiver was ACKing every 2*4096 Bytes, but the sender
              > was using 512 Byte segments, causing 17 packet bursts during
              > slow-start).

              Exactly this problem got fixed in Linux 2.1 a few days ago. Funny
              that you bring it up now :) See my other mail on the algorithm used.

              -Andi
            • Kacheong Poon
              ... This issue was discussed about a week ago in the Linux networking list. I think they implement something similar, which they called adaptive receive
              Message 6 of 10 , Aug 12, 1998
                > Yes you can. In the receiver, you implement a local state variable that
                > keeps track of the largest packet ever received from the peer, and use
                > that in determining how many full size packets you have received, instead
                > of the sending MSS. I implemented this in UNICOS years ago, and it worked
                > fine for most cases.

                This issue was discussed about a week ago in the Linux networking list. I
                think they implement something similar, which they called "adaptive receive
                MSS." So I guess Linux, beside UNICOS, is also conformant to the RFC.
                Anyone from the Linux networking community can confirm this?

                K. Poon.
                kcpoon@...
              • Andi Kleen
                In article , ... All released versions (Linux 2.0, official 2.1) do not conform. The implementation of
                Message 7 of 10 , Aug 12, 1998
                  In article <Roam.SIMCSD.2.0.4.902950136.15924.kcpoon@jurassic>,
                  Kacheong Poon <Kacheong.Poon@...> writes:
                  >> Yes you can. In the receiver, you implement a local state variable that
                  >> keeps track of the largest packet ever received from the peer, and use
                  >> that in determining how many full size packets you have received, instead
                  >> of the sending MSS. I implemented this in UNICOS years ago, and it worked
                  >> fine for most cases.

                  > This issue was discussed about a week ago in the Linux networking list. I
                  > think they implement something similar, which they called "adaptive receive
                  > MSS." So I guess Linux, beside UNICOS, is also conformant to the RFC.
                  > Anyone from the Linux networking community can confirm this?


                  All released versions (Linux 2.0, official 2.1) do not conform.

                  The implementation of adaptive receive MSS will appear shortly in 2.1
                  snapshots, and will be included in the upcoming 2.2 release.

                  Most BSD implementation (at least Free/NetBSD) don't have the problem,
                  because they simply send an ACK every second packet, no matter how big
                  the packets were (this means BSD will send more acks than linux in some
                  situations)


                  -Andi
                • Kacheong.Poon@Eng.Sun.COM
                  ... I just got the latest FreeBSD source. The following part of code has not changed. /* * Compare available window to amount of window * known to peer (as
                  Message 8 of 10 , Aug 12, 1998
                    > Most BSD implementation (at least Free/NetBSD) don't have the problem,
                    > because they simply send an ACK every second packet, no matter how big
                    > the packets were (this means BSD will send more acks than linux in some
                    > situations)

                    I just got the latest FreeBSD source. The following part of code has not
                    changed.

                    /*
                    * Compare available window to amount of window
                    * known to peer (as advertised window less
                    * next expected input). If the difference is at least two
                    * max size segments, or at least 50% of the maximum possible
                    * window, then want to send a window update to peer.
                    */
                    if (win > 0) {
                    /*
                    * "adv" is the amount we can increase the window,
                    * taking into account that we are limited by
                    * TCP_MAXWIN << tp->rcv_scale.
                    */
                    long adv = min(win, (long)TCP_MAXWIN << tp->rcv_scale) -
                    (tp->rcv_adv - tp->rcv_nxt);

                    if (adv >= (long) (2 * tp->t_maxseg))
                    goto send;
                    if (2 * adv >= (long) so->so_rcv.sb_hiwat)
                    goto send;
                    }

                    Can you point me to where FreeBSD forces acking every second segment,
                    regardless of the size?

                    K. Poon.
                    kcpoon@...
                  • Kevin M. Lahey
                    In message Kacheong.Poon@Eng.S un.COM writes ... NetBSD does it. It seemed like a reasonable compromise to
                    Message 9 of 10 , Aug 12, 1998
                      In message <Roam.SIMC.2.0.6.902971866.24899.kcpoon@jurassic>Kacheong.Poon@Eng.S
                      un.COM writes
                      >> Most BSD implementation (at least Free/NetBSD) don't have the problem,
                      >> because they simply send an ACK every second packet, no matter how big
                      >> the packets were (this means BSD will send more acks than linux in some
                      >> situations)
                      >
                      >I just got the latest FreeBSD source. The following part of code has not
                      >changed.
                      >[...]
                      >Can you point me to where FreeBSD forces acking every second segment,
                      >regardless of the size?

                      NetBSD does it. It seemed like a reasonable compromise to me[*];
                      it avoids assuming that the MSS is the same in both directions,
                      and it avoids adding extra state and complexity to the system.
                      In the case of tiny X events or something similar, this results
                      in extra ACKs. Are these going to be a significant problem,
                      perhaps with a grossly assymetric link?

                      Are there any clocking or other advantages to having frequent ACKs?

                      Thanks,

                      Kevin
                      kml@...
                      -----
                      [*] I didn't put it in, so I can't take credit for it, but I certainly
                      thought it is a Good Thing[tm]...
                    • Andi Kleen
                      ... [...] You re right, FreeBSD does not do this. I just looked at the NetBSD source and somehow believed FreeBSD did the same. Seems I was wrong. In NetBSD
                      Message 10 of 10 , Aug 12, 1998
                        On Thu, Aug 13, 1998 at 03:31:06AM +0200, Kacheong.Poon@... wrote:
                        > > Most BSD implementation (at least Free/NetBSD) don't have the problem,
                        > > because they simply send an ACK every second packet, no matter how big
                        > > the packets were (this means BSD will send more acks than linux in some
                        > > situations)
                        >
                        > I just got the latest FreeBSD source. The following part of code has not
                        > changed.

                        [...]

                        You're right, FreeBSD does not do this. I just looked at the NetBSD source
                        and somehow believed FreeBSD did the same. Seems I was wrong.

                        In NetBSD the TCP_SETUP_ACK macro controls the ACK behaviour

                        #define TCP_SETUP_ACK(tp, ti) \
                        do { \
                        if ((tp)->t_flags & TF_DELACK || \
                        (tcp_ack_on_push && (ti)->ti_flags & TH_PUSH)) \
                        tp->t_flags |= TF_ACKNOW; \
                        else \
                        TCP_SET_DELACK(tp); \
                        } while (0)


                        And then in the fast packet receive path:

                        TCP_SETUP_ACK(tp, ti);
                        if (tp->t_flags & TF_ACKNOW)
                        (void) tcp_output(tp);

                        This means FreeBSD has a similar problem to Linux when the path changes
                        and the MTU shrinks during a connection then it'll go into delayed ACK
                        mode even for bulk data. NetBSD fixed it. The FreeBSD algorithm seems to
                        match the original 4.4BSD-Lite code as described in Stephens TCP/IP Ill. V.2

                        -Andi
                      Your message has been successfully submitted and would be delivered to recipients shortly.