Internet Draft R. Braden Expires: December 1993 ISI June 21, 1993 TCP Extensions for High Performance: An Update Status of This Memo This document is an Internet-Draft. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months. Internet-Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet- Drafts as reference material or to cite them other than as a ``working draft'' or ``work in progress.'' Abstract This memo is a contribution to the TCP Large Windows (TCPLW) Working Group. It presents some suggested modifications to RFC-1323, which defined TCP extensions to improve performance over large bandwidth*delay product paths and to provide reliable operation over very high-speed paths. 1. INTRODUCTION RFC-1323 [Jacobson92] defined a set of extensions to the TCP protocol [Postel81] to improve performance over large bandwidth*delay product paths and to provide reliable operation over very high-speed paths. Specifically, RFC-1323 defined three new mechanisms. (1) Window Scale Option A new TCP option, "Window Scale" allows windows larger than 2**16 bytes. This option defines an implicit scale factor, which is used to multiply the window size value found in a TCP header to obtain the true window size. (2) RTTM: Round-Trip Time Measurement A new TCP option "Timestamps" is introduced, and a mechanism called "RTTM" (Round Trip Time Measurement) uses this option to obtain improved measurement of round trip times (RTTs). Braden Expires: December 1993 [Page 1] Internet Draft TCP Performance Extensions: Update June 1993 (3) PAWS: Protect Against Wrapped Sequence The Timestamps option is used by the PAWS mechanism to extend TCP reliability to transfer rates well beyond the foreseeable upper limit of network bandwidths, with reasonably large values of the Maximum Segment Lifetime (MSL). The present document summarizes several minor issues and clarifications that have accumulated since RFC-1323 was published. 2. MODIFICATIONS TO RFC-1323 2.1 RTTM: Clarify Relationship to Karn's Algorithm TCP requires that the RTO (retransmission timeout) values used for successive retransmissions of the same segment form an increasing sequence [Postel81]; this is known as "retransmission back-off". TCP implementations have previously been required [RFC-1323] to use Phil Karn's algorithm [Karn87], which states that (1) retransmission back-off will persist until the next ACK is received for a data segment that has never been retransmitted, and that (2) no RTT measurements will be made from acknowledgments of retransmitted data segments. Karn's algorithm was designed to allow reliable RTT estimates despite an ambiguity when an ACK is received for a retransmitted data segment: the ACK may have been created from either the original or the retransmission [Zhang86]. However, as RFC-1323 implied but did not clearly state, the RTTM mechanism replaces Karn's algorithm. With the RTTM mechanism in operation, an ACK segment will echo the timestamp from whichever data segment triggered the ACK. This removes the ambiguity in RTT measurement that required the Karn algorithm. For compatibility, however, an implementation of RFC-1323 must still be prepared to use the Karn algorithm when talking with a host that does not implement RFC-132. Overriding the Karn algorithm was implied by the following statement on page 14 of RFC-1323, which was independent of whether or not the new data being acknowledged has been retransmitted: "A TSecr value received in a segment is used to update the averaged RTT measurement only if the segment acknowledges some new data, i.e., only if it advances the left edge of the send window." Braden Expires: December 1993 [Page 2] Internet Draft TCP Performance Extensions: Update June 1993 2.2 RTTM: Discuss When RTT Measurements are Made The RFC-1323 text quoted immediately above implies that duplicate acknowledgments will not contribute to measurement of the RTT, even with RTTM in use. Suppose that exactly one segment is lost from a window of N segments. If there are no delayed ACKs or lost segments, this will result in a string of N-1 duplicate ACKs arriving at the sender. RTTM can make no new RTT measurement for at least N packet times, so the first new measurement will come from the ACK triggered by retransmission of the lost packet. Therefore, the discussion under bullet (B) on page 15 of RFC-1323 is gratuitous: no matter which timestamp is echoed in a duplicate ACK segment, it (the echoed timestamp) will be ignored. This issue deserves further discussion. We see that with one dropped segment per window, RTTM may result in only one RTT measurement per window. However, this is still a significant improvement over a standard TCP without RTTM, which will make even fewer measurements; it cannot measure the retransmitted packet, due to Karn's algorithm. However, we should ask whether it would be possible to do any better than one measurement per RTT. The reason for making a new RTT measurement only when new data is acknowledged is to avoid artificial inflation of the RTT value, as illustrated by the diagram on the top of page 14 of RFC-1323. We would need an alternative criterion for making a measurement that would also prevent such inflation of the RTT measurements. For example, suppose that the transmitter made a new RTT measurement only when it had outstanding data, i.e., only when SND_NXT > SND_UNA. The following example, involving simultaneous data transmission from both sides, shows that this alternative criterion may still allow RTT inflation. Here the TSrecent values on each side are shown in parentheses, and TCP A sends data blocks a, b, ... and TCP B sends data blocks x, y, ... Braden Expires: December 1993 [Page 3] Internet Draft TCP Performance Extensions: Update June 1993 TCP A TCP B (TSrecent) (TSrecent) 1. ------> (1) 2. (127) <----- (1) 3. (127) ------> (5) . . . ( Pause for 60 timestamp clock ticks ) . . . . 4. (127) ---> ... 5. ... <--- (5) 4'. ---> (65) 5'. (191) <-- 6. ... <--- (65) 7. (191) ---> ... In this symmetrical data transfer example, both sides send data simultaneously (lines 4 and 5) after a pause of roughly 60 time units. When these segments arrive (lines 4' and 5'), each side has outstanding data and by the proposed rule would use the TSecr to update its RTT estimate. However, this would result in inflating ech of these RTT estimates by the 60 time units. We believe that the only way to ensure that the measured RTT is accurate is to accept TSecr only when new data is acknowledged. Thus, the RFC-1323 tule quoted at the end of the preceding section is the best that can be done, and duplicate ACKs cannot update the RTT estimate. 2.3 RTTM: Which Timestamp to Echo? RFC-1323 presented the following algorithm to control which timestamp is echoed: (1) "The connection state is augmented with two 32-bit slots: TS.Recent holds a timestamp to be echoed in TSecr whenever a segment is sent, and Last.ACK.sent holds the ACK field from the last segment sent. Last.ACK.sent will equal RCV.NXT Braden Expires: December 1993 [Page 4] Internet Draft TCP Performance Extensions: Update June 1993 except when ACKs have been delayed. (2) If Last.ACK.sent falls within the range of sequence numbers of an incoming segment: SEG.SEQ <= Last.ACK.sent < SEG.SEQ + SEG.LEN then the TSval from the segment is copied to TS.Recent; otherwise, the TSval is ignored. (3) When a TSopt is sent, its TSecr field is set to the current TS.Recent value." Step (2) of this algorithm is incorrect in two regards: (1) it will fail to update TSrecent for a retransmitted segment that resulted from a lost ACK, and (2) it will fail if SEG.LEN = 0 [Borman93,Skibo93]. The correct step (2) is actually simpler. It is as follows: (2) If: SEG.TSval >= TSrecent and SEG.SEQ <= Last.ACK.sent then SEG.TSval is copied to TS.Recent; otherwise, it is ignored. Observe that this algorithm explicitly constructs a monotonic sequence of TSrecent values. The case SEG.TSval = TSrecent is included here for consistency with the PAWS test. Note also that RFC-1323 presented this algorithm *correctly* in Section 4.2.1 discussing PAWS, but *incorrectly* in the Event Processing rules on page 35. 2.4 Implementation of TCP Options The major implementation chore in the RFC-1323 extensions is probably the modifications to allow TCP options in data segments. This code must obey the limits set by the MSS (maximum segment size) and by the connected network MTU (maximum transmission unit). This issue has sometimes been misunderstood, perhaps partly due to a past imprecision in terminology (e.g., what is a "segment"?). In addition, prior attempts to clarify these issues have been unfortunately obscure [RFC-1122]. To send a segment, the general procedure for a TCP should be: (a) Get a packet buffer and create a TCP header in it. (b) Format any required TCP options into the buffer. Braden Expires: December 1993 [Page 5] Internet Draft TCP Performance Extensions: Update June 1993 (c) Copy 'len' bytes of data into the buffer, where: len = min( data_to_send, maxseg, maxoptdata - optlen ); Here: * data_to_send = Amount of data to be sent. * maxseg = "Normal" data length in a segment. * maxoptdata = Largest area permitted. * optlen = length of TCP options added in (b). Finally, we must define how to compute 'maxseg' and 'maxoptdata'. maxoptdata = min( Received_MSS, pathMTU - 40) - maxseg = maxoptdata - Here "Received_MSS" is the value received in an MSS option in a SYN segment, or 536 if none is received. The MTU over the path, "pathMTU", may be found by MTU Discovery, or it may be determined by the following heuristic: use "interface_MTU" if the destination is on the connected network, else use 576. In normal usage today, there are no IP options to be considered. An MSS option is intended to specify ONLY a property of the remote host, independent of the path: the largest IP datagram that can be received and reassembled (less 40). For those hosts that have no limit on datagram size, it would not be incorrect to specify "infinity" (65535) in its MSS option. However, a more sensible choice would be "interface_MTU". Note also that 'maxseg' is also used by the SWS (silly-window syndrome) and congestion control algorithms of TCP [RFC-1122], and it may correspond to the "normal" data block size for a segment used in bulk transmission. 3. SUMMARY OF ALGORITHMS Appendix E of RFC-1323 defined the overall algorithm as modifications of the TCP Event Processing rules. This section contains a more concise and algorithmic description. We define the following symbols: Braden Expires: December 1993 [Page 6] Internet Draft TCP Performance Extensions: Update June 1993 Options WSopt: TCP Window Scale Option TSopt: TCP Timestamps Option Option Fields shift.cnt: Window scale byte in WSopt. TSval: 32-bit Timestamp Value field in TSopt. TSecr: 32-bit Timestamp Reply field in TSopt. Option Fields in Current Segment SEG.TSval: TSval field from TSopt in current segment. SEG.TSecr: TSecr field from TSopt in current segment. SEG.WSopt: 8-bit value in WSopt Clock Values my.TSclock: Local source of 32-bit timestamp values my.TSclock.rate: Period of my.TSclock (1 ms to 1 sec per tick). Per-Connection State Variables TS.Recent: Latest received Timestamp Last.ACK.sent: Last ACK field sent Snd.TS.OK: 1-bit flag Snd.WS.OK: 1-bit flag Rcv.Wind.Scale: Receive window scale power Snd.Wind.Scale: Send window scale power Start.Time: my.TSclock value when segment being timed was sent (used by pre-1323 code). Procedure Update_SRTT( m ) Procedure to update the smoothed RTT and RTT variance estimates, using the rules of [Jacobson88], given m, a new RTT measurement. PSEUDO-CODE SUMMARY: Create new TCB => { Rcv.wind.scale = MIN( 14, MAX( 0, floor(log2(receive buffer space)) - 15 ) ); Braden Expires: December 1993 [Page 7] Internet Draft TCP Performance Extensions: Update June 1993 Snd.wind.scale = 0; Last.ACK.sent = 0; Snd.TS.OK = Snd.WS.OK = FALSE; } Send initial {SYN} segment => { SEG.WND = MIN( RCV.WND, 65535 ); Include in segment: TSopt(TSval=my.TSclock, TCecr=0); Include in segment: WSopt = Rcv.wind.scale; } Send {SYN, ACK} segment => { SEG.ACK = Last.ACK.sent = RCV.NXT; SEG.WND = MIN( RCV.WND, 65535 ); if (Snd.TS.OK) then Include in segment: TSopt(TSval=my.TSclock, TSecr=TS.Recent); if (Snd.WS.OK) then Include in segment: WSopt = Rcv.wind.scale; } Receive {SYN} or {SYN,ACK} segment => { if (Segment contains TSopt) then { TS.Recent = SEG.TSval; Snd.TS.OK = TRUE; if (is {SYN,ACK} segment) then Update_SRTT( (my.TSclock - SEG.TSecr)*my.TSclock.rate ) ; } if Segment contains WSopt) then { Snd.wind.scale = SEG.WSopt; Snd.WS.OK = TRUE; } else Rcv.wind.scale = Snd.wind.scale = 0; } Send non-SYN segment => { SEG.ACK = Last.ACK.sent = RCV.NXT; Braden Expires: December 1993 [Page 8] Internet Draft TCP Performance Extensions: Update June 1993 SEG.WND = MIN( RCV.WND >> Rcv.wind.scale, 65535 ); if (Snd.TS.OK) then Include in segment: TSopt(TSval=my.TSclock, TSecr=TS.Recent); } Receive non-SYN segment in (state >= ESTABLISHED) => { Window = (SEG.WND << Snd.wind.scale); /* Use 32-bit 'Window' instead of 16-bit 'SEG.WND' * in rest of processing. */ if (Segment contains TSopt) then { if (SEG.TSval < TS.Recent && Idle less than 25 days) then { if (Send.TS.OK AND (NOT RST) ) then { /* Timestamp too old => * segment is unacceptable. */ Send ACK segment; Discard segment and return; } } else { if (SEG.SEQ =< Last.ACK.sent) then TS.Recent = SEG.TSval; } } if (SEG.ACK > SND.UNA) then { /* (At least part of) first segment in * retransmission queue has been ACKd */ if (Segment contains TSopt) then Update_SRTT( (my.TSclock - SEG.TSecr)/my.TSclock.rate); else Update_SRTT( /* for compatibility */ (my.TSclock - Start.Time)/my.TSclock.rate); } } Braden Expires: December 1993 [Page 9] Internet Draft TCP Performance Extensions: Update June 1993 4. REFERENCES [Borman93] Borman, D., Private communication, 1993. [Jacobson88] Jacobson, V., "Congestion Avoidance and Control", SIGCOMM '88, Stanford, CA., August 1988. [Jacobson92] Jacobson, V., Braden, R., and D. Borman, "TCP Extensions for High Performance", RFC-1323, May 1992. [Karn87] Karn, P. and C. Partridge, "Estimating Round-Trip Times in Reliable Transport Protocols", Proc. SIGCOMM '87, Stowe, VT, August 1987. [Postel81] Postel, J., "Transmission Control Protocol - DARPA Internet Program Protocol Specification", RFC 793, DARPA, September 1981. [RFC-1122] Braden, R., Ed., "Requirements for Internet Hosts -- Communication Layers", RFC-1122, October 1989. [Skibo93] Skibo, T., Private communication, 1993. [Zhang86] Zhang, L., "Why TCP Timers Don't Work Well", Proc. SIGCOMM '86, Stowe, Vt., August 1986. Security Considerations Security issues are not discussed in this memo. Authors' Addresses Bob Braden University of Southern California Information Sciences Institute 4676 Admiralty Way Marina del Rey, CA 90292 Phone: (213) 822-1511 EMail: Braden@ISI.EDU Braden Expires: December 1993 [Page 10]