«

»

May 12

Just The Facts

Don’t jump to cause. This often leads to wasted resources and is a quick way to lose respect amongst your peers.

Scenario

You receive a trace via email and are told that it illustrates a “network problem”, which is causing slow application performance. In the words of the analyst, there are “tons of bad packets and retransmissions.”

Much can be said about ensuring the methodology employed to collect the capture data actually provides relevant information. Additionally, much can also be said about ensuring that a problem is properly defined and provides the analyst with some sort of idea as to potential areas of interest. Often the first thing that I do when receiving a trace, is ask for additional information regarding the symptom, e.g. relevant IP addresses, application ports, etc. Before you start trying to sort through thousands or even millions of packets, you want to be sure that you are looking in the right haystack and are fairly certain that you are looking for a needle. These sorts of topics will be addressed in future discussions.

The Companion Video walks through the observation and analysis described below.

Observations

Wireshark metadata provides a quick way of assessing what a capture contains and whether this data coincides with the problem being described. Let’s examine some metadata regarding our “network problem”.

The Capture file properties dialog (below) indicates that the trace was conducted at 19:04:08 on 3/25/2016, with a duration of 180ms and contains a total of 14 frames. No frame slice, nor capture filter was in place during the initial capture, though it is very likely that we are looking at a trace that was filtered and saved from a larger capture.

Capture File Properties

Expert information (below) indicates IP and TCP “Bad checksum” errors, “Previous segment not captured”, “Duplicate ACK”s, “suspected retransmission”, and “fast retransmission” events. While bad checksums can result in retransmissions, we see that the number of checksum errors is significantly greater than the number of retransmissions. In other words, If we were seeing this many “real” bad TCP checksums, we would likely expect to see many more retransmission symptoms. However, it is prudent to validate. The “Previous segment not captured”, “Duplicate ACK”s and “suspected retransmission”/”fast retransmission” events logically correlate. For example, we experience a gap in TCP segments due to a dropped frame or segment reordering. This causes Wireshark to generate a previous segment not captured event. The receiver sensing a segment(s), later in the stream (than what it expecting), generates duplicate ack(s) which results in a retransmission (further categorized as fast retransmission). However, we only see two duplicate acks, so whether this is an actual retransmission is questionable. Analysis should help to provide more clarity.

Expert

Tip: The expert provides “hints” as to potential concerns detected by Wireshark. It can get very “busy” in terms of the number and types of events present, many of which are of little importance towards the task at hand. I find that the expert window is much more valuable after I have created a filtered trace, containing just the packets related to the issue that I am troubleshooting. In this case, as we are only looking at 14 frames, it isn’t overwhelming.

Protocol Hierarchy (below) indicates that this trace only contains SSL over TCP. There is no UDP, ICMP, SNMP, or other applications running over TCP.

jtf-protocol_hierarchy

In fact, Conversations (below) indicates a single SSL session between a client (User-1/192.168.1.10) TCP port 57913 to what appears to be an Amazon EC2 instance (server ec2-52-22-153-18.compute-1.amazonaws.com/ 52.22.153.18), located in Wilmington, DE. on TCP port 443 (SSL). This connection has a duration of approx. 180ms, which corresponds to the capture file properties dialog.

Conversations

Analysis

In examining the checksum errors, we see that these are present on every packet/segment generated by 192.168.1.10. This is a pretty clear indicator that this machine was our capture machine and that these were due to checksum offloading. Checksums exist for a reason; detecting corruption. I was once engaged in an issue where sporadic bad TCP checksums led to retransmissions. The assumption of another analyst was that the issue was related to packet loss, but he was having difficulty determining where this loss was occurring. However, careful analysis of checksums indicated that these frames were not actually being lost in the network. They were being dropped by the receiver because of invalid TCP checksums generated by the sender ~ upgrading the driver resolved the issue. Just be mindful.

The following graphic illustrates a checksum analysis, conducted via creating a filter for frames for bad IP and TCP checksums, and then comparing filtered vs. non-filtered transmitted frames from host 192.168.1.10.

checksums

The”Previous segment not captured”, “Duplicate ACK”s, “suspected retransmission”, and “fast retransmission” events are due to packet reordering. We were able to determine this from examining the IP Identification fields, as shown below:

IP Identifier

The specifics of why and where packets were reordered is uncertain. Could this situation represent a potential performance concern? Maybe. However, in our example we only saw two duplicate acks and the conversation progresses to the next application level message. There were no actual retransmission. Thus, reordering of segments in this particular trace did not create a concern though it is something to be mindful of when examining further traces.

Session reuse indicates that the client and server had established a successful SSL session(s) prior to this trace sample and are attempting to reuse the session identifier to reduce overhead. As pointed out by a couple readers (Jin Qian and Sake), session reuse is not occurring in this trace. When using any analysis tool, we should verify the information provided, especially if this information is pivotal to our analysis. In fact, this expert is due to the out of order which occur in this sample. Regardless, in our SSL session we never see the client and server exchange application data. In SSL there is a handshake protocol phase in which the client and server negiotate ciphers, validate identity, and use public key cryptography to securely create a shared master key. After this step is complete, the client and server use this shared master key to symmetrically encrypt data. The last SSL message(s) that we see in our trace is the “Change Cipher Spec, Encrypted Handshake Message” from the server. While this indicates that we have completed our handshake and are entering a secure data phase, we never actually see any application data. SSL will be a topic that I will spend some time on in the future.

Why did the initial SYN/ACK take so much longer than other acks? This is something to keep an eye on in further traces.

slow syn-ack

Discussion

Metadata is information about data. In this context, there are many tools which can analyze a trace or even multiple traces and produce metadata regarding content. Whether I examine a large or small trace, I will almost undoubtedly start by getting a higher level metadata perspective, as examination of this information allows me to make quite a few assessments regarding the contents of a packet capture (without even looking at individual packets). Wireshark creates quite a bit of metadata and much of this same information is available via the command line (tshark) and exposed to LUA, which can create higher level abstractions. Future discussions will address the LUA programming interface and specifically, how to create additional metadata via LUA taps.

I am wrapping a technical concept inside of a larger idea regarding “Impactful Analysis”. It is OK if we don’t immediately have “the answer”. We cannot draw real conclusions when we don’t have enough data. In many cases it is difficult to determine much more than the need for further directed testing and this testing is often iterative. An impactful analyst will define the objectives of a follow up test plan to obtain the necessary information, or at very least ensure that the necessary information is gathered as part of the larger test plan. Throughout the process, we need to recognize that the way we present our observations, e.g. specific content and format, will be influenced by the knowledge level and focus of our audience; we don’t want our observations misconstrued. Anyway, these are future discussion topics.

While we saw some potential concerns in this example, we didn’t find any sort of smoking gun. We don’t have enough information to make any concrete assessments and we want to communicate this clearly. We also want to ensure that we can get the data (and information) to drill deeper and may have to guide others through a test process. However, in the end impactful analysis relies on “Just the Facts.”

6 comments

Skip to comment form

  1. Vladimir

    Very interesting, thanks!

    Some other observations in the trace:
    1. IP ID of the SYN-ACK packet is not in the pattern. It has a value of 0x0000 while all other packets have incremental IP ID. (To keep in mind: possible different source, possible IP ID manipulating on the path)
    2. The use of DSCP in the packets received from server, TOS 0x08, throughput bit set.
    3. TTL of received packets is 43, which is not very usual. (To keep in mind: possible long path, possible TTL manipulation).

    1. Thomas P. Kager

      Thank you Vladimir. I appreciate the kinds words 🙂

      I like your observations and the way you think!

      IMO the IP ID of 0 on the Syn/Ack, is probably an indicator that we are communicating with Linux. There is some mention of this behavior on the Web http://rtoodtoo.net/ip-identification-why-zero/. I have also seen Linux stacks which set an IP ID of 0 when DF is set, which is obviously not the case in this particular example (as all packets have DF set). Regarding the TTL, I have seen intermediate devices change a TTL to 64 (or even 60), but I am not aware of any devices (or stacks) that use a value between 60 and 32, e.g. 48, etc. Perhaps there are. I am just not aware of any.

      1. Vladimir

        Hi Thomas,

        Yes, you’re right regarding IP ID. I didn’t know that. Just checked on Centos 6, Ubuntu 16, Debian 7 – all have exactly the same behavior, SYN-ACK has 0x0000, all the next have incremental. So, this pattern infact is spesific for Linux systems.

        Windows 7 uses incremental IP ID in SYN-ACK’s.

        There’s always something new to learn, cool, thanks!

        Regards,
        Vladimir

  2. Thomas P. Kager

    Thank you for your interest, insight and additional research. As you indicate, there is always something new to learn!

  3. Sake

    Hi Thomas,

    Nice blogpost. I look forward to reading your LUA article next 🙂

    My observations on the clientside:

    – As there are packets which are less than 60 bytes in size (actually, less than 64 bytes, but as the FCS is already stripped before capturing, I look for packets less than 60 bytes) the capture must have been made on a system that is involved in the dataflow itself, rather than a system that was connected to a TAP or span-port.
    – The bad checksum of all packets from that IP address confirm the above.
    – As the TTL of the packets that are less than 60 bytes is 128, the capture system is either one of the endpoints or the first hop after one of the endpoints
    – The 1.2 ms delay between the SYN/ACK and the ACK would indicate that we are not on the endpoint however, as there would normally be delay of tens of microseconds. This would suggest that we might be capturing on the first hop instead. It could also indicate that we are indeed capturing on the client, but that it is very busy.
    – The jumps in the IP id numbers on 192.168.1.10 indicate that it is sending network traffic on at least one other connection.
    – The interface name in the capture summary indicates a windows system as the capturing device.

    All in all I would say that the capture was made on the Client with IP 192.168.1.10 which is a windows system that was very busy at the time of capture.

    As for the server-side, I would have to disagree on your conclusion that this SSL session is reusing the keying material of a previous SSL handshake:

    – As already stated, the IP id of packet 10 indicates that this is not a retransmission, but instead an out-of-order packet.
    – Wireshark has a problem re-assembling data when the first data-packet of a stream is out-of-order. So looking at packet 10, there is indeed extra data after the “ServerHello”. The bytes “16 03 03 0d c3” indicate a new SSL record of 3523 bytes. In it is a handshake message of type 0xb which is the Certificate handshake message.

    This means the SSL session in this tracefile is using a full handshake and is not a reused session. This can also be seen by the “ServerKeyExchange” and “ClientKeyExchange” handshake messages which only appear in full SSL handshakes or renegotiations.

    Cheers,
    Sake

  4. Thomas P. Kager

    You are correct. Busy client.

    Regarding SSL: When put in correct order, Wireshark no longer triggers the expert. This is logical for the reasons which you had mentioned, as well as the fact that the client Hello contains a Session ID Length of 0. Another individual had pointed out this out to me. I will update the blog and give you guys a shout.

    Thank You Sake!

Leave a Reply to Sake Cancel reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>