User Guide 19.11 documentation

This Page

Metrics Computation

Here, you can find details on how some of the less obvious metrics are computed, and how they are affected by the sniffer configuration. You may safely skip this section unless you need a deeper understanding of how the sniffer works.

Conversations

Many generic metrics are computed on TCP streams. To be able to interpret these correctly, it may be useful to be aware of a few things.

Client or Server?

To find out which peer is the client, the sniffer tries several options:

  • if it understands the protocol at hand (and has successfully identified it), then client/server identification is usually trivial. Unfortunately, most traffic do not fall into this category.
  • in TCP, the client is the peer that actively opens the connection (i.e., sends the initial SYN). But we may miss the SYN or we may have forgotten it if we have not received traffic for that socket for more than 2 minutes (especially problematic for lengthy connections such as remote control protocols).
  • in either TCP or UDP, we may have indicative port numbers. A port number below 1024 on one side and greater than 1024 on the other is a strong indication of the server location.
  • in TCP, we may have seen past SYNs directed at one of the ports, which again gives an indication of that port being the server.
  • when all else fails, the server is chosen according to a complex heuristic that’s mostly equivalent to choosing at random.

Keep-Alive

Applicative keep-alives are small messages that are sent from either peer to the other when no traffic have used this socket for some time. They must not be taken into consideration when computing SRT, DTT, and so on. The ica_keepalive_max_size parameter is dedicated to the detection of ICA (citrix) keep-alive messages.

The standard TCP keep-alive packet is normally detected using its size and sequence number, according to the RFC. In case the previous sequence number is unknown, though, the tcp_keepalive_timer may be used as an alternative; after this inactivity period, any TCP packet that looks like a keep-alive will be ignored.

DTT timeouts

The objective of the TCP DTT metric is to measure the duration of a single write (or of a sequence of closely related writes). For protocols that do not follow the pattern request/response, it is very important to detect when two data transfers are separate in time (suggesting they are unrelated). The tcp_dtt_timeout parameter helps with that. If two packets are separated by more than this duration, then they do not belong to the same DTT. By default, it is set to 1s so that lost packets nor a full reception buffer would not interrupt the DTT, but an actual pause from the sending application will be detected as such.

What is a retransmission?

According to the sniffer, any TCP packet with a payload (or a SYN, a FIN or RST flag) with a sequence number that was already covered is a retransmission (here, covered means that this sequence number was in a packet that has already been analyzed).

Fast retransmission is thus counted as retransmission.

HTTP

The HTTP metric offers a very synthetic notion of a page, which is a set of HTTP documents fetched by the same user and combined by his browser into a single object, a “page”. Reconstructing pages from the actual packets involves an unusually high number of operations and thus, deserves quite a detailed description.

HTTP specific glossary

Although not required to use SkyLIGHT PVX, the following definitions are required to understand the following description.

  • HTTP message: as defined by RFC, it is an HTTP header optionally followed by a body. Sniffing gives us some of the headers, the relevant timestamps, sizes, and so on. We may not see everything, but the beginning of the header is mandatory in order to recognize an HTTP message.
  • HTTP query: HTTP message with a command (GET, POST, HEAD, etc) and the URL.
  • HTTP response: HTTP message with a response code (sometimes called status code or status)
  • HTTP hit or transaction: HTTP query with optionally its associated HTTP response (note: a response with no associated query is ignored for this metric).
  • user: the HTTP client software (browser or whatever) that has sent the query under consideration. It’s identified by his IP address and user-agent field.
  • page: set of transactions that are supposed to be perceived as a single query implying a single delay for the user. Notice how subjective this definition is. The intent is to include in a single page all the hits required for a typical browser to display enough content for the typical user to think his query is fulfilled. For websites or browsers that delay download of content until it becomes visible, or for websites that display intermediary content, the only objective is to behave in a way that’s understandable.
  • root (of a page): the transaction that triggers other transactions for the same page, either directly or indirectly. We’d like it to be the first chronologically but that’s not necessarily the case due to mirroring.

From packets to HTTP messages

The sniffer receives fragments of HTTP messages. It starts to reconstruct a new HTTP message as soon as it receives the start of a header. Some fragments of the message may be missing, though, in which case it may be incapable of:

  • associating a body fragment to the proper HTTP message, thus leading to erroneous payloads and dubious chronology,
  • saving part of content in HTTP save files (without notice),
  • reporting the timestamp of message end.

From individual messages to transactions

HTTP offers no better way to associate response with corresponding query than to rely on ordering: first response of the socket with first query, and so on.

So, for every socket, the sniffer stores all queries not already paired with a response. Notice that on a socket, a proxy may mix queries of different users, and that two interconnected proxies may even mix queries to distinct servers.

Notice also how damaging a single dropped packet may be if it hides a query or a full response to the sniffer, since all pairing following this gap will be questionable.

Also, servers may not respond, leading to a timeout of the pending queries (which will be inserted in database without any response).

From transactions to pages

Since all transactions of a page are necessarily emitted by the same user, then all transactions are associated to this user, in chronological order (time and the “Referrer” field are our two best tools from now on). Notice that since a page routinely involves transactions of several sockets, and since different sockets are reassembled by different TCP parsers which thus deliver segments at different paces, then it’s possible for the HTTP metric to reconstruct a transaction A before a transaction B even if B happened and was received by the probe before A (for instance, if A’s socket reassembly was delayed by a missing frame). In such an occurrence, the referrer relation between A and B may not be honored.

We do not wait for the pairing with a response to attach a query to the page it belongs to. When we attach a new query to a client, we look for the referrer of this transaction within the ones that are already attached to this client (in case the referrer field is absent, we use the same kind of referrer cache as found in KSniffer). If the referred page is itself attached to another page, two behaviors are possible:

  • we detach it, thus turning the referrer into the root of a new page, or
  • we follow the chain of attachment and attach the new transaction to the parent page.

Note that the first behavior is possible only when the content-type of the referred page does not prevent it (i.e., is not typically reserved to non-root transactions, such as image, CSS, and other typically embedded content).

You can choose between these two behaviors with the http-detach-referred parameter.

The second behavior (keep referred transactions attached) is better when iframes are involved but it is believed that the first (and default) one generally leads to better results. Other than iframes, the only observed case where a referenced transaction was obviously not a page root was an AJAX POSTing to the same URL as referrer continuously, thus detaching its predecessor.

If/when we eventually receive the response of a transaction (and, hopefully, its content-type), we revise our judgment on the attachment. If the transaction seems to have not been triggered by AJAX, and its content-type is indicative of a standalone document (PDF, PS or HTML with status 200), then we detach it (turning it into a root). Otherwise, if the content-type is not indicative of a typically embedded content (image, CSS, etc.) then we check the delay between the page root and this transaction and if found greater than a parameter (http-page-construction-max-delay), then it is detached as well.

To speed up information retrieval, some global per page values are precomputed in the sniffer: every transaction attached to a page contributes into the page as soon as it was received less than http-page-contribution-max-delay seconds after the root. All of these transactions will contribute to the page load time.

To be able to dump a root transaction with all of these counters, we must, of course, delay the dump of roots as late as possible, thus raising memory requirements.

Protections

To limit memory and CPU usage, the sniffer implements these protections:

  • page reconstruction is only active for some IP addresses and TCP ports (client or server). See the HTTP flag in the zone and application definition. All transactions that do not come from / goes to one of these IP addresses will not be attached to a root transaction. It will be inserted in the database but will be excluded from the page list.
  • the total number of simultaneously tracked and remembered HTTP transactions is limited by http-max-tracked (unlimited by default). New transactions above this will be ignored (with catastrophic consequences to transaction pairing).
  • the total number of simultaneously tracked and remembered HTTP transactions for which we want page reconstruction is limited by http-max-tracked-for-reconstruction (unlimited by default).
  • max size of HTTP save file is limited by http-max-content-size (50k by default).
  • the memory dedicated to the referrer cache is limited by http-referrer-mem.

Limitations

Page load time is the most interesting metric, yet we have seen that many conditions must be met to accurately reconstruct pages.

  • the process is very sensible to missing TCP fragments (retransmitted fragments cause no problem but fragments that are not mirrored to the probe do);
  • the bigger the proxies, the less reliable client isolation will be;
  • some heuristics regarding AJAX, content types and timing do not necessarily match your sites;
  • some clients may successfully hide the referrer (or worse, we may guess the wrong referrer);
  • HTTP analysis may consume more resources than what’s available (or configured);
  • any small inaccuracy in HTTP message reassembly or in transaction pairing will lead to highly inaccurate page load times.

SMB

The SMB module produces one flow for each couple of a query and its answer. To link queries and responses together, the SMB protocol uses the following IDs:

  • Multiplex ID in SMB1
  • Message ID in SMB2

The sniffer conjointly uses these IDs with the Tree ID, the command type and the underlying connection (a.k.a. IP, ports, VLAN and such) to properly link requests and responses together for each conversation.

However, it may induce a high number of flows for some simple and common operations like reading or writing to files: the operations being sent as multiple read or writes commands, using buffers with a maximum size of 64KiB or 1MiB (for the more recent versions of the protocol).

For example, writing a file of 1GiB over 10s (at a rate of roughly 100MiB/s) would generate 1000 SMB2 WRITE commands with a buffer of 1MiB resulting in 1000 flows stored in the database. The interval between two of these write commands would roughly be of 10ms. The number of flows would be an order of magnitude higher if the protocol used 64KiB buffers.

This would give a fine-grained precision but it isn’t of much use most of the time and the resulting number of flow may quickly grow the database usage or toward the license limit.

It seems much more interesting to have these statistics aggregated from an higher level. Read and writes commands could be aggregated together if they act on the same underlying file (based on its File ID).

As such, from PVX 5.0 onwards, the sniffer will aggregate successions of the following commands together for a small period of time (which is configurable):

  • SMB1 READ_ANDX & WRITE_ANDX
  • SMB1 TRANS2 FIND_FIRST2, QUERY_PATH_INFORMATION, QUERY_FILE_INFORMATION & QUERY_FS_INFORMATION
  • SMB2 READ & WRITE
  • SMB2 QUERY_INFO
  • SMB2 QUERY_DIRECTORY

Some of these commands use the File ID as a discriminating factor, others requires to compare paths or patterns.

You can expect less SMB flows after upgrading from PVX 4.2, and more importantly, this should decrease SMB flow bursts. Unfortunately, some frequent commands may not be aggregated together, like closes or opens since they only appear once during file manipulations.

VXLAN

Starting from PVX 5.2, when the new “VXLAN stripping” option in the VXLAN parser is enabled (which is the case by default), the transport layer is simply discarded. The discarded layers are not accounted in the traffic. Enabling this option make sense when considering the VXLAN transport as a mirror mechanism. You can switch back to the old behavior by disabling this option.