.. _interpreting: Interpreting the results ======================== .. note:: With version 2.8, the in/out notion has been fully replaced by Server/Client. So in our graphs, any RTT and RR (in/out) should be considered as RTT, RR (Server/Client) as in the following rules. - ``RTT in`` stands for ``RTT Server``. - ``RTT out`` stands for ``RTT Client``. - ``RR in`` stands for ``RR Server``. - ``RR out`` stands for ``RR Client``. .. index:: BCA, Business Critical Application, Dashboard Business Critical Application Dashboard --------------------------------------- To customize this view for your own needs, just go to the `Configuration` menu and choose the application you want to define as a 'business' one. (see the :ref:`bca_config`). The purpose of the Business Critical Application Dashboard (BCA) is to have, regrouped into a single view, the most important elements that are critical for your business. Vital information is presented in a completely configurable and dynamic dashboard view to people in charge in order to radically improve early diagnostics and impact analysis. What is monitored is the EURT (End User Response Time) metric. Thus, this dashboard reflects the quality of experience of the users for the selected critical applications. - In red: poor quality - In orange: medium quality - In green: good quality - In grey: not enough data gathered .. gofigure:: img/screenshots/spv/version_unk/bca_overview.png Business Critical Application Dashboard view. Business Critical Application Dashboard Capabilities ++++++++++++++++++++++++++++++++++++++++++++++++++++ - You can customize the business critical dashboard to view specific applications and metrics corresponding to your specific business. - From the BCA dashboard, you can drill down from the general view to detailed analysis and problem resolution views. .. gofigure:: img/screenshots/spv/version_unk/bca_shortcut_links.png Quick links in the Business Critical Application Dashboard view. Thus, from each Business Critical Application, with a single click on the appropriate icon, you can: - Directly access to the corresponding Application Dashboard, - Add a filter on this specific Critical Application (in case you have defined a lot of Critical Applications and you only want to see one for a moment), - Edit Application characteristics. - Directly access to the flow details for this Application. .. note:: If you click on the icons that are next to the name of the application at the beginning of each line, the quick links will take into account the complete period of time currently displayed. If you click on the icons associated to a specific period of time, the quick links will used this when redirecting you to a detailed screen. - You will always see up-to-date information with the auto-refresh feature of the BCA dashboard. The information will be automatically refreshed based on the data aggregation level (see :term:`aggregation period`). For example, if the “Aggregate level” is “1 minute”, the BCA will be updated every minute; if the “Aggregate level” is “1 hour”, it will be updated every hour. .. index:: BCN, Business Critical Network, Dashboard Business Critical Network Dashboard ----------------------------------- To customize this view for your own needs, just go to the `Configuration` menu and choose the entry labeled ``Business Critical Network`` (see the :ref:`bcn_config`). The Business Critical Network Dashboard (BCN) is aimed at presenting in a single screen the status of your organization's most critical network “links”. You can customize the business critical network dashboard to view the status of the most strategic links corresponding to your business. .. gofigure:: img/screenshots/spv/version_unk/bcn_overview.png Business Critical Network Dashboard. From the Business Critical Network Dashboard, you can drill down from the general view to more detailed information for analysis and problem resolution: .. gofigure:: img/screenshots/spv/version_unk/bcn_shortcut_links.png Detailed values for a point of time. By hovering with the mouse over a cell representing a point of time, you can view the threshold values for each direction (indicating status ``OK``, ``Warning`` or ``Alert`` as well as the value for each direction). You can also access the bandwidth graphs and the flow details table for each link. If you click on the icons that are next to the name of the link at the beginning of a line, the quick links will take into account the complete period of time currently displayed. If you click on the icons associated to a specific period of time, the quick links will use this when redirecting you to a detailed screen. You will always see up-to-date information with the auto-refresh feature of the `BCN dashboard`. The information will automatically be refreshed based on the data aggregation level (see :term:`aggregation period`). For example, if the “Aggregate level” is “1 minute”, the BCN will be updated every minute; if the “Aggregate level” is “1 hour”, it will be updated every hour. .. index:: VoIP, SIP, RTCP, RTP VoIP Module ----------- A specific reporting for `Voice over IP` traffic is provided. The aim of this module is to show the volume and quality of service associated with VoIP flows. Supported protocols +++++++++++++++++++ These VoIP protocols are supported: - ``SIP`` + ``RTCP`` + ``RTP`` - ``MGCP`` + ``RTCP`` + ``RTP`` - ``SKINNY`` + ``RTCP`` + ``RTP`` For more information, please consult the corresponding `RFCs`: - ``SIP`` as defined in :rfc:`3261` - ``MGCP`` as defined in :rfc:`3435` - ``RTP`` as defined in to :rfc:`3550` and :rfc:`3551` - ``RTCP`` as defined in :rfc:`3605` Basics of VoIP ++++++++++++++ `Voice over IP` relies on three protocols to operate over IP networks: - **Signalization protocol**: the role of this protocol is to establish and control the voice communications. It usually consists of communications between the IP phone and a call manager / IPBX. The two signalization protocols supported are ``SIP`` (Session Initiation Protocol) and ``MGCP`` (Media Gateway Control Protocol). Please note that ``SIP`` may or may not follow the same route as the ``RTP`` traffic, while ``MGCP`` follows the same route as ``RTP`` (Real-Time Protocol). - **Media protocol**: the role of this protocol is to carry the voice signal from one IP phone to another one (it can eventually pass through the call manager / ``IPBX``). ``RTP`` is the only media protocol supported by |Product|. It usually runs over ``UDP``. - **Control protocol**: the role of this protocol is to carry quality and control information from one phone to the other. ``RTCP`` (Real Time Control Protocol) is the only control protocol supported. .. index:: MOS, Voice Quality Quality of service & MOS ++++++++++++++++++++++++ Mean Opinion Score (MOS) is a numeric indication of the perceived quality of service of VoIP. It ranges from ``1`` to ``5`` with ``1`` corresponding to the lowest quality and ``5`` to the highest (close to humain voice). +------------+-----------+ | MOS Rating | Meaning | +============+===========+ | 5 | Excellent | +------------+-----------+ | 4 | Good | +------------+-----------+ | 3 | Fair | +------------+-----------+ | 2 | Poor | +------------+-----------+ | 1 | Bad | +------------+-----------+ Please note that in a real network, a ``MOS`` of over ``4.4`` is unachievable. A low ``MOS`` will translate into an echo and degraded signal. ``MOS`` is, in principle, the result of a series of subjective tests; in the context of network analysis, ``MOS`` will be estimated using a formula that integrates ``3`` factors: - Network latency (``RTT`` recommended value: ``<100ms``) - :term:`Jitter` (recommended value: ``<30ms``) - Packet loss rate (recommended value: ``<5%``) Prerequisites +++++++++++++ To provide ``MOS`` values for `VoIP` traffic, it is necessary to capture the three flows: signalization (``SIP`` or ``MGCP``), media ``(RTP)`` and control protocol ``(RTCP)``. If one of these flows is absent in the traffic capture brought to the listening interface(s), the ``MOS`` value will not be calculated. Other quality-of-service metrics will remain available. +--------------+-----------------------------------------------------------------+ | Protocol | Metrics obtained by analysis of the protocol | +==============+=================================================================+ | ``SIP/MGCP`` | - ``Sign. RTT`` (network latency between each phone – value in | | | and out interval between a request and the first response | | | (definitive or temporary) from the signalization server) | | | - ``Sign. SRT`` (signalization server response time) | | | - ``Sign. RD`` (retransmission delay for the signalization | | | traffic) | | | - ``Sign. RR`` (retransmission rate for the signalization | | | traffic) | | | - ``Code`` (indicates how the VoIP call ended – e.g., error or | | | not; please note that the code depends on the protocol used) | +--------------+-----------------------------------------------------------------+ | ``RTP`` | - ``Jitter`` (standard deviation of latency for the media | | | traffic going from one IP phone to the other) | | | - ``Packet loss`` (percentage of packets lost in the | | | conversation at the point of capture of the probe - based on | | | RTP sequence numbers) | +--------------+-----------------------------------------------------------------+ | ``RTCP`` | - ``RTT`` (network latency between the two IP phones – based on | | | the timestamps provided by both IP phones) | +--------------+-----------------------------------------------------------------+ .. note:: ``RTT`` and ``MOS`` values depend to some extent on the quality of the measurement provided by ``RTCP``. Please note that ``MOS`` is not very sensitive to “normal” latency values. When referring to voice or media, we refer to the ``RTP`` traffic, which may correspond to different things (human voice, pre-recorded message, ringback tone, busy line tone, etc.) The `VoIP` module discards the :term:`jitter` and packet loss data present in the ``RTCP`` flow and replaces them with equivalent values computed internally. This is so for several reasons: - It was observed that many softphones do not place accurate (or even credible) values in these fields, - ``RTCP`` stream is more often missing than present, probably because it is firewalled and of little use to the `VoIP` client software. For the `VoIP` module to remain passive, there is no other option than to compute these values for every ``RTP`` stream to generate jitter and packet loss values which will be a good estimate of the real jitter and loss experienced by both users. This is how, even in the absence of ``RTCP`` stream, we can display a jitter and packet loss count (and no ``RTT``, and, thus, no ``MOS``). VoIP views ++++++++++ .. index:: MOS MOS Over Time ^^^^^^^^^^^^^ This view shows the evolution of the `Mean Opinion Score` through time. A second graph shows the evolution of the number of calls to help you evaluate how many were impacted by a ``MOS`` degradation. - By hovering over a specific point in time on the graph, you can display the exact value for each metric on the right side of the graph. - By clicking on a specific point in time, you are directly to the `VoIP` conversations for that time interval. .. gofigure:: img/screenshots/spv/version_unk/mos_chart.png .. index:: Jitter, Packet Loss Jitter / Packet Loss ^^^^^^^^^^^^^^^^^^^^ This view shows the evolution through time of the jitter and the packet loss. This view can help you understand ``MOS`` variations and see which metric is impacting the ``MOS``. - By pointing a specific point of time on the graph, you can display the exact value for each metric on the right side of the graph. - By clicking on a specific point of time, you are directed to the `VoIP conversations` for this point of time. .. gofigure:: img/screenshots/spv/version_unk/voip_jitter.png VoIP Bandwidth & Call Volume ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ These views show charts for: - the bandwidth used for voice and signalization. .. gofigure:: img/screenshots/spv/version_unk/voip_bandwidth_chart.png VoIP Bandwidth Chart. - the evolution of the volume of calls through time. Calls are distributed between successful and unsuccessful calls. Successful calls are conversations where some voice was exchanged; unsuccessful calls are conversations without any voice exchanged. .. gofigure:: img/screenshots/spv/version_unk/voip_callvolume_chart.png VoIP Calls Volume. VoIP Conversations & Details ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The last two views show each call individually with some `usage` metrics for `VoIP Conversations`. The `VoIP Flow Details` view is the same table with the addition of `performance` metrics. .. note:: The “caller” value corresponds to the metric for the ``RTP/RTCP`` traffic from the caller to the callee, while the “callee” value corresponds to the metric for the ``RTP/RTCP`` traffic from the callee to the caller. .. gofigure:: img/screenshots/spv/version_unk/voip_conversation_details.png VoIP Calls. .. index:: Dashboard Application Dashboards ---------------------- A dashboard is a single-screen report that displays relevant information to understand how the application is doing. They are present in APS from version 1.7. .. note:: These dashboards are unavailable in |Product| `NPS`. It is extremely useful: - as a starting point for troubleshooting, - as a tool to communicate to management and business users on how the application is actually performing. Section :ref:`components` will discuss the three components that display these key information. .. gofigure:: img/screenshots/spv/version_unk/app_dashboard_overview.png Overall view of the application dashboard. How can it help? ++++++++++++++++ For reporting ^^^^^^^^^^^^^ In a single report, you have enough to explain to a business user or a manager how the application performance went through time, which servers were doing worse and which zones were impacted. On top of the ``EURT``, all this is based on three synthetic metrics that are easy to explain to non-technical people: - ``RTT`` – network performance; - ``SRT`` – server performance; - ``DTT`` – delivery of application response through the network. For troubleshooting ^^^^^^^^^^^^^^^^^^^ For network administrators, this report brings together all of the information about a business application required to: - validate whether or not there is a slowdown; - identify the origin of a slowdown (network, application, response delivery); - which users or servers were impacted. With one click, you can conclude if there was a slowdown or not, what was the origin of the degradation, which client zones were impacted. With an additional click, you can view whether all clients in a :term:`zone` were impacted or if the server response time degradation was caused by another :term:`application` hosted on the same server machine. .. index:: Components .. _components: Components ++++++++++ .. index:: EURT 1st element: the evolution of End User Response Time through time ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. gofigure:: img/screenshots/spv/version_unk/app_dashboard_ex1.png End User Response Time (EURT) graph. This ``EURT`` graph shows: - the evolution of the quality of experience for users of this application over the period of time, - the number of transactions which help you consider the evolution of EURT with rigor and common sense (you would not consider a degradation of `EU Response Time` for ``10`` applicative transactions in the same way as for ``10,000``). The breakdown of ``EURT`` into three intelligible components (``RTT`` for network latency, ``SRT`` for `Server Response Time` and ``DTT`` for `Data Transfer Time`) lets you know at first glance the possible origin of the performance degradation. For example, in the screenshot above, we can observe an increase in the ``SRT``; the network and the time required to send the response to the client have not increased. Either the server overall responded slower or some specific queries required a much longer treatment time (you can determine this by drilling down to that specific point of time). 2nd element: EURT by Server ^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. gofigure:: img/screenshots/spv/version_unk/app_dashboard_ex2.png EURT by server. There is a comparison of the ``EURT`` on each server that runs this :term:`application`. In this case, it is obvious that `Atlantis` tends to respond much slower than Brax. By clicking on it and looking at a second dashboard called `Server/Application Dashboard`, we shall be able to determine if this is permanent or punctual and whether this is due to the load on this application or on another one hosted on the same server. 3rd element: EURT by Client Zone ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. gofigure:: img/screenshots/spv/version_unk/app_dashboard_ex3.png EURT by Client zone. What we can see here is a breakdown of the ``EURT`` for this :term:`application` between client zones; with one glance, you can determine which :term:`zone` was impacted by the degradation as well as the different levels of performance depending on the users' location. In the screenshot above, we can see that mainly one zone was impacted by the ``SRT`` degradation. There are also significant differences in performance between zones due to differences in ``RTT`` values (network latency). .. index:: Dashboard Drill-down dashboards +++++++++++++++++++++ |Product| APS offers two additional dashboards: - Client zone / Application dashboard. - Server / Application dashboard. Client zone / Application dashboard ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can access this dashboard either through the menu or by clicking on a specific client zone in the `Application Dashboard`. This dashboard contains three bits of information: - ``EURT`` graph through time for this client :term:`zone` and this :term:`application`. - `EURT breakdown by server` (so that you can compare the performance offered by different servers for that client zone). - `EURT per client` (so that you can identify whether all clients are impacted by a slowdown, or which individual client generates more volume or has worse application performance). .. gofigure:: img/screenshots/spv/version_unk/client_zone_dashboard.png Client zone / Application dashboard. The breakdown by client lets us know whether all the zone was impacted or just some individual users, and on which component of the ``EURT`` (network latency, server response time or data transfer time, and for which number of transaction and amount of traffic). .. gofigure:: img/screenshots/spv/version_unk/breackdown_by_client.png Breakdown by client. Server / Application dashboard ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You can access this dashboard either through the menu or by clicking on a specific server in the `Application Dashboard`. This dashboard contains three bits of information: - ``EURT`` graph through time for this server and this application - `EURT breakdown by client zone` (so that you can compare the performance offered to different client zones from that server) - Comparison with other applications provided by that server (so that you can identify whether a peak of transactions on another application is impacting the performance of that application, and see the volume of data, transactions and performance metrics for all applications provided by this server). .. gofigure:: img/screenshots/spv/version_unk/server_application_dashboard.png Server / Application Dashboard Interactions ++++++++++++ Dashboards have been developed so that a single click provides more detailed information on the object you are interested in: - If you click on the ``EURT`` graph in any of these three dashboards, you make a focus on a shorter period of time. For example, in an ``SRT`` peak, depending on the aggregation level, you either reach a lower aggregation level for a shorter period or the corresponding performance conversations (see :ref:`data_aggregation`). At the same time, you will get the server and zone breakdown for that more specific period of time. - If you click on a server, you reach the Server / Application dashboard. - If you click on a client zone, you reach the Client zone / Application dashboard. TCP Errors / Events ------------------- Objectives ++++++++++ The ``TCP`` statistics can be displayed in tables by selecting the appropriate column theme. They can reveal dysfunctions or unusual events. TCP errors ++++++++++ For each ``TCP`` conversation, the following fields are available: - RD Server/Client - Duplicate ACKs - number of SYNs - number of handshakes - number of session ends - number of FINs from the client - number of FINs from the server - number of RSTs from the client - number of RSTs from the server - number of timeouts By sorting on the RD or duplicate ACK fields, one can quickly check the worst conversations in terms of ``TCP`` performance. Also, the number of reset packets is usually noteworthy. One can then jump to the IP summary page of either the client or the server (depending on who is to blame) to gather further data on this event. TCP events ++++++++++ For each ``TCP`` conversation, the following fields are available: - payload - number of packets - number of handshakes - number of timeouts - number of RSTs from the client - number of RSTs from the server - number of FINs from the client - number of FINs from the server .. index:: Packet Analysis, Tcpdump, PCAP .. _packet_analysis: Packet-Level Analysis --------------------- Objectives ++++++++++ Once you have identified the origin of an issue, you may want to analyze it further by looking at the packets themselves. You have two ways to do this: - :ref:`manual_packet_analysis` through Pulsar's ``tcpdump`` command - :ref:`autopcap` - :ref:`instacap` from the data of a result row. .. _manual_packet_analysis: Manual packet capture +++++++++++++++++++++ By connecting through Pulsar, you can start a manual capture of any traffic viewed on the interface of your device. To do so, you need: 1. Connect to Pulsar (see :ref:`pulsar`). 2. Enter the command to launch the trace: for example, ``tcpdump_tofile -i host ``. 3. Enter ``Control+C`` to stop the trace. Use the ``tcpdump`` command instead of ``tcpdump_tofile`` to display the results of real time packet capture. .. note :: * You can access any help by typing ``help tcpdump_tofile``. * You can refer to the ``tcpdump`` command's online `manual `_. * All parameters are available except the ``-w``. Accessing the tracefile ^^^^^^^^^^^^^^^^^^^^^^^ To access the PCAP file generated by the ``tcpdump_tofile`` command, you should connect to the probe via FTP, using an FTP client and the Pulsar admin user (see :ref:`pulsar`). .. index:: Autopcap .. _autopcap: Automated Packet Capture (AutoPCAP) +++++++++++++++++++++++++++++++++++ Principles ^^^^^^^^^^ |Product| can capture packets automatically, in case abnormal values are observed on critical servers. These packets are presented for later analysis as `PCAP` files, which can be downloaded through the web graphical interface at the conversation level. Applications ^^^^^^^^^^^^ These files are presented in the following views: - Conversations - DNS messages - VoIP details In each of these views, a column at the right end of the table indicates `PCAP`; a small icon indicates whether packets have been captured for a given conversation or not. If the `PCAP` file is available, you can download it by clicking on the icon. Once the file has been downloaded, you can view the packets using any protocol decoder capable of reading `PCAP` files. .. gofigure:: img/screenshots/spv/version_unk/AutoPCAP_conversations2.png PCAP column in Performance conversations. .. gofigure:: img/screenshots/spv/version_unk/AutoPCAP_DNS_conversations2.png PCAP column in DNS messages. .. gofigure:: img/screenshots/spv/version_unk/AutoPCAP_VOIP_conversations2.png PCAP in VoIP details. For instance, if you are using `Wireshark` to decrypt the packets, you can directly view the packets. .. gofigure:: img/Wireshark1.png Viewing packets in Wireshark. To view the query and the beginning of the response, you can use the feature `Follow TCP stream` (in the Analysis menu). .. gofigure:: img/Wireshark2.png Viewing query and response. Conditions ^^^^^^^^^^ Packets are saved by |Product| as soon as the conversation they belong to matches a certain number of conditions: * If `Capture HTTP` is checked in a `Zone`, and an IP address matches the zone subnet (either as client or server). * If `Capture HTTP` is checked in an `Application`, and a port or an IP address matches the application (either as client or server). * And one of the following metrics is considered as out of the norm: * Server Response Time ``(SRT)`` for ``TCP`` flows * Retransmission Rate * `DNS` Response Time .. note:: Why is PVX not used directly in a `Zone` or an `Application` to capture PCAP files? We want to capture the flow for troubleshooting from the very first packet. With only one packet, PVX cannot know what is a `Zone` or an `Application` of the flow. .. note:: `PCAP` files are a **sample** of the conversation. If you request on a one-hour interval and get a `PCAP` file, the `PCAP` will not contain one hour's worth of data but only those that match the above conditions. Limitations ^^^^^^^^^^^ The Automatic Packet Capture feature works under a certain conditions to ensure the proper execution of other services provided by |Product|. These necessary limitations include: * The retention of `PCAP` files is limited by the disk space allocated for captures; in the current version, this space is limited to 10GB by default (for both manual and automatic captures). When all 10GB are used, no new `PCAP` file is saved. You can change this value in **Sniffer Configuration** page. * The maximum retention time for automatic captures is set to 48 hours; after this delay, `PCAP` files will be deleted. This cannot be modified. * The sniffer component of |Product| is set forge a maximum of 5,000 `PCAP` files simultaneously; if more than 5,000 conversations are needed, change the parameter in **Sniffer Configuration** page. Otherwise, some conversations will not be recorded at packet level. Please note that the threshold values and voluntary limitations in newer versions will be reviewed in light of our experience and the customer feedback that we receive. If you need an exhaustive trace of a given set of conversations, you can also use the manual capture feature available through Pulsar. .. index:: Triggered PCAP .. _instacap: Triggered Packet Capture ++++++++++++++++++++++++ Triggered PCAPs are generated from the user interface, either by the result rows, or by a configuration page. In both cases, the administrator rights are required. The setup is very easy because the capture filters are preset with the wanted flow characteristics, but the main advantage of triggered PCAP is that it is possible to set a date and time to start the capture. .. gofigure:: img/screenshots/spv/version_unk/instacap_button.png Load the form to trigger a new PCAP; the flow data will be used to preset the filters. .. gofigure:: img/screenshots/spv/version_unk/instacap_form.png Trigger a PCAP for midnight. By default, only the local capture is selected to trigger the capture, but all known captures are available. If multiple captures are selected for a capture, then one PCAP will be created for each one. All added triggered PCAPs are referenced in the dedicated page in the configuration menu. It is possible to delete and download them, regardless of the capture where they were captured. If a capture was done on multiple captures at a time, then they will have the same name and same filters. They will be grouped together in the management interface. .. gofigure:: img/screenshots/spv/version_unk/instacap_main_page.png The triggered PCAP management page with the first one created on two captures. Interpretation Guidelines ------------------------- The objective of this section is to help our customers make the best use of the performance reports provided by their appliance. You will find a brief overview of how application performance issues can be solved with PVX. This first section focuses on synthetic metrics to produce a measure of the quality of user experience (QoS - End User Response Time) and give you a simple explanatory framework to understand the cause of application slowdowns (Round Trip Time, Server Response Time and Data Transfer Time). .. note:: Some metrics and views described below are only available in |Product| `APS`. Objectives ++++++++++ Before you start analyzing performance reports, there are certain things to keep in mind: **Performance metrics should not be considered as absolute values, but in comparison with different time intervals, servers and user groups.** Performance metrics represent time interval. Although most of them correspond to the measurement of a concrete phenomenon, it is almost impossible to provide a scale of what is a good or a bad response time, with no experience of the impact it has on users. For example, indicating that the Network Round Trip Time from a ``site A`` to a ``site B`` is ``200ms`` does not mean you have a measure that is acceptable or not. In the same way, a Server Response Time ``(SRT)`` of an ``application A`` of ``100ms`` may be very "bad" when the same value would be excellent for an ``application B``. As a consequence, it is important to consider performance metrics as relative values. One of the keys to a good interpretation of performance metrics is to compare systematically performance metric value with: - another time period, - another user group. **Mixing up performance metrics for several applications does not make sense.** When looking at application performance metrics, you should be very careful of isolating applications for analysis. As a consequence, the metrics which very much depend on the application's specific behaviour should not be considered altogether. This is true for metrics such as ``EURT`` (End User Response Time), ``SRT`` (Server Response Time) and ``DTT`` (Data Transfer Time). **RTT measurements can marginally be impacted by the behaviour of the operating system.** Network Round Trip Times for ``TCP`` are based on the ``TCP`` acknowledgment mechanism. This means that, although ``RTT`` is generally a good measurement of round trip latency, if the operating system of one of the parties is so overloaded that the acknowledgment process becomes slower, ``RTT`` values will be impacted.  RTT Server would be impacted on the server side and ``RTT Client`` on the client side. ``RTT`` should then be analyzed in parallel with ``CT`` (Connection Time) because the treatment of a new session by the IP stack has a higher priority. **Some values are averaged measures.** For each conversation, two kinds of values are reported: - counters, for instance packets or byte counters, which are the sum over all connections aggregated for this conversation; - performance metrics, for instance ``RTT``, ``SRT``, ``DTT`` and the like, which are average values over all samples aggregated for this conversation. .. index:: EURT EURT ^^^^ ``EURT`` stands for End User Response Time. This metric is an aggregate of various other measures meant to give an idea of the perceived overall end user experience. It is taken as the sum of ``RTT``, ``SRT`` and ``DTT``. ``EURT`` has no meaningful physical counterpart. Only its evolution makes sense and allows the system administrator to check at a glance whether a network :term:`zone` is behaving as usual or not. Notice that expected correct values for both ``SRT`` and ``DTT`` depend on the protocol at hand. As a consequence, you should not try to compare two ``EURTs`` of different applications. .. index:: RTT RTT ^^^ ``RTT`` stands for Round Trip Time. ``RTT`` gives an approximation of the time required for a packet to reach its destination, and can be further decomposed into an ``RTT Server`` (delay between a data packet send by the client and its ``ACK`` from the server) and an ``RTT Client`` (in the other way around). As a typical IP implementation will delay acknowledging of incoming data, additional tricks are exploited in order to rule out these software biases: - make use of ``SYN/FIN`` acknowledgment and some exceptional conditions such as ``TCP`` resets, that suffer no such delays, to estimate a realistic upper bound. - exclude unusually high ``RTT`` values. - bound ``RTT Server/Client`` by ``SRT/CRT`` if ``RTT`` sample set looks suspicious. ``RTT`` is refers to the bare speed of the physical layer. It is unaffected by packet retransmissions, packet loss or similar occurrences. ``RTT`` may be affected by (from most common to the rarest): - Slow network equipment between client and server (such as a router or a switch); - Link layer overloaded (ethernet collisions, for instance); - Malfunction of one of the involved network adapters. These troubles should be further investigated by comparing with other client and/or server zones in order to locate the misbehaving equipment. Notice that a degradation of ``RTT`` will almost invariably impact other metrics as well. .. index:: SRT SRT ^^^ ``SRT`` stands for Server Response Time. ``SRT`` gives an estimation of the elapsed time between the last packet of an applicative request and the first packet of the server's response. ``SRT`` represents the processing time of the server, at the application layer, for a given request. ``SRT`` may be affected by (from the most common the the rarest): - Time-consuming application request (a complex ``SQL`` command can let the server process longer); - Application layer overloaded (too many requests that the server can't handle in a small period of time); - ``SRT`` can be marginally affected by the increase of network latency between the point of capture and the server (parallel increase of the ``RTT Server`` value). To pinpoint the root cause of the slowdown, we want to compare the ``SRT`` for a given server/application with other applications on the very same server. If there is a blatant difference, the application is guilty. Otherwise, we want to compare it with other servers in the same zone, then different zones. .. index:: DTT DTT ^^^ ``DTT`` stands for Data Transfer Time. ``DTT server`` is defined as the time between the first data packet of the response (with ``ACK`` flag and a non-nil payload) from the server and the last packet considered as part of the same response (if the packet has the same acknowledgement number); ``FIN``, ``RST`` packets from server or client will also be considered as closing the sequence. A Timeout will cancel a ``DTT``. Note that if the answer is small enough to be contained in only one packet, the ``DTT`` will be ``'0'``. ``DTT client`` is the same metric in the other direction. ``DTT`` (sum of both server and client ``DTT``) is the time the user is going to wait for the response to circulate on the network from the server to the client. It does not depend on the Server Response Time (e.g., a ``DTT`` might be short for a long ``SRT``: - the request might require a large calculation, but the result represents a small volume of data; or a ``DTT`` might be very large, but ``SRT`` is very short because the request is easy to handle yet the response is very large). ``DTT`` depends on (from the largest impact to the smallest): - the size of the response (the more data is contains, the longer it takes to transfer it), - the level of retransmission (the more packets are retransmitted, the longer it will take to transfer the whole response), - the network latency (the longer it takes to transfer packets through the network, the longer it will be to transfer the response - minor impact), - the actual throughput that can be reached to transfer the response from the server to the client. ``DTT`` may vary (from most common to the rarest): - globally or on a per transaction basis (if only some transaction is impacted, it may be linked to the size of some specific application response), - for all or some client zones (if only some client zones are impacted, it may be linked to specific network conditions — retransmissions), - for all or some servers (if only a specific server is impacted, it may be due to a specific server issue in broadcasting the response). Scenario guidelines +++++++++++++++++++ Slow site connection ^^^^^^^^^^^^^^^^^^^^ **Hypothesis:** One or several end users complain about a slow access to all applications (both in and out the LAN). **Diagnosis:** You will find in this section the classical information to grab in order to diagnose the issue: - is the application really slower for this site? You can get this information from the Application Performance Dashboard: .. gofigure:: img/screenshots/spv/version_unk/100002010000021D00000149E23F3EF1.png Zone comparison in the Application Performance Dashboard. - Does the slowdown occur for a specific application? If so, check :ref:`slow_application`. - Does the slowdown occur for a specific server?  If so, check :ref:`slow_server`. .. gofigure:: img/screenshots/spv/version_unk/100002010000021000000156F4235B24.png EURT comparison between servers in the Application Performance Dashboard .. gofigure:: img/screenshots/spv/version_unk/10000201000003700000018370907FCE.png Server Response Time comparison through Server Performance. - Did you upgrade the clients workstations recently? If so, it's a specific system issue. You may ask the System Administrator for more details. - Did you upgrade your network equipment? If so, the router/switch configuration is probably involved. - We may do an in-depth inspection of the PV dashboards. Check the **Monitoring -> Performance Over Time Chart** .. gofigure:: img/screenshots/spv/version_unk/10000201000002F000000131A12627CA.png Network Round Trip Time analysis - Do the **Retransmission Rate** and **Retransmission Delay** vary? If so, we might face a congestion issue. Take a look at the router's load, etc. .. gofigure:: img/screenshots/spv/version_unk/100002010000035F0000024870DF3B1E.png Retransmission analysis .. gofigure:: img/screenshots/spv/version_unk/100002010000033E000001EBAA0ED6CC.png Retransmission analysis - The general slowdown for a client zone may also be the consequence of a crucial service: the ``DNS``. Check out :ref:`dns_response_time`. - Look at the **Monitoring -> Bandwidth Chart** to inspect the bandwidth variation, and the number of ``TCP/UDP`` flows as well. .. gofigure:: img/screenshots/spv/version_unk/100002010000033E000001BC739AFF22.png Bandwidth charts .. gofigure:: img/screenshots/spv/version_unk/1000020100000313000002D424847D7F.png Impact of congestion on retransmissions and network latency or connection time They might have overcome a QoS threshold, such that all new application requests are blocked. A hint would be the increasing number of TCP RST packets. To be sure, you may dive into the **Analysis -> TCP Errors** menu. .. gofigure:: img/screenshots/spv/version_unk/1000020100000356000000E47D3E5560.png Number of RST packets sent from the TCP servers .. index:: Application .. _slow_application: Slow application ^^^^^^^^^^^^^^^^ Hypothesis ~~~~~~~~~~ One or several end users complain about a slow access to a specific application -- a fileserver. Prerequesites ~~~~~~~~~~~~~ Zones have been configured to reflect the customer's network topology. The application ``Samba_CIFS`` has been identified. The traffic to the fileserver is mirrored to one of the listening interfaces of the probe. Where to start: a global view of the application performance! **1st example** .. gofigure:: img/screenshots/spv/version_unk/100002010000033E000003431F334305.png Peak in Server Response Time: application performance Display the Application Dashboard for a relevant period of time. We can easily observe a peak in ``SRT`` from ``6`` to ``18:15``. From the breakdown by zone, we can easily conclude that only one zone has been impacted. .. gofigure:: img/screenshots/spv/version_unk/100002010000040500000348BB8F0C75.png Peak in server response time: Application EURT By clicking on that zone, we can see this client zone's application dashboard: From this, you can conclude that only one client (= user) was impacted. This issue was definitely due to a slow response of the server; it may be due to an application issue or a request which is specifically hard to respond to. **2nd example** .. gofigure:: img/screenshots/spv/version_unk/10000201000003A2000003D5A8855291.png Peak in server response time: Application dashboard Inspect the Application Dashboard for a relevant period in the past (48 hours for example). This dashboard shows in the upper part the evolution of the End User Response Time ``(EURT)`` through time for this fileserver. - We can easily observe that the quality of experience of users accessing this application got much worse yesterday afternoon. - We can easily identify that this was due to a degradation of ``RTT`` (Round Trip Time - indicator of network latency) and not to the Server Response Time ``(SRT)`` or the Data Transfer Time ``(DTT)``. From this graph, we can conclude that the server and the application are likely not related to the slowdown. By looking at the two bar charts which show the breakdown by server and by client zone, respectively, we can draw the following conclusions: - This application is distributed by one server only ``(192.168.20.9)``. - The ``EURT`` vary in large proportion between client zones, mainly because of ``RTT``. - `VLAN_Sales` has a much worse access to the application than `VLAN_R&D`, mainly because of the network latency. To confirm our first conclusions, click on the peak of ``EURT`` in the upper graph. We can narrow our observation period to understand better what happened at that point of time. .. gofigure:: img/screenshots/spv/version_unk/10000201000003A2000004107E704D58.png Peak of RTT in Application Dashboard This confirms the following conclusions: ``RTT`` went up for the `VLAN_Sales` (only). Understanding the perimeter of the slowdown ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We now know that only `VLAN_Sales` was impacted by this slowdown, due to a longer network ``RTT``. We, therefore, need to understand whether this was general (i.e., impacted all clients in the zone) or isolated to certain clients. .. gofigure:: img/screenshots/spv/version_unk/100002010000040C000001A9549B12C9.png Peak in server response time: Conversations To achieve this, we can simply display the Performance conversations for the application `Samba_CIFS` for the zone `VLAN_Sales`. From this screen, we can draw the following conclusion: Only the clients ``192.168.20.205`` and ``192.168.20.212`` seem to be impacted. The other clients have very short ``RTT`` values. .. gofigure:: img/screenshots/spv/version_unk/10000201000003A200000208205A3038.png Peak in server response time: Conversations To confirm this, we need to check that these two hosts are the only ones impacted and check whether they were impacted only when accessing the Fileserver. To do so, we look at the Performance conversations between the `VLAN_Sales` and the `Private` zone. From this, we can draw the following conclusions: - Not only ``192.168.20.212`` and ``192.168.20.205``, but also ``192.168.20.220`` and ``192.168.20.50`` were impacted. - The `Samba_CIFS` (access to the fileserver) was not the only impacted application; ``SMTP``, ``HTTP`` and the `Web Intranet SecurActive` were also impacted. Actions to be taken after that analysis: - Check the windowing configuration on the operating system of these hosts (if high value, this is normal). - Check the level of usage of the host (CPU, RAM usage). Alternative scenarios: - If we had seen some retransmissions, check whether they are all on the same edge switch and check the interface configuration and media errors. .. _slow_server: **Slow server** ^^^^^^^^^^^^^^^ **Hypothesis**: Users complain about having to try several times to connect to a web-based application named `“Salesforce”`. The administrator suspects the application server hosting `“Salesforce”` is slow. **How to analyze the problem**: First, check to see if all applications on the application server hosting `“Salesforce”` are slow or if it is just that particular application. If all applications are slow, then the application server may, in fact, be a slow server. If just the one web-based application `“Salesforce”` is slow, while the other applications ``(CRM)`` are responding quickly, the problem may be the application. To begin diagnosis, go to “**Monitoring**” -> “**Clt/Srv Table**”. Select the application server from the drop-down box labeled “**Server Zone**” and click “**Search**”. - If we see that all applications on the server are responding slowly, i.e., the ``SRT`` values are high for both `“Salesforce”` and `“CRM"`, the issue is related to the server, not to applications. - Second, check the Connection Time of the application server. If the connection times are high, then this may also indicate a slow server. - Third, check for retransmissions between the clients and the application server. If there are a lot of retransmissions, then either the application server or a network device in between are dropping packets. Go to “**Monitoring**” -> “**Performance Over Time chart**”. Select the application server “Salesforce” from the drop-down box labeled “**Server Zone**” and click “**Search**”. .. gofigure:: img/screenshots/spv/version_unk/10000201000003FB000002D51ABEA0D0.png Slow server: Performance Over Time chart Here, we see that there is a high Retransmission Rate ``(RR Server)`` going from the clients to the application server. However, none of the packets from the server to the clients needed to be retransmitted (``RR Client`` is around ``0``). This indicates that the application server is, in fact, dropping the packets and is therefore a slow server (assuming that the route taken from the client to the server is the same route taken from the server to the client as is industry standard practice). Lastly, check the `TCP errors` of the clients and the `Application server`. If the server reset count or number of timeouted sessions is high, this is a further indication of a slow server. Go to **Analysis -> TCP errors**. Select the application server `“Salesforce”` from the drop-down box labeled “**Server Zone**” and click **Search**. .. gofigure:: img/screenshots/spv/version_unk/1000020100000376000000F4E9025B12.png Slow server: TCP Errors Here, we see that there are a lot of server resets and timeouts. Given all the above information, we can conclude that the application server is operating slowly. At this point, the server administrator should perform direct diagnosis on the application server to verify CPU, RAM and HD usage. .. index:: Application **N-tier application performance issue** ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ **Hypothesis**: ~~~~~~~~~~~~~~~ Users are complaining about slow response time from an in-house web application. This application being an N-tier architecture, its performance as seen by a client is tied to several parameters: - DNS latency to resolve web server name from the client host (see **DNS Response Time**) - Connection time to server - Data Transfer Time between these hosts - DNS latency to resolve other server names accessed from the web server (database servers for instance, see **DNS Response Time**) - Connection and data transfer times between these hosts - Server response time of these servers Identifying the culprit: ~~~~~~~~~~~~~~~~~~~~~~~~ First, we need to find out if the experienced slowdown is due to the web front-end itself. To this end, check every component of the ``EURT``: - If ``SRT`` is fast but ``RTT`` and/or ``DTT`` (see also Connection Time), then we are facing a network slowdown. Refer to previous sections of this guide to further track down the problem. - If ``SRT`` is preponderant compared to ``DTT`` and ``RTT``, then the application itself is to blame. Proceed to find out what is affecting performance. - Then check the ``EURT`` between web server and each other involved servers (databases, etc.) If some of these ``EURTs`` appear to be degraded then check recursively these other hosts. If not, then check the web server's load average. Additional metrics ++++++++++++++++++ **TCP anomalies** ^^^^^^^^^^^^^^^^^ .. index:: RST, Reset **RST packets** ~~~~~~~~~~~~~~~ A ``TCP`` connection is reset by an ``RST`` packet. There is no need to acknowledge such packet; the closure is immediate. An ``RST`` packet may have many meanings: - If a ``TCP`` client tries to reach a server on a closed port, the server sends an ``RST`` packet. The connection attempt could be a malicious one (port scanning -- NMAP, etc.), or the consequence of an unexpectedly down server, client/server misconfiguration, server restart, etc.; - A router might send an ``RST`` packet if the incoming ``TCP`` packet does not fit with the security policy (source range IP address is banned, the number of connection attempts is too high in a small period of time, etc.); - A QoS (Quality of Service) equipment limits the bandwidth (or the number of connections) by sending an ``RST`` packet to any new connection attempt; - If an Intrusion Detection System (e.g., Snort) detects a malicious connection, he can send an ``RST`` packet to roughly close it; - If a host between Client and Server wants to do a Denial of Service, it can reset the connection by sending ``RST`` to both peers. Basically, it's the same mechanism as the previous one, but the motivation is quite different. .. index:: Retransmission **Retransmissions** ~~~~~~~~~~~~~~~~~~~ One of the ``TCP`` metrics that is interesting to analyze is the retransmission. A ``TCP`` Retransmission is when a ``TCP`` packet is resent after having been either lost or damaged. Such a retransmitted packet is identified thanks to its sequence number. In |Product|, we do not consider packets with no payload, since duplicate ACKs are much more frequent, and not really characteristic of a network anomaly. There are several common sources for ``TCP`` retransmission: - A network congestion. If a router can't cope with the whole traffic, its queue will grow bigger until it gets full and then starts dropping the incoming packets. If you reach a predefined QoS limit, the exceeding packets will be dropped as well. Such drop will result in a ``TCP`` retransmission. A common way to identify this kind of problems is by taking a glance at the traffic statistics. If you see a flat line at the max traffic allowed, then you get the root cause of retransmission. If the traffic graph looks OK, you can check the load of the routers/switches you own (e.g., with the ``SNMP`` data). If the load is too high, you found the culprit. - An overloaded server. Check the **Section Slow Server**. - A hardware failure. Maybe a network equipment is simply down. It will obviously result in ``TCP`` retransmission until a new route is computed, or the issue fixed. This type of retransmission should occur with very short time effects and give some quite big peaks of retransmission, on very broad types of traffic on a specific subnet. If this happens often, it becomes important to find the faulty hardware by tracking down which subnets are concerned. - A packet header corruption. Network equipments are used to rewrite portions of packets (Ethernet source/destination, IP checksum, maybe TOS field). A buggy firmware can result in corruption while rewriting protocol headers. In this case, the packet will probably be dropped within the network route. Even if it reaches the destination, the ``TCP/IP`` stack won't consider it as a valid packet for the current ``TCP`` sessions, and the stack will wait for the correct packet. It will end in a ``TCP`` retransmission, anyway. This problem will likely occur on the same type of traffic and continuously. .. index:: ICMP **ICMP** ^^^^^^^^ What is ICMP? ~~~~~~~~~~~~~ ``ICMP`` stands for `Internet Control Message Protocol` and is also a common IP transport protocol. It seems pretty explicit, although most people reduce ICMP to ping reply commands, it is a good way to test whether a host can be reached through a network and how much it takes for a packet to make a round trip through the network. Obviously, ping and trace-route-like tools are very useful for network administrators but there is much more to say about ``ICMP`` and the help it can provide for network administration and diagnosis. ``ICMP`` can be used to send more than twenty types of control messages. Some are just messages; some others are a way for IP devices or routers to indicate the occurrence of an error. **Error messages** ~~~~~~~~~~~~~~~~~~ Let’s describe the most typical ``ICMP`` error messages you can find on networks. **ICMP Network Unreachable** Let’s take the simplest example: one machine sitting on a LAN ``(192.168.0.7),`` has one default gateway ``(192.168.0.254)`` which is the router. It is trying to reach a server, which does not sit on the LAN ``(10.1.0.250)`` and which cannot be reached because ``192.168.0.254`` does not know how to route this traffic. .. gofigure:: img/icmp_error.png **ICMP Host Unreachable** Let’s take the simplest example: one machine sitting on a LAN ``(10.1.2.23),`` has one default gateway ``(10.1.2.254/24)`` which is the router. It is trying to reach a server which does not sit on the LAN ``(192.168.1.15).`` The traffic flows and reaches the last router before the server ``(192.168.1.254/24);`` this router cannot reach ``192.168.1.15`` (because it is unplugged, down or it does not exist). .. gofigure:: img/icmp_host_unreachable.png **ICMP Port Unreachable** Let’s take a second example: one machine sitting on a LAN ``(192.168.0.7).`` It is trying to reach a server ``192.168.0.254,`` which sits on the LAN on port ``UDP 4000``, on which the server does not respond. .. gofigure:: img/icmp_port_unreachable.png Where is the challenge with ICMP? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ You may be tempted to say: if it is that simple, why do we need  |Product| on top of any sniffer? All the information sits in the payload. But in every network, you will find some ``ICMP`` errors. They may be due to a user trying to connect to a bad destination, or trying to reach a server on the wrong port. The key is in having a global view of how many errors you have normally and currently and from where to where. The key to leveraging ``ICMP`` information is in having a relevant view of it and understanding what it means. How can ICMP help in network diagnostic and security monitoring? ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ By analysing ``ICMP`` errors, we can identify machines that try to connect to networks or machines that are routable from the LAN’s machine or ones that try to connect on actual servers but for services whose ports are not open. Here are some examples of phenomena that can be identified that way: Misconfigured workstation ^^^^^^^^^^^^^^^^^^^^^^^^^ A workstation repeats a large volume of missed attempts to connect to a limited number of servers: it may be that this machine does not belong to the company’s workstations (external consultant on the network whose laptop is trying to reach common resources on his home network -- DNS, printers, etc.), or it may be the machine of someone coming from a remote site with its own configuration or a machine that has been simply wrongly configured. How would we see it? ~~~~~~~~~~~~~~~~~~~~ A large number of `ICMP Host Unreachable` errors coming from one or several routers to this machine or this group of machines. The ``ICMP`` information contained in the payload of each of these errors would probably show they are trying to reach a certain number of hosts for some services or applications. Migration legacy ^^^^^^^^^^^^^^^^ A certain number of machines keep requesting DNS resolution to a ``DNS`` server that has been migrated (this could be true for any application available on the network). Their users certainly experience worse performance when trying to use these services. How would we see it? ~~~~~~~~~~~~~~~~~~~~ A large number of `ICMP Host Unreachable` errors coming from one or several routers to a group of machines. The ``ICMP`` information contained in the payload of each of these errors would probably show they are all trying to reach the previous IP address of a given server. Network device misconfiguration ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ A router does not have a route configured; some machines are trying to reach some resources, unsuccessfully. How would we see it? ~~~~~~~~~~~~~~~~~~~~ A large number of ICMP Network Unreachable errors coming from one router to many machines. The ICMP information contained in the payload of each of these errors would probably show they are all trying to reach the same network through the same router. Port scanning ^^^^^^^^^^^^^ A machine is trying to complete a network discovery. It is trying to connect to all servers around to see which ports are open. How would we see it? ~~~~~~~~~~~~~~~~~~~~ A large number of `ICMP Port Unreachable` errors coming from one or several routers corresponding to a single machine (the one which is scanning). Spyware / Worms ^^^^^^^^^^^^^^^ An infected machine is trying to propagate its spyware, virus or worm throughout the network; obviously, it has no previous knowledge of the network architecture. How would we see it? ~~~~~~~~~~~~~~~~~~~~ A large number of `ICMP Host Unreachable` errors coming from one or several routers corresponding to a limited number of hosts, trying to reach a large volume of non-existing machines on a limited set of ports. **Server disconnected/reboot** A service on ``UDP`` (``DNS``, Radius...) is interrupted because the server program is temporarily stopped or the host machine is temporarily shutdown. Many requests are then discarded. **How would we see it?** Many `ICMP Port Unreachable` messages (preceded by some unreachable host if the host itself was shut down) are emitted during a short period of time for this service host/port. .. index:: DNS .. _dns_response_time: **DNS Response Time** ^^^^^^^^^^^^^^^^^^^^^ **Background:** ~~~~~~~~~~~~~~~ The ``DNS`` (Domain Name System), which has been defined in detail in the :rfc:`1034` and :rfc:`1035`, is key to the good performance of ``TCP/IP`` networks. It works in a hierarchical way. This means that if one of the DNS servers is misconfigured or compromised, the entire network, which relies on it, is also impacted. Although the ``DNS`` protocol is quite simple, it generates a significant number of issues: configuration issues, which affect the performance of the network as well as security issues, which jeopardize the network integrity. The purpose of this section is to cover the main configuration issues you may encounter with DNS when it comes to network performance. **Hypothesis:** You noticed a general slowdown for a specific host, zone, or the entire LAN. You didn't find the issue with the previous methods. Maybe this problem has nothing to do with the business applications or your network equipment. **Diagnosis:** The ``DNS`` server(s) need to have a very high availability to resolve all the names into IP addresses that are necessary for the applications on the network to function. An overloaded ``DNS`` server will take some time to respond to a name request and will slow down all applications, that have no ``DNS`` data in their cache. An analysis of the ``DNS`` flows on the network will reveal some malfunctions like: **Latency issues** If we can observe that the mean time between the client request is significantly higher than the average (on a LAN it should remain close to ``1 ms``), we may face three kinds of issue: - the client is not requesting the correct ``DNS`` server (``DHCP`` misconfiguration, for example). You can check this out in the interface by looking at the **Server IP** fields; - it means that the ``DNS`` server has an issue with regards to the caching of ``DNS`` names. The cache system makes it possible to resolve a name without requesting the ``DNS`` server, which has authority for the DNS zone, the IP address corresponding to the name. Hence, if the response time is high, first the application will be slow from the user’s point of view and secondly, it will include an unnecessary consumption of bandwidth. This bandwidth will be wasted both on the LAN and on the Internet link (if we make the hypothesis that the authority server sits on the Internet). If we consider the case of a fairly large organization, the bandwidth used by the DNS traffic will not be negligible and will represent an additional charge; - the ``DNS`` server may have system issues. If the server is overloaded, it cannot hold all the requests, and delay (or drop) some, which leads to a general slowdown of the network perfomances. You can easily cast a glance at these issues: go in the **Analysis -> DNS Messages** menu, and fill the form with appropriate values (especially the **Requester Zone**), to verify that the requests are correctly answered, and in an acceptable timing. .. gofigure:: img/screenshots/spv/version_unk/10000201000003A2000001F48D406F23.png DNS Response Time for a specific requester zone (here, VLAN_Sales) **Traffic issue** If we establish the top hosts making ``DNS`` requests, it will be possible to pinpoint misconfigured clients that are not keeping the DNS server responses in a local cache. This approach makes it possible to distinguish between an issue coming from the user’s workstation and one coming from the general function of the network. Please note that hosts making a very high volume of ``DNS`` requests may correspond to a malicious behaviour. For example, some malwares try to establish connections to Internet by resolving domain names and sometimes, the DNS protocol is used in cover channels to escape information. **DNS errors issue** We can also ask for the top hosts receiving the most ``DNS`` error messages (non-existing hosts, etc.). This will also shine a light on misconfigured stations, generating an unnecessary traffic and lowering the overall network performance. **DNS Internal misconfiguration** To do this, we need to identify the ``AXFR`` and ``IXFR`` transactions' autorithy server. If these updates occur too often (and therefore, generate unnecessary traffic), we can conclude that there is an issue. If the bandwidth used is too large, it means that our DNS server requests a full zone transfer ``(AXFR)`` when an iterative transfer ``(IXFR)`` would have been more adequate. If this is the case, then the network administrator can take some easy steps to improve his network’s performance.