User Guide 19.11 documentation
Note
With version 2.8, the in/out notion has been fully replaced by Server/Client. So in our graphs, any RTT and RR (in/out) should be considered as RTT, RR (Server/Client) as in the following rules.
RTT in
stands for RTT Server
.RTT out
stands for RTT Client
.RR in
stands for RR Server
.RR out
stands for RR Client
.To customize this view for your own needs, just go to the Configuration menu and choose the application you want to define as a ‘business’ one. (see the Business Critical Applications).
The purpose of the Business Critical Application Dashboard (BCA) is to have, regrouped into a single view, the most important elements that are critical for your business. Vital information is presented in a completely configurable and dynamic dashboard view to people in charge in order to radically improve early diagnostics and impact analysis. What is monitored is the EURT (End User Response Time) metric. Thus, this dashboard reflects the quality of experience of the users for the selected critical applications.
Thus, from each Business Critical Application, with a single click on the appropriate icon, you can:
Note
If you click on the icons that are next to the name of the application at the beginning of each line, the quick links will take into account the complete period of time currently displayed. If you click on the icons associated to a specific period of time, the quick links will used this when redirecting you to a detailed screen.
To customize this view for your own needs, just go to the
Configuration menu and choose the entry labeled Business Critical
Network
(see the Business Critical Networks).
The Business Critical Network Dashboard (BCN) is aimed at presenting in a single screen the status of your organization’s most critical network “links”. You can customize the business critical network dashboard to view the status of the most strategic links corresponding to your business.
From the Business Critical Network Dashboard, you can drill down from the general view to more detailed information for analysis and problem resolution:
By hovering with the mouse over a cell representing a point of time,
you can view the threshold values for each direction (indicating
status OK
, Warning
or Alert
as well as the value for
each direction). You can also access the bandwidth graphs and the
flow details table for each link. If you click on the icons that are
next to the name of the link at the beginning of a line, the quick
links will take into account the complete period of time currently
displayed. If you click on the icons associated to a specific period
of time, the quick links will use this when redirecting you to a
detailed screen. You will always see up-to-date information with the
auto-refresh feature of the BCN dashboard. The information will
automatically be refreshed based on the data aggregation level (see
aggregation period). For example, if the “Aggregate level” is
“1 minute”, the BCN will be updated every minute; if the
“Aggregate level” is “1 hour”, it will be updated every hour.
A specific reporting for Voice over IP traffic is provided. The aim of this module is to show the volume and quality of service associated with VoIP flows.
These VoIP protocols are supported:
SIP
+ RTCP
+ RTP
MGCP
+ RTCP
+ RTP
SKINNY
+ RTCP
+ RTP
For more information, please consult the corresponding RFCs:
Voice over IP relies on three protocols to operate over IP networks:
SIP
(Session
Initiation Protocol) and MGCP
(Media Gateway Control Protocol).
Please note that SIP
may or may not follow the same route as
the RTP
traffic, while MGCP
follows the same route as
RTP
(Real-Time Protocol).IPBX
). RTP
is the only media
protocol supported by SkyLIGHT PVX. It usually runs over
UDP
.RTCP
(Real Time Control Protocol) is the only control protocol supported.Mean Opinion Score (MOS) is a numeric indication of the perceived
quality of service of VoIP. It ranges from 1
to 5
with 1
corresponding to the lowest quality and 5
to the highest (close to
humain voice).
MOS Rating | Meaning |
---|---|
5 | Excellent |
4 | Good |
3 | Fair |
2 | Poor |
1 | Bad |
Please note that in a real network, a MOS
of over 4.4
is
unachievable. A low MOS
will translate into an echo and degraded
signal. MOS
is, in principle, the result of a series of subjective
tests; in the context of network analysis, MOS
will be estimated
using a formula that integrates 3
factors:
RTT
recommended value: <100ms
)<30ms
)<5%
)To provide MOS
values for VoIP traffic, it is necessary to
capture the three flows: signalization (SIP
or MGCP
), media
(RTP)
and control protocol (RTCP)
. If one of these flows is
absent in the traffic capture brought to the listening interface(s),
the MOS
value will not be calculated. Other quality-of-service
metrics will remain available.
Protocol | Metrics obtained by analysis of the protocol |
---|---|
SIP/MGCP |
|
RTP |
|
RTCP |
|
Note
RTT
and MOS
values depend to some extent on the quality of
the measurement provided by RTCP
. Please note that MOS
is
not very sensitive to “normal” latency values. When referring to
voice or media, we refer to the RTP
traffic, which may
correspond to different things (human voice, pre-recorded message,
ringback tone, busy line tone, etc.) The VoIP module discards the
jitter and packet loss data present in the RTCP
flow
and replaces them with equivalent values computed internally. This
is so for several reasons:
RTCP
stream is more often missing than present, probably
because it is firewalled and of little use to the VoIP client
software.For the VoIP module to remain passive, there is no other option
than to compute these values for every RTP
stream to generate
jitter and packet loss values which will be a good estimate of the
real jitter and loss experienced by both users. This is how, even
in the absence of RTCP
stream, we can display a jitter and
packet loss count (and no RTT
, and, thus, no MOS
).
This view shows the evolution of the Mean Opinion Score through
time. A second graph shows the evolution of the number of calls to
help you evaluate how many were impacted by a MOS
degradation.
This view shows the evolution through time of the jitter and the
packet loss. This view can help you understand MOS
variations and
see which metric is impacting the MOS
.
These views show charts for:
The last two views show each call individually with some usage metrics for VoIP Conversations. The VoIP Flow Details view is the same table with the addition of performance metrics.
Note
The “caller” value corresponds to the metric for the RTP/RTCP
traffic from the caller to the callee, while the “callee” value
corresponds to the metric for the RTP/RTCP
traffic from the
callee to the caller.
A dashboard is a single-screen report that displays relevant information to understand how the application is doing. They are present in APS from version 1.7.
Note
These dashboards are unavailable in SkyLIGHT PVX NPS.
It is extremely useful:
Section Components will discuss the three components that display these key information.
In a single report, you have enough to explain to a business user or a
manager how the application performance went through time, which
servers were doing worse and which zones were impacted. On top of the
EURT
, all this is based on three synthetic metrics that are easy to
explain to non-technical people:
RTT
– network performance;SRT
– server performance;DTT
– delivery of application response through the network.For network administrators, this report brings together all of the information about a business application required to:
With one click, you can conclude if there was a slowdown or not, what was the origin of the degradation, which client zones were impacted. With an additional click, you can view whether all clients in a zone were impacted or if the server response time degradation was caused by another application hosted on the same server machine.
This EURT
graph shows:
10
applicative
transactions in the same way as for 10,000
).The breakdown of EURT
into three intelligible components (RTT
for network latency, SRT
for Server Response Time and DTT
for Data Transfer Time) lets you know at first glance the possible
origin of the performance degradation. For example, in the screenshot
above, we can observe an increase in the SRT
; the network and the
time required to send the response to the client have not increased.
Either the server overall responded slower or some specific queries
required a much longer treatment time (you can determine this by
drilling down to that specific point of time).
There is a comparison of the EURT
on each server that runs this
application. In this case, it is obvious that Atlantis
tends to respond much slower than Brax. By clicking on it and looking
at a second dashboard called Server/Application Dashboard, we shall
be able to determine if this is permanent or punctual and whether
this is due to the load on this application or on another one hosted
on the same server.
What we can see here is a breakdown of the EURT
for this
application between client zones; with one glance, you can
determine which zone was impacted by the degradation as well
as the different levels of performance depending on the users’
location. In the screenshot above, we can see that mainly one zone
was impacted by the SRT
degradation. There are also significant
differences in performance between zones due to differences in
RTT
values (network latency).
SkyLIGHT PVX APS offers two additional dashboards:
You can access this dashboard either through the menu or by clicking on a specific client zone in the Application Dashboard. This dashboard contains three bits of information:
EURT
graph through time for this client zone and this
application.The breakdown by client lets us know whether all the zone was impacted
or just some individual users, and on which component of the EURT
(network latency, server response time or data transfer time, and for
which number of transaction and amount of traffic).
You can access this dashboard either through the menu or by clicking on a specific server in the Application Dashboard. This dashboard contains three bits of information:
EURT
graph through time for this server and this applicationDashboards have been developed so that a single click provides more detailed information on the object you are interested in:
EURT
graph in any of these three dashboards,
you make a focus on a shorter period of time. For example, in an
SRT
peak, depending on the aggregation level, you either reach a
lower aggregation level for a shorter period or the corresponding
performance conversations (see Data Aggregation). At the same
time, you will get the server and zone breakdown for that more
specific period of time.The TCP
statistics can be displayed in tables by selecting the
appropriate column theme. They can reveal dysfunctions or unusual events.
For each TCP
conversation, the following fields are available:
By sorting on the RD or duplicate ACK fields, one can quickly check
the worst conversations in terms of TCP
performance. Also,
the number of reset packets is usually noteworthy. One can then jump
to the IP summary page of either the client or the server (depending
on who is to blame) to gather further data on this event.
For each TCP
conversation, the following fields are available:
Once you have identified the origin of an issue, you may want to analyze it further by looking at the packets themselves. You have two ways to do this:
- Manual packet capture through Pulsar’s
tcpdump
command- Automated Packet Capture (AutoPCAP)
- Triggered Packet Capture from the data of a result row.
By connecting through Pulsar, you can start a manual capture of any traffic viewed on the interface of your device. To do so, you need:
tcpdump_tofile -i <interface> host <host_ip>
.Control+C
to stop the trace.Use the tcpdump
command instead of tcpdump_tofile
to display
the results of real time packet capture.
Note
help tcpdump_tofile
.tcpdump
command’s online
manual.-w
.SkyLIGHT PVX can capture packets automatically, in case abnormal values are observed on critical servers. These packets are presented for later analysis as PCAP files, which can be downloaded through the web graphical interface at the conversation level.
These files are presented in the following views:
- Conversations
- DNS messages
- VoIP details
In each of these views, a column at the right end of the table indicates PCAP; a small icon indicates whether packets have been captured for a given conversation or not. If the PCAP file is available, you can download it by clicking on the icon. Once the file has been downloaded, you can view the packets using any protocol decoder capable of reading PCAP files.
For instance, if you are using Wireshark to decrypt the packets, you can directly view the packets.
To view the query and the beginning of the response, you can use the feature Follow TCP stream (in the Analysis menu).
Packets are saved by SkyLIGHT PVX as soon as the conversation they belong to matches a certain number of conditions:
- If Capture HTTP is checked in a Zone, and an IP address matches the zone subnet (either as client or server).
- If Capture HTTP is checked in an Application, and a port or an IP address matches the application (either as client or server).
- And one of the following metrics is considered as out of the norm:
- Server Response Time
(SRT)
forTCP
flows- Retransmission Rate
- DNS Response Time
Note
Why is PVX not used directly in a Zone or an Application to capture PCAP files?
We want to capture the flow for troubleshooting from the very first packet. With only one packet, PVX cannot know what is a Zone or an Application of the flow.
Note
PCAP files are a sample of the conversation. If you request on a one-hour interval and get a PCAP file, the PCAP will not contain one hour’s worth of data but only those that match the above conditions.
The Automatic Packet Capture feature works under a certain conditions to ensure the proper execution of other services provided by SkyLIGHT PVX. These necessary limitations include:
- The retention of PCAP files is limited by the disk space allocated for captures; in the current version, this space is limited to 10GB by default (for both manual and automatic captures). When all 10GB are used, no new PCAP file is saved. You can change this value in Sniffer Configuration page.
- The maximum retention time for automatic captures is set to 48 hours; after this delay, PCAP files will be deleted. This cannot be modified.
- The sniffer component of SkyLIGHT PVX is set forge a maximum of 5,000 PCAP files simultaneously; if more than 5,000 conversations are needed, change the parameter in Sniffer Configuration page. Otherwise, some conversations will not be recorded at packet level.
Please note that the threshold values and voluntary limitations in newer versions will be reviewed in light of our experience and the customer feedback that we receive. If you need an exhaustive trace of a given set of conversations, you can also use the manual capture feature available through Pulsar.
Triggered PCAPs are generated from the user interface, either by the result rows, or by a configuration page. In both cases, the administrator rights are required.
The setup is very easy because the capture filters are preset with the wanted flow characteristics, but the main advantage of triggered PCAP is that it is possible to set a date and time to start the capture.
By default, only the local capture is selected to trigger the capture, but all known captures are available. If multiple captures are selected for a capture, then one PCAP will be created for each one.
All added triggered PCAPs are referenced in the dedicated page in the configuration menu. It is possible to delete and download them, regardless of the capture where they were captured.
If a capture was done on multiple captures at a time, then they will have the same name and same filters. They will be grouped together in the management interface.
The objective of this section is to help our customers make the best use of the performance reports provided by their appliance. You will find a brief overview of how application performance issues can be solved with PVX. This first section focuses on synthetic metrics to produce a measure of the quality of user experience (QoS - End User Response Time) and give you a simple explanatory framework to understand the cause of application slowdowns (Round Trip Time, Server Response Time and Data Transfer Time).
Note
Some metrics and views described below are only available in SkyLIGHT PVX APS.
Before you start analyzing performance reports, there are certain
things to keep in mind: Performance metrics should not be
considered as absolute values, but in comparison with different time
intervals, servers and user groups. Performance metrics represent
time interval. Although most of them correspond to the measurement of
a concrete phenomenon, it is almost impossible to provide a scale of
what is a good or a bad response time, with no experience of the
impact it has on users. For example, indicating that the Network
Round Trip Time from a site A
to a site B
is 200ms
does
not mean you have a measure that is acceptable or not. In the same
way, a Server Response Time (SRT)
of an application A
of
100ms
may be very “bad” when the same value would be excellent for
an application B
. As a consequence, it is important to consider
performance metrics as relative values. One of the keys to a good
interpretation of performance metrics is to compare systematically
performance metric value with:
Mixing up performance metrics for several applications does not make
sense. When looking at application performance metrics, you should
be very careful of isolating applications for analysis. As a
consequence, the metrics which very much depend on the application’s
specific behaviour should not be considered altogether. This is true
for metrics such as EURT
(End User Response Time), SRT
(Server
Response Time) and DTT
(Data Transfer Time).
RTT measurements can marginally be impacted by the behaviour of the
operating system. Network Round Trip Times for TCP
are based on
the TCP
acknowledgment mechanism. This means that, although
RTT
is generally a good measurement of round trip latency, if the
operating system of one of the parties is so overloaded that the
acknowledgment process becomes slower, RTT
values will be
impacted. RTT Server would be impacted on the server side and
RTT Client
on the client side. RTT
should then be analyzed in
parallel with CT
(Connection Time) because the treatment of a new
session by the IP stack has a higher priority.
Some values are averaged measures. For each conversation, two kinds of values are reported:
RTT
, SRT
, DTT
and the
like, which are average values over all samples aggregated for this
conversation.EURT
stands for End User Response Time.
This metric is an aggregate of various other measures meant to give an
idea of the perceived overall end user experience. It is taken as the
sum of RTT
, SRT
and DTT
.
EURT
has no meaningful physical counterpart. Only its evolution
makes sense and allows the system administrator to check at a glance
whether a network zone is behaving as usual or not. Notice that
expected correct values for both SRT
and DTT
depend on the
protocol at hand. As a consequence, you should not try to compare two
EURTs
of different applications.
RTT
stands for Round Trip Time.
RTT
gives an approximation of the time required for a packet to
reach its destination, and can be further decomposed into an
RTT Server
(delay between a data packet send by the client and its
ACK
from the server) and an RTT Client
(in the other way
around). As a typical IP implementation will delay acknowledging of
incoming data, additional tricks are exploited in order to rule out
these software biases:
SYN/FIN
acknowledgment and some exceptional
conditions such as TCP
resets, that suffer no such delays, to
estimate a realistic upper bound.RTT
values.RTT Server/Client
by SRT/CRT
if RTT
sample set
looks suspicious.RTT
is refers to the bare speed of the physical layer. It is
unaffected by packet retransmissions, packet loss or similar
occurrences. RTT
may be affected by (from most common to the
rarest):
These troubles should be further investigated by comparing with other
client and/or server zones in order to locate the misbehaving
equipment. Notice that a degradation of RTT
will almost invariably
impact other metrics as well.
SRT
stands for Server Response Time.
SRT
gives an estimation of the elapsed time between the last
packet of an applicative request and the first packet of the server’s
response.
SRT
represents the processing time of the server, at the
application layer, for a given request. SRT
may be affected by
(from the most common the the rarest):
SQL
command can
let the server process longer);SRT
can be marginally affected by the increase of network
latency between the point of capture and the server (parallel
increase of the RTT Server
value).To pinpoint the root cause of the slowdown, we want to compare the
SRT
for a given server/application with other applications on the
very same server. If there is a blatant difference, the application
is guilty. Otherwise, we want to compare it with other servers in
the same zone, then different zones.
DTT
stands for Data Transfer Time.
DTT server
is defined as the time between the first data packet of
the response (with ACK
flag and a non-nil payload) from the server
and the last packet considered as part of the same response (if the
packet has the same acknowledgement number); FIN
, RST
packets
from server or client will also be considered as closing the sequence.
A Timeout will cancel a DTT
. Note that if the answer is small
enough to be contained in only one packet, the DTT
will be
'0'
.
DTT client
is the same metric in the other direction.
DTT
(sum of both server and client DTT
) is the time the user
is going to wait for the response to circulate on the network from the
server to the client. It does not depend on the Server Response Time
(e.g., a DTT
might be short for a long SRT
:
DTT
might be very large,
but SRT
is very short because the request is easy to handle yet
the response is very large). DTT
depends on (from the largest
impact to the smallest):DTT
may vary (from most common to the rarest):
Hypothesis:
One or several end users complain about a slow access to all applications (both in and out the LAN).
Diagnosis:
You will find in this section the classical information to grab in order to diagnose the issue:
is the application really slower for this site? You can get this information from the Application Performance Dashboard:
Does the slowdown occur for a specific application? If so, check Slow application.
Does the slowdown occur for a specific server? If so, check Slow server.
Did you upgrade the clients workstations recently? If so, it’s a specific system issue. You may ask the System Administrator for more details.
Did you upgrade your network equipment? If so, the router/switch configuration is probably involved.
We may do an in-depth inspection of the PV dashboards. Check the Monitoring -> Performance Over Time Chart
Do the Retransmission Rate and Retransmission Delay vary? If so, we might face a congestion issue. Take a look at the router’s load, etc.
The general slowdown for a client zone may also be the consequence
of a crucial service: the DNS
. Check out
DNS Response Time.
Look at the Monitoring -> Bandwidth Chart to inspect the
bandwidth variation, and the number of TCP/UDP
flows as well.
They might have overcome a QoS threshold, such that all new application requests are blocked. A hint would be the increasing number of TCP RST packets. To be sure, you may dive into the Analysis -> TCP Errors menu.
One or several end users complain about a slow access to a specific application – a fileserver.
Zones have been configured to reflect the customer’s network topology.
The application Samba_CIFS
has been identified. The traffic to the
fileserver is mirrored to one of the listening interfaces of the
probe. Where to start: a global view of the application performance!
1st example
Display the Application Dashboard for a relevant period of time. We
can easily observe a peak in SRT
from 6
to 18:15
. From the
breakdown by zone, we can easily conclude that only one zone has been
impacted.
By clicking on that zone, we can see this client zone’s application dashboard:
From this, you can conclude that only one client (= user) was impacted. This issue was definitely due to a slow response of the server; it may be due to an application issue or a request which is specifically hard to respond to.
2nd example
Inspect the Application Dashboard for a relevant period in the past (48 hours for example).
This dashboard shows in the upper part the evolution of the End User
Response Time (EURT)
through time for this fileserver.
RTT
(Round Trip Time - indicator of network latency) and not to the
Server Response Time (SRT)
or the Data Transfer Time (DTT)
.From this graph, we can conclude that the server and the application are likely not related to the slowdown. By looking at the two bar charts which show the breakdown by server and by client zone, respectively, we can draw the following conclusions:
(192.168.20.9)
.EURT
vary in large proportion between client zones, mainly
because of RTT
.To confirm our first conclusions, click on the peak of EURT
in
the upper graph. We can narrow our observation period to understand
better what happened at that point of time.
This confirms the following conclusions: RTT
went up for the
VLAN_Sales (only).
We now know that only VLAN_Sales was impacted by this slowdown, due
to a longer network RTT
. We, therefore, need to understand whether
this was general (i.e., impacted all clients in the zone) or isolated
to certain clients.
To achieve this, we can simply display the Performance conversations for the application Samba_CIFS for the zone VLAN_Sales.
From this screen, we can draw the following conclusion:
Only the clients 192.168.20.205
and 192.168.20.212
seem to be
impacted. The other clients have very short RTT
values.
To confirm this, we need to check that these two hosts are the only ones impacted and check whether they were impacted only when accessing the Fileserver. To do so, we look at the Performance conversations between the VLAN_Sales and the Private zone. From this, we can draw the following conclusions:
192.168.20.212
and 192.168.20.205
, but also
192.168.20.220
and 192.168.20.50
were impacted.SMTP
, HTTP
and the Web Intranet
SecurActive were also impacted.Actions to be taken after that analysis:
Alternative scenarios:
Hypothesis:
Users complain about having to try several times to connect to a web-based application named “Salesforce”. The administrator suspects the application server hosting “Salesforce” is slow.
How to analyze the problem:
First, check to see if all applications on the application server
hosting “Salesforce” are slow or if it is just that particular
application. If all applications are slow, then the application
server may, in fact, be a slow server. If just the one web-based
application “Salesforce” is slow, while the other applications
(CRM)
are responding quickly, the problem may be the application.
To begin diagnosis, go to “Monitoring” -> “Clt/Srv Table”. Select the application server from the drop-down box labeled “Server Zone” and click “Search”.
SRT
values are high for both “Salesforce” and
“CRM”, the issue is related to the server, not to applications.Here, we see that there is a high Retransmission Rate (RR Server)
going from the clients to the application server. However, none of the
packets from the server to the clients needed to be retransmitted
(RR Client
is around 0
). This indicates that the application
server is, in fact, dropping the packets and is therefore a slow
server (assuming that the route taken from the client to the server is
the same route taken from the server to the client as is industry
standard practice).
Lastly, check the TCP errors of the clients and the Application server. If the server reset count or number of timeouted sessions is high, this is a further indication of a slow server. Go to Analysis -> TCP errors. Select the application server “Salesforce” from the drop-down box labeled “Server Zone” and click Search.
Here, we see that there are a lot of server resets and timeouts. Given all the above information, we can conclude that the application server is operating slowly. At this point, the server administrator should perform direct diagnosis on the application server to verify CPU, RAM and HD usage.
Users are complaining about slow response time from an in-house web application. This application being an N-tier architecture, its performance as seen by a client is tied to several parameters:
First, we need to find out if the experienced slowdown is due to the
web front-end itself. To this end, check every component of the
EURT
:
SRT
is fast but RTT
and/or DTT
(see also Connection
Time), then we are facing a network slowdown. Refer to previous
sections of this guide to further track down the problem.SRT
is preponderant compared to DTT
and RTT
, then the
application itself is to blame. Proceed to find out what is
affecting performance.EURT
between web server and each other involved
servers (databases, etc.)If some of these EURTs
appear to be degraded then check
recursively these other hosts. If not, then check the web server’s
load average.
A TCP
connection is reset by an RST
packet. There is no need
to acknowledge such packet; the closure is immediate. An RST
packet may have many meanings:
TCP
client tries to reach a server on a closed port, the
server sends an RST
packet. The connection attempt could be a
malicious one (port scanning – NMAP, etc.), or the consequence of
an unexpectedly down server, client/server misconfiguration, server
restart, etc.;RST
packet if the incoming TCP
packet
does not fit with the security policy (source range IP address is
banned, the number of connection attempts is too high in a small
period of time, etc.);RST
packet to any new
connection attempt;RST
packet to roughly close it;RST
to both
peers. Basically, it’s the same mechanism as the previous one, but
the motivation is quite different.One of the TCP
metrics that is interesting to analyze is the
retransmission. A TCP
Retransmission is when a TCP
packet is
resent after having been either lost or damaged. Such a retransmitted
packet is identified thanks to its sequence number. In
SkyLIGHT PVX, we do not consider packets with no payload, since
duplicate ACKs are much more frequent, and not really characteristic
of a network anomaly. There are several common sources for TCP
retransmission:
TCP
retransmission. A common way to identify this kind of
problems is by taking a glance at the traffic statistics. If you see
a flat line at the max traffic allowed, then you get the root cause
of retransmission. If the traffic graph looks OK, you can check the
load of the routers/switches you own (e.g., with the SNMP
data).
If the load is too high, you found the culprit.TCP
retransmission until a new route is
computed, or the issue fixed. This type of retransmission should
occur with very short time effects and give some quite big peaks of
retransmission, on very broad types of traffic on a specific
subnet. If this happens often, it becomes important to find the
faulty hardware by tracking down which subnets are concerned.TCP/IP
stack won’t consider it as a valid
packet for the current TCP
sessions, and the stack will wait for
the correct packet. It will end in a TCP
retransmission, anyway.
This problem will likely occur on the same type of traffic and
continuously.ICMP
stands for Internet Control Message Protocol and is also a
common IP transport protocol. It seems pretty explicit, although most
people reduce ICMP to ping reply commands, it is a good way to test
whether a host can be reached through a network and how much it takes
for a packet to make a round trip through the network. Obviously, ping
and trace-route-like tools are very useful for network administrators
but there is much more to say about ICMP
and the help it can
provide for network administration and diagnosis. ICMP
can be used
to send more than twenty types of control messages. Some are just
messages; some others are a way for IP devices or routers to indicate
the occurrence of an error.
Let’s describe the most typical ICMP
error messages you can find
on networks.
ICMP Network Unreachable
Let’s take the simplest example: one machine sitting on a LAN
(192.168.0.7),
has one default gateway (192.168.0.254)
which
is the router. It is trying to reach a server, which does not sit on
the LAN (10.1.0.250)
and which cannot be reached because
192.168.0.254
does not know how to route this traffic.
ICMP Host Unreachable
Let’s take the simplest example: one machine sitting on a LAN
(10.1.2.23),
has one default gateway (10.1.2.254/24)
which is
the router. It is trying to reach a server which does not sit on the
LAN (192.168.1.15).
The traffic flows and reaches the last router
before the server (192.168.1.254/24);
this router cannot reach
192.168.1.15
(because it is unplugged, down or it does not exist).
ICMP Port Unreachable
Let’s take a second example: one machine sitting on a LAN
(192.168.0.7).
It is trying to reach a server 192.168.0.254,
which sits on the LAN on port UDP 4000
, on which the server does
not respond.
You may be tempted to say: if it is that simple, why do we need
SkyLIGHT PVX on top of any sniffer? All the information sits in
the payload. But in every network, you will find some ICMP
errors.
They may be due to a user trying to connect to a bad destination, or
trying to reach a server on the wrong port. The key is in having a
global view of how many errors you have normally and currently and
from where to where. The key to leveraging ICMP
information is in
having a relevant view of it and understanding what it means.
By analysing ICMP
errors, we can identify machines that try to
connect to networks or machines that are routable from the LAN’s
machine or ones that try to connect on actual servers but for
services whose ports are not open. Here are some examples of
phenomena that can be identified that way:
A workstation repeats a large volume of missed attempts to connect to a limited number of servers: it may be that this machine does not belong to the company’s workstations (external consultant on the network whose laptop is trying to reach common resources on his home network – DNS, printers, etc.), or it may be the machine of someone coming from a remote site with its own configuration or a machine that has been simply wrongly configured.
A large number of ICMP Host Unreachable errors coming from one or
several routers to this machine or this group of machines. The
ICMP
information contained in the payload of each of these errors
would probably show they are trying to reach a certain number of hosts
for some services or applications.
A certain number of machines keep requesting DNS resolution to a DNS
server that has been migrated (this could be true for any application
available on the network). Their users certainly experience worse
performance when trying to use these services.
A large number of ICMP Host Unreachable errors coming from one or
several routers to a group of machines. The ICMP
information
contained in the payload of each of these errors would probably show
they are all trying to reach the previous IP address of a given
server.
A router does not have a route configured; some machines are trying to reach some resources, unsuccessfully.
A large number of ICMP Network Unreachable errors coming from one router to many machines. The ICMP information contained in the payload of each of these errors would probably show they are all trying to reach the same network through the same router.
A machine is trying to complete a network discovery. It is trying to connect to all servers around to see which ports are open.
A large number of ICMP Port Unreachable errors coming from one or several routers corresponding to a single machine (the one which is scanning).
An infected machine is trying to propagate its spyware, virus or worm throughout the network; obviously, it has no previous knowledge of the network architecture.
A large number of ICMP Host Unreachable errors coming from one or several routers corresponding to a limited number of hosts, trying to reach a large volume of non-existing machines on a limited set of ports.
Server disconnected/reboot
A service on UDP
(DNS
, Radius...) is interrupted because the
server program is temporarily stopped or the host machine is
temporarily shutdown. Many requests are then discarded.
How would we see it?
Many ICMP Port Unreachable messages (preceded by some unreachable host if the host itself was shut down) are emitted during a short period of time for this service host/port.
The DNS
(Domain Name System), which has been defined in detail in
the RFC 1034 and RFC 1035, is key to the good performance of
TCP/IP
networks. It works in a hierarchical way. This means that
if one of the DNS servers is misconfigured or compromised, the entire
network, which relies on it, is also impacted. Although the DNS
protocol is quite simple, it generates a significant number of issues:
configuration issues, which affect the performance of the network as
well as security issues, which jeopardize the network integrity. The
purpose of this section is to cover the main configuration issues you
may encounter with DNS when it comes to network performance.
Hypothesis:
You noticed a general slowdown for a specific host, zone, or the entire LAN. You didn’t find the issue with the previous methods. Maybe this problem has nothing to do with the business applications or your network equipment.
Diagnosis:
The DNS
server(s) need to have a very high availability to resolve
all the names into IP addresses that are necessary for the
applications on the network to function. An overloaded DNS
server
will take some time to respond to a name request and will slow down
all applications, that have no DNS
data in their cache. An
analysis of the DNS
flows on the network will reveal some
malfunctions like:
Latency issues
If we can observe that the mean time between the client request is
significantly higher than the average (on a LAN it should remain close
to 1 ms
), we may face three kinds of issue:
DNS
server (DHCP
misconfiguration, for example). You can check this out in the
interface by looking at the Server IP fields;DNS
server has an issue with regards to the
caching of DNS
names. The cache system makes it possible to
resolve a name without requesting the DNS
server, which has
authority for the DNS zone, the IP address corresponding to the
name. Hence, if the response time is high, first the application
will be slow from the user’s point of view and secondly, it will
include an unnecessary consumption of bandwidth. This bandwidth
will be wasted both on the LAN and on the Internet link (if we
make the hypothesis that the authority server sits on the
Internet). If we consider the case of a fairly large organization,
the bandwidth used by the DNS traffic will not be negligible and
will represent an additional charge;DNS
server may have system issues. If the server is
overloaded, it cannot hold all the requests, and delay (or drop)
some, which leads to a general slowdown of the network perfomances.You can easily cast a glance at these issues: go in the Analysis -> DNS Messages menu, and fill the form with appropriate values (especially the Requester Zone), to verify that the requests are correctly answered, and in an acceptable timing.
Traffic issue
If we establish the top hosts making DNS
requests, it will be
possible to pinpoint misconfigured clients that are not keeping the
DNS server responses in a local cache. This approach makes it
possible to distinguish between an issue coming from the user’s
workstation and one coming from the general function of the network.
Please note that hosts making a very high volume of DNS
requests
may correspond to a malicious behaviour. For example, some malwares
try to establish connections to Internet by resolving domain names and
sometimes, the DNS protocol is used in cover channels to escape
information.
DNS errors issue
We can also ask for the top hosts receiving the most DNS
error
messages (non-existing hosts, etc.). This will also shine a light on
misconfigured stations, generating an unnecessary traffic and lowering
the overall network performance.
DNS Internal misconfiguration
To do this, we need to identify the AXFR
and IXFR
transactions’ autorithy server. If these updates occur too often (and
therefore, generate unnecessary traffic), we can conclude that there
is an issue. If the bandwidth used is too large, it means that our DNS
server requests a full zone transfer (AXFR)
when an iterative
transfer (IXFR)
would have been more adequate. If this is the
case, then the network administrator can take some easy steps to
improve his network’s performance.