Running out of bandwidth?
Running out of bandwidth? There are a variety of SNMP and NetFlow monitoring solutions for looking at Network Bandwidth – so what’s the difference between them?
The most common tool for looking at bandwidth issues are SNMP tools. These are probably the most common network tools in the world and range from “free” download versions to sophisticated Data Centre management tools that all claim to be the only tool you need.
- The good – they all get their information from the network hardware that’s passing packets through them with some idea of what those packets are. Many of the fundamental questions, such as how busy is a particular port and are there errors in the network are answered, marked red/green and plotted over time.
- The bad – As the tool polls for the data, it's typically several minutes of information rolled up into 1 or more polls, so it lacks detail. There is also the challenge that a modern network is a mesh of interconnected ports and Layer 3 configuration which SNMP struggles to represent in any meaningful form. There is also the problem that SNMP does not actually understand the content of the packets, it just effectively counts them.
In summary SNMP tools, such as Solarwinds are a good place to start the analysis, great for trends and patterns, and anything flagged up is worth a good look at. However SNMP tools are not so good for detail, short time periods, configuration or TCP style issues. The modern solutions are getting better at polling the software layers as well so increasing the visibility the solutions are offering.
NetFlow and Probes
The next tools to consider involve NetFlow and Probes. These have the ability to record, not only the utilisation of a link, but also read the IP address information and the protocol fields. NetFlow uses the Layer 3 switches and routers to export the routing cache to a database, then a reporting engine organises it into a meaningful form. Some of the better SNMP tools now have NetFlow modules (such as Solarwinds to add this extra detail to their statistics.
Other more dedicated NetFlow tools (such as Plixer have also emerged which offer more detail on traffic patterns, carrier types, interface analysis and even look for security risks based on the patterns of the TCP connections in the network.
In modern networking one of the blind spots in looking at traffic analysis is the use of port numbers to identify protocols. It is very easy to look at a WAN link and read that it is all Port 80 (HTTP traffic). This is not very helpful and can be hiding all sorts of illegal downloads, music files, social networking images etc. This can also be complicated by the use of shared PCs, hot desking, remote/home workers as these devices will use shared IP address ranges making it very difficult to identify the users behind this questionable traffic.
TCP/IP also contains its own correction and error notification system and NetFlow is the first layer which makes this visible. In short these errors tell you how easily the TCP protocol is getting data through to the far end and when it’s not working, do we look at the client, server or network to see why. These errors give us clues to whether the resources of the client or server machines are actually slowing down the data transfer (Zero Windows), which will get blamed on the network, but in reality tend to be a new machine and operating system talking to an old one that needs data sending through in smaller chunks (known as Windows).
Also the first part of the TCP configuration is also read here. If you use CoS or QoS, the traffic reports contain these fields and reports can show you how much traffic of what type is passing using which priority field.
Running out of bandwidth?
The issues with NetFlow (and all the variants SFlow, JFlow, IPFIx) are the same as SNMP; time resolution is limited to 1 minute so lacks detail here and this data is really the routing cache of the networking devices and not looking at the actual content of the packets. However for many people it gets them to the level where they can see the bandwidth being used, plus who it is and what they are doing.
With some of these solutions, it is popular to “sample flows” as well. The reason for doing this is to reduce the information stored (NetFlow databases tend to be much larger than SNMP databases) and sometimes to avoid license costs. If you are using SFLOW as a reporting tool for usage over a week or month then it probably makes little to no difference, however if you plan to use this as a trouble shooting tool then this is making the data less and less reliable. Another more modern trend is to reduce or take out NetFlow from the Routing stack to try and improve the test bench performance of the kit. Many organisations purchase kit based purely on the switching/routing times hence the manufacturers are responding to this with stripped down O/S to improve the numbers.
These tools are complimentary rather than competitive. SNMP works with more manufacturers’ hardware and also covers off basics like availability very well. NetFlow solutions add more detail regarding users and applications but can only use layer 3 devices as data sources. The only area to be careful of here is how much data you decide to keep. The free ware products tend to measure in hours and days, good paid for solutions can measure this in months, the issue being is you need a proper database architecture to organise all this. Netflow solutions tend to create more data than SNMP ones so when sizing the solution bear that in mind.