1

I have a small non-profit that I help that has a hub and spoke fiber network on managed Netgear switches. They have both T1 & DSL connectivity, separated by VLAN's with a simple VoIP system running. There are weird things happening and periodically the network slows down then jams up. Cycling the power on the main equipment restores functionality until the next time (usually a few days later). The network is actually fairly simple (serves around 15 users) and they don't have a dedicated IT guy, although one of the more technically minded general staff does most of the mundane IT stuff.

The organisation is rurally located and has had trouble finding local support with a sufficient depth of knowledge to diagnose the problem (suggesting that they systematically replace all the equipment until the problem goes away is not a diagnosis IMHO).

All the switches are managed and we could set up a packet sniffing machine to plug directly into a port setup for monitoring. Is it realistic to think that a network guru, logging in remotely would likely be able to do the detective work to locate the source of the issue?

Assuming it is viable, any direction on sites to look for gurus would also be appreciated. Also, if any network geeks reading this are up for some moonlighting at reasonable rates, please comment.

Stuart
  • 113
  • 2

5 Answers5

2

I would start with monitoring. If you're having intermittent problems that don't go away on their own, but rebooting solves it, check your resource levels. That's a sign that /something/ is expending your free resources of some type.

Matt Simmons
  • 20,396
  • 10
  • 68
  • 116
  • Matt - do you mean monitoring the network equipment via something like SNMP? – Stuart Aug 13 '09 at 15:33
  • Seems to be my day trailing Mr. Simmons. I agree with Matt, recently we had an issue where hosts would show as unreachable, despite uptimes in the year period. By using network monitoring (Nagios using ICMP) I was able to get down to the second of when all of my networking devices started havinig slow downs. When I discussed the time of issue with my team, it turned out we had a perfect storm of backup jobs, AV downloads, and remote workers logging in. – breadly Aug 13 '09 at 15:47
  • Stuart: As long as you're monitoring the right things via snmp, then sure. Check your error counters, your session counters, free memory, etc. Anything that could get used and not released – Matt Simmons Aug 13 '09 at 16:19
1

You can set up the managed switch to be monitored for alerts or odd behavior via SNMP (setup a dedicated Linux machine temporarily on their network with SSH access if need be) but to answer your question it depends...

When they have network trouble, is it slow, or dead?

Is it too slow for remote access to work properly?

If the network still works, you can set up access from the outside in to the Linux machine mentioned above to try accessing the switch and see what the switch says. I don't know the full functionality of that switch so I don't know what it does or doesn't alert and log but this would give some access point for you to monitor network traffic as well as get into the switch (I'd set it up to access on a port from the outside other than 22 though).

If you could you might be able to just switch out the switch temporarily with a temporary unit (I know what you said about not being a diagnosis) but if cycling power to the switch clears up the problem it might very much narrow down the problem for you, but only if you have the ability to get your hands on some temporary replacement equipment.

Otherwise something might be overwhelming the switch or router. Are they running the latest firmware?

Bart Silverstrim
  • 31,172
  • 9
  • 67
  • 87
  • Thanks Bart. The network generally stays up long enough for somebody to log in, or they could log in to the monitoring machine once it is back up. The machines are actually all Windows. I'll follow up your firmware suggestion. – Stuart Aug 13 '09 at 15:55
1

Many switches support a "management" network which may be completely isolated from your production network. This allows you to log into your systems via some out-of-band interface like a modem connected to a bastion host, then from there you can reach all your network devices via the management network and perform your diagnostics from there.

That said, this often isn't done because it doubles the number of networks you have to support and test, but when done properly it can make remote administration almost as effective as live-in-person troubleshooting.

chris
  • 11,944
  • 6
  • 42
  • 51
  • +1 for a terminal server with an oobm, with all console ports of network devices servers on it. – petrus Jul 13 '10 at 11:42
0

Is it realistic to think that a network guru, logging in remotely would likely be able to do the detective work to locate the source of the issue?

Most have to do this as a matter of course. Few organsations have this expertise at every site and even visiting does not easily address the issues as problems are often intermittent or unpredictable.

For example, monitoring traffic on switch ports and hosts (e.g. bytes in/out, numbers of packets in/out; broadcast and multicasts in/out, errors in/out) can give a first overview of normal behaviour and any changes during fault conditions. Typical intervals would be every 5 minutes and aggregated over longer periods, ideally displayed on web pages. Data needs to be stored locally as well as remotely in case access is lost when a fault is in progress.

SNMP alerts are useful to collect.

Beyond that network traces taken to a machine, often BSD orGNU/Linux based,typically connected to one or more span ports on local switche(es) are useful though, if not narrowly filtered, may be huge. Multiple sources may be needed (e.g. traffic to/from local servers; to/from WAN connection(s)). It is helpful if multiple traces can be taken concurrently.

All these can be looked at and interpreted remotely though need a reasonable understanding of the examined network and some of the data volumes (especially raw traces or traces over time awaiting a fault) can be huge.

A risk assessment will be needed before either allowing a third party to access the networks or sending network traces out of the control of your organisation. A full network trace allows the reconstruction of any non-encrypted content. Even if the data is encrypted and the trace excludes most of the content a full record of volumes with sources and sinks is still available. It may also include web sites and pages accessed and by whom, for example. Encrypting disks of trace information sent by mail would be a minimal safeguard and you would want a corresponding level of trust in whoever these go to. An external party given access may need equipment passwords: make sure you know which so they can be changed and consideration given to auditing equipment that has had external access. Online external access should be over secured channels (e.g. using ssh) if at all possible.

mas
  • 639
  • 5
  • 9
  • One problem with this if you have a hard failure of a device or if you have an issue that wipes out the whole network like a broadcast storm. At that point, your alerts and monitoring infrastructure may be on the wrong side of the washed-out river... – chris Aug 13 '09 at 20:40
  • @chris - that's why I would store the data locally as well as on another segment; there needs to be a local device to do the network traces and this would also hold stats information for the segment from devices on the segment, including its own interfaces (so you'd see the broadcast/multicast storm recorded there when access to the device is available again). With a hard fail remote systems would alert that contact with the remote devices had been lost and you either wait for service to be restored or have back-channel access to the remote monitoring device (e.g. dial-in modem). – mas Aug 14 '09 at 13:57
0

Setup local monitoring (SNMP of the switches, perhaps) that should continue to operate when the network is in bad shape. After the next reboot of the offending gear, remote in and review the logs from the time in question.

Yes, a good network guy should be able to figure something out this way eventually, although it may be slower than if he/she were local to the systems.

Michael Kohne
  • 2,334
  • 1
  • 16
  • 29
  • Thanks Michael, sounds like it's doable to have a remote expert assist. Any guidance on where to find a network expert to review the logs? Where do network gurus congregate online? – Stuart Aug 13 '09 at 16:10