1

I can use any of Windows Pro or Server Standard, or CentOS to do this monitoring (I guess). It seems there are some complex issues going on with AWS Deadline jobs which I don't expect anyone to know about. The jobs fail a lot.

The smoking gun (I think) is the job Monitor software fails to launch sometimes because it can't access a particular share. The shares are ZFS and the hardware is more than capable of keeping up with the IO. This problem mostly occurs in dense clusters with sparse "one off" occurrences at other times. "A reboot fixes it" but not during the dense clusters of failures.

I am strongly motivated to intensely monitor the reliability of network shares because sometimes these shares become suddenly and mysteriously unavailable to Windows 10 clients at other times! Then they either show up again moments or minutes later... or a reboot fixes it.

All clients experiencing this are running Windows 10 but that doesn't necessarily mean that this is a Windows 10 issue.

Network congestion is not too high.

Can I do this kind of monitoring with Event Viewer? Is there a painless python method to doing this monitor. I want to collect data as many moments as possible for a week.... 24 hours a day... If that makes sense.

bluesquare
  • 137
  • 1
  • 1
  • 10
  • A reboot of the client or server fixes it? What are you using to share the ZFS shares (Solaris, Linux, OpenIndiana)? Can the server still access the storage under the shares when this is going on? Are you mapping directly to the \\server\share or are you using DFS? What version of SMB are you using? – RobbieCrash Oct 09 '20 at 22:20

1 Answers1

2

Run a script as a schedule task that reads and writes a small but changing value to a text file on the share at regular and frequent intervals and logs these to the monitoring server.

I'd suggest having it run every few seconds, if possible, so you may wish to use cmd rather than PowerShell if you're going to host it on a Windows machine.

The value to write would be time/date down to fractions of a second, and on the next iteration the script will look for the file, and if it finds it, will read the value in that file, and then append that to the monitoring log, and write a new value to the check file. If the file isn't found, you'd write an appropriate error message to the log.

You'd be able to verify access to, including both reading and writing, the share down to whatever granularity you desire or your systems can handle, and you'd have a log of successful or failed reads and writes to that share.

music2myear
  • 1,905
  • 3
  • 27
  • 51
  • Lets say I only write 16 chars... Can I do this every 60 seconds....30s? I feel like absolutely, why not... but I'm no expert. The only thing I do know is that the more often I do this the higher the chance it will be delayed – bluesquare Oct 09 '20 at 20:25
  • 1
    That's going to depend on your network and systems. You can benchmark by timing how long it takes to read the old value and write the new. Collect this under a period of direct observation to see an average, minimum, and maximum time lengths it takes. On my own network I'd guess that I could run a command like these every 5 or 10 seconds easily on most servers. Some of those more network hops away may take longer, but if I placed the monitoring source closer on the network to them it would still work for pretty high resolution checking. – music2myear Oct 09 '20 at 20:36