Are you responsible for monitoring IT within your organization? Do problems with your IT services keep arising that your monitoring systems are silent about? Are you constantly having to swap monitoring tools or write custom scripts because “new” monitoring requirements keep cropping up that your monitoring systems can’t meet?
I’ve been in those situations both as a systems engineer and as a manager for the enterprise monitoring department of a large bank. Having been responsible for working with dozens of support teams to monitor 100s of services running on 1000s of servers, I can attest to how daunting trying to monitor an enterprise can be. But what drove me and my team to successfully align systems was seeking the answers to the five essential questions I ask below.
The Five Essential Questions are both strategic and tactical. The strategic questions expose potential weaknesses in your portfolio of monitoring systems that may require long-term planning to rectify. The tactical questions expose weaknesses in keeping your monitoring systems aligned with day-to-day operations. I label each question strategic and tactical–two are strategic, two are tactical and one spans both categories.
Before I lay them out, I should mention that I present them in the order that will drive the most transformative change. I start with a big picture question to show you where the major gaps are and then focus on each technology to further determine what is missing from comprehensive monitoring coverage. If you are the team lead or manager responsible for monitoring IT, you need the answers to all five questions. If you are only responsible for monitoring a specific technology or service, your focus will be primarily on the tactical questions.
Granville’s Five Essential Questions for Discovering Monitoring Gaps
- Are we monitoring all services and technologies in our environment? (Strategic)
- Are we monitoring all instances of a technology in our environment? (Tactical)
- Are we monitoring for all incidents support staff commonly encounter? (Tactical)
- Are we monitoring for failure and performance degradation scenarios that subject matter experts (SMEs) anticipate? (Strategic and Tactical)
- Do we have the capability of monitoring technology in the pipeline / on the roadmap? (Strategic)
1. Are we monitoring all services and technologies in our environment? (Strategic)
This is a big picture question, and as such, we are not as concerned about how comprehensively we are monitoring each technology (depth) but rather whether we have any coverage at all (breadth). The tactical questions that follow will deal with the depth aspect.
Conceptually, the way to determine the answer is to create a list of all the technologies and technology-based services in your organization and put a check mark next to each that is monitored. Any that don’t have checks are the monitoring gaps.
This works best if you do a thorough job identifying all of the technologies to monitor. If your list of technologies is very high level and generic (websites, desktops) versus low level and specific (Apache HTTP Servers, Red Hat JBoss Enterprise Application Platform, Dell OptiPlex workstations, Windows 10), the more it will appear that you are monitoring quite broadly when you may not be. Likewise, don’t intentionally omit technologies from your list unless you have *VERY CLEAR* policies of what you will not monitor–ideally you want the results of answering this question to help guide those policies.
If you want to greatly improve the usefulness of your survey, rather than flagging each technology as monitored or not, consider using qualified flags. When I construct an answer to this question, I incorporate method and capability. For example,
- monitored through automated means
- monitored through manual means
- not monitored, but able to using existing tools
- not monitored, but able to using staff procedures
- not monitored
Incorporating manual procedures, such as data center walk throughs and daily error reports, into the survey can greatly help to prioritize resources because you technically don’t have a monitoring gap but instead have an opportunity to automate. But only include manual processes if you are confident they are rigorously followed and result in remediation when problems are spotted.
2. Are we monitoring all instances of a technology in our environment? (Tactical)
You may have configured the most in-depth alert conditions for a server, but if your monitoring system is not aware of those servers, it doesn’t matter. That’s why this is the first tactical question I present because addressing the gaps uncovered by this answer need to be done as soon as possible.
In all but the smallest, static environments, this question has to be answered in an automated fashion. When I worked for the bank, we received a daily report of servers entering and leaving production status which we manually acted on. If you are in a more dynamic environment or make use of ephemeral servers, you will need this discovery and instrumentation process to be fully automated.
3. Are we monitoring for all incidents support staff commonly encounter? (Tactical)
The intent of this question is to discover all the types of incidents that a support team encounters and understand how they were detected and reported to the support team. The responsibility for detecting and reporting should be with your monitoring systems, so any incidents not coming through that channel are the gaps. Conceptually, you are creating a list of such incidents and cross checking them against what your monitoring systems are configured to do today, are capable of monitoring for (a fillable gap), and won’t be able to monitor with the tools in hand (a persistent gap).
In my experience, this is the toughest question to answer. First off, finding incidents that were reported outside of your monitoring systems requires knowing all the other channels (e.g. incident tickets, NOC call logs, daily health checks, secret admin monitoring scripts that e-mail them). Second, analyzing these records is a lot of manual work that requires a lot of input from the support staff and your monitoring system admins. Finally, you really need rapport with the support team you are working with because many admins perceive your monitoring systems as adding more overhead to and scrutiny of their work.
Before moving on to question #4, I wanted to comment on the use of management pack tuning for attempting to answer this question. When I was a SCOM administrator, I spent a large chunk of my time working with technology stakeholders to determine which predefined metrics and alerts contained in a management pack should actually be enabled for monitoring. This process can uncover some of the day-to-day incidents your support teams encounter, but, by itself, is not a sufficient way of answering this question. Doing so assumes that the management pack covers every failure or performance degradation scenario. In my experience, management packs cover some of that ground, but are lacking if for no other reason than your organization’s deployment and use of the technology is unique in ways the management pack did not anticipate.
4. Are we monitoring for failure and performance degradation scenarios that subject matter experts (SMEs) anticipate? (Strategic and Tactical)
Conceptually, you build a list of failure and performance degradation scenarios and cross check this list with what you are monitoring for today. Anything not monitored for is the gap.
There are several methods you can use to generate the scenarios. I’m partial to borrowing a method from lean six sigma called Failure Modes and Effects Analysis (FMEA) which not only generates a list of scenarios but helps prioritize them. Another way would be to take documented system functional requirements and ask the subject matter expert what could cause that function to not behave correctly. And yet another way would be to sit with the SME while looking at a diagram of the system, point to different components and ask questions like, “what could make this component not perform correctly?” and “what would happen to the system if it did?”
Choose your subject matter expert wisely. They not only have to be an expert in the technology but have to be an expert in how it is actually deployed and used at your organization. You might consider getting your lead engineer, an admin and a consultant from the vendor together to help you answer this question for a given technology.
5. Do we have the capability of monitoring technology in the pipeline / on the roadmap? (Strategic)
To be proactive and prepare your monitoring system portfolio for the future, you need to know what technology changes are coming down the pipe. These changes can be the introduction of new technologies, major updates to existing ones, or their decommissioning. For your monitoring systems, these changes can trigger the need for more / different licenses, increased capacity, system upgrades, module purchases, custom scripting, or complete replacements of monitoring tools. Each change brings its own monitoring challenges and it is up to you to be prepared before these changes go live.
If you’ve answered the previous four essential questions you have likely uncovered monitoring requirements your current systems can’t handle. My advice to you is to leverage changes in your environment to address these deficits. If you are proactive by routinely answering this final essential question, you will be in a better position to ask projects for money by approaching them at the beginning of their effort and not just before they go live.
Good luck with your monitoring!