Granville’s Five Essential Questions for Discovering Monitoring Gaps

Are you responsible for monitoring IT within your organization? Do problems with your IT services keep arising that your monitoring systems are silent about? Are you constantly having to swap monitoring tools or write custom scripts because “new” monitoring requirements keep cropping up that your monitoring systems can’t meet?

I’ve been in those situations both as a systems engineer and as a manager for the enterprise monitoring department of a large bank. Having been responsible for working with dozens of support teams to monitor 100s of services running on 1000s of servers, I can attest to how daunting trying to monitor an enterprise can be. But what drove me and my team to successfully align systems was seeking the answers to the five essential questions I ask below.

The Five Essential Questions are both strategic and tactical. The strategic questions expose potential weaknesses in your portfolio of monitoring systems that may require long-term planning to rectify.  The tactical questions expose weaknesses in keeping your monitoring systems aligned with day-to-day operations. I label each question strategic and tactical–two are strategic, two are tactical and one spans both categories.

Before I lay them out, I should mention that I present them in the order that will drive the most transformative change. I start with a big picture question to show you where the major gaps are and then focus on each technology to further determine what is missing from comprehensive monitoring coverage. If you are the team lead or manager responsible for monitoring IT, you need the answers to all five questions. If you are only responsible for monitoring a specific technology or service, your focus will be primarily on the tactical questions.

Granville’s Five Essential Questions for Discovering Monitoring Gaps

  1. Are we monitoring all services and technologies in our environment? (Strategic)
  2. Are we monitoring all instances of a technology in our environment? (Tactical)
  3. Are we monitoring for all incidents support staff commonly encounter? (Tactical)
  4. Are we monitoring for failure and performance degradation scenarios that subject matter experts (SMEs) anticipate? (Strategic and Tactical)
  5. Do we have the capability of monitoring technology in the pipeline / on the roadmap? (Strategic)

1. Are we monitoring all services and technologies in our environment? (Strategic)

This is a big picture question, and as such, we are not as concerned about how comprehensively we are monitoring each technology (depth) but rather whether we have any coverage at all (breadth). The tactical questions that follow will deal with the depth aspect.

Conceptually, the way to determine the answer is to create a list of all the technologies and technology-based services in your organization and put a check mark next to each that is monitored. Any that don’t have checks are the monitoring gaps.

This works best if you do a thorough job identifying all of the technologies to monitor. If your list of technologies is very high level and generic (websites, desktops) versus low level and specific (Apache HTTP Servers, Red Hat JBoss Enterprise Application Platform, Dell OptiPlex workstations, Windows 10), the more it will appear that you are monitoring quite broadly when you may not be. Likewise, don’t intentionally omit technologies from your list unless you have *VERY CLEAR* policies of what you will not monitor–ideally you want the results of answering this question to help guide those policies.

If you want to greatly improve the usefulness of your survey, rather than flagging each technology as monitored or not, consider using qualified flags. When I construct an answer to this question, I incorporate method and capability. For example,

  • monitored through automated means
  • monitored through manual means
  • not monitored, but able to using existing tools
  • not monitored, but able to using staff procedures
  • not monitored

Incorporating manual procedures, such as data center walk throughs and daily error reports, into the survey can greatly help to prioritize resources because you technically don’t have a monitoring gap but instead have an opportunity to automate. But only include manual processes if you are confident they are rigorously followed and result in remediation when problems are spotted.

2. Are we monitoring all instances of a technology in our environment? (Tactical)

You may have configured the most in-depth alert conditions for a server, but if your monitoring system is not aware of those servers, it doesn’t matter. That’s why this is the first tactical question I present because addressing the gaps uncovered by this answer need to be done as soon as possible.

In all but the smallest, static environments, this question has to be answered in an automated fashion. When I worked for the bank, we received a daily report of servers entering and leaving production status which we manually acted on. If you are in a more dynamic environment or make use of ephemeral servers, you will need this discovery and instrumentation process to be fully automated.

3. Are we monitoring for all incidents support staff commonly encounter? (Tactical)

The intent of this question is to discover all the types of incidents that a support team encounters and understand how they were detected and reported to the support team. The responsibility for detecting and reporting should be with your monitoring systems, so any incidents not coming through that channel are the gaps. Conceptually, you are creating a list of such incidents and cross checking them against what your monitoring systems are configured to do today, are capable of monitoring for (a fillable gap), and won’t be able to monitor with the tools in hand (a persistent gap).

In my experience, this is the toughest question to answer. First off, finding incidents that were reported outside of your monitoring systems requires knowing all the other channels (e.g. incident tickets, NOC call logs, daily health checks, secret admin monitoring scripts that e-mail them). Second, analyzing these records is a lot of manual work that requires a lot of input from the support staff and your monitoring system admins. Finally, you really need rapport with the support team you are working with because many admins perceive your monitoring systems as adding more overhead to and scrutiny of their work.

Before moving on to question #4, I wanted to comment on the use of management pack tuning for attempting to answer this question. When I was a SCOM administrator, I spent a large chunk of my time working with technology stakeholders to determine which predefined metrics and alerts contained in a management pack should actually be enabled for monitoring. This process can uncover some of the day-to-day incidents your support teams encounter, but, by itself, is not a sufficient way of answering this question. Doing so assumes that the management pack covers every failure or performance degradation scenario. In my experience, management packs cover some of that ground, but are lacking if for no other reason than your organization’s deployment and use of the technology is unique in ways the management pack did not anticipate.

4. Are we monitoring for failure and performance degradation scenarios that subject matter experts (SMEs) anticipate? (Strategic and Tactical)

Conceptually, you build a list of failure and performance degradation scenarios and cross check this list with what you are monitoring for today. Anything not monitored for is the gap.

There are several methods you can use to generate the scenarios. I’m partial to borrowing a method from lean six sigma called Failure Modes and Effects Analysis (FMEA) which not only generates a list of scenarios but helps prioritize them. Another way would be to take documented system functional requirements and ask the subject matter expert what could cause that function to not behave correctly. And yet another way would be to sit with the SME while looking at a diagram of the system, point to different components and ask questions like, “what could make this component not perform correctly?” and “what would happen to the system if it did?”

Choose your subject matter expert wisely. They not only have to be an expert in the technology but have to be an expert in how it is actually deployed and used at your organization. You might consider getting your lead engineer, an admin and a consultant from the vendor together to help you answer this question for a given technology.

5. Do we have the capability of monitoring technology in the pipeline / on the roadmap? (Strategic)

To be proactive and prepare your monitoring system portfolio for the future, you need to know what technology changes are coming down the pipe. These changes can be the introduction of new technologies, major updates to existing ones, or their decommissioning. For your monitoring systems, these changes can trigger the need for more / different licenses, increased capacity, system upgrades, module purchases, custom scripting, or complete replacements of monitoring tools. Each change brings its own monitoring challenges and it is up to you to be prepared before these changes go live.

If you’ve answered the previous four essential questions you have likely uncovered monitoring requirements your current systems can’t handle. My advice to you is to leverage changes in your environment to address these deficits. If you are proactive by routinely answering this final essential question, you will be in a better position to ask projects for money by approaching them at the beginning of their effort and not just before they go live.

Good luck with your monitoring!

4 Ways to Get More Value from Monitoring Systems

The primary reason to use a monitoring system is to inform your support teams when technology is not working the way it is expected to.  But monitoring systems can bring much more value to an organization if you can find ways to reuse the data. Below are four possibilities.

1. Operations Excellence Initiatives. I am a very big proponent of synthetic monitoring solutions. And so was one of my previous CTOs, who not only sponsored an entire team to create an effective synthetic web monitoring solution but also used the stats from that monitoring system to determine the bonuses for the entire IT division.

Everyday the stats from that tool were reported to IT management. Every hiccup of a website was investigated. And it seemed like every week I had at least one web development team asking me for help investigating (or proving that my monitoring solution was accurate).

Lo and behold the frequent outages that plagued our web services for years disappeared. (And people got their bonuses.)

2. Inventory Management.  When I took over Microsoft Systems Center Operations Manager (SCOM) at my last employer, the inventory of Windows servers maintained by the sys admins was completely out-of-date (and we were running thousands of servers).  I worked with the sys admins to automate a report that showed the difference between what SCOM was monitoring and what servers in their inventory were flagged for monitoring (basically, only production servers).  I then told them that monitoring would only be done if inventory actually showed the server was flagged for monitoring. Thus alerts for non-production servers (which they loathed) and missed alerts (which they got dinged for by management) drove them everyday to update the inventory.  In a year, their inventory was pretty much in-sync with reality.

Of course, this process primarily kept production server inventory up-to-date.  However, by the time production inventory was straightened out, the sys admins had become so accustomed to updating inventory that they built inventory update tasks into their provisioning and decommissioning processes for all servers.

3. Data Model of Your IT Systems. Many monitoring tools provide some degree of topology mapping features which is essentially a data model of your IT systems. Network monitoring tools will tell you what your network looks like right now. Web Application Performance Management (APM) tools will tell you how your services are comprised. This is extremely valuable information that can be used to supplement your CMDB (or create your own if you like) or help your event management systems effectively perform event correlation and root cause isolation.

Unfortunately, this topology data is isolated for use within the monitoring system that generates it. If you really want to make a data model of your IT systems, you are going to need to bring your data architecture skills and a lot of engineering ingenuity to extract or federate the data. But it may be worth it to avoid “all hands” calls, reduced mean time to repair (MTTR), and large numbers of incident tickets generated during event storms.

4. Capacity Planning. Monitoring systems collect a lot of data for use in troubleshooting. Much of that same data can be fed into analytics tools, such as Excel and R, to perform capacity studies. These studies often involve correlations of service counters (e.g. number of users, number of hits / queries) and system counters (e.g. CPU utilization, Memory Utilization) which may necessitate extracting data from multiple monitoring systems. This extra elbow grease may be worth it if you consider the cost of reaching capacity during the next major marketing campaign or if you can identify many under-utilized servers that can be repurposed.

In conclusion, monitoring systems can provide more than just alerts and diagnostic data. That data can be reused for management performance scorecards, IT system data modeling, capacity planning and can be leveraged to improve data elsewhere, such as in inventory systems and CMDBs.