Anomaly Management Playbook

Prerequisites

Managing anomalies is not an easy task, and most teams will look to 3rd party tools or build in-house solutions. Regardless of tooling, selected solution(s) should do the following:

Allow “real-time” alerts for any anomalous spend. These alerts will alert the team(s) to anomalous spend that has already begun and will require quick reaction
Proactivity can be driven by trend reporting. Having specific reporting and/or dashboards that allow those responsible for overall cost of an application, BU, or set of resources to monitor the spend they are covering over time allows them to become familiar with usage patterns and seasonality of their usage. This should be the responsibility of the FinOps team or Product/Application level. They own the design and should be accountable for watching overall trends. This reporting can help us see out of the ordinary patterns and react before we get an alert.

We must also align a responsibility matrix before we enact our process. If the process exists with no one responsible or with users unable to take action due to convoluted approval flows then anomalies will become long term issues. Having set SLAs and clearly outlined responsibilities will allow quick reaction and resolution.

NOTE: The Playbook refers to DACI – Driver, Approver, Contributor and Informed throughout the content. This is similar to a RACI matrix, and practitioners should mirror this Playbook with the type of accountability that fits the organization.

Meet the end-user where they are:

Understand the processes that already exist for your Product or Application teams. Most teams use an agile methodology today. DO NOT BREAK THIS PROCESS IF POSSIBLE. Work within the tooling and current processes as much as you can. If alerts can be sent into the agile project management tool it is our responsibility to make that happen. Work with SCRUM masters to design workflows.
Create SLA’s and hold teams to them. Anomalies that are not reacted to with urgency become long-term issues and have immediate financial consequences. Anomalies can be indicative of larger issues or one-time mistakes. Have an escalation process, a formal triage process and conduct post mortems as soon as possible after the event to share learnings and stop repeated issues. Responsibility matrices will be extremely important at this step. Pursue action and help clear any hurdle to that action that a team faces.

Who needs to be involved?

This section will guide you through the personas needed to deliver success for this capability. You may not have every persona so adjust as need within your organization and follow the guidance below as much as you can:

The Leadership Persona will be crucial to driving legitimacy and responsibility throughout the organization – cleaning any initial roadblocks and empowering the rest of the process. They will usually start as an Approver at the strategic initiative level but end up as Informed quickly. They should be kept in the loop but they will not be reacting or involved in the day to day process.

FinOps Practitioners are contributors to this process. They are responsible for evaluating tool sets, ensuring that we are engaging in the current company process to allow end users quick reaction with as little complexity as possible. This includes understanding thresholds to trigger off of, building reporting and dashboards to expose trends over time, setting up “real-time” alerts, ensuring the delivery and content of the alerts. We should always look to continuously improve our ability to empower this process through toolings, alerting, reporting and process evaluation. Do not forget about the person either – people have emotions, motivations and challenges. We often are relationship drivers and need to have empathy for all here. This is our responsibility to our coworkers.

Product/Application Owner Personas own the applications design within a company. They are making architectural design choices and building out new products and applications. If, within your organization, they are also in charge of the agile process (features, bugs, etc) then they are both Driver and Approver. If you are staffed with Project Managers (or similar roles) under these personas then product/application owners will most likely be Approvers as they own the design. They should not be blockers. In the most mature organizations a good relationship between Product owners and PMs will have implicit approval in a lot of these activities. Drive that trust.

Project Manager/SCRUM Master personas (or similar) are Drivers as they are owning what happens and when within an application team. They understand the processes – how to escalate tickets, set priority levels, and they have the ability to adjust team member work allowing end-users to drop what they are doing and empowering them to react to anomalies.They are working within current agile tools and with Engineering personas directly. Developing good relationships between these personas are key. Listen to feedback and try to drive positive change.

Engineering Personas are the “DO-ers”. They will often touch each and every part of DACI. They are hands on the keyboard reacting to the tickets and doing the root cause analysis. They will provide information, advice, take action to adjust the resources, and provide the postmortem back up the chain of responsibility. Every persona is working to enable these folks to do their job the best that they can. Listen, have empathy, and use them as your experts. We do not want to be SPAM to these personas so the work we do to ensure that by the time an anomaly reaches an engineering resource it has been confirmed to be real and should be acted upon immediately (or set to the right priority for them) will pay huge dividends to our work. FinOps Practitioners can help massively here with relationship work between the teams.

Finance Personas should be informed. Anomalies have impacts on budgets and forecasting but Finance personas do not need to understand each and every detail on why an anomaly occurred and what took place to remedy it. They will be interested in the impact and if the fix was successful or if the problem is still expected to surface. They can help establish thresholds for the team and allow space for anomalous spend within their forecasting/budget process. Mistakes will happen, keep them informed at the post-mortem activity.

What information and resources are required?

This section provides information that contributes to the success of this Playbook; the information here may include specific datasources, reports or any relevant input

Information Needed

Cloud Data: We must have access to our cloud data across any and all CSPs we use. This data must be ingested, normalized and built into valuable reporting and alerting.

Thresholds: We must establish thresholds. These can be dollar amounts, percentages and even based on other cloud policy restrictions. Wrong regions used, incorrect instance types and storage classes are all types of anomalies with cost impact.

Cost Allocation: this capability will power our ability to drive transparency in our cloud spend which directly impacts our ability to identify a cost owner and engage with that resource. If we are not able to identify who to send an alert too we have failed the first step of our process.

Tools, Utilities & Templates

3rd Party Tools: As stated above, we cannot accomplish anomaly detection with raw cloud data. We will need 3rd party tools with feature sets that allow alerting and threshold building. Many tools use “x amount of standard deviation from the mean” to automatically alert you while others provide more levels of customization. Find what works for you.

If you decide to build an in-house solution, you must ensure feature parity with the needs of this playbook. Alerting on thresholds, ingesting and normalizing data, ability to break out by key metadata and ability to feed into agile management tools.

APIs & Integrations: CSPs now have their own APIs and tools you can use. Where applicable take advantage of these especially if it overcomes any feature set you may be lacking. APIs are great for automation. Many third party tools also have API and formal integrations you can take advantage of as well.

Triage Templates: A formal, documented triage process with types of anomalies expected, priority and severity levels, and SLAs are needed. This will guide our personas through the process allowing them to react quickly and think less about what they should do. They can focus on the solution not the process.

Organizational Alignment: Similar to above, we must publish and empower our organization to make decisions. Ensure all personas involved understand their role and who should be engaged when.

Instructions for Running This Playbook

(1) Notification

The basis of anomaly detection. We must begin with a notification to the correct persona. Without notification we fail immediately. Real-time cost alerts will power this action but watching our trend reporting over time can also show us issues building up and we should see this as a notification as well. Our established thresholds will determine what we are notified of.

(2) Communication

This must be quick and direct. This is where we log the anomaly in our agile management or ticketing tool. This will often start as a manual step but should be automated as quickly as we can. Integrations and APIs power that automation. The communication must include all relevant information so that the receiver does not need to do much more data gathering.

(3) Triage

Using our triage templates, severity or prioritization rules we will evaluate if the threshold hit is severe, is this a true or false anomaly (Business as usual) and if we have seen this before or not. Much like a doctor in the ER we have to prioritize based on the need to react quickly due to severity. Automating our thresholds and documenting past learnings can allow us the ability to quickly triage here.

(4) Decision

A decision must be made – is this a true anomaly, how severe and what course of action is needed. The decision should be made quickly and the decision maker should take immediate, delayed or no action based on our triage findings. This immediately takes place in our established processes.

(5) Postmortem

Within our processes we will require postmortems. We will revisit our root cause analysis, the success or failure of our response, the impact of the anomaly and our response in stopping the anomaly as well as any learnings or best practices that need to be implemented immediately to stop this in the future. These need to be published publicly and shared across the organization. Executive personas are key here in driving overall strategy changes as they have the widest authority for mass change.

The key is to learn from these and ensure those learnings are not siloed within the area where the anomaly occurred. If one application team is not learning from the others you will see anomalies that were corrected in one area still occur in others. This is a waste of our team’s time that can be better spent on driving business value through features and new product lines.

(6) Metrics

As with everything in the FinOps framework we want to evaluate our efforts. Metrics should be created and publicized. They should be visible and easily accessible. Iterations on metrics as you learn is important and encouraged. As you grow and learn so should your KPIs. These will determine if what we are doing is working and where we may need to improve.

Outcomes and Indicators of Success

Following our FinOps Phases – Inform, Optimize and Operate – we should continuously exercise this playbook. Anomaly detection is never “done”… we are constantly watching and reacting to anomalies. Over time we aim to reduce to near zero levels through good policies, automations and the right guidance but mistakes will happen. Continuously watch and create business cases for improvements.

Postmortems are a required outcome and should feed into roadmaps, best practice documents and drive best practices. We must share and learn from each and every anomaly no matter how large or small. Similar Anomalies that are repeated are an indication of a possible lack of understanding or incorrect fix and should be treated with the highest priority. We do not want to make the same mistake twice.

Primary Outcomes of Running This Playbook

The primary outcome expected from this Playbook is a thoroughly vetted and quick reacting anomaly process. This process should be successful in finding and alerting anomalies. It should enable processes that allow quick discovery, approval and action. It will establish a feedback loop of learning and best practices that will impact future architectural and engineer decisions. It should, over time, reduce the number of anomalies to a minimum and the cost consequences should become minimal over time. It will not stop every anomaly: mistakes will happen, but they should have less and less impact as our process matures. Automation is the end goal for as much of the process that we can achieve..

Indicators of Success

Success can be seen in many ways, but metrics and KPIs will drive our efforts to evaluate our process and people here.

Some key metric examples are below, but there are many more that can fit your organization:

Spend Associated with alert
# of alerts received
# of Actioned anomalies
# of false anomalies (alerts ignored – these need justification and that needs to be logged)
$ of spend avoided
# of SLAs hit or missed
Reaction Time
% of pipeline automated

Indicators of a successful process will be seen in your metrics but also within your people.

Are we reacting seamlessly? Are we seeing requests for more information often? Are approvals causing frustration? Is there conflict in the process?

We must watch our people as much as our metrics to avoid becoming blind to emotions. We are working with people and we need to remember that. Regularly meet and gather feedback on every single aspect – the process, tooling, reports, alert and the content of each.

Exceptions and Considerations

You will want to limit the amount of “exceptions” you have within this process but there are always outliers. If you are going to ignore or give exceptions to certain teams, be direct and justified in this.

Testing new features or CSP services, load testing, staging / development / test environments will all have challenges here. Within some of these activities, creating anomalies is expected, so do your best to plan these with teams so context is understood.

Example: If you are a tax service providing SaaS products, you will be doing a lot of load testing as you move closer to tax season. Ensure that this natural design is known and built into your processes. Load testing is key to supporting your customers, but it is not an excuse to run wild – build in these use cases and ensure teams return to “normal” within agreed timelines. As you learn together, find new normals and build thresholds around those learnings.

Eric Driscoll

Apptio, an IBM Company

Join as an Individual

Join as an Enterprise

Anomaly Management Playbook

Prerequisites

Who needs to be involved?

What information and resources are required?

Information Needed

Tools, Utilities & Templates

Instructions for Running This Playbook

(1) Notification

(2) Communication

(3) Triage

(4) Decision

(5) Postmortem

(6) Metrics

Outcomes and Indicators of Success

Primary Outcomes of Running This Playbook

Indicators of Success

Exceptions and Considerations

Eric Driscoll

Advance Your Career

FinOps X

Join as an Individual

Join as an Enterprise

Join as an Individual

Join as an Enterprise

Anomaly Management Playbook

Prerequisites

Who needs to be involved?

What information and resources are required?

Information Needed

Tools, Utilities & Templates

Instructions for Running This Playbook

(1) Notification

(2) Communication

(3) Triage

(4) Decision

(5) Postmortem

(6) Metrics

Outcomes and Indicators of Success

Primary Outcomes of Running This Playbook

Indicators of Success

Exceptions and Considerations

Eric Driscoll

Advance Your Career

Related FinOps Capabilities

FinOps X

Make Suggestions

Suggest a Resource