This content was provided as a Professional Contribution through the FinOps Certified Professional program.
Summary: Learn how to identify and address unexpected technology billing spikes and strengthen Anomaly Management. Transition from manual checks to automated detection systems that integrate directly into existing engineering workflows. See how this practitioner builds a shared triage process and conducts blameless post-mortems, and how organizations can contextualize anomalies and collaboratively reduce their resolution time. This guidance serves as a foundation for building a routine, cross-functional approach to maintaining predictable technology usage without disrupting development speed.
Managing anomalies is not an easy task, and most teams will look to 3rd party tools or build in-house solutions. Regardless of tooling, selected solution(s) should do the following:
We must also align a responsibility matrix before we enact our process. If the process exists with no one responsible or with users unable to take action due to convoluted approval flows then anomalies will become long term issues. Having set SLAs and clearly outlined responsibilities will allow quick reaction and resolution.
NOTE: The Playbook refers to DACI – Driver, Approver, Contributor and Informed throughout the content. This is similar to a RACI matrix, and practitioners should mirror this Playbook with the type of accountability that fits the organization.
Meet the end-user where they are:
This section will guide you through the personas needed to deliver success for this capability. You may not have every persona so adjust as need within your organization and follow the guidance below as much as you can:
The Leadership Persona will be crucial to driving legitimacy and responsibility throughout the organization – cleaning any initial roadblocks and empowering the rest of the process. They will usually start as an Approver at the strategic initiative level but end up as Informed quickly. They should be kept in the loop but they will not be reacting or involved in the day to day process.
FinOps Practitioners are contributors to this process. They are responsible for evaluating tool sets, ensuring that we are engaging in the current company process to allow end users quick reaction with as little complexity as possible. This includes understanding thresholds to trigger off of, building reporting and dashboards to expose trends over time, setting up “real-time” alerts, ensuring the delivery and content of the alerts. We should always look to continuously improve our ability to empower this process through toolings, alerting, reporting and process evaluation. Do not forget about the person either – people have emotions, motivations and challenges. We often are relationship drivers and need to have empathy for all here. This is our responsibility to our coworkers.
Product/Application Owner Personas own the applications design within a company. They are making architectural design choices and building out new products and applications. If, within your organization, they are also in charge of the agile process (features, bugs, etc) then they are both Driver and Approver. If you are staffed with Project Managers (or similar roles) under these personas then product/application owners will most likely be Approvers as they own the design. They should not be blockers. In the most mature organizations a good relationship between Product owners and PMs will have implicit approval in a lot of these activities. Drive that trust.
Project Manager/SCRUM Master personas (or similar) are Drivers as they are owning what happens and when within an application team. They understand the processes – how to escalate tickets, set priority levels, and they have the ability to adjust team member work allowing end-users to drop what they are doing and empowering them to react to anomalies.They are working within current agile tools and with Engineering personas directly. Developing good relationships between these personas are key. Listen to feedback and try to drive positive change.
Engineering Personas are the “DO-ers”. They will often touch each and every part of DACI. They are hands on the keyboard reacting to the tickets and doing the root cause analysis. They will provide information, advice, take action to adjust the resources, and provide the postmortem back up the chain of responsibility. Every persona is working to enable these folks to do their job the best that they can. Listen, have empathy, and use them as your experts. We do not want to be SPAM to these personas so the work we do to ensure that by the time an anomaly reaches an engineering resource it has been confirmed to be real and should be acted upon immediately (or set to the right priority for them) will pay huge dividends to our work. FinOps Practitioners can help massively here with relationship work between the teams.
Finance Personas should be informed. Anomalies have impacts on budgets and forecasting but Finance personas do not need to understand each and every detail on why an anomaly occurred and what took place to remedy it. They will be interested in the impact and if the fix was successful or if the problem is still expected to surface. They can help establish thresholds for the team and allow space for anomalous spend within their forecasting/budget process. Mistakes will happen, keep them informed at the post-mortem activity.
This section provides information that contributes to the success of this Playbook; the information here may include specific datasources, reports or any relevant input
Cloud Data: We must have access to our cloud data across any and all CSPs we use. This data must be ingested, normalized and built into valuable reporting and alerting.
Thresholds: We must establish thresholds. These can be dollar amounts, percentages and even based on other cloud policy restrictions. Wrong regions used, incorrect instance types and storage classes are all types of anomalies with cost impact.
Cost Allocation: this capability will power our ability to drive transparency in our cloud spend which directly impacts our ability to identify a cost owner and engage with that resource. If we are not able to identify who to send an alert too we have failed the first step of our process.
3rd Party Tools: As stated above, we cannot accomplish anomaly detection with raw cloud data. We will need 3rd party tools with feature sets that allow alerting and threshold building. Many tools use “x amount of standard deviation from the mean” to automatically alert you while others provide more levels of customization. Find what works for you.
If you decide to build an in-house solution, you must ensure feature parity with the needs of this playbook. Alerting on thresholds, ingesting and normalizing data, ability to break out by key metadata and ability to feed into agile management tools.
APIs & Integrations: CSPs now have their own APIs and tools you can use. Where applicable take advantage of these especially if it overcomes any feature set you may be lacking. APIs are great for automation. Many third party tools also have API and formal integrations you can take advantage of as well.
Triage Templates: A formal, documented triage process with types of anomalies expected, priority and severity levels, and SLAs are needed. This will guide our personas through the process allowing them to react quickly and think less about what they should do. They can focus on the solution not the process.
Organizational Alignment: Similar to above, we must publish and empower our organization to make decisions. Ensure all personas involved understand their role and who should be engaged when.
The basis of anomaly detection. We must begin with a notification to the correct persona. Without notification we fail immediately. Real-time cost alerts will power this action but watching our trend reporting over time can also show us issues building up and we should see this as a notification as well. Our established thresholds will determine what we are notified of.
This must be quick and direct. This is where we log the anomaly in our agile management or ticketing tool. This will often start as a manual step but should be automated as quickly as we can. Integrations and APIs power that automation. The communication must include all relevant information so that the receiver does not need to do much more data gathering.
Using our triage templates, severity or prioritization rules we will evaluate if the threshold hit is severe, is this a true or false anomaly (Business as usual) and if we have seen this before or not. Much like a doctor in the ER we have to prioritize based on the need to react quickly due to severity. Automating our thresholds and documenting past learnings can allow us the ability to quickly triage here.
A decision must be made – is this a true anomaly, how severe and what course of action is needed. The decision should be made quickly and the decision maker should take immediate, delayed or no action based on our triage findings. This immediately takes place in our established processes.
Within our processes we will require postmortems. We will revisit our root cause analysis, the success or failure of our response, the impact of the anomaly and our response in stopping the anomaly as well as any learnings or best practices that need to be implemented immediately to stop this in the future. These need to be published publicly and shared across the organization. Executive personas are key here in driving overall strategy changes as they have the widest authority for mass change.
The key is to learn from these and ensure those learnings are not siloed within the area where the anomaly occurred. If one application team is not learning from the others you will see anomalies that were corrected in one area still occur in others. This is a waste of our team’s time that can be better spent on driving business value through features and new product lines.
As with everything in the FinOps framework we want to evaluate our efforts. Metrics should be created and publicized. They should be visible and easily accessible. Iterations on metrics as you learn is important and encouraged. As you grow and learn so should your KPIs. These will determine if what we are doing is working and where we may need to improve.
Following our FinOps Phases – Inform, Optimize and Operate – we should continuously exercise this playbook. Anomaly detection is never “done”… we are constantly watching and reacting to anomalies. Over time we aim to reduce to near zero levels through good policies, automations and the right guidance but mistakes will happen. Continuously watch and create business cases for improvements.
Postmortems are a required outcome and should feed into roadmaps, best practice documents and drive best practices. We must share and learn from each and every anomaly no matter how large or small. Similar Anomalies that are repeated are an indication of a possible lack of understanding or incorrect fix and should be treated with the highest priority. We do not want to make the same mistake twice.
The primary outcome expected from this Playbook is a thoroughly vetted and quick reacting anomaly process. This process should be successful in finding and alerting anomalies. It should enable processes that allow quick discovery, approval and action. It will establish a feedback loop of learning and best practices that will impact future architectural and engineer decisions. It should, over time, reduce the number of anomalies to a minimum and the cost consequences should become minimal over time. It will not stop every anomaly: mistakes will happen, but they should have less and less impact as our process matures. Automation is the end goal for as much of the process that we can achieve..
Success can be seen in many ways, but metrics and KPIs will drive our efforts to evaluate our process and people here.
Some key metric examples are below, but there are many more that can fit your organization:
Indicators of a successful process will be seen in your metrics but also within your people.
Are we reacting seamlessly? Are we seeing requests for more information often? Are approvals causing frustration? Is there conflict in the process?
We must watch our people as much as our metrics to avoid becoming blind to emotions. We are working with people and we need to remember that. Regularly meet and gather feedback on every single aspect – the process, tooling, reports, alert and the content of each.
You will want to limit the amount of “exceptions” you have within this process but there are always outliers. If you are going to ignore or give exceptions to certain teams, be direct and justified in this.
Testing new features or CSP services, load testing, staging / development / test environments will all have challenges here. Within some of these activities, creating anomalies is expected, so do your best to plan these with teams so context is understood.
Example: If you are a tax service providing SaaS products, you will be doing a lot of load testing as you move closer to tax season. Ensure that this natural design is known and built into your processes. Load testing is key to supporting your customers, but it is not an excuse to run wild – build in these use cases and ensure teams return to “normal” within agreed timelines. As you learn together, find new normals and build thresholds around those learnings.