Managing Anomalies is one of the capabilities of the FinOps Foundation Framework, which has the objective of detecting and correcting cost anomalies that might occur in cloud consumption. This document aims to inform you of what anomalies are, why it’s important to address them, and what the lifecycle of managing anomalies looks like.
Understanding how to manage cloud cost anomalies include FinOps and organizational benefits such as:
All information contained in this document has been created by FinOps Foundation Community members and it’s based on first-hand experiences. The team expects this documentation and information to expand and deepen as it is presented to the wider community for feedback and improvements.
The FinOps Foundation defines a cloud cost anomaly as:
Anomalies in the context of FinOps are unpredicted variations (resulting in increases) in cloud spending that are larger than would be expected given historical spending patterns.
Let’s unpack each part of this definition to better understand all of the nuances at play with cloud cost anomalies. To expand on the definition above, we’ll also cover the “level” of identifying such anomalies at what “timescale,” and last but not least, what constitutes “cost”.
Identifying and defining all of these terms related to cloud cost anomalies allows us to better define the lifecycle of that entity. We’ll try to address each of the points before we proceed to the lifecycle, algorithm, and other interesting details of managing the anomalies.
Note: We’ve also updated the FinOps Terminology & Definitions asset on the finops.org website so practitioners can learn the terms within this documentation alongside other important concepts from other capabilities.
It’s important to note that we are not simply talking about the “outliers” (one method of approaching anomalies), but we are actually looking to find what are the “expected” or “predicted” costs for a period, and then measure if the actual costs accumulated in that period.
The level of variation which is considered an anomaly can greatly differ based on the size and type of company, the scope of how much they use cloud, and other variables of their particular operation.
Cost anomaly detection focuses on identifying deviations from an expected rate of spend. Organizations in the crawl or walk phase of anomaly detection typically focus on cost increases only. Companies just launching an anomaly detection system and processes will need to fine tune the settings and prove out their alerting/notification process. It typically takes time to improve the signal to noise level to an acceptable level of false positives.
While organizations primarily prioritize cost increases, mature (i.e. Run phase) FinOps organizations should investigate decreases as well. A cost anomaly can be an indicator of an underlying technology or business issue. For example, a misconfigured autoscaling system may cause a cost increase, or decrease if it fails to upscale. Typically, there are other systems in place to identify such issues so most organizations will focus strictly on cost increases.
Cost anomalies can be split to three types:
Anomalous drop in unit economics featuring revenue / cost metrics: When available, the drop in the ratio between revenues derived from the cloud compute environment to the cost of running that environment is a strong indication of loss of efficiency in the way the resources are used. It can be broken down by the unit of revenue for a company (e.g., a player for a gaming company).
Most anomaly detection systems utilize historical data as a basis for detecting anomalies. The systems may range in sophistication from a simple percent increase in spend to machine learning based models that understand (historical) spend patterns but are still based on learnings from historical data and lack future awareness. The downside to not being future aware is more false positives.
More sophisticated systems are future aware and include forecast (budget) and event data in their models. These systems rely on a combination of historical data and future data to determine anomalies with greater accuracy than historical data alone. Forecast data is often aggregated at a relatively high level that makes it unusable without understanding the historical patterns.
More specifically, forecasts are typically by month and the department level by resource type. Knowing that you expect compute resource costs to increase 25% for a given month over the prior year is helpful, but not sufficient as spend patterns often have variability from day to day that can be seen in the historical data, but are lost in a monthly bucket.
Taking the discussion on variation further, once you have established a minimum threshold, you still need to identify a low-impact anomaly from a high-impact one. Usually it’s best if the business users have some control over what you call a low-medium-high-critical anomaly and also being able to set alerts only on the high/critical ones and leave the low-medium for offline analysis.
Also note that in a day, an anomaly may start off with low severity but as it accumulates more costs, it may escalate to a high or critical level.
This is where the Anomaly Detection process separates itself from budgets.
Budgets are usually created and monitored on a monthly, quarterly or annual basis. This does not leave any room to find inter-day variation. We have seen the Anomaly Detection working best on a timescale of a day or consecutive days when an anomaly persists.
Detecting and managing anomalies is critical to avoiding unwanted and surprising billing charges. An identified anomaly could be an indicator that there is an infrastructure issue, a software bug, possible cyber attack, or any other problem that might cause a constant increase of unpredicted cloud costs.
Anomaly management is an important mechanism aiding in keeping costs on target, or at least mitigating damages, when anomalous events occur. Without anomaly management, organizations rely on chance to discover unusual spend increases- which is not a good position to be in.
For example, a cost anomaly might indicate a heavy utilization of a certain type of resource that might exhaust its resources due to unwanted workload. Even while cloud resources are usually highly scalable and built in such a way that there are no issues with performance, there are situations where resources or service flows that might not be so resilient when facing an increased demand of performance. In this case, not only will cloud costs increase, but service levels and reliability can drop, which can negatively impact revenue or reputation.
Now that we have established what a cost anomaly is, let’s look at its lifecycle:
It’s important to note that the lifecycle steps are implemented differently across organizations. Labor capacity, automation, and FinOps maturity are just some of the factors which affect the implementations of anomaly management and the efficiency of the execution.
Below are some sample implementations of the anomaly lifecycle.
The first step of the life cycle, record creation, involves detecting the anomaly (and whether it is actually one) to kick off the entire process. Here we address best practices and challenges involved with anomaly record creation.
The general process of any anomaly detection method is to take data, learn what is normal, and then apply a statistical test to determine whether any data point for the same time series in the future is normal or abnormal.
Let’s take a closer look at the data pattern in the figure below. The shaded area was produced because of such analysis. We could, therefore, apply statistical tests such that any data point outside of the shaded area is defined as abnormal and anything within it is normal.
Detecting anomalies in time series involves three basic steps:
The most complicated part of this scheme is the estimation phase. The normal pattern over time of the cost may include effects such as: trends, seasonal patterns (e.g., daily, weekly and/or monthly patterns), normal changes of the pattern due to change of usage and effects due to known events that may impact cloud usage (e.g., known tech changes in products, holidays, product release, marketing campaigns, etc).
A model that encodes and estimates all of the above will be more accurate than a model that does not. For example, a simple statistical model for anomaly detection would estimate a 7-day running average of the cost, with its corresponding standard deviation.
Such a model accounts for slow changes (trends), but would not capture the effect of a weekly seasonal pattern and might trigger a false positive alert every Monday, if the weekly pattern exhibits lower cost over the weekend. However, a model that can encode seasonality, would not trigger an alert on Monday because it knows that Mondays are higher than the weekend.
Therefore, the more data we can have, the better (i.e. historical cost data). For example, to capture annual seasonality, you would need at least a year of data if not multiple years. However, cost fluctuation on shorter cycles (e.g., weekly, monthly), can produce good estimates with less data. A good rule of thumb of the minimum data required is having at least two cycles of a seasonal pattern in order to estimate it (e.g., for weekly patterns, two weeks can be sufficient to learn it).
Prediction: In the prediction phase the model considers all the data up to the previous time step (e.g., up to yesterday) and it is used to predict future cost (e.g., today’s cost). When predicted cost is compared to the actual cost, the results determine whether or not an anomaly has been detected. and if it is significantly different, it is flagged as an anomaly. Significance is typically determined using a prediction confidence interval. An event measurement is considered anomalous if it falls outside that confidence interval.
For example, using the simple statistical model of the last 7 days of average and standard deviation, the prediction is the average of the last 7 days, and the confidence interval can be average +- 3 x standard deviation – which is a reasonable interval if the underlying data follows the Gaussian distribution (aka Normal). Therefore, for anomaly detection, the model used should also have the ability to estimate an accurate confidence interval, and not just a point prediction.
Because there will be normal variation in the data day over day, we don’t want every change to become an anomaly. Therefore, we use the confidence interval to be able to predict within that range what costs are likely and normal to occur. For those familiar with Statistical Process Control, these confidence intervals can be plotted on “Control Charts” or XMR charts.
Detection: When an event has been identified as an anomalous event, it’s important to capture a sequence of observations related to the event to ensure the anomaly record created provides sufficient meaningful information to those who will review the record. Some of this methodology blends in with the next lifecycle step, Notification, where this documentation will walk through an example.
Much has been discussed regarding identifying and resolving anomalies. The bulk of which can be classified into the following categories.
Without a strategy to reduce false positives individuals analyzing anomaly records may choose to disable or ignore alerts. False positives consume critical bandwidth and therefore it is essential for anomaly detection algorithms to be carefully tuned for identifying anomalous events which are likely to be validated as true anomalies without also falsely identifying too many other cost increase events as anomalous events.
It’s important to utilize anomaly detection even if the data is not complete as the costs will just continue to grow as the data is updated. If you take the approach of not waiting for all the cost data to be complete, the anomaly detection should be reapplied on the same data to avoid missing spikes that were not detected when the data was not yet complete. Systems that look beyond cost data and include visibility data have an advantage with early anomaly detection.
It is a common practice to start identifying what impacts your business at large. (See below image for a comparison of cloud hierarchy structures for AWS, Azure, and GCP)
This will ensure you do not get false alarms if a service runs in another project for some reason, while it may be a flag for the project/solution owner, nothing is broken at the organization level. This will be of most interest to the FinOps teams which are centrally located. Once you have mastered the “recipe” of identifying the right anomalies, it may be time to go down to the next level of AWS Accounts / Projects. This will ensure the solution owners can also monitor their own costs and anomalies.
Source: Team, E. (2022, June 8). AWS, Azure and GCP: The Ultimate IAM Comparison. Security Boulevard.
Once we have created a record of the anomaly and validated that it is one, it’s time to notify the correct stakeholders to progress through the anomaly detection lifecycle and take proper action.
Notable roles who may need to be informed include:
In a later section of this documentation, you’ll find a table of Persona Responsibilities, where the Working Group conveys which FinOps or adjacent business personas are either responsible, accountable, consulted, or informed across a number of anomaly management lifecycle phases.
In future sprints, we’ll better define when and how to inform these various roles and personas whenever a cloud cost anomaly occurs. This includes user stories to help define different scenarios of various organizations and enterprises that use cloud at scale.
By reporting on these anomalies, it can help bring to light gaps in communication. If infrastructure changes can be communicated to finance in advance, it will help them to more accurately make forecasts. Therefore, anomaly detection isn’t just about addressing cost increases, it is about building lines of communication about what is happening in the infrastructure.
Let’s consider the multi-day anomaly shown in the example figure above. Having separate records, one for each day of the anomaly event, would yield the following results:
In this part of the lifecycle, it’s time to take a closer look at the cloud cost anomaly. Learn more about what goes on during this phase, which personas are involved, and challenges to look out for.
Once cloud cost anomalies are identified, they require triage and investigation. The initial triage should look at the severity of the anomaly and the probability of the anomaly being caused by normal business activities in order to determine the response.
Details of the anomaly, such as the trigger and run rate above baseline, should be provided by the anomaly detection system. A mature organization will have predefined thresholds dictating the response. For example a 5% increase in cost for a department might be addressed during a regularly scheduled meeting where a 100% increase would trigger an all hands on deck response.
One tried-and-true workflow to follow during anomaly analysis after initiating investigation includes tracking cost changes, re-forecasting, and identifying ownership for remediation.
Tips for analyzing potential anomalies:
Now the cost impact ($) and expected time duration for the anomaly usage has been identified. The budget inventory’s change space is where you should attempt to note as many anomalies as you can (preferably all the alerts above your defined threshold). This will allow the Finops practitioner to estimate the impact of the anomaly and do the re-forecasting.
For Example: Let’s consider the budget is maintained in the CSV file, then tracking anomaly changes may look as follows. With this approach at any given point in time you should be able to view the cost impacts.
|Jan 2023||Feb 2023||March 2023||…||Dec 2023|
|Fixed Budget (Account X)||$1000||$1100||$1100||$1100|
|Changes in Budget (anomalies)|
|Anomaly 1 cost impact||$100||$100||$100|
|Anomaly 2 cost impact||$50||$50||…|
|Anomaly <n> cost impact||$75||…||$75|
Reforecasting is reflecting the cost impacts to the initial budget. Giving a re-forecast and budget impact, for example, will raise the sense of urgency for the budget owner on fixing the anomalies based on the cost impacts. Having a heads-up on the effects will also benefit Level C executives and the finance departments.
The last step of this phase is to identify the workflows required and roles responsible for resolving anomalies. We’ll dive deeper into what’s required for FinOps teams generally in the next section.
After a complete analysis of an anomaly, a decision needs to be made as to what additional actions, if any, should be taken as a result of the anomaly and by whom. The resolution outcome depends on the findings from analysis. Below is a list of common resolution outcomes & reasoning.
In any case, a decision needs to be made regarding what should happen next. You may conclude no further action is necessary, or that you need to engage with an engineer to shut down an out of control resource or perhaps you need to notify multiple individuals in your organization regarding a likely security breach.
In scenarios where the anomaly events threatens the ability to stay on budget we suggest you notify the budget owner and engage with relevant product and engineering teams on cost optimization opportunities to mitigate budget overages.
During the retrospective phase organizations should reflect on what went well, what didn’t go well, and what can be improved in the future. The goal is to identify areas of improvement and to implement changes that will improve anomaly management. Retrospectives should be scheduled on a regular cadence such as monthly or quarterly or annually.
The retrospective should include the key players that receive, analyze and remediate anomalies. Whoever is leading the retrospective should distribute a list of anomalies that have happened since the last retrospective to allow participants to select and come prepared to drill into specific items. The team should consider if all anomalies will be reviewed or if there is a certain threshold that warrants deeper analysis. It’s important to make sure that everyone on the team understands the purpose of the retrospective and is prepared to provide constructive feedback.
Retrospectives are most effective when all stakeholders share open and honest feedback. An effective method for evaluating existing processes is the “Start, Stop, Continue” discussion exercise during which participants identify actions that should be newly implemented, stopped or continued. Tooling and methods should be considered during retrospective as well.
Once the team has identified the anomaly management changes to implement they should develop an action plan for completing the implementation of the changes. Conducting regular retrospectives helps drive continuous improvement into your anomaly detection and management processes.
There are many measurements that can be take place in the context of anomaly detection which might vary from case to case, and to the maturity of the organization, here are some Key Performance Metrics (KPI) as examples:
Most of the metrics can be categorized and measured on an application, department, product and/or organization level for more granularity and identify if there are any major outliers (for example a specific product that is more prompt to generate anomalies, or a department that needs further education).
Let’s dig into one of the above KPIs as an example: the impact of anomaly detection.
In general, evaluating the quality of an anomaly detection model (or system) requires measuring the rate of true positives vs false positives, e.g., how many true anomalies were detected vs how many anomalies were wrongly predicted as anomalies. For FinOps organization, an anomaly detection model or system should be evaluated not just based on these two measures of accuracy but also based on the impact of those anomalies and the cost in detecting anomalies in the first place. The impact of anomaly detection can be calculated as follows:
Impact of anomaly detection ($) = Cost value of true positives (in $) – (Cost of False positives + Cost of False Negatives ($) + Cost of anomaly detection)
The impact of false positives is measured as the cost of time wasted until the anomaly is determined to be a false positive. It is harder to compute exactly, therefore it is recommended to measure it for a few cases and use the average time to translate it to a cost estimate based on a standard hourly rate.
The impact of true positives is measured by the total cost these anomalies represent. For example, an anomalous increase in cost of an DB service was deemed to have been caused by a badly formed SQL query. The impact of the anomaly is the excess cost, above the expected range, of the DB cost metric. The figure below illustrates the impact of an anomaly – as the area between the expected range and the actual cost measured.
The third component of the impact of anomaly detection formula measures the cost of anomalies that were not detected by the anomaly detection system/methodology that were later detected by chance resulting in increased cloud costs.
The fourth and final term in the equation is the cost of detecting and responding to the anomalies themselves. This portion of the equation should consider the cost of involved persons’ time and if applicable the cost of any tooling either built or bought for anomaly detection and management.
The following is a table that presents cloud cost anomaly management lifecycle steps, relevant tasks, and how various FinOps personas are either Responsible (R), Accountable (A), Consulted (C), or Informed (I).
|Record Creation||Anomaly Definition||R||C||C||C||A|
|Record Creation||Tool Setup/Maintenance||A||C||R||I||I|
|Retrospective||Measure, reflect & improve||A||R||R||C||C|
In future sprints, we’ll better describe who is responsible for what and when and how these individuals carry out their responsibilities.
The complexity and depth of cloud cost anomaly management varies as cloud usage differs from company to company. The severity of an anomaly is the methods in which it’s identified and resolved will vary based on the industry, organization, maturity, and many other factors for practitioners and their teams.
Our Working Group hopes that this documentation can at least help FinOps practitioners build a starting point to better understand cloud cost anomalies, how to identify, analyze, and resolve them, who should be informed, and what measurements to utilize to help determine the right outcomes for success. We hope this guidance is foundational and evergreen, helping any FinOps practitioner to begin building an anomaly management practice for their FinOps team.
This effort so far has just been our initial sprint and we hope that, with the efforts and feedback from the community, we can improve this documentation even further.
Our community welcomes feedback and recommendations on improving the content in this documentation. You can get in touch in our Slack channel, #chat-anomalies, to discuss any of the content here or to pitch new content if you’re interested in being a part of a future sprint.
Cloud cost anomaly management is a part of a larger FinOps curriculum which we encourage practitioners to learn more about. If this education and training is something you or your teammates require, consider our FinOps Certified Professional course.
If you are reading this and aren’t a FinOps Foundation member yet, we welcome you to sign up and join.
The FinOps Foundation extends a huge thank you to the members of this Working Group that broke ground on this documentation:
We’d also like to thank any community members who have helped us kick-start this documentation.
Lastly, a big thank you to the FinOps Foundation support team for helping us bring our work to life: Samantha White (Program Management), Tom Sharpe (Design), and Andrew Nhem (Staff Sponsor and Content).