This work is licensed under CC BY 4.0 - Read how use or adaptation requires attribution
FinOps X 2023, San Diego June 27-30 - Early Bird Registration Ends April 3

Protected: Managing Cloud Cost Anomalies

Introduction

Managing Anomalies is one of the capabilities of the FinOps Foundation Framework, which has the objective of detecting and correcting cost anomalies that might occur in cloud consumption. This document aims to inform you of what anomalies are, why it’s important to address them, and what the lifecycle of managing anomalies looks like.

Benefits of this guidance

Understanding how to manage cloud cost anomalies include FinOps and organizational benefits such as:

  • Learning what a cloud cost anomaly is and how it might affect your product or organization’s cost and usage efficiency
  • Understanding the lifecycle of a cloud cost anomaly, from reporting one through to resolution
  • Reviewing key performance indicators involved with anomaly management
  • Identifying business roles and personas, whether FinOps related or adjacent, who need to be involved or informed about cloud cost anomalies

All information contained in this document has been created by FinOps Foundation Community members and it’s based on first-hand experiences. The team expects this documentation and information to expand and deepen as it is presented to the wider community for feedback and improvements.

Cloud cost anomaly definition

The FinOps Foundation defines a cloud cost anomaly as:

Anomalies in the context of FinOps are unpredicted variations (resulting in increases) in cloud spending that are larger than would be expected given historical spending patterns.

Let’s unpack each part of this definition to better understand all of the nuances at play with cloud cost anomalies. To expand on the definition above, we’ll also cover the “level” of identifying such anomalies at what “timescale,” and last but not least, what constitutes “cost”.

Identifying and defining all of these terms related to cloud cost anomalies allows us to better define the lifecycle of that entity. We’ll try to address each of the points before we proceed to the lifecycle, algorithm, and other interesting details of managing the anomalies.

Note: We’ve also updated the FinOps Terminology & Definitions asset on the finops.org website so practitioners can learn the terms within this documentation alongside other important concepts from other capabilities.

Unpredicted variation

It’s important to note that we are not simply talking about the “outliers” (one method of approaching anomalies), but we are actually looking to find what are the “expected” or “predicted” costs for a period, and then measure if the actual costs accumulated in that period. 

The level of variation which is considered an anomaly can greatly differ based on the size and type of company, the scope of how much they use cloud, and other variables of their particular operation.

Cost-driven anomalies

Cost anomaly detection focuses on identifying deviations from an expected rate of spend. Organizations in the crawl or walk phase of anomaly detection typically focus on cost increases only. Companies just launching an anomaly detection system and processes will need to fine tune the settings and prove out their alerting/notification process. It typically takes time to improve the signal to noise level to an acceptable level of false positives.

While organizations primarily prioritize cost increases, mature (i.e. Run phase) FinOps organizations should investigate decreases as well. A cost anomaly can be an indicator of an underlying technology or business issue. For example, a misconfigured autoscaling system may cause a cost increase, or decrease if it fails to upscale. Typically, there are other systems in place to identify such issues so most organizations will focus strictly on cost increases.

Cost anomalies can be split to three types:

  1. Anomalous spikes in total costs: finding that the total cost of a service spiked for the past days compared to normal.
  2. Anomalous spikes in Cost per Usage: finding that the amount paid per unit of usage spiked – e.g,. If there is a spike in the cost per hour of compute, it may indicate an increase in on-demand vs discounted plan costs, or switch to more expensive resources compared to the increase in usage. 

Anomalous drop in unit economics featuring revenue / cost metrics: When available, the drop in the ratio between revenues derived from the cloud compute environment to the cost of running that environment is a strong indication of loss of efficiency in the way the resources are used. It can be broken down by the unit of revenue for a company (e.g., a player for a gaming company). 

Historical patterns

Most anomaly detection systems utilize historical data as a basis for detecting anomalies. The systems may range in sophistication from a simple percent increase in spend to machine learning based models that understand (historical) spend patterns but are still based on learnings from historical data and lack future awareness. The downside to not being future aware is more false positives.

More sophisticated systems are future aware and include forecast (budget) and event data in their models. These systems rely on a combination of historical data and future data to determine anomalies with greater accuracy than historical data alone. Forecast data is often aggregated at a relatively high level that makes it unusable without understanding the historical patterns.

More specifically, forecasts are typically by month and the department level by resource type. Knowing that you expect compute resource costs to increase 25% for a given month over the prior year is helpful, but not sufficient as spend patterns often have variability from day to day that can be seen in the historical data, but are lost in a monthly bucket.

Other terms and definitions to know

Severity

Taking the discussion on variation further, once you have established a minimum threshold, you still need to identify a low-impact anomaly from a high-impact one. Usually it’s best if the business users have some control over what you call a low-medium-high-critical anomaly and also being able to set alerts only on the high/critical ones and leave the low-medium for offline analysis.

Also note that in a day, an anomaly may start off with low severity but as it accumulates more costs, it may escalate to a high or critical level.

Timescale

This is where the Anomaly Detection process separates itself from budgets. 

Budgets are usually created and monitored on a monthly, quarterly or annual basis. This does not leave any room to find inter-day variation. We have seen the Anomaly Detection working best on a timescale of a day or consecutive days when an anomaly persists.

Importance of anomaly management

Detecting and managing anomalies is critical to avoiding unwanted and surprising billing charges. An identified anomaly could be an indicator that there is an infrastructure issue, a software bug, possible cyber attack, or any other problem that might cause a constant increase of unpredicted cloud costs.

Anomaly management is an important mechanism aiding in keeping costs on target, or at least mitigating damages, when anomalous events occur. Without anomaly management, organizations rely on chance to discover unusual spend increases- which is not a good position to be in. 

For example, a cost anomaly might indicate a heavy utilization of a certain type of resource that might exhaust its resources due to unwanted workload. Even while cloud resources are usually highly scalable and built in such a way that there are no issues with performance, there are situations where resources or service flows that might not be so resilient when facing an increased demand of performance. In this case, not only will cloud costs increase, but service levels and reliability can drop, which can negatively impact revenue or reputation.

Lifecycle of an cloud cost anomaly

Now that we have established what a cost anomaly is, let’s look at its lifecycle:

  • Record Creation: This is the first step and, regardless of your algorithm, you need to systematically create a record for each identified anomaly with the characteristics (e.g. impact, service, scope) for future analysis.
  • Notification: Depending on the severity of the anomaly, the mode of communication may be different (or absent). You may choose to integrate mobile / team chat alerts for “critical anomalies” and set email alerts for “high impact” ones and choose not to send notifications for medium and low ones to be analyzed weekly or monthly (this will vary from organization to organization).
  • Analysis: Once the anomaly has been identified and stored, an investigation should be performed to understand the reason behind the spike in cost. During this lifecycle phase, individuals look to determine if the increased cost is in fact an unexpected increase, if this is the result of intended or unintended change, etc.,and  to uncover the “why” behind what happened. 
  • Resolution Deciding upon an action to take as a result of the analysis, even if the decision is to take no further action. See common resolution outcomes below. 
  • Retrospective: once the Anomaly has been resolved, it’s important to perform an analysis to understand how future anomalies can be prevented, capture data to feed into KPIs (e.g. $ avoided), or even adjust the monitoring system to capture it earlier.

It’s important to note that the lifecycle steps are implemented differently across organizations. Labor capacity, automation, and FinOps maturity are just some of the factors which affect the implementations of anomaly management and the efficiency of the execution.

Below are some sample implementations of the anomaly lifecycle.

  • Company A – Notice of anomaly records are sent directly to engineering teams who perform analysis and all subsequent lifecycle steps.
  • Company B – Notice of anomaly records are sent directly to FinOps practitioners who conduct an initial analysis and forward only important, actionable anomaly records to engineering teams for further analysis and resolution.

Record Creation

The first step of the life cycle, record creation, involves detecting the anomaly (and whether it is actually one) to kick off the entire process. Here we address best practices and challenges involved with anomaly record creation.

How to detect an anomaly and understand deviations from average

The general process of any anomaly detection method is to take data, learn what is normal, and then apply a statistical test to determine whether any data point for the same time series in the future is normal or abnormal.

Let’s take a closer look at the data pattern in the figure below. The shaded area was produced because of such analysis. We could, therefore, apply statistical tests such that any data point outside of the shaded area is defined as abnormal and anything within it is normal.

Example figure: Database cost over time chart

Detecting anomalies in time series involves three basic steps:

  1. Estimation: Estimating a mathematical model that describes the normal pattern and distribution of the pattern of cost based on historical data and known future events.
  2. Prediction: Predicting the expected cost and its prediction confidence interval – aka baseline sleeve for the next measurement interval (hour/day/week/month…)
  3. Detection: If the next measurement of cost does not fall within the predicted confidence interval, flag it as an anomaly.

Estimation

The most complicated part of this scheme is the estimation phase. The normal pattern over time of the cost may include effects such as: trends, seasonal patterns (e.g., daily, weekly and/or monthly patterns), normal changes of the pattern due to change of usage and effects due to known events that may impact cloud usage (e.g., known tech changes in products, holidays, product release, marketing campaigns, etc). 

A model that encodes and estimates all of the above will be more accurate than a model that does not. For example, a simple statistical model for anomaly detection would estimate a 7-day running average of the cost, with its corresponding standard deviation.

Such a model accounts for slow changes (trends), but would not capture the effect of a weekly seasonal pattern and might trigger a false positive alert every Monday, if the weekly pattern exhibits lower cost over the weekend. However, a model that can encode seasonality, would not trigger an alert on Monday because it knows that Mondays are higher than the weekend. 

Therefore, the more data we can have, the better (i.e. historical cost data). For example, to capture annual seasonality, you would need at least a year of data if not multiple years. However, cost fluctuation on shorter cycles (e.g., weekly, monthly), can produce good estimates with less data. A good rule of thumb of the minimum data required is having at least two cycles of a seasonal pattern in order to estimate it (e.g., for weekly patterns, two weeks can be sufficient to learn it).

Prediction: In the prediction phase the model considers all the data up to the previous time step (e.g., up to yesterday) and it is used to predict future cost (e.g., today’s cost). When predicted cost is compared to the actual cost, the results determine whether or not an anomaly has been detected. and if it is significantly different, it is flagged as an anomaly. Significance is typically determined using a prediction confidence interval. An event measurement is considered anomalous if it falls outside that confidence interval.

For example, using the simple statistical model of the last 7 days of average and standard deviation, the prediction is the average of the last 7 days, and the confidence interval can be average +- 3 x standard deviation – which is a reasonable interval if the underlying data follows the Gaussian distribution (aka Normal). Therefore, for anomaly detection, the model used should also have the ability to estimate an accurate confidence interval, and not just a point prediction.

Because there will be normal variation in the data day over day, we don’t want every change to become an anomaly. Therefore, we use the confidence interval to be able to predict within that range what costs are likely and normal to occur. For those familiar with Statistical Process Control, these confidence intervals can be plotted on “Control Charts” or XMR charts.

Detection:  When an event has been identified as an anomalous event, it’s important to capture a sequence of observations related to the event to ensure the anomaly record created provides sufficient meaningful information to those who will review the record. Some of this methodology blends in with the next lifecycle step, Notification, where this documentation will walk through an example. 

Challenges involved with identifying and resolving anomalies

Much has been discussed regarding identifying and resolving anomalies. The bulk of which can be classified into the following categories.

  1. Signal-to-noise ratio: Most large IT environments are dynamic with ongoing development, testing, and deployment initiatives constantly starting and stopping new cloud resources. Successfully identifying anomalies is akin to finding the proverbial needle in a haystack as true anomalies often share patterns similar to normal development activities. Improving the ability to identify true anomalies is essential to any system.

Without a strategy to reduce false positives individuals analyzing anomaly records may choose to disable or ignore alerts. False positives consume critical bandwidth and therefore it is essential for anomaly detection algorithms to be carefully tuned for identifying anomalous events which are likely to be validated as true anomalies without also falsely identifying too many other cost increase events as anomalous events.

  1. Latency: Identification of anomalies using clouding billing details can be delayed as much as 36 hours from the start of the event and upwards of 24 hours before they are processed and made available for analysis. Short duration anomalies may be over by the time the alerting is triggered and costs can grow significantly during this timeframe for long duration anomalies.

It’s important to utilize anomaly detection even if the data is not complete as the costs will just continue to grow as the data is updated. If you take the approach of not waiting for all the cost data to be complete, the anomaly detection should be reapplied on the same data to avoid missing spikes that were not detected when the data was not yet complete. Systems that look beyond cost data and include visibility data have an advantage with early anomaly detection. 

  1. Scope/aggregation level: Large and/or multi-cloud organizations face the additional challenge of mapping anomalies to their organization structure. Forecasts and budgets are often set at the business or department level creating the need to align your cloud costs with your budget for cost based anomaly detection. You can detect anomalies at an organization level, cloud level (in case of multi-cloud), account level, and various other increasingly granular ways. 

It is a common practice to start identifying what impacts your business at large. (See below image for a comparison of cloud hierarchy structures for AWS, Azure, and GCP) 

This will ensure you do not get false alarms if a service runs in another project for some reason, while it may be a flag for the project/solution owner, nothing is broken at the organization level. This will be of most interest to the FinOps teams which are centrally located. Once you have mastered the “recipe” of identifying the right anomalies, it may be time to go down to the next level of AWS Accounts / Projects. This will ensure the solution owners can also monitor their own costs and anomalies.

Source: Team, E. (2022, June 8). AWS, Azure and GCP: The Ultimate IAM Comparison. Security Boulevard

Notification

Once we have created a record of the anomaly and validated that it is one, it’s time to notify the correct stakeholders to progress through the anomaly detection lifecycle and take proper action.

Notable roles who may need to be informed include:

  • FinOps practitioners
  • Finance
  • Engineering
  • Product and Business Owners
  • Executives and C-Level

In a later section of this documentation, you’ll find a table of Persona Responsibilities, where the Working Group conveys which FinOps or adjacent business personas are either responsible, accountable, consulted, or informed across a number of anomaly management lifecycle phases.

In future sprints, we’ll better define when and how to inform these various roles and personas whenever a cloud cost anomaly occurs. This includes user stories to help define different scenarios of various organizations and enterprises that use cloud at scale.

Include operational context alongside anomaly notifications

By reporting on these anomalies, it can help bring to light gaps in communication. If infrastructure changes can be communicated to finance in advance, it will help them to more accurately make forecasts. Therefore, anomaly detection isn’t just about addressing cost increases, it is about building lines of communication about what is happening in the infrastructure.

Let’s consider the multi-day anomaly shown in the example figure above. Having separate records, one for each day of the anomaly event, would yield the following results: 

  1. Multiple alerts for the same/related events – Treating each measurement (e.g., cost per day) as independent anomalies may lead to alerts being sent every day, confusing a user that might already be handling an issue.
  2. Inaccuracies with KPI reporting – The impact of an anomalous event may be underestimated, which could lead to users ignoring them – e.g., suppose an R&D issue causes an anomalous $500 per day increase in spend for a specific service. A user may ignore it because of the low impact compared to the overall spend, but after 30 days, the impact of the issue is already $15,000. Without combining all the anomalous days into a single anomaly for the user to review, the total impact may not be visible – leading to more and more waste.

Analysis

In this part of the lifecycle, it’s time to take a closer look at the cloud cost anomaly. Learn more about what goes on during this phase, which personas are involved, and challenges to look out for.

Investigation

Once cloud cost anomalies are identified, they require triage and investigation. The initial triage should look at the severity of the anomaly and the probability of the anomaly being caused by normal business activities in order to determine the response.

Details of the anomaly, such as the trigger and run rate above baseline, should be provided by the anomaly detection system. A mature organization will have predefined thresholds dictating the response. For example a 5% increase in cost for a department might be addressed during a regularly scheduled meeting where a 100% increase would trigger an all hands on deck response. 

One tried-and-true workflow to follow during anomaly analysis after initiating investigation includes tracking cost changes, re-forecasting, and identifying ownership for remediation.

Tips for analyzing potential anomalies:

  1. Examine contextual data provided with the anomaly record. Is this a tagging error that resulted in a false positive anomaly? Is this a one-day spike on the first of the month perhaps related to delivery time of billing charges? Is this anomaly associated with a certain environment, perhaps it is scheduled performance testing? Identifying the resource owner is often critical to assessing if an anomaly is justified or not. If your resources aren’t tagged with an owner, you may encounter delays in finding the individual responsible for the resource. While you are looking for the owner, costs continue to grow. 
  2. Consider spending patterns. Having a baseline understanding of your normal cloud spending patterns will help you better understand seasonality and variability usage. This means gathering historical data on your cloud usage and costs to identify trends and patterns. Consider whether or not the spike in spend correlates to increased demand for resources such as cyber monday for retailers. 
  3. Consider predicted anomalies. Some organizations keep a running list of anomaly events they are anticipating. You may want to consider keeping track of such anticipated events or comparing against such a list if one exists in your organization. 
  4. Compare anomalous spend or anomalous spend trend to the forecast and or budget to determine severity. Is this an application or account which is tracking under budget, on target, or already trending above? How does this anomaly impact the scenario?

Tracking Cost Changes

Now the cost impact ($) and expected time duration for the anomaly usage has been identified. The budget inventory’s change space is where you should attempt to note as many anomalies  as you can (preferably all the alerts above your defined threshold). This will allow the Finops practitioner to estimate the impact of the anomaly and do the re-forecasting.

For Example: Let’s consider the budget is maintained in the CSV file, then tracking anomaly changes may look as follows. With this approach at any given point in time you should be able to view the cost impacts.

Jan 2023 Feb 2023 March 2023 Dec 2023
Fixed Budget (Account X) $1000 $1100 $1100 $1100
Changes in Budget (anomalies)
Anomaly 1 cost impact $100 $100 $100
Anomaly 2 cost impact $50 $50
Anomaly <n> cost impact $75 $75
Reforcasted Budget $1150 $1250 $1275 $1175

Reforecasting

Reforecasting is reflecting the cost impacts to the initial budget. Giving a re-forecast and budget impact, for example, will raise the sense of urgency for the budget owner on fixing the anomalies based on the cost impacts. Having a heads-up on the effects will also benefit Level C executives and the finance departments.

Ownership for Remediation

The last step of this phase is to identify the workflows required and roles responsible for resolving anomalies. We’ll dive deeper into what’s required for FinOps teams generally in the next section.

Resolution

After a complete analysis of an anomaly, a decision needs to be made as to what additional actions, if any, should be taken as a result of the anomaly and by whom. The resolution outcome depends on the findings from analysis. Below is a list of common resolution outcomes & reasoning.

  1. Reject anomaly Some anomalies should be rejected for further action. There are several scenarios in which one would want to reject an anomaly.
    • Faulty anomaly detection model: In this scenario, the anomalies do not look like anomalies at all.
    • Incorrect data: In this scenario the cost data collected about the anomalous event were incomplete or incorrect leading to the creation of an anomaly record which is not valid. This sometimes occurs when the billing data referenced by the detection algorithm is incorrect. 
    • Expected costs spikes: Investigation concluded that the increase in spend, although anomalous in nature/quantity, was in fact an anticipated event caught by anomaly detection algorithms.
    • Low impact anomalies or low priority anomalies: Cost increases may be real anomalies, but their impact is too low to require any action or the organization does not have the resources (time, people, etc.) to follow up on the anomaly event.
  1. Accept anomaly – A sign of a good anomaly management practice is the amount of accepted anomaly records vs rejected anomaly records. While no person wants to come across serious anomalies it is preferable that the anomaly records generated are truly anomalous events so that people are not spending their time analyzing anomaly records which are eventually rejected as events requiring action. An anomaly record might be validated as an event needing further action for a variety of reasons. Below are some common scenarios. 
    • Security Breach: If the reason for the unusually high cloud cost was increased consumption due to malicious activity and/or a security breach, immediate action must be taken to deal with cost and the security issue. The FinOps team must work closely with the cloud security team to address the security breach in addition to dealing with increased cloud costs.
    • Misconfiguration: If the reason for the unusually high cost was increased consumption due to misconfiguration of services, action must  be taken to reconfigure the services. In addition to reconfiguration based on actual needs, a team may also benefit from some training on how to properly configure their needs in the future to prevent any future anomaly.
    • Unintentional Provisioning of Services – Shut down provisioned services. Can controls or processes be put in place to minimize unintentional provisioning?
    • Intentional changes in architecture/products leading to unintended cost increases – In cases where a known change was made, usually in products, but it lead to unintentional cost spikes, a review of the changes should be made by the technical team that made them to evaluate the cost-benefit of the change and look for a different architectural solution if the cost is deemed higher than the benefit.

In any case, a decision needs to be made regarding what should happen next. You may conclude no further action is necessary, or that you need to engage with an engineer to shut down an out of control resource or perhaps you need to notify multiple individuals in your organization regarding a likely security breach.

In scenarios where the anomaly events threatens the ability to stay on budget we suggest you notify the budget owner and engage with relevant product and engineering teams on cost optimization opportunities to mitigate budget overages. 

Retrospective

During the retrospective phase organizations should reflect on what went well, what didn’t go well, and what can be improved in the future. The goal is to identify areas of improvement and to implement changes that will improve anomaly management. Retrospectives should be scheduled on a regular cadence such as monthly or quarterly or annually.

The retrospective should include the key players that receive, analyze and remediate anomalies. Whoever is leading the retrospective should distribute a list of anomalies that have happened since the last retrospective to allow participants to select and come prepared to drill into specific items. The team should consider if all anomalies will be reviewed or if there is a certain threshold that warrants deeper analysis. It’s important to make sure that everyone on the team understands the purpose of the retrospective and is prepared to provide constructive feedback.

Retrospectives are most effective when all stakeholders share open and honest feedback. An effective method for evaluating existing processes is the “Start, Stop, Continue” discussion exercise during which participants identify actions that should be newly implemented, stopped or continued. Tooling and methods should be considered during retrospective  as well.

Once the team has identified the anomaly management changes to implement they should develop an action plan for completing the implementation of the changes. Conducting regular retrospectives helps drive continuous improvement into your anomaly detection and management processes.

Key Performance Indicators (KPIs)

There are many measurements that can be take place in the context of anomaly detection which might vary from case to case, and to the maturity of the organization, here are some Key Performance Metrics (KPI) as examples:

  • No. of Anomalies – The count of anomalies within a period of time
  • Anomalous Cost- The increased cost associated with the anomaly record for the given period of time
  • Detection Time – Time between when the anomaly occurred to when the anomaly was detected by systems/tools/etc.  
  • Notification Time – Time between when the anomaly occurred and when someone received the notification. 
  • Resolution Time – The time between when an anomaly is detected until the anomaly record has been addressed (either rejected or all response actions complete) the length in time for resolution of the anomaly
  • Analysis Time – Time between when an anomaly notification is received and when the analysis is complete. his might include the investigation done by the several teams involved in the analysis before a final resolution decision is made
  • Total Cost Avoidance – Cost Avoided by fixing the true positive anomaly until next billing cycle. This could be by anomaly and/or consolidated (total, per product, per department).
  • Accepted records – The count of anomaly records which are determined as needing additional action
  • False records (count or percentage) – Amount alerts identified as false positives
  • Impact of anomaly detection ($) = Cost value of true positives (in $) – (Cost of False positives + Cost of False Negatives ($) + Cost of anomaly detection)

Most of the metrics can be categorized and measured on an application, department, product and/or organization level for more granularity and identify if there are any major outliers (for example a specific product that is more prompt to generate anomalies, or a department that needs further education).

Let’s dig into one of the above KPIs as an example: the impact of anomaly detection.

In general, evaluating the quality of an anomaly detection model (or system) requires measuring the rate of true positives vs false positives, e.g., how many true anomalies were detected vs how many anomalies were wrongly predicted as anomalies. For FinOps organization, an anomaly detection model or system should be evaluated not just based on these two measures of accuracy but also based on the impact of those anomalies and the cost in detecting anomalies in the first place. The impact of anomaly detection can be calculated as follows:

Impact of anomaly detection ($) = Cost value of true positives (in $) – (Cost of False positives + Cost of False Negatives ($) + Cost of anomaly detection)

The impact of false positives is measured as the cost of time wasted until the anomaly is determined to be a false positive. It is harder to compute exactly, therefore it is recommended to measure it for a few cases and use the average time to translate it to a cost estimate based on a standard hourly rate.

The impact of true positives is measured by the total cost these anomalies represent. For example, an anomalous increase in cost of an DB service was deemed to have been caused by a badly formed SQL query. The impact of the anomaly is the excess cost, above the expected range, of the DB cost metric. The figure below illustrates the impact of an anomaly – as the area between the expected range and the actual cost measured.

The third component of the impact of anomaly detection formula measures the cost of anomalies that were not detected by the anomaly detection system/methodology that were later detected by chance resulting in increased cloud costs.

The fourth and final term in the equation is the cost of detecting and responding to the anomalies themselves. This portion of the equation should consider the cost of involved persons’ time and if applicable the cost of any tooling either built or bought for anomaly detection and management. 

Responsibilities by persona and lifecycle phases

The following is a table that presents cloud cost anomaly management lifecycle steps, relevant tasks, and how various FinOps personas are either Responsible (R), Accountable (A), Consulted (C), or Informed (I).

Lifecycle step Task
FinOps Practitioner

Business/Product Owner

Engineering/Operations

Finance

Executive
Record Creation Anomaly Definition R C C C A
Record Creation Tool Setup/Maintenance A C R I I
Notification Notification Definition A C C C/I I
Analysis Investigate Origin A R R I I
Analysis Reforecast R C C A I
Resolution Remediate I A R I I
Resolution Document Anomaly A R/C R/C C I
Retrospective Measure, reflect & improve A R R C C

In future sprints, we’ll better describe who is responsible for what and when and how these individuals carry out their responsibilities.

Conclusion

The complexity and depth of cloud cost anomaly management varies as cloud usage differs from company to company. The severity of an anomaly is the methods in which it’s identified and resolved will vary based on the industry, organization, maturity, and many other factors for practitioners and their teams.

Our Working Group hopes that this documentation can at least help FinOps practitioners build a starting point to better understand cloud cost anomalies, how to identify, analyze, and resolve them, who should be informed, and what measurements to utilize to help determine the right outcomes for success. We hope this guidance is foundational and evergreen, helping any FinOps practitioner to begin building an anomaly management practice for their FinOps team.

This effort so far has just been our initial sprint and we hope that, with the efforts and feedback from the community, we can improve this documentation even further.

Get involved

Our community welcomes feedback and recommendations on improving the content in this documentation. You can get in touch in our Slack channel, #chat-anomalies, to discuss any of the content here or to pitch new content if you’re interested in being a part of a future sprint.

Cloud cost anomaly management is a part of a larger FinOps curriculum which we encourage practitioners to learn more about. If this education and training is something you or your teammates require, consider our FinOps Certified Professional course.

If you are reading this and aren’t a FinOps Foundation member yet, we welcome you to sign up and join.

Acknowledgments

The FinOps Foundation extends a huge thank you to the members of this Working Group that broke ground on this documentation:

We’d also like to thank any community members who have helped us kick-start this documentation.

Lastly, a big thank you to the FinOps Foundation support team for helping us bring our work to life: Samantha White (Program Management), Tom Sharpe (Design), and Andrew Nhem (Staff Sponsor and Content).