This work is licensed under CC BY 4.0 - Read how use or adaptation requires attribution

Anomaly Management

Framework / Domains / Understand Cloud Usage & Cost / Anomaly Management

Detect, identify, alert and manage unexpected or unforecasted cloud cost and usage irregularities in a timely manner to lower risk in cost-effective cloud operations.

Detect anomalies

  • Define the tools we use to detect anomalous spending
  • Identify and document how alerts are created and logged
  • Identify and document how responsible parties are identified
  • Set mechanisms to alert appropriate parties timely using appropriate channels

Enable anomaly detection

  • Define information required for effectiveness, send back to Data Ingestion
  • Create policies with respect to anomaly management

Manage anomalies

  • Analyze reported anomalies
  • Categorize, manage false positives, investigate
  • Document anomalies and their resolutions

Definition

Anomaly Management gives a FinOps team the ability to detect, identify, clarify, alert on, and manage unexpected cloud cost events in a timely manner, in order to minimize impact to the business.

Managing anomalies involves the use of tools or reports to identify unexpected spending, the distribution of anomaly alerts, and information to investigate and resolve anomalous usage or cost.

In the context of Cloud FinOps, anomalies are levels of spending that are different (usually higher) than normal historical or expected spend.

Anomaly detection identifies data points, events, and/or observations that deviate from a dataset’s normal behavior. Detection tools should examine not only aggregate usage but usage within subcategories. Effective Allocation metadata are critical to effective anomaly detection, and to be able to determine who can best evaluate and resolve detected anomalies.

Having Anomaly Detection tools that provide this granularity of cost by service, by account/project, by cost allocation tag, etc. is critical to be able to detect the specific causes of the anomalous spending.

Standard procedures for anomaly detection and analysis are critical to allow the FinOps team to react quickly when anomalies occur. Use of automated, machine learning–based anomaly detection is typical. These tools are generally offered by cloud providers and third party platforms.

As with many FinOps capabilities, anomaly detection is performed by comparing current usage to past usage. As such, sudden increases in new usage or spending can trigger anomaly alerts even when they are anticipated. The launch of a new training environment as an application is getting ready for launch may create a sudden spike in usage in an account which has never seen significant cost before. In these cases, it should be expected that teams be prepared to see anomaly alerts, and that they effectively manage and document them when they occur to avoid work by other personas that might otherwise be triggered.

Managing and resolving Anomalies typically involves some level of investigation and then either a change to adjust the environment, or to adjust the expectation of the cost of the affected scope. Another resolution may be to simply acknowledge the anomaly and document the reasons it was detected.

Maturity Assessment

Crawl

  • Understanding that anomalous spending might occur among the FinOps team and organization generally
  • Manually checking for anomalous spending using reporting
  • Reacting to anomaly activity more than a week after it occurs (post receipt of billing data)
  • Using budget alerts vs an anomaly detection service
  • Limited granular detection (e.g. Account/Project level, not using tagging data or logical groupings) minimizing the context of insights from anomalous spending identified
  • Anomaly alerts sent to a central team, or require manual action to look
  • Unexpected spending is manually investigated by central team and routed when necessary to a suspected owner for resolution

Walk

  • Some form of automated detection or reporting or tooling (usually provided by the Cloud Service Provider, third party, or custom tooling)
  • Knowledge and use of anomaly detection tooling in most or all departments and teams
  • Context-relevant thresholds are detected (percentage of spend change, single item spend amount ceilings, forecast breach alerts, etc.)
  • Cost allocation metadata provides context to segment anomalies, allowing for easier analysis
  • Unexpected spending automatically routed to responsible teams
  • KPIs associated with Anomaly Management set and in use by key teams in the organization
  • Ability to document outcomes and capture some details of the outcomes of anomalies which alert

Run

  • Mature Anomaly detection tooling in use and embedded in cloud cost tooling across the organization
  • Automation created to detect, suggest resolution, or resolve anomalous spending alerts, with appropriate severity in appropriate environments
  • Anomaly alerts of an appropriate scale or urgency can integrate to event management or ticketing systems and processes
  • Granular context-related anomaly alert thresholds linked directly to service components
  • Alert thresholds iteratively updated in line with service lifecycles
  • Alerting and thresholds set differently for different personas and levels of responsibility (e.g. leadership, FinOps team, engineering, finance may see alerts from anomalous spending differently based on their own thresholds for action or information)
  • Results and resolution of each anomaly alert is captured
  • Analysis results in full root cause analysis post-mortem where appropriate
  • Analysis of resolution of past anomalies improves alerting on future anomalies

Functional Activities

FinOps Practitioner

As someone in the FinOps team role, I will…

  • Establish requirements for anomaly detection tool selection that is appropriate for cost monitoring and capable of defining, refining, detecting and alerting unexpected cloud spending events (cost anomalies)
  • Establish requirements for Anomaly automation and documentation tooling or processes and integration to appropriate ticketing or process management systems in place
  • Document and communicate anomaly detection mechanism and thresholds to all stakeholders
  • Work with stakeholder teams to establish anomaly detection thresholds and reporting/notifications frequency
  • Ensure that anomaly detection is tied appropriately to cost allocation metadata, providing feedback to the Allocation capability when additional metadata are required
  • Ensure anomaly detection tooling has access to raw spending data at appropriate real-time granularity and frequency
  • Generate reports that surface all and/or alerted anomalous spending

Engineering

As someone in an Engineering role, I will…

  • Ensure my team checks for or receives anomalous spending alerts
  • Ensure that my team is aware of the correct processes and actions to respond to and address a cloud cost anomaly
  • Ensure the appropriate metadata are applied to resources within my control to allow anomaly detection to occur
  • Provide feedback to FinOps on the correctness of established thresholds and detection coverage
  • Investigate and determine the causes and scope of detected anomalies and document false positives or plans of action
  • Resolve issues causing anomalous spending and document the resolution of anomalies
  • Proactively alert other Personas as to large unexpected anomalous spending that is planned and likely to create Anomaly alerts that would be troubling (e.g. launch of new environment)

Product

As someone in a Product role, I will…

  • Monitor and investigate detected and reported anomalies for systems within my areas of responsibility
  • Assist Investigating and resolving reported anomalies and document resolutions in concert with engineering teams in my areas of responsibility

Finance

As someone in a Finance role, I will…

  • Help establish anomalous percentage thresholds that trigger financial review of budget to actual spend
  • Provide contingency funding sources within forecasts for cloud costs to accommodate anomalous spending trends that align with established thresholds

Leadership

As someone in a Leadership role, I will…

  • Establish anomalous percentage thresholds that trigger action or attention
  • Provide guidelines and support policies related to OKRs and KPIs and mandate anomaly management activities by engineering and product teams

Measures of Success & KPIs

  • The count of anomalies within a period of time (week, month) in aggregate or for meaningful subset of usage
  • Consistent identification of anomalous spending vs. missed vs. false positives
  • Amount of cost associated with anomaly alerts within a period of time (week, month); represents total anomaly detection scope
  • Mean time to detect anomalies over a period of time (week, month); documents efficiency and effectiveness of tools used
  • Mean time to notify owner of anomalies over a period of time (week, month); documents the time it takes from the anomaly detection to the appropriate owner acknowledging it
  • Duration of unresolved anomalies over a period of time (week, month); the velocity of anomaly resolution
  • Time to investigate and address an identified anomaly; time of investigation of a true anomaly is real time wasted cost in many cases
  • % of teams educated on how variable cloud spending can lead to anomalous spending, definition of what is anomalous, who is accountable, how to respond
  • The count of actioned anomalies and spending amount avoided (to nearest following billing period); the amount of cost saved through resolution of anomalies that would have gone unresolved until the bill was received
  • The count of unactionable anomalies and categorized but justification to ignore (i.e. new service, performance testing, customer peak, false alert) by category
  • Tracking number of alerts suspended (ignored) to identify teams who might not be adhering to protocol or policy
  • Percentage of anomalies managed using automation of various categories; this documents the effectiveness of automation put in place

Inputs & Outputs

  • Cloud cost and usage data provided via Data Ingestion capability
  • Anomaly detection tooling (cloud provider, tooling vendor, homegrown tool)
  • Cost allocation metadata established and aligned to the organization’s reporting needs
  • Allocation strategy assigning usage to specific teams responsible for its oversight
  • Anomalous spend notification to stakeholder teams
  • Stakeholder real-time visibility into cost and usage reporting data
  • Cloud Policy & Governance associated with expectations for managing anomalies
  • Documentation of detection, analysis, and resolution processes and expectation of personas
  • Reporting & Analysis will be required to investigate and analyze anomalous spending
  • Workload optimization may be required to remediate usage-based causes of anomalous spending or to turn off resources which are not being used