Community collaboration is at the heart of the Foundation, why not create the initial content for this page so others can benefit and build onto it?
Anomaly Management is the ability to detect, identify, clarify, alert and manage unexpected or unforecasted cloud cost events in a timely manner, in order to minimize detrimental impact to the business, cost or otherwise.
Anomalies in the context of FinOps are unpredicted variations (usually increases) in cloud spending that are larger than would be expected given historical spending patterns.
Managing anomalies typically involves the use of tools or reports to identify unexpected spending, the distribution of anomaly alerts, and the investigation and resolution of anomalous usage and cost.
Anomaly detection allows the FinOps team to react quickly in order to maintain spend levels that an organization expects. To quickly find those needles in your cloud haystack, using automated, machine learning–based anomaly detection is key. These tools are generally offered by cloud providers and third party platforms.
An anomaly usually occurs when teams deploy resources with the expectation that they will maintain budget, and then find they are trending over budget. This may be due to the fact that the resources are higher priced than planned, or that launching some resources has created unexpected cost in another service. For example, launching a new set of virtual machines might also trigger an unexpected large increase in logging data.
Detected anomalies can be effectively dealt with only when cost allocation metadata or other operational metadata exists to determine who can best evaluate the anomaly for resolution.
Resolving an anomaly typically involves some level of investigation and then either a change to adjust the environment, or to adjust the expectation of the cost of the affected scope. Another resolution may be to simply acknowledge the anomaly. For example, a new testing infrastructure may be created to accommodate a testing period for a new application. If this environment has not existed before, it may be flagged as anomalous because it varies from historical spending patterns. So while automated tools will see this as anomalous, it is expected from the perspective of the humans launching the new environment, and the anomaly can be dismissed after ensuring it’s within the expected new cost of the new environment.
You must also be able to track anomalies that might not directly result in a change in cloud spend. If a team starts using a new cloud service offering, replacing the usual one, you can learn of this through anomaly reports that show your cost by cloud service offering. Anomalies in this report can be very significant for companies that require sign-off—for security or compliance reasons—before using new services. Having Anomaly Detection tools that provide this granularity of cost by service, by account/project, by cost allocation tag, etc. is critical.
Managing anomalies will also be an important touchpoint between the FinOps function and the Security function. Security anomaly detection tools may detect problems which do not dramatically affect cost, and vice versa.
Measures of success are represented in the context of cloud costs and may include one or more key performance indicators ( KPI ), describe objectives with key results ( OKR ), and declare thresholds defining outliers or acceptable variance from forecasted trends.
_a collection of real world examples, stories and “how to” for this Capability; based on FinOps community member experiences; information here may:
- apply to one or more cloud providers
- include specific types of cloud services used) (compute, storage, database, etc…)
- describe a combination of tooling, platform or vendor
- describe the industry the organization belongs to
- describe the complexity of the organization (global, enterprise, etc…)
- include the FinOps personas involved and any other organizational roles
Reference of cloud cost management platforms, tooling and service providers related to this Capability coming soon.
Reference of courses and training partners related to this Capability coming soon.
Get involved and contribute to the community by sharing your real world experiences related to this Capability in the form of a story or providing a playbook for how you have implemented best practices in your organization. Your real world experiences can be provided in the context of:
Join the conversation about this Capability in Slack . You can submit stories, how-tos and suggest improvements using one of the options for contributing here.