This work is licensed under CC BY 4.0 - Read how use or adaptation requires attribution

Workload Optimization

Framework / Domains / Optimize Cloud Usage & Cost / Workload Optimization

Analyze and optimize cloud resources to match specific usage patterns while ensuring that workloads operate efficiently, sustainably and generate sufficient business value for their cost.

Creating a workload optimization strategy

  • Evaluate workloads for optimization options
  • Determine value vs risk thresholds for each workload
  • Define target optimization thresholds
  • Establish guidelines for balancing workload optimization against other optimization options
  • Make conscious trade-off decisions among cost, quality, performance and environmental impact of cloud operations

Managing workload optimizations

  • Generate optimization recommendations by workload
  • Prioritize production and pre-production optimization activities
  • Share recommendations with Engineering and agree on timelines for action
  • Align with Core Personas (e.g. Sustainability, ITAM) for optimization targets

Understanding where opportunities have value

  • Use modern and managed services
  • Compare optimization benefit in the context Engineering time and priorities

Definition

Workload Optimization is a set of practices that ensure that cloud resources are properly selected, correctly sized, only run when needed, appropriately configured, and highly utilized in order to meet all functional and non-functional requirements at the lowest cost and environmental impact. This work is primarily done by Engineering, using guidelines and strategies formed collaboratively with the FinOps, Product, and other personas.

Engineers should seek to ensure there is sufficient business value for the cloud costs associated with each type of resource being consumed. Because cloud systems are built iteratively, it is typical to observe resource utilization over time to ensure performance, availability or other quality metrics are met, and to adjust or modify resources which are over- or under-sized, or make other optimizations even for systems which are well-architected for the cloud.

There is a strong relationship between all of the Capabilities in the Optimize Cloud Usage & Cost Domain. Each of the Capabilities in this Domain work in different ways to optimize cloud value – by using commitment based discounts, rearchitecting, using or stopping the use of licenses or SaaS, providing guidance on cloud sustainability improvements, and optimizing the utilization and efficiency of the workloads that make up systems. Among all of these, Workload Optimization will likely be the most widely practiced, and have the most options.

Early in the FinOps practice, the FinOps Team will likely play a large role in identifying opportunities to optimize workloads, but over time Engineering will take on the primary responsibility for their cloud usage by seeking out ways to optimize, or better yet by building in optimization as much as possible as systems are being built. But, no matter how well built and efficient a system is when built, services in the cloud are constantly being added and modernized, and organizations must be prepared to continuously work to keep pace and maintain optimal performance and utilization. Engineering leadership is critical to establishing the cadence and highlighting the need to maintain optimization of workloads at the appropriate level.

A key way the FinOps team can support this is by developing a workload optimization strategy. This strategy can direct optimization work by highlighting which types of resources should be prioritized, setting thresholds for taking action so that time is not wasted on trivial improvements, defining target KPIs the organization wants to achieve, and creating guidelines for making the tradeoffs that come with optimization. Other capabilities in this Domain may have important inputs to this strategy in highlighting for Engineering where the organization supports (or plans to stop) using licensed software, when rearchitecting is preferred over resource optimization, how to prioritize resource optimization against rate optimization, or how to incorporate sustainability and carbon impact decisions into usage optimization decision-making. As noted, the strategy may also set Leadership’s expectations of how frequently and diligently optimization should be pursued by Engineering versus new feature development work.

Engineering teams, in collaboration with FinOps, Product, and Leadership, will use the Capabilities in the Understand Cloud Usage & Cost Domain to review workloads in their areas of responsibility. Determining utilization and identifying scaling or workload management opportunities may require access to utilization, performance, or observability data in addition to cloud usage and cost and carbon impact data. Engineering teams may focus their efforts on finding opportunities to optimize in different ways depending on factors like the system’s importance, time available to optimize, maturity of the application, or whether the workloads are production or non-production.

A wide range of options exist to optimize workloads in the cloud including:

  • Waste reduction – removing resources which were created but are no longer being used. These may include stranded storage volumes, excessive backups or snapshots, unused sandbox resources, etc. If these types of items are generated consistently, look to automate resource creation to avoid stranding (or creating) resources in the future, or automate cleanup processes to save having to search for them continuously
  • Workload Management – resources should ideally be running in the cloud only when the workload is required. Scheduling the time when an environment or resources run saves both cost and environmental impact by making shared cloud available to others when not needed. Preproduction resources are the biggest targets, and should always require scheduled start/stop at launch when feasible.
  • Scaling – some resources have the ability to scale up or down depending on variable workload needs throughout the day/month. This can be accomplished in a variety of ways. Looking for high cost or impact workloads with cyclical usage patterns can identify places where scaling could be introduced.
  • Rightsizing – resources that can’t be scaled but which have more consistent low utilization may be candidates for rightsizing, reducing the size, scale, or service tier of the resource to match its workload needs.
  • Temporal shifting – processes not bound to run at a specific time can be run at times that optimize cost and/or carbon when lower-price compute (e.g. interruptible/spot instances) or lower carbon intensity electricity is available.
  • Geo Regional shifting – processes not bound to run in a specific region can be run in a region that optimizes cost and/or carbon where there are price advantages or lower carbon intensity electricity so long as compliance and performance requirements are met.

Examine workloads carefully for longer-cycle periods of high utilization (e.g. higher utilization at month-end, or quarterly busy periods) and be cautious of workloads that have resource requirements for warranty or software performance reasons. Rightsizing typically requires recreating resources so this can involve system outages that should be carefully coordinated within the Engineering team.

There may be times when utilization may need to decrease and the extra expense incurred is worth the value the resources create. Or the opposite may be true and carbon and/or performance expectations can be lowered to improve cost.

For some resources, like storage, it may be necessary to estimate latent inefficiency in the stored data, and by extension the potential gross savings that can be realized by removing, or rightsizing, that inefficiency. Different data sets require tailored approaches. For example, highly compressible (yet uncompressed) data has relatively high latent inefficiency, whereas encrypted data has relatively low (or no) latent inefficiency. Data that is infrequently accessed but stored in a high cost, high performance storage class (or tier) also has relatively high latent inefficiency. Similarly, storage data housekeeping like optimizing data placement, implementing data compression techniques, and adopting tiered storage solutions. By reducing unnecessary data duplication and implementing energy-efficient storage infrastructure, organizations can minimize their carbon footprint.

  • Modernization – short of rearchitecting an application (which is considered in the Architecting for Cloud capability) there will be many cases where cloud service providers modernize resources – releasing new generations of compute families, serverless versions of existing services, or new tiers of service (more or less performant, more or less costly) – which should initiate a look at the use of older generations for modernization. Newer resource types typically are more cost-performant per unit. Not every upgrade requires immediate action, but Engineering and FinOps teams should stay on top of new services.

For any of these decisions to be made, resource utilization, efficiency, cloud sustainability, and cost must be looked at together. Determining when workload optimization can be done effectively involves estimating not only the savings that can accrue from the change, but also the cost (in labor hours, outages, etc.) of making the change, and potentially transforming the use of the resource in the process.

Moving from identifying what optimizations are technically possible, and aligning with the Engineering or other personas involved to make those changes to identifying when real opportunities exist to improve value is the key aspect of Workload Optimization to focus on.

Maturity Assessment

Crawl

  • Establishing a basic workload optimization strategy identifying top resources to target, basic prioritization, basic optimization KPI goals for cost and carbon
  • Developing visibility into resource utilization and efficiency using one or more sources such as cloud billing data, infrastructure monitoring tools, data efficiency tools, cloud provider insights/tools
  • Defining a basic efficiency metric – i.e. a metric that speaks to your business that can be used to measure how efficient a resource is
  • Likely the primary focus is on compute optimization and related service costs

Walk

  • Establishing a more comprehensive optimization strategy differentiating optimization approaches for different resource types, with cadence and priority guidance for Engineering
  • Understanding the financial, operational, or cloud sustainability value expected from specific optimization activities
  • Able to estimate the costs and effort required to optimize the service, and operational impact. (e.g. “it will cost 50 hours of work to make this change at an hourly rate of X”, or “it will cost $0.01/GB for a data efficiency platform to surface the savings potential of the data”)
  • Able to measure the cost and effort required in performing the action in labor, cloud sustainability, or operational impact (e.g. it’ll cost 50 man hours to make this change at an hourly figure of xyz)
  • Recommendations are documented simply and tracked, to allow personas to see impact of optimization
  • Basic automation of simple optimization processes

Run

  • Requires a comprehensive optimization strategy that provides guidance on many services, various approaches to addressing waste, specific guidance for Engineering, Sustainability and Product Personas on optimization expectations, and specific KPI targets
  • Access to detailed cost and utilization data to drive automated processes
  • Automate alerting or cleanup of idle resources, rightsizing, updates to architecture/sizing of resources deployed.
  • Automated triage of notifications for resources that will not be valuable to pursue
  • Recommendations and opportunities for optimization tracked when identified, and analysis of impact performed to steer future strategy

Functional Activities

FinOps Practitioner

  • Create and manage a Workload Optimization Strategy for the business
  • Promote and support collaboration with Engineering, Sustainability and other Personas as needed to identify opportunities for workload optimization
  • Support the reporting, data, and analysis needs of Engineering to identify opportunities
  • Provide oversight to the identification of optimizations that will provide most value to the organization, triage and make recommendations based on comparison to other optimization categories, particularly Rate Optimization, which is the responsibility of the central FinOps team

Engineering

  • Architect and/or purchase services with the strategy, KPIs, and forecasts guiding decisions
  • Use elasticity, rightsizing, utilization metrics, workload management best practices to match resources with the workload demands
  • Build and/or purchase automation to output measure and metrics needed to measure utilization and efficiency
  • Regularly review utilization and efficiency of resources, and identify opportunities to improve

Finance

  • Highlight any opportunities to increase utilization and efficiency and work with the teams to review feasibility of alternative options
  • Help create the reporting to track and report on the impact on value of underutilization and inefficiencies
  • Partner with the Engineering organization to establish budgetary & efficiency targets

Procurement

  • Seek to understand the future impact of planned workload optimizations on cloud spend when negotiating with cloud service providers

Product

  • Clearly define service KPIs so that engineering are able to design and/or purchase efficient services within the defined boundaries
  • Provide demand forecasts and information on the demand pattern profiles (daily/weekly/monthly/cyclic)
  • Establish the business goals for the objective (e.g. release to customers as quickly as possible, reduce the effective storage rate by >20%, release to customers w/ an availability of 99.99%)
  • Work with Engineering, FinOps and Finance personas to meet the needs of the optimization strategy for areas under my control

Leadership

  • Deliver the business value creation vision and strategy to inform the optimization strategy
  • Provide executive level support in the defined KPIs, establishing credibility in the FinOps efficiency program
  • Drive prioritization and decision making for workload optimization work, alongside alternative types of optimization such as cost and sustainability in the context of expected business activities

Measures of Success & KPIs

  • Data efficiency is applied to at least 50% of stored data (i.e. net savings coverage is >50%)
  • Effective $/GB/mo storage rates are reduced by at least 30% relative to the S3 Standard baseline
  • KPI Library
  • Waste Management library
  • Use Unit Economics to create KPIs to measure workload performance metrics per some unit of work. You might consider a compute or throughput metric (e.g. vCPU-Hours), monetary cost, or carbon emissions (CO2e) estimate per customer, transaction, or other similar unit of work.

Inputs & Outputs

Inputs

  • Reporting & Analytics capability to understand where workloads are underutilized, underperforming, idle, or require adjustment
  • Data Ingestion capability may be required to bring in Performance, Utilization or other Observability data to effectively measure individual resource performance
  • Organizational objectives from Finance and Product personas related to required rate targets and thresholds
  • Finance approvals for purchases, purchase cadence, purchase amounts, prepayment parameters
  • Third-party carbon emissions factors, regional electrical grid carbon intensity data, and regional water intensity data are examples of additional data that might be necessary to inform cloud sustainability decisions.

Outputs

  • Guidance to Procurement persona on future planned usage, past usage of resources and rate optimizable usage
  • Expectations of optimization improvement impact to Planning & Estimating, Forecasting
  • Documentation (mini business cases) justifying optimization opportunities to adjust usage by turning workloads off, rightsizing, replacing with other resources, selecting/changing region or using more carbon or cost effective options
  • Guidance to Engineering and Product teams if discounts are shown to those teams, to give information about anticipated rates for covered resources
  • Unit Economics evaluation of effective rate and other metrics related to rates