This playbook provides an in-depth guide for optimizing costs in Amazon EC2 Autoscaling. It covers various strategies, from setting up auto-scaling groups to advanced cost-saving techniques like spot and reserved instances.
The specific instructions to perform such activity may change over time as AWS rolls out new features, pricing models, user interfaces, etc. We have linked to relevant AWS-provided documentation to help this document stay relevant.
The insights, instructions, and resources provided herein, in combination with those available directly from AWS, should help individuals have a more complete understanding of this action as it pertains to FinOps and usage optimization.
Who should use this Playbook
This playbook is intended for cloud architects, DevOps engineers, FinOps practitioners, and IT managers who have a deep understanding of AWS services and are looking to optimize EC2 costs in a scalable manner.
Prerequisites & recommended understanding
AWS Account and Permissions
- Active AWS Account with proper billing
- When investigating EC2 Autoscaling with the AWS Cost and Usage Report, include these attributes: Resource IDs, hourly time granularity, UsageType, Operation, Cost Allocation Tags, Availability Zone, Linked Account, Service, and Pricing Options. These attributes provide the necessary granularity and context for analyzing autoscaling behavior and associated costs, identifying cost-effective zones, and segregating costs per account. Review the data dictionary included in the appendix for more details.
- AWS Support activation for assistance and issue resolution
IAM roles and policies
- Implement least-privilege IAM roles and policies
- Rotate access keys and minimize root account usage
Infrastructure setup
- VPC Configuration – Secure VPC with public and private subnets
- Key Pair – Utilize and manage secure key pairs
EC2 configuration
- AMI Selection – Use up-to-date AMIs and customized AMIs for specific needs
- Instance Type – Choose types aligning with the workload and explore cost-effective options like Spot or Reserved Instances
- Storage – Optimize EBS volumes, schedule snapshots for backups
Monitoring and metrics
- CloudWatch Metrics & Alarms – Set up for EC2 monitoring, log management, and critical alerts
Load balancing
- Elastic Load Balancer – Implement ALB or NLB as needed, with SSL/TLS termination for security
Autoscaling configuration
- Launch Templates – Use for advanced setups
- Auto Scaling Group – Set health checks, mixed instance types, and purchase options
- Scaling Policies – Implement target-tracking policies and consider scheduled scaling.
Cost management
- Budgets and Alerts – Monitor costs with AWS Budgets set notifications for overspends.
Testing
- Conduct load and auto-scaling responsiveness tests under various conditions
Please refer to the Appendix/External Resources section for detailed guidelines and additional resources.
Who needs to be involved
Involvement from the following FinOps Personas is vital to the success of EC2 Autoscaling. Below is a summary of the functional activities required from each persona, outlined using the DACI responsibility model.
- The FinOps Practitioner will analyze cost data and validate the effectiveness of the optimization strategies as a Driver.
- The Cloud Architect will design and implement the Autoscaling configurations as a Driver.
- The DevOps Engineer will execute the configurations and monitor the system as a Contributor.
- The IT Manager will oversee the entire process and make final decisions as a Decider.
- The Finance Team will validate cost savings and ROI as Informed
Information and resources required
This section provides information that contributes to the success of this playbook.
Information
- AWS Management Console Access: For configuring EC2 Autoscaling
- Cost and Usage Reports: For pre and post-optimization analysis
- AWS CLI and SDKs: For script-based configurations and automation
- Instance Metrics Data: For determining the optimal instance types and sizes
Tools, utilities & templates
Links to tools, utilities, and templates to use during the playbook.
Step 1 – Initial assessment
Estimated time: 2 Hours
Who: FinOps Analyst
What: Conduct an initial assessment using AWS Cost Explorer to identify the current usage patterns related to EC2 instances.
How: Log into AWS Management Console and access AWS Cost Explorer. Set the time range for analysis to understand recent cost trends.
Comprehensive View: Sometimes, it’s beneficial to start with a broader time range (e.g., one year) for a high-level overview and then narrow down to specific periods (e.g., three months) to investigate anomalies or spikes in costs.
Choose a 3-Month Time Range If:
- Evaluating Recent Changes: Your organization has recently implemented cost-saving measures, architectural changes, or launching new projects, and you wish to assess their immediate impact on costs.
- Seasonal or Project-Based Analysis: You are analyzing costs for a specific project or during a seasonal peak to understand the cost impact of temporary scaling or usage spikes.
- Short-Term Budgeting: Your focus is on short-term budget adjustments and forecasts, where understanding recent trends is more critical for immediate financial planning.
Opt for a 1-Year Time Range In Scenarios Where:
- Annual Budgeting and Trend Analysis: You’re conducting long-term planning and need to understand cost trends throughout the year to inform budgeting and forecasting decisions.
- Identifying Seasonal Patterns: Your organization experiences predictable seasonal fluctuations in demand, and analyzing a full year allows you to capture these patterns and plan for future cycles.
- Year-over-Year Comparisons: You wish to compare current costs against the previous year to evaluate the effectiveness of long-term cost optimization strategies or to measure growth in usage and associated costs.
Other Considerations:
- Custom Time Range: Beyond the standard 3-month or 1-year recommendations, consider a custom time range that aligns with specific business events, fiscal quarters, or before/after analysis of significant infrastructure changes.
- Granularity: While selecting the time range, consider the granularity (daily, monthly) that will provide the most insight. For short-term analysis, daily granularity can highlight the immediate effects of changes. For long-term trends, monthly granularity may suffice.
Filter the cost data to focus on EC2 instances
When focusing on EC2 instances, a deeper dive into monitoring and managing their usage can provide significant cost savings.
- Instance Usage or Running Hours Guidance: Understanding and managing the running hours of your EC2 instances is crucial for cost optimization. For organizations uncertain about how to proceed, AWS offers tools like AWS Cost Explorer and the AWS Management Console, which can help track usage patterns.
Strategies for Optimization
- Schedule Start/Stop Times: For non-essential instances that do not need to run 24/7, use AWS Instance Scheduler to automatically start and stop these instances based on business hours or usage patterns.
- Rightsize Instances: Regularly review your instances’ performance metrics via Amazon CloudWatch to identify opportunities to change instance types or sizes, thus ensuring you’re only paying for the capacity you actually need.
- Autoscaling: Implement autoscaling to adjust the number and size of instances in real-time based on demand. This not only helps in managing application performance but also in reducing costs by ensuring instances are only running when needed.
- Monitoring and Adjustments: Utilize AWS Cost Explorer for detailed insights into your EC2 usage and expenditures. It allows you to visualize usage patterns, identify cost trends, and make informed decisions about which instances to keep running, resize, or terminate.
- Utilization Reports: AWS provides utilization reports that can help you understand if you’re fully utilizing your Reserved Instances or if there are savings plans that are being underutilized. These reports can guide adjustments to your reservations or savings plans, aligning them more closely with your actual usage.
Implementation Consideration
When integrating these strategies, it’s important not to delve too deeply into complexities that may not apply universally. Instead, focus on providing clear, actionable steps that organizations of any size can take to monitor and optimize their EC2 instance usage. Regularly revisiting and refining these strategies as your organization’s needs and AWS offerings evolve will ensure continuous optimization of cloud costs.
This approach not only offers direct pathways to cost savings but also ensures that teams are empowered with the knowledge and tools necessary to manage their cloud resources effectively. Whether it’s through strategic scheduling, rightsizing, or leveraging autoscaling and AWS Cost Explorer, these practices play a critical role in aligning operational efficiency with cost optimization objectives.
Analyze the spending patterns, identifying high-cost areas and underutilized resources
When conducting an analysis using AWS Cost Explorer to understand EC2 instance costs, there are several key data points and FinOps insights that organizations typically seek to uncover. This analysis aims to identify opportunities for cost optimization, ensure efficient resource utilization, and align spending with business value.
Here’s what to look for and how to interpret the findings:
High Spend Areas
- Identify Top Spend Areas: Look for the services, accounts, or regions with the highest costs. High-spend areas could indicate where the most significant opportunities for cost savings lie.
- Anomalies in Spend: Sudden spikes or unusual patterns in spending could signal inefficient use, such as over-provisioned resources, forgotten instances, or suboptimal deployment choices.
Utilization Metrics
- CPU and Memory Utilization: Low utilization metrics often indicate over-provisioning. EC2 instances running consistently low CPU or memory utilization (e.g., below 20-30%) may be downsized to a smaller instance type for cost savings.
- Network and Disk I/O: High network or disk input/output could signal the need for optimized data transfer practices or more efficient storage solutions.
Cost Optimization Opportunities and Spend Commitments
When exploring cost optimization opportunities within AWS, particularly through Reserved Instances (RI) or Savings Plans, and Spot Instance usage, it’s crucial to consider how these strategies intersect with autoscaling capabilities and existing spend commitments.
Reserved Instances (RI) or Savings Plans Utilization
- Spend Commitments vs. Autoscaling: Adopting autoscaling can impact how effectively you utilize RIs or Savings Plans. Autoscaling dynamically adjusts the number of instances in response to demand, which could lead to underutilization of already committed RIs or Savings Plans if not carefully managed. To maximize these commitments, match them with a baseline load that autoscaling won’t dip below.
- Cost Optimization: Analyze your organization’s utilization of RIs or Savings Plans to ensure that autoscaling policies are aligned with these commitments. Underutilization may indicate that your autoscaling policies are too aggressive or that your RI/Savings Plans purchasing doesn’t match your actual usage patterns, while overutilization could suggest areas where additional RIs or Savings Plans could capture savings.
Spot Instance Usage
- Autoscaling Flexibility: Spot Instances, offering significant cost savings for flexible workloads, complement autoscaling by providing a way to scale capacity at a lower cost. However, the use of Spot Instances should be balanced with the reliability requirements of your applications.
- Cost Savings vs. Commitments: Integrating Spot Instances with autoscaling strategies requires careful consideration of your existing spend commitments. While Spot Instances can reduce costs for variable workloads, relying heavily on them should not come at the expense of fulfilling your RI or Savings Plans commitments.
Incorporating Spend Commitments into Autoscaling
Savings Plans commitments
- Strategic Autoscaling: Develop autoscaling policies that take into account existing RI or Savings Plans commitments to ensure that these financial investments are optimized rather than wasted. This might involve setting a minimum level of usage that aligns with RI or Savings Plans coverage to ensure that baseline demand is always met through these committed resources.
- Monitoring and Adjustment: Regularly review and adjust autoscaling policies and RI or Savings Plans commitments to align with changing demand patterns and to avoid financial penalties or underutilized resources.
Thresholds and Benchmarks
- Utilization Thresholds: Establish benchmarks for resource utilization to identify under or over-utilized resources. For instance, setting a 20-30% utilization threshold for considering downsizing or terminating cases.
- Cost Growth Rate: Monitor the rate of cost growth compared to usage or business growth. A faster rate of cost growth could indicate inefficiencies or scaling issues.
- Cost Per Unit of Work: Calculate the price per unit of work (e.g., transactions processed, users served) to assess the efficiency of your deployment. A high or increasing cost per unit indicates areas for optimization.
Interpreting the Numbers
- Cost vs. Utilization Mismatch: A high cost with low utilization suggests over-provisioning and an opportunity for downsizing. Conversely, high utilization with unexpected cost spikes might indicate the need for reserved instances or more efficient instance types.
- Spot vs. On-Demand Spend: A high proportion of On-Demand spend in flexible or interruptible workloads suggests potential savings through increased Spot Instance usage.
- RI or Savings Plan Coverage: Low coverage indicates underutilization of commitment discounts, while high coverage with significant On-Demand spend suggests opportunities for further commitments.
FinOps Insights
- Align Spending with Value: Ensure high-spend areas correlate with high-value business functions. Discrepancies here indicate misaligned resources.
- Optimize Before You Spend: Before increasing spend in response to growth, ensure that existing resources are fully optimized to avoid unnecessary costs.
- Cost Visibility and Accountability: Foster a culture of cost awareness and accountability, encouraging teams to review and optimize their cloud usage and costs regularly.
Compile a report detailing these findings, emphasizing potential cost optimization areas.
Outcome: A report highlighting the areas where cost optimization can be most effective.
Step 2 – Design autoscaling strategy
Estimated time: 4 Hours
Who: Cloud Architect
What: Design the Autoscaling strategy, considering factors like instance types, scaling policies, and load balancing.
How:
- Review the initial assessment completed in Step. 1, to determine the autoscaling strategy.
- Assess the application’s performance requirements and forecasted workload.
- Choose suitable EC2 instance types and sizes based on performance and cost efficiency.
- Develop scaling policies that align with workload variations.
- Determine the appropriate load balancing strategy (ALB or NLB) based on application needs.
- Calculate and compare costs for different Autoscaling configurations.
- Document the strategy, including all technical decisions and justifications.
Outcome: A comprehensive Autoscaling strategy document.
Step 3 – Implement Autoscaling
Estimated time: 3 Hours
Who: DevOps Engineer
What: Implement the Autoscaling configurations using AWS Management Console or AWS CLI.
How:
- Create Launch Templates with the selected configurations in AWS Console or via AWS CLI.
- Set up an Auto Scaling Group, define capacity thresholds, and attach the launch template.
- Configure CloudWatch alarms to trigger scaling actions according to the defined policies.
- Integrate and set up the chosen load balancer with the Auto Scaling Group.
- Test the setup under various load conditions to ensure it functions as designed.
Outcome: Functional Autoscaling Groups that adjust according to the defined policies.
Step 4 – Monitor and adjust (Ongoing)
Who: DevOps Engineer
What: Continuously monitor the performance and costs using AWS CloudWatch and make necessary adjustments.
How:
- Regularly review CloudWatch for insights on EC2 and Autoscaling performance.
- Adjust CloudWatch alarms and scaling policies based on the observed data.
- Utilize AWS Cost Explorer to track the financial impact of Autoscaling.
- Continuously refine Autoscaling settings to align with changing workloads and cost optimization goals.
- Keep a record of changes for future analysis and ongoing optimization.
Outcome: Optimized resource utilization and reduced costs.
Outcomes and Indicators of Success
Primary outcomes of running this Playbook
Achieving cost reduction
- Aim for significantly reducing monthly EC2 costs. The target percentage should be based on a comprehensive analysis of historical data.
- Analyze past EC2 usage and spending trends using AWS Cost Explorer and CloudWatch.
- Identify patterns of underutilization or over-provisioning of resources
- Assess costs during peak and non-peak hours to understand demand fluctuations
- Implement Autoscaling strategies that closely align with these observed patterns
Utilization metrics analysis
- Continuously monitor network, disk, memory, and compute throughput using AWS CloudWatch.
- Compare these metrics against your established benchmarks for optimal utilization.
- Adjust Autoscaling parameters to maintain a balance between performance and cost.
Projecting future needs
- Use historical resource usage trends to predict future requirements
- Justify scaling decisions based on these projections, ensuring that resources are scaled up or down in anticipation of changing demands
Maintaining high utilization efficiency
- Establish specific resource utilization targets adaptable to varying usage patterns and demand forecasts.
- These targets could include CPU utilization thresholds, network bandwidth usage, and disk I/O operations.
Strategies for optimization
- Implement a mix of instance types and purchasing options (On-Demand, Reserved, Spot) to optimize costs while ensuring resource availability.
- Use Load Balancers and Auto Scaling Groups to distribute traffic and workload efficiently.
- Regularly review and update autoscaling policies to adapt to new application requirements or changes in usage patterns.
Continuous monitoring and adjustment
- Set up detailed tracking for all key metrics in CloudWatch
- Regularly review the performance data and adjust the autoscaling settings to ensure the resources are used efficiently without incurring unnecessary costs
Indicators of success
High resource utilization
- Aim for an average CPU utilization between 40% and 70% for most workloads, which indicates effective use without over-provisioning
Measurement resources
- Utilize AWS CloudWatch to monitor CPU utilization, network I/O, and disk I/O metrics. Review these metrics to ensure they align with your demand forecasting and adjust as necessary.
KPI change
- An increase in these metrics within the optimal range indicates successful autoscaling implementation.
Exceptions and considerations
Handling Malfunctioning Autoscaling Groups
Scenario: The autoscaling group is not behaving as expected, potentially leading to inefficiencies in resource utilization or unexpected costs.
Please follow the below steps to resolve the issue.
Turn off the malfunctioning autoscaling group to prevent further resource allocation issues or cost escalations. This should be done in consultation with key stakeholders:
IT Manager (“Decider”): Approves the temporary suspension of the autoscaling group and any necessary resource adjustments.
DevOps Engineer (“Contributor”): This position assists in the technical process of turning off the autoscaling group and ensures that alternative resource allocation methods are in place.
Cloud Architect (“Driver”): This person oversees the overall strategy and ensures that turning off the autoscaling group aligns with the broader architectural goals and cloud usage optimization strategies.
Analysis and Diagnosis
Conduct a thorough investigation to identify the root cause of the issue. This may involve reviewing configuration settings, recent changes, and utilization patterns. Utilize cloud service providers’ diagnostic tools and logs to gather detailed insights.
Developing a Remediation Plan
Once the cause is identified, collaborate with the Cloud Architect and DevOps Engineer to develop a remediation plan. This plan should address the identified issues and outline steps to prevent similar occurrences.
Re-enabling the Autoscaling Group
After the IT Manager develops and approves the remediation plan, re-enable the autoscaling group. Monitor the group closely to ensure it functions as intended.
Post-Implementation Review
After resolving the issue, conduct a post-implementation review to assess the response’s effectiveness and identify lessons learned. This can contribute to improving practices for cloud usage optimization.
Documenting and Communicating the Process
Ensure that the entire process, from identification to resolution, is well-documented. Share this documentation with relevant teams to improve organizational knowledge and preparedness for similar scenarios.
Reserved Instances
When opting for Reserved Instances, consider the commitment length (1-year or 3-year terms) and evaluate if your workload will consistently use these instances over this period. Analyze the trade-off between upfront payment and the discounted hourly rate against your predicted usage patterns. Remember, Reserved Instances are ideal for steady-state workloads but less flexible for fluctuating demands.
Periodically review instance usage to ensure alignment with Reserved Instance types and sizes. Consider selling unused Reserved Instances in the AWS Marketplace if usage patterns change significantly.
Spot Instances
While Spot Instances offer significant cost savings, they can be terminated by AWS with a two-minute notice when AWS needs the capacity back. Therefore, they’re best used for stateless, fault-tolerant, or flexible workloads like batch processing, development environments, or large computations.
Implement fallback strategies using On-Demand or Reserved Instances. Use AWS Auto Scaling to combine Spot Instances with other types, ensuring high availability and cost optimization.
AWS Cost and Usage Record Data Dictionary
Provides a detailed list of records used in the AWS cost and usage record. AWS CUR – Data Dictionary
AWS Well-Architected Framework
Provides comprehensive best practices for designing and running reliable, efficient, and cost-effective systems in the cloud. AWS Well-Architected Framework
FinOps Foundation
Offers resources and community support for adopting FinOps best practices, focusing on cost management and optimization in the cloud. FinOps Foundation
Acknowledgments
We’d like to thank the following people for their work on this Playbook:
We’d also like to thank Brian Robbins for their support on this asset.