Anderson Oliveira
Axa Group Operations
This paper aims to be an evolving set of materials to provide practical guidance, education, and enablement that helps organizations of all sizes and cloud maturity adopt FinOps as the standard for Cloud Financial Management. See also the first part of this series, Adopting FinOps: Getting Started, which includes the Adopting FinOps Pitch Deck, an open slideshow that practitioners can use to start building their case for FinOps.
At any point in your FinOps journey, where you’ve begun and have had your share of wins, losses, and lessons, there are many uncertainties and unknowns that increase the sensation of not knowing how to proceed. When you get to this point, sometimes all that you are looking for is for a piece of advice: Where to invest and what to avoid, together with some stories that can help inspire the journey you are about to make.
No matter where you are on this important journey, this paper will help by providing key points to take into consideration. It is not intended to be exhaustive, but at least will convey valuable advice from those that have been doing this work for a while.
Making the distinction between cost management and optimization is crucial as it highlights how you benefit from the cloud, Cost management focuses on waste reduction (removing unused resources, pruning old data etc) and sensibly applied commitments, while optimization ensures you only buy the minimal resources that provide the best money for value and answer the business requirement, not the developer assumptions on requirements.
Thinking about it in medical terms, cost management is treating the symptom while optimization is studying the source issue and dealing with that.
Working with Product, Architects, Developers and Business to define the new software while tailoring the cloud resource to match the need and efficiency of the product (not the resources as they are used in the development environment).
A gaming company wanted to launch a new feature in the game to allow users to communicate with other players, to do that they planned on using AWS SQS to deliver ~50,000 messages in a second.
When submitting the HLD, the FinOps and the Architect used the AWS calculator to forecast the cost of the SQS for a month, the estimated value was around $49,419.
The FinOps suggested to use the Bulk feature of the SQS to reduce the number of messages in the queue, which required a change in the code and consumption, when the developers worked on the code to use the bulk consumption they realized that the message payload was small enough that they can send 15 game data messages in 1 JSON and then using the Bulk feature to submit 300 messages in 1 call, thus reducing the amount of calls to the SQS from 50000 to 166 and the expected cost went down accordingly – to $174.08 per month.
Removing waste and applying commitments on existing workload, Buying RI’s or Shutting down resources during off hours, to reduce cost on same used resources, is managing the expenses but does not optimize the usage of the cloud resources.
A gaming company wanted to have a buffer of instances that will support a failure in the game, they run a 100 m5.4xlarge instances in On-demand inside an elastic scaling group (min 100).
When first we approached management about reducing the group they rejected the notion as this was the “safety buffer” in case the game had an outage.
Studying the metrics of the group we found that the average CPU usage of the group for a month of 18%. We presented that to management and after evaluating the usage versus spike risk, we reduced the minimum value to 90 and waited 2 weeks to ensure the group is stable, we did the iterations till the group was at 60 instances (minimal limit Management was willing to reach), with 25% running on Spot instances and the rest covered by Reserved instances. Overall CPU usage of the group rose to 33% monthly.
Healthy competition is very beneficial to FinOps because it can help increase motivation to launch successful stakeholder capabilities, which can increase performance from gathering requirements, seeking feedback, demoing, and adding a pipeline for feature requests. Healthy competitions can speed up adoption by joining individuals around goals and missions that are mutually beneficial.
It is important for FinOps teams to not solely focus on their performance but recognize purpose and success as part of the greater whole.
Business operating models evolve over time, at a major education enterprise company the operating model shifts toward individual business units making decisions on which core services to build or leverage centrally. The enterprise already had an established FinOps function centrally. Most individual business units recognized the economics of scale of working with the centralized team and had an established trusted relationship. However, for a while a single business unit advocated for a decentralized approach.
The centeralized FinOps team was able to educate the decentralized team & provide resources (dashboards, tooling) to support them. To avoid friction the two teams began regular cadences and sharing feedback. This allowed the decentralized team to build off of what the centralized team had and also provide a customer feedback loop for improvements. These two teams over time gained trust in one another. This relationship was not without some turbulence, however having the attitude that embracing internal competitions and understanding that healthy competition where the focus is for greater good goal can be very valuable for FinOps teams.
FinOps must engage, collaborate with domain experts before sharing recommendations with wider teams. Many of the ‘Quick Wins’ or ‘Low hanging fruit’ when looking to optimize costs are in the realm of engineering teams. What look on the surface as good actions to take may not be as appealing to do once you consider the context.
The FinOps team for a company spending $ tens of millions a year was able to rank Virtual Machine recommendations by potential savings and prioritize which engineering teams they should start talking to.
When talking to the team with the largest potential rightsizing savings, it transpired that the team were just starting to use cloud and used another team’s Infrastructure as Code repo as a template for their service. The template repo was for a team that needed to use VMs that were optimized for CPU-bound workloads. The team with large right-sizing opportunities had workloads that were memory-bound. To meet the memory requirements for their application nodes, they scaled up the size of the instances in the template.
The discussion with the engineering team highlighted that there are other instance types, including instances that have a higher ratio of memory to vCPU. After some quick tests, the changes were implemented giving a 50% reduction in the cost of running the instances.
Changes implemented giving a 50% reduction in the cost of running the instances
Knowing what you aim for is important and that is why you need to set goals, be them cost reduction ( easy wins) or a way to forecast budget and unit economics, all this needs to be outlined up front, so you can engage with stakeholders and understand what the desired outcome is.
It is also very important to set realistic expectations, [ ex. 8% cost reduction on Q1] and not “cut cloud cost by 50%” so you can communicate them to the teams and coral them to the effort.
(1,2) A customer wanted to understand why his cloud spend increased so much in the last 3 months when he is selling an on-premise appliance and not a cloud service.
When we started to investigate we noticed that the cost arose from the fact that the CI account spiked.Turns out he hired 15 new developers and each was bringing the equivalent of a client appliance in the cloud and leaving it on even on the weekends. We implemented a short script that stops all VM’s on Thursday evening and starts them on Sunday Morning, that enables the developers to have the systems up when they get in the morning without needing to wait, the next step was to do the shutdown every night and the weekends.
There was a reduction but not as the customer expected. As we looked again we noticed that the CI was running on On-Demand load, we brought in Spot instances and that helped, then some of the developers complained that sometimes their test machines crashed, as can happen with spot, when we explained that to the VP R&D he simply told the development team “If that happens – go get a coffee, it will resume by the time you are done” and that in fact sped up production as the developers stopped waiting for CI to work on one branch while the started working on other tasks.
(3) A customer wanted to create a budget prediction for the next fiscal year.
When asked for his growth prediction, he provided only the increased users (end users) expectation, neglecting the R&D and other departments growth, when explaining to him that building a forecast, we need to account for those cost he tried to dismiss the R&D cost in the cloud as trivial, when we presented to him the figures he was startled to learn that a bulk of his cloud spend was development and QA, while his production was 35% of the cloud cost.
(4) see the section about optimization above.
The quantity of data that FinOps teams need to understand and analyze is vast. The raw billing data is unfit for human consumption and quickly exhausts the capabilities of general office suite applications. Tools can help surface the key information from all the data as well as automate activities allowing you to reduce cycle times in the FinOps lifecycle loop. Implementing tools is not a silver-bullet solution. There are many other considerations that need to be evaluated and processes to set up to get the best value from tools.
A few years ago, back on 2020, a major marketplace company that is cloud native decided to invest in a solution mainly for showback. After researching the market for a SaaS solution, the evaluation was that the available options were not robust enough to cover the company’s complexity, especially those related to multi-cloud providers and cost split of shared resources. Based on this, the company decided to build its own solution.
After the first MVP (minimum viable product) was deployed, every year, the FinOps team assess the market to check how the available solutions have evolved. Even with all the development employed, the “buy” option is still over the table. By adopting an existing solution, you will have a chance to speed up the company’s FinOps roadmap. An “in-house” solution gives more flexibility and control over the tool evolution. There is no right or wrong, just a matter of balancing the benefits and costs of every option.
When adopting finops you do not just want to jump right in without important decision making.
The initial step to start can be overwhelming if you just try to consume it all in one go. You want to carefully assess the current state of your organization so you can make the right decisions for your organization while knowing what to push back and what to embrace as your first steps.
(Group – need to clarify the insights first)
Do’s of adopting and decision making
Don’t of adopting and decisions making
The best way to build trust and gain traction in the company is to show immediate results, and those happen with the low hanging fruit, which can manifest in easy tasks. At a gaming company that first task was going over all the storage (Disk) and cleaning the unused ones ( unattached), That initial swap removed $1400 of the monthly consumption, then they enabled a life cycle on the long term storage (S3/ Storage accounts) and that reduced the billing from $64,000 a month to $56,000.
At the monthly meeting of cost evaluation, after the BI team presented the growth and company intake, the FinOps team presented the cloud expanse and showed that even though the cloud usage grew that month by 3% (roughly 150,000 players) , due to the actions done by the FinOps team, cloud costs shrunk by 1.5%. This gave the management the confidence that the FinOps team is doing a responsible job in maintaining the cloud cost under supervision while ensuring that none of their business goals and activities are hindered.
As a new FinOps Driver, we were very set on getting the raw recommendations from tooling into the hands of engineers to take action. We created an automated workflow that would pull recommendations from a Cloud providers recommendation engine and create ServiceNow tickets that would be ready to approve via change management process. We took a waterflow approach to building this automation out over 6 weeks. The entire thing worked perfectly, or so we thought. Once we started piloting it we found that most of the recommendations were not actionable per the engineering teams. This caused a lot of tickets within the change process that were not actionable, yet required engineers to clean the tickets up because they were noisey. We missed the most important element of designing a new FinOps capability, which was to engage with the engineering teams on building a tool they would be the end users for.
The intent of this article is not to be exhaustive, but to give some points to think about when implementing FinOps in your organization. It is not a “silver bullet” or a “one-size-fit-all” solution. But the key success factor is to identify the principles that run behind each of the topics mentioned above. With the help of the real case stories, you have another opportunity to have an idea on how the principles are applied in a specific environment that may not be like yours. We hope that with this article you will, at least, eliminate the idea of starting from scratch and have some talking points to discuss within your organization in order to have a successful FinOps implementation.
You can also count with the help of the FinOps community. Access finops.org and get to know how to be part of this community of thousands of practitioners. You will be surprised to see that you are not alone on this journey.
The FinOps Foundation extends its gratitude to the hard-working members of the Working Group: