This work is licensed under CC BY 4.0 - Read how use or adaptation requires attribution

How to Optimize AWS Athena Costs

by Marcus Irani, Site Reliability Engineer, Electronic Arts

AWS Athena is a popular serverless query service that provides on-demand querying of data in Amazon S3, with no need for infrastructure management. However, the pricing structure can be complex and requires careful management to avoid cost overruns. One key factor in controlling cost is optimizing queries to minimize the amount of data scanned. Athena charges per query based on the amount of data scanned, with additional charges for data stored and other factors. Therefore, reducing the amount of data scanned is critical for reducing costs.

One common optimization technique is to partition tables using relevant attributes, such as date or location, to reduce the amount of data scanned. For example, if a table contains daily sales data, partitioning the table by date allows queries to scan only the relevant partitions for a specific date range, rather than scanning the entire table. This can result in major cost savings, as the amount of data scanned is greatly reduced.

Another technique is to use predicate pushdown, which allows Athena to push query filters down to the data source before retrieving the data. This reduces the amount of data scanned and improves performance. For example, instead of retrieving all rows from a table and then filtering them based on a condition, Athena can push the filter condition to S3 to only retrieve the relevant rows. This is especially effective when querying large tables with many rows.

Additionally, running efficient SQL queries can greatly reduce costs. For example, querying only the necessary columns instead of using “SELECT *”, or using the appropriate data types and formats to minimize data size. Simple optimizations like these can largely reduce the amount of data scanned, resulting in lower costs. To avoid these issues, it’s important to carefully design database structures and craft queries in a way that minimizes the amount of data scanned.