Building a FOCUS Data Export Pipeline for Alibaba Cloud

NOTE: The latest FOCUS version as of the publication date of this Paper is 1.3, with Alibaba Cloud supporting 1.0 in invitational preview. Access to the FOCUS export may need to be requested. Results may vary depending on specification versioning and upstream data completeness (see Alibaba Cloud FOCUS 1.0 field differences).

Removing Manual Work Using FOCUS Data Export

Prior to using the FOCUS export, billing data was available only as raw daily CSV files stored in Amazon S3. These files were organized in a deeply nested time-based folder structure and included complex fields such as JSON-encoded tags. While the source data was technically complete, it was not easily consumable at scale. Schemas were weakly typed, tags were embedded as strings, and queries required significant preprocessing.

The goal of this effort was to transform these raw exports into an incremental, normalized dataset that could be efficiently queried using Amazon Athena and integrated into downstream cost analytics workflows.

Understanding the Challenges

Several structural challenges needed to be addressed in order to operationalize the data.

Data Structure

Although the exported billing files were delivered as CSV, some fields (particularly tags) contained JSON objects with key-value pairs such as:

Brand
Project
Owner
CostCenter
Environment
Application

These fields are frequently critical for FinOps analysis but are not directly usable in SQL without transformation.

Incremental Data Processing

New billing files were written daily to S3, requiring a solution that could:

Process only newly generated data
Avoid reprocessing historical files
Support partitioned storage to optimize performance and cost

FOCUS Alignment

The dataset also needed to remain:

FOCUS-aligned
Strongly typed
Stable over time

This ensured that the resulting dataset could be used consistently across analytics environments and reporting tools, while acknowledging that full FOCUS conformance depends on upstream data completeness.

Initial Alibaba Cloud Export Configuration

The first step involved configuring the Alibaba Cloud billing export based on the official procedure for Billing FOCUS Export.

This process automatically generates an OSS bucket containing daily exports of FOCUS billing data.

Each export includes compressed CSV files representing billing line items.

To automate ingestion of this data into a FinOps data lake, a Function Compute service was implemented within Alibaba Cloud (similar to AWS Lambda).

Function Compute Configuration

Create a Function Compute service using Python 3.12.
Configure an OSS trigger for new file creation events:

oss:ObjectCreated:PutObject
oss:ObjectCreated:PostObject
oss:ObjectCreated:CompleteMultipartUpload
oss:ObjectCreated:PutSymlink

Configure the following environment variables:

AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
S3_BUCKET_NAME
S3_REGION

These credentials must allow the function to write data into the destination S3 bucket.

Security note: In a production deployment, consider using a dedicated IAM user with least-privilege permissions (write-only to the target S3 prefix), enabling key rotation, and evaluating cross-cloud identity federation as an alternative to long-lived access keys.

The function is configured to execute automatically whenever new export files are created.

Exported files are stored in a raw S3 bucket, which serves as the base dataset for downstream processing.

Designing the Data Pipeline

The transformation pipeline was implemented using:

AWS Glue for ETL processing
Amazon Athena as the query engine

The design focused on four core principles:

Strong schema enforcement
Explicit JSON parsing
Partition-aware storage
Incremental processing

Data Transformation with AWS Glue

Using a Glue Visual ETL job, the pipeline performs several transformation steps.

1. Ingest Raw CSV Files

The ETL job recursively reads billing files from the raw S3 bucket.

CSV parsing is configured with:

Header detection
Proper quote and escape characters

This prevents column parsing errors caused by embedded commas within fields.

2. Enforce a Strong Schema

An ApplyMapping step converts relevant columns into strongly typed fields such as:

timestamps
doubles
bigints

Enforcing types early prevents silent schema drift and ensures consistency across downstream analytics systems.

3. Extract Partition Metadata

The S3 folder structure already encodes temporal metadata. The pipeline extracts partition columns directly from the file path, including:

yearmonth
batch_ts

This allows natural partitioning without modifying the source files.

4. Parse and Normalize Tags

The tags column is parsed from a JSON string into a map<string,string> structure.

From this map, common tag attributes are extracted into normalized fields such as:

Brand
Project
Owner
CostCenter
Environment
Application

These fields become directly queryable in SQL while preserving the full tag map for flexibility.

5. Write Optimized Output

The transformed dataset is written in Parquet format with Snappy compression and partitioned by:

yearmonth
batch_ts

This significantly reduces query scan costs and improves performance in Athena.

6. Data Catalog Integration

The AWS Glue job automatically updates the AWS Glue Data Catalog, ensuring schema definitions remain synchronized with the transformed dataset.

Creating the Athena Table

Once the normalized Parquet dataset is written to S3, it is made queryable through Amazon Athena.

Instead of relying on schema-on-read approaches or ad hoc definitions, an explicit Athena table is created and registered in the AWS Glue Data Catalog.

The table definition reflects:

Strong data types
Normalized tag columns
Partitioned storage structure

Explicit table definition ensures that:

Queries behave consistently
Schema evolution is controlled
Consumers do not need to understand the underlying file layout

The table references the normalized S3 location, and new data is appended incrementally.

Partitions are discovered dynamically rather than hard-coded in the table definition.

To synchronize metadata with the underlying storage layout, a scheduled maintenance job runs the command:

MSCK REPAIR TABLE db.focus_alibaba_normalized;

This command updates Athena metadata to include newly created partitions.

By separating data ingestion, transformation, and metadata management, the pipeline maintains a clear and robust architecture.

Glue workflows and triggers coordinate the execution of these stages.

Building Analytics-Ready SQL Views

Although the normalized dataset is queryable directly, exposing the base table to analysts would still introduce unnecessary provider-specific complexity.

To address this, an additional abstraction layer was implemented using curated SQL views in Athena.

Provider-Specific Views

The first layer includes views tailored specifically for Alibaba Cloud billing data. These views:

Standardize naming conventions
Normalize cost and usage metrics
Apply business logic such as cost allocation and time aggregation
Expose normalized tag fields

This layer shields consumers from the complexity of the underlying schema.

Cross-Provider Views

A second layer of views aligns datasets across multiple providers, including:

Alibaba Cloud
AWS
Google Cloud
Datacenter environments

These views expose a consistent, FOCUS-structured schema, allowing datasets from multiple providers to be combined without additional transformations. Note that the unified view inherits any upstream data gaps; columns such as ContractedCost and PricingQuantity will contain null values for Alibaba rows until the upstream export populates them.

This abstraction layer serves as the contract between data engineering and analytics, providing a stable and documented interface.

Business intelligence tools can connect directly to Athena and query these views without needing to understand ingestion pipelines, file formats, or provider-specific semantics.

Results & Testing Observations

The resulting pipeline produces a FOCUS-aligned billing dataset that:

Enforces strong data types
Is easy to query using Athena
Scales efficiently with incremental ingestion
Supports cost analytics and governance use cases

What began as a collection of raw CSV billing exports has evolved into an automated, production-ready billing export pipeline designed to support FinOps analysis and decision-making. Full FOCUS conformance will depend on continued improvements to the upstream Alibaba Cloud export.

Observations from Initial Testing

During early validation of the dataset, several fields were identified that may require review or adjustment in the upstream export. These observations were shared with the Alibaba Cloud team during the initial testing phase. Until these gaps are resolved upstream, the pipeline’s output will contain null or unexpected values in the affected columns, limiting certain analytics use cases.

For a complete list of known field differences, see Alibaba Cloud FOCUS 1.0 Preview Field Differences.

Field Behavior

The following fields currently appear to contain numeric references rather than descriptive values:

InvoiceIssuerName
Expected: Name of the entity responsible for invoicing (for example, Alibaba Europe Ltd.)
PublisherName
Expected: Entity that produces the purchased resources or services (for example, Alibaba Inc.)
ProviderName
Expected: Entity making the services available for purchase (for example, Alibaba)

Note: ProviderName and PublisherName were deprecated in FOCUS 1.3, replaced by ServiceProviderName. Pipelines targeting future FOCUS versions should plan for this change.

Missing Fields

The following fields were not present in the exported dataset and may require verification:

PricingQuantity
ConsumedQuantity
ConsumedUnit
ContractedCost
ServiceCategory
ChargeClass
ChargeDescription

Null-Valued Fields

The following fields were present in the exported schema but contained only null values. This list reflects the author’s export as of January 2026; consult the field differences page for the latest status.

Mandatory, nulls not allowed (hard conformance violations):

ContractedCost
ContractedUnitPrice

Mandatory, nulls conditionally allowed:

PricingQuantity
ChargeDescription
SkuId
SkuPriceId

Conditional:

ConsumedQuantity
ConsumedUnit

Observed by the author but not listed in Alibaba’s published conformance gaps (may require further verification):

ServiceCategory
ChargeClass

Additional Known Gaps

The following issues are documented by Alibaba Cloud but not directly observed in the pipeline testing described above:

Timestamp format: BillingPeriodStart, BillingPeriodEnd, ChargePeriodStart, and ChargePeriodEnd use Beijing time (UTC+8) instead of the required UTC format.

Commitment discount columns: CommitmentDiscountCategory, CommitmentDiscountId, CommitmentDiscountName, CommitmentDiscountStatus, and CommitmentDiscountType do not display usage or unused data when commitment discounts apply.

Conditionally null fields: ListUnitPrice, PricingUnit, ResourceId, and ResourceType may be null in scenarios where the spec expects values.

NOTE: The author reached out to the Alibaba Cloud team with this feedback as of January 2026.

Please get in touch with additional feedback or observations from additional implementations. Feedback helps the FOCUS Maintainers and Steering Committee improve the underlying schema and export functionality.

Additional Resources

Acknowledgments

We’d like to thank the following for their work on this Paper:

Related People

Alessandro Bellini

Max Mara Fashion Group

Join as an Individual

Join as an Enterprise

Building a FOCUS Data Export Pipeline for Alibaba Cloud

On this page:

Removing Manual Work Using FOCUS Data Export