Creeping Bytes, Costing Dollars — Creeping Cloud Costs Explained

Boris Cherkasky
ITNEXT
Published in
5 min readApr 17, 2024

--

artwork by @equeen on unsplash.com

Managing cost efficiency in a modern production cloud environment is a challenging task, especially when you’re only starting your development effort — everything is cheap on a small scale — the 3$ / Day in storage costs seems meaningless — It’s only 3 dollars!

This was our case too, but as our operation grew I’ve started noticing a pattern in those costs that were “just 3$” only 5 months ago, we’ve nicknamed this “creeping cloud costs” — It’s when your costs are slowly but surely rise, with no seasonality, and little correlation to additional business.

This article will cover which cloud services are prone to those creeping costs, how to identify them, and lastly how to mitigate them.

The chart used in this article are from the Finout platform (full disclosure — I am one of the engineers building it).

What Are “Creepers”?

“Creepers” are usually monotonously increasing costs, with no seasonality (seasonality may affect the rate of increase, but it’ll be there regardless), and very little decreases, and generally unrelated to business growth.

Good examples will be Container Repositories (such as Amazon ECR or Google Cloud Registry) — Regardless of application traffic, or seasonality, each GB of stored images will be billed monthly.

The rate at which developers create images is usually unrelated to business (and probably correlates more to the size of the team)

The above snapshot is our ECR cost which triggered a low-hanging cost optimization effort and this article.

Identifying “Creepers”

It’s relatively easy to identify Creepers:

  • They have no seasonality
  • They are monotonically increasing over the course of a long period of time

Just look for line resources with cost trending upwards on a constant basis.

Cloud Services That Are Known To Have “Creeping” Potential

Pay-per-use storage — based services are usually the immediate offenders in creeping costs cases — each byte we store today, will be billed monthly as long as it exists.

In AWS for example the immediate suspect is S3, but there are many other services that have a pay-per-use storage component such as DynamoDB, ECR, CloudWatch logs and others.

This is obviously not limited to AWS and happens in all other cloud providers.

How’s That Different From Any Other Cloud Resource?

Unlike other services, the storage component of pay-per-use services is accumulative.

Your RDS has a ~fixed price (according to the size of the instance):

Your lambda / ALB transfer cost is correlated to traffic / input:

S3 storage on the other hand — you’ll pay monthly, recurringly, for each byte stored for as long as it exists! This cost will accumulate over time, won’t be seasonal, and won’t trend downwards without deleting objects explicitly:

Handling creeping costs

Obviously the mitigation of these creepers is to set TTL / Expiration / Life Cycle policy (LCP) (depending on the cloud service).

It’s important to note, that most cloud services have no TTL / Expiration / Life Cycle policy by default, therefore each newly created resource is a potential “creeper” in the making unless proactively treated!

Handling Creepers and Cost to Business Correlation

Generally after applying a Life Cycle Policy on the relevant service we should achieve a “steady state” where the overall storage cost is relatively constant. Moreover, after applying LCP the cost becomes correlated to the business, this is because an increase in business (traffic / volume) will cause the “steady state” storage to trend upwards in a proportional manner until achieving a new steady state.

This is every Finops’s wet dream — correlation of costs to business growth!

As a real life example, after applying LCP on our ECR we were able to achieve a cost reduction + steady state:

Stop The Rat Race — Use Automation and Safeguards

I used to bucket those resources into 3 groups, and tag them accordingly:

  • Lifecycled resources — resources with LCP / TTL applied to them.
    They are bounded, and you can probably forget about them. You are relatively safe here.
  • Unbounded resources — those that we don’t have and can’t have a LCP for. Resources such as your main s3 storage or your main operational DynamoDB table for example.
    Those are expected to be very expensive, and are with high correlation to the business. You just need to keep an eye on them so they don’t get too out of hand.
  • Cheap unbounded resources — those are unbounded resources, but you don’t expect them to be expensive, therefore tagging them, and adding an alert that they don’t cross some boundary seems enough here.
    Example for such a resource — an ECR repository where you need the full history of revisions and you can’t trim them.

By tagging resources into the above bucket, alerts can be set that things are kept within expected cost ranges.

As you can probably imagine, monitoring or knowing about each and every cloud resource that is created is just infeasible without proper automation and tooling, therefore we can leverage policy agents such as OPA or Kyverno to deny the introduction of resources without LCP / TTL (or on the other hand explicitly tagging them as unbounded resources).

Closing thoughts

Each resource created should have some kind of Life cycle policy attached to it, and not only due to its implicit costs — each stored byte requires governance, it has the risk of including PII, and being relevant to regulations such as GDPR and CCPA.

We should aim for a limited lifespan of each byte we process, and if that’s not possible — that byte should be of great importance to the business.

As always, thoughts and comments are welcome on Twitter at @cherkaskyb.

--

--

Software engineer, clean coder, scuba diver, and a big fan of a good laugh. @cherkaskyb on Twitter