Managed Services

TL;DR: Friends don’t let friends roll their own

Alon Nisser
ITNEXT

--

By Unknown author — This file is from the collections of The National Archives (United Kingdom), catalogued under document record FO850/234. For high quality reproductions of any item from The National Archives collection please contact the image library., Public Domain, https://commons.wikimedia.org/w/index.php?curid=501979

I’ve heard it a thousand times: Datadog (Placeholder name for you metrics/logs provider) is too expensive, why don’t we roll our own? We’re developers, it’s not so complicated. Just install Elasticsearch + kibana, influx and grafana (Sing along with the music from my humps), throw some Prometheus into the calderon and we’re done.

Oh dear, We’re just getting started.

We now need to manage Elasticsearch/Influx (probably both) in production and it’s probably more than we bargained for (I can tell you the one about when I found out we’re out of space in elasticsearch), We now need to know much more than the parts of the public api we’ve played with, while testing it running in a quickstart container on our machine.

We also need to manage the vms they are running on (or the infrastructure of kubernetes beneath them). Who is now responsible for monitoring that we are actually getting all the metrics we bargained for? We are. Implementing more and more specific tools to gather that extra metric? On us.

Oh you also need APM. Good luck with opentracing and jaeger and integrating all this together so we can look at traces together with logs and metrics. When your team discovers there are no logs in production, who’s job is it to fix it? Is it a High severity incident like when production is down? The answer is probably yes.

Is your team performing better now?, when they need to focus on managing this operations platform besides writing code which is actually needed for your product? The answer is probably NO. You might have saved a few $ (And might is the correct word, you might also be overspending a managed service -see later) but you almost certainly spent many more in developer productivity.

While monitoring and logs seems to be the most susceptible to this kind of logic, they aren’t alone. There are other fields where rolling your own is usually not a good idea.

Persistence

Persistence is usually the next candidate, I see this less often in the SQL space lately, AWS RDS and similar offerings by other cloud vendors have made managed SQL(of whatever dialect you chose) the goto solution, although I still bump into the occasional “It’s secure because mysql runs in the same vm as my app so it’s not public” WAT moments .

But when you go to the NOSQL space? Everyone seems to suggest other options. Running Big databases in production is c o m p l i c a t e d. Running them on your robust Kubernetes infra? Even harder. Patching/scaling/sharding/resharding/Monitoring and more. This is probably a critical part of your service, so you can’t just patch something and hope this works. And there you are, falling into the rabbit hole of doing DBA stuff instead of writing the code powering your business.

Continues CI

Almost everyone seems to think they should run their own Jenkins and everything would be fine. Yes, using kubernetes pods as workers or autoscaling vms it’s easier than it used to be to scale this kind of service, but this hardly the complicate part. You still you’ll need to handle cleanup (especially storage for all those extra docker layers) and even more important, atomicity — CI builds MUST be Isolated. You can’t just “trust” the people writing a test suite to do this correctly. And hey: what about security? And what happens when you can’t deploy to production because your patched together CI Service can’t take the heat? Does your “roll your own” CI service support High Availability? Yes you can tune the s**t out of it, but having too many options instead of a well vetted subset isn’t always in your best interest.

Now for kubernetes, analytics, etc etc. Rinse and repeat, it’s always the same song. Running your own might be way more complicated than what you imagined and always comes with a price tag.

Elasticity and some extra costs

Besides Dev attention and time issues, rolling your own, especially when you’re not big, can turn out to be quite costly. Managed services can usually optimize their infra purchase. For databases Low tier usage allows them to reuse the same vms in a multi tenant model. Other services might have shared dbs for metrics or logs with multi tenant deploys. They can optimize spot vms and volume purchases.

You can’t (or at least, it’s harder).

You also probably need to provision above your regular/expected needs to protect against an unexpected usage spike and to accommodate to growth. Managed services might “normalize the curve” for spikes with extra provisioning costs spread between their different users. This means you might be paying more for this extra capacity than what you’ll pay with a managed service. Yes, in the age of cloud vendors this is way easier than having to actually buy a machine and connect it to the rack, but there are still costs attached.

So how much does it cost?

Here’s the thing with managed service; You actually know how much they cost. You can reason about it. Run some calculation on your usage vs your growth. When you are running your own — even besides the developer team costs and time, it’s way harder to figure out how much this actually costs you. This is especially true if you’re running one big Kubernetes cluster. How much of your costs are attached to the pods running the db? How much of the disks IOPS is driven by your own CI service? It may be driving it over the tipping point forcing you to buy the extra expensive ultra something disks. How do we separate data ingress costs for those? You may not care about this. But sometime in the future knowing this stuff and being able to reason about your infra budget would be important.

A general directive:

Buy/Build dilemmas are common in our business. And it’s understandable. We’re in the business of building software and we can probably build with the same team we have, the services we might also buy. But like I tell my kids:

Being able to do something doesn’t mean it’s a good idea.

I suggest using the general directive of software development as a rule of thumb:

“focus on your actual product” .

Is running a CI service part of your actual product? Is managing a monitoring platform? If the answer is No, then you should reconsider doing that.

Exceptions

There are exceptions to the general directive:

  • When you’re very big Buy/build equation changes: From the build side you now have vast build powers, and from the buy side you observe that small change savings becomes real big money. You can afford to build your own monitoring platform or orchestrating engine, build your own data centers, etc. At the point where you’re spending millions on monitoring, it’s completely logical to roll your own, where you can get both $ savings and the control you probably need now. I’ve never worked in that side of the industry, but there is certainly a size where this is a sensible decision.
  • When you’re very very small. But not because then it’s OK to manage your own Elasticsearch cluster, but because you don’t need it. You might hack a “good enough solution” with a bash script on that vm sending a log file periodically to S3 and some basic VM metrics to Cloudwatch . And APM might be connecting with the IDE debugger to production, or running strace on that machine. Some things are meant to be a hobby/side project and aren’t intended to grow and it’s OK.
  • When you have a very specific use case where the hooks and controls available from managed services aren’t enough to customize the managed service for your specific needs.
  • When your actual product is building this service or something in that ecosystem which needs intimate knowledge of the tool. If you’re building a Monitoring service, you probably need to run your own time series db because you’ll need to know it in and out. If you are in the CI business. Then yes, You should be running your own builds (Duh).

But this is rarely the case. While almost everyone thinks they have a super specific use case, it’s usually because they didn’t read the manual or contacted support. The odds are it’s been done before. And you’re probably not that small. Nor that big.

A recap: Your development team is your main asset. And you don’t want to spend it’s time in the wrong places. Just use a goddamn managed service for that (But do your evaluation carefully on WHAT managed service, But that’s an issue for another post)

--

--