Microservices and the myth of loose coupling

Published in

ITNEXT

6 min readOct 27, 2021

In the previous article, we looked at what microservices are, why they seem like a good idea and the trap of intuitive desings.

Microservices are simple. Why are they so hard?

This is the first part and an introduction for what I plan to be a series of articles on modern distributed software…

andrasgerlits.medium.com

In this article, we’ll look at what loose coupling is, and make an attempt at designing a very simple solution following these principles.

This definition from Wikipedia defines ‘loose coupling’ the same way I’ve heard people use the term:

in which components are weakly associated (have breakable relationship) with each other, and so, changes in one component least affect existence or performance of another component.

Loose coupling is weak association. The thinking is that changes in one subsystem should have as little effect on another one as possible, be it implied knowledge about how another subsystem works, code changes or service disruptions.

Let’s do some blue-sky thinking.

Define something simple

Let’s take the first example of a microservice architecture from microservices.io that Google throws up for a search for “microservice example” (for me, at least). As all examples, this is maybe the simplest one imaginable, and with practice everyone questions their applicability, but as we’ll later find out, this is already a lot more complex than what we (well, I) can mentally manage.

We have 4 subsystems, all of which are loosely coupled with each other, and where a portal or an API is used to query and update the state of the whole. The 4 services:

account
inventory
shipping
web UI

Each of these are responsible for maintaining some elements of the system but they don’t talk to each other directly. These services imply some operations. We’ll stick to the bare minimum.

First, we’ll need a way to register new clients, who have some shipping addresses. To create this process, we must introduce a form into our web UI, and we need to add some validations. There are only two things which can go wrong (from a business-perspective) when a new client registers:

the client‘s email might already exist
its shipping address might not be on the list of allowed cities

This implies that the account service has a database with the existing client-accounts it can check and that the shipping service has another one with a list of allowed cities.

There are two ways to do this, the “validate first” and the “cleanup afterward” approaches. Let’s look at both.

Validate first approach

To validate the form, we have to ask both services if they can support the values provided on the form, so we send two requests to each service and wait for their answers. If they say yes, we can proceed and save the user.

Can’t we?

Well, not necessarily. We expect this to be a very popular service, which is why we chose the distributed design. Consider that if a list of allowed cities exists, there must be a way to add/remove cities to this list (ie: this list is mutable). We also know that we’re not disallowing multiple clients from registering in parallel. What could go wrong?

Time. In distributed systems, there’s a difference between the time I gathered the information and the time I relied on the information I received to store the data. We can presume that on most occasions the data we used just now is still reliable, but not all the time. We need to identify all the information we looked at to reach the decision that the new client’s record can be stored.

whether an existing client has the same email
whether an address is in the list of allowed cities

Either of this can change between the information retrieval and its storage. What makes this even worse is that we’re implying here that we understand the internals of both accounting and shipping services. This clearly breaks the loose coupling principle. We presumed that we know how these services reached their decisions and so can reason about what information can change while an update is being done.

As far as each service is concerned, they’re each working correctly. They served correct answers to the simple requests they served, namely they validated and stored, yet they reached the wrong conclusion in the end.

Let’s look at the cleanup approach next.

Cleanup afterward approach

To get around the limitations we identified in our earlier example, we’ll instruct each service to do what we asked it to do and return an error-code if they couldn’t do what they were asked. We ask the affected services to consume the “new client” event and if they have a problem with it, report back to the originator with an error-code. This means that we shifted the problem to the particular service which makes everything a lot more self-contained, so we no longer have to worry about each services internal logic.

What we have instead is two problems.

First, since either of these two operations can veto the original “create client” operation, we need a way to undo our changes everywhere, even if only one of them had a problem. It doesn’t make sense to have a client in accounting without a shipping address, and it most definitely doesn’t make sense to have an address in shipping without a client.

Second, since we have to now report these potential issues back to the client-form on the web UI, we must “know” there that the client has a half-finished state somewhere, should an operation fail. For example, if the address was invalid, we need to prompt the user about that and we also need to undo the saving of the client-record in accounting. We’ll call this algorithm the orchestrator. In the originally cited article’s example, the proposed solution to this problem is the Saga pattern. This tells us that if this happens, we go back to the accounting service and ask it to delete the record it stored. Simple, clean, self-contained.

Remember how our system was designed for failure-tolerance? Since we introduced the orchestrator, we now have another bit of information to worry about, and it’s not as simple as having a record in a database table anymore. Since the orchestrator is a piece of executing code and since it must survive the loss of the computer hosting it, we now need to introduce some kind of a persistent workflow-solution which can store the state of an executing finite state-machine, so that we can resume the operation as expected, should it fail. Anything less and our database will either have some shipping addresses without clients or vice versa, in other words, we can leave our system in a unpredictable state.

Okay, so let’s say that we thought of all of this in time, we introduced a persistent, stateful workflow solution, we have a working setup with multiple redundancies everywhere. Are we done yet? Well, not entirely.

Parallel operations

We’ve already touched upon the fact that we have a list of allowed cities within the shipping service, which is maintained by someone. This presumes another operation: “maintain cities”. Let’s say that this list is updated using a single operation, which can never go wrong, and which returns a list of addresses the operator has to notify about shipping changes. How is this a problem?

Well, it’s not a problem in itself, the issue comes from the interaction between two processes happening in parallel. Remember that the point of this (increasingly complex) exercise is to decrease coupling, ie: reduce the amount of implied information between subsystems.

When looked at in isolation, the shipping service has three operations so far:

new address
delete address
maintain cities

The service can guarantee that when the cities are being updated and the list of offending addresses are returned, they are correct and that when a new address is added it’s checked against the latest list of allowed cities, so if we look at the service alone, everything is fine.

The problem comes from the fact an address is not a “real” address until it passed its validation in another service we can’t know about due to the loose coupling principle, which could result in addresses being returned to the operator, with no client-records. We already established that this is wrong behaviour, so now that knowledge has to be cascaded into the software the operator uses when looking for the clients to notify based on the list they received from the “maintain cities” operation.

There are ways to kick this can down the road, but this will eventually come back to haunt us. The fact that the operator’s UI software (a 5th service) now has to know that we need to make a special case for the addresses with no client-records, because we can potentially store the states of half-finished “create client” operations, because of some design decisions made earlier is exactly the sort of problem we’re trying to avoid by loose coupling in the first place. The truth is that the “coupling” has to go somewhere unless the subsystems participate in costly 2-phase commits or some other, usually unpractical monolith-like setups.

In the next article, we’ll look at what we need to do to map out these kind of problems in a slightly more complex example.

https://andrasgerlits.medium.com/microservices-clearing-up-the-definitions-f679ebb794cb