Logging Like a Pro

Theories and best practices on effective application logging

Published in

ITNEXT

9 min readMay 19, 2020

Logs are like evidence in crime scenes, and developers are like CSIs. Logs play a crucial role in developers’ investigation of a bug or an outage. Same as how lack of evidence leads to cold cases, lack of meaningful logs makes troubleshooting time-consuming or even impossible. I have seen people wrestle with tools such as strace and tcpdump, or deploy new code to production during an outage just to get more logs to diagnose a problem.

As the old saying goes, “A beard well lathered is half shaved,” all professional developers should learn to log effectively. This article aims to not only provide a list of useful application logging practices but also explain the theories, such as the what, the when and the who in logging.

How do we debug

It’s hard to talk about logging without first discussing what a program is and how people debug because logs are supposed to help with debugging.

Program as state transitions

A program is a series of transitions among states.

States are what a program stores in its memory at a particular time point, and the code of a program defines how the program transit from one state to another. Developers using imperative programming languages such as Java often focus more on the procedure (code) than states. However, considering a program as a series of states is a crucial mindset to adopt, because states are keener to what the program should do, rather than how the program does it.

For instance, if a robot is getting my car’s gas tank filled, when described as state transitions, it needs to go from the state (tank=empty, money=$50) to the state (tank=full, money=$15). When described as a process, the robot needs to find a gas station, get the car there, and pay. The procedure matters a lot, but states are more direct in measuring the correctness of the program.

Debug

Debugging is a mental reconstruction of state transitions. In their minds, developers replay how a program accepts inputs, goes through a series of state changes and produces outputs, then figure out what goes wrong. In development, developers use a debugger to assist with this mental process, while in production, it’s much harder to use debuggers therefore logs are used more often.

What to log

With the definition of debugging, what to log becomes more straightforward:

Logs should contain sufficient information to help with the reconstruction of state transitions.

It’s impossible and unnecessary to capture all states at all time points. The polices don’t need hologram videos but a few good sketches to catch criminals. It’s the same for logs, developers only need to log when critical state transition happens, and logs should contain the key characteristics of the current state and the reason for transition.

Critical state transition

Not all state transitions are worth logging. The key is to think about the execution of a program as a series of state changes, divide them into phases, then focus on the time when the execution goes from one phase to another.

For example, assume there are 3 phases in the startup of an app

loads configurations
connects to dependencies
starts a server

The startup of an app described as state transitions

It would be very reasonable to log at the beginnings and the ends of all these phases. Should there be any error connecting to dependencies and the app hangs, logs can clearly show that the app has loaded its configuration, entered the dependency connection phase but didn’t complete. With that, developers can quickly pinpoint the problem.

Key characteristics

Logging states is like sketching your program, only key characteristics of the core business logic need to be capture. If there is any sensitive information such as PII, they should still be logged but masked or described indirectly.

For instance, when an HTTP server changes from the state “await request” into the state “request received,” it should log HTTP method and URL because they describe the basics of an HTTP request. Other elements of an HTTP request such as headers or part of the body should be logged if their values affect business logic. E.g., if the server behavior differs significantly between Content-Type:application/json and Content-Type:multipart/form-data, the header should be recorded.

Reason for state transition

Logging the reason of state transition is extremely helpful to debugging. Logs should briefly cover the meanings of the previous and the next states, and explain why. They help developers to connect dots and navigate the reconstruction(debugging) of program execution.

Example

A simple example to put all these together: assuming a server receives a malformed SSN number, and a developer wants to log about that event.

A few anti-pattern logs that lack key state characteristics and reasons:

[2020–04–20T03:36:57+00:00] server.go: Error processing request
[2020–04–20T03:36:57+00:00] server.go: SSN rejected
[2020–04–20T03:36:57+00:00] server.go: SSN rejected for user UUID “123e4567-e89b-12d3-a456–426655440000”

They all contain some information but not enough to answer questions developers might have during troubleshooting such as: What kind of request did the server fail? Why was the SSN rejected? Which user was impacted? A good log that helps with debugging is:

[2020–04–20T03:36:57+00:00] server.go: Received a SSN request(track id: “e4a49a27–1063–4ab3–9075-cf5faec22a16”) from user uuid “123e4567-e89b-12d3-a456–426655440000”(previous state), rejecting it(next state) because the server is expecting SSN format AAA-GG-SSSS but got **-***(why)

Who should log

A pitfall that many overlook is “the who” in writing logs. Writing logs from the wrong functions cause logs to contain either duplicated or insufficient information.

Program as layers of abstraction

Most well-constructed modern program is like a pyramid with layers of abstractions. Upper-layer classes/functions break a complex task into subtasks, while lower-layer classes/functions abstract subtask implementations like black boxes and provide interfaces for upper-layer to invoke. This paradigm makes programming easier because each layer only needs to focus on its logic without worrying about all the gory details.

For example, a website can consist of the business logic layer, the HTTP layer, and the TCP/IP layer. When responding to a request to some URL, the business logic layer focuses on deciding which webpage to show, and passes webpage content to the HTTP layer. The HTTP layer bakes the content into an HTTP response, then has the TCP/IP layer turn the HTTP response into TCP packets and send them out.

Don’t log at the wrong layer

As a consequence of abstraction, different layers have different levels of knowledge of an ongoing task. In the previous example, the HTTP layer does not have many ideas about how many TCP packets are sent, nor users’ intention when they are requesting this URL. When trying to log something, developers should pick the right layer that has a full view of state transitions and reasons.

In our SSN validation example, assuming SSN validation logic is wrapped into a Validator class such as

public class Validator {
    // other validation functions ...
 
  public static void validateSSN(String ssn) throws ValidationException {
    // do the validation
    String regex = "^(?!000|666)[0-8][0-9]{2}-(?!00)[0-9]{2}-(?!0000)[0-9]{4}$";
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(ssn);
 
    if (!matcher.matches()) {
      // --> Log Location A <--
      logger.info("Bad SSN blah, blah, blah...");
      throw new ValidationException(String.format("expecting SSN format AAA-GG-SSSS but got %s", ssn.replaceAll("\\d", "*")));
    }
  }
}

And there is another function that validates a request that updates user info, and it invokes SSN validation:

public class Validator {
    // other validation functions ...
 
  public static void validateUserUpdateRequest(UserUpdateRequest req) {
    // validate other attribute of req ...
 
    try {
      validateSSN(req.ssn);
    } catch (ValidationException e) {
      // --> Log Location B <--
      logger.info(String.format("Received a user update request(track id %s) from user uuid %s, rejecting it because %s", req.trackID, req.uid, e.getMessage()));
      // other error handling logic ...
    }
 
    // other logic ...
  }
}

There are two places, locations A and B, to write logs about an SSN validation failure, but only location B has sufficient information to log about. At location A, the program does not know what kind of request it’s handling nor which user is the request from. Logging does not do much but adding verbosity. It’s more appropriate for validateUserUpdateRequest to throw the error out to the caller (validateRequest) who has more context to log.

Although that is not to say that logging at lower layers of a program is all unnecessary, especially when lower layers don’t expose errors to high layers. For example, the network layer can have retry logic embedded, which means high layers don’t notice some of the intermittent network issues. In general, lower layers can write logs more at DEBUG level rather than at INFO level to reduce verbosity. If needed, developers can tune the log level to get more details.

How many logs?

There is an obvious trade-off between log volume and usefulness. The more meaningful logs, the easier to reconstruct the state transitions. There are two things you can do to control your logging bill:

Estimate the relationship between your logs and workload

To control log volume, it’s important first to give it a fair estimate. Most programs have two types of workload:

Receive work items(requests) then respond
Poll work items from somewhere else then perform actions

Most logging is triggered by workload, the more workload the program has, the more logs it writes. There can be other non-workload associated logs but they become trivial when a program starts taking workload. Developers should keep the relationship between # of logs and # of work items linear, or in another word:

# of logs = X * # of work items + constants

where X can be determined by examining the code. Developers should have a good idea of the X factors for their programs and match it with the logging capacity and budget. A few common cases of X are:

0 < X < 1: This means that logs are sampled and not every work items have logs. E.g. only log on error, or use some other log sampling algorithms. It helps reducing log volumes but it can also limit troubleshooting.
X ~ 1: This means that, on average, every work item produces roughly one record. It’s usually reasonable as long as that one log contains sufficient information(see “what to log” section).
X >> 1: People should have a very good reason to have an X that is way larger than 1. Because when the workload pikes, for example, the server receives a sudden storm of HTTP requests, X amplifies that and puts a huge load on the logging infrastructure, which is usually a call for trouble.

Make use of log levels

What if X is still too large, even after optimization? Log levels can help. If X is way larger than 1, maybe some logs can be put at the DEBUG level to lower the X of the INFO level. When troubleshooting, the program can run at the DEBUG level temporarily to provide more information.

Summary

To write logs like a professional, one should consider a program as a series of state transitions with layers of abstractions. With those theories in mind, these key questions in application logging can be answered:

When to log: Log when critical state transition happens.
What to log: Key characteristics of the current state and the reason for state transition.
Who to log: Log at the right layer of abstraction that has sufficient context.
How many logs: Estimate the X factor(X as in, # of logs = X * # of work items + constants), and tune it to a budget but helpful value.

If you enjoyed this article, follow me on Medium! I write about distributed systems and software architecture, such as:

Automation-friendly Software Systems and How to Build Them. Tips on designing your software systems toward better automations
Minimalist Software Architecture. Lessons learned from building large-scale multi-region distributed systems
Before you use Kustomize. The pros and cons of the popular Kubernetes config management tool Kustomize