OpenTelemetry — Understanding SLI and SLO with OpenTelemetry Demo

Diego Amaral
ITNEXT
Published in
15 min readApr 22, 2023

--

I promise that I will try not to be annoying.

Introduction

This topic is probably one of the most "boring" for some Software, Site, and DevOps engineers I've worked with through the years. So, perhaps, I prefer to use the word: tricky.

Even if you do not assume that you do not need or even do not use Service Level Objective (SLO), you use it somehow. For example, working as a Software Engineer, you will try to build reliable code through a feature or component; working as a Site Reliability Engineer, you will also try to deliver a reliable application and platform.

This could be achieved using reliable evidence, where the output could be numbered. Are you used to making calls to check the current objectives of the product and plan new goals for the roadmap of the software? Always collecting reliable pieces of evidence? Congratulations, you are making use of SLOS. You may call another name; however, you are using SLO in the heart.

So, as we are discussing modern observability topics here, OpenTelemetry must stay in this scene. It is one of the best frameworks aiming to simplify the export, collection, and analysis of telemetry data from cloud-native applications. It is vendor-neutral and focused on interoperability for distributed systems. We will use the e-Commerce application from OpenTelemetry Demo to analyze and visualize a real scenario using SLIs and SLOs.

Part 1: Before we touch the sky

What are an SLI, SLO, and an SLA?

Service Level Indicator (SLI):

Imagine that SLI means how well something is doing what it's supposed to do. In a technical overview, this is something you can "feel" when using products that use this approach. For instance, from a user's perspective, response time, error rate, and availability are Indicators.

As mentioned above, we can get deeper and see a classic example of SLI:

  • Request Latency for requests should be under 330 milliseconds
  • The availability of the server should be 99.9% for a given period.
  • Throughput for an e-Commerce endpoint, for instance, using the number of successful purchases per minute
  • The error rate for the service should be below 1%

All the points above help us measure the service level delivered by a system. When thinking about SLI, remember the association with Product Managers, Product Owners, and SREs, where technical and clean objectives are designed.

Service Level Objectives (SLO):

On the other side, SLO works with the word "promise." This happens because you must perform a certain way most of the time and quantify the reliability of a product. After all, it is directly related to the customer experience.

Can some cases be associated with SLO:

  • Response time of 100 milliseconds for all requests
  • System uptime of 99.99%
  • An error rate of less than 0.8%
  • Error budget

Generally, SLO attempts tend to be aggressive. However, the goal of perfection could not be worth it. In the end, the customers need to be happy. If 99.99% causes customers happiness and mindfulness, it is unnecessary to change for a higher value.

In the Preface of the book Implementing Service Level Objectives: A Practical Guide to Slis, Slos, and Error Budgets[7], the author gives a great example about You Don't Have to Be Perfect that could help you on your journey with SLO.

Service Level Agreement (SLA):

If some "agreement" mentioned above is broken, a value, price, or touchable must be on the table. In other words, a contract. Almost all of the consequences are financial, but can vary as said before, for instance:

  • Uptime falls below 99.9% in a Black Friday week. As a result, the provider will issue a discount of 40% to the customer.
  • Support requests will be responded to within 1 hour.
  • Maintenance will be scheduled outside of business hours.

Metrics Signal in OpenTelemetry

The Basics

Metrics brings information about the state of a running system. This results in data that can be aggregated over time and help us analyze trends and patterns that could be visualized in tools specialized in this approach. As we will use an e-commerce example, it could be the number of black binoculars sold today.

Metrics also monitor the critical health of applications and decide if an on-call engineer could be triggered based on an alert that was designed using SLIs and SLOs.

OpenTelemetry uses OpenMetrics, StatsD, and Prometheus as open-source formats that combine in a unified data model.

High-Level Definition

A metric in OpenTelemetry comprises three primary components:
the name, value(data point value), and dimensions.

A name of a metric identifies the measurement being captured, while the value represents the current state of that measurement. Additionally, metrics can also include labels that provide additional metadata about the measurement. Also, labels can be used to provide context to metrics, allowing better analysis and comparison.

These labels are included in dimension information. The representation of the dimensions has some different approaches depending on the back end in use. For example, Prometheus uses the concept of labels. StatsD adds a prefix to the metric’s name. Finally, in OpenTelemetry, dimensions are added to metrics via attributes.

During the Implementation session, we will see a basic performance of the OpenTelemetry Demo in Prometheus with Grafana.

However, before we get into the details, let’s check these details expressed on Prometheus UI.

Prometheus Table View

  1. The name of the metric is otelcol_exporter_send_failed_requests
  2. The dimensions recorded here are in curly braces using key-value pairs, and the service_name label contains the otel-contrib service
  3. This value represents an integer that could be represented as the last received value or a calculated current value which depends on the metric type.
Prometheus UI Table View

The value represented above shows the current value as cumulative.

Prometheus Graph View

Prometheus UI Graph View

The value represented above shows that the data received is stored over time, i.e. a live graph is generated.

We can see a simple example with a start time or, to choose a better word, a trend. It is possible to observe some anomalies and be prepared for some patterns caused by the behavior of the telemetry.

How OpenTelemetry helps?

In a real-world case, you will likely use the OpenTelemetry SDK to instrument your code and collect telemetry data. Also, it is possible to use the OpenTelemetry Collector Operator to help with the instrumentation. There are many ways that you can achieve this.

Defining SLIs

Identify the specific metrics or telemetry data that will be used to measure the performance and availability of your service. For example, you may use response time, error rate, or throughput as your SLI.

Defining SLOs

Using the SLIs, define a set of clearly defined goals for your service's performance and availability. For example, you may set a target response time of 200ms or a 99.9% uptime goal.

Monitoring SLIs

In this step, we will need a chosen monitoring tool to track the SLIs and ensure they meet our set targets, i.e. NewRelic, but you can use Prometheus, Grafana, or another tool that can perform this step.

Needed Actions

Use the data gathered from monitoring your SLIs and SLOs to make informed decisions about improving your service from now on.

Overall, integrating SLI and SLO with OpenTelemetry will help you make more informed decisions about the performance and availability of your service and give you a clear set of goals to aim for going forward.

OpenTelemetry Demo Overview

What is the OpenTelemetry Demo?

The OpenTelemetry Demo is an Astronomy Shop e-Commerce, a microservice-based distributed system intended to illustrate the implementation of OpenTelemetry in a near real-world environment.

These are the main points present in the project:

  • A Showcase of what OpenTelemetry has to offer
  • Provides a realistic example of distributed systems work
  • It is possible to use different back-end vendors (NewRelic, DataDog, Dynatrace)
  • 10~13 microservices across ten different languages in use
  • It is open to forks
  • Works with Docker and Kubernetes
  • Anyone can use it

Understanding SLO Application for the Webstore

This abstraction above leads us to a real SLO experience that could be shown with the appointments below.

Operation

It is when an order is generated, i.e., when we select a product from our repertory. This is where we have the first customer-facing interaction. Here, we can decide what is important enough to choose a Service Level Objective.

Service Level Indicator

A latency in seconds from the “order submitted” to the final “order acknowledged.” I.e., for each order submitted, we need to be able to obtain the request for the telemetry data.

Trends

Using trends, we will see patterns for specific times of the day during the app use. For instance, as metrics are essential, we will benefit from the aggregation timeline in our e-Commerce system. So we are setting an Aggregation Window, like 12 hours.

With this, we can check that the orders become unreliable every Friday after NASA’s announcement of the solstice. Also, we can point to hotspots for the system.

Target Reliability

Remember we mentioned before that happiness of our clients is more important than perfection? So yes, here is where we set this representative value. In real scenarios of production systems, the value of 90% is the more significant value in use for most companies.

Objective

The final measurement can be represented as New orders being acknowledged within 1 second and achieving this level of service 90% of the time.

Part 2: Implementation of OpenTelemetry Demo Setup

Two Architectures will be used, one local with Grafana and Prometheus, to create a basic SLI and SLO for the OpenTelemetry Collector Infrastructure. Conversely, we will use the NewRelic as a back-end vendor to get info about the E-commerce Webstore.

Prometheus and Grafana

Requirements

To start the project on your machine, you will need Docker and Docker Compose installed. Unfortunately, I will not cover the process of the installation here.

After you have the Docker installed, let’s certify that you also have the git installed and clone the project:

> ~$ git clone https://github.com/newrelic/opentelemetry-demo
> ~$ cd opentelemetry-demo
> ~$ code .

Check the otel-collector-config.yaml

In this case, as we are using the fork of NewRelic for the OpenTelemetry Demo, you need to go the following path:

> src/otelcollector/otelcol-config.yml

Check the receiver, and configure another endpoint for the Prometheus.

receivers:
otlp:
protocols:
grpc:
http:
cors:
allowed_origins:
- "http://*"
- "https://*"
prometheus:
config:
scrape_configs:
- job_name: 'otel-collector'
scrape_interval: 10s
static_configs:
- targets: ['localhost:8889']

After that, check if the Prometheus exists on the exporter and set these configs:

  prometheus:
endpoint: 'localhost:9090/metrics'

On processors, if you prefer, you can set the batch config timeout:

processors:
batch:
timeout: 1s

Finally, check if the service contains all configs that we mentioned.

service:
pipelines:
traces:
receivers: [otlp]
processors: [spanmetrics, batch]
exporters: [logging, otlp]
metrics:
receivers: [otlp]
processors: [transform, batch]
exporters: [prometheus, logging]

Run the project

> $ docker compose up --no-build OR
> $ sudo docker compose up --no-build

Wait some minutes, and you will access the Webstore and the Telemetry using these links:

For now, let’s make use of these URLs.

Accessing Grafana UI

When you access the Grafana URL and the Dashboard OpenTelemetry Collector, you will see these panels generated by the engineers who created the Demo.

You can click, create, edit, and view each one of them. But for now, let’s create another panel. You can create this panel in another Dashboard. But, for the sake of understanding, I will create the same above.

Creating the Panel

Go to the bottom right of the page and click the icon to add a new panel. Select the Prometheus as the Data Source.

After you add the panel, add these two PromQl queries to help understand how we can analyze the time series.

Defining the SLI Query.

sum(rate(otelcol_exporter_send_failed_requests{job="otel-collector"}[1m]))

This PromQL query calculates the sum of the failed requests sent by the OpenTelemetry Collector exporter within the last 1 minute.

The “otelcol_exporter_send_failed_requests” metric tracks the number of requests that failed to be sent by the OpenTelemetry Collector exporter.

The “rate” function calculates the per-second rate of increase of the metric, given the time range specified in the square brackets (1m in this case).

The “sum” function then adds up all the rates calculated by the “rate” function over the specified time range, giving us the total number of failed requests sent at the last minute.

The query is filtered by the “job” label, which is set to “otel-collector”, to specify the specific instance of the monitored OpenTelemetry Collector.

Overall, this query helps to track the health and performance of the OpenTelemetry Collector exporter, alerting administrators or developers when there is a significant increase in the number of failed requests.

Defining the SLO Query.

ceil(sum(rate(otelcol_exporter_send_failed_requests{job="otel-collector"}[1m])) * 0.8)

This PromQL query monitors the failure rate of outgoing requests from the otel-collector job. Let’s break down the different parts of the query:

1. otelcol_exporter_send_failed_requests: This metric measures the number of failed requests sent from the otel-collector exporter.

2. {job=”otel-collector”}: This is a label selector that filters the metric to only include data from the otel-collector job.

3. rate(otelcol_exporter_send_failed_requests{job=”otel-collector”}[1m]): This calculates the per-second rate of failed requests over the last 1 minute.

4. sum(rate(otelcol_exporter_send_failed_requests{job=”otel-collector”}[1m])): This computes the sum of the per-second rate of failed requests over the last 1 minute.

5. * 0.8: This multiplies the above sum by 0.8 to calculate 80% of the threshold limit.

6. ceil(): This rounds up the final result to the nearest integer.

Therefore, this query returns the ceiling value of 80% of the failures per second rate in the last 1 minute. If the number of failed requests exceeds this threshold, then it can be used to trigger an alert or take some other action.

Showing the cumulative aggregation values for the sum counter

The sum of requests from a period of time.

The Time Window: 22:00, 22:15, 22:20, 22:25, and so on.

The Cumulative Numbers: 0.65, 0.70, 0.75, 0.80, 0.85, 0.90, 0.95, 1.00

Important Note

On OpenTelemetry, as we saw earlier, the sum measures incremental changes to a determined value. This could be monotonic or non-monotonic. The temporality must be associated and can be represented in two ways:
Delta and Cumulative Aggregation. The Delta contains the change in a specific value from its previous recording. Conversely, Cumulative includes the previously reported sum and the delta being reported.

If you want to understand more about Sum cumulative monotonic and Delta monotonic, I recommend the Talk from James Guthrie on PromCon EU at 19:04 minutes. [9]

NewRelic Back-End Vendor

We will be using the Demo App with Docker, and for the effect of this setup, we will be using a back-end vendor, NewRelic. It is possible to use great features, but we will focus on the basics.

Vendor Back-end: NewRelic

Create your account

Do you need to create your account via the link: https://newrelic.com/signup

Find your LICENSE KEY

  • STEP 1
    — Go to the left bottom of the page after you create your account
  • STEP 2
    — Click on API KEYS session
  • STEP 3
    — Copy the INGEST — Licence Key

Once you have saved these configurations, let's implement the project.

Part 3: Validating

Export the environment variables

> $/ export NEWRELIC_OTLP_ENDPOINT_US=https://otlp.nr-data.net:4317 
> $/ export NEWRELIC_OTLP_ENDPOINT=${NEWRELIC_OTLP_ENDPOINT_US}
> $/ export NEWRELIC_LICENSE_KEY=111111111111111112222222

Note that the OTLP_ENDPOINT is fixed to _US at the end of the variable; this is shown when you select the region and could differ if you choose the EU region, for example.

Check the .env file present on the root

.env file

Go to the end of the file and search if the following block is the same as below:

# *********
# New Relic
# *********
# Select corresponding OTLP endpoint depending where your account is.
NEWRELIC_OTLP_ENDPOINT_US=https://otlp.nr-data.net:4317
NEWRELIC_OTLP_ENDPOINT=${NEWRELIC_OTLP_ENDPOINT_US}

# Define license key as environment variable
NEWRELIC_LICENSE_KEY=${NEWRELIC_LICENSE_KEY}

Run the project

> $ docker rm -vf $(docker ps -aq)
> $ docker rmi -f $(docker images -aq)
&&
> $ docker compose up --no-build OR
> $ sudo docker compose up --no-build

Wait some minutes, and you will access the NewRelic URL.

Visualizing the Telemetry

Considering that you did all the steps behind it, let's access the NewRelic page and visualize some SLO examples of the real environment.

Action

Access the localhost:8080, simulate some orders, buying anything. For instance, I purchased 6 Roof Binocular:

localhost:8080

Go to the NewRelic One page, and select the Trace bar on the side menu. After that, select a checkbox. In my example, I selected the checkout entity.

NewRelic Trace Page

I clicked on the first HTTP POST trace group, and after that, it showed the group of HTTP POSTs; I selected the first one.

This will show us all the traces of the requisition containing the Spans from each call for the applications.

I will not focus here on tracing and span path, only on the metrics of measurement that are shown easily to us on the NewRelic page.

It will be possible to check all the checkout processes we mentioned at a high level on the diagram above.

New Relic Tracing Group

I will click in the span containing the path that includes the send_order_confirmation

NewRelic Individual Span

I disabled the flag of the Span event and just focused on the data generated by the requisition.

You also can next to the Query Your Data on the side menu and select some metrics, and check behavior for requests, io, networking, memory, and Latency for each service. For example, I chose a system.memory.utilization for the past 60 minutes metric

Query Data with System Memory Utilization on the NewRelic side Menu

Also, check the Latency Average of the Kafka consumer:

Query Data with Kafka on the NewRelic side Menu

You can also select the Dimensions of the metric and create SLIs/SlOs based on this Telemetry.

Query Data with Kafka Dimensions on the NewRelic side Menu

Remember, this is just a clipping of a big content that could be explored in many ways.

Conclusion

SLI and SLO are present in the systems in production nowadays, whether you agree with that or not, even if you call them by another name.

Perfection sometimes doesn't represents the happiness of the customer.

It is modular and will depend on the product that you use and the experience of the team managing the load.

It is possible to have a reliable product once everything is planned and executed correctly by the engineering and product teams.

All is Telemetry, and how much we are prepared to extract and lapidary important information from these data will imply benefits for those participating in the process.

Takeaways

There is an entire set of tools that can help to achieve a better experience with SLO implementations. I recommend you search and see some examples with the Sloth Prometheus SLO generator[3]

  • Sloth generates understandable, uniform, and reliable Prometheus SLOs for any kind of service. Using a simple SLO spec that results in multiple metrics and multi-window multi-burn alerts.

It is possible to conduct a few analyses with Grafana for simple applications for collecting service health and basic info before starting with a great and enormous application.

--

--