Scaling IoT Monitoring & Observability Solutions

Published on Fri Aug 02 2024 David Nepozitek Software Engineer at Spotflow

Observability solutions are notoriously expensive when managing large-scale operations, particularly in IoT environments where telemetry data comes from thousands of devices. In this article, we will explore strategies for scaling an observability stack to maintain high performance and detailed insights while minimizing costs.

In the previous article, we discussed the essentials of monitoring and observability in IoT. Mainly, we presented how to leverage logs, metrics, traces, and structured events to enhance the observability of your IoT systems.

It is no exception to operate tens of thousands of IoT devices. Scaling your observability solution might quickly lead to insufficient performance and unbearable costs for your observability infrastructure. Thus, this article will focus on handling the large scale.

Choosing a Performant Database

Okay, we know what to collect, now we just dump all the data into our MySQL and we’re ready to observe, right? Well, not so fast (pun intended), this might not be the best idea for several reasons. We’ll look at our requirements for the database and then suggest a storage that will serve our needs better.

First, let’s revise a few characteristics of storing IoT observability data:

The querying speed is important. When dealing with a production outage, the last thing you want is to wait several minutes until your debugging queries finish.
We will deal with many dimensions and high cardinality. The high number of dimensions comes from the idea of capturing many attributes of your operation to prepare for unknown conditions. Also, there will be important columns with high cardinality (the number of unique values of the column) such as the device IDs.
We need to query across all dimensions efficiently. We don’t know which attributes will be important when debugging a specific issue.
We will usually be interested in data coming from a limited time range. The time range will often correspond to the periods when you observe degraded service of your system.

There’s definitely more to it, but this small set of characteristics will be enough to make our point.

General-purpose SQL Databases Might Be Insufficient

We’re probably all familiar with SQL databases, so it’s natural to consider it as a place to store our observability data. However, several technical aspects make SQL databases unsuitable for storing large-scale observability data.

Traditional row-oriented databases, like MySQL or PostgreSQL, struggle to efficiently handle queries on tables with many dimensions when only a subset of columns is required. Another issue of high dimensionality is the difficulty of implementing efficient indexing. We can’t create database indices for a subset of columns beforehand, because we don’t know which dimensions will be important during troubleshooting. So, we would either need to index all columns (which would be quite expensive), or the queries would be slow when filtering based on the unindexed columns.

Also, without explicit time-based data partitioning, there is usually no efficient way of discarding old data. Time-partitioning allows efficiently deleting large chunks of data when they get stale.

In case of reasonable motivations for using a traditional SQL database for observability data, you might want to consider Timescale. It is a PostgreSQL extension that addresses some of the challenges mentioned above with time partitioning and better compression while still using the row-based SQL model. Spotflow provides seamless integration with PostgreSQL and Timescale via the SQL egress sink. Our platform transforms the JSON messages from your devices into database rows according to the mapping that you specify.

Signal-Specific Storages Scale Better

The categorization of observability signals into metrics, logs, and traces has led to the development of specialized storages tailored to each signal type. For example, there is Mimir for metrics, Loki for logs, and Tempo/Jaeger for traces. Each of these storages is made with the specific signal type in mind, which makes them effective for monitoring use cases within the specific signal. However, it might be cumbersome to query data across these storages.

Grafana LGTM Stack (Loki, Grafana, Tempo, Mimir) for Efficient Logs, Traces, and Metrics Storage — Grafana LGTM stack (Loki, Grafana, Tempo, Mimir) provides efficient storage for logs, traces, and metrics.

Additionally, certain storages have some specific limitations. For instance, the traditional time series databases (TSDBs, such as Mimir) cannot handle high cardinality data. TSDBs store a separate time series for each unique set of attributes. This approach can be very efficient with a limited number of dimensions and low cardinality as writing and querying within a single time series is very performant. However, with high cardinality, the database needs to create a new series very often because it often encounters a unique combination of attributes. As a result, when retrieving aggregate values, the database needs to read through each time series, making the operation inefficient. This issue is particularly problematic within the IoT sector, where using high cardinality labels such as device ID and sensor ID would seem appropriate.

Traditional Time-Series Databases Storing Separate Time Series for Each Unique Set of Attributes — Traditional time-series databases store a separate time series for each unique set of attributes.

With Spotflow, you can leverage the strength of the Grafana stack and other storages that support the OpenTelemetry protocol (OTLP) using the OpenTelemetry egress sink. It allows you to send messages in the OTLP format into the Spotflow IoT platform, which will route these messages to your preferred observability backend.

Use Column-Oriented, Time-Partitioned Storage for the Best Scalability

With the increasing demand for analytical workloads similar to ours (as described above), a new wave of databases emerged. They employ columnar storage, which makes the read operations more efficient as they only touch the columns required for the particular query. Thanks to time-partitioning, the database can limit the read operations only to a limited range of data, making the queries even more efficient. The combination of these design choices makes the compression work faster as well, as the algorithm operates on single columns bounded by a time range. Notable examples of such storages include InfluxDB, QuestDB, and ClickHouse.

Diagram of a Column-Oriented, Time-Partitioned Database — A diagram of a column-oriented, time-partitioned database.

Sampling the Data

At a certain scale, it becomes unbearable to collect and store every observability signal that your devices produce. Thankfully, this is usually unnecessary as you can successfully debug issues with only a fraction of the observability data. For example, the events describing successful scenarios are often not as important as the ones describing failures. This is why we can discard most of these events and store only a few examples that are representative enough to reconstruct the particular historical situation.

Various sampling strategies exist to ensure that only a limited number of events are collected while still preserving sufficient detail. It's essential to choose a sampling approach that aligns with your specific needs. Instrumentation libraries, such as OpenTelemetry SDKs, often provide implementations of such sampling strategies. This makes sampling a relatively easy way to reduce storage and processing costs.

In the context of tracing, we distinguish two kinds of sampling based on the point where the sampling decisions are made: head and tail sampling. Head sampling decides whether a span/trace will be sampled right at the device, while tail sampling makes this decision later once all the spans of the particular trace are collected. The main advantages of head sampling are simplicity and cost efficiency. It reduces network traffic, which can be constrained in IoT environments, and avoids storing and processing unsampled data in observability backends. However, tail sampling becomes necessary if you prefer to make sampling decisions based on the entire trace. This approach is useful if you want to sample traces with errors differently than the successful ones.

Motivation for Data Sampling — The motivation for data sampling. (Source: OpenTelemetry)

Setting Up Retention Policies

Observability data tends to lose their value over time quickly. The telemetry received today is usually much more valuable than data from the last year. This gives us another way to significantly trim the storage costs. Retention policies allow the automatic removal of data beyond a specified timeframe. Time-based partitioning simplifies the implementation of retention policies which is why many modern databases support them out of the box.

Another strategy is utilizing tiered storage. That is, storing older data in low-cost object storages like Amazon S3 or Azure Blob Storage. Although querying from these storages might have higher latencies than local disks, it allows you to retain the data longer while still reducing storage costs.

Lastly, it is possible to reduce the resolution of historical data further. One approach is to perform a secondary round of downsampling on older data. An alternative approach is to explicitly create aggregates of historical data while discarding the original raw records.

Wrap Up: Choose Efficient Storage and Keep Only Essential Data

When setting up an IoT observability stack, you must decide where to store the data and select an appropriate observability backend. In this article, we have described various aspects to consider when making this decision to optimize cost-efficiency and scalability. The main points to remember are the following:

Optimize Storage Selection: Evaluate the access patterns to your observability storage and go with a database tailored to your needs. Choose a general-purpose database only when you’re really sure it will suffice. Otherwise, go with battle-tested observability databases for better scalability.
Set Up Data Sampling: Employ data sampling techniques to save on storage costs without compromising critical insights.
Fine-Tune Retention Policies: Configure retention policies to discard obsolete data, ensuring your storage remains lean to save up on storage costs even more.

David Nepozitek

Software Engineer at Spotflow

With past years dedicated to IoT platform development, David has gained a solid understanding of industrial IoT use cases. Specializing in cloud application development, his proficiency extends to distributed systems and modern front-end technologies. David isn't afraid to tackle hard technological challenges and enjoys sharing the discoveries he makes on the way. In his blog posts, he shares his insights covering a range of topics including IIoT, cloud computing, and beyond.

The Team

We are a team of tech enthusiasts immersed in IoT solutions for over a decade. Our expertise spans distributed systems, cloud engineering, embedded programming, and IoT, giving us a unique perspective on real-world challenges in this space.

Our Vision

Over the years, we've listened to builders of embedded hardware who struggle to gain visibility into device operations—finding it tough to quickly check device logs or metrics and learn about their overall status. That's why we started working on a new product designed to simplify IoT log collection and working on the platform for embedded observability. We help you to keep track of how your devices operate so you can focus on what truly matters: innovating and building great products.

Our Track Record

Our journey began with building the IoT platform at Datamole . That foundation has grown into building a robust product that now powers large-scale solutions for brands like Lely or Agrifac with more than 100,000 devices actively using the platform today.