Managing IoT devices at scale

TOMAS PAJUREK, CTO & Head of Engineering at Spotflow

Published on Thu Oct 31 2024 Tomas Pajurek CTO at Spotflow

Learn about challenges that surface when cloud-to-device communication is applied at scale. Devices are typically being continuously added to IoT solutions. Achieving proper and accurate device communication in such a scenario is non-trivial. We show how IoT platforms can help and what are their must-haves.

What it takes remotely manage large fleets of IoT devices?

This is a follow-up part of the article on the topic of remote management of IoT devices. First part is available here: Remote management of IoT devices.

In the first part, we introduced several kinds of remote management and its examples. We argued that each kind has its specific purpose and place within an IoT solution.

However, selecting a suitable kind of cloud-to-device communication is the first step toward a reliable production-grade IoT solution. Now, we need also to make sure that:

The solution will be reliable even with several magnitudes more connected devices compared to the development phase
The solution is usable by everybody involved
The solution is extensible enough to support all required business use cases

Reliable communication with hundreds or more devices

Successful IoT solutions typically face the following two challenges:

The number of devices that need to be maintained increases
The rate of adding new devices (and removing old ones) increases

From a certain point, it is not feasible to deal with these challenges using cloud-to-device communication on a per-device basis (e.g., updating device twin for each individual device). Although some scripting skills of the solution operators might help, inherently tricky technical problems need to be approached more scalable way.

Scaling device twins

The scalability challenges will be demonstrated in a simple case of applying the same device twin update to many devices. Let’s assume a fleet of devices that have device twins that look similar to this:

{ 
 "desired": { 
 "maxSpeed": 30 
 }, 
 "reported": { 
 "maxSpeed": 20 
 } 
}

There is maxSpeed property for which an operator can set the value via desired property, and the device can report actual maxSpeed via the reported property. More on this example can be found in the previous part of the article: Remote management of IoT devices.

In this example, let’s also assume that there are multiple versions of the device, and the specific version can be inferred from the device ID (the device ID is structured like this: device-xyz-v3.1).

Now the device operator is tasked with setting the maxSpeed property for all devices of version v3.1.

Naive approach - updating each device individually

A naive approach to this task could be to first list all devices, select the relevant ones based on version and then update the device twin for each device individually.

Using this naive per device, the operator must send a request to the relevant cloud API for each item in the resulting list of devices.

With some basic scripting skills, these tasks might be simple for smaller IoT solutions. However, with the growing number of devices, several issues arise:

The execution time of such operation/script grows together with the number of devices (linearly at best).
The reliability of such an operation/script decreases exponentially. Assuming a very nice SLO of the cloud API of 99.99% for each request, we should expect that with an increasing number of devices, our script has the following chance for problem-free execution:
• For 100 devices: 0.9999^100 ~= 0.99 → 99 %
• For 1000 devices 0.9999^1000 ~= 0.905 → 91 %
• For 10000 devices 0.9999^10000 ~= 0.368 → 37 %
The accuracy/correctness of such an operation/script is dubious. What if a new device was registered right in the middle of listing the devices and executing the updates?

Such reliability calculation is wildly theoretical, and the real numerical representation of reliability can differ. However, the exponential relationship between reliability and the number of devices is the hard truth we must deal with.

Both execution time and reliability of the scripts can be improved by using parallelization techniques (sending multiple requests at once) and implementing retry policies (detecting and retrying failed requests with reasonable delays). However, with these improvements, the complexity and requirements of the scripting tools and developers skyrockets.

Scalable approach - updating all current and future devices at once

To deal with the challenge in a scalable, reliable, and correct way, an existing solution in the form IoT platform can be used. The such platform should provide a simple way that allows operators to define not just a single target device but multiple devices at once. Also, the way of defining multiple target devices should be more intelligent than providing just a list (otherwise, we would still face the issue with accuracy/correctness).

The way to go here is for the IoT platform to accept configuration along the following lines:

{ 
 "targetCondition": "STARTS_WITH(device.name, 'model-v3.1')",  "desiredProperties": { 
 "speed": 20 
 } 
}

where the operator defines that the desired device twin property speed should be set on all devices whose names start with the string ' model-v3.1'. This configuration is persistently stored in the IoT platform and continuously evaluated.

This approach has the following benefits:

The operator needs to send only a single request for all existing and future devices
The operator does not need to list all devices beforehand
The burden of optimizing execution speed and reliability is now on the IoT platform, which is in a much better position to deal with it

For more mature IoT solutions, there is typically also a need for multiple such configurations (e.g., for multiple device models and/or features). In such cases, it is essential that the IoT platform provides a deterministic and transparent way of applying multiple configurations to a single device and dealing with conflicts.

Scaling messaging and remote procedure calls

The challenges and solutions to scaling cloud-to-device messaging and remote procedure calls (RPC) are similar for device twins:

The naive approach is not reliable
The set of target devices is not static

However, in practice, the need for sending the same message and/or RPC to many devices is much smaller than for device twins. In particular, for RPCs, it is very rare to have a use case requiring calling multiple devices simultaneously. Also, given the synchronous nature of RPCs and the exponential relationship between reliability and the number of devices, it becomes infeasible to call many nature of RPCs and the exponential relationship between reliability and the number of devices, it becomes infeasible to call many devices at once and expect all calls to succeed, especially considering that some of the devices might not be connected or have a poor connection at the time of the call.

IoT platforms and extensibility

IoT platforms are here to facilitate all kinds of cloud-to-device communication reliably. This applies to small solutions that communicate with each device individually and to large solutions that face scaling challenges.

IoT platform should provide a rich set of interfaces for cloud-to-device communication that can support all typical roles/users of an IoT solution:

Graphical interface/UI - For device operators and/or end users who are not interested in a high degree of control and just need to complete some basic tasks.
Command-line interface (CLI) - For more tech-savvy device operators that manage a large and variable fleet of devices or for solution administrators responsible for platform configuration and its automation (e.g., via CI/CD and IaC principles). In general, any role that needs a higher degree of control and the possibility to automate non-trivial tasks will appreciate a good CLI.
API and SDK - For software engineers/developers who create custom applications on top of the IoT platform.

Suitable graphical and command-line interfaces should significantly save time and be safe (designed in a robust, non-error-prone way so that users are less likely to execute destructive actions unintentionally).

APIs (Application Programming Interface) and SDKs (Software Development Kits) are crucial for extensibility. Especially in the case of cloud-to-device communication, the IoT platforms that do not provide a high-quality API cannot be easily extended with custom business use cases, thus significantly decreasing the value they can bring.

Example

Let’s assume a fictional company X that manufactures and sells sophisticated industrial devices/robots of arbitrary kind (it could be, for example, an electron microscope, milking robot, or harvester), and the following requirements apply:

The device contains non-trivial embedded software/firmware that needs to be updated.
The device has optional features that can be enabled remotely in some situations.
End customers can interact with their devices remotely via a custom web application.

Company X needs to build a software system to support this scenario. How do we approach such a situation? Here is where an IoT platform comes into play.

Let’s put aside topics related to connecting devices to the platform and focus on the matter at hand - cloud-to-device communication:

1. Assuming that the device is technically able to execute its software/firmware update, the IoT platform can provide tools for administrators to trigger or otherwise control the update procedure (e.g., via Device Twins).

Administrators can use CLI to control and automate the procedure. No cloud/server-side development is required
Depending on the number of devices, Administrators might also use special features for updating Device Twins at scale, as discussed earlier)

2. Enabling optional features could also be implemented via Device Twins. However, in this case, we do not want to update all devices simultaneously. Instead, we want the support staff/device operators to update only specific devices when an end customer purchases the extra feature.

It would not be efficient to force support staff into using CLI due to higher complexity. Instead, we could give them access the platform UI, where they can perform necessary updates simply from the web browser.
Same as in the previous case, no custom software needs to be developed.

3. The custom web application will be implemented by company X’s software department. With a proper IoT platform, company X’s software engineers can simply use the platform’s API or SDK, do not have to deal with all the challenges discussed in this article, and focus on the added value of the end-customer application instead.

Key takeaways

In this second part of the article, we discussed challenges that surface when cloud-to-device communication is applied at scale:

Communicating with many devices individually is unreliable, slow, and complex.
Devices are typically being continuously added to IoT solution. Achieving proper and accurate device communication in such a scenario is non-trivial.
IoT platforms can hide a significant part of the complexity.
To harvest the full power of cloud-to-device communications, IoT platforms must provide various interfaces (UI, CLI, SDK, API).

Tomas Pajurek

CTO at Spotflow

Tomas is a software engineer at heart with a proven track record in architecting data-intensive systems, mainly for agritech, manufacturing, or biotech sectors, and leading engineering teams building those. He is deeply interested in the design of IIoT, stateful stream processing & distributed systems, software & platform architecture, resilience, cloud, and security. With his team of talented engineers, they apply this knowledge daily to ensure Spotflow is a product that our customers can fully rely on and enjoy using.

The Team

We are a team of tech enthusiasts immersed in IoT solutions for over a decade. Our expertise spans distributed systems, cloud engineering, embedded programming, and IoT, giving us a unique perspective on real-world challenges in this space.

Our Vision

Over the years, we've listened to builders of embedded hardware who struggle to gain visibility into device operations—finding it tough to quickly check device logs or metrics and learn about their overall status. That's why we started working on a new product designed to simplify IoT log collection and working on the platform for embedded observability. We help you to keep track of how your devices operate so you can focus on what truly matters: innovating and building great products.

Our Track Record

Our journey began with building the IoT platform at Datamole . That foundation has grown into building a robust product that now powers large-scale solutions for brands like Lely or Agrifac with more than 100,000 devices actively using the platform today.