The state of logging in IoT
This post introduces some of what the Overlock team have learned while building our own production IoT systems, as well as insights from speaking to other developers and also from the process of building Overlock. It outlines the three most common methods, as well as the pros and cons of those methods.
We’ll talk about:
- No logging
- Logging on the IoT device to its own file system
- Sending logs off to a cloud service
Context: what do we mean by logging
We mean the ability to get one or more of:
- Log messages added by a developer which indicate the state of a program as it runs
- Information about the state that the device was in when the problem was observed (either from logs or other means)
- Any traceback/stacktrace/coredump which indicates what the problem was
The key information in all of these sources is information which helps you to determine “was it a software bug or a hardware issue?”, “what was the root cause?” and “how do I fix it?”.
Without further ado, here are the three main ways of logging we’ve come across. If you do something different altogether, we’d love to hear about it!
We’ll keep this one brief. Were it not for the Overlock team having uncovered multiple instances of IoT deployments being made with no logging whatsoever, we probably wouldn’t have included this section. on the other hand, sometimes it really does seem like the only option.
This describes a situation where the only way to debug a problem is through feedback from users when something went wrong, by attempting to reproduce the problem in a lab or by physically visiting the devices int he field so you can actually see the problem yourself. This typically happens either when your end nodes are really small microcontrollers or you have no way to get information back from them, e.g. they only transmit data as iBeacon base station names or through NFC - or sometimes there’s just no space in a payload for debug information.
This can be the only option with technologies such as Sigfox or other LPWA networks which only permit sending of very small amounts of data. For example, a Sigfox radio can only broadcast 12 bytes of user data per packet and only 140 messages per day (i.e. 1.6KB/day peak throughput). In these cases it feels like you’re on your own.
- No logging means minimal complexity, flash wear or data usage
- No worries about leaking user data
- You have no oversight of what’s going wrong nor how common it is
- There is no information to help you reproduce or debug a problem when a user complains
If you’re using an LPWA network, iBeacons or another extremely constrained network without a gateway node, this may be your only option.
However, for a machine which is that simple, we think that a very small amount of data is enough to indicate the state of that device. We’ll follow up with an article advising on how even a single byte can be used to indicate the state of the device, adding barely any data usage overhead.
Logging to the Device
For devices which have a reasonable amount of flash memory, such as a more powerful microcontroller or an embedded linux system, many developers opt for storing logs directly on the device. Logs are then accessed by being able to remotely log in to the device.
This method provides a means to store a relatively large amount of data without having to move that data off the device, which saves on bandwidth costs and means you don’t have to store all that data on the cloud as your product scales. An example of a setup like this from the Resin blog shows how to set up local logging on a Resin device which essentially uses the same setup as server admins have been using for years to track log messages from their applications.
As a variation of this method, we’ve spoken to companies who opt to store a minimal amount of data in the logs, normally only
Exception level logs, at which point they will be notified remotely that something went wrong. This is normally a trigger for them the enable debug logging to the device so that they can capture all the logging information next time it goes wrong.
- This is a very scalable way to store the data on the device because the amount of storage scales with the number of devices
- You can get clever with enabling more verbose logging when there’s an issue, due to the elevated chance of it happening again
- No bandwidth consumption because logs are not moved off the device for the most part
- Many of the developers we’ve spoken to suffer from ‘Flash wear anxiety’ (which is a term we’ve coined). They fear that if you log too much to an SD/MMC flash you will wear out the blocks more quickly and shorten the life of the card.
- The ability to remotely log in to a device opens a potential attack vector. This can be mitigated with a VPN (like with Resin etc) but this is a potentially expensive solution
- Logs are only available if the device is online to log in to. You get no data if the device is off, has crashed or is out of range.
If you are logging directly to the device and you are able to log in remotely it’s important that you run this over a VPN and restrict logins to coming through the VPN. Of course using unique logins and other security hygiene factors are really important too.
Flash wear concerns are sometimes legitimate, however developers seem to often be overly concerned about the risk of these. Other than cheap knock-offs, all reputable SD card manufactures include built in transparent wear leveling which greatly increases the lifetime of the flash by remapping the storage on the fly, preventing too many writes to the same sections of storage.
Sending alerts when something goes wrong is a good idea, which can be done using a service like Pager Duty or similar.
Logging to a cloud service
Disclosure: this is generally the method that the Overlock team has preferred to use on our IoT projects, especially for gateways.
This method of logging describes a situation where logs are sent directly from the device to a cloud service which stores, aggregates and optionally alerts to changes in the data.
Generally this means sending a stream of logs, normally over a socket connection to a remote service. This may be a self-hosted solution such as Logstash, or a cloud hosted solution such as Papertrail or LogDNA.
For the purposes of IoT logging and exception tracking we also classify APM (Application Performance Monitoring) as falling in to this category too. This includes tools like New Relic and AppDynamics. Typically these tools have per-host pricing which is aimed at servers and consequently is generally prohibitively expensive for use on IoT systems where there are many hosts. APM can be considered a more advanced logging system, where the tracing data gives the developer information about how long each function call takes and the ability to log against each of these function call ‘spans’. Generally this, of course, generates a huge amount of trace data. Most developers only sample a small number of instances of the IoT program running in order to capture information to limit the overall cost and volume of data.
APM aside, the volumes of data generated are still rather large. A real world example from a project the Overlock team worked on had approximately 5-10X more logging info at the
INFO level than actual IoT data. If debug logs were also included, this would lead to approximately 50x more logging data than payload.
Upon the data from these devices being uploaded, we are able to be alerted to problems as they occur and be more proactive in hunting down bugs in the IoT deployment. Since all the logs are in one place, we’re also able to find common problems and more quickly identify why some devices are breaking whilst others are not. This is the most powerful aspect of getting logs to an indexed log store.
We’re also able to search logs from devices which are not currently connected. On occasion the logs do contain useful information which help us identify why that device is no longer connected, e.g. the signal was getting weaker or the user triggered a factory reset.
One of the biggest downsides to this method is ensuring that logs can be cached on the device when there’s no Internet connection. Most logging agents provide some support for this, however they’re generally designed for use on servers and expect writing to
/var/log to be possible. This is often a RAM disk on IoT gateways, causing further problems.
- Logs are indexed and searchable
- Logs are available even if the device is offline or you are unable to log in to it remotely
- It’s far easier to setup alerts on problems, including integration with many other tools rather than rolling your own alerting
- High cost! This depending on the log volume per host of course but it’s a drawback of using a system designed for servers on an IoT deployment
- Higher bandwidth usage - easily 5-10x more logs than business data
- Dealing with unreliable internet connections can be tricky
Logging to the cloud has most the most utility if alerting and aggregation over device is possible, so do be sure to make best use of that.
Cloud logging is best used in scenarios where there’s likely to be a wifi or stable cellular connection. This method involves transferring high volumes of data and suffers when the internet connection is inconsistent or known to be unavailable for long periods of time.
In IoT a common best practice is to make the root file system on the device read-only, to make the device more resilient. This likely means making a small RAM volume available, mounted on
/var/log, to minimize flash wear and corruption of files if there’s an unexpected power loss during a write.
We’ve looked at not logging at all, logging to the device with remote access, and logging to the cloud. Logging allows developers to get information from IoT devices, allowing them to monitor and track down the root cause of problems.
We’ve described methods that we’ve seen and used ourselves and would love to hear of any other different setups which others may have used. Over our conversations with a dozen or so developers, most have opted for logging to the device. The most common reason for doing this is that logging was a last minute addition when the device was about to go in to the field, rather than something which had been considered important from the get-go.
At Overlock we’ve ran in to exactly this issue and we believe that there’s currently no genuinely good logging and debugging solution which covers the unique challenges of IoT: a large, often complex network of low-power devices, often with unreliable or slow data connections, limited storage space and a wider variety of potential problems. That’s why we’re working on a set of tools to address this. There’s more information about that in our introduction to Overlock post.