TLDR; Overlock builds on concepts from tools like Sentry/Rollbar but with a new approach to logging which is useful for distributed systems with persisted state. Overlock sits alongside your existing IoT infrastructure to capture problems as they happen.
Today I’m really proud to announce Overlock - the first exception tracking system designed specifically for distributed and hetrogenious systems.
What is Overlock?
We love sentry and rollbar, but neither have proven particuarly well suited to developing IoT products, as we have been doing for several years at Zoetrope. The key enhancements we’ve added for distributed systems are:
- Gathering of state and bread-crumbs from multiple parts of a system - i.e. if an error happens on an end device, we also gather bread crumbs from the gateway and web services.
- Only collecting information when events occur (this only really applies as an advantage to text logging). Overlock only gathers information when there’s a problem, saving a lot of data on constrained devices which may only have a cell connection.
- Device-centric view. Overlock does not view exceptions as isolated events because embedded devices tend to persist state over much longer periods of time (weeks/years) v.s. regular tools designed for HTTP where requests are very short lived (milliseconds). Overlock gives you a complete device view with lifecycle events and exceptions.
Overlock lets developers quickly and easily add logging and exception tracking code to embedded linux devices, cloud platforms and anything else running linux in order to automatically gather together state and messages from all parts of a distributed system. Overlock builds an internal “association graph” which then allows exception data from multiple parts of a system to be pulled together when a problem occurs in one part of the system.
Technically, how does it work?
Imagine a device in the field is throwing an exception caused by an edge case, let’s say a divide by zero error. When that error is thrown, it will be logged to the local Overlock daemon and consiquently sent back to the Overlock service which will then request logging information from all other associated devices.
Overlock is platform agnostic, so you can use it to track exceptions from any of the major IoT platforms such as thingworx, Predix, etc.
This provides a means to gather exception data from all over your system without the need to constantly send gigabytes of data back to the cloud continously.
From speaking with developers we’ve found that the most common solutions to error tracking at the moment are to either:
- Send back all logs including info/debug to a centeral platform, e.g. papertrail, logdna or an ELK stack.
- Store logs on the devices and allow logging in over SSH (with a VPN normally)
We see both of these options as compromises and if you’re going to have enough information to solve problems a huge amount of logging needs to be done. Overlock addresses the issues with both of these by storing a small in-memory cache of logs on the device and sending that cache of logs if a suitably severe log event occurs.
In addition overlock captures what we’re calling “lifecycle events” of devices which can be events such as
factory reset etc, which are logged to a device entity along with all the log messages allowing for an even richer picture of what’s happened with that device. This can be crucial when trying to pin down problems to a particular set of circumstances.
Our journey to making Overlock has been circa 3 years in the making, starting with our experiences of the first connected products we took to market. Over the interviening years we’ve continued to add more connected products, generate ever more logs and, I’m not ashamed to say, fixed our fair share of bugs.
This is where the idea of a better way to track exceptions and logs through the whole stack came from and we started building overlock as an internal tool.
We’re now opening overlock up to the IoT community in a private beta. We’ll be opening up a full SaaS product during Q1 2018.
We’re looking to work with companies who’re making IoT products and services and want to reduce the amount of time they spend debugging complex systems. Get in touch!