Autonomous Data Bots
Today we are at a place where its practical to deploy autonomous data bots on edge computing to collect data and make intelligent decisions. We have been incrementally moving towards this goal over the past decade. This post delves into the various stages in evolution towards autonomous data bots.
- Remote Agents, HTTP Tunneling & Long Polling
- Self-Upgrading Programs
- WebSockets, SSE, and HTTP/2 Server Push
- Self-Sustaining Bots
The data bots are autonomous, manage their own lifecycle while running on edge networks, and communicate with central cloud servers for data and knowledge exchange. Since the data bots have secure access to corporate data they are responsible for real-time data ingestion to the cloud, into a central data lake. The central data lake is used by machine learning models to build knowledge, and knowledge service is accessible by bots and other clients. In addition, there is a relay service running in the cloud that serves as communication mechanism for data bots that are spread out geographically.
Variants of data collectors, commonly referred to as agents, have been around for almost a decade. The inception of these agents happened when the first generation of SaaS applications gained traction and needed a mechanism to securely access data from behind corporate firewalls. The basic requirement is “not” to punch holes in corporate firewall and still allow SaaS apps running in public cloud to access data behind corporate VPN. These agents were based on master-slave architecture, with webapp in the cloud being the master and agent is the slave. There was no intelligence or autonomy in the agent.
A simple design used for agent architecture is based on HTTP Tunneling. It uses POST method instead of CONNECT, as intermediaries and security inspectors do not play nice with CONNECT method. Remote agent (slave) is the HTTP client and always initiates the request, hence no need to open firewall ports, and webapp in cloud is the HTTP server that services the requests. Agent is continuously polling the server for next “command”, and server may respond with a command or no-op. The server (master) is the one which assigns commands to the agent (slave); the agent executes these commands and sends results and monitoring information back to the server. The agent protocol describes all the commands and uses plugin frameworks to extend commands. Since the agent is continuously polling the server this can get quite expensive and wasteful, a simple optimization is to use long polling. The agent has a forever loop issuing HTTP request to fetch next command from server. If the server has a command for the agent it responds immediately with the details, else it waits until next command becomes available in the queue or timeout is reached. This way it stretches out the response duration to avoid sending a flood of no-ops. HTTP handshake is expensive and reusing TCP connection saves valuable resources. The server uses HTTP persistent connections with keep-alive header, and always responds back in finite time so that network timeouts are not triggered by client or intermediaries. An interesting observation here is that if you try and keep the same HTTP connection alive for many hours you’ll start seeing unexpected network breakages because intermediaries set max-life on each TCP connection and will recycle it.
The SaaS application is running in the cloud but agent is running remotely on the edge, this poses a challenge as to how to manage the agent lifecycle. Anti-virus software has dealt with this problem for quite sometime. We can solve this by having a very small & stable shell program that spawns and manages the agent. The shell process has a forever loop that monitors the status of the agent and starts it, if the agent goes down. This provides basic resiliency to the agent to ensure it is “always on”. The shell program is setup as Windows service, macOS LauchDaemon, or Linux service. The shell program periodically checks for availability of upgrades, if it finds an upgrade it downloads it securely and initiates a shutdown of the agent process. If the agent has advanced state management, it could go into queise mode to ensure zero message/action loss but delays execution. After the agent has completed a graceful shutdown, the shell program initiates the upgrade that takes care of upgrading the persistent state of the agent (if any) and the binary bits. If upgrade is successful, the shell program restarts the agent. If upgrade fails for some reason, the shell program does a rollback and restarts agent. Its important to keep the lifecycle steps performed by the shell program as simple and clean as possible. The robustness and resiliency of the agent depends on it working flawlessly all the time. The goal is for the agent to be a remote process that requires zero human intervention/supervision and be robust enough to upgrade itself. The KISS (Keep It Simple Stupid) principle serves us well in the design of the agent. State management can get complex very quickly, ideally the agent should be stateless and should be able to init from scratch. If the agent has persistent state, ensure the data models work with additive attributes and NoSQL-ish storage for interoperability. The agent needs to be portable and small in size so that it can be downloaded efficiently even on slow networks and have pluggable architecture to add capabilities progressively.
As discussed in previous section, one of the key communication paths is cloud master server sending information/actions/updates to the bots running on edge networks. The earlier implementation were using HTTP/1.1 with long polling, which worked but was non-ideal. Now that its being 7+ years since WebSocket protocol was standardize as RFC 6455 in 2011, there is widespread support for it in corporate networks. Server Side Events (SSE) for HTML5 also has widespread support on server-side and all browsers except IE. Given that server to bot communication uses embedded HTTP client, SSE is a viable implementation for bot protocol. The comprehensive HTTP/2 specification that includes Server Push was published in 2015 and three years later has decent support for implementing a HTTP/2 based server in Node.js, Go, Jetty, Nginx, etc. All the modern browsers have HTTP/2 compliant clients as well. There is a lot to like in HTTP/2 including multiplexed streams, header optimization, binary format, and server push. Given, the overall performance benefits of HTTP/2, it makes sense to go with that which rules out WebSockets. WebSockets uses upgrade header on top of HTTP/1.1 and was left out of HTTP/2 and replaced by server push. So the first inclination is to pick HTTP/2 server push for implementing bot protocol. But HTTP/2 server push is primarily designed to push multiple resource files (css, js, assets, etc.) and image files asynchronously to allow for progressive loading of HTML pages in browser. It gets the maximum performance and helps build a proactive program that is auto-suppling the dependencies in HTML page. HTTP/2 server push gets harder to use if you want to implement a generic long-lived 2 way communication stream with remote clients. You will have to do quite a lot of context and state management on the server side to keep track of bot streams, and also figure out the flush intervals. It just seems like a wrong choice for implementing the bot protocol. So the fallback choice is using SSE with HTTP/2, both Node.js and Golang have decent support on implementing SSE with HTTP/2. The lifecycle of HTTP/2 stream is nicely handled and we are left to handle timeouts. We end up with an elegant and resilient solution that does efficient resource management in cloud server. In the code snippets below the server is implemented in Node.js and bot client is in golang.
Client configuration in golang for SSE:
- Set value of HTTP header “Accept” to “text/event-stream”.
- Server will send events in text format delimited by newline.
- Common practice to have JSON style attribute name-value pair for id and data.
- Client will have a forever loop that is always waiting on messages/events from server.
- Client reads event from SSE socket stream, which usually corresponds to an action, and fires a goroutine to service that action.
- Goroutines that implement actions do their work and then send the status back to the server through a callback REST endpoint.
- The forever loop handles timeout errors and network interruptions. In case of failures, it re-issues an http request and the server setups an SSE stream on that.
Server configuration in node.js for SSE:
- Middleware frameworks like Express.js and Koa.js want to control the lifecycle of the request and don’t act nice with long-lived SSE sockets.
- Need to write a custom HTTP request handles in node.js to service SSE requests
- Check incoming request headers and if “Accept” equals “text/event-stream” then turn off compression and don’t close the Response OutputStream.
- The node server will function as relay server that accepts actions for a given bot via REST API and relay the action to the bot using SSE stream, which is always open.
- Server receives initiation request from bot, it responds with an ack and keeps the response stream open. It stores the bot ID, and reference to response output stream in cache. This response object is used to send events to the bot as and when they arrive in the future.
- Server has a different “relay” REST endpoint that can be used to send events to a given bot. External clients of micro-services within the app invoke this endpoint to send events to the bot.
- When server receives request on “relay” endpoint, it identifies the bot for that request, looks up the response output stream from cache, and writes the event to the stream. Every event received by “relay” is given a unique eventID, the request is suspended and response handle is stored in cache.
- The bot executes action and sends status back to “callback” endpoint with eventID. Server retrieves the event data structure from cache and revives the suspended “relay” response and writes the status back to it.
- Node.js serves as an efficient relay server that can handle multiple relay requests from all sorts of clients, get these requests processed by a distributed set of agents spread over different edge networks and respond back with the status. All the request processing is done synchronously.
Using the design laid out in self-upgrading programs as the building block its possible to build autonomous bots that reside in edge networks and function independently. The bots use relay server to communicate with each other, this helps to setup a distributed system of bots that can interact with each other and perform information exchange. These bots have data collection, harvesting and processing capabilities. They also serve as data collectors and load this data to the central data lake in the cloud which is used for building knowledge. Machine Learning models can be applied this central data lake and the learnings can be pushed down to the bots asynchronously. The bots are self-sufficient and manage their own lifecycle, they perform data processing and actions locally within the edge network and keep getting smarter with time as they are continuously absorbing learnings from the central cloud servers.