Opkit, a New DevOps Bot for Slack: Optimize Communications for Performance and Reliability

18 Oct

Introduction

Automating time-consuming and error-prone processes has led to many of the gains we’ve seen in developer efficiency. For example, continuous integration and deployment tools, like Jenkins and Travis, have greatly reduced production downtime by automating the building, testing, and deployment of apps; by eliminating points of failure, or reducing the risk of failure at those points, we can develop better products. Similar gains can be had by bringing automation into our communication channels; by automating the most error-prone parts of communication, we can improve efficiency.

Another key paradigm shift, DevOps, or the deep integration of development and operations staff with collaborations and communications channels, has also improved the quality and reliability of many apps. Key to DevOps is the constant maintenance of a communications channel between the development team and the operations team; it’s this communications channel that we’ve (partially) automated, by developing a DevOps Slack bot called Catbot, running on a custom, DevOps-optimized framework called Opkit. We’ve open-sourced the framework and many extensions here (https://www.npmjs.com/package/opkit-example), all pre-packaged and deployable with one click. Read on to learn about our motivations, and a bit of an introduction to hacking on Opkit.

 

Motivation

We have a twenty-four hour Network Operations Center, or NOC, that works in tandem with development and customer support teams across all of our products. Our core communications channel is Slack, and we maintain a number of channels within our Slack team. All this communication leads to a better product being delivered in a better way, but it can take a significant amount of time to locate and share information. In many cases, such as performance optimization, information has to be gathered from another page or tool and only then relayed to the Slack channel. This adds another point of failure; anyone who played the elementary-school game “telephone” knows just how much a message can change when it goes through one other person.

Having to wait for someone to fetch a CloudWatch alarm state or trace a call on our internal tools is bad enough, let alone running the risk that the information comes back with a typo or misunderstanding. That’s why we built Catbot: to automate common DevOps functions, in much the same way that a CI tool automates builds or Ansible automates deployments. Catbot can report on CloudWatch alarms, restart EC2 instances, and check SQS queues. It can also be extended easily, as it’s written in Node, and we wrote a couple dozen commands to interface with all of our internal tools.

Catbot allows our DevOps collaboration to occur more efficiently, anywhere. Many internal tools that required arcane command-line invocations can now be called with a simple phrase off one’s phone. Even quibbles as minor as sites loading poorly on mobile are resolved. It even automates some things we didn’t expect to need automation, like environment reservation.

We had occasional conflicts on the Application Platform team over the use of our limited testing environments on AWS. Thanks to Catbot, we were able to add queues that one could join through Slack, eliminating endless “who’s on perf?” questions in the office. This was accomplished with just a few dozen lines of JavaScript.

 

How does it work? What neat stuff does it do?

The opkit package proper has a Bot class that parses input and figures out when a command should be run, as most bots work. It also does a few other neat things, too.

It maintains state for its commands by handling persistence to the filesystem or to one of several data stores. Each command can belong to a script, and each command within the same script shares this state, which is automatically persisted to and recovered from the data store. This persistence is handled by creatively-named Persisters, which can be swapped out by changing an argument, so your scripts that persist can be tested locally, writing to the filesystem, and then deployed to write to database.

Another issue with some other bots is logging. Figuring out who turned on an instance, for example, is easy… if everyone has their own AWS creds, which can be tricky. Have your commands return a Promise, and the return message will be logged by Winston. If you deploy to Heroku, like we do, integration with Logentries or another logging service makes it easy to keep track of everything your developers fix (or break).

Speaking of credentials, Opkit supports access control. Role-based auth is baked right in. Tag your commands with a role array, and you can define the access control for that command. You can require a particular role, any one of several roles, or a particular combination of roles.

You can specify any arbitrary authorizationFunction, or use the one we use internally at Bandwidth, which queries MongoDB. We also include commands to grant and revoke roles within the bot; roles can even be granted for a given duration, say, for the duration of a maintenance window. This access control even carries over to the help command. Query it, and you only get the commands you have access to. Altogether, this makes for a framework pretty suited to use for things as important and potentially destructive as DevOps.

 

How do I extend it?

Hacking on it is pretty straightforward; adding another API integration is not difficult at all. Suppose we want to know what the current weather is in some particular town. There’s a convenient API provided by OpenWeatherMap for this; once we’ve registered for an account, we can set up our command like so:

 

 

Each command is packaged in an object with a few pieces of metadata. There is a command name, as well as a script field, which is used to ensure that all stateful commands in the same script can share state. (As we are just accessing a RESTful API, we don’t need to use it here.) Line 6 specifies the acceptable ways to call this command; either as $BOTNAME weather $TOWN or $BOTNAME weather in $TOWN. All subsequent words entered are passed in as an args array, which we use to generate the city name to query.

Many commands follow a similar general flow: an API query made, followed by a bot sending a message indicating the results. All commands must return a Promise; the resolution or rejection indicating success or failure. To this end, using promised libraries, like request-promise, is recommended.

After the command is written, add it to server.js:

image-1

Now, when we deploy to Heroku, we can use the command to ask about the weather in our town, or on our favorite beachside getaway, which actually does have a rather long beach:

image-2

Of course, considerably more complex interactions could be orchestrated. For example, this particular API recommends limiting queries to once every ten minutes, so we can cache our queries in MongoDB like so:

 

 

We factor out the processing into a processWeatherResponse function, and then check our cache before making the request, which we then proceed to cache once it resolves. Best of all, any other commands in the same script will be able to access this cached response.

Altogether, this makes for a readily extensible bot framework that is both immediately useful and readily adaptable as new integrations need to be added and circumstances change. You can find the example bot, an excellent starting point,  here(https://www.npmjs.com/package/opkit-example).

Illirik Smirnov
Illirik Smirnov
ismirnov@bandwith.com

Illirik is an undergraduate Computer Science and Philosophy student at UNC-Chapel Hill and an intern on the data and communications platform teams. His hobbies include distance running and working on his car.

No Comments

Post A Comment