Monitoring Spinnaker, Part 1

Published in

The Spinnaker Community Blog

10 min readMar 21, 2018

In a recent post to my dev log, I mentioned I wanted to write about scaling strategies for Redis within Spinnaker, our primary storage engine. But before we can jump into that, a far more important topic is necessary: Monitoring and alerting. Without measuring your applications, how can you actually be sure it’s behaving correctly, let alone know what part of the system needs your attention to continue growing?

So, we’ll learn about monitoring Spinnaker first, service by service, while taking a look at graphs of (as far as I know) the largest Spinnaker installation: Netflix’s production deployment. Irrespective of your Spinnaker’s deployment footprint, the metrics I’ll detail in this series will be valuable to you. At the end of the series, I’ll detail the alerts that we have setup, along with the remediation steps we have for each.

Monitoring Orca

What can I say? I wanted to start off with my favorite service. For me, Orca is the most interesting of them all. Moreover, making sure it’s happy is critical to making sure your users are happy.

From Orca’s README:

Orca is the orchestration engine for Spinnaker. It is responsible for taking a pipeline or task definition and managing the stages and tasks, coordinating the other Spinnaker services.

In other words, Spinnaker is known and often loved for enabling accessible, highly configurable and composable workflows through its pipelines and tasks. This is largely thanks to the code that makes up Orca.

It wasn’t random. It was orca-strated. rimshot

A super-fast internals crash course

The primary domain model is an Execution, of which there are two types: PIPELINE and ORCHESTRATION. The PIPELINE type is, you guessed it, for pipelines while ORCHESTRATION is what you see in the “Tasks” tab for an application. At this point in Spinnaker’s life, these two types are basically the same, but it wasn’t always that way. For our purposes, that’s all you need to know.

Orca uses a purpose-built distributed queue library, Keiko, to manage its work. A Message is a granular unit of work that is re-deliverable. For example, StartStage, will verify that all upstream branches are complete and then plan the next stage, it won’t actually do anything as far as a user is concerned. A RunTask is really the only message that actually performs work that a user would be familiar with. Some message types are redelivered and duplicated often, whereas others are not.

If the queue is unhappy, everyone is unhappy.

The queue could be broken into two parts: The QueueProcessor, and its Handlers. Orca uses a single thread (in a thread pool shared with all other Spring-scheduled threads) to run QueueProcessor, which polls the oldest messages off the queue. A separate worker thread pool is used for Handlers. Orca’s ability to scale to process its work is dependent on availability of threads in the worker pool, which can be tuned either by adjusting the worker thread pool size, or by increasing the number of Orca instances: Ignoring a persistence or downstream service bottleneck, Orca can be scaled out horizontally to meet the work demands it is given.

If the queue starts to back up, Pipelines and Tasks will begin to take longer to start or complete. Sometimes the queue will backup because of downstream service pressure (Clouddriver), but often times it will be due to Orca not having enough threads. You’ll know if Orca or Clouddriver is the problem based on Clouddriver metrics real fast (ha, I’m such a tease, come back for Part 2!)

For all of these graphs, unless otherwise mentioned, I’ve chosen a time window starting at 1pm, ending at 3pm of a normal activity day.

Active Executions

**executions.active** (gauge, grouped by **executionType**)

This is really just good insight for answering the question of workload distribution. Since adding this metric, we’ve never seen it crater, but if that were to happen it’d be bad. For Netflix, most ORCHESTRATION executions are API clients. Disregarding what the execution is doing, there’s no baseline cost difference between a running orchestration and a pipeline.

Controller Invocation Time

**controller.invocations** (timer, grouped by **statusCode, controller, method**)

Run of the mill graph here, to be honest. If you’ve got a lot of 500’s, check your logs. When we see a spike in either invocation times or 5xx errors, it’s usually one of two things: 1) Clouddriver is having a bad day, 2) Orca doesn’t have enough capacity in some respect to service people polling for pipeline status updates. You’ll need to dig elsewhere to find the cause.

Similarly, we have another graph that is the same metric, but filters on status=5xx, which is useful to see if things are really going sideways.

Task Invocations

This will look similar to the first metric we looked at, but this is directly looking at our queue: This is the number of Execution-related Messages that we’re invoking every second. If this drops, it’s a sign that your QueueProcessor may be starting to freeze up. At that point, check that the thread pool it’s on isn’t starved for threads.

Running Task Invocations by Application

**task.invocations** (gauge, filter **status=RUNNING**, grouped by **application**)

This is handy to see who your biggest customers are, from a pure orchestration volume perspective. Often times, if we start to experience pain and see a large uptick in queue usage, it’ll be due to a large submission from one or two customers. In the graph above, it’s all normal business for us, but if we were having pain, we could bump our capacity, or look to adjust some rate limits.

Here’s an example of it being a little more dicey from the day before. You can see our teal green friends causing a bit of a stir at ~16:10. In this case, Orca was still perfectly healthy, but had we not been aware ahead of time, we could’ve easily sourced where our new found load was coming from.

Message Handler Executor Usage

threadpool.* (gauge), **orca.nu.worker.pollSkippedNoCapacity**

This is our thread pool for the Message Handlers, plus another metric from the QueueProcessor. Green is good, blue is actual active usage, and red is when a thread is blocked. Red (blockingQueueSize) is bad, especially because the yellow guy should always block blockingQueueSize being changed from 0.

If you’ve got green, it means you have capacity to take on more orchestration work. If you’re seeing yellow, it means your threadpools are saturated and the QueueProcessor has skipped trying to take on more work.

A note about the QueueProcessor, there’s some default-off config options we use at Netflix (so dubbed “turbo orca”) that will cause the QueueProcessor to be more aggressive. Enabling it may cause us to see more yellow, but that’s what we want: We want to be processing as fast as possible, but not going overboard on capacity.

An occasional blip of yellow isn’t bad, the QueueProcessor polls on each Orca instance (by default) every 10ms. Missing a poll cycle isn’t the end of the world by any means, but sustaining a value over 0 is bad: Scale up or make your threadpool bigger.

Queue Messages Pushed / Acked

**queue.pushed.messages** (gauge, grouped by instance)

**queue.acknowledged.messages** (gauge, grouped by instance)

These two graphs come as a couple. If messages pushed is out pacing acked, you’re presently having a bad time. Most messages will complete in a blink of an eye, only RunTask will really take much time. If you see an uptick in messages pushed, but not a correlating ack’d, it’s a good indicator you’ve got a downstream service issue that’s preventing message handlers completing: Take a look at Clouddriver, it probably wants your love and attention.

Queue Depth

I like this graph, especially when it looks like this. Keiko supports setting a delivery time for messages, so you’ll always see queued messages outpacing in-process messages if your Spinnaker install is active. Things like wait tasks, execution windows, retries, and so-on all schedule message delivery in the future, and in-process messages are usually in-process for a handful of milliseconds.

Operating Orca, one of your life mission is to keep ready messages at 0. A ready message is a message that has a delivery time of now or in the past, but it hasn’t been picked up and transitioned into processing yet: This is a key contributor to a complaint of, “Spinnaker is slow.” As I’ve mentioned before, Orca is horizontally scalable. Give Orca an adrenaline shot of instances if you see ready messages over 0 for more than two intervals so you can clear the queue out.

Here’s an example from just yesterday where we got hit with a bunch of orchestration work. You can see that we hit a couple bumps of ready messages, but cleared them out quickly. Once the ready messages were a steady state, we scaled up and the ready messages dropped away and the world went back to a happy place.

If you want to see what it looks like when it’s really bad, take a look at this post from my personal blog.

Queue Errors

**queue.retried.messages** (counter, yellow), **queue.dead.messages** (counter, orange), **queue.orphaned.messages** (gauge, red)

It sounds worse than it is. Retried is a normal error condition by itself. Dead-lettered occurs when a message has been retried a bunch of times and has never been successfully delivered. A dead message gets dropped into its own sorted set in Redis which, if you want, can be redriven with some Redis-fu, but there’s no current tooling around that. We know we need better tooling around dead messages — but we typically do not redrive messages.

Orphaned messages are bad. They’re messages whose message contents are in the queue, but do not have a pointer in either the queue set or unacked set. This is a sign of an internal error, likely a troubling issue with Redis. It “should never happen” if your system is healthy, and likewise “should never happen” even if your system is really, really overloaded. It’s worth a bug report.

Message Lag

This is a measurement of a message’s desired delivery time and the actual delivery time: Smaller and tighter is better. This is a timer measurement of every message’s (usually very short) life in a ready state. When your queue gets backed up, this number will grow. We consider this one of Orca’s key performance indicators.

A mean message lag of anything under a few hundred milliseconds is fine. Don’t panic until you’re getting around a second. Scale up, everything should be fine.

Read Message Lock Failed

Look at that boring graph. Good thing, too. I’m not sure what this graph would look like if you have “turbo orca” disabled (enable with keiko.queue.fillExecutorEachCycle=true), but this graph is an indication that a poll cycle had messages in a ready state to acquire a lock on, but couldn’t. Scaling out should get you out of this situation, but if it doesn’t, then open a Pull Request because the setting that dictates this is currently hard coded (sorry).

Other Things

The other things to monitor would be your Redis instance. Running out of Redis memory on Orca is generally considered a really bad thing because running executions will fail and you’ll lose your queue state. If you’re using echo to publish Spinnaker events to a log (like Kafka), you can rebuild the state, but it’s generally good to just… avoid that from happening.

At Netflix, we have a replica that we’ll promote in the event of a failure, or can recover from S3, or alternatively, re-hydrate from our big data store. We’re currently working on moving to Dynomite so instance failures are less of an ordeal on this front.

To manage your Redis memory size, especially at a scale like ours, you need to have some background jobs running to keep the memory size down. Orca ships with just those things! Here’s what we have enabled:

pollers.oldPipelineCleanup.enabled=true: Cleanup old pipeline executions over a certain threshold. It has some additional logic to keep past executions over the threshold for low-frequency pipelines. Source & configs are here.
pollers.topApplicationExecutionCleanup.enabled=true: Cleanup orchestrations from especially busy applications. We have some apps that run tens of thousands of orchestrations a day: A nice recipe for Redis OOM by themselves. Source and configs are here.
tasks.daysOfExecutionHistory=180: We clean out any execution that’s over 180 days old. Period.

If you’re in a company that must meet things like HIPAA, PCI compliance or… any of those things that need long-term auditing, just make sure you’re shipping events to a long-term store via echo’s event firehose.

Wrap up

I hope this was useful for you. Getting started with Spinnaker is a daunting task, and keeping it running well is a hard task, especially as demand grows. From an orchestration perspective, this should give you enough to at least know when things are going sideways before your customers do, and hopefully be able to stay ahead of the curve before issues even become apparent to your users.

Next up: Clouddriver!