Scaling Spinnaker at Netflix — Metrics and more!
Part 2 of a series on how we scale and operate Spinnaker at Netflix.
This post outlines our approach to using metrics, dashboards and alerts to operationally support Spinnaker.
Netflix-specific Tooling
Although Spinnaker is an open-source project, there are a handful of Netflix-specific tools that support our efforts.
Atlas : The telemetry system used to manage, alert on and dashboard dimensional time series data (aka metrics). Portions of Atlas are open source.
Chronos : An event tracking service that provides an answer to questions like “What happened in the last hour?” or “What happened between time A and B”.
Spectator : A simple open source library for instrumenting code to record dimensional time series data.
Big Data Platform : A reliable data analytics platform shared across all of Netflix. More details can be found here, here and here.
Metrics vs Alerts
Alerts tell you something is wrong and require that an action be taken.
Dashboards provide visibility and context into what might be wrong.
It’s worth noting that only critical alerts will result in a Spinnaker engineer being paged.
Lower priority alerts will email and/or auto-remediate themselves (ie. terminate the offending instance and re-launch). The latter is a capability of atlas.
The Anatomy of a Metric
A spectator metric consists of a name and one or more key/value tags. These tags can be used to group and/or narrow down graphs on a dashboard.
There are two types of metrics commonly used in Spinnaker:
Timer : A timer measures both the rate that a particular piece of code is called and its duration.
Ex) How many times has this controller method been invoked and what is the average duration.
Gauge : A gauge is an instantaneous measurement of a value.
Ex) How many hystrix circuits are currently flipped.
Creating a Metric
Id id = registry.createId(metricName)
.withTag(“controller”, controller)
.withTag(“method”, handlerMethod.getMethod().getName())
.withTag(“status”, status.toString().charAt(0) + “xx”)
.withTag(“statusCode”, status.toString());registry.timer(id).record(
getNanoTime() — ((Long) request.getAttribute(TIMER_ATTRIBUTE)),
TimeUnit.NANOSECONDS
);
Flow of Metrics and Events
Key Points
All Spinnaker services emit metrics to Atlas that are assembled into either dashboards or alerts (or a combination of both).
Orca publishes pipeline and orchestration events to echo where they are immediately forwarded to the events bridge (Netflix-only). This events bridge is responsible for transforming the Spinnaker event data structure into what is expected by chronos.
Chronos is used by SREs and developers alike to determine what environment changes occurred between two given points in time.
It’s worth noting that the echo to events bridge forwarding is generic and natively supported by echo.
rest:
enabled: true
endpoints:
-
wrap: false
url: https://spinnaker-events-bridge
Important Dashboards
We aim to have a consistent set of graphed metrics across all Spinnaker services.
Dashboards for each Spinnaker service look remarkably similar as a result.
General metrics we’ve found to be of value (every service emits these metrics automatically):
- API invocations (per endpoint)
- API latency (per endpoint)
- 5xx responses
- Hystrix fallback activity
- Health check (is the service passing it’s health check)
We have paged alerts on API latency and Hystrix fallback activity.
Diagnosing via Metrics
We had a recent issue where an instance of orca would periodically fail its health check.
That’s strange.
Let’s look at the historical load average for this particular instance.
I wonder what was running on that instance?
That’s weird. We opted to take the instance out of service (Sep14 3am) and monitor further.
For what it’s worth, removing an orca instance from service only prevents new work from launching. All existing pipelines will run to completion.
We were able to deduce from the above graph that there was a significant increase in a particular operation running on the affected instance.
The fact that the issue persisted after the instance was removed from service (the small spikes after Sep14 3am) confirmed our suspicion.
As a result of these graphs, we were able to identify and adjust the polling cycles for a number of operations. In many cases, these operations were polling every second for many hours and introducing a noticeable amount of overhead on the overall system.
Although this particular issue has not been fully remediated (turns out polling intervals were only part of the problem), we’ll continue to use metrics as the mechanism by which we gauge our success.
That’s it for now!
If you’d like to know more about Spinnaker:
- Join us on Slack
- Follow us on Twitter
- Visit our project page
- Browse our code