How Netflix Has Extended Spinnaker

Rob Zienert
The Spinnaker Community Blog
6 min readOct 30, 2019

--

Airbnb recently came out with a cool article about their adoption story and how they’ve been extending Spinnaker in the process. If you haven’t already, I’d recommend checking it out before continuing!

If you lurk Spinnaker’s OSS development, you’d know there’s an active effort to introduce a true, complete plugin model for the project between Netflix and Armory. While this is still very early days, I thought it would be a fun exercise to enumerate many (yet, not all) of the ways that Netflix has extended open source Spinnaker.

What Extensions Has Netflix Built?

I’ve mentioned before we have ~30 engineers on the Netflix Delivery Engineering team (2/3 of that work on Spinnaker). That’s a big team. In addition to the OSS services, we have an additional 4 or 5 internal services we maintain, plus a ton of custom code layered on top of OSS Spinnaker.

As my team has mentioned before, we consume OSS Spinnaker JARs as a library and layer custom code atop. This allow us to run the same code that everyone else does while giving us full ability to add and modify functionality where we need.

First, a crash course of Spinnaker services: They’re all written on the JVM using Spring Boot. There’s a lot of power in both of those tools: If you want to do something, you can probably do it and do it relatively easily once you’re set up. Spinnaker doesn’t use a ton of the sugary add-ons of Spring Boot, but it heavily utilizes its dependency injection, which affords developers great latitude in customizing or replacing standard functionality. For example…

netflixplatform

We have a shared library, similar to kork, for integration of Spinnaker into the Netflix runtime “paved road”. Things like auto-wiring our services to send metrics to Atlas, register with Eureka, read dynamic configuration from Fast Properties, and perform secrets decryption and RPC auth with Metatron.

Most of this is accomplished through simply adding more code, then adding @Configuration classes that wire things into Spring’s Environment.

package com.netflix.spinnaker.platform.atlas;@Configuration
public class AtlasConfiguration {
@Bean
public Registry registry() {
AtlasRegistry reg = new AtlasRegistry();
SpectatorContext.setRegistry(reg);
return reg;
}
@Bean
public AtlasPluginManager atlasPluginManager(Registry registry) {
return new AtlasPluginManager(registry);
}
}

Nothing exciting to see here: we’re just wiring up some internal libraries, but now any metrics produced by Spinnaker services will be correctly collected into our internal metric store, Atlas.

What’s especially interesting is that all of the customizations and extensions I’m about to outline are all enabled and wired up in similar ways: Just implementing interfaces and creating factory configuration classes and dropping the jar into the classpath. For Java developers who have used Spring, this process should be boring levels of accessible.

Adam Jordens wrote, long ago, how we extend Spinnaker. It’s the same patterns today, even if the example repos haven’t been updated.

clouddriver-nflx

For awhile, the Titus integration was internal-only. We had an entire cloud provider that was an extension. It’s now open source, but migrating it into OSS was just a lift-and-shift task.

We did all of the Clouddriver SQL backend development as an extension as well. Similar to Titus, open sourcing this work was also a lift-and-shift operation once we felt it was stable enough after running in production for a few weeks.

We also have an ElasticSearch integration for Docker. This is fairly specific to Netflix’s use cases, but it let’s us more efficiently index and search for Docker tags within our registries via ElasticSearch. This is just implemented as a new CacheProvider, similar in implementation to ProjectClustersCachingAgent.

Clouddriver has the concept of a preprocessors for mutating operations that it’s supposed to perform before execution. We’ve implemented custom checks that enforce AWS Security Groups rules, specifically around some of our own internal team security requirements. It also supports validators, which we’ve used to restrict security groups from allowing 0.0.0.0/0 ingress (requiring adding these ingress rules to go through our Cloud Network team’s tooling) and a validator to simply enforce server group name lengths to our preferences.

One cool customization we’ve created is a Lambda integration with our Security team which enforces each application gets its own AWS Instance Profile. If the Instance Profile doesn’t exist, the Lambda will create it from a blessed company default. Applications cannot use an instance profile created for another application.

Finally, we have an extension for ALBs & NLBs to auto-attach some custom security rules.

deck-nflx

Admittedly, I don’t have a lot of insight into the UI. We’ve built a HUGE amount of custom views within Deck. You’ll have to ask some frontend folks about this. 😬Sorry!

echo-nflx

Echo handles all events within Spinnaker and is the source for all triggering of executions. Auditing is very important for Spinnaker, so we have an integration point that sprays all events to Chronos, our central SRE auditing system as well as big data portals.

The biggest integration we’ve added is a new trigger type, which integrates into our Rocket build system, an internal CI system.

fiat-nflx

We source roles for authorization from an internal source of truth service, this is a pretty simple implementation by providing our own UserRolesProvider.

front50-nflx

For Netflix, we require some additional validation for applications, so we’ve added additional validators that are run when someone saves an application.

We’ve also had to perform migrations on applications, pipelines and so-on. Rather than cat wrangle all of the teams, we’ve written custom Migrations that are scheduled to incrementally rollout adoption of new features or configurations without our users having to do anything.

gate-nflx

Gate has seen a lot of integrations. For a lot of the custom work done in Deck, we also have a lot of internal-only web controllers, associated services and configuration. We also have X509 auth extensions to extract additional user data from our internal certificate manager which allows us finer-grained permission control over our inbound traffic.

igor-nflx

Nothing too crazy here, our Jenkins servers use internal security services for client auth, so we wire in our own keystores into the Jenkins clients.

orca-nflx

I’m not going to count, but we have something like 15 custom stages that integrate with various internal services. For example, our Resilience team has integrated ChAP as a first-class stage. Adoption of Spinnaker at Netflix isn’t prescriptive by any means, but simple and tight integrations like this make using Spinnaker ever-more compelling.

One interesting integration is OpenConnect: Our world-wide CDN. They perform their firmware delivery to datacenters around the world through Spinnaker and that’s orchestrated through a custom integration within Orca.

We also have an integration to automate creation of JIRA tickets for releases, if necessary, so users don’t need to create custom Deploy Strategies to automate JIRA creation or resolution. In our implementation, it’s entirely invisible to the users, but you could also use Preprocessors to automatically add stages, or build entirely arbitrary pipelines: This is how Pipeline Templates (v1 and v2) is built, actually… it’s just a preprocessor.

Platform extensions

Extending Spinnaker services directly is one thing, but doesn’t tell the whole story. Netflix uses Spinnaker as the control plane for our clouds, so we do mostly API-driven traffic, too. These integrations can be summarized as domain-specific orchestrations.

Similar to Airbnb with Deployboard, a few organizations within Netflix have written services that offer a specialized view and in some cases, deep extension features atop our orchestration primitives.

Many organizations hit the Spinnaker APIs from their applications to read the their own operational footprint at runtime. One of my favorite integrations here is integrating with our API to create and orchestrate on-top-of an internal spot market of Instance Type Reservations. The most widely-known integration, however, is likely Chaos Monkey.

Lean Core, Fat Ecosystem

Being able to extend Spinnaker is super powerful, but it does come with requiring people to reason about a lot of Spinnaker internals: It’s a high bar and we need to level up. The new plugin model will allow for a more federated development approach, and will eventually serve a crucial role in lowering the bar to contributing to Spinnaker, both for open source and as your internal use cases. Plugins will initially be in-process JVM, but we have plans to expand plugin contracts to remote plugins (RPC, containers) in the future.

More on this stuff later as it continues to shake out, but you can get started with my epic-level proposal of Spinnaker as a Platform. Of course, you can come into #sig-platform on Slack if you’re interested in helping with early development / testing.

Spinnaker Summit 19

Are you interested in this kind of stuff? Come to Spinnaker Summit in San Diego on Nov 15–17! It’s just before Kubecon, so since you’re probably headed that direction anyway, what’s another couple days? 😄

There will be a talk from the Armory folks on plugins, and Adam Jordens and I will be giving a talk on the evolution of operations and internals of Spinnaker at Netflix.

Eager to see you there and to say hello to both new and old faces!

--

--