Evolving How Netflix Builds, Maintains, and Operates Their Spinnaker Distribution

Adam Jordens
The Spinnaker Community Blog
8 min readMay 4, 2021

--

It has been nearly five years since we first shared the story of how we were delivering our custom packaged variant of Spinnaker to hundreds of internal users.

Much has changed.

We now orchestrate more than 20k deployments daily. This number is significantly higher when you consider more discrete changes like dynamic configuration adjustments, security group refinement, and load balancer creation that are handled safely and efficiently by the platform we call Spinnaker.

Our team has grown and specialization emerged. We have dedicated groups focused on Managed Delivery and Automated Canary Analysis, emphasizing a need to raise the level of abstraction across delivery and resource management. Daily active users have grown from hundreds to thousands. Internal customers of the Spinnaker API now number in the hundreds. Plugins took off in 2020, with more than a dozen features contributed and now maintained by other Netflix teams.

We now run a multi-region, largely active-active Spinnaker, having moved on operationally from what was a single-region Spinnaker. Exercises are planned wherein traffic is shifted away from a region (either entirely or for a specific service). We have built tight integrations with the Netflix paved road ranging from authentication and authorization to logging and telemetry. Where possible, implementations have been done in a way that an equally motivated member of the community could follow suit.

All of this is handled by a team of approximately 15, a mix of front and back-end engineers supported by a designer. This team has day-to-day responsibility for enhancing, operating and supporting Spinnaker at Netflix. Because of this, the mechanisms through which we can efficiently consume and contribute to Spinnaker OSS are crucial considerations of this team.

Before jumping to the present day, let’s recap the approaches we have taken to build and maintain what effectively amounts to the Netflix distribution of Spinnaker.

2015–2018 (Artifact and Configuration Layering)

While preparing for the open sourcing in 2015, we recognized a need to separate Netflix-specific integrations from the larger open-source project.

At the time we opted to maintain separate Netflix-specific source code repositories and built Debian packages that consisted of open-source and Netflix artifacts with our configuration layered over the top.

See Scaling Spinnaker at Netflix — Custom Features and Packaging (circa 2016) for more information on this. We also spoke about this approach at multiple Spinnaker Summits, and the pattern applies to both our UI and backend services.

Benefits

  • Directly consumed artifacts built and published by the open-source continuous integration (CI) process. Any member of the community could follow this approach.
  • Highly flexible. We could override components, exclude integrations and ultimately tailor Spinnaker to what we needed at Netflix.

Drawbacks

  • A split mental model between open-source (GitHub) and closed-source (Netflix).
  • Slow cycle time when changes originated as open-source pull requests and needed to be built and published before incorporated at Netflix.
  • Explosive growth in per-repository “releases” proved confusing to many in the community unaware of our approach.

This approach to building our distribution was serving our needs well, at least until a motivated member of our team thought we could do better. Investments were made towards automating many of the manual steps inherent in keeping explicitly specified artifact versions current.

What follows is a description of these efforts, thanks Mark!

2018–2021 (Composite Builds and Custom Tooling)

Our follow-up to Artifact and Configuration Layering has not been extensively written or talked about.

While not a significant technical shift, we did make investments in optimizing how quickly we could introduce changes from the open-source repositories. Given the majority of our day-to-day work was happening there, even the smallest of improvements had significant multiplicative benefits.

The crux of this involved a movement away from directly consuming open-source artifacts. Rather than waiting for CI (either Travis CI or GitHub Actions) to build and publish artifacts, we adapted our internal build process such that we could point directly at open-source git hashes.

This bought us 10–30 minutes of cycle time improvement per change and shielded us from any flakiness in the CI process. We coupled this with a migration to Gradle composite builds wherein a Spinnaker engineer could simultaneously make changes to multiple projects (Netflix and open-source). This further improved cycle time by allowing local Netflix-specific development to happen concurrently with the open-source change needed to support it.

Lastly, we tied this all together with a command-line interface called bumpy. This interface made it trivial to pull in open-source git hashes, track commit drift from open-source, and review changesets before deployment.

Benefits

  • No longer need to cut per-repository “releases” and wait for publishing (saving ~20–30 minutes per change).
  • It improved the local development story with composite builds across open-source and Netflix repositories.
  • Tooling improved the developer ergonomics of pulling in open-source changes.

Drawbacks

  • Deck (the Spinnaker UI) was unable to take advantage of this and continued using its review, merge and release process.
  • Awkward split between work happening in GitHub (open-source) and Bitbucket (Netflix). The majority of PR activity took place in GitHub and still required relatively slow CI processes to pass before merging.
  • Challenges in introducing changes that would simplify the development, operator, and user experiences at Netflix but be potentially backward-incompatible for the broader community.
  • Growth in the Spinnaker community and feature set meant Netflix was primarily concerned with an increasingly smaller surface area of Spinnaker (AWS + Titus).

On that last point, Netflix continues to deploy the majority of our infrastructure on EC2 VMs. Containers are becoming an increasingly viable alternative, with Titus providing an abstraction layer around container orchestration and management that was purpose-built for Netflix.

We’re happy to see continued emphasis on the Kubernetes experience within Spinnaker, but it’s not one we have plans to adopt internally. Kubernetes will certainly play a role, but it’s yet unknown whether it will be exposed directly or serve as platform building blocks under Titus.

The ecosystem around Spinnaker at Netflix is also shifting. New intent-based systems are being built in front of much of our infrastructure. We’re taking a holistic view of the SDLC and looking to improve the end-to-end developer experience. Think of a more singular developer portal versus many tools stitched together (Stash, GitHub, Jenkins, Spinnaker, AWS, Titus, Atlas, etc.).

Late last year, we began to think about potential changes to Spinnaker that would keep in alignment with our other tools and platforms. Orchestration remains critically important, so much so that we are making new investments in more robust and capable primitives.

These conversations naturally led us to think about the potential impact on and relevance to the open-source project, leading us to debate possible improvements to the composite build and custom tooling approach we had been taking.

What we are looking for is continued flexibility around the types of changes we can make to Spinnaker. Ideally, we would land in a spot that allows us to take advantage of other Netflix tooling and processes, notably around CI, but without sacrificing any ability to contribute our work upstream.

The following section describes an intentional forking effort, giving us more flexibility around change while preserving our ability to cherry-pick relevant features back to the open-source project.

2021+ (Subtree Forks)

Let’s roll up our sleeves and take a walkthrough. Thus far, we have targeted backend services but expect to make progress against deck in the coming months. Tactics will likely vary given the difference in technology stacks.

To start, we have added a git subtree corresponding to each upstream repository which has allowed us to move away from composite builds. We have long maintained a Netflix-specific repository for each Spinnaker service.

$ git subtree add -P oss git@github.com:spinnaker/<serviceName>.git main

Gradle makes it relatively straightforward to build against source code located in the oss/ directory.

// spinnaker oss includes
[
...
'clouddriver-aws',
'clouddriver-core',
'clouddriver-web'
...
].each {
include it

project(':' + it).projectDir = file('oss/' + it)
}

We point at internally published kork and fiat artifacts without changing any of the build configurations under oss/.

dependencySubstitution.all { DependencySubstitution dependency ->
if (dependency.requested instanceof ModuleComponentSelector) {
if (dependency.requested.group == "io.spinnaker.kork") {
dependency.useTarget( "a.b.kork:${dependency.requested.getModule()}:${korkVersion}")
} else if (dependency.requested.group == "io.spinnaker.fiat") {
dependency.useTarget( "a.b.fiat:${dependency.requested.getModule()}:${fiatVersion}")
}
}
}

Our goal is to keep the oss/ subtree independently buildable and avoid introducing Netflix-specific dependencies, making it easier to cherry-pick changes to/from the open-source repositories.

# OSS -> Netflix
$ git fetch <serviceName>-oss main
$ git cherry-pick -x --strategy subtree -Xsubtree=oss/ <gitHash>
# Netflix -> OSS
$ git fetch <serviceName>-netflix main
$ git cherry-pick -x --strategy=subtree -Xsubtree=oss/ <gitHash>

Benefits

  • We have significantly improved developer cycle time. Our engineers can now take advantage of Netflix CI processes, tooling, and build hardware.
  • Provides an opportunity to make broader cross-cutting and potentially backward-incompatible changes in areas highly relevant to Netflix.

Drawbacks

  • UI has not been tackled but is critically important.
  • There are additional steps required to contribute or consume from open-source repositories.

Right now, we are approximately six weeks into this journey and have validated our ability to consume and contribute changes across many of the open-source repositories. Eliminating composite builds has simplified the local development story, notably addressing a few nagging issues with IntelliJ and Gradle.

More than a set of services or repositories, Spinnaker is a community. One that many of us have been proud members of for many years now. We’ve developed friendships and built relationships that will outlive this project. To close out what has turned into a rather lengthy blog post, let’s recap how our engagement model may change in the future.

Engagement Model Moving Forward

Evolving to subtree forks gives us more flexibility. We can continue to use and contribute to Spinnaker while simplifying the development of Netflix-specific features. Spinnaker supports significant portions of the Developer Productivity vision at Netflix which means we remain quite committed to the platform.

While we will continue to contribute to Spinnaker, the cadence of contribution and involvement will slow. Rather than near 100% of our work originating in open-source repositories, we will look for opportunities to cherry-pick changes up and down. These changes will often represent more complete features already deployed to our production environment.

With luck, this post reinforces decisions already made or inspires improvement for the community members maintaining their own Spinnaker distribution. If you’re using one of the pre-packaged variants of Spinnaker, let the Netflix experience offer a glimpse of what’s possible when you invest in the platform.

As always, we’re more than happy to answer questions!

Consider joining us at the upcoming cdCon, we’ll be helping with a Birds of a Feather discussion around Spinnaker at Scale.

--

--

Continuous Delivery junkie at Netflix. Spent time at Amazon and a few startups in my previous life.