Go Clouddriver: Scaling Spinnaker to 1000 Kubernetes Clusters at The Home Depot

Published in

The Spinnaker Community Blog

8 min readNov 1, 2022

INTRO

In August of 2019, The Home Depot chose Spinnaker as its Continuous Deployment (CD) enterprise solution for Google Cloud Platform (GCP) targets. We planned to use Spinnaker as the tool for Kubernetes and App Engine deployments. At the time we had a few hundred GCP projects and a couple of hundred Kubernetes clusters that slowly started to be onboarded into Spinnaker over the following months. There was one big problem standing in the way: the more Kubernetes providers we onboarded, the more Spinnaker’s microservice Clouddriver demanded an increasing amount of compute resources. When we reached around 500 Kubernetes providers, Clouddriver was using around 100 CPUs and 250GB of memory. With the prospect of more than doubling the number of Kubernetes providers, we knew we needed to find a solution for this dilemma since this resource growth was unsustainable.

Spinnaker is built on a microservice architecture, and Clouddriver serves as Spinnaker’s workhorse. Clouddriver is responsible for all mutating calls to the cloud providers and for indexing/caching all deployed resources. This article discusses the several challenges we experienced with Clouddriver’s approach to managing Kubernetes clusters, along with some solutions, and eventually, how Home Depot slashed Clouddriver’s compute needs by more than 95%.

PROBLEM STATEMENT

Spinnaker is a powerful and useful Open Source CD tool. We adopted Spinnaker due to its customizability and microservice architecture, which promised unparalleled scalability compared to all other CD solutions on the market. We subsequently encountered issues that made implementation at a large enterprise challenging. For example, scalability for Large Kubernetes environments, bugs in App Engine implementation, unresponsive infrastructure pages, and poorly optimized API endpoints.

For the first year of Spinnaker adoption, Armory and Home Depot worked together to identify and fix a few stability bugs related to Kubernetes and dynamic accounts, resolved several bugs in App Engine, and implemented many configuration optimizations. Additionally, Open Source made a major change to Kubernetes in version 1.23 to leverage live calls for Kubernetes deployments along with several other Kubernetes provider optimizations between 1.19 and 1.23. Even after all this work, in December of 2021, Clouddriver alone was using around 100 CPUs and 250GB of memory, and our team still had a concern with the scalability of our Kubernetes accounts as we plan to double our Kubernetes footprint in the next 1–2 years.

The scalability of Kubernetes still remained an issue after all of the improvements above due to the architecture design known as “Cache All The Stuff” (CATS). CATS may work well for a small number of Kubernetes clusters and other target deployment environments like Compute Engine and App Engine, but for Kubernetes, CATS results in an unscalable and unmanageable service.

The reason CATS performs poorly in the Kubernetes world is due to the nature of Kubernetes API discovery and how Clouddriver OSS caches Kubernetes cluster resources. Clouddriver does not currently interface with the Kubernetes API directly using HTTP for its CATS implementation, instead, it shells out and runs kubectl commands to grab cluster resources. It shells out and runs these commands frequently on a polling basis, often with many threads per Kubernetes provider. This means that Clouddriver is constantly running hundreds, if not thousands, of commands to grab resources from Kubernetes clusters. This behavior is wasting CPU and memory on work and information that is not needed at the time.

Exacerbating the CATS performance, kubectl has its own method of interacting with the Kubernetes API using what is known as a dynamic client. A dynamic client is necessary to talk to Kubernetes servers as the Kubernetes API is not concrete, because it allows for the creation of Custom Resource Definitions (CRDs) and different version groups for resources based on the Kubernetes cluster’s version. To interact with a Kubernetes cluster dynamically, kubectl performs a set of API discovery calls to find what resources are available on the cluster. The results of this API discovery are cached and stored on disk for their time-to-live (TTL — default of 10 minutes) and then any subsequent kubectl commands will require API discovery to be performed again if that cache is stale (a cache entry is older than its TTL). API discoveries for clusters can return dozens of resources.

What does this all mean for a Continuous Deployment environment that has hundreds of target Kubernetes clusters?

It means constant shelling out, constant use of CPU and memory as well as tens of thousands of disk writes per second. With our environment caching hundreds of clusters, we were hitting bottlenecks in CPU, memory, and disk IOPS (input/output operations per second). Adding more infrastructure to scale vertically had diminishing returns, often resulting in some other bottleneck, either software or hardware, being hit.

Clouddriver’s CPU usage before migrating accounts to Go Clouddriver

Clouddriver’s memory usage before migrating accounts to Go Clouddriver

Disk throughput and IOPS on a single machine running OSS Clouddriver (we had 15 instances, each instance on its own machine)

SOLVING THE SCALING PROBLEM

With plans to double our Kubernetes footprint in the coming years, and already using 100CPUs/250GB just for Clouddriver, solving for the Kubernetes scalability problem became a top priority for our team. We created a basic prototype that showed very promising results. Spinnaker’s microservice architecture allowed us to replace Clouddriver with a new implementation. It took us around one year to deliver our solution from concept to production, which included several months of development work, six months of beta testing, and one month of conversion.

We named our solution Go Clouddriver. Go Clouddriver is a rewrite of the Kubernetes portion of Clouddriver in Go. We chose to write this in Go for a couple of reasons. For one, we are more familiar with the language as a team, but more importantly, it allowed us to solve some key issues with CATS.

To solve the unmanageable scaling problem using Go Clouddriver we needed to address:

shelling out to perform any operation on a Kubernetes cluster
disk caching of API discovery
performing live calls for all operations (no CATS!), including grabbing information for the Clusters page in each Spinnaker application
avoiding slamming a Kubernetes cluster with requests when multiple people are viewing the same Clusters page

One of the major driving factors to using Go as our language of choice was to get away from any shelling out and running commands. Kubectl itself is written in Go, so we were able to combine the Clouddriver logic of handling Kubernetes manifests with the ability to deploy them using the kubectl source code. This method completely avoids shelling out for all requests to a Kubernetes cluster.

Disk caching of resources was addressed by writing our own in-memory cache to hold API discovery. We are no longer hitting disk IOPS limits in GCP as a result.

Performing live calls for all operations was quite a challenge. Each application in Spinnaker has an associated set of providers. For a Kubernetes provider, we needed to provide the necessary information to build and interact with a cluster using the Clusters page in Spinnaker. To grab this information, we had to concurrently make calls to all clusters associated with the Spinnaker application and build the server groups and load balancers responses accordingly.

When many people are using the same Application’s Clusters page on Spinnaker there are several unnecessary calls to the application’s Kubernetes clusters. To address this problem, we put a small caching layer in front of the Application’s API endpoints. This effectively solves the caching problem with a few lines of configuration.

RESULTS

During December 2021, we migrated all accounts from Clouddriver OSS to Go Clouddriver. This undertaking was ambitious and a lot of work, but the results were nothing short of phenomenal. The absence of shelling out in favor of using kubectl’s source code resulted in a small overhead for grabbing resources and performing deployments. An in-memory cache for API discovery only made things faster. And, to our surprise, switching to live calls for the clusters page improved response times for most applications.

Using Go Clouddriver we saw a 95%+ decrease in CPU and memory usage, with disk usage being completely eradicated. Deployment times of complicated Helm charts consisting of hundreds of resources were cut in half, in addition to solving timeout issues when deploying these charts using Clouddriver OSS. We are running 2 instances of Go Clouddriver comfortably instead of 15 for OSS (very uncomfortably). User complaints around deployments to target Kubernetes clusters have decreased dramatically. Clouddriver OSS, which we still use for App Engine, has had its restart time decreased from about 45 minutes to 2 minutes. Go Clouddriver only takes about 10 seconds to start, which is a great benefit from an operation’s perspective, allowing for quick hotfixes and revisions.

Another addition to Go Clouddriver is an extended API to onboard Kubernetes providers. We now store providers in a database instead of utilizing Spring Cloud Config for dynamic accounts. This has made account onboarding and offboarding much simpler.

Overall, our efforts have made Spinnaker a viable solution for The Home Depot’s Continuous Deployment platform for the near future.

Go Clouddriver’s CPU usage over a two-week period

Go Clouddriver’s memory usage over a two-week period

MIGRATION TO GO CLOUDDRIVER

Go Clouddriver only supports Kubernetes, except Home Depot still needed to support App Engine. In order to support both Clouddriver OSS and Go Clouddriver simultaneously, the team implemented a very light proxy which can route Clouddriver requests based on account type and name. This allows us to migrate accounts between the two instances as needed and maintain the functionality of Clouddriver OSS for our customers.

Diagram of our proxy chain and Clouddriver setup

ADDITIONAL THOUGHTS

We really like Spinnaker’s Spring Expression Language (SpEL) integration in pipelines, its deployment strategies, its multi-target deployment abilities, and its ability to integrate with so many external services. Right now, it lacks several things that would improve Spinnaker for Admins: a built-in administrative page for users and providers along with predefined ways to manage providers and a better designed and efficient API. It is our hope that this article will spark discussions around what the community wants from Spinnaker, its powerful Continuous Deployment tool.

If you are looking for a commercial solution, you may want to look at Scale Agent for Kubernetes from Armory. Armory developed this solution with this Kubernetes scale problem in mind.

If you are experiencing scaling issues with Clouddriver OSS, make sure to inspect the problem points mentioned in this article. To increase cache hits on API discovery, it is a good idea to run fewer large instances of Clouddriver OSS than many smaller instances. Make sure to check all bottlenecks and provide enough resources for Clouddriver OSS to run as responsive as possible.

At Home Depot we believe that this is an excellent opportunity for the Spinnaker community to band together and make Spinnaker the CD tool of choice for the future.

If you want to see if Go Clouddriver will work for you, join in on the fun at https://github.com/homedepot/go-clouddriver.

Go Clouddriver: Scaling Spinnaker to 1000 Kubernetes Clusters at The Home Depot

Written by Billy Bolton