Hardening Kubernetes on the DCOS with etcd-mesos
For more than five years, DC/OS has enabled some of the largest, most sophisticated enterprises in the world to achieve unparalleled levels of efficiency, reliability, and scalability from their IT infrastructure. But now it is time to pass the torch to a new generation of technology: the D2iQ Kubernetes Platform (DKP). Why? Kubernetes has now achieved a level of capability that only DC/OS could formerly provide and is now evolving and improving far faster (as is true of its supporting ecosystem). That’s why we have chosen to sunset DC/OS, with an end-of-life date of October 31, 2021. With DKP, our customers get the same benefits provided by DC/OS and more, as well as access to the most impressive pace of innovation the technology world has ever seen. This was not an easy decision to make, but we are dedicated to enabling our customers to accelerate their digital transformations, so they can increase the velocity and responsiveness of their organizations to an ever-more challenging future. And the best way to do that right now is with DKP.
As part of our work to put Kubernetes on the Mesosphere DC/OS, we built etcd-mesos, a system for maintaining an etcd cluster that can withstand devastating failures.
Kubernetes is a powerful tool for managing application containers. It provides opinionated solutions for deployment and service discovery, allowing you to spend more of your time building great products instead of figuring out how to get them into production. But before you can begin experimenting with its capabilities, you have to get Kubernetes itself into production.
If you want to use Kubernetes, you first have to learn:
- How to configure networking.
- How to setup etcd.
- How to replace a failed Kubernetes API server, scheduler or controller manager.
With the Mesosphere DCOS, we are building a technology that allows someone with zero operational experience to deploy battle-hardened distributed systems, such as Cassandra, Kafka or Kubernetes, with the push of a button. etcd-mesos simplifies the etcd setup process, and reduces the recovery time for many types of failures.
etcd-mesos can even recover from failures that impact the majority of the cluster. Without etcd-mesos, these types of failures require an engineer to manually perform an emergency backup-and-restore of the cluster (see this etcd recovery document). etcd-mesos acts as a cautious-yet-efficient operator, taking preventative action where possible and recovering from unhealthy states when encountered.
Our testing regime
As part of our hardening process, thousands of clusters have been sacrificed through fault injection with close monitoring. Our engineers, who have accumulated invaluable experience running some of the largest database installations on Earth, drive this process. We have found several bugs in etcd itself, which were rapidly fixed by Xiang Li from CoreOS.
We know that our testing assumptions are insufficient to bet your business on, so we are excited to announce etcd-mesos as alpha-level open source software. You can try to break it and benefit from it! We look forward to your feedback and we are hopeful that you will invalidate some of our assumptions in the coming weeks, as the system continues to be hardened.
This is just one part of our work to make Kubernetes an effortless experience for users of the DCOS. With a rock-solid source of truth, we can get a higher return on work to harden and optimize other parts of the system.
Check it out, try to break it, have fun and let us know how it goes!