Jun 19, 2016

Michael Hausenblas


I assume that you've heard of Netflix's Chaos Monkey? It seems that chaos-based resilience testing is becoming increasingly popular these days, especially in the context of containers and folks have been asking for a Mesos-specific tool as well.
Now, the other day I get a mail from Flo, our CEO, asking me what the effort would be and if I could take care of it. Given that both the community and my CEO seems to think that's a good thing to have I thought: well, let's give it a try.
The result is now available and I decided to name it after a Guardians of the Galaxy character (because I like Marvel) who exposes certain (in this context desirable) properties, namely his destructive manners. Meet DRAX, a DC/OS-specific resilience testing tool that works mainly on the task/ container level.
In a nutshell, DRAX runs as a Marathon app, killing off random tasks of any (non-framework) or specific (non-framework) application running in Marathon. Why this limitation to Marathon? Simple: this sort of testing is mainly interesting for long-running apps, so exactly what Marathon is used for to run.
Enough talk, let's see DRAX in action.
In action
All you have to do to get started is either clone the GitHub repo dcos-labs/drax or simply download DRAX's app spec and you can launch DRAX like so (assuming you've got a DC/OS 1.7 cluster and the CLI installed):
dcos marathon app add marathon-drax.json
Once that is done you should see DRAX running in Marathon an ready to rampage:
DRAX in action.
A concrete rampage, targeting any app, looks as follows:
$ http POST $PUBLIC_AGENT:7777/rampage HTTP/1.1 200 OK Content-Length: 121 Content-Type: application/javascript Date: Mon, 13 Jun 2016 12:15:19 GMT {"success":true,"goners":["webserver.0fde0035-315f-11e6-aad0-1e9bbbc1653f","dummy.11a7c3bb-315f-11e6-aad0-1e9bbbc1653f"]}
BTW, no worries, DRAX is smart enough not to target itself.
Next steps
This is only the beginning. I'd love to hear feedback, either via the DC/OS Community Slack or directly by raising an issue on the repo. What would you like to see next? Node-level rampages ala Chaos Monkey? Or maybe taking down an entire cluster to simulate a DR scenario? Network partitions? I'd appreciate if you use DRAX and let me know what sucks and what rocks. Thanks for your time and hope to hear from you!

