I assume that you've heard of Netflix's Chaos Monkey
? It seems that chaos-based resilience testing is becoming increasingly popular these days, especially in the context of containers and folks have been asking for a Mesos-specific
tool as well.
Now, the other day I get a mail from Flo, our CEO, asking me what the effort would be and if I could take care of it. Given that both the community and my CEO seems to think that's a good thing to have I thought: well, let's give it a try.
The result is now available and I decided to name it after a Guardians of the Galaxy character (because I like Marvel) who exposes certain (in this context desirable) properties, namely his destructive manners. Meet DRAX, a DC/OS-specific resilience testing tool that works mainly on the task/ container level.
In a nutshell, DRAX runs as a Marathon app, killing off random tasks of any (non-framework) or specific (non-framework) application running in Marathon. Why this limitation to Marathon? Simple: this sort of testing is mainly interesting for long-running apps, so exactly what Marathon is used for to run.
Enough talk, let's see DRAX in action.
All you have to do to get started is either clone the GitHub repo dcos-labs/drax
or simply download DRAX's app spec
and you can launch DRAX like so (assuming you've got a DC/OS 1.7 cluster and the CLI installed):
dcos marathon app add marathon-drax.json
Once that is done you should see DRAX running in Marathon an ready to rampage
DRAX in action.
A concrete rampage, targeting any app, looks as follows:
BTW, no worries, DRAX is smart enough not to target itself.
This is only the beginning. I'd love to hear feedback, either via the DC/OS Community Slack
or directly by raising an issue
on the repo. What would you like to see next? Node-level rampages ala Chaos Monkey? Or maybe taking down an entire cluster to simulate a DR scenario? Network partitions? I'd appreciate if you use DRAX and let me know what sucks and what rocks. Thanks for your time and hope to hear from you!