This post was written by Jeff Malnick and Nicholas Parker.
This is the second part of a series on "Day 2" operations using DC/OS, specifically new capabilities set to ship with version 1.10. The first part explains
what we mean by Day 2 operations, and introduces readers to the DC/OS logging API. This installment is on metrics—gathering them, shipping them and integrating with popular analytics solutions.
DC/OS has three main focus points for metrics:
- Host metrics: metrics about the host systems themselves
- Container metrics: metrics about the resource utilization of each container
- Task metrics: metrics emitted by the deployed applications within each container
Across these areas, there are endless possibilities for gathering metrics. We'll discuss how we approach each of these cases. For more information, see our recent presentation at Mesoscon EU
Host metrics such as system-wide CPU usage, memory usage and system load are automatically collected for each system in the cluster. This information is automatically tagged with the system it came from, via the agent_id tag. Because we're running an agent on every host in the cluster, we can parse the cgroups hierarchy and publish these for free.
In addition to the metrics collected for the system as a whole, metrics are also automatically collected for each container on that system, and those data are automatically tagged to identify the container. These metrics mainly focus on resource utilization, as containers are, after all, effectively a collection of allocated resources.
While it's easy to think about collecting system-level metrics, we'd also like to allow tasks themselves to emit their own custom metrics. This opens a world of possibilities, as it effectively allows every task in the cluster to emit any measurable value at any time. These values can then be decorated automatically with metadata—such as the host it was running on and the framework that started it—to assist in debugging.
For this case, we explicitly sought a solution that was easy to integrate with any application, written in any programming language. To this end, we expose two environment variables to each container: STATSD_UDP_HOST and STATSD_UDP_PORT. Any statsd-formatted metrics sent by the application to this environment-advertised endpoint will automatically be tagged with the originating container and forwarded to the rest of the metrics infrastructure. The Datadog StatsD tag format
is also supported by this endpoint, so any such tags produced by the application will automatically be parsed out and included in the forwarded data. But from the application's perspective, it only needs to check for those two environment variables and configure a local emitter to send the metrics. The rest is handled automatically.
To see some examples of this integration in practice, see how it's done by the DC/OS Cassandra
and DC/OS Kafka
services. In each of these cases, it was straightforward to configure even third-party code to support DC/OS metrics, without needing to touch the code itself for either of these services (except for the addition of necessary .jars to support emitting StatsD).
In order to support easy drill-down, filtering and grouping of metrics data, all data sent through the DC/OS metrics stack is automatically tagged with its origin. The tags include the following, but this list is expected to grow over time:
- Container identification (for all container and application metrics): container_id, executor_id, framework_id, framework_name
- Application identification (e.g., for Marathon apps): application_name
- System identification: agent_id
For example, grouping metrics by agent_id would allow an administrator to detect a situation that's system-specific (e.g., a faulty disk). Meanwhile, grouping data by framework_id would allow them to detect which Mesos frameworks are using the most resources in the system. Grouping that data by container_id would show the same information, except with per-container granularity.
Now that metrics have been collected from the applications and from the host itself, they need to be forwarded to a customer-managed location so that the customer can consume them. As with all our day 2 operations APIs, our end-goal is ease of integration with popular stacks and solutions. To forward that goal, we currently support two widely used methods for outputting the metrics data from the cluster. These methods are:
Kafka service (either in DC/OS itself, or external to the cluster). This method provides widely understood performance and maintenance characteristics, as well as good throughput for larger clusters.
StatsD service (e.g., dogstatsd or the original Etsy statsd). This is a lighter-weight solution for smaller clusters, where running a Kafka instance may be overkill. This method optionally supports outputting Datadog-format tags.
In either of these cases, it's extremely easy to consume the outputted metrics data and forward it to any customer-managed or third-party monitoring infrastructure.
We hope this is a helpful and informative look into some of the new features coming to DC/OS. Keep an eye on our blog for Part 3 of this series, which will dive deeper into DC/OS debugging. For more information about what's possible in DC/OS today, check out the project documentation