on Sep 12, 2022
on Sep 12, 2022
So you’ve got an app and you want to monitor how many streaming connections you have. You’ve got Datadog as a metrics collector, so it feels like we should really be a line or two of code away from a solution.
But here we are.
Micrometer has various types of metrics. Counters, Timers, etc, but if you want a basic “what is the level of X over time”, gauge is your answer.
Here’s a basic example of using a Gauge. This is a Micronaut example, but is pretty generalizable.
@Singleton
public class ConfigStreamMetrics {
private final AtomicInteger projectConnections;
@Inject
public ConfigStreamMetrics(MeterRegistry meterRegistry) {
projectConnections =
meterRegistry.gauge(
"config.broadcast.project-connections",
Tags.empty(),
new AtomicInteger()
);
}
@Scheduled(fixedDelay = "1m")
public void recordConnections(){
projectConnections.set(calculateConnections());
}
}
Ok, with that code in place and feeling pretty sure that calculateConnections()
was returning a consistent value. You can imagine how I felt looking at:
What is happening here? The gauge is all over the place. It made sense to me that taking the avg
was going to be wrong, if I have 2 servers I don’t want the average of the gauge on each of them, I want the sum
.
But that doesn’t explain what’s happening here.
The key is remembering how statsd with tagging works and discovering some surprising behavior from a default DataDog setup.
Metrics from micrometer come out looking like config.broadcast.project-connections.connections:0|g|#statistic:value,type:grpc
.
As an aside, I’d highly recommend setting up a quick clone of git@github.com:etsy/statsd.git
locally that just outputs to stdout when you’re trying to get this all working.
The “aha” is that all of these metrics get aggregated based on just that string. So if you have
Server 1:
config.broadcast.project-connections.connections:99|g|#statistic:value,type:grpc
Server 2:
config.broadcast.project-connections.connections:0|g|#statistic:value,type:grpc
A gauge is expecting a single value at any given point, so what we end up with here is a heisengauge that could be either 0 or 99. Our sum
doesn’t work, because we don’t have a two data points to sum across. We just have one value that is flapping back and forth.
Now we know what’s up, but it’s definitely a sad state of affairs. This is definitely not what we want, and our expected behavior here is that we should be outputting a different value per host.
It turns out that https://micronaut-projects.github.io/micronaut-micrometer/latest/guide/#metricsAndReportersDatadog
hits DataDog directly, not my local Datadog agent
Since it goes straight there and we aren’t explicitly sending a host tag, these metrics are clobbering each others.
The other solution is to calculate the same DataDog hostname that the datadog agent uses and manually add that as a commonTag
to our MetricRegistry.
@Order(Integer.MAX_VALUE)
@Singleton
@RequiresMetrics
public class MetricFactory
implements MeterRegistryConfigurer<DatadogMeterRegistry>, Ordered {
@Property(name = "gcp.project-id")
protected String projectId;
@Override
public void configure(DatadogMeterRegistry meterRegistry) {
List<Tag> tags = new ArrayList<>();
addIfNotNull(tags, "env", "MICRONAUT_ENVIRONMENTS");
addIfNotNull(tags, "service", "DD_SERVICE");
addIfNotNull(tags, "version", "DD_VERSION");
if (System.getenv("SPEC_NODENAME") != null) {
final String hostName =
"%s.%s".formatted(System.getenv("SPEC_NODENAME"), projectId);
tags.add(Tag.of("host", hostName));
}
meterRegistry.config().commonTags(tags);
}
private void addIfNotNull(List<Tag> tags, String tagName, String envVar) {
if (System.getenv(envVar) != null) {
tags.add(Tag.of(tagName, System.getenv(envVar)));
}
}
@Override
public Class<DatadogMeterRegistry> getType() {
return DatadogMeterRegistry.class;
}
}
Passing the node name in required some kubernetes yaml work.
spec:
containers:
- image: gcr.io/-----
name: -----------
env:
- name: SPEC_NODENAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName