Puppet Infrastructure Performance (3 Part Series)
1 Taming Puppetserver 6: a Grafana story
2 Taming Puppetserver 6 Pt II: Garbage Collection
3 Getting Puppet Report Metrics from PuppetDB
Now that our internal Puppet Infrastructure is migrated to Puppet 6 and tuned, it was time to switch a second infra to it.
Yesterday, I migrated our second infrastructure, and started seeing more issues. The rules-of-thumb from last post were useful, but I still needed to upgrade available memory to make up for a lack of computing power (probably to be imputed to the underlying IaaS throttling virtual CPUs).
And then, a Puppetserver crashed with a GC overhead limit exceeded
error. This error happens when the CPU spends more than 98% performing garbage collection.
Analyzing Garbage Collection Data
Looking at our Grafana dashboard, I realized we had no metrics about garbage collection, so I added a graph with two metrics:
- mean time per GC: the average time taken by each garbage collection request to complete, calculated as a rate over 1 minute (since
jvm_gc_collection_seconds_sum
is a cumulative counter) - GC time: the percentage of time spent by the CPUs doing GC, over 1 minute (since
jvm_gc_collection_seconds_count
is also a cumulative counter)
The formulas are as follow:
-
time per request {{gc}}
:rate(jvm_gc_collection_seconds_sum{job="puppetserver"}[1m])/rate(jvm_gc_collection_seconds_count{job="puppetserver"}[1m])
-
rate {{gc}}
:rate(jvm_gc_collection_seconds_sum{job="puppetserver"}[1m])
GC time
I then looked at the graphs around the time when the GC overhead limit exceeded
error happened:
Yes, I had a problem indeed. I restarted the Puppetservers and this hasn’t happened since. However, the rates for PS MarkSweep have kept pretty high still. Here’s the last 15 minutes as I’m writing:
In comparison, the infrastructure I upgraded last week is faring much better, with GC rates well under 10%:
Mean time per GC
In addition to a high rate spent performing garbage collection on the PS MarkSweep GC threads, I also noticed that the mean GC time for PS MarkSweep is pretty high, too, at around 2 to 3s. The values are slightly lower (a bit under 2s) on my first infrastructure.
Getting rid of PS MarkSweep
All in all, it seems PS MarkSweep garbage collection is to blame. It tends to take lots of CPU, for long periods of time.
The good news is: PS MarkSweep is a legacy garbage collector, and it’s not too hard to get rid of it, since OpenJDK 11 replaces it with the G1 Young Generation garbage collector by default.
The official puppetserver Docker image installs the puppetserver
package, which pulls openjdk-8-jre-headless
as a dependency. OpenJDK 11 is also officially supported starting with Puppet 6.6, but the package doesn’t allow to install it instead of OpenJDK 8. So for now, I’ll just derive an image and install openjdk-11-jre-headless
in addition to OpenJDK8 and let Ubuntu update the alternative for Java automatically.
The following graph shows the difference in GC time between PS MarkSweep and G1, following the upgrade to OpenJDK 11:
And here’s what GC looks like after a good warm up:
From 2s to 80ms, that’s a great improvement if you ask me!
Puppet Infrastructure Performance (3 Part Series)
1 Taming Puppetserver 6: a Grafana story
2 Taming Puppetserver 6 Pt II: Garbage Collection
3 Getting Puppet Report Metrics from PuppetDB
暂无评论内容