on Feb 6, 2022
So, as an engineer:
So how do I do that? Well, here’s what I learned!
The first version of Prefab started in 2018. A key requirement of prefab is good support for grpc and I learned previously that that can be a bit of a challenge for some proxies. Essentially, grpc pushes into some of the edges of the HTTP2 spec and that gives some ingress trouble. When I had previously delved into these parts in 2018 Traefik had the best out of the box support.
After a bit of thinking I decided to continue with Traefik as my ingress controller.
Traefik has moved on in the interveneing 4 years however and I was now looking at a migration to v2. What could’ve changed!?
Well… I was sad to see that the Traefik project had a change of heart about their role in SSL certificates, but to understand that we need to start at the beginning…
Once upon a time, if you wanted SSL, it went like this. First, you paid somebody $99 or so for an SSL certificate. They gave you an encrypted key as a file and then you went and placed it in a special place on al your servers and everything worked great.
Of course, the certificate only lasted 1 yr. So it was pretty common to put something in your calendar for 11 months in the future called “UPGRADE THE SSL CERT YOU BOZO!”. If you got a new job in between or were on vacation… well, it wasn’t great.
The huge change to all of this came with Let’s Encrypt. Certs were now free! It’s was awesome! Unfortunately / fortunately this came at a terrible price of not being able to be quite as lazy and haphazard. The Let’s Encrypt folks decided to address the cert upgrade problem at the same time by being pretty clever. Telling people the should do something is one thing, but making their life really annoying until they do it is much more effective. So they decided to only issue certs good for 30 days! This worked like a charm and essentially forced us all to invest in a real system for updating the certs.
One thing I really appreciated in Traefik 1, was that it totally took care of this for you. If Traefik didn’t have a cert, it would go fetch one from LetsEncrypt and there was a really seamless way to get it to talk to your DNS provider so that it could prove it was authorized to do this. Literally this was ~3 lines of configuration and it was great.
You can imagine I was pretty sad to hear that Traefik 2 didn’t support this anymore. Well, technically they still do. Let’s encrypt will still go and lookup your cert, but with a huge, deal-breaker caveat. Traefik no longer supports sharing the certificate with other Traefik pods. Each Traefik pod is on its own.
That might not sound so bad, but I would strongly, strongly urge you not to do this. Even in staging or development or anything. The downside is terrible, because, well, let me introduce you to LetsEncrypt. The most curmudgeonly API you’ve ever hit.
You’re probably familiar with API rate limits. These are thing like a 10 request/minute cap and if you go over it your requests start failing for 10 minutes while the limit resets.
But you ain’t never seen a limit like LetsEncrypt’s. The limit on these API requests is “5 requests per week” Yikes!
So how can this bite you? Well, say you spin up 3 Traefik nodes. They will all see they don’t have a certificate. They’ll fetch them and things are fine. Now say you change a setting or something or your pods die for any old reason. They’ll spin back up and try to fetch new certificates and blam! they’ll hit the rate limit. Woe be to you if this happens to you. Because you my friend are going to be HTTP for the rest of the week. (Like really A WEEK).
Also, just to be clear, this makes a ton of sense for them and is no doubt necessary to prevent DDOS and such. I salute their curmudgeonly ways.
Well, the key feature that we need to avoid this terrible rate limit fate is a way for our traefik pods to share the cert file. You’d think that sharing a single < 1k file amongst the pods in your cluster would be pretty easy. It sure sounds easy. But this is were the devil in the details means software is tough and when you actually get into it, you really need a system were 1 pod will get elected the leader to go look for the cert and the others will wait for that one to finish. And now that sounds a lot like Consul or some other more complex system with a Raft protocol yaddah yaddah yaddah. Setting consul up for this single file was what I’d done in Traefik 1, but TBH it was a weird and annoying piece of the puzzle and there were a handful of odd errors when my pods couldn’t really decide who was doing what.
I know, you just wanted HTTPS and now you’ve had to read about pods achieving consensus and electing a leader. Yeesh.
Well, turns out that there’s just 3 more pieces of software you need to install and then it will all work.
Yeah, I know. I didn’t love that either, but I’m here to tell you that I survived and that it was actually much less painful than I expected (and kinda cool).
Ok, so here are the things we need to solve:
Turns out we’re 4 helm charts away from victory: traefik, cert-manager, cert-manager-webhook-dnsimple, and kubed. We’ll walk through what they do next.
Ok, let’s walk through the steps. First off, we’ve got to declare the certificate we need somehow. There’s two ways to do this.
The first option here is very cool, but I’d be a bit wary of it. The main rub is that if you have your ingress being the instigator, it kinda assumes that you want the certificate to end up as a secret in that namespace. That’s pretty reasonable, but for me I have multiple namespaces that wanted to share the certificate. And I didn’t like the loss of control / determinism that was happening.
Declaring a certificate is pretty straightforward.
--- apiVersion: cert-manager.io/v1 kind: Certificate metadata: name: my-thing namespace: cert-manager spec: dnsNames: - '*.my.thing' - 'my.thing' issuerRef: name: cert-manager-webhook-dnsimple-staging kind: ClusterIssuer secretName: my-thing-tls
But a little certificate is lost on its own. It’s not valid. Luckily we have a ClusterIssuer listening.
How did we get this issuer you ask? We helmed it in and told it to create a staging and production issuer. This is CRUCIAL! Do NOT change the issuerRef in your certificate to production until you feel really confident that things are setup correctly.
helm repo add jetstack https://charts.jetstack.io helm repo update helm install \ cert-manager jetstack/cert-manager \ --namespace cert-manager \ --version v1.2.0 \ --create-namespace \ --set installCRDs=true
Ok, the issuer has listened to our call and has made a request, but LetsEncrypt isn’t going to give the keys to the castle to just anyone. This is where the webhook comes into its own.
A bit more helm:
helm repo add neoskop https://charts.neoskop.dev helm install cert-manager-webhook-dnsimple \ --namespace cert-manager \ --set dnsimple.token='REPLACE ME' \ --set clusterIssuer.production.enabled=true \ --set clusterIssuer.staging.enabled=true \ --set clusterIssuer.email='REPLACE ME' \ neoskop/cert-manager-webhook-dnsimple
We webhook out to DNSSimple and validate that we’re legit. This is represented by Acme-Order and Acme-Challenge objects. You don’t necessarily need to worry about these, but I highly recommend Lens Truly I don’t think I would’ve had a chance understanding what was happening at the command line, but getting a realtime clickable view into all of these objects was amazing.
Ok! Our challenge has been accepted.
This starts a chain reaction, validating the order and approve the certificate request and getting the actual certificate bytes.
And at the end of the chain these bytes end up in a Kubernetes secret.
Now, we ain’t done yet. This secret lives in a namespace. And I bet that namespace isn’t where you are running your deployments. Now, the namespace is configurable, but I feel like trying to put it into production or staging namespaces is going to end in tears. It’s too easy to add one more namespace and now you’ve got a weird visibility problem. Better to have a standardized solution.
But what is that standard solution? Just one more helm chart I promise. Kubed!
helm repo add appscode https://charts.appscode.com/stable/ helm repo update helm search repo appscode/kubed --version v0.12.0 helm install kubed appscode/kubed \ --version v0.12.0 \ --namespace kube-system kubectl create clusterrolebinding "cluster-admin-$(whoami)" \ --clusterrole=cluster-admin \ --user="$(gcloud config get-value core/account)"
Kubed is a copying machine. It is always watching. Looking at your secrets. Dreaming of the day when it can copy them someplace else.
To do this, we add the label
cert-manager-tls=shared to all of the secrets that your cert-manager is creating by editing the secretTemplate. This tells kubed I’m a secret that wants to go to namespaces that have this label.
Then edit your namespaces and give em the label and kubed magically makes it happen.
apiVersion: cert-manager.io/v1 kind: Certificate metadata: name: my-thing namespace: cert-manager spec: dnsNames: - '*.my.thing' - 'my.thing' issuerRef: name: cert-manager-webhook-dnsimple-staging kind: ClusterIssuer secretName: my-thing-tls secretTemplate: annotations: kubed.appscode.com/sync: "cert-manager-tls=shared" # Sync certificate to matching namespaces apiVersion: v1 kind: Namespace metadata: name: staging labels: cert-manager-tls: shared # Define namespace label for kubed
Last step is to make sure your Ingress are asking for the correct secret name for tls. Be aware that the tls: host needs to match the dnsNames on the certificates.
--- apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: ingress-my-thing-web-web namespace: staging labels: application: my-thing deployed-name: my-thing-web-web annotations: kubernetes.io/ingress.class: traefik traefik.frontend.passHostHeader: 'false' traefik.frontend.priority: '1' traefik.frontend.entryPoints: https traefik.protocol: http traefik.frontend.headers.SSLRedirect: 'true' traefik.docker.network: traefik traefik.ingress.kubernetes.io/router.entrypoints: websecure traefik.ingress.kubernetes.io/router.tls: 'true' spec: rules: - host: www.my.thing http: paths: - path: "/" pathType: Prefix backend: service: name: my-thing-web-web port: name: http tls: - hosts: - www.my.thing secretName: staging-my-thing-tls
Or, well, I did it. But maybe this will help you do it.
In either case, that’s it! SSL with Traefik V2 in 89 short steps.
In truth, I’m actually darn pleased with the outcome here. It’s a lot of mental overhead getting it setup, but the pieces themselves are very boring and there is very little actual glue code on my end. Truly I’m just putting a certificate object in, specifying the name of the secret and then referencing my secret in the Ingress file. Those feel like the correct inputs to have.
Now, how you actually get those ingress files deployed… well. That’s a whole other rant about the missing kubernetes deployer. You can see our open-source take in pfab