Enabling IRSA for ServiceAccounts in Kops
When I first started looking into adding IRSA (IAM Roles for ServiceAccounts) support to my Kops managed Kubernetes clusters, I found the documentation to be extremely confusing. I wanted to share my journey and the steps I took to successfully enable IRSA support, configure my cluster, and test the setup.
I decided to do that because we faced the same issue as described in Kiam repo
with random 502
errors for /latest/api/token
endpoint which broke our pipeline.
Enable OIDC and face the dots dilemma
The first step I took was to configure the serviceAccountIssuerDiscovery
section in
Kops cluster configuration.
Dots in S3 bucket names are widely used in our terraform modules, to add Java style prefixes
like ee.hdo.beta.
. I wanted to use the same approach for OIDC public bucket, which was
a bit puzzling at first.
The reason is that IRSA relies on the OpenID Connect (OIDC) discovery endpoint, and dots
in the hostname can cause potential issues with SSL certificate validation.
serviceAccountIssuerDiscovery:
discoveryStore: s3://ee.hdo.beta.irsa.providers
enableAWSOIDCProvider: true
error was displayed in yaml file header:
# error populating cluster spec: spec.serviceAccountIssuerDiscovery.serviceAccountIssuerDiscovery.discoveryStore: Invalid value: "s3://ee.hdo.beta.irsa.providers": Bucket name cannot contain dots
I ended up with the next configuration:
serviceAccountIssuerDiscovery:
discoveryStore: s3://hdo-beta-irsa-providers
enableAWSOIDCProvider: true
It is important to note that the bucket shouldn't block public access.
The serviceAccountIssuerDiscovery
configuration enabled a trust relation between AWS IAM
and the certificate issued by Kops.
For that, Kops required having a publicly available store.
The previously issued keys used by Kops were replaced with a new one.
The public part of it was stored in S3.
This way tokens that are stored in SA secrets can be validated by external agent, AWS in our case using the asymmetrical JWT key pair.
Keeping in mind that enabling serviceAccountIssuerDiscovery
in runtime will cause all SA to regenerate
their secrets I handled it after rolling cluster update.
Warning: Enabling the following configuration on an existing cluster can be disruptive due to the control plane provisioning tokens with different issuers. The symptom is that Pods are unable to authenticate to the Kubernetes API. To resolve this, delete Service Account token secrets that exists in the cluster and kill all pods unable to authenticate.
- serviceAccountIssuer: https://api.internal.k8s.beta.hdo.ee
- serviceAccountJWKSURI: https://api.internal.k8s.beta.hdo.ee/openid/v1/jwks
+ serviceAccountIssuer: https://hdo-beta-irsa-providers.s3.eu-west-1.amazonaws.com
+ serviceAccountJWKSURI: https://hdo-beta-irsa-providers.s3.eu-west-1.amazonaws.com/openid/v1/jwks
Dealing with cert-manager
Next, I needed to enable certManager
in my Kops cluster configuration. However, I was already
managing cert-manager
using Helm, so I set the managed
field to false
.
serviceAccountIssuerDiscovery:
discoveryStore: s3://hdo-beta-irsa-providers
enableAWSOIDCProvider: true
certManager:
enabled: true
managed: false
The podIdentityWebhook
magic
I then enabled the podIdentityWebhook
in my Kops cluster configuration.
That service registers a webhook to mutate pods, by adding environment variables necessary
for assuming IAM roles.
serviceAccountIssuerDiscovery:
discoveryStore: s3://hdo-beta-irsa-providers
enableAWSOIDCProvider: true
certManager:
enabled: true
managed: false
podIdentityWebhook:
enabled: true
Enabling IRSA for managed addons
By default, kops adds additional permissions to the node roles; instead it can use dedicated roles attached to SA.
iam:
useServiceAccountExternalPermissions: true
serviceAccountIssuerDiscovery:
discoveryStore: s3://hdo-beta-irsa-providers
enableAWSOIDCProvider: true
certManager:
enabled: true
managed: false
podIdentityWebhook:
enabled: true
Updating the cluster and applying changes
With my Kops configuration complete, I updated the terraform code:
kops update cluster --target=terraform --out=update --yes
I carefully checked the diff between the update folder and my Terraform
content before applying the changes using terraform apply
.
Rotating control plane nodes
After applying the Terraform state to the target cluster, I rotated the
control plane nodes by draining them. This was crucial because enabling
serviceAccountIssuerDiscovery
caused all ServiceAccounts
to regenerate their tokens.
Since the SA is invalid, aws-iam-authenticator
isn't working anymore.
This is because the tokens generated by the new service account issuer aren't compatible
with the previous aws-iam-authenticator
service account token and as a result pod can't
issue new auth token to K8s API.
I needed to issue static credentials to access the cluster:
kops export kubeconfig --auth-plugin --admin=5h
After that, I needed to delete all Secrets with a type service-account
and then delete pods.
kubectl get secrets -A -o json \
| jq '.items[] | select(.type=="kubernetes.io/service-account-token") | "kubectl delete secret \(.metadata.name) -n \(.metadata.namespace)"' \
| xargs -n 1 bash -c
Next, I deleted all the pods to force their recreation with the new service account tokens:
kubectl get pods --all-namespaces \
| grep -v 'NAMESPACE' \
| awk '{print $1" "$2}' \
| xargs -n 2 bash -c 'kubectl delete pod -n $0 $1'
You may delete only selected pods if deleting all of them is too disruptive for your cluster.
After running these commands, my cluster began recreating the pods with the new service account tokens, and they were able to authenticate using IRSA.
Configuring IAM trust
All activities with the cluster were done, and I started adopting IAM roles to trust the new OIDC provider.
To make sure that roles trust created OIDC, I added additional trust policy to it:
statement {
actions = ["sts:AssumeRoleWithWebIdentity"]
principals {
type = "Federated"
identifiers = [var.oidc_provider_arn]
}
condition {
variable = "${replace(var.oidc_provider_arn, "/^(.*provider/)/", "")}:sub"
values = [for sa in module.configuration.trusted_sa : "system:serviceaccount:${sa}"]
test = "StringEquals"
}
# :aud is not added because it works only with EKS
}
trusted_sa
is an array of strings with concatenated namespace:sa_name
.
Full code of the Terraform role creation
resource "aws_iam_role" "this" {
name = var.name
assume_role_policy = data.aws_iam_policy_document.assume_role_policy.json
tags = {
Terraform = true
}
}
data "aws_iam_policy_document" "assume_role_policy" {
# Preserve legacy trust policy for the migration period
statement {
actions = ["sts:AssumeRole"]
dynamic "principals" {
for_each = module.configuration.identifiers
content {
type = principals.key
identifiers = principals.value
}
}
}
# Trust to the SA if they are configured
statement {
actions = ["sts:AssumeRoleWithWebIdentity"]
principals {
type = "Federated"
identifiers = [var.oidc_provider_arn]
}
condition {
variable = "${replace(var.oidc_provider_arn, "/^(.*provider/)/", "")}:sub"
values = [for sa in module.configuration.trusted_sa : "system:serviceaccount:${sa}"]
test = "StringEquals"
}
# :aud is not added because it works only with EKS
# # https://aws.amazon.com/premiumsupport/knowledge-center/eks-troubleshoot-oidc-and-irsa/?nc1=h_ls
# condition {
# variable = "${replace(var.oidc_provider_arn, "/^(.*provider/)/", "")}:aud"
# values = ["sts.amazonaws.com"]
# test = "StringEquals"
# }
}
}
Testing the setup
To test that my SA with the assigned role via eks.amazonaws.com/role-arn
annotation worked as expected,
I ran a debug pod with the mounted SA,
found in gist.
kubectl run --rm -it debug --image amazon/aws-cli --command bash --overrides='{ "spec": { "serviceAccount": "my-cool-service-account" } }'
Then, I called the default sts method to verify the assumed role:
aws sts get-caller-identity
Additional information
To get a better understanding of how IRSA works in general, you may check incredible article.
It explains in an easy graphical form how trust relations works and why this way of granting permissions
is better than Kiam
or Kube2iam
Post migration
I planned to migrate my workloads to use IRSA via mounted ServiceAccounts.
After that, I would enforce IMDSv2 with a hop limit set to 1
.
That would help me protect instance attached roles from being used by pods deployed to the same machine.
Pods with hostNetwork
will be denied by gatekeeper to enforce the security policy.