Kubernetes operators best practices: understanding conflict errors

One of the most prevalent errors one encounters when developing an operator is ‘the object has been modified; please apply your changes to the latest version and try again’. Please raise your hand if you’ve seen this in your logs… Yeah, that’s what I thought :-)

To understand why this is happening, it’s worth noting that API server uses an optimistic concurrency control to deal with possible conflicts between two users updating the same resource at the same time. This makes it possible to avoid locking, that would normally be required. It relies on a field called `resourceVersion` and as per API conventions this field is opaque and should be fetched from the server and then passed back to the server when updating. Every kubernetes resource has this field as part of ObjectMeta.

When you provide `resourceVersion` to the API server upon update, it reads the value you provided and also the one available currently in etcd. If this number changed before you submitted your update, the update operation will be rejected and you’ll need to re-fetch the state of the resource and try again.

If we go even one level deeper, this nicely maps to how etcd is implemented. Looking at etcd API every item stored has a 64bit integer called revision, which is incremented every time a resource is modified. This exactly maps to the `resourceVersion` provided by API Server and this is also what the implementation uses underneath. You can see that by yourself in the apiserver implementation.

If you’re writing an operator you typically run your reconcile loop reading the `Spec` of your CustomResource, updating state of the world to reflect that spec and also typically updating `Status` field of CustomResource with the current state. Most of the API conflicts will happen while updating the `Status` (typically controller should not be updating spec of CustomResource). This could also happen if you’re updating other resources in the kubernetes cluster outside of your CustomResource. When you get a conflict this typically mean:

  • user submitted new spec of the resource between you last pulled it from the server
  • some other controller updated the status or some other field on your resource

So how to solve this? One option is to retry reconciliation if you hit that error by simply returning the error. A better approach would be to use this handy tool provided by client-go called RetryOnConflict. The implementation can look like the following snippet.

err := retry.RetryOnConflict(retry.DefaultRetry, func() error {    var res apiv1.MyResource    err := r.Get(ctx, types.NamespacedName{        Name:      resourceName,        Namespace: resourceNamespace,    }, &res)    if err != nil {        return err    }    res.Status.Replias = readyReplicas        return r.Status().Update(ctx, &res)})if err != nil {    return fmt.Errorf("failed to update resource status: %w", err)}