Chaos Mesh

Chaos engineering is a discipline that studies how these failures can occur and provides methodologies to help avoid them. By understanding the root cause of failures, chaos engineers can develop plans to prevent or mitigate them.

In this article I will look into Chaos Mesh. Chaos Mesh is an open source cloud-native Chaos Engineering platform. It offers various types of fault simulation and has an enormous capability to orchestrate fault scenarios. I am highlighting Chaos Mesh as a tool, there are many others out there like Litmus etc, I have implemented this in some companies I have work with as a tool to test the resilience of the Kubernetes cluster

So lets try out Chaos Mesh in your cluster

Lets get an app, or you have your own you can skip this

 git clone https://github.com/dockersamples/example-voting-app.git

Deploy the application to your Kubernetes cluster

 kubectl create -f example-voting-app/k8s-specifications/

Install Chaos Mesh

 helm repo add chaos-mesh https://charts.chaos-mesh.org
 helm repo update
 helm install chaos-mesh chaos-mesh/chaos-mesh

Testing out one of the killing pod script, apply and the vote pod will be killed, and a new pod will be created.

 apiVersion: chaos-mesh.org/v1alpha1
 kind: PodChaos
 metadata:
   name: pod-kill-example
   namespace: default
 spec:
   action: pod-kill
   mode: one
   selector:
     namespaces:
       - default
     labelSelectors:
       app: vote

 kubectl apply -f pod-kill.yaml

There are also other scripts you can test out like this

 apiVersion: chaos-mesh.org/v1alpha1
 kind: PodChaos
 metadata:
   name: pod-failure-example
 spec:
   action: pod-failure
   mode: one
   duration: "30s"
   selector:
     labelSelectors:
       "app.kubernetes.io/component": "tikv"

You can find more scripts here https://github.com/chaos-mesh/chaos-mesh/tree/master/examples

Conclusion

If you plan to roll out such in your organization, do always have a set of plans and script you would want to test out. Usually larger organization has their internal disaster recovery team performing this with DevOps assisting on a yearly basis to test out the resilience of their infrastructure.