2021-07-19 08:13:10 +00:00
# Chaos Tests
## Goal
Chaos tests are designed to check the reliability of Milvus.
For instance, if one pod is killed:
- verify that it restarts automatically
- verify that the related operation fails, while the other operations keep working successfully during the absence of the pod
- verify that all the operations work successfully after the pod back to running state
- verify that no data lost
## Prerequisite
Chaos tests run in pytest framework, same as e2e tests.
Please refer to [Run E2E Tests ](https://github.com/milvus-io/milvus/blob/master/tests/README.md )
2021-11-08 13:01:54 +00:00
## Flow Chart
2021-11-09 10:59:41 +00:00
< img src = "../graphs/chaos_test_flow_chart.jpg" alt = "Chaos Test Flow Chart" width = "350" / >
2021-07-19 08:13:10 +00:00
## Test Scenarios
### Milvus in cluster mode
2021-10-08 05:13:33 +00:00
#### pod kill
2021-07-19 08:13:10 +00:00
2021-11-12 11:41:10 +00:00
kill pod every 5s
2021-07-19 08:13:10 +00:00
2021-10-08 05:13:33 +00:00
#### pod network partition
two direction(to and from) network isolation between a pod and the rest of the pods
2021-07-19 08:13:10 +00:00
2021-10-27 13:10:25 +00:00
#### pod failure
Set the pod( querynode, indexnode and datanode) as multiple replicas, make one of them failure, and test milvus's functionality
2021-11-12 11:41:10 +00:00
#### pod memory stress
2021-07-19 08:13:10 +00:00
### Milvus in standalone mode
1. standalone pod is killed
2. minio pod is killed
## How it works
- Test scenarios are designed by different chaos objects
2021-09-14 11:13:48 +00:00
- Every chaos object is defined in one yaml file locates in folder `chaos_objects`
- Every chaos yaml file specified by `ALL_CHAOS_YAMLS` in `constants.py` would be parsed as a parameter and be passed into `test_chaos.py`
- All expectations of every scenario are defined in `testcases.yaml` locates in folder `chaos_objects`
- [Chaos Mesh ](https://chaos-mesh.org/ ) is used to inject chaos into Milvus in `test_chaos.py`
2021-07-19 08:13:10 +00:00
## Run
### Manually
Run a single test scenario manually(take query node pod is killed as instance):
2021-09-15 02:37:49 +00:00
1. update `ALL_CHAOS_YAMLS = 'chaos_querynode_podkill.yaml'` in `constants.py`
2021-07-19 08:13:10 +00:00
2. run the commands below:
2021-09-18 12:09:57 +00:00
```bash
cd /milvus/tests/python_client/chaos
2021-07-19 08:13:10 +00:00
2021-09-18 12:09:57 +00:00
pytest test_chaos.py --host ${Milvus_IP} -v
```
Run multiple test scenario in a category manually(take network partition chaos for all pods as instance):
1. update `ALL_CHAOS_YAMLS = 'chaos_*_network_partition.yaml'` in `constants.py`
2. run the commands below:
```bash
cd /milvus/tests/python_client/chaos
pytest test_chaos.py --host ${Milvus_IP} -v
```
2021-10-22 12:07:11 +00:00
### Automation Scripts
Run test scenario automatically:
1. update chaos type and pod in `chaos_test.sh`
2. run the commands below:
```bash
cd /milvus/tests/python_client/chaos
# in this step, script will install milvus and run testcase
2021-12-15 12:51:10 +00:00
bash chaos_test.sh ${pod} ${chaos_type} ${chaos_task}
# example: bash chaos_test.sh querynode pod_kill chaos-test
2021-10-22 12:07:11 +00:00
```
2021-10-08 05:15:25 +00:00
### Github Action
* [Pod Kill Chaos Test ](https://github.com/milvus-io/milvus/actions/workflows/pod-kill-chaos-test.yaml )
* [Network Partition Chaos Test ](https://github.com/milvus-io/milvus/actions/workflows/network-partition-chaos-test.yaml )
2021-07-19 08:13:10 +00:00
### Nightly
2021-09-09 09:14:13 +00:00
still in planning
### Todo
2021-11-12 11:41:10 +00:00
- [ ] network attack
- [ ] clock skew
- [ ] IO injection
2021-09-16 10:07:49 +00:00
## How to contribute
* Get familiar with chaos engineering and [Chaos Mesh ](https://chaos-mesh.org )
* Design chaos scenarios, preferring to pick from todo list
* Generate yaml file for your chaos scenarios. You can create a chaos experiment in chaos-dashboard, then download the yaml file of it.
* Add yaml file to chaos_objects dir and rename it as `chaos_${component_name}_${chaos_type}.yaml` . Make sure `kubectl apply -f ${your_chaos_yaml_file}` can take effect
* Add testcase in `testcases.yaml` . You should figure out the expectation of milvus during the chaos
* Run your added testcase according to `Manually` above and check whether it as your expectation