milvus/tests/python_client/chaos/README.md

# Chaos Tests
## Goal
Chaos tests are designed to check the reliability of Milvus.

For instance, if one pod is killed:
   - verify that it restarts automatically 
   - verify that the related operation fails, while the other operations keep working successfully during the absence of the pod
   - verify that all the operations work successfully after the pod back to running state
   - verify that no data lost

## Prerequisite
Chaos tests run in pytest framework, same as e2e tests. 

Please refer to [Run E2E Tests](https://github.com/milvus-io/milvus/blob/master/tests/README.md)

## Flow Chart

<img src="../graphs/chaos_test_flow_chart.jpg" alt="Chaos Test Flow Chart" width="350"/>

## Test Scenarios
### Milvus in cluster mode
#### pod kill
1. root coordinator pod is killed
   
2. proxy pod is killed

3. data coordinator pod is killed

4. data node pod is killed

5. index coordinator pod is killed

6. index node pod is killed

7. query coordinator pod is killed

8. query node pod is killed

9. minio pod is killed
#### pod network partition

two direction(to and from) network isolation between a pod and the rest of the pods

#### pod failure

Set the pod（querynode, indexnode and datanode）as multiple replicas, make one of them failure, and test milvus's functionality
### Milvus in standalone mode
1. standalone pod is killed

2. minio pod is killed

## How it works
- Test scenarios are designed by different chaos objects
- Every chaos object is defined in one yaml file locates in  folder `chaos_objects`
- Every chaos yaml file specified by `ALL_CHAOS_YAMLS` in `constants.py` would be parsed as a parameter and be passed into `test_chaos.py`
- All expectations of every scenario are defined in `testcases.yaml` locates in folder `chaos_objects`
- [Chaos Mesh](https://chaos-mesh.org/) is used to inject chaos into Milvus in `test_chaos.py`

## Run
### Manually
Run a single test scenario manually(take query node pod is killed as instance):
1. update `ALL_CHAOS_YAMLS = 'chaos_querynode_podkill.yaml'` in `constants.py`

2. run the commands below:
   ```bash
   cd /milvus/tests/python_client/chaos

   pytest test_chaos.py --host ${Milvus_IP} -v
   ```
Run multiple test scenario in a category manually(take network partition chaos for all pods as instance):

1. update `ALL_CHAOS_YAMLS = 'chaos_*_network_partition.yaml'` in `constants.py`

2. run the commands below:
   ```bash
   cd /milvus/tests/python_client/chaos

   pytest test_chaos.py --host ${Milvus_IP} -v
   ```
### Automation Scripts
Run test scenario automatically:
1. update chaos type and pod in `chaos_test.sh`
2. run the commands below:
   ```bash
   cd /milvus/tests/python_client/chaos
   # in this step, script will install milvus and run testcase
   bash chaos_test.sh
   ```
### Github Action
* [Pod Kill Chaos Test](https://github.com/milvus-io/milvus/actions/workflows/pod-kill-chaos-test.yaml)
* [Network Partition Chaos Test](https://github.com/milvus-io/milvus/actions/workflows/network-partition-chaos-test.yaml)

### Nightly 
still in planning 

### Todo
- [x] pod_failure
- [ ] container_kill
- [x] network attack
- [x] memory stress

## How to contribute
* Get familiar with chaos engineering and [Chaos Mesh](https://chaos-mesh.org)
* Design chaos scenarios, preferring to pick from todo list
* Generate yaml file for your chaos scenarios. You can create a chaos experiment in chaos-dashboard, then download the yaml file of it.
* Add yaml file to chaos_objects dir and rename it as `chaos_${component_name}_${chaos_type}.yaml`. Make sure `kubectl apply -f ${your_chaos_yaml_file}` can take effect
* Add testcase in `testcases.yaml`. You should figure out the expectation of milvus during the chaos
* Run your added testcase according to `Manually` above and check whether it as your expectation
-												Add chaos tests readme and update bug report template (#6610)

* Add chaos ReadMe and update bug report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Update reg exp to match test pass document

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Add deployment mode in report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-07-19 08:13:10 +00:00
+								# Chaos Tests
 								## Goal
 								Chaos tests are designed to check the reliability of Milvus.
 								For instance, if one pod is killed:
 								   - verify that it restarts automatically
 								   - verify that the related operation fails, while the other operations keep working successfully during the absence of the pod
 								   - verify that all the operations work successfully after the pod back to running state
 								   - verify that no data lost
 								## Prerequisite
 								Chaos tests run in pytest framework, same as e2e tests.
 								Please refer to [Run E2E Tests](https://github.com/milvus-io/milvus/blob/master/tests/README.md)
-												[skip ci]Add flow chart for chaos test (#11454)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-11-08 13:01:54 +00:00
+								## Flow Chart
-												[skip ci]Update flow chart image size (#11483)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-11-09 10:59:41 +00:00
 								<img src="../graphs/chaos_test_flow_chart.jpg" alt="Chaos Test Flow Chart" width="350"/>
-												Add chaos tests readme and update bug report template (#6610)

* Add chaos ReadMe and update bug report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Update reg exp to match test pass document

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Add deployment mode in report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-07-19 08:13:10 +00:00
+								## Test Scenarios
 								### Milvus in cluster mode
-												[skip ci]Update test scenario in readme for chaos test (#9441)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-10-08 05:13:33 +00:00
+								#### pod kill
-												Add chaos tests readme and update bug report template (#6610)

* Add chaos ReadMe and update bug report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Update reg exp to match test pass document

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Add deployment mode in report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-07-19 08:13:10 +00:00
+. root coordinator pod is killed
 . proxy pod is killed
 . data coordinator pod is killed
 . data node pod is killed
 . index coordinator pod is killed
 . index node pod is killed
 . query coordinator pod is killed
 . query node pod is killed
 . minio pod is killed
-												[skip ci]Update test scenario in readme for chaos test (#9441)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-10-08 05:13:33 +00:00
+								#### pod network partition
 								two direction(to and from) network isolation between a pod and the rest of the pods
-												Add chaos tests readme and update bug report template (#6610)

* Add chaos ReadMe and update bug report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Update reg exp to match test pass document

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Add deployment mode in report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-07-19 08:13:10 +00:00
-												[skip ci]Update readme of chaos test (#10783)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-10-27 13:10:25 +00:00
+								#### pod failure
 								Set the pod（querynode, indexnode and datanode）as multiple replicas, make one of them failure, and test milvus's functionality
-												Add chaos tests readme and update bug report template (#6610)

* Add chaos ReadMe and update bug report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Update reg exp to match test pass document

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Add deployment mode in report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-07-19 08:13:10 +00:00
+								### Milvus in standalone mode
 . standalone pod is killed
 . minio pod is killed
 								## How it works
 								- Test scenarios are designed by different chaos objects
-												Update chaos test README (#7922)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-09-14 11:13:48 +00:00
+								- Every chaos object is defined in one yaml file locates in  folder `chaos_objects`
 								- Every chaos yaml file specified by `ALL_CHAOS_YAMLS` in `constants.py` would be parsed as a parameter and be passed into `test_chaos.py`
 								- All expectations of every scenario are defined in `testcases.yaml` locates in folder `chaos_objects`
 								- [Chaos Mesh](https://chaos-mesh.org/) is used to inject chaos into Milvus in `test_chaos.py`
-												Add chaos tests readme and update bug report template (#6610)

* Add chaos ReadMe and update bug report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Update reg exp to match test pass document

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Add deployment mode in report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-07-19 08:13:10 +00:00
 								## Run
 								### Manually
 								Run a single test scenario manually(take query node pod is killed as instance):
-												[skip ci]Update a sample for chaos test (#7937)

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-09-15 02:37:49 +00:00
+. update `ALL_CHAOS_YAMLS = 'chaos_querynode_podkill.yaml'` in `constants.py`
-												Add chaos tests readme and update bug report template (#6610)

* Add chaos ReadMe and update bug report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Update reg exp to match test pass document

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Add deployment mode in report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-07-19 08:13:10 +00:00
 . run the commands below:
-												[skip ci]Update chaos readme for running multi cases (#8252)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-09-18 12:09:57 +00:00
+								   ```bash
 								   cd /milvus/tests/python_client/chaos
-												Add chaos tests readme and update bug report template (#6610)

* Add chaos ReadMe and update bug report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Update reg exp to match test pass document

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Add deployment mode in report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-07-19 08:13:10 +00:00
-												[skip ci]Update chaos readme for running multi cases (#8252)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-09-18 12:09:57 +00:00
+								   pytest test_chaos.py --host ${Milvus_IP} -v
 								   ```
 								Run multiple test scenario in a category manually(take network partition chaos for all pods as instance):
 . update `ALL_CHAOS_YAMLS = 'chaos_*_network_partition.yaml'` in `constants.py`
 . run the commands below:
 								   ```bash
 								   cd /milvus/tests/python_client/chaos
 								   pytest test_chaos.py --host ${Milvus_IP} -v
 								   ```
-												[skip ci]Update readme of chaos test (#10481)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-10-22 12:07:11 +00:00
+								### Automation Scripts
 								Run test scenario automatically:
 . update chaos type and pod in `chaos_test.sh`
 . run the commands below:
 								   ```bash
 								   cd /milvus/tests/python_client/chaos
 								   # in this step, script will install milvus and run testcase
 								   bash chaos_test.sh
 								   ```
-												[skip ci]Update the way of running test in readme for chaos test (#9442)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-10-08 05:15:25 +00:00
+								### Github Action
 								* [Pod Kill Chaos Test](https://github.com/milvus-io/milvus/actions/workflows/pod-kill-chaos-test.yaml)
 								* [Network Partition Chaos Test](https://github.com/milvus-io/milvus/actions/workflows/network-partition-chaos-test.yaml)
-												Add chaos tests readme and update bug report template (#6610)

* Add chaos ReadMe and update bug report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Update reg exp to match test pass document

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Add deployment mode in report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-07-19 08:13:10 +00:00
 								### Nightly
-												[skip ci]Add todo for chaos tests (#7646)

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-09-09 09:14:13 +00:00
+								still in planning
 								### Todo
-												[skip ci]Update readme of chaos test (#10481)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-10-22 12:07:11 +00:00
+								- [x] pod_failure
-												[skip ci]Update todo list in readme for chaos test (#9440)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-10-08 03:16:54 +00:00
+								- [ ] container_kill
 								- [x] network attack
-												[skip ci]Update readme of chaos test (#10481)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-10-22 12:07:11 +00:00
+								- [x] memory stress
-												[skpi ci] Add how to contribute in chaos README (#8096)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-09-16 10:07:49 +00:00
 								## How to contribute
 								* Get familiar with chaos engineering and [Chaos Mesh](https://chaos-mesh.org)
 								* Design chaos scenarios, preferring to pick from todo list
 								* Generate yaml file for your chaos scenarios. You can create a chaos experiment in chaos-dashboard, then download the yaml file of it.
 								* Add yaml file to chaos_objects dir and rename it as `chaos_${component_name}_${chaos_type}.yaml`. Make sure `kubectl apply -f ${your_chaos_yaml_file}` can take effect
 								* Add testcase in `testcases.yaml`. You should figure out the expectation of milvus during the chaos
 								* Run your added testcase according to `Manually` above and check whether it as your expectation