milvus/tests/python_client/chaos/README.md

# Chaos Tests
## Goal
Chaos tests are designed to check the reliability of Milvus.

For instance, if one pod is killed:
   - verify that it restarts automatically 
   - verify that the related operation fails, while the other operations keep working successfully during the absence of the pod
   - verify that all the operations work successfully after the pod back to running state
   - verify that no data lost

## Prerequisite
Chaos tests run in pytest framework, same as e2e tests. 

Please refer to [Run E2E Tests](https://github.com/milvus-io/milvus/blob/master/tests/README.md)

## Flow Chart

<img src="../graphs/chaos_test_flow_chart.jpg" alt="Chaos Test Flow Chart" width="350"/>

## Test Scenarios
### Milvus in cluster mode
#### pod kill

kill pod every 5s

#### pod network partition

two direction(to and from) network isolation between a pod and the rest of the pods

#### pod failure

Set the pod（querynode, indexnode and datanode）as multiple replicas, make one of them failure, and test milvus's functionality

#### pod memory stress

### Milvus in standalone mode
1. standalone pod is killed

2. minio pod is killed

## How it works
- Test scenarios are designed by different chaos objects
- Every chaos object is defined in one yaml file locates in  folder `chaos_objects`
- Every chaos yaml file specified by `ALL_CHAOS_YAMLS` in `constants.py` would be parsed as a parameter and be passed into `test_chaos.py`
- All expectations of every scenario are defined in `testcases.yaml` locates in folder `chaos_objects`
- [Chaos Mesh](https://chaos-mesh.org/) is used to inject chaos into Milvus in `test_chaos.py`

## Run
### Manually
Run a single test scenario manually(take query node pod is killed as instance):
1. update `ALL_CHAOS_YAMLS = 'chaos_querynode_podkill.yaml'` in `constants.py`

2. run the commands below:
   ```bash
   cd /milvus/tests/python_client/chaos

   pytest test_chaos.py --host ${Milvus_IP} -v
   ```
Run multiple test scenario in a category manually(take network partition chaos for all pods as instance):

1. update `ALL_CHAOS_YAMLS = 'chaos_*_network_partition.yaml'` in `constants.py`

2. run the commands below:
   ```bash
   cd /milvus/tests/python_client/chaos

   pytest test_chaos.py --host ${Milvus_IP} -v
   ```
### Automation Scripts
Run test scenario automatically:
1. update chaos type and pod in `chaos_test.sh`
2. run the commands below:
   ```bash
   cd /milvus/tests/python_client/chaos
   # in this step, script will install milvus and run testcase
   bash chaos_test.sh ${pod} ${chaos_type} ${chaos_task}
   # example: bash chaos_test.sh querynode pod_kill chaos-test
   ```
### Github Action
* [Pod Kill Chaos Test](https://github.com/milvus-io/milvus/actions/workflows/pod-kill-chaos-test.yaml)
* [Network Partition Chaos Test](https://github.com/milvus-io/milvus/actions/workflows/network-partition-chaos-test.yaml)

### Nightly 
still in planning 

### Todo
- [ ] network attack
- [ ] clock skew
- [ ] IO injection

## How to contribute
* Get familiar with chaos engineering and [Chaos Mesh](https://chaos-mesh.org)
* Design chaos scenarios, preferring to pick from todo list
* Generate yaml file for your chaos scenarios. You can create a chaos experiment in chaos-dashboard, then download the yaml file of it.
* Add yaml file to chaos_objects dir and rename it as `chaos_${component_name}_${chaos_type}.yaml`. Make sure `kubectl apply -f ${your_chaos_yaml_file}` can take effect
* Add testcase in `testcases.yaml`. You should figure out the expectation of milvus during the chaos
* Run your added testcase according to `Manually` above and check whether it as your expectation
-												Add chaos tests readme and update bug report template (#6610)

* Add chaos ReadMe and update bug report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Update reg exp to match test pass document

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Add deployment mode in report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-07-19 08:13:10 +00:00
+								# Chaos Tests
 								## Goal
 								Chaos tests are designed to check the reliability of Milvus.
 								For instance, if one pod is killed:
 								   - verify that it restarts automatically
 								   - verify that the related operation fails, while the other operations keep working successfully during the absence of the pod
 								   - verify that all the operations work successfully after the pod back to running state
 								   - verify that no data lost
 								## Prerequisite
 								Chaos tests run in pytest framework, same as e2e tests.
 								Please refer to [Run E2E Tests](https://github.com/milvus-io/milvus/blob/master/tests/README.md)
-												[skip ci]Add flow chart for chaos test (#11454)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-11-08 13:01:54 +00:00
+								## Flow Chart
-												[skip ci]Update flow chart image size (#11483)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-11-09 10:59:41 +00:00
 								<img src="../graphs/chaos_test_flow_chart.jpg" alt="Chaos Test Flow Chart" width="350"/>
-												Add chaos tests readme and update bug report template (#6610)

* Add chaos ReadMe and update bug report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Update reg exp to match test pass document

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Add deployment mode in report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-07-19 08:13:10 +00:00
+								## Test Scenarios
 								### Milvus in cluster mode
-												[skip ci]Update test scenario in readme for chaos test (#9441)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-10-08 05:13:33 +00:00
+								#### pod kill
-												Add chaos tests readme and update bug report template (#6610)

* Add chaos ReadMe and update bug report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Update reg exp to match test pass document

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Add deployment mode in report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-07-19 08:13:10 +00:00
-												[skip ci]Update doc of chaos test (#11754)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-11-12 11:41:10 +00:00
+								kill pod every 5s
-												Add chaos tests readme and update bug report template (#6610)

* Add chaos ReadMe and update bug report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Update reg exp to match test pass document

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Add deployment mode in report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-07-19 08:13:10 +00:00
-												[skip ci]Update test scenario in readme for chaos test (#9441)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-10-08 05:13:33 +00:00
+								#### pod network partition
 								two direction(to and from) network isolation between a pod and the rest of the pods
-												Add chaos tests readme and update bug report template (#6610)

* Add chaos ReadMe and update bug report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Update reg exp to match test pass document

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Add deployment mode in report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-07-19 08:13:10 +00:00
-												[skip ci]Update readme of chaos test (#10783)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-10-27 13:10:25 +00:00
+								#### pod failure
 								Set the pod（querynode, indexnode and datanode）as multiple replicas, make one of them failure, and test milvus's functionality
-												[skip ci]Update doc of chaos test (#11754)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-11-12 11:41:10 +00:00
 								#### pod memory stress
-												Add chaos tests readme and update bug report template (#6610)

* Add chaos ReadMe and update bug report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Update reg exp to match test pass document

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Add deployment mode in report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-07-19 08:13:10 +00:00
+								### Milvus in standalone mode
 . standalone pod is killed
 . minio pod is killed
 								## How it works
 								- Test scenarios are designed by different chaos objects
-												Update chaos test README (#7922)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-09-14 11:13:48 +00:00
+								- Every chaos object is defined in one yaml file locates in  folder `chaos_objects`
 								- Every chaos yaml file specified by `ALL_CHAOS_YAMLS` in `constants.py` would be parsed as a parameter and be passed into `test_chaos.py`
 								- All expectations of every scenario are defined in `testcases.yaml` locates in folder `chaos_objects`
 								- [Chaos Mesh](https://chaos-mesh.org/) is used to inject chaos into Milvus in `test_chaos.py`
-												Add chaos tests readme and update bug report template (#6610)

* Add chaos ReadMe and update bug report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Update reg exp to match test pass document

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Add deployment mode in report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-07-19 08:13:10 +00:00
 								## Run
 								### Manually
 								Run a single test scenario manually(take query node pod is killed as instance):
-												[skip ci]Update a sample for chaos test (#7937)

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-09-15 02:37:49 +00:00
+. update `ALL_CHAOS_YAMLS = 'chaos_querynode_podkill.yaml'` in `constants.py`
-												Add chaos tests readme and update bug report template (#6610)

* Add chaos ReadMe and update bug report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Update reg exp to match test pass document

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Add deployment mode in report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-07-19 08:13:10 +00:00
 . run the commands below:
-												[skip ci]Update chaos readme for running multi cases (#8252)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-09-18 12:09:57 +00:00
+								   ```bash
 								   cd /milvus/tests/python_client/chaos
-												Add chaos tests readme and update bug report template (#6610)

* Add chaos ReadMe and update bug report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Update reg exp to match test pass document

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Add deployment mode in report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-07-19 08:13:10 +00:00
-												[skip ci]Update chaos readme for running multi cases (#8252)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-09-18 12:09:57 +00:00
+								   pytest test_chaos.py --host ${Milvus_IP} -v
 								   ```
 								Run multiple test scenario in a category manually(take network partition chaos for all pods as instance):
 . update `ALL_CHAOS_YAMLS = 'chaos_*_network_partition.yaml'` in `constants.py`
 . run the commands below:
 								   ```bash
 								   cd /milvus/tests/python_client/chaos
 								   pytest test_chaos.py --host ${Milvus_IP} -v
 								   ```
-												[skip ci]Update readme of chaos test (#10481)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-10-22 12:07:11 +00:00
+								### Automation Scripts
 								Run test scenario automatically:
 . update chaos type and pod in `chaos_test.sh`
 . run the commands below:
 								   ```bash
 								   cd /milvus/tests/python_client/chaos
 								   # in this step, script will install milvus and run testcase
-												[skip ci]Update readme for chaos test (#13454)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-12-15 12:51:10 +00:00
+								   bash chaos_test.sh ${pod} ${chaos_type} ${chaos_task}
 								   # example: bash chaos_test.sh querynode pod_kill chaos-test
-												[skip ci]Update readme of chaos test (#10481)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-10-22 12:07:11 +00:00
+								   ```
-												[skip ci]Update the way of running test in readme for chaos test (#9442)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-10-08 05:15:25 +00:00
+								### Github Action
 								* [Pod Kill Chaos Test](https://github.com/milvus-io/milvus/actions/workflows/pod-kill-chaos-test.yaml)
 								* [Network Partition Chaos Test](https://github.com/milvus-io/milvus/actions/workflows/network-partition-chaos-test.yaml)
-												Add chaos tests readme and update bug report template (#6610)

* Add chaos ReadMe and update bug report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* [skip ci] Update code format in ReadMe

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Update reg exp to match test pass document

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>

* Add deployment mode in report template

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-07-19 08:13:10 +00:00
 								### Nightly
-												[skip ci]Add todo for chaos tests (#7646)

Signed-off-by: yanliang567 <yanliang.qiao@zilliz.com>
											
										
										
											2021-09-09 09:14:13 +00:00
+								still in planning
 								### Todo
-												[skip ci]Update doc of chaos test (#11754)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-11-12 11:41:10 +00:00
+								- [ ] network attack
 								- [ ] clock skew
 								- [ ] IO injection
-												[skpi ci] Add how to contribute in chaos README (#8096)

Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
											
										
										
											2021-09-16 10:07:49 +00:00
 								## How to contribute
 								* Get familiar with chaos engineering and [Chaos Mesh](https://chaos-mesh.org)
 								* Design chaos scenarios, preferring to pick from todo list
 								* Generate yaml file for your chaos scenarios. You can create a chaos experiment in chaos-dashboard, then download the yaml file of it.
 								* Add yaml file to chaos_objects dir and rename it as `chaos_${component_name}_${chaos_type}.yaml`. Make sure `kubectl apply -f ${your_chaos_yaml_file}` can take effect
 								* Add testcase in `testcases.yaml`. You should figure out the expectation of milvus during the chaos
 								* Run your added testcase according to `Manually` above and check whether it as your expectation