/kind improvement
add testcases and fix a related issue
issue: #47928
---------
Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Summary
- Set `consistency_level="Strong"` at collection creation time in chaos
`checker.py` (9 places) to ensure data correctness during chaos testing
- Add explicit `consistency_level="Strong"` to all search/query calls in
`test_all_collections_after_chaos.py` (8 places) since it operates on
pre-existing collections
## Test plan
- [x] Pod logs confirm `[consistency_level=Strong]` at CreateCollection
- [x] `describe_collection` returns `consistency_level: 0` (Strong)
- [x] Insert-then-query returns all data immediately without explicit
consistency_level param
Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Summary
- Add chaos test for upsert-during-node-kill: verifies TTL extension via
upsert persists after querynode kill and WAL replay
- Add chaos test for insert-during-node-kill: verifies new data with TTL
inserted during chaos is correctly handled after recovery
- Fix `_verify_correctness` to distinguish "service never recovered"
from "wrong counts"
- Strengthen `_verify_search_consistency` to require non-zero valid
results from non-expired groups
## Test plan
- [x] All 5 chaos tests pass against a Milvus cluster with Chaos Mesh
- [x] Upsert test confirms TTL extension survives querynode kill
- [x] Insert test confirms both long-TTL (alive) and short-TTL (expired)
data inserted during chaos are correctly handled after recovery
issue: https://github.com/milvus-io/milvus/issues/47482🤖 Generated with [Claude Code](https://claude.com/claude-code)
---------
Signed-off-by: Eric Hou <eric.hou@zilliz.com>
Co-authored-by: Eric Hou <eric.hou@zilliz.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Summary
Wire three null-vector checkers into the chaos test concurrent operation
pipeline (`test_concurrent_operation.py`):
- **NullVectorSearchChecker** — detects NaN distances from null vector
leaks in search index
- **NullVectorQueryChecker** — validates null/non-null ratio consistency
in queries
- **AddVectorFieldChecker** — dynamically adds nullable FLOAT_VECTOR
fields, creates index, inserts data, and verifies
These checkers are already implemented in `checker.py` but were not
registered in `init_health_checkers()`. The default collection schema
already has nullable vector fields (`float_vector` and `image_emb` both
`nullable=True`), so the checkers use the shared collection name.
issue: #47867
### Changes
**Modified:**
- `tests/python_client/chaos/testcases/test_concurrent_operation.py`:
Added imports and registered 3 checkers
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Signed-off-by: yanliang567 <82361606+yanliang567@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## Summary
- Split snapshot testing into two checkers to fix row count mismatch
failures in chaos tests
- **SnapshotChecker** (`Op.snapshot`): lightweight, shares collection
with other checkers, only verifies snapshot create/restore operations
succeed
- **SnapshotRestoreChecker** (`Op.restore_snapshot`): uses independent
collection with internal DML operations, verifies data correctness (row
count + PK) after restore
- Removed `_snapshot_lock` from `Checker` class and all DML checkers
(Insert/Upsert/Delete) to eliminate coupling
## Root Cause
The `SnapshotRestoreChecker` shared a collection with other concurrent
checkers and relied on a global `_snapshot_lock` to prevent data
modifications during snapshot creation. However, some code paths (e.g.,
`Checker.insert_data()` base method,
`DeleteChecker.update_delete_expr()` refill logic) bypassed the lock,
causing row count mismatches after restore. The delta was exactly 3000
(`DELTA_PER_INS`).
Failure log: `expected=158188, actual=155188` (delta=3000)
ref:
https://qa-jenkins.milvus.io/blue/organizations/jenkins/chaos-test-cron/detail/chaos-test-cron/23943/pipeline
## Test plan
- [ ] Run chaos-test-cron pipeline and verify `Op.snapshot` (shared
collection) passes
- [ ] Verify `Op.restore_snapshot` (independent collection) passes with
no row count mismatch
- [ ] Confirm other checkers (insert/upsert/delete) are not impacted by
lock removal
Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
## Summary
- Add `SnapshotRestoreChecker` to test snapshot restore functionality in
chaos testing
- Support concurrent execution with other checkers (shared collection)
- Use `self.get_schema()` to get latest schema for DML operations
- Simplified data verification for concurrent scenarios
## Test plan
- [x] Single run test passed
- [x] Concurrent operation test passed (100% success rate)
- [x] Added to test_concurrent_operation.py
- [x] Added to test_single_request_operation.py
---------
Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
/kind improvement
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
- Core invariant: tests' persistence of EventRecords and RequestRecords
must be append-safe under concurrent writers; this PR replaces Parquet
with JSONL and uses per-file locks and explicit buffer flushes to
guarantee atomic, append-safe writes (EventRecords uses event_lock +
append per line; RequestRecords buffers under request_lock and flushes
to file when threshold or on sink()).
- Logic removed/simplified and rationale: DataFrame-based parquet
append/read logic (pyarrow/fastparquet) and implicit parquet buffering
were removed in favor of simple line-oriented JSON writes and explicit
buffer management. The complex Parquet append/merge paths were redundant
because parquet append under concurrent test-writer patterns caused
corruption; JSONL removes the append-mode complexity and the
parquet-specific buffering/serialization code.
- Why no data loss or behavior regression (concrete code paths):
EventRecords.insert writes a complete JSON object per event under
event_lock to /tmp/ci_logs/event_records_*.jsonl and get_records_df
reads every JSON line under the same lock (or returns an empty DataFrame
with the same schema on FileNotFound/Error), preserving all fields
event_name/event_status/event_ts. RequestRecords.insert appends to an
in-memory buffer under request_lock and triggers _flush_buffer() when
len(buffer) >= 100; _flush_buffer() writes each buffered JSON line to
/tmp/ci_logs/request_records_*.jsonl and clears the buffer; sink() calls
_flush_buffer() under request_lock before get_records_df() reads the
file — ensuring all buffered records are persisted before reads. Both
read paths handle FileNotFoundError and exceptions by returning empty
DataFrames with identical column schemas, so external callers see the
same API and no silent record loss.
- Enhancement summary (concrete): Replaces flaky Parquet append/read
with JSONL + explicit locking and deterministic flush semantics,
removing the root cause of parquet append corruption in tests while
keeping the original DataFrame-based analysis consumers unchanged
(get_records_df returns equivalent schemas).
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
- Refactor connection logic to prioritize uri and token parameters over
host/port/user/password for a more modern connection approach
- Add explicit limit parameter (limit=5) to search and query operations
in chaos checkers to avoid returning unlimited results
- Migrate test_all_collections_after_chaos.py from Collection wrapper to
MilvusClient API style
- Update pytest fixtures in chaos test files to support uri/token params
Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
# PR Summary
This PR resolves the deprecation warnings of the `logger` library:
```python
DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
```
Signed-off-by: Emmanuel Ferdman <emmanuelferdman@gmail.com>
fix query result verification:
changed the query expression and adopted a more lenient validation
method to address the issue of not being able to guarantee the retrieval
of specific IDs due to frequent deletion operation
Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>
add freshness checker
insert/upsert --> query: Get the time when it can be queried
delete --> query: Get the time when it can not be queried
Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>