milvus

Commit Graph

Author	SHA1	Message	Date
tinswzy	aed7c8bcfb	enhance: update WP version v0.1.25 (#45011 ) #43638 update wp to latest Introduce the beta version of the wp service mode. --------- Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2026-03-23 05:57:30 +08:00
Zhen Ye	446f06eb02	enhance: Implement rate limiting in WAL append operations (#47179 ) issue: #47178 This commit introduces a rate limiting mechanism for Write-Ahead Logging (WAL) operations to prevent overload during high traffic. Key changes include: - Added `RateLimitObserver` to monitor and control the rate of DML operations. - Add Adaptive RateLimitController to apply the strategy of rate limit. - WAL will slow down if the recovery-storage works on catchup mode or node memory is high. - Updated `WAL` and related components to handle rate limit states, including rejection and slowdown. - Introduced new error codes for rate limit rejection in the streaming error handling. - Enhanced tests to cover the new rate limiting functionality. These changes aim to improve the stability and performance of the streaming service under load. --------- Signed-off-by: chyezh <chyezh@outlook.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 16:19:26 +08:00
Li Liu	72918439ef	enhance: remove unused Go dependencies (ansi, fastjson, grpc/examples, sizedwaitgroup) (#47852 ) Related to #46199 ## Summary Remove 5 unused or misused Go dependencies to reduce module bloat and consolidate overlapping libraries: - `mgutz/ansi` → replaced with inline ANSI escape codes (only used for 3 color constants in migration console) - `valyala/fastjson` → replaced with `tidwall/gjson` (only 1 file used fastjson; gjson is already used in 22+ files) - `google.golang.org/grpc/examples` → replaced with existing `rootcoordpb` (test file pulled in entire grpc examples repo for a mock server) - `remeh/sizedwaitgroup` → replaced with `chan` semaphore + `sync.WaitGroup` (only 2 files, trivial pattern) - `pkg/errors` → replaced with `cockroachdb/errors` (the project standard; `pkg/errors` was used in 1 file) ## Behavior change: DeleteLog.Parse() fail-fast on missing fields The `fastjson` → `gjson` migration adds explicit `Exists()` validation for `ts`, `pk`, and `pkType` fields in the JSON parsing branch. Previously, both fastjson and gjson would silently return zero values for missing fields, causing `dl.Pk` to remain nil and panicking downstream. The new code fails fast with a descriptive error at parse time. This is a defensive improvement (the original code had identical silent-failure behavior). ## Performance impact \| Change \| Path type \| Perf delta \| Matters? \| \|--------\|-----------\|------------\|----------\| \| `pkg/errors` → `cockroachdb/errors` \| Cold (offline CLI tool `config-docs-generator`) \| Negligible \| No \| \| `mgutz/ansi` → inline ANSI codes \| Cold (offline CLI tool `migration/console`) \| Marginally faster (eliminates map lookup) \| No \| \| `fastjson` → `gjson` (`DeleteLog.Parse`) \| Warm — old-format deltalog deserialization only \| ~2.5x slower per JSON parse (143ns→361ns) \| No — see below \| \| `grpc/examples` → `rootcoordpb` \| Test only (`client_test.go`) \| None \| No \| \| `sizedwaitgroup` → chan+WaitGroup \| Test only (`wal_test.go`, `test_framework.go`) \| None \| No \| ### fastjson → gjson regression detail `DeleteLog.Parse()` is called per-row during deltalog deserialization, but only for the legacy single-field format. The new multi-field parquet format (`newDeltalogMultiFieldReader`) reads pk/ts as separate Arrow columns and bypasses `Parse()` entirely. Legacy deltalogs are rewritten to parquet format during compaction, so this is a dying code path. Additionally, deltalog loading is I/O-bound — the JSON parse cost (~361ns/row) is negligible compared to disk read and Arrow deserialization overhead. Benchmark (Go 1.24, arm64): ``` BenchmarkFastjsonSmall-4 8,315,624 143.1 ns/op 0 B/op 0 allocs/op BenchmarkGjsonOptimized-4 3,321,613 361.4 ns/op 96 B/op 1 allocs/op ``` ## Test plan - [x] CI build passes - [x] CI code-check passes - [ ] CI ut-go passes - [ ] CI e2e passes - [x] Boundary test cases added (bare number, missing pkType/ts/pk) 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: Li Liu <li.liu@zilliz.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 00:49:27 +08:00
Zhen Ye	b6db3c34ec	enhance: refactor WithClusterLevelBroadcast to use external channel list and add FlushAll integration test (#47656 ) issue: #47647 Refactor the cluster-level broadcast mechanism to decouple the message package from the channel registration lifecycle: - Replace internal provider pattern with opaque ClusterChannels type passed externally to WithClusterLevelBroadcast() - Add channel package singleton (syncutil.Future) exposing GetClusterChannels() and GetPChannelNames() blocking accessors - Add PChannel() interface to MutableMessage/ImmutableMessage for deriving physical channel from virtual channel - Validate non-control-channel entries are physical channels using funcutil.IsPhysicalChannel and use funcutil.IsOnPhysicalChannel for control channel matching - Move control channel substitution logic into WithClusterLevelBroadcast to simplify callers (datacoord, coordinator, assignment service) - Add lock interceptor unit tests and cluster broadcast test coverage - Add integration test for FlushAll with streaming node restart to verify data integrity across node lifecycle --------- Signed-off-by: chyezh <chyezh@outlook.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 11:36:47 +08:00
wei liu	6b4171e7ac	feat: [ExternalTable Part3] Support manual refresh for external collections (#47492 ) design doc: https://github.com/milvus-io/milvus-design-docs/blob/main/design_docs/20260105-external_table.md issue: #45881 This change introduces manual refresh capability for external collections, allowing users to trigger on-demand data synchronization from external sources. It replaces the legacy update mechanism with a more robust job-task hierarchy and persistent state management. Key changes: - Add RefreshExternalCollection, GetRefreshExternalCollectionProgress, and ListRefreshExternalCollectionJobs APIs across Client, Proxy, and DataCoord - Implement ExternalCollectionRefreshManager to manage refresh jobs with a 1:N Job-Task hierarchy - Add ExternalCollectionRefreshMeta for persistent storage of jobs and tasks in the metastore - Add ExternalCollectionRefreshChecker for task state management and worker assignment - Implement ExternalCollectionRefreshInspector for periodic job cleanup - Use WAL Broadcast mechanism for distributed consistency and idempotency - Replace legacy external_collection_inspector and update tasks with the new refresh-based implementation - Add comprehensive unit tests for refresh job lifecycle and state transitions design doc: https://github.com/milvus-io/milvus-design-docs/blob/main/design_docs/20260105-external_table.md --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2026-02-26 11:20:46 +08:00
congqixia	2d14975d18	enhance: implement BatchUpdateManifest RPC for batch segment manifestversion updates (#47773 ) Related to #46358 Add a new BatchUpdateManifest API that allows updating manifest versions for multiple segments in a single request. The update is broadcast via the streaming WAL to ensure consistency across the cluster. This includes the proto definitions, proxy task, datacoord service handler, meta operator, streaming message type registration, and associated tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2026-02-24 11:32:46 +08:00
Chun Han	25a155efcb	feat: part1 for add field backfill(#44444 ) (#46808 ) related: #44444 design doc: https://github.com/milvus-io/milvus-design-docs/blob/main/design_docs/20260129-add-function-field-design.md Signed-off-by: MrPresent-Han <chun.han@gmail.com> Co-authored-by: MrPresent-Han <chun.han@gmail.com>	2026-02-05 19:19:52 +08:00
XuanYang-cn	92baeabfd3	fix: Check for error msg because the error type is missing (#47366 ) See also: #45117 Signed-off-by: yangxuan <xuan.yang@zilliz.com>	2026-01-28 17:19:32 +08:00
aoiasd	664f181f5f	enhance: Improve the consistency of file resource sync (#47113 ) relate: https://github.com/milvus-io/milvus/issues/41424 --------- Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2026-01-26 11:45:32 +08:00
XuanYang-cn	d09b8fad16	enhance: encrypt all dbs when defaultKey is given (#47049 ) when default key is provided, all the new database will be encrypted by default and use the default key as the key. See also: #40013 --------- Signed-off-by: yangxuan <xuan.yang@zilliz.com>	2026-01-20 12:59:30 +08:00
tinswzy	cd2d8c7f39	enhance: support switching of WAL implementation (#45286 ) issue: #44726 Introduce an immutable option to prevent accidental modification of critical configurations. Support switching of WAL implementation. Note: This PR depends on [milvus-proto PR #503](https://github.com/milvus-io/milvus-proto/pull/503) being merged first. Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2026-01-18 20:13:29 +08:00
yihao.dai	30d8f9804a	fix: Fix shard interceptor incorrectly skip flushallmsg (#47003 ) issue: https://github.com/milvus-io/milvus/issues/46799 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2026-01-13 10:07:26 +08:00
Zhen Ye	4c6e33f326	fix: lost tenant/namespace support for pulsar since 2.6 (#46752 ) issue: #46748 Signed-off-by: chyezh <chyezh@outlook.com>	2026-01-06 14:33:24 +08:00
wei liu	975c91df16	feat: Add comprehensive snapshot functionality for collections (#44361 ) issue: #44358 Implement complete snapshot management system including creation, deletion, listing, description, and restoration capabilities across all system components. Key features: - Create snapshots for entire collections - Drop snapshots by name with proper cleanup - List snapshots with collection filtering - Describe snapshot details and metadata Components added/modified: - Client SDK with full snapshot API support and options - DataCoord snapshot service with metadata management - Proxy layer with task-based snapshot operations - Protocol buffer definitions for snapshot RPCs - Comprehensive unit tests with mockey framework - Integration tests for end-to-end validation Technical implementation: - Snapshot metadata storage in etcd with proper indexing - File-based snapshot data persistence in object storage - Garbage collection integration for snapshot cleanup - Error handling and validation across all operations - Thread-safe operations with proper locking mechanisms <!-- This is an auto-generated comment: release notes by coderabbit.ai --> - Core invariant/assumption: snapshots are immutable point‑in‑time captures identified by (collection, snapshot name/ID); etcd snapshot metadata is authoritative for lifecycle (PENDING → COMMITTED → DELETING) and per‑segment manifests live in object storage (Avro / StorageV2). GC and restore logic must see snapshotRefIndex loaded (snapshotMeta.IsRefIndexLoaded) before reclaiming or relying on segment/index files. - New capability added: full end‑to‑end snapshot subsystem — client SDK APIs (Create/Drop/List/Describe/Restore + restore job queries), DataCoord SnapshotWriter/Reader (Avro + StorageV2 manifests), snapshotMeta in meta, SnapshotManager orchestration (create/drop/describe/list/restore), copy‑segment restore tasks/inspector/checker, proxy & RPC surface, GC integration, and docs/tests — enabling point‑in‑time collection snapshots persisted to object storage and restorations orchestrated across components. - Logic removed/simplified and why: duplicated recursive compaction/delta‑log traversal and ad‑hoc lookup code were consolidated behind two focused APIs/owners (Handler.GetDeltaLogFromCompactTo for delta traversal and SnapshotManager/SnapshotReader for snapshot I/O). MixCoord/coordinator broker paths were converted to thin RPC proxies. This eliminates multiple implementations of the same traversal/lookup, reducing divergence and simplifying responsibility boundaries. - Why this does NOT introduce data loss or regressions: snapshot create/drop use explicit two‑phase semantics (PENDING → COMMIT/DELETING) with SnapshotWriter writing manifests and metadata before commit; GC uses snapshotRefIndex guards and IsRefIndexLoaded/GetSnapshotBySegment/GetSnapshotByIndex checks to avoid removing referenced files; restore flow pre‑allocates job IDs, validates resources (partitions/indexes), performs rollback on failure (rollbackRestoreSnapshot), and converts/updates segment/index metadata only after successful copy tasks. Extensive unit and integration tests exercise pending/deleting/GC/restore/error paths to ensure idempotence and protection against premature deletion. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2026-01-06 10:15:24 +08:00
Zhen Ye	bb913dd837	fix: simplify go ut (#46606 ) issue: #46500 - simplify the run_go_codecov.sh to make sure the set -e to protect any sub command failure. - remove all embed etcd in test to make full test can be run at local. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## PR Summary: Simplify Go Unit Tests by Removing Embedded etcd and Async Startup Scaffolding Core Invariant: This PR assumes that unit tests can be simplified by running without embedded etcd servers (delegating to environment-based or external etcd instances via `kvfactory.GetEtcdAndPath()` or `ETCD_ENDPOINTS`) and by removing goroutine-based async startup scaffolding in favor of synchronous component initialization. Tests remain functionally equivalent while becoming simpler to run and debug locally. What is Removed or Simplified: 1. Embedded etcd test infrastructure deleted: Removes `EmbedEtcdUtil` type and its public methods (SetupEtcd, TearDownEmbedEtcd) from `pkg/util/testutils/embed_etcd.go`, removes the `StartTestEmbedEtcdServer()` helper from `pkg/util/etcd/etcd_util.go`, and removes etcd embedding from test suites (e.g., `TaskSuite`, `EtcdSourceSuite`, `mixcoord/client_test.go`). Tests now either skip etcd-dependent tests (via `MILVUS_UT_WITHOUT_KAFKA=1` environment flag in `kafka_test.go`) or source etcd from external configuration (via `kvfactory.GetEtcdAndPath()` in `task_test.go`, or `ETCD_ENDPOINTS` environment variable in `etcd_source_test.go`). This eliminates the overhead of spinning up temporary etcd servers for unit tests. 2. Async startup scaffolding replaced with synchronous initialization: In `internal/proxy/proxy_test.go` and `proxy_rpc_test.go`, the `startGrpc()` method signature removes the `sync.WaitGroup` parameter; components are now created, prepared, and run synchronously in-place rather than in goroutines (e.g., `go testServer.startGrpc(ctx, &p)` becomes `testServer.startGrpc(ctx, &p)` running synchronously). Readiness checks (e.g., `waitForGrpcReady()`) remain in place to ensure startup safety without concurrency constructs. This simplifies control flow and reduces debugging complexity. 3. Shell script orchestration unified with proper error handling: In `scripts/run_go_codecov.sh` and `scripts/run_intergration_test.sh`, per-package inline test invocations are consolidated into a single `test_cmd()` function with unified `TEST_CMD_WITH_ARGS` array containing race, coverage, verbose, and other flags. The problematic `set -ex` is replaced with `set -e` alone (removing debug output noise while preserving strict error semantics), ensuring the scripts fail fast on any command failure. Why No Regression: - Test assertions and code paths remain unchanged; only deployment source of etcd (embedded → external) and startup orchestration (async → sync) change. - Readiness verification (e.g., `waitForGrpcReady()`) is retained, ensuring components are initialized before test execution. - Test flags (race detection, coverage, verbosity) are uniformly applied across all packages via unified `TEST_CMD_WITH_ARGS`, preserving test coverage and quality. - `set -e` alone is sufficient for strict failure detection without the `-x` flag's verbose output. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-12-31 16:07:22 +08:00
Zhen Ye	ca8740c7c0	fix: remove redundant log (#46695 ) issue: #45841 - CPP log make the multi log line in one debug, remove the "\n\t". - remove some log that make no sense. - slow down some log like ChannelDistManager. <!-- This is an auto-generated comment: release notes by coderabbit.ai --> - Core invariant: logging is purely observational — this PR only reduces, consolidates, or reformats diagnostic output (removing per-item/noise logs, consolidating batched logs, and converting multi-line log strings) while preserving all control flow, return values, and state mutations across affected code paths. - Removed / simplified logic: deleted low-value per-operation debug/info logs (e.g., ListIndexes, GetRecoveryInfo, GcConfirm, push-to-reorder-buffer, several streaming/wal/debug traces), replaced per-item inline logs with single batched deferred logs in querynodev2/delegator (logExcludeInfo) and CleanInvalid, changed C++ PlanNode ToString() multi-line output to compact single-line bracketed format (removed "\n\t"), and added thresholded interceptor logging (InterceptorMetrics.ShouldBeLogged) and message-type-driven log levels to avoid verbose entries. - Why this does NOT cause data loss or behavioral regression: no function signatures, branching, state updates, persistence calls, or return values were changed — examples: ListIndexes still returns the same Status/IndexInfos; GcConfirm still constructs and returns resp.GetGcFinished(); Insert and CleanInvalid still perform the same insert/removal operations (only their per-item logging was aggregated); PlanNode ToString changes only affect emitted debug strings. All error handling and control flow paths remain intact. - Enhancement intent: reduce log volume and improve signal-to-noise for debugging by removing redundant, noisy logs and emitting concise, rate-/threshold-limited summaries while preserving necessary diagnostics and original program behavior. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-12-31 15:35:21 +08:00
yihao.dai	5b97cb70a0	enhance: Support delaying scanner startup (#46369 ) Introduce a ScannerStartupDelay configuration to enable WAL write-only recovery, allowing fence messages to be persisted during primary–secondary switchover when the StreamingNode is trapped in crash loops. issue: https://github.com/milvus-io/milvus/issues/46368 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * New Features * Added a configurable WAL scanner pause/resume and a consumer request flag to optionally ignore pause signals. * Metrics * Added a scanner pause gauge and pause-duration tracking for WAL scanning. * Tests * Added coverage for pause-consumption behavior and cleanup in stream client tests. * Chores * Consolidated flush-all logging into a single field and added a helper for bulk message conversion. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-12-24 11:53:19 +08:00
Zhen Ye	7d6d279e9c	fix: set enable.auto.commit false to prevent from creating kafka consumer group (#46508 ) ### User description issue: #46507 we use the assign/unassign api to manage the consumer manually, the commit operation will generate a new consumer group which is not what we want. so we disable the auto commit to avoid it, also see: https://github.com/confluentinc/confluent-kafka-python/issues/250#issuecomment-331377925 ___ ### PR Type Bug fix ___ ### Description - Disable auto-commit in Kafka consumer configuration - Prevents unwanted consumer group creation from manual offset management - Clarifies offset reset behavior with explanatory comments ___ ### Diagram Walkthrough ```mermaid flowchart LR A["Kafka Consumer Config"] --> B["Set enable.auto.commit to false"] B --> C["Prevent auto consumer group creation"] A --> D["Set auto.offset.reset to earliest"] D --> E["Handle deleted offsets gracefully"] ``` <details><summary><h3>File Walkthrough</h3></summary> <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Bug fix</strong></td><td><table> <tr> <td> <details> <summary><strong>builder.go</strong><dd><code>Disable auto-commit and add configuration comments</code>              </dd></summary> <hr> pkg/streaming/walimpls/impls/kafka/builder.go <ul><li>Added <code>enable.auto.commit</code> configuration set to <code>false</code> to prevent <br>automatic consumer group creation<br> <li> Added explanatory comments for both <code>auto.offset.reset</code> and <br><code>enable.auto.commit</code> settings<br> <li> Clarifies that manual assign/unassign API is used for consumer <br>management</ul> </details> </td> <td><a href="https://github.com/milvus-io/milvus/pull/46508/files#diff-4b5635821fdc8b585d16c02d8a3b59079d8e667b2be43a073265112d72701add">+7/-0</a>      </td> </tr> </table></td></tr></tbody></table> </details> ___ <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Bug Fixes * Kafka consumer now reads from the earliest available messages and auto-commit has been disabled to support manual offset management. <sub>✏️ Tip: You can customize this high-level summary in your review settings.</sub> <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: chyezh <chyezh@outlook.com>	2025-12-22 21:07:18 +08:00
Zhen Ye	7c575a18b0	enhance: support AckSyncUp for broadcaster, and enable it in truncate api (#46313 ) issue: #43897 also for issue: #46166 add ack_sync_up flag into broadcast message header, which indicates that whether the broadcast operation is need to be synced up between the streaming node and the coordinator. If the ack_sync_up is false, the broadcast operation will be acked once the recovery storage see the message at current vchannel, the fast ack operation can be applied to speed up the broadcast operation. If the ack_sync_up is true, the broadcast operation will be acked after the checkpoint of current vchannel reach current message. The fast ack operation can not be applied to speed up the broadcast operation, because the ack operation need to be synced up with streaming node. e.g. if truncate collection operation want to call ack once callback after the all segment are flushed at current vchannel, it should set the ack_sync_up to be true. TODO: current implementation doesn't promise the ack sync up semantic, it only promise FastAck operation will not be applied, wait for 3.0 to implement the ack sync up semantic. only for truncate api now. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-12-17 16:55:17 +08:00
sijie-ni-0214	f51de1a8ab	feat: support TruncateCollection api to clear collection data (#46167 ) issue: https://github.com/milvus-io/milvus/issues/46166 --------- Signed-off-by: sijie-ni-0214 <sijie.ni@zilliz.com>	2025-12-12 10:31:14 +08:00
yihao.dai	f32f2694bc	enhance: Implement new FlushAllMessage and refactor flush all (#45920 ) This PR: 1. Define and implement the new FlushAllMessage. 2. Refactor FlushAll to flush the entire cluster. issue: https://github.com/milvus-io/milvus/issues/45919 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-12-10 19:27:13 +08:00
tinswzy	1917bb720f	enhance: add fallback mechanism for WP when accessing object storage without Condition Write support (#45735 ) related issue: #45733 related [wp issue: #60](https://github.com/zilliztech/woodpecker/issues/60) Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2025-12-07 21:59:11 +08:00
Zhen Ye	adbdf916e1	enhance: support proxy DML forward (#45921 ) issue: #45812 - 2.6 proxy will try to forward DWL to 2.5 proxy if streaming service is not ready Signed-off-by: chyezh <chyezh@outlook.com>	2025-12-01 19:37:10 +08:00
Zhen Ye	8e0ae6433d	fix: LastConfirmedMessageID may be wrong if high concurrent writing (#45873 ) issue: #45872 Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-27 12:01:07 +08:00
tinswzy	1427825133	enhance: improve WAL retention strategy (#45350 ) issue: #44369 woodpecker related[ issue: #59](https://github.com/zilliztech/woodpecker/issues/59) Refactor the WAL retention logic in Milvus StreamingNode: - Remove the simple sampling-based truncation mechanism. - After flush, WAL data is directly truncated. - The retention control is now delegated to the underlying message queue (MQ) implementation. Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2025-11-23 21:41:05 +08:00
Zhen Ye	40e2042728	enhance: add more metrics for DDL framework (#45558 ) issue: #43897 --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-14 15:19:37 +08:00
junjiejiangjjj	102481e53f	feat: Support add_function/alter_function/drop_function (#44895 ) https://github.com/milvus-io/milvus/issues/44053 Signed-off-by: junjie.jiang <junjie.jiang@zilliz.com>	2025-11-13 20:53:39 +08:00
Xiaofan	a9895bb904	enhance: add robust handle etcd servercrash (#45304 ) related to #45303 fix milvus pod may restart when etcd pod start Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>	2025-11-13 10:23:36 +08:00
Zhen Ye	b7fb8ed38c	fix: use the right resource key lock for ddl and use new ddl in transfer replica (#45506 ) issue: #45452 - alias/rename related DDL should use database level exclusive lock - alias cannot use as the resource key of lock, use collection name instead - transfer replica should use WAL-based framework Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-12 19:01:38 +08:00
Zhen Ye	4797bb6ab2	fix: wrong update timetick of collection meta info (#45461 ) issue: #45403, #45463 - fix the Nightly E2E failures. - fix the wrong update timetick of altering collection to fix the related load failure. Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-11 16:01:36 +08:00
Zhen Ye	31a609c21d	fix: kafka should auto reset the offset from earliest to read (#45237 ) issue: #44172, #45210, #44851 kafka will auto reset the offset to "latest" if the offset is Out-of-range. the recovery of milvus wal cannot read any message from that. So once the offset is out-of-range, kafka should read from eariest to read the latest uncleared data. https://kafka.apache.org/documentation/#consumerconfigs_auto.offset.reset Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-03 21:07:33 +08:00
Zhen Ye	00d8d2c33d	enhance: support load/release collection/partition with WAL-based DDL framework (#45154 ) issue: #43897 - Load/Release collection/partition is implemented by WAL-based DDL framework now. - Support AlterLoadConfig/DropLoadConfig in wal now. - Load/Release operation can be synced by new CDC now. - Refactor some UT for load/release DDL. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-11-02 18:39:32 +08:00
Zhen Ye	309d564796	enhance: support collection and index with WAL-based DDL framework (#45033 ) issue: #43897 - Part of collection/index related DDL is implemented by WAL-based DDL framework now. - Support following message type in wal, CreateCollection, DropCollection, CreatePartition, DropPartition, CreateIndex, AlterIndex, DropIndex. - Part of collection/index related DDL can be synced by new CDC now. - Refactor some UT for collection/index DDL. - Add Tombstone scheduler to manage the tombstone GC for collection or partition meta. - Move the vchannel allocation into streaming pchannel manager. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-10-30 14:24:08 +08:00
Zhen Ye	ce164db1f3	fix: wal state may be unconsistent after recovering from crash (#45092 ) issue: #45088, #45086 - Message on control channel should trigger the checkpoint update. - LastConfrimedMessageID should be recovered from the minimum of checkpoint or the LastConfirmedMessageID of uncommitted txn. - Add more log info for wal debugging. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-10-29 16:26:10 +08:00
Zhen Ye	2aa48bf4ca	fix: wrong execution order of DDL/DCL on secondary (#44886 ) issue: #44697, #44696 - The DDL executing order of secondary keep same with order of control channel timetick now. - filtering the control channel operation on shard manager of streamingnode to avoid wrong vchannel of create segment. - fix that the immutable txn message lost replicate header. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-10-21 22:38:05 +08:00
Zhen Ye	8bf7d6ae72	enhance: refactor update replicate config operation using wal-broadcast-based DDL/DCL framework (#44560 ) issue: #43897 - UpdateReplicateConfig operation will broadcast AlterReplicateConfig message into all pchannels with cluster-exclusive-lock. - Begin txn message will use commit message timetick now (to avoid timetick rollback when CDC with txn message). - If current cluster is secondary, the UpdateReplicateConfig will wait until the replicate configuration is consistent with the config replicated from primary. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-10-15 15:26:01 +08:00
tinswzy	f342f49b32	enhance: add support for Azure Blob Storage in wp (#44592 ) #44485 add support for blob in woodpecker #43638 upgrade wp v0.1.6 related wp [issue#11](https://github.com/zilliztech/woodpecker/issues/11 ) Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2025-09-29 09:51:44 +08:00
Zhen Ye	19e5e9f910	enhance: broadcaster will lock resource until message acked (#44508 ) issue: #43897 - Return LastConfirmedMessageID when wal append operation. - Add resource-key-based locker for broadcast-ack operation to protect the coord state when executing ddl. - Resource-key-based locker is held until the broadcast operation is acked. - ResourceKey support shared and exclusive lock. - Add FastAck execute ack right away after the broadcast done to speed up ddl. - Ack callback will support broadcast message result now. - Add tombstone for broadcaster to avoid to repeatedly commit DDL and ABA issue. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-09-24 20:58:05 +08:00
Zhen Ye	c171280f63	enhance: support replicate message in wal. (#44456 ) issue: #44123 - support replicate message in wal of milvus. - support CDC-replicate recovery from wal. - fix some CDC replicator bugs Signed-off-by: chyezh <chyezh@outlook.com>	2025-09-22 17:06:11 +08:00
tinswzy	c7f21d5a06	enhance: purge small files right after wp segment compaction (#44473 ) #43638 improve wp log output [wp#43](https://github.com/zilliztech/woodpecker/issues/43) intro purge small files right after segment compaction [wp#47](https://github.com/zilliztech/woodpecker/issues/47) The rootpath configured by milvus is uniformly used as the base for wp local fs storage. update to v0.1.5 Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2025-09-21 16:32:01 +08:00
Zhen Ye	ba289891c0	enhance: add all ddl message into messages (#44407 ) issue: #43897 - add ddl messages proto and add some message utilities. - support shard/exclusive resource-key-lock. - add all ddl callbacks future into broadcast registry. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-09-18 10:08:00 +08:00
yihao.dai	51f69f32d0	feat: Add CDC support (#44124 ) This PR implements a new CDC service for Milvus 2.6, providing log-based cross-cluster replication. issue: https://github.com/milvus-io/milvus/issues/44123 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com> Signed-off-by: chyezh <chyezh@outlook.com> Co-authored-by: chyezh <chyezh@outlook.com>	2025-09-16 16:32:01 +08:00
Zhen Ye	cbe4c3d231	enhance: get cchannel before build message (#44229 ) issue: #43897 - support never expire txn message. Signed-off-by: chyezh <chyezh@outlook.com>	2025-09-10 11:09:57 +08:00
Zhen Ye	9e2d1963d4	enhance: support cchannel for streaming service (#44143 ) issue: #43897 - add cchannel as a special vchannel to hold some ddl and dcl. Signed-off-by: chyezh <chyezh@outlook.com>	2025-09-02 10:05:52 +08:00
Zhen Ye	3327df72e4	enhance: make immutable message as the param of ack operation for cdc (#43900 ) issue: #43897 - The original broadcast ack operation need to recover message from etcd, which can not support cdc. - immutable message will set as the ack parameter to fix it. Signed-off-by: chyezh <chyezh@outlook.com>	2025-09-01 10:21:52 +08:00
XuanYang-cn	37a447d166	feat: Add CMEK cipher plugin (#43722 ) 1. Enable Milvus to read cipher configs 2. Enable cipher plugin in binlog reader and writer 3. Add a testCipher for unittests 4. Support pooling for datanode 5. Add encryption in storagev2 See also: #40321 Signed-off-by: yangxuan <xuan.yang@zilliz.com> --------- Signed-off-by: yangxuan <xuan.yang@zilliz.com>	2025-08-27 11:15:52 +08:00
Zhen Ye	5bdc593b8a	enhance: use v0.15.1 official pulsar client and add logging for pulsar client (#43913 ) issue: #43785 - pulsar client will print log into milvus logger now. - pulsar client open the metric by default. - upgrade the pulsar client to v0.15.1, and use offical repo. - the fixing of milvus-io/pulsar-client-go is already covered by official v0.15.1. Signed-off-by: chyezh <chyezh@outlook.com>	2025-08-26 16:45:53 +08:00
Zhen Ye	d0e3a33c37	enhance: add IsRebalanceSuspended interface for wal balancer (#44026 ) issue: #43968 Signed-off-by: chyezh <chyezh@outlook.com>	2025-08-24 09:19:47 +08:00
Zhen Ye	082ca62ec1	enhance: support balancer interface for streaming client to fetch streaming node information (#43969 ) issue: #43968 - Add ListStreamingNode/GetWALDistribution to fetch streaming node info - Add SuspendRebalance/ResumeRebalance to enable or stop balance - Add FreezeNodeIDs/DefreezeNodeIDs to freeze target node Signed-off-by: chyezh <chyezh@outlook.com>	2025-08-21 15:55:47 +08:00
Zhen Ye	f5cee0012a	fix: remove panic for message type in recovery storage and marshal log (#43976 ) issue: #43897 Signed-off-by: chyezh <chyezh@outlook.com>	2025-08-21 14:23:47 +08:00

1 2 3

135 Commits (17532517c611cafe5ec7a79bda47c9f296e82682)