milvus

Commit Graph

Author	SHA1	Message	Date
sijie-ni-0214	98934d3212	enhance: make thread pool max threads size configurable (#48379 ) ## Summary - Replace the hardcoded thread pool max threads limit (16) with a configurable parameter `common.threadCoreCoefficient.maxThreadsSize`, default 16, only effective when greater than 0 - Add missing Resize watchers for middle and low priority thread pools - When `maxThreadsSize` changes dynamically, update the limit first then resize all pools to ensure correct ordering issue: #48378 Signed-off-by: sijie-ni-0214 <sijie.ni@zilliz.com>	2026-03-24 06:07:29 +08:00
Chun Han	cc9c60eeaf	fix: HashTable dynamic rehash and group count limit for GROUP BY aggregation (#48174 ) ## Summary - HashTable now dynamically rehashes (doubles capacity) when load factor exceeds 7/8, fixing crash when GROUP BY cardinality > ~1792 - Added configurable `queryNode.segcore.maxGroupByGroups` (default 100K) to cap total groups and prevent OOM on both C++ (per-segment HashTable) and Go (cross-segment agg reducer) layers - Added 4 C++ unit tests covering rehash basic/correctness, max groups limit, and multiple rehash rounds issue: #47569 ## Test plan - [ ] C++ unit tests: `--gtest_filter="HashTableRehash:MaxGroups"` - [ ] E2E: GROUP BY aggregation with >2K unique values should succeed - [ ] E2E: Set `queryNode.segcore.maxGroupByGroups` to small value, verify clear error message 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: MrPresent-Han <chun.han@gmail.com> Co-authored-by: MrPresent-Han <chun.han@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-23 18:21:29 +08:00
tinswzy	aed7c8bcfb	enhance: update WP version v0.1.25 (#45011 ) #43638 update wp to latest Introduce the beta version of the wp service mode. --------- Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2026-03-23 05:57:30 +08:00
yihao.dai	89bd75fab7	fix: apply denylist retry to pack_writer writeLog and binlog import (#48402 ) ### Summary Follow-up to #48152 which applied denylist retry to parquet/json/csv imports but missed two other paths. - fix(High): `pack_writer.go` `writeLog` now skips retry only for non-retryable errors (permission denied, bucket not found, invalid credentials, etc.), matching the denylist strategy in `retryable_reader.go`. - fix(Medium): Binlog import's `WithDownloader` callbacks now use `multiReadWithRetry`, skipping retry only for non-retryable errors. Previously all transient failures were not retried. - fix(Low): `IsMilvusError` in `merr/utils.go` switched from `errors.Cause` (root only) to `errors.As` (full chain traversal). ### Out of Scope - `pack_writer_v2.go` / `pack_writer_v3.go` — same retry pattern but different code path (multi-part upload); separate fix. - `writeDelta` — no retry wrapper; separate concern. issue: #48153 --------- Signed-off-by: Yihao Dai <yihao.dai@zilliz.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-21 18:25:27 +08:00
yihao.dai	cc792486a8	feat: Support force promote for primary-secondary failover (#47352 ) Add force_promote flag to UpdateReplicateConfiguration API for disaster recovery. Changes: - Add ForcePromote field to UpdateReplicateConfigurationRequest - Refactor UpdateReplicateConfiguration to accept request object instead of separate params - Add WithForcePromote() method to ReplicateConfigurationBuilder - Implement force promote validation and handling in assignment service - Add integration tests for force promote scenarios Force promote allows a secondary cluster to immediately become standalone primary when the original primary is unavailable, enabling active-passive failover. issue: https://github.com/milvus-io/milvus/issues/47351 design doc: https://github.com/milvus-io/milvus-design-docs/blob/main/design_docs/20260202-force_promote_failover.md --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com> Signed-off-by: Yihao Dai <yihao.dai@zilliz.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>	2026-03-20 16:09:28 +08:00
Zhen Ye	48ba5fbfcd	fix: use sliding window for old version message lastConfirmedMessageID to prevent long catchup (#48390 ) issue: #48389 Previously, all old version (v0) WAL messages shared the same lastConfirmedMessageID pointing to the very first v0 message. When a tailing scanner fell back to catchup mode (e.g., due to WAL ownership change), it would restart from this extremely old position, causing catchup times of 14+ minutes during which tsafe could not advance and all search requests would timeout. This change replaces the fixed first-message ID with a configurable sliding window (default size 30). The lastConfirmedMessageID now points to the message N positions back, bounding the WAL replay distance on fallback to at most N messages. Signed-off-by: chyezh <chyezh@outlook.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 11:20:34 +08:00
sthuang	c54f2e47a1	fix: split QueryCoord executor into channel/non-channel pools to prev… (#48309 ) …ent deadlock related: #48308 When hundreds of channels need rebalancing (e.g. during upgrade), channel tasks could fill the entire per-node executor pool, blocking segment/leader tasks and causing a deadlock. This fix splits the single pool into two independent pools: - Channel task pool: controlled by channelTaskCapFraction (default 0.1) - Non-channel task pool: remainder of total capacity Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>	2026-03-19 14:05:27 +08:00
Zhen Ye	bd6054a2fd	enhance: improve replica and node assignment stability during scale-up/down (#48275 ) issue: #48239 Problem: During replica scale-up/down, RG name changes cause cross-RG node migration. The async resource_observer uses non-deterministic map iteration to select which nodes go to which RG, breaking the original node-to-replica mapping. This can leave a newly created replica without QN nodes, causing query unavailability when the old replicas are released. Root cause: 1. ReassignReplicaToRG iterates maps non-deterministically, so the existing replica may be transferred to any RG rather than the lexicographically smallest one. 2. When UpdateResourceGroups zeroes an RG and creates new ones, nodes transit through __recycle_resource_group with non-deterministic selection at each hop, scrambling the original node assignment. Fix: 1. Sort RG iteration in ReassignReplicaToRG and resource_observer to ensure deterministic replica-to-RG mapping: existing replica goes to lex-smallest RG during scale-up, lex-smallest RG's replica is preserved during scale-down. 2. Add transferNodesOnRGSwap in updateResourceGroups: when a batch update zeroes an RG (old request=N,limit=N -> new 0,0) and another RG takes the same config (new request=N,limit=N), directly swap their node lists before persisting, bypassing the async observer entirely. Extra fix: - Skip DQL forwarding to legacy proxy in cluster mode to avoid legacy-proxy overloading. The API proto GetShardLeader is only lost in standalone mode. --------- Signed-off-by: chyezh <chyezh@outlook.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-19 09:47:30 +08:00
Zhen Ye	446f06eb02	enhance: Implement rate limiting in WAL append operations (#47179 ) issue: #47178 This commit introduces a rate limiting mechanism for Write-Ahead Logging (WAL) operations to prevent overload during high traffic. Key changes include: - Added `RateLimitObserver` to monitor and control the rate of DML operations. - Add Adaptive RateLimitController to apply the strategy of rate limit. - WAL will slow down if the recovery-storage works on catchup mode or node memory is high. - Updated `WAL` and related components to handle rate limit states, including rejection and slowdown. - Introduced new error codes for rate limit rejection in the streaming error handling. - Enhanced tests to cover the new rate limiting functionality. These changes aim to improve the stability and performance of the streaming service under load. --------- Signed-off-by: chyezh <chyezh@outlook.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-18 16:19:26 +08:00
yihao.dai	0bfcae0cfb	enhance: switch import retry from allowlist to denylist strategy (#48152 ) ### Summary Switches Milvus import operations from allowlist (only retry specific errors) to denylist strategy (retry all errors except permanent/validation ones). This improves reliability by automatically retrying transient failures while still failing fast on permanent errors. ### Changes 1. New error types (pkg/util/merr/errors.go): - Added 6 new non-retryable error types for permanent and validation errors - ErrIoPermissionDenied, ErrIoBucketNotFound, ErrIoInvalidCredentials - ErrIoInvalidArgument, ErrIoInvalidRange, ErrIoEntityTooLarge 2. Denylist function (pkg/util/merr/utils.go): - Implemented IsNonRetryableErr() to identify errors that should NOT be retried - Checks for permanent errors (permission, credentials, not found) - Checks for client validation errors (invalid argument, range, size) 3. Cloud provider error mapping (internal/storage/remote_chunk_manager.go): - Enhanced MinIO/S3 error mapping: 11 error codes - Enhanced Azure Blob error mapping: 4 error codes - Enhanced GCP Cloud Storage error mapping: 3 HTTP status codes - Maps provider-specific errors to Milvus error types 4. Import retry logic (internal/util/importutilv2/common/retryable_reader.go): - Updated Read() to use denylist: retry all errors except non-retryable ones - Preserves EOF handling (no retry on EOF) 5. Write retry logic (internal/flushcommon/syncmgr/pack_writer.go): - Updated writeLog() with denylist retry - Updated writeDelta() with denylist retry ### Testing - Added comprehensive unit tests for IsNonRetryableErr() - Added tests for all cloud provider error mappings - Updated import read/write retry tests with denylist scenarios - static-check passed ### Issue issue: #48153 Signed-off-by: Wei Liu <wei.liu@zilliz.com> Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com> Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-03-18 12:03:26 +08:00
congqixia	3ec434da8b	fix: prevent CASCachedValue permanent failure with FallbackKeys (#48313 ) When a ParamItem's primary key equals DefaultValue and a FallbackKey has a different value, getWithRaw() overwrote `raw` with the fallback value. CASCachedValue then re-reads the primary key and compares it against `raw` — mismatch causes CAS to permanently fail. The cache is never populated, forcing every call through the write-lock path and causing goroutine contention on proxy search hot paths. Introduce `effectiveRaw` to separate the CAS comparison value (always the primary key's raw value) from the computation value (may come from fallback), so CAS succeeds and the cache is properly populated. Related to #48312 Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2026-03-18 11:03:26 +08:00
Xiaofan	32ed2df763	fix: preserve trailing slash semantics for meta KV prefix paths (#48053 ) issue: #47998 - add util.GetPath/GetPrefixPath and replace path.Join in etcd/tikv key composition - normalize metastore prefix builders and Walk/Load/Remove prefix usage with trailing / - fix suffix snapshot prefix/root handling when root is empty - add regression tests for trailing-slash behavior and prefix isolation Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 21:13:42 +08:00
junjiejiangjjj	4b68cd657a	enhance: Add gemini embedding (#48215 ) https://github.com/milvus-io/milvus/issues/35856 Signed-off-by: junjie.jiang <junjie.jiang@zilliz.com>	2026-03-16 15:15:26 +08:00
Li Liu	2be995f927	fix: fix flaky TestCheckCtxValid and TestProxy port binding (#48246 ) ## Summary - TestCheckCtxValid: replace `time.Sleep(25ms)` with `assert.Eventually` to eliminate timing dependency — 5ms margin between sleep and context deadline was too tight on busy CI nodes (build-ut-ciloop #91) - TestProxy/TestProxyRpcLimit: bind to `localhost:0` instead of hardcoded port 19530 to avoid TCP TIME_WAIT conflicts when ciloop reruns tests (build-ut-ciloop #123) ## Test plan - [x] TestCheckCtxValid: 500/500 passed post-fix - [ ] TestProxy: syntax verified (gofmt OK), CI will validate full compilation issue: https://github.com/milvus-io/milvus/issues/48118 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: Li Liu <li.liu@zilliz.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-16 15:09:26 +08:00
cai.zhang	fc4982bd62	fix: prevent node panic when unsupported types used as ClusteringKey (#48184 ) ## Summary Fixes #47540 - Bug 1 (MixCoord panic): `compactionInspector.analyzeScheduler` was never initialized in production code. When FloatVector is used as ClusteringKey, the `doAnalyze()` path calls `analyzeScheduler.Enqueue()` on a nil pointer → panic. Fixed by passing `analyzeScheduler` to `newCompactionInspector`. - Bug 2 (DataNode round-robin panic): `validateClusteringKey()` had no field type whitelist, so JSON/Bool/Array passed schema validation. During clustering compaction, `NewScalarFieldValue()` panics on unsupported types. Fixed by adding `IsClusteringKeyType()` check to reject unsupported types at collection creation time. ## Test plan - [x] `TestIsClusteringKeyType` — verifies supported/unsupported type classification - [x] `TestClusteringKey` — new sub-tests for JSON, Bool, Array as ClusteringKey (all rejected) - [ ] Existing `TestClusteringKey` sub-tests (normal, multiple keys, vector key) still pass - [ ] `TestCompactionPlanHandler*` tests pass with updated `newCompactionInspector` signature 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Cai Zhang <cai.zhang@zilliz.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-15 21:21:25 +08:00
aoiasd	9d31bf4f1f	fix: add FileResource privileges to RBAC v2 built-in privilege groups (#48126 ) ## Summary - Add FileResource privileges to v2 built-in privilege groups in `pkg/util/constant.go` issue: #47893 Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>	2026-03-13 15:33:24 +08:00
Gao	5b3382abdf	enhance: add big topk optimization property (#47848 ) issue: #48011 ## Summary - Add a new collection-level property `bigtopk_optimization.enabled` that allows collections to use significantly higher TopK limits (up to 1M by default vs 16384) for search and query operations - When enabled, auto-index creation selects an IVF-based index type (configurable via `autoIndex.params.bigTopK.build`, default IVF_SQ8) instead of the default HNSW, which is better suited for large TopK retrieval scenarios - Introduce dedicated quota parameters (`quotaAndLimits.limits.bigTopK` and `quotaAndLimits.limits.bigMaxQueryResultWindow`) to control the relaxed limits independently - The property can be set at collection creation or via alter collection, but changing it requires dropping any existing vector index first - This PR also fix the partitionkey isolation alter bug when non related properties could affect isolation's vector index check --------- Signed-off-by: chasingegg <chao.gao@zilliz.com>	2026-03-11 12:31:23 +08:00
sparknack	6dc8418eac	enhance: add async warmup policy support for caching layer (#47627 ) issue: #47902 Integrate milvus-common commit 7b54b6e which adds CacheWarmupPolicy_Async and a prefetch thread pool for background cache warmup. This enables segments to be marked as loaded immediately while cache warming happens asynchronously, reducing load latency. --------- Signed-off-by: Shawn Wang <shawn.wang@zilliz.com>	2026-03-10 14:45:23 +08:00
sthuang	3cd435d6ac	fix: [RBAC] grant cleanup on drop and migration on rename and meta cache interceptor (#48140 ) related: #48061 related: #48062 related: #48137 - Reconstruct logical etcd keys to avoid double rootPath prefix in delete/migrate operations - Use typeutil.After for privilege name extraction instead of broken length-based substring - Match wildcard dbName and use DefaultTenant consistently - Move grant cleanup to DropCollection ack callback - Enable resolveAliasForPrivilege by default and fix alias cache - passed rbac alias e2e tests --------- Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com> Signed-off-by: Li Liu <li.liu@zilliz.com> Co-authored-by: Li Liu <li.liu@zilliz.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 10:13:24 +08:00
congqixia	1bd23fda87	enhance: [loon] move storage v3 deltalog files under basePath/_delta/ to align with loon convention (#48150 ) Related to #44956 Previously, V3 deltalogs were written to {rootPath}/delta_log/{collID}/{partID}/{segID}/{logID}, separate from the segment's basePath. The manifest used complex "../" relative paths to bridge the two locations. This change writes V3 deltalogs directly to {basePath}/_delta/{logID}, aligning with the C loon library's native _delta/ convention and simplifying the manifest relative path to just the logID filename. Legacy V1 segments and existing manifests remain backward compatible. Signed-off-by: Congqi Xia <congqi.xia@zilliz.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-10 10:01:25 +08:00
Bingyi Sun	39606eec74	fix: Optimize namespace compaction and query implementation (#46512 ) ### User description issue: https://github.com/milvus-io/milvus/issues/44011 Updated the DescribeCollection functionality to exclude the namespace field from the response schema. This change ensures that these fields are not returned in the collection description. ___ ### PR Type Bug fix, Enhancement ___ ### Description - Filter namespace field from DescribeCollection response schema - Add support for PartitionKeySortCompaction in compaction task handling - Fix lambda capture issues in C++ expression evaluation code - Add comprehensive test coverage for namespace field filtering ___ ### Diagram Walkthrough ```mermaid flowchart LR A["DescribeCollection Request"] --> B["Filter Fields"] B --> C["Exclude Dynamic Fields"] B --> D["Exclude Namespace Field"] C --> E["Response Schema"] D --> E F["Compaction Task"] --> G["Create/Complete Task"] G --> H["Support PartitionKeySortCompaction"] H --> I["Task Execution"] ``` <details><summary><h3>File Walkthrough</h3></summary> <table><thead><tr><th></th><th align="left">Relevant files</th></tr></thead><tbody><tr><td><strong>Bug fix</strong></td><td><table> <tr> <td> <details> <summary><strong>service_provider.go</strong><dd><code>Filter namespace field from DescribeCollection response</code>    </dd></summary> <hr> internal/proxy/service_provider.go <ul><li>Added import for <code>common</code> package to access <code>NamespaceFieldName</code> constant<br> <li> Modified field filtering logic in <code>DescribeCollection</code> to exclude <br>namespace field alongside dynamic fields<br> <li> Updated filter condition to check both <code>IsDynamic</code> flag and field name <br>equality</ul> </details> </td> <td><a href="https://github.com/milvus-io/milvus/pull/46512/files#diff-e72e29bf5e62a9c5a797c0045e0d6f427d5c49e587c848f68e81ceabaa3e2a0c">+2/-1</a>      </td> </tr> <tr> <td> <details> <summary><strong>task.go</strong><dd><code>Exclude namespace field from task execution</code>                            </dd></summary> <hr> internal/proxy/task.go <ul><li>Updated <code>describeCollectionTask.Execute</code> to filter out namespace field <br>from response schema<br> <li> Modified condition to skip fields that are either dynamic or have <br>namespace field name<br> <li> Ensures namespace field is not included in collection description <br>results</ul> </details> </td> <td><a href="https://github.com/milvus-io/milvus/pull/46512/files#diff-b4aff1bcd223cde92858085e1be028292b0899b08c9664aa590a952f01106e3d">+1/-1</a>      </td> </tr> <tr> <td> <details> <summary><strong>Expr.cpp</strong><dd><code>Fix lambda capture and remove duplicate includes</code>                  </dd></summary> <hr> internal/core/src/exec/expression/Expr.cpp <ul><li>Removed duplicate include of <code>expr/ITypeExpr.h</code><br> <li> Removed unused include of <code>fmt/format.h</code><br> <li> Fixed lambda capture issues in <code>SetNamespaceSkipIndex</code> function by using <br>pointer capture<br> <li> Changed from reference capture <code>[&]</code> to explicit pointer and value <br>captures to avoid dangling references<br> <li> Extracted namespace field ID and value as const variables for safer <br>lambda capture</ul> </details> </td> <td><a href="https://github.com/milvus-io/milvus/pull/46512/files#diff-1051ba05ef883a88cf621aa1ca7ba4f3b433be44f574877c1888f39eed42cb50">+16/-16</a>  </td> </tr> </table></td></tr><tr><td><strong>Tests</strong></td><td><table> <tr> <td> <details> <summary><strong>service_provider_test.go</strong><dd><code>Add test for namespace field filtering</code>                                      </dd></summary> <hr> internal/proxy/service_provider_test.go <ul><li>Added imports for <code>schemapb</code> and <code>common</code> packages<br> <li> Created new test <br><code>TestCachedProxyServiceProvider_DescribeCollection_FilterNamespaceField</code><br> <li> Test verifies namespace and dynamic fields are filtered while user <br>fields are preserved<br> <li> Uses mock cache to simulate collection metadata with namespace field</ul> </details> </td> <td><a href="https://github.com/milvus-io/milvus/pull/46512/files#diff-47cdd329529238ca17e2a5c9f66f548cf941e19d332203a099d0fa8a9e67dd47">+65/-0</a>    </td> </tr> <tr> <td> <details> <summary><strong>task_test.go</strong><dd><code>Add namespace field filtering test case</code>                                    </dd></summary> <hr> internal/proxy/task_test.go <ul><li>Added comprehensive test <br><code>TestDescribeCollectionTask_FilterNamespaceField</code><br> <li> Test creates schema with namespace field as partition key and dynamic <br>metadata field<br> <li> Verifies both namespace and dynamic fields are excluded from describe <br>result<br> <li> Confirms user fields like id and fvec are properly included</ul> </details> </td> <td><a href="https://github.com/milvus-io/milvus/pull/46512/files#diff-ab14021f2a0170ad1c610628871f98c5cd4386ce95efe7af94b4509bdea4ccf7">+90/-0</a>    </td> </tr> </table></td></tr><tr><td><strong>Enhancement</strong></td><td><table> <tr> <td> <details> <summary><strong>compaction_inspector.go</strong><dd><code>Support PartitionKeySortCompaction in task creation</code>            </dd></summary> <hr> internal/datacoord/compaction_inspector.go <ul><li>Added <code>datapb.CompactionType_PartitionKeySortCompaction</code> to switch case <br>in <code>createCompactTask</code><br> <li> Now handles PartitionKeySortCompaction alongside MixCompaction and <br>SortCompaction<br> <li> Routes PartitionKeySortCompaction to <code>newMixCompactionTask</code> handler</ul> </details> </td> <td><a href="https://github.com/milvus-io/milvus/pull/46512/files#diff-1c884001f2e84de177fea22b584f3de70a6e73695dbffa34031be9890d17da6d">+1/-1</a>      </td> </tr> <tr> <td> <details> <summary><strong>meta.go</strong><dd><code>Handle PartitionKeySortCompaction in mutation completion</code>  </dd></summary> <hr> internal/datacoord/meta.go <ul><li>Added <code>datapb.CompactionType_PartitionKeySortCompaction</code> to switch case <br>in <code>CompleteCompactionMutation</code><br> <li> Routes PartitionKeySortCompaction to <code>completeSortCompactionMutation</code> <br>handler<br> <li> Ensures proper mutation completion for partition key sort compaction <br>type</ul> </details> </td> <td><a href="https://github.com/milvus-io/milvus/pull/46512/files#diff-1b1c74d883d233d2457813f0708bd3a3418102555615d6abca5e04b9815e6c37">+1/-1</a>      </td> </tr> </table></td></tr></tbody></table> </details> ___ <!-- This is an auto-generated comment: release notes by coderabbit.ai --> - Core invariant: schemapb.CollectionSchema.EnableNamespace is the authoritative per-collection flag for namespace behavior — it is persisted on Collection models, copied into DescribeCollection responses, and used at runtime to derive datapb.SegmentInfo.IsNamespaceSorted and datapb.CompactionSegmentBinlogs/CompactionSegment.IsNamespaceSorted (i.e., per-collection EnableNamespace → per-segment IsNamespaceSorted). - Removed/simplified logic: eliminated property-based namespace toggles and partition-key-sort special-cases (TriggerTypePartitionKeySort / TriggerTypeClusteringPartitionKeySort and IsPartitionKeySortCompactionEnabled) and removed the dual namespace-skip API in expression evaluation (SetNamespaceSkipFunc / SetNamespaceSkipIndex); routing now reuses existing Mix/Sort/Clustering handlers and uniformly treats IsNamespaceSorted alongside IsSorted in compaction, index, and inspection checks. - Why no data loss or behavior regression: wire compatibility preserved (SegmentInfo field number unchanged when renaming is_partition_key_sorted → is_namespace_sorted), binlog/segment IDs, manifest paths and storage versions are unchanged, and compaction requests only add/pass an extra boolean flag — no binlog content, identifiers, or storage layout are modified; DescribeCollection now only changes user-facing filtering of schema fields (internal stored schema unchanged) and unit tests were added for filtering to prevent regressions. - Bug fix (milvus-io/milvus#44011): fixes leakage of internal namespace/meta fields in DescribeCollection by filtering out fields with IsDynamic==true and the Namespace field name (common.NamespaceFieldName) and by propagating EnableNamespace through DescribeCollectionTask.Execute; tests added in internal/proxy/service_provider_test.go and internal/proxy/task_test.go validate the fix. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: sunby <sunbingyi1992@gmail.com>	2026-03-09 21:11:23 +08:00
congqixia	d88c5655e1	enhance: enable Loon FFI by default and support storage v3 in file managers (#47984 ) Related to #44956 Enable useLoonFFI config by default and extend DiskFileManagerImpl and MemFileManagerImpl to handle STORAGE_V3 using the same code paths as STORAGE_V2 for caching raw data, optional fields to disk and memory. --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2026-03-06 19:03:21 +08:00
yihao.dai	5748d8df4f	enhance: add per-cluster TLS config for CDC outbound mTLS connections (#47968 ) ## Summary - Add `BuildTLSConfig` helper and `TLSConfig` field to SDK `ClientConfig` for mTLS support - Add `GetClusterTLSConfig(clusterID)` for dynamic per-cluster paramtable lookup via `tls.clusters.<clusterID>.{caPemPath,clientPemPath,clientKeyPath}` - CDC `NewMilvusClient` reads per-cluster TLS config by target cluster ID, enabling different certs per target cluster All target clusters' certs are pre-configured on every node, so CDC topology switchover (e.g., A→B,C to B→A,C) works without process restart. ## Test plan - [x] Unit tests for `BuildTLSConfig` (valid certs, missing CA, invalid cert pair) - [x] Unit tests for `GetClusterTLSConfig` (per-cluster lookup, missing config) - [x] Unit tests for `buildCDCTLSConfig` (no config, partial config, invalid CA, per-cluster isolation) issue: #47843 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-06 11:27:20 +08:00
sthuang	f5b6db2929	fix: resolve collection alias in RBAC and clean up grants on collection lifecycle (#47851 ) related: https://github.com/milvus-io/milvus/issues/47850 1. Privilege interceptor: resolve collection alias to real collection name before RBAC permission check, so that operating via alias checks permission against the real collection, not the alias name. 2. MetaCache alias cache: add aliasInfo cache with positive/negative entries to avoid repeated DescribeAlias RPC calls. Cache is invalidated on alias removal, collection removal, and database removal. 3. Catalog grant cleanup: add DeleteGrantByCollectionName and MigrateGrantCollectionName to RootCoordCatalog interface and kv implementation. On collection drop, delete all associated grants; on collection rename, migrate grants to the new name. 4. Feature flag: add proxy.resolveAliasForPrivilege config to enable/disable alias resolution in the privilege interceptor. --------- Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>	2026-03-05 18:39:22 +08:00
sijie-ni-0214	b0a6a75f2b	enhance: optimize qn load speed (#47423 ) issue: https://github.com/milvus-io/milvus/issues/47422 --------- Signed-off-by: sijie-ni-0214 <sijie.ni@zilliz.com>	2026-03-05 16:55:21 +08:00
jiaqizho	ca89df5512	enhance: support configurable TLS minimum version for object storage connections (#48000 ) Related to https://github.com/milvus-io/milvus/issues/44999 Currently Milvus doesn't allow users to control the TLS version used when connecting to object storage (MinIO/S3/Azure/GCP). Some environments require enforcing TLS 1.3 for compliance, but there's no way to set that today. This adds a new config option `minio.ssl.tlsMinVersion` that lets users specify the minimum TLS version ("1.0", "1.1", "1.2", "1.3", or "default"). It works across all supported storage backends including MinIO/S3, Azure Blob, and GCP native. The setting is plumbed through paramtable, proto StorageConfig, and all the places that create storage clients (compaction, datacoord, datanode, storagev2, etc.). For the GCP native backend, this also adds proper UseIAM/ADC support that was previously missing, since the TLS transport injection needed to handle both credential modes correctly. Also fixed the GCP MinIO-compatible path to reuse any custom transport (e.g. with TLS config) as the backend for the OAuth2 token wrapping, instead of always creating a new default transport. Unit tests cover the TLS version parsing, HTTP client construction, and version enforcement (proving a TLS 1.3 client correctly rejects a TLS 1.2-only server). Integration tests are included but gated behind environment variables. Signed-off-by: jiaqizho <jiaqi.zhou@zilliz.com>	2026-03-04 19:45:21 +08:00
Chun Han	9031783ea6	enhance: involve text index when estimating memory cost for loading (#47899 ) related: #47539 Signed-off-by: MrPresent-Han <chun.han@gmail.com> Co-authored-by: MrPresent-Han <chun.han@gmail.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-04 19:15:21 +08:00
wei liu	220c691500	feat: [ExternalTable Part4] Support data mapping for external collections (#47730 ) design doc: https://github.com/milvus-io/milvus-design-docs/blob/main/design_docs/20260105-external_table.md issue: https://github.com/milvus-io/milvus/issues/45881 ## Summary - Pre-allocate segment IDs in DataCoord, pass to DataNode for direct final-path manifest writes (eliminating two-phase ID workflow) - Add FFI bridges for file exploration (`ExploreFiles`, `GetFileInfo`) and manifest creation (`CreateManifestForSegment`, `ReadFragmentsFromManifest`) - Implement fragment-to-segment balancing with configurable target rows per segment - Add `ExternalSpec` parser for external data format configuration - Extend `UpdateExternalCollectionRequest` proto with schema, storage config, and pre-allocated segment ID fields - Add E2E test for external collection refresh with data verification > Note: This PR includes Part3 changes (PR #47303). After Part3 is merged, this PR will be rebased to only contain Part4-specific changes. ## Test plan - [x] Unit tests for `task_refresh_external_collection.go` (28 tests) - [x] Unit tests for `task_update.go` and fragment utilities (40 tests) - [x] Unit tests for FFI bridges (`exttable_test.go`, 9 tests) - [x] Unit tests for `ExternalSpec` parser - [x] Unit tests for paramtable config - [x] Integration test with real Parquet files - [x] `make lint-fix` passes - [ ] E2E test with MinIO backend --------- Signed-off-by: Jiquan Long <jiquan.long@zilliz.com> Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2026-03-04 18:09:21 +08:00
Sulimov Dmitriy	7ae9e51fcf	feat: Add Yandex Cloud (YC) text embedding support (#47939 ) issue: #47938 - Updated milvus.yaml to include configuration for Yandex Cloud model service. - Implemented CreateYCEmbeddingServer function to mock Yandex Cloud embedding service. - Added support for YC provider in text embedding function logic. - Enhanced error handling for unsupported providers. - Added unit tests for Yandex Cloud embedding functionality and its disabled state. Design documents location: https://github.com/milvus-io/milvus-design-docs/blob/main/design_docs/20260227-yc-text-embedding-provider.md --------- Signed-off-by: edddoubled <vrs_2.1@yandex.ru> Co-authored-by: edddoubled <vrs_2.1@yandex.ru>	2026-03-03 22:37:19 +08:00
Zhen Ye	46a43fc3a5	fix: fast-fail ServerIDMismatch for node connections and increase walBalancer operationTimeout (#47981 ) For node connections (isNode=true), ServerIDMismatch now returns needRetry=false immediately instead of retrying 10 times with exponential backoff (~52.6s). Retrying is futile because the NodeID injected via the interceptor at connection time never changes during retry. Coord connections keep existing retry behavior. Also increase streaming.walBalancer.operationTimeout default from 30s to 30m. issue: #46182 Signed-off-by: chyezh <chyezh@outlook.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 22:41:20 +08:00
yihao.dai	564279e3c3	enhance: allow pchannel count increase in ReplicateConfiguration (#47792 ) ## Summary - Allow pchannel count increase (append-only) in `validateClusterConsistency` for CU scaling - Existing pchannels must be preserved at the same positions; decrease and reorder are still rejected - Equal pchannel count across clusters is still enforced in `validateClusterBasic` ## Test plan - [x] Unit tests pass for `pkg/util/replicateutil/...` - [x] Verified pchannel increase (append) is accepted - [x] Verified pchannel decrease is rejected - [x] Verified pchannel reorder/replace is rejected issue: https://github.com/milvus-io/milvus/issues/47791 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-02 11:25:18 +08:00
yihao.dai	a355d81134	enhance: add configurable skip list for replicate message types (#47777 ) - Add `streaming.replication.skipMessageTypes` config parameter (comma-separated message type names, default `AlterResourceGroup,DropResourceGroup`) - On the secondary side, `overwriteReplicateMessage()` checks incoming message type against skip set and returns `IgnoreOperation`, which `ReplicateStreamServer` already handles gracefully - Configurable at runtime (refreshable) issue: https://github.com/milvus-io/milvus/issues/47776 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 16:37:24 +08:00
yihao.dai	650bc7f9a5	enhance: support different replica numbers on secondary CDC cluster (#47780 ) ## Summary - Add `use_local_replica_config` flag to `AlterLoadConfigMessageHeader` so the secondary CDC cluster can use its own `ClusterLevelLoadReplicaNumber` / `ClusterLevelLoadResourceGroups` instead of blindly applying the primary's replica config - Secondary's `ReplicateService` sets the flag on every replicated `AlterLoadConfigMessage`; `LoadCollectionJob` reads local config when flag is set - Use `generateReplicas` for idempotent replica generation, ensuring WAL replay does not create duplicate replicas - Default to 1 replica in `__default_resource_group` when local config is not explicitly set (instead of falling back to primary's config) ## Test plan - [x] Unit test: replicated AlterLoadConfig gets `UseLocalReplicaConfig=true` - [x] Unit test: local config overrides primary config when set - [x] Unit test: defaults to 1 replica in `__default_resource_group` when local config not set - [x] Unit test: flag=false uses primary config directly - [x] Unit test: `getLocalReplicaConfig` first load, idempotent replay, not set, no RGs, alloc error issue: #47779 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2026-02-27 21:19:20 +08:00
Gao	d91a86aef8	enhance: auto warmup for the big tenant collection (#47630 ) issue: #47371 --------- Signed-off-by: chasingegg <chao.gao@zilliz.com>	2026-02-26 22:49:19 +08:00
zhuwenxing	c5102683d3	enhance: increase default maxVectorFieldNum from 4 to 10 (#47866 ) ## Summary - Increase the default `maxVectorFieldNum` from 4 to 10 to accommodate the growing variety of vector types (dense, sparse, function-based) supported by Milvus - Update related test constants and hardcoded values in both Go and Python test suites ## Changes - `configs/milvus.yaml`: default config value 4 → 10 - `pkg/util/paramtable/component_param.go`: Go param default "4" → "10" - `tests/go_client/common/consts.go`: Go test constant 4 → 10 - `tests/python_client/common/common_type.py`: Python test constant 4 → 10 - `tests/python_client/milvus_client/test_milvus_client_collection.py`: replace hardcoded "4" in error message with constant reference Closes #47402 ## Test plan - [x] Verify collection creation with up to 10 vector fields succeeds - [x] Verify collection creation with 11+ vector fields fails with proper error message - [x] Run existing Go integration tests (`tests/go_client`) - [x] Run existing Python client tests (`tests/python_client`) Signed-off-by: zhuwenxing <wenxing.zhu@zilliz.com>	2026-02-26 14:18:47 +08:00
Zhen Ye	b6db3c34ec	enhance: refactor WithClusterLevelBroadcast to use external channel list and add FlushAll integration test (#47656 ) issue: #47647 Refactor the cluster-level broadcast mechanism to decouple the message package from the channel registration lifecycle: - Replace internal provider pattern with opaque ClusterChannels type passed externally to WithClusterLevelBroadcast() - Add channel package singleton (syncutil.Future) exposing GetClusterChannels() and GetPChannelNames() blocking accessors - Add PChannel() interface to MutableMessage/ImmutableMessage for deriving physical channel from virtual channel - Validate non-control-channel entries are physical channels using funcutil.IsPhysicalChannel and use funcutil.IsOnPhysicalChannel for control channel matching - Move control channel substitution logic into WithClusterLevelBroadcast to simplify callers (datacoord, coordinator, assignment service) - Add lock interceptor unit tests and cluster broadcast test coverage - Add integration test for FlushAll with streaming node restart to verify data integrity across node lifecycle --------- Signed-off-by: chyezh <chyezh@outlook.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 11:36:47 +08:00
wei liu	6b4171e7ac	feat: [ExternalTable Part3] Support manual refresh for external collections (#47492 ) design doc: https://github.com/milvus-io/milvus-design-docs/blob/main/design_docs/20260105-external_table.md issue: #45881 This change introduces manual refresh capability for external collections, allowing users to trigger on-demand data synchronization from external sources. It replaces the legacy update mechanism with a more robust job-task hierarchy and persistent state management. Key changes: - Add RefreshExternalCollection, GetRefreshExternalCollectionProgress, and ListRefreshExternalCollectionJobs APIs across Client, Proxy, and DataCoord - Implement ExternalCollectionRefreshManager to manage refresh jobs with a 1:N Job-Task hierarchy - Add ExternalCollectionRefreshMeta for persistent storage of jobs and tasks in the metastore - Add ExternalCollectionRefreshChecker for task state management and worker assignment - Implement ExternalCollectionRefreshInspector for periodic job cleanup - Use WAL Broadcast mechanism for distributed consistency and idempotency - Replace legacy external_collection_inspector and update tasks with the new refresh-based implementation - Add comprehensive unit tests for refresh job lifecycle and state transitions design doc: https://github.com/milvus-io/milvus-design-docs/blob/main/design_docs/20260105-external_table.md --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2026-02-26 11:20:46 +08:00
Li Liu	96c7ef094a	enhance: Go search path optimizations for improved QPS (#47734 ) issue: #47827 Key optimizations: - O(1) task lookup in scheduler via unissuedTasksIndex map - Channel→indexed slice in searchSegments, eliminating channel alloc/sync - Single segment fast path skipping errgroup/goroutine overhead - Single channel fast path in LB policy - Shallow copy SearchRequest instead of proto.Clone deep copy - Pipeline.String()→p.name in debug log to avoid eager evaluation - TimeRecorder: fix double time.Now(), pre-compute logLabel, IsRecording guard - Batch 2D array allocation for resultOffsets in search reduce - Pre-allocate Scores/Topks slices with nq*topk capacity - Replace fmt.Sprint(nodeID) with paramtable.GetStringNodeID() on hot paths - Eliminate fmt.Sprintf in CtxElapse calls, use static strings Signed-off-by: Li Liu <li.liu@zilliz.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-24 22:12:47 +08:00
Gao	091b147e2a	enhance: use correct tag string for autoindex formatter (#47820 ) Signed-off-by: chasingegg <chao.gao@zilliz.com>	2026-02-24 14:32:46 +08:00
marcelo-cjl	52f10023a3	fix: change queryPreExecute merge logic from row-based to column-based (#47122 ) issue: #46953 - GenNullableFieldData only supported scalar types, causing error when partial upsert after adding nullable vector fields issue: #47160 - pickFieldData used row index directly to access Contents array without converting to data index, causing index out of range panic relate: #45993 Row-based processing had two problems: - Nullable vector handling was tightly coupled with other field types, requiring complex special-case logic - Update path was inefficient for nullable vectors: data had to be read first then updated in-place. For example, changing a valid vector to null requires shifting subsequent data since null values are not stored, which is very inefficient. Column-based processing solves both: - Each field type is processed independently, nullable vector logic is cleanly separated from other types - Nullable vectors are appended directly without read-then-update, avoiding expensive data shifting operations - Iterate over fields instead of rows when merging upsert and existing data key changes: - Remove nullable vector special handling: nullableVectorMergeContext, buildNullableVectorIdxMap, rebuildNullableVectorFieldData, etc. - Use generic column utilities: AppendFieldDataByColumn, UpdateFieldDataByColumn - Add vector type support to GenNullableFieldData - Add unit test for nullable vector upsert scenarios --------- Signed-off-by: marcelo-cjl <marcelo.chen@zilliz.com>	2026-02-10 18:10:41 +08:00
congqixia	1bd65fc1ce	enhance: remove deprecated lazy load code (#47590 ) Related to #44452 Remove the deprecated lazy load feature which has been superseded by warmup-related parameters. This cleanup includes: - Remove AddFieldDataInfoForSealed from C++ segcore layer - Remove IsLazyLoad() method and isLazyLoad field from segment - Remove lazy load checks in proxy alterCollectionTask - Remove DiskCache lazy load handling in search/retrieve paths - Remove LazyLoadEnableKey constant and related helper functions - Update mock files to reflect interface changes --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2026-02-10 14:14:44 +08:00
Spade A	ba29e0ba48	fix: stl-sort allows struct scalar fields (#47626 ) issue: https://github.com/milvus-io/milvus/issues/47620 Signed-off-by: SpadeA <tangchenjie1210@gmail.com>	2026-02-09 10:31:52 +08:00
XuanYang-cn	94c100935d	fix: correct the default value for mmap in code (#47489 ) See also: #47488 Signed-off-by: yangxuan <xuan.yang@zilliz.com>	2026-02-06 10:49:51 +08:00
Chun Han	25a155efcb	feat: part1 for add field backfill(#44444 ) (#46808 ) related: #44444 design doc: https://github.com/milvus-io/milvus-design-docs/blob/main/design_docs/20260129-add-function-field-design.md Signed-off-by: MrPresent-Han <chun.han@gmail.com> Co-authored-by: MrPresent-Han <chun.han@gmail.com>	2026-02-05 19:19:52 +08:00
Buqian Zheng	a7f996619f	enhance: use STL_SORT as default high cardinality index for Hybrid Index (#47408 ) issue: #47083 Changes: - Add configurable low/high cardinality index types for hybrid index - Default high cardinality: STL_SORT - Add float/double support for hybrid index Compatibility: - Version <= 2: Uses legacy behavior (STLSORT for int, INVERTED for string/float) - Version >= 3: Uses configurable index types from config This PR does not bump scalar version, so indexes will continue use version 2, until we bump the version. --------- Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>	2026-02-04 12:11:50 +08:00
Zhen Ye	ae16d43061	fix: less empty time tick filtering interval (#47470 ) issue: #46540 Signed-off-by: chyezh <chyezh@outlook.com>	2026-02-03 15:38:09 +08:00
yihao.dai	e1f5dafef9	feat: Add GetReplicateConfiguration API (#47393 ) ## Summary Add a new public API `GetReplicateConfiguration` that allows cluster administrators to view the current cross-cluster replication topology with sensitive connection parameters (tokens) redacted. ## Changes - Add privilege constant `PrivilegeGetReplicateConfiguration` to `ClusterReadOnlyPrivileges` - Add `SanitizeReplicateConfiguration` helper to strip sensitive tokens before returning - Add `GetReplicateConfiguration` method to `ReplicateService` interface - Implement `GetReplicateConfiguration` handler in Proxy - Add integration tests ## API ```protobuf rpc GetReplicateConfiguration(GetReplicateConfigurationRequest) returns (GetReplicateConfigurationResponse) {} ``` Security: - Requires ClusterAdmin privilege - Tokens are redacted from the response ## Dependencies - Proto changes: milvus-io/milvus-proto#566 ## Related Issue Closes #47392 ## Design Doc design doc: https://github.com/milvus-io/milvus-design-docs/blob/main/design_docs/20260128-get_replicate_configuration.md ## Test Plan - [x] Unit tests for sanitization helper - [x] Unit tests for ReplicateService method - [x] Integration tests for the API --- 🤖 Generated with [Claude Code](https://claude.ai/code) --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Happy <yesreply@happy.engineering>	2026-02-03 15:32:34 +08:00
Gao	1f671094b8	enhance: support user specified warmup (#47373 ) issue: https://github.com/milvus-io/milvus/issues/47371 --------- Signed-off-by: chasingegg <chao.gao@zilliz.com>	2026-02-02 14:34:00 +08:00
congqixia	82bde13c89	enhance: add session version requirement for storage version upgrade compaction (#47376 ) Related to #46988 Add a minimum session version check before triggering storage version upgrade compaction. This ensures all query nodes in the cluster have been upgraded to a version that can handle the new storage format before any compaction is triggered. Changes: - Add StorageVersionCompactionMinSessionVersion config parameter (default: 2.6.9) to specify minimum required session version - Add GetMinimalSessionVer() method to IndexEngineVersionManager to track the minimum version across all sessions - Check version requirement in storageVersionUpgradePolicy.Trigger() and skip compaction if any node is below the required version - Add comprehensive unit tests for version requirement scenarios This prevents potential data loading issues during rolling upgrades where older nodes may not be able to read segments compacted with the new storage format. --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2026-01-30 12:05:47 +08:00
foxspy	84ba0d62a3	enhance: update default auto index for vector filed (#47387 ) issue: #47386 /kind improvement Signed-off-by: xianliang.li <xianliang.li@zilliz.com>	2026-01-29 16:55:46 +08:00

1 2 3 4 5 ...

1149 Commits (17532517c611cafe5ec7a79bda47c9f296e82682)