Commit Graph

679 Commits (2.5)

Author SHA1 Message Date
wei liu 80d1ef74ce
fix: apply load config changes failed after restart (#43555)
issue: #43107
pr: #43554

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-08-01 20:17:37 +08:00
wei liu 75463725b3
fix: skip loading non-existent L0 segments to prevent load blocking (#43576)
issue: #43557
In 2.5 branch, L0 segments must be loaded before other segments. If an
L0 segment has been garbage collected but is still in the target list,
the load operation would keep failing, preventing other segments from
being loaded.

This patch adds a segment existence check for L0 segments in
getSealedSegmentDiff. Only L0 segments that actually exist will be
included in the load list.

Changes:
- Add checkSegmentExist function parameter to SegmentChecker constructor
- Filter L0 segments by existence check in getSealedSegmentDiff
- Add unit tests using mockey to verify the fix behavior

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-07-31 14:33:38 +08:00
wei liu 4631657304
fix: Unstable integration case TestBalanceOnSingleReplica (#43552)
issue: #42930

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-07-25 10:52:55 +08:00
wei liu ad0bf9cad8
enhance: Optimize channel node balancing for uneven QN distribution (#42786) (#43423)
issue: #42860
pr: #42786
Fix channel node allocation when QueryNode count is not a multiple of
channel count. The previous algorithm used simple division which caused
uneven distribution with remainders.

Key improvements:
- Implement smart remainder distribution algorithm
- Refactor large function into focused helper functions
- Support two-phase rebalancing (release then allocate)
- Handle edge cases like insufficient nodes gracefully

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-07-21 17:04:54 +08:00
wei liu b08d9efe69
fix: Prevent delegator unserviceable due to shard leader change (#42689) (#43309)
issue: #42098 #42404
pr: #42689
Fix critical issue where concurrent balance segment and balance channel
operations cause delegator view inconsistency. When shard leader
switches between load and release phases of segment balance, it results
in loading segments on old delegator but releasing on new delegator,
making the new delegator unserviceable.

The root cause is that balance segment modifies delegator views, and if
these modifications happen on different delegators due to leader change,
it corrupts the delegator state and affects query availability.

Changes include:
- Add shardLeaderID field to SegmentTask to track delegator for load
- Record shard leader ID during segment loading in move operations
- Skip release if shard leader changed from the one used for loading
- Add comprehensive unit tests for leader change scenarios

This ensures balance segment operations are atomic on single delegator,
preventing view corruption and maintaining delegator serviceability.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-07-15 17:46:51 +08:00
wei liu 4952b8c416
enhance: apply load config changes after QueryCoord restart (#43108) (#43236)
issue: #43107
pr: #43108
- Add checkLoadConfigChanges() to apply load config during startup
- Call config check in startQueryCoord() after restart
- Skip auto-updates for collections with user-specified replica numbers
- Add is_user_specified_replica_mode field to preserve user settings
- Add comprehensive unit tests with mockey

Ensures existing collections use latest cluster-level config after
restart.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-07-14 10:22:50 +08:00
congqixia 2531ebda27
fix: [2.5] Check field mmap property before apply collection level one (#43091)
Cherry-pick from master
pr: #43090
Related to #43089

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-07-03 14:32:45 +08:00
congqixia 3d58b2ecee
fix: [2.5] Make controller wait checker worker quit (#42704) (#42726)
Cherry-pick from master
pr: #42704
Related to #42702

This patch add wait logic for `CheckerController` 
Nil check already exists due to code branching

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-06-16 15:14:38 +08:00
Zhen Ye edca441eae
fix: filter the streaming query node from resource group when upgrading (#42594)
issue: #42492
pr: #38677

- filter the streaming query node out from 2.6.0, avoid to load sealed
segment on streaming query node.

Signed-off-by: chyezh <chyezh@outlook.com>
2025-06-09 22:10:35 +08:00
wei liu f06de7eca6
fix: Fix delegator selection logic in releaseSegment (#42572)
issue: #42568
Fix incorrect delegator selection during segment release process which
introduced by pr #42410

- Add serviceable filter to prioritize available shard leaders
- Fix fallback logic with channel-specific lookup
- Add early return when no leader found
- Add comprehensive unit tests for all scenarios

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-06-06 19:24:33 +08:00
Xianhui Lin a1927e22a5
fix: add ShowLoadCollections and ShowLoadPartitions for compatibale mixcoord (#42514)
fix: add ShowLoadCollections and ShowLoadPartitions for compatibale
mixcoord
issue:https://github.com/milvus-io/milvus/issues/42492

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
2025-06-05 15:46:33 +08:00
wei liu b298218a29
enhance: [2.5] Remove balance constraints between channel and segment tasks (#42410)
issue: #42176
pr: #42177

Remove the mutual exclusion constraints between channel and segment
balance tasks to allow them to run concurrently.

Changes include:
- Remove permitBalanceChannel() and permitBalanceSegment() methods from
RoundRobinBalancer
- Update ChannelLevelScoreBalancer, MultiTargetBalancer,
RowCountBasedBalancer, and ScoreBasedBalancer to remove constraint
checks
- Allow segment balance tasks to proceed even when channel balance tasks
are running
- Update test cases to reflect new behavior where balance tasks no
longer block each other
- Improve error handling in task executor by preferring serviceable
shard leaders for segment release operations
- Add fallback logic to find latest shard leader when serviceable leader
is not available

This change improves the efficiency of load balancing by removing
unnecessary coordination overhead between different types of balance
operations.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-06-03 10:16:32 +08:00
wei liu d2ff390a52
fix: Segment may be released prematurely during balance channel (#42043)
issue: #41143
pr: #42090

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-05-29 18:36:35 +08:00
aoiasd 198ff1f150
enhance: [2.5] support run analyzer by loaded collection field (#42119)
relate: https://github.com/milvus-io/milvus/issues/42094
pr: https://github.com/milvus-io/milvus/pull/42113

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2025-05-29 10:26:30 +08:00
wei liu 4a05180f88
enhance: [2.5] support balancing multiple collections in single trigger (#41875) (#42134)
issue: #41874
pr: #41875
- Optimize balance_checker to support balancing multiple collections
simultaneously
- Add new parameters for segment and channel balancing batch sizes
- Add enableBalanceOnMultipleCollections parameter
- Update tests for balance checker

This change improves resource utilization by allowing the system to
balance multiple collections in a single trigger with configurable batch
sizes.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-05-28 23:18:30 +08:00
yihao.dai 7c8370ccd2
fix: [2.5] Fix ants.Pool goroutine leak (#41893)
1. Release the pool after it is no longer in use.
2. Upgrade ants.Pool to fix the goroutine leak issue (see
https://github.com/panjf2000/ants/pull/287).

issue: https://github.com/milvus-io/milvus/issues/41838

pr: https://github.com/milvus-io/milvus/pull/41892

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-05-16 19:12:22 +08:00
SimFG 6e18ededab
fix: [2.5] mockery too unavailable after upgrade golang version (#41522)
- issue: ##41291
- pr: #41481

Signed-off-by: SimFG <bang.fu@zilliz.com>
2025-04-25 14:40:40 +08:00
SimFG 18eb627533
fix: [2.5] Update logging context and upgrade dependencies (#41319)
- issue: #41291
- pr: #41318

---------

Signed-off-by: SimFG <bang.fu@zilliz.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
2025-04-24 23:50:40 +08:00
wei liu 2e8445c2ef
fix: balance checker may enter infinite normal balance loop after balance suspension (#41196)
issue: #41194 
pr: #41195
- Refactor hasUnbalancedCollection flag handling to function scope
- Ensure tracking sets clearance when no balance needed
- Add deferred cleanup for both normal/stopping balance paths
- Add unit tests for collection tracking scenarios

The changes ensure tracking sets (normalBalanceCollectionsCurrentRound
and stoppingBalanceCollectionsCurrentRound) are properly cleared when:
- All collections in current round are balanced
- Balance checks return early due to unready targets
- Balance feature flags are disabled

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-04-10 15:18:28 +08:00
liliu-z cb0f984155
enhance: Revert "separate for index completed (#40873)" (#41152)
This reverts commit 23e579e324. #40873

issue: #39519

Signed-off-by: Li Liu <li.liu@zilliz.com>
2025-04-08 17:36:30 +08:00
Chun Han 23e579e324
separate for index completed (#40873)
related: https://github.com/milvus-io/milvus/issues/40781

Signed-off-by: MrPresent-Han <chun.han@gmail.com>
Co-authored-by: MrPresent-Han <chun.han@gmail.com>
2025-04-05 10:20:24 +08:00
wei liu 37a533fe6d
fix: [2.5] Address manual balance and balance check issues (#41038)
issue: #37651
pr: #41037
- Fix context propagation for manual balance segment task creation from
PR #38080.
- Optimize stopping balance by preventing redundant checks per round,
addressing performance regression from PR #40297.
- Decrease default `checkBalanceInterval` from 3000ms to 300ms.
- Correct minor log messages in `BalanceChecker`.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-04-03 01:26:23 +08:00
Xianhui Lin 249d5b9b41
fix: jsonstats check if cache schema is nil lazy describecollection (#41068)
fix: jsonstats check if cache schema is nil lazy describecollection
pr:https://github.com/milvus-io/milvus/pull/38039
issue:https://github.com/milvus-io/milvus/issues/36995

---------

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
2025-04-03 00:32:21 +08:00
wei liu d185a8f941
enhance: Balance the collection with the largest row count first (#40958)
issue: #37651
pr: #40297
this PR enable to balance the collection with largest row count first,
to avoid temporary migration of small table data to new nodes during
their onboarding, only to be moved out again after the large table
balance, which would cause unnecessary load.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-03-31 16:14:21 +08:00
wei liu b64bb63e77
enhance: [2.5] Add trigger interval config for auto balance (#39154) (#39918)
issue: #39156
pr: #39154

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-03-27 16:40:23 +08:00
Xianhui Lin 8bdff401a3
fix: fix indexchecker schema released (#40809)
pr:https://github.com/milvus-io/milvus/pull/38039
issue:https://github.com/milvus-io/milvus/issues/36995

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
2025-03-20 18:05:22 +08:00
Xianhui Lin 705b3c90a5
fix: Failed to rolling upgrade from v2.5.6 to new 2.5 version when enable JsonKeyStats (#40661)
fix: Failed to rolling upgrade from v2.5.6 to new 2.5 version when
enable JsonKeyStats.The reason is that the file path of the jsonkeyindex
has changed.
issue: https://github.com/milvus-io/milvus/issues/40649https://github.com/milvus-io/milvus/issues/40669
https://github.com/milvus-io/milvus/issues/40707
master-pr: https://github.com/milvus-io/milvus/pull/38039

---------

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
2025-03-18 17:32:16 +08:00
Xianhui Lin f5e9dea2aa
fix: [2.5]fix the garbage cleanup logic of jsonkey stats && improve json key stats filer (#40039)
fix: fix the garbage collection cleanup logic of jsonkey stats &&
improve json key stats filer
issue: https://github.com/milvus-io/milvus/issues/36995
https://github.com/milvus-io/milvus/issues/40034
https://github.com/milvus-io/milvus/issues/40041
https://github.com/milvus-io/milvus/issues/40106
https://github.com/milvus-io/milvus/issues/40138
pr: https://github.com/milvus-io/milvus/pull/38039

---------

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
2025-03-13 20:18:10 +08:00
Bingyi Sun 683b26ffb7
feat: cherry pick json path index (#40313)
issue: #35528 
pr: #36750 
this pr includes json path index pr and some related prs:
1. update tantivy version #39253 
2. json path index #36750 
3. fall back to brute force #40076 
4. term filter #40140 
5. bug fix #40336

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-03-10 22:14:05 +08:00
yihao.dai 893caee467
fix: [2.5] Fix task delta cache data race (#40262)
issue: https://github.com/milvus-io/milvus/issues/40258

pr: https://github.com/milvus-io/milvus/pull/40259

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-03-02 16:52:10 +08:00
wei liu 82c000a4b2
fix: task delta cache leak due to duplicate task id (#40184)
issue: #40052
pr: #40183

task delta cache rely on the taskID is unique, so it incDeltaCache at
AddTask, and decDeltaCache at RemoveTask, but the taskID allocator is
not atomic, which cause two task with same taskID, in such case, it will
call incDeltaCache twice, but call decDeltaCacheOnce, which cause delta
cache leak.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-28 10:22:08 +08:00
wei liu 14f05650e3
enhance: clean shard location cache after collection released (#40228)
issue: #40077
pr: #40088

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-27 19:42:05 +08:00
Xianhui Lin a4eb2ce224
fix: [2.5]Revert qc statschecker for json key stats (#40125)
Revert qc statschecker for json key stats
issue:https://github.com/milvus-io/milvus/issues/36995
pr:https://github.com/milvus-io/milvus/pull/39876

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
2025-02-24 13:31:55 +08:00
congqixia 709594f158
enhance: [2.5] Use v2 package name for pkg module (#40117)
Cherry-pick from master
pr: #39990
Related to #39095

https://go.dev/doc/modules/version-numbers

Update pkg version according to golang dep version convention

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-23 00:46:01 +08:00
Xianhui Lin c1de61ff7c
fix: [2.5]Replace the position of EnabledJSONKeyStats (#40108)
Replace the position of EnabledJSONKeyStats
issue: https://github.com/milvus-io/milvus/issues/36995
pr: https://github.com/milvus-io/milvus/pull/38039

---------

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
2025-02-22 14:35:54 +08:00
yihao.dai b8a758b6c4
enhance: [2.5] Add get vector latency metric and refine request limit error message (#40085)
issue: https://github.com/milvus-io/milvus/issues/40078

pr: https://github.com/milvus-io/milvus/pull/40083

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-02-21 20:19:55 +08:00
wei liu 82fb0bf9c1
fix: [2.5] task delta cache leak on reduce task (#40056)
issue: #40052
pr: #40055

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-21 16:49:54 +08:00
wei liu e42c944e04
fix: [2.5] querycoord panic in cornor case (#40058)
issue: #40050 
pr: #40057

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-21 11:19:58 +08:00
wei liu 3c2d8c1419
enhance: [2.5] Add management api to check querycoord balance status (#37784) (#39909)
issue: #37783
pr: #37784

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-19 10:56:49 +08:00
wei liu bf54f47c34
enhance: [2.5] use rated logger for high frequency log in dist handler (#39452) (#39928)
pr: #39452

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-18 14:32:52 +08:00
Xianhui Lin f0964f769d
enhance: [2.5]Add json key inverted index in stats for optimization (#39876)
Add json key inverted index in stats for optimization
issue: https://github.com/milvus-io/milvus/issues/36995
pr: https://github.com/milvus-io/milvus/pull/38039

---------

Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2025-02-16 20:12:15 +08:00
congqixia 9407a3c9b1
fix: [2.5] Check collection released before target checks (#39843)
Cherry-pick from master
pr: #39841 
Related to #39840

The target could be updated async in previous code. This PR make remove
collection from target observer block until all tasks related in
dispatchers are removed preventing the metrics being updated after
collection released.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-13 20:00:15 +08:00
wei liu 82dc57ace0
fix: [skip e2e][2.5] pr conflict cause ut failed (#39810)
Related to https://github.com/milvus-io/milvus/pull/39701 &
https://github.com/milvus-io/milvus/issues/39681

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-12 11:44:51 +08:00
congqixia 4322a0d49a
fix: [2.5] Resolve conflict on qc task test (#39797)
Cherry-pick from master
pr: #39796
Related to #39701 & #39681

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-11 18:52:45 +08:00
wei liu 11cba57dc7
fix: [2.5] load collection stucks if compaction/gc happens (#39761)
issue: #39680
pr: #39701
if compaction/gc happens, load collection may stuck due to
SegmentNotFound, we should trigger UpdateNextTarget to get a new data
view to execute loading operation.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-11 15:48:50 +08:00
wei liu 969e34d540
fix: [2.5]uneven distribution caused by executing task delta cache leak (#39759)
issue: #39681
pr: #39702
this PR maintain workload effect in action instead of computing workload
effect from target, which may cause leak if target changes.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-11 14:32:46 +08:00
jaime ddc5b299ad
enhance: expose more metrics data (#39466)
issue: #36621 #39417
pr: #39456
1. Adjust the server-side cache size.
2. Add source information for configurations.
3. Add node ID for compaction and indexing tasks.
4. Resolve localhost access issues to fix health check failures for
etcd.

Signed-off-by: jaime <yun.zhang@zilliz.com>
2025-02-07 11:48:45 +08:00
yihao.dai 4464966462
enhance: [2.5] Remove frequent observe log (#39414)
/kind improvement

pr: https://github.com/milvus-io/milvus/pull/39413

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-20 11:01:10 +08:00
yihao.dai 89a183c7c2
enhance: [2.5] enable task delta cache (#39349)
When there are many segment tasks in the querycoord scheduler, the
traversal in GetSegmentTaskDelta checks becomes time-consuming. This PR
adds caching for segment deltas.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/39307

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Co-authored-by: Wei Liu <wei.liu@zilliz.com>
2025-01-17 12:01:03 +08:00
yihao.dai 6773fb10a8
enhance: [2.5] Read metadata concurrently to accelerate recovery (#38900)
Read metadata such as segments, binlogs, and partitions concurrently at
the collection level.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38403

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-16 17:53:01 +08:00