Commit Graph

55 Commits (2.5)

Author SHA1 Message Date
wei liu b298218a29
enhance: [2.5] Remove balance constraints between channel and segment tasks (#42410)
issue: #42176
pr: #42177

Remove the mutual exclusion constraints between channel and segment
balance tasks to allow them to run concurrently.

Changes include:
- Remove permitBalanceChannel() and permitBalanceSegment() methods from
RoundRobinBalancer
- Update ChannelLevelScoreBalancer, MultiTargetBalancer,
RowCountBasedBalancer, and ScoreBasedBalancer to remove constraint
checks
- Allow segment balance tasks to proceed even when channel balance tasks
are running
- Update test cases to reflect new behavior where balance tasks no
longer block each other
- Improve error handling in task executor by preferring serviceable
shard leaders for segment release operations
- Add fallback logic to find latest shard leader when serviceable leader
is not available

This change improves the efficiency of load balancing by removing
unnecessary coordination overhead between different types of balance
operations.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-06-03 10:16:32 +08:00
wei liu 4a05180f88
enhance: [2.5] support balancing multiple collections in single trigger (#41875) (#42134)
issue: #41874
pr: #41875
- Optimize balance_checker to support balancing multiple collections
simultaneously
- Add new parameters for segment and channel balancing batch sizes
- Add enableBalanceOnMultipleCollections parameter
- Update tests for balance checker

This change improves resource utilization by allowing the system to
balance multiple collections in a single trigger with configurable batch
sizes.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-05-28 23:18:30 +08:00
congqixia 709594f158
enhance: [2.5] Use v2 package name for pkg module (#40117)
Cherry-pick from master
pr: #39990
Related to #39095

https://go.dev/doc/modules/version-numbers

Update pkg version according to golang dep version convention

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-23 00:46:01 +08:00
wei liu 51994158d9
fix: channel unbalance during stopping balance progress (#38971) (#39200)
issue: #38970
pr: #38971
cause the stopping balance channel still use the row_count_based policy,
which may causes channel unbalance in multi-collection case.

This PR impl a score based stopping balance channel policy.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-01-14 18:25:00 +08:00
wei liu 659847c11f
enhance: Remove load task limit in one round (#38436)
the task limit in assignSegment/assignChannel will works for both load
task and balance task.

this PR remove the load task limit, only limit balance task num in one
round.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-16 19:30:43 +08:00
wei liu e279ccf109
enhance: Enable score based balance channel policy (#38143)
issue: #38142
current balance channel policy only consider current collection's
distribution, so if all collections has 1 channel, and all channels has
been loaded on same querynode, after querynode num increase, balance
channel won't be triggered.

This PR enable score based balance channel policy, to achieve:
1. distribute all channels evenly across multiple querynodes
2. distribute each collection's channel evenly across multiple
querynodes.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-11 17:20:43 +08:00
tinswzy e76802f910
enhance: refine querycoord meta/catalog related interfaces to ensure that each method includes a ctx parameter (#37916)
issue: #35917 
This PR refine the querycoord meta related interfaces to ensure that
each method includes a ctx parameter.

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
2024-11-25 11:14:34 +08:00
wei liu 0a440e0d38
fix: Prevent simultaneous balance of segments and channels (#37850)
issue: #33550
balance segment and balance segment execute at same time, which will
cause bounch of corner case.

This PR disable simultaneous balance of segments and channels

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-21 17:56:55 +08:00
congqixia 3fe0f82923
enhance: Add balance report log for qc balancer (#36747)
Related to #36746

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-10-11 10:25:24 +08:00
wei liu 470bb0cc3f
enhance: Enable balance on querynode with different mem capacity (#36466)
issue: #36464
This PR enable balance on querynode with different mem capacity, for
query node which has more mem capactity will be assigned more records,
and query node with the largest difference between assignedScore and
currentScore will have a higher priority to carry the new segment.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-30 16:15:17 +08:00
jaime fcec4c21b9
fix: check collection health(queryable) fail for releasing collection (#34947)
issue: #34946

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-08-02 17:20:15 +08:00
wei liu 27b6d58981
fix: Set legacy level to l0 segment after qc restart (#35197)
issue: #35087
after qc restarts, and target is not ready yet, if dist_handler try to
update segment dist, it will set legacy level to l0 segment, which may
cause l0 segment be moved to other node, cause search/query failed.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-08-02 10:18:13 +08:00
wei liu 03912a8788
enhance: Avoid balance stuck after segment list become stable (#34728)
issue: #34715
if collection's segment list doesn't changes anymore, then the next
target will be empty at most time, and balance segment will check
whether segment exist in both current and next target, so the balance
cloud be blocked due to next target is empty.

This PR permit segment to be moved if next target is empty, to avoid
balance stuck.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-07-31 18:09:48 +08:00
wei liu 166fc902b0
enhance: Limit collection's normal balance speed (#34810)
issue: #34798

after we remove the task priority on query coord, to avoid load/release
segment blocked by too much balance task, we limit the balance task size
in each round. at same time, we reduce the balance interval to trigger
balance more frequently.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-07-24 19:11:44 +08:00
wei liu 8123bea1ae
enhance: Avoid assign too much segment/channels to new querynode (#34096)
issue: #34095

When a new query node comes online, the segment_checker,
channel_checker, and balance_checker simultaneously attempt to allocate
segments to it. If this occurs during the execution of a load task and
the distribution of the new query node hasn't been updated, the query
coordinator may mistakenly view the new query node as empty. As a
result, it assigns segments or channels to it, potentially overloading
the new query node with more segments or channels than expected.

This PR measures the workload of the executing tasks on the target query
node to prevent assigning an excessive number of segments to it.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-06-27 19:06:05 +08:00
wei liu a7f6193bfc
fix: query node may stuck at stopping progress (#33104)
issue: #33103 
when try to do stopping balance for stopping query node, balancer will
try to get node list from replica.GetNodes, then check whether node is
stopping, if so, stopping balance will be triggered for this replica.

after the replica refactor, replica.GetNodes only return rwNodes, and
the stopping node maintains in roNodes, so balancer couldn't find
replica which contains stopping node, and stopping balance for replica
won't be triggered, then query node will stuck forever due to
segment/channel doesn't move out.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-05-20 10:21:38 +08:00
Xiaofan 02ace25c68
enhance: reduce the cpu usage when collection number is high (#32245)
related to #32165
1. for all the manager, support collection level index
2. remove collection level filter to avoid extra cpu usage when
collection number increases

Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
2024-04-26 11:49:25 +08:00
wei liu 4822b109bd
fix: Skip to load l0 segment on old version query node (#32124)
issue: #32107

during rolling upgrade progress, skip to load l0 segment on old version
query node

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-15 11:23:23 +08:00
wei liu c4806b69c4
enhance: Refactor leader view manager interface (#31133)
issue: #31091
This PR add GetByFilter interface in leader view manager, instead of all
kind of get func

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-10 15:13:36 +08:00
chyezh a2502bde75
enhance: replica manager enhancement (#31496)
issue: #30647 

- ReplicaManager manage read only node now, and always do persistent of
node distribution of replica.

- All segment/channel checker using ReplicaManager to get read-only node
or read-write node, but not ResourceManager.

- ReplicaManager promise that only apply unique querynode to one replica
in same collection now (replicas in same collection never hold same
querynode at same time).

- ReplicaManager promise that fairly node count assignment policy if
multi replicas of collection is assigned to one resource group.

- Move some parameters check into ReplicaManager to avoid data race.

- Allow transfer replica to resource group that already load replica of
same collection

- Allow transfer node between resource groups that load replica of same
collection

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-04-05 04:57:16 +08:00
wei liu 0944a1f790
enhance: Refactor channel dist manager interface (#31119)
issue: #31091
This PR add GetByFilter interface in channel dist manager, instead of
all kind of get func

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-02 10:23:14 +08:00
wei liu 92971707de
enhance: Add restful api for devops to execute rolling upgrade (#29998)
issue: #29261
This PR Add restful api for devops to execute rolling upgrade, including
suspend/resume balance and manual transfer segments/channels.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-27 16:15:19 +08:00
chyezh 9f9ef8ac32
enhance: transfer resource group and dbname to querynode when load (#30936)
issue: #30931

Signed-off-by: chyezh <chyezh@outlook.com>
2024-03-21 11:59:12 +08:00
wei liu 06b191b164
fix: Balance channel stuck forever due to logic dead lock (#31202)
issue: #30816

cause balance channel will stuck until leader view catch up the current
target, then start to unsub the old delegator. which make sure that the
new delegator can provide search before release old delegator. but
another logic in segment_checker skip loading segment during balance
channel. so during balance channel, if query node crash, new delegator
can't catch up target forever, then stuck forever.

This PR remove the rule that skip loading segment during balance channel
to avoid the logic dead lock here.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-13 15:05:04 +08:00
wei liu 06df9b8462
fix: Balance segment/channel won't be trigger on multi replicas (#31107)
issue: #30983 #30982

cause balancer call wrong interface to get segment/channel list in
replica, then got a wrong average segment/channel number, which make
each node have less segment/channel than average, and the balance won't
be trigger in multi replica case.

This PR fix that balance segment/channel won't be trigger on multi
replicas

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-11 20:35:04 +08:00
wei liu efe8cecc88
enhance: refactor segment dist manager interface (#31073)
issue: #31091
This PR add `GetByFilter` interface in segment dist manager, instead of
all kind of get func

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-08 16:29:01 +08:00
congqixia 4c93912135
enhance: Shuffle candidates before channel assignment (#30066)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-01-17 19:34:53 +08:00
wei liu 336fce0582
enhance: Rewrite gen segment plan based on assign segment (#29574)
issue: #29582
This PR rewrite gen segment plan logic based on assign segment in
`score_based_balancer`

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-01-04 11:10:44 +08:00
wei liu 820ee692fc
enhance: Add config for querycoord auto balance channel (#29231)
issue: #23726
This PR add control config to querycoord's background auto balance
channel operation

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-18 10:00:40 +08:00
wei liu 008bae675d
enhance: Skip balance segment when channel need be balanced (#29116)
issue: #28622
After we support balance segment with growing segment count #28623, if
we balance segment and channel at same time, some segments need to be
rebalanced after balance channel finish.

This PR skip balance segment when channel need be balanced.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-14 16:44:43 +08:00
yah01 2f0c7a6544
fix: forbid balancing level zero segments (#29130)
we can't balance the L0 segments
related #29128

---------

Signed-off-by: yah01 <yah2er0ne@outlook.com>
Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-12-12 20:38:38 +08:00
wei liu 42e538b683
enhance: enable balance channel in querycoord (#28469)
issue: #23726

/kind improvement

1. enable auto balance channel between nodes in querycoord
2. make `genSegmentPlan` reuse the `AssignSegment` logic
3. make `genChannelPlan` reuse the `AssignChannel` logic

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-11 14:18:37 +08:00
wei liu 911a915798
feat: enable balance based on growing segment row count (#28623)
issue: #28622 

query node with delegator will has more rows than other query node due
to delgator loads all growing rows.
This PR enable the balance segment which based on the num of growing
rows in leader view.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-11-27 14:58:26 +08:00
wei liu e0222b2ce3
refine target manager code style (#27883)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-10-25 00:44:12 +08:00
SimFG 26f06dd732
Format the code (#27275)
Signed-off-by: SimFG <bang.fu@zilliz.com>
2023-09-21 09:45:27 +08:00
MrPresent-Han b517bc9e6a
refine balance mechanism including:(#23454) (#23763) (#23791)
1. balance granuity to replica to avoid influence unrelated replicas
2. avoid balance back and forth

Signed-off-by: MrPresent-Han <jamesharden11122@gmail.com>
2023-05-04 12:22:40 +08:00
wei liu 5244020336
ban auto balance channel (#23725)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-04-26 19:26:39 +08:00
wei liu 6653e2c3b0
fix balance channel (#23631)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-04-25 10:22:37 +08:00
wei liu 3933080511
skip to balance redundant segment (#23490)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-04-18 18:32:32 +08:00
wei liu dbbd703667
fix balance generate unexpected task (#23299)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-04-11 14:38:30 +08:00
wei liu 9f127dae47
enable balance channel (#23227)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-04-07 19:06:28 +08:00
jaime c9d0c157ec
Move some modules from internal to public package (#22572)
Signed-off-by: jaime <yun.zhang@zilliz.com>
2023-04-06 19:14:32 +08:00
MrPresent-Han afd874b736
enhance segment balance by considering global rowCount(##22914) (#23056)
Signed-off-by: MrPresent-Han <jamesharden11122@gmail.com>
Co-authored-by: xiaofan-luan <xiaofan.luan@zilliz.com>
2023-04-03 14:16:25 +08:00
congqixia 127867b873
Add ratedgroup for some info/warning log (#23095)
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-03-31 15:22:23 +08:00
wei liu 74da53c027
fix update load percentage (#23054)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-03-30 10:48:23 +08:00
yihao.dai 1f718118e9
Dynamic load/release partitions (#22655)
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2023-03-20 14:55:57 +08:00
wei liu c3e8ad3629
fix balance generate reduce task (#22236)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-02-21 19:06:27 +08:00
wei liu 73c44d4b29
resource group impl (#21609)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-01-30 10:19:48 +08:00
Enwei Jiao fb42466c65
Use opentelemetry (#21509)
Signed-off-by: Enwei Jiao <enwei.jiao@zilliz.com>
2023-01-12 16:09:39 +08:00
SimFG 6a29a964df
Fix queryCoord panic during query node down (#21400)
Signed-off-by: SimFG <bang.fu@zilliz.com>
2022-12-28 10:17:30 +08:00