Commit Graph

633 Commits (hotfix-2.5.4)

Author SHA1 Message Date
congqixia 01f8faacae
fix: [hotfix] Add sub task pool for multi-stage tasks (#40080)
Cherry-pick from master
pr: #40079 
Related to #40078

Add a subTaskPool to execute sub task in case of logic deadlock
described in issue.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Co-authored-by: Cai Zhang <cai.zhang@zilliz.com>
Co-authored-by: bigsheeper <yihao.dai@zilliz.com>
2025-02-21 16:06:12 +08:00
yihao.dai 4464966462
enhance: [2.5] Remove frequent observe log (#39414)
/kind improvement

pr: https://github.com/milvus-io/milvus/pull/39413

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-20 11:01:10 +08:00
yihao.dai 89a183c7c2
enhance: [2.5] enable task delta cache (#39349)
When there are many segment tasks in the querycoord scheduler, the
traversal in GetSegmentTaskDelta checks becomes time-consuming. This PR
adds caching for segment deltas.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/39307

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Co-authored-by: Wei Liu <wei.liu@zilliz.com>
2025-01-17 12:01:03 +08:00
yihao.dai 6773fb10a8
enhance: [2.5] Read metadata concurrently to accelerate recovery (#38900)
Read metadata such as segments, binlogs, and partitions concurrently at
the collection level.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38403

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-16 17:53:01 +08:00
yihao.dai 9d2a0e775c
fix: [2.5] Fix slow dist handle and slow observe (#38905)
1. Provide partition&channel level indexing in the collection target.
2. Make SegmentAction not wait for distribution.
3. Remove scheduler and target manager mutex
4. Optimize logging to reduce CPU overhead.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38566

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-16 17:07:02 +08:00
yihao.dai c741b8be2b
fix: [2.5] Remove frequently updating metric to avoid mutex contention (#38778)
issue: https://github.com/milvus-io/milvus/issues/37630

Reduce the frequency of `updateIndexTasksMetrics` to avoid holding the
mutex for long periods.

pr: https://github.com/milvus-io/milvus/pull/38775

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-16 11:51:02 +08:00
wei liu 76ed552b00
enhance: Add logs for check health failed (#39208) (#39302)
pr: #39208

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-01-16 10:31:04 +08:00
wei liu 51994158d9
fix: channel unbalance during stopping balance progress (#38971) (#39200)
issue: #38970
pr: #38971
cause the stopping balance channel still use the row_count_based policy,
which may causes channel unbalance in multi-collection case.

This PR impl a score based stopping balance channel policy.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-01-14 18:25:00 +08:00
wei liu 4fd56e4773
fix: Prevent leader checker from generating excessive duplicate leader tasks (#39000) (#39160)
issue: #39001
pr: #39000
Background:
Segment Load Version: Each segment load request assigns a timestamp as
its version. When multiple copies of a segment are loaded on different
QueryNodes, the leader checker uses this version to identify the latest
copy and updates the routing table in the leader view to point to it.
Delegator Router Version: When a delegator builds a route to a QueryNode
that has loaded a segment, it also records the segment's version.

Router Table Update Logic: If the leader checker detects that the
version of a segment in the routing table does not match the version in
the worker, it updates the routing table to point to the QueryNode with
the latest version. Additionally, it updates the segment's load version
in the QueryNode during this process.

Issue:
When a channel is undergoing load balancing, the leader checker may sync
the routing table to a new delegator. This sync operation modifies the
segment's load version, which invalidates the routing in the old
delegator. Subsequently, the leader checker updates the routing table in
the old delegator, breaking the routing in the new delegator. This cycle
continues, causing repeated updates and inconsistencies.

Fix:
This PR introduces two changes to address the issue:
1. Use NodeID to verify whether the delegator's routing table needs an
update, avoiding unnecessary modifications.
2. Ensure compatibility by using the latest segment's load version as
the version recorded in the routing table.

These changes resolve the cyclic updates and prevent the leader checker
from generating excessive duplicate tasks, ensuring routing stability
across delegators during load balancing.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-01-14 18:11:06 +08:00
Zhen Ye adfc3f945e
enhance: record memory size (uncompressed) item for index (#38844)
issue: #38715 
pr: #38770

- Current milvus use a serialized index size(compressed) for estimate
resource for loading.
- Add a new field MemSize (before compressing) for index to estimate
resource.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-01-14 10:33:06 +08:00
jaime b0afe32c98
fix: unstable ut in leader_vew_manager.go file (#39162)
issue: #38672
pr: #39161

Signed-off-by: jaime <yun.zhang@zilliz.com>
2025-01-10 19:54:57 +08:00
Zhen Ye 95809ca767
enhance: make new go package to manage proto (#39128)
issue: #39095
pr: #39114

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-01-10 10:53:01 +08:00
jaime 0693634f62
enhance: add db name in replica description (#38673)
issue: #36621
pr: #38672

Signed-off-by: jaime <yun.zhang@zilliz.com>
2025-01-09 19:43:04 +08:00
wei liu 35cef0567c
enhance: Add log for case which target not update as expected (#38944) (#39046)
pr: #38944

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-01-08 19:32:57 +08:00
Xiaofan a2c4cd59ce
fix: drop partition can not be successful if load failed[2.5] (#38874)
fix https://github.com/milvus-io/milvus/issues/38649
pr: #38793
when partition load failed, the partition drop will also fail due to the
wrong error message

Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
2025-01-02 09:56:53 +08:00
wei liu f441ccdbe9
fix: [2.5] Prevent balancer from overloading the same QueryNode (#38724)
issue: #38718
pr: #38719
The balancer calculates the workload of executing tasks as an ongoing
score for target nodes. However, a logic issue arises when
GetSegmentTaskDelta or GetChannelTaskDelta is called with
collectionID=-1, which incorrectly returns zero.

Due to the incorrect global score, the executing task's workload is not
properly reflected for each collection. Consequently, each collection
submits its own balance task, leading to the balancer assigning
excessive tasks to the same QueryNode.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-25 16:16:49 +08:00
wei liu cb0618b2d4
fix: [2.5] Querycoord will trigger unexpected balance task after restart (#38725)
issue: https://github.com/milvus-io/milvus/issues/38606
pr: https://github.com/milvus-io/milvus/pull/38630

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-25 16:14:49 +08:00
wei liu b16d04d7cc
fix: Fix update loading collection's load config doesn't work (#38737)
issue: #38594 
pr: #38595

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-25 15:02:50 +08:00
jaime 11bedf5e76
fix: Revert "Expose metrics of stanby coordinators (#27698)" (#38621)
issue: #38608
pr: #38620

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-12-20 18:04:47 +08:00
jaime 78438ef41e
fix: revert optimize CPU usage for CheckHealth requests (#35589) (#38555)
issue: #35563

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-12-19 00:38:45 +08:00
yihao.dai d3c174b0f1
enhance: Accelerate observe collection (#38028)
1. A collection should observe the channel only once.
2. A collection should check the CollectionLoadPercent for updates only
once.
3. Skip saving coll/partition meta if there are no changes, primarily to
accelerate collection observation after recovery.

issue: https://github.com/milvus-io/milvus/issues/37630

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-17 14:14:45 +08:00
jaime 28fdbc4e30
enhance: optimize CPU usage for CheckHealth requests (#35589)
issue: #35563
1. Use an internal health checker to monitor the cluster's health state,
storing the latest state on the coordinator node. The CheckHealth
request retrieves the cluster's health from this latest state on the
proxy sides, which enhances cluster stability.
2. Each health check will assess all collections and channels, with
detailed failure messages temporarily saved in the latest state.
3. Use CheckHealth request instead of the heavy GetMetrics request on
the querynode and datanode

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-12-17 11:02:45 +08:00
SimFG 2afe2eaf3e
feat: support to replicate collection when the services contains the system tt msg (#37559)
- issue: #37105

---------

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-12-17 09:08:46 +08:00
wei liu 659847c11f
enhance: Remove load task limit in one round (#38436)
the task limit in assignSegment/assignChannel will works for both load
task and balance task.

this PR remove the load task limit, only limit balance task num in one
round.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-16 19:30:43 +08:00
wei liu 40f9db491e
fix: Fix SyncDistribution may cost too much time on retry (#38454)
issue: #38428

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-16 11:38:44 +08:00
tinswzy 27229f7907
enhance: refine exists log print with ctx (#38080)
issue: #35917 
Refines exists log print with ctx

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
2024-12-14 22:36:44 +08:00
Zhen Ye 833c74aa66
enhance: add detail, replica count for resource group (#38314)
issue: #30647

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-12-13 14:14:50 +08:00
wei liu e279ccf109
enhance: Enable score based balance channel policy (#38143)
issue: #38142
current balance channel policy only consider current collection's
distribution, so if all collections has 1 channel, and all channels has
been loaded on same querynode, after querynode num increase, balance
channel won't be triggered.

This PR enable score based balance channel policy, to achieve:
1. distribute all channels evenly across multiple querynodes
2. distribute each collection's channel evenly across multiple
querynodes.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-11 17:20:43 +08:00
Zhen Ye d3ae8e9232
fix: delay the wait other coord logic in query coord after query coord change into standby state (#38259)
issue: https://github.com/milvus-io/milvus/issues/37764

- After removing rpc layer from mixcoord, the querycoord at standby mode
will be blocked forever of deployment rolling

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-12-11 15:48:42 +08:00
wei liu 950203aba0
enhance: Optimize save colelction target latency (#38345)
issue: #38237
this PR only use better compression level for proto msg which is larger
than 1MB, and use a lighter compression level for smaller proto msg,
which could get a better latency in most case.

this PR could reduce the latency from 22.7s to 4.7s with 10000
collctions and each collections has 1000 segments.

before this PR:
BenchmarkTargetManager-8 1 22781536357 ns/op 566407275088 B/op 11188282
allocs/op
after this PR:
BenchmarkTargetManager-8 1 4729566944 ns/op 36713248864 B/op 10963615
allocs/op

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-11 10:12:43 +08:00
congqixia 7ea9c983d2
enhance: Add mockery package config for QC&QN (#38340)
Related to #38339

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-10 19:18:42 +08:00
wei liu 856e2aad7d
fix: Leader task stuck and retry again and again (#38202)
issue: #38201
leader task require to update delegator's distribution, and only success
after the distribution change has been applyed to delegator. but the
delegator will reject the distribution change if it's version is older
than current version in delegator. which cause the leader task stuck and
retry forever.

this PR remove the leader task finish check.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-10 19:16:42 +08:00
wei liu f04986fceb
enhance: Remove constraint on release segment task (#38297)
issue: #38305
after we disable balance segment and balance channel happens at same
time, the constriant which require release segment must happens on
serviceable shard leader is unnessary.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-10 11:18:49 +08:00
jaime 8ed019735c
enhance: add disk stats within system metrics (#38033)
issue: ##36621

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-12-06 16:32:41 +08:00
congqixia 36946cc9ce
enhance: Set loaded collection/partition number to metrics (#38271)
Related to #36456
Previous PR: #38471 #38233

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-06 16:18:40 +08:00
congqixia 6ff19481f0
enhance: Resolve compilation error due to PR conflict (#38252)
Related pr: #38233 #38059

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-05 19:26:40 +08:00
congqixia 051bc280dd
enhance: Make dynamic load/release partition follow targets (#38059)
Related to #37849

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-05 16:24:40 +08:00
congqixia 32645fc28a
enhance: Unify querycoord meta metrics (#38233)
Related to #36456

Unify collection/partition number metrics to collection manager in case
of unwant missing modification

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-05 15:48:39 +08:00
tinswzy 7944538ade
enhance: Add ctx param to KV operation interfaces (#38154)
issue: #35917 
Refine KV operation interfaces by adding a ctx param

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
2024-12-05 15:16:41 +08:00
tinswzy e76802f910
enhance: refine querycoord meta/catalog related interfaces to ensure that each method includes a ctx parameter (#37916)
issue: #35917 
This PR refine the querycoord meta related interfaces to ensure that
each method includes a ctx parameter.

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
2024-11-25 11:14:34 +08:00
jaime 7bbfe86bcd
enhance: add list index and segment index retrieval API for WebUI (#37861)
issue: #36621

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-11-22 16:58:34 +08:00
congqixia b34bfb98a0
enhance: Refine Replica manager colle2Replicas secondary index (#37906)
Related to #37630

This PR add a new util coll2Replicas secondary index to reduce map
access & iteration while get replicas by collection

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-22 11:54:32 +08:00
wei liu 965bda6e60
enhance: Add channel name to shard leader log in meta cache (#37856)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-21 19:24:31 +08:00
wei liu 0a440e0d38
fix: Prevent simultaneous balance of segments and channels (#37850)
issue: #33550
balance segment and balance segment execute at same time, which will
cause bounch of corner case.

This PR disable simultaneous balance of segments and channels

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-21 17:56:55 +08:00
wei liu b983ef9fca
fix: Channel may be released after balance (#37862)
issue: #37830
casue dist handler doesn't set channel's version, so if channel checker
try to dedup channel, it may release the new delegator after balance
finished.

this PR fix the way to set proper version for channel.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-21 10:40:31 +08:00
congqixia b8d31ebed8
enhance: Remove unnecessary segment clone updating dist (#37797)
Related to #37630

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-20 11:26:31 +08:00
yihao.dai b6612e02b4
enhance: Reduce GetIndexInfos calls (#37695)
Batch `GetIndexInfos` calls for segments to reduce RPC calls.

issue: https://github.com/milvus-io/milvus/issues/37634

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-19 14:24:31 +08:00
congqixia 6d86b9022e
enhance: Provide secondary index critria when filter leaderview (#37777)
Related to #37630

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-19 10:12:30 +08:00
jaime 257ecab84b
enhance: remove collection queryable check from health check (#37712)
Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-11-18 10:50:38 +08:00
congqixia b0bd290a6e
enhance: Use internal json(sonic) to replace std json lib (#37708)
Related to #35020

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-18 10:46:31 +08:00