milvus

Commit Graph

Author	SHA1	Message	Date
congqixia	01f8faacae	fix: [hotfix] Add sub task pool for multi-stage tasks (#40080 ) Cherry-pick from master pr: #40079 Related to #40078 Add a subTaskPool to execute sub task in case of logic deadlock described in issue. --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com> Signed-off-by: Cai Zhang <cai.zhang@zilliz.com> Signed-off-by: bigsheeper <yihao.dai@zilliz.com> Co-authored-by: Cai Zhang <cai.zhang@zilliz.com> Co-authored-by: bigsheeper <yihao.dai@zilliz.com>	2025-02-21 16:06:12 +08:00
yihao.dai	4464966462	enhance: [2.5] Remove frequent observe log (#39414 ) /kind improvement pr: https://github.com/milvus-io/milvus/pull/39413 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-01-20 11:01:10 +08:00
yihao.dai	89a183c7c2	enhance: [2.5] enable task delta cache (#39349 ) When there are many segment tasks in the querycoord scheduler, the traversal in GetSegmentTaskDelta checks becomes time-consuming. This PR adds caching for segment deltas. issue: https://github.com/milvus-io/milvus/issues/37630 pr: https://github.com/milvus-io/milvus/pull/39307 Signed-off-by: bigsheeper <yihao.dai@zilliz.com> Co-authored-by: Wei Liu <wei.liu@zilliz.com>	2025-01-17 12:01:03 +08:00
yihao.dai	6773fb10a8	enhance: [2.5] Read metadata concurrently to accelerate recovery (#38900 ) Read metadata such as segments, binlogs, and partitions concurrently at the collection level. issue: https://github.com/milvus-io/milvus/issues/37630 pr: https://github.com/milvus-io/milvus/pull/38403 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-01-16 17:53:01 +08:00
yihao.dai	9d2a0e775c	fix: [2.5] Fix slow dist handle and slow observe (#38905 ) 1. Provide partition&channel level indexing in the collection target. 2. Make SegmentAction not wait for distribution. 3. Remove scheduler and target manager mutex 4. Optimize logging to reduce CPU overhead. issue: https://github.com/milvus-io/milvus/issues/37630 pr: https://github.com/milvus-io/milvus/pull/38566 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-01-16 17:07:02 +08:00
yihao.dai	c741b8be2b	fix: [2.5] Remove frequently updating metric to avoid mutex contention (#38778 ) issue: https://github.com/milvus-io/milvus/issues/37630 Reduce the frequency of `updateIndexTasksMetrics` to avoid holding the mutex for long periods. pr: https://github.com/milvus-io/milvus/pull/38775 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-01-16 11:51:02 +08:00
wei liu	76ed552b00	enhance: Add logs for check health failed (#39208 ) (#39302 ) pr: #39208 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-01-16 10:31:04 +08:00
wei liu	51994158d9	fix: channel unbalance during stopping balance progress (#38971 ) (#39200 ) issue: #38970 pr: #38971 cause the stopping balance channel still use the row_count_based policy, which may causes channel unbalance in multi-collection case. This PR impl a score based stopping balance channel policy. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-01-14 18:25:00 +08:00
wei liu	4fd56e4773	fix: Prevent leader checker from generating excessive duplicate leader tasks (#39000 ) (#39160 ) issue: #39001 pr: #39000 Background: Segment Load Version: Each segment load request assigns a timestamp as its version. When multiple copies of a segment are loaded on different QueryNodes, the leader checker uses this version to identify the latest copy and updates the routing table in the leader view to point to it. Delegator Router Version: When a delegator builds a route to a QueryNode that has loaded a segment, it also records the segment's version. Router Table Update Logic: If the leader checker detects that the version of a segment in the routing table does not match the version in the worker, it updates the routing table to point to the QueryNode with the latest version. Additionally, it updates the segment's load version in the QueryNode during this process. Issue: When a channel is undergoing load balancing, the leader checker may sync the routing table to a new delegator. This sync operation modifies the segment's load version, which invalidates the routing in the old delegator. Subsequently, the leader checker updates the routing table in the old delegator, breaking the routing in the new delegator. This cycle continues, causing repeated updates and inconsistencies. Fix: This PR introduces two changes to address the issue: 1. Use NodeID to verify whether the delegator's routing table needs an update, avoiding unnecessary modifications. 2. Ensure compatibility by using the latest segment's load version as the version recorded in the routing table. These changes resolve the cyclic updates and prevent the leader checker from generating excessive duplicate tasks, ensuring routing stability across delegators during load balancing. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-01-14 18:11:06 +08:00
Zhen Ye	adfc3f945e	enhance: record memory size (uncompressed) item for index (#38844 ) issue: #38715 pr: #38770 - Current milvus use a serialized index size(compressed) for estimate resource for loading. - Add a new field MemSize (before compressing) for index to estimate resource. --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-01-14 10:33:06 +08:00
jaime	b0afe32c98	fix: unstable ut in leader_vew_manager.go file (#39162 ) issue: #38672 pr: #39161 Signed-off-by: jaime <yun.zhang@zilliz.com>	2025-01-10 19:54:57 +08:00
Zhen Ye	95809ca767	enhance: make new go package to manage proto (#39128 ) issue: #39095 pr: #39114 --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-01-10 10:53:01 +08:00
jaime	0693634f62	enhance: add db name in replica description (#38673 ) issue: #36621 pr: #38672 Signed-off-by: jaime <yun.zhang@zilliz.com>	2025-01-09 19:43:04 +08:00
wei liu	35cef0567c	enhance: Add log for case which target not update as expected (#38944 ) (#39046 ) pr: #38944 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-01-08 19:32:57 +08:00
Xiaofan	a2c4cd59ce	fix: drop partition can not be successful if load failed[2.5] (#38874 ) fix https://github.com/milvus-io/milvus/issues/38649 pr: #38793 when partition load failed, the partition drop will also fail due to the wrong error message Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>	2025-01-02 09:56:53 +08:00
wei liu	f441ccdbe9	fix: [2.5] Prevent balancer from overloading the same QueryNode (#38724 ) issue: #38718 pr: #38719 The balancer calculates the workload of executing tasks as an ongoing score for target nodes. However, a logic issue arises when GetSegmentTaskDelta or GetChannelTaskDelta is called with collectionID=-1, which incorrectly returns zero. Due to the incorrect global score, the executing task's workload is not properly reflected for each collection. Consequently, each collection submits its own balance task, leading to the balancer assigning excessive tasks to the same QueryNode. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-12-25 16:16:49 +08:00
wei liu	cb0618b2d4	fix: [2.5] Querycoord will trigger unexpected balance task after restart (#38725 ) issue: https://github.com/milvus-io/milvus/issues/38606 pr: https://github.com/milvus-io/milvus/pull/38630 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-12-25 16:14:49 +08:00
wei liu	b16d04d7cc	fix: Fix update loading collection's load config doesn't work (#38737 ) issue: #38594 pr: #38595 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-12-25 15:02:50 +08:00
jaime	11bedf5e76	fix: Revert "Expose metrics of stanby coordinators (#27698 )" (#38621 ) issue: #38608 pr: #38620 Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-12-20 18:04:47 +08:00
jaime	78438ef41e	fix: revert optimize CPU usage for CheckHealth requests (#35589 ) (#38555 ) issue: #35563 Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-12-19 00:38:45 +08:00
yihao.dai	d3c174b0f1	enhance: Accelerate observe collection (#38028 ) 1. A collection should observe the channel only once. 2. A collection should check the CollectionLoadPercent for updates only once. 3. Skip saving coll/partition meta if there are no changes, primarily to accelerate collection observation after recovery. issue: https://github.com/milvus-io/milvus/issues/37630 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-12-17 14:14:45 +08:00
jaime	28fdbc4e30	enhance: optimize CPU usage for CheckHealth requests (#35589 ) issue: #35563 1. Use an internal health checker to monitor the cluster's health state, storing the latest state on the coordinator node. The CheckHealth request retrieves the cluster's health from this latest state on the proxy sides, which enhances cluster stability. 2. Each health check will assess all collections and channels, with detailed failure messages temporarily saved in the latest state. 3. Use CheckHealth request instead of the heavy GetMetrics request on the querynode and datanode Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-12-17 11:02:45 +08:00
SimFG	2afe2eaf3e	feat: support to replicate collection when the services contains the system tt msg (#37559 ) - issue: #37105 --------- Signed-off-by: SimFG <bang.fu@zilliz.com>	2024-12-17 09:08:46 +08:00
wei liu	659847c11f	enhance: Remove load task limit in one round (#38436 ) the task limit in assignSegment/assignChannel will works for both load task and balance task. this PR remove the load task limit, only limit balance task num in one round. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-12-16 19:30:43 +08:00
wei liu	40f9db491e	fix: Fix SyncDistribution may cost too much time on retry (#38454 ) issue: #38428 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-12-16 11:38:44 +08:00
tinswzy	27229f7907	enhance: refine exists log print with ctx (#38080 ) issue: #35917 Refines exists log print with ctx Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2024-12-14 22:36:44 +08:00
Zhen Ye	833c74aa66	enhance: add detail, replica count for resource group (#38314 ) issue: #30647 --------- Signed-off-by: chyezh <chyezh@outlook.com>	2024-12-13 14:14:50 +08:00
wei liu	e279ccf109	enhance: Enable score based balance channel policy (#38143 ) issue: #38142 current balance channel policy only consider current collection's distribution, so if all collections has 1 channel, and all channels has been loaded on same querynode, after querynode num increase, balance channel won't be triggered. This PR enable score based balance channel policy, to achieve: 1. distribute all channels evenly across multiple querynodes 2. distribute each collection's channel evenly across multiple querynodes. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-12-11 17:20:43 +08:00
Zhen Ye	d3ae8e9232	fix: delay the wait other coord logic in query coord after query coord change into standby state (#38259 ) issue: https://github.com/milvus-io/milvus/issues/37764 - After removing rpc layer from mixcoord, the querycoord at standby mode will be blocked forever of deployment rolling --------- Signed-off-by: chyezh <chyezh@outlook.com>	2024-12-11 15:48:42 +08:00
wei liu	950203aba0	enhance: Optimize save colelction target latency (#38345 ) issue: #38237 this PR only use better compression level for proto msg which is larger than 1MB, and use a lighter compression level for smaller proto msg, which could get a better latency in most case. this PR could reduce the latency from 22.7s to 4.7s with 10000 collctions and each collections has 1000 segments. before this PR: BenchmarkTargetManager-8 1 22781536357 ns/op 566407275088 B/op 11188282 allocs/op after this PR: BenchmarkTargetManager-8 1 4729566944 ns/op 36713248864 B/op 10963615 allocs/op Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-12-11 10:12:43 +08:00
congqixia	7ea9c983d2	enhance: Add mockery package config for QC&QN (#38340 ) Related to #38339 Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-12-10 19:18:42 +08:00
wei liu	856e2aad7d	fix: Leader task stuck and retry again and again (#38202 ) issue: #38201 leader task require to update delegator's distribution, and only success after the distribution change has been applyed to delegator. but the delegator will reject the distribution change if it's version is older than current version in delegator. which cause the leader task stuck and retry forever. this PR remove the leader task finish check. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-12-10 19:16:42 +08:00
wei liu	f04986fceb	enhance: Remove constraint on release segment task (#38297 ) issue: #38305 after we disable balance segment and balance channel happens at same time, the constriant which require release segment must happens on serviceable shard leader is unnessary. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-12-10 11:18:49 +08:00
jaime	8ed019735c	enhance: add disk stats within system metrics (#38033 ) issue: ##36621 Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-12-06 16:32:41 +08:00
congqixia	36946cc9ce	enhance: Set loaded collection/partition number to metrics (#38271 ) Related to #36456 Previous PR: #38471 #38233 Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-12-06 16:18:40 +08:00
congqixia	6ff19481f0	enhance: Resolve compilation error due to PR conflict (#38252 ) Related pr: #38233 #38059 Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-12-05 19:26:40 +08:00
congqixia	051bc280dd	enhance: Make dynamic load/release partition follow targets (#38059 ) Related to #37849 --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-12-05 16:24:40 +08:00
congqixia	32645fc28a	enhance: Unify querycoord meta metrics (#38233 ) Related to #36456 Unify collection/partition number metrics to collection manager in case of unwant missing modification Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-12-05 15:48:39 +08:00
tinswzy	7944538ade	enhance: Add ctx param to KV operation interfaces (#38154 ) issue: #35917 Refine KV operation interfaces by adding a ctx param Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2024-12-05 15:16:41 +08:00
tinswzy	e76802f910	enhance: refine querycoord meta/catalog related interfaces to ensure that each method includes a ctx parameter (#37916 ) issue: #35917 This PR refine the querycoord meta related interfaces to ensure that each method includes a ctx parameter. Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2024-11-25 11:14:34 +08:00
jaime	7bbfe86bcd	enhance: add list index and segment index retrieval API for WebUI (#37861 ) issue: #36621 Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-11-22 16:58:34 +08:00
congqixia	b34bfb98a0	enhance: Refine Replica manager colle2Replicas secondary index (#37906 ) Related to #37630 This PR add a new util coll2Replicas secondary index to reduce map access & iteration while get replicas by collection --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-11-22 11:54:32 +08:00
wei liu	965bda6e60	enhance: Add channel name to shard leader log in meta cache (#37856 ) Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-11-21 19:24:31 +08:00
wei liu	0a440e0d38	fix: Prevent simultaneous balance of segments and channels (#37850 ) issue: #33550 balance segment and balance segment execute at same time, which will cause bounch of corner case. This PR disable simultaneous balance of segments and channels Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-11-21 17:56:55 +08:00
wei liu	b983ef9fca	fix: Channel may be released after balance (#37862 ) issue: #37830 casue dist handler doesn't set channel's version, so if channel checker try to dedup channel, it may release the new delegator after balance finished. this PR fix the way to set proper version for channel. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-11-21 10:40:31 +08:00
congqixia	b8d31ebed8	enhance: Remove unnecessary segment clone updating dist (#37797 ) Related to #37630 Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-11-20 11:26:31 +08:00
yihao.dai	b6612e02b4	enhance: Reduce GetIndexInfos calls (#37695 ) Batch `GetIndexInfos` calls for segments to reduce RPC calls. issue: https://github.com/milvus-io/milvus/issues/37634 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-11-19 14:24:31 +08:00
congqixia	6d86b9022e	enhance: Provide secondary index critria when filter leaderview (#37777 ) Related to #37630 --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-11-19 10:12:30 +08:00
jaime	257ecab84b	enhance: remove collection queryable check from health check (#37712 ) Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-11-18 10:50:38 +08:00
congqixia	b0bd290a6e	enhance: Use internal json(sonic) to replace std json lib (#37708 ) Related to #35020 Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-11-18 10:46:31 +08:00

1 2 3 4 5 ...

633 Commits (hotfix-2.5.4)