milvus

Commit Graph

Author	SHA1	Message	Date
wei liu	4fd56e4773	fix: Prevent leader checker from generating excessive duplicate leader tasks (#39000 ) (#39160 ) issue: #39001 pr: #39000 Background: Segment Load Version: Each segment load request assigns a timestamp as its version. When multiple copies of a segment are loaded on different QueryNodes, the leader checker uses this version to identify the latest copy and updates the routing table in the leader view to point to it. Delegator Router Version: When a delegator builds a route to a QueryNode that has loaded a segment, it also records the segment's version. Router Table Update Logic: If the leader checker detects that the version of a segment in the routing table does not match the version in the worker, it updates the routing table to point to the QueryNode with the latest version. Additionally, it updates the segment's load version in the QueryNode during this process. Issue: When a channel is undergoing load balancing, the leader checker may sync the routing table to a new delegator. This sync operation modifies the segment's load version, which invalidates the routing in the old delegator. Subsequently, the leader checker updates the routing table in the old delegator, breaking the routing in the new delegator. This cycle continues, causing repeated updates and inconsistencies. Fix: This PR introduces two changes to address the issue: 1. Use NodeID to verify whether the delegator's routing table needs an update, avoiding unnecessary modifications. 2. Ensure compatibility by using the latest segment's load version as the version recorded in the routing table. These changes resolve the cyclic updates and prevent the leader checker from generating excessive duplicate tasks, ensuring routing stability across delegators during load balancing. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2025-01-14 18:11:06 +08:00
Zhen Ye	95809ca767	enhance: make new go package to manage proto (#39128 ) issue: #39095 pr: #39114 --------- Signed-off-by: chyezh <chyezh@outlook.com>	2025-01-10 10:53:01 +08:00
wei liu	cb0618b2d4	fix: [2.5] Querycoord will trigger unexpected balance task after restart (#38725 ) issue: https://github.com/milvus-io/milvus/issues/38606 pr: https://github.com/milvus-io/milvus/pull/38630 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-12-25 16:14:49 +08:00
wei liu	659847c11f	enhance: Remove load task limit in one round (#38436 ) the task limit in assignSegment/assignChannel will works for both load task and balance task. this PR remove the load task limit, only limit balance task num in one round. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-12-16 19:30:43 +08:00
tinswzy	27229f7907	enhance: refine exists log print with ctx (#38080 ) issue: #35917 Refines exists log print with ctx Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2024-12-14 22:36:44 +08:00
wei liu	e279ccf109	enhance: Enable score based balance channel policy (#38143 ) issue: #38142 current balance channel policy only consider current collection's distribution, so if all collections has 1 channel, and all channels has been loaded on same querynode, after querynode num increase, balance channel won't be triggered. This PR enable score based balance channel policy, to achieve: 1. distribute all channels evenly across multiple querynodes 2. distribute each collection's channel evenly across multiple querynodes. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-12-11 17:20:43 +08:00
tinswzy	e76802f910	enhance: refine querycoord meta/catalog related interfaces to ensure that each method includes a ctx parameter (#37916 ) issue: #35917 This PR refine the querycoord meta related interfaces to ensure that each method includes a ctx parameter. Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>	2024-11-25 11:14:34 +08:00
yihao.dai	b6612e02b4	enhance: Reduce GetIndexInfos calls (#37695 ) Batch `GetIndexInfos` calls for segments to reduce RPC calls. issue: https://github.com/milvus-io/milvus/issues/37634 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-11-19 14:24:31 +08:00
congqixia	9539739781	enhance: Release compacted growing segment if in dropped list (#37245 ) See also #37205 Previously releasing growing segments could be triggered by two conditions: - Sealed Segment with same id is loaded - Segment start position is before target checkpoint ts Which has a worst case that the corresponding sealed segment is compacted and the checkpoint is pinned by a growing l0 segment. This PR introduces a new rule that: a growing segment could be released if the segment id appeared in current target dropped segment id list. Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-10-29 18:04:21 +08:00
jaime	9d16b972ea	feat: add tasks page into management WebUI (#37002 ) issue: #36621 1. Add API to access task runtime metrics, including: - build index task - compaction task - import task - balance (including load/release of segments/channels and some leader tasks on querycoord) - sync task 2. Add a debug model to the webpage by using debug=true or debug=false in the URL query parameters to enable or disable debug mode. Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-10-28 10:13:29 +08:00
wei liu	39a91eb100	fix: Delegator may becomes unserviceable after querycoord restart (#37055 ) issue: #37054 after querycoord restart, segment_checker may release segment by mistake due to next target isn't ready yet. This PR requires release segment must happens after next target is ready. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-10-24 12:21:28 +08:00
wei liu	3cd0b26285	enhance: Enable dynamic update loaded collection's replica (#35822 ) issue: #35821 After collection loaded, if we need to increase/decrease collection's replica, we need to release and load it again. milvus offers 4 solution to update loaded collection's replica, this PR aims to dynamic change the replica number without release, and after replica number changed, milvus will execute load replica or release replica in async, and the replica loaded status can be checked by getReplicas API. Notice that if set too much replicas than querynode can afford，the new replica won't be loaded successfully until enough querynode joins. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-09-25 10:13:18 +08:00
wei liu	fb2a41a94c	fix: Clean dirty segment/channel on querynode (#36202 ) issue: #36201 after querynode has been remove from replica, all dirty segment/channel on it should be released. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-09-13 18:15:08 +08:00
wei liu	30a99b66c1	fix: Fix logic dead lock when delegator has high memory usage (#36065 ) issue: #36064 when delegator has high memory usage, load l0 segment will failed. and balance segment task will blocked by load segment task, then delegator cann't free memory by moving out some segment, causes a logic dead lock. this PR remove the limit for balance, we permit segment and balance execute in parallel. which won't cause side effect due to: 1. one segment can only has one task in qc's scheduler, and load/release task will replace balance task if necessary 2. balance speed has been limited, and it won't block load segment task. 3. if collection has load task and balance task at same time, load task will be scheduled first due to high proirity. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-09-09 10:21:06 +08:00
wei liu	c84ea5465c	fix: Fix some replicas don't participate in the query after the failure recovery (#35850 ) issue: #35846 querycoord will notify proxy to update shard leader cache after delegator location changes, but during querynode's failure recovery, some delegator may become unserviceable due to lacking of segments, and back to serviceable after segment loaded, so we also need to notify proxy to invalidate shard leader cache when delegator serviceable state changes. This PR will maintain querynode's serviceable state during heartbeat, and notify proxy to invalidate shard leader cache if serviceable state changes. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-09-03 15:39:03 +08:00
congqixia	86691656f3	enhance: Change frequent balancer debug log to rated one (#35749 ) "skip balance" log is too frequent in debug level. This PR changes it into rated on. Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-08-29 10:07:00 +08:00
Chun Han	3faef63a25	enhance: add log for partition stats( #30376 ) (#35219 ) related: #30376 Signed-off-by: MrPresent-Han <chun.han@gmail.com> Co-authored-by: MrPresent-Han <chun.han@gmail.com>	2024-08-02 19:34:22 +08:00
wei liu	166fc902b0	enhance: Limit collection's normal balance speed (#34810 ) issue: #34798 after we remove the task priority on query coord, to avoid load/release segment blocked by too much balance task, we limit the balance task size in each round. at same time, we reduce the balance interval to trigger balance more frequently. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-07-24 19:11:44 +08:00
wei liu	40e39ef7c9	fix: Avoid segment lack caused by deduplicate segment task (#34782 ) issue: #34781 when balance segment hasn't finished yet, query coord may found 2 loaded copy of segment, then it will generate task to deduplicate, which may cancel the balance task. then the old copy has been released, and the new copy hasn't be ready yet but canceled, then search failed by segment lack. this PR set deduplicate segment task's proirity to low, to avoid balance segment task canceled by deduplicate task. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-07-22 16:35:43 +08:00
congqixia	b284b81a47	fix: Check partition in current target when observing partition load status (#34282 ) See also #34234 `LoadPartitions` does not guarantee the current target has loading partitions if there are some partitions already loaded before. This PR check current target contains the partition to load when advancing loading percentage to 100. Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-07-01 17:40:07 +08:00
wei liu	f7ecafe77d	enhance: Skip update index for L0 segment (#34099 ) try to update index for l0 segment, will failed by `index not found` This PR skip update index for l0 segment Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-07-01 10:26:06 +08:00
jaime	9630974fbb	enhance: move rocksmq from internal to pkg module (#33881 ) issue: #33956 Signed-off-by: jaime <yun.zhang@zilliz.com>	2024-06-25 21:18:15 +08:00
Chun Han	f7af323d1e	fix: sync partitiion stats blocking balance task(#33741 ) (#33742 ) related: #33741 Signed-off-by: MrPresent-Han <chun.han@zilliz.com>	2024-06-11 14:21:56 +08:00
wayblink	a1232fafda	feat: Major compaction (#33620 ) #30633 Signed-off-by: wayblink <anyang.wang@zilliz.com> Co-authored-by: MrPresent-Han <chun.han@zilliz.com>	2024-06-10 21:34:08 +08:00
wei liu	2013d97243	enhance: Enable to dynamic update balancer policy in querycoord (#33037 ) issue: #33036 This PR enable to dynamic update balancer policy without restart querycoord. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-05-21 14:29:39 +08:00
wei liu	a7f6193bfc	fix: query node may stuck at stopping progress (#33104 ) issue: #33103 when try to do stopping balance for stopping query node, balancer will try to get node list from replica.GetNodes, then check whether node is stopping, if so, stopping balance will be triggered for this replica. after the replica refactor, replica.GetNodes only return rwNodes, and the stopping node maintains in roNodes, so balancer couldn't find replica which contains stopping node, and stopping balance for replica won't be triggered, then query node will stuck forever due to segment/channel doesn't move out. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-05-20 10:21:38 +08:00
wei liu	e2332bdc17	enhance: Enable channel exclusive balance policy (#32911 ) issue: #32910 * split replica's node list to channels when create replicas * balance nodes among channels when node change happens * implement channel level balance, let balance happens in channel level Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-05-10 17:27:31 +08:00
wei liu	fad8f0afa5	enhance: enable stopping balance after balance has been suspended (#32812 ) issue: #32811 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-05-08 10:15:29 +08:00
wei liu	ba02d54a30	enhance: update shard leader cache when leader location changed (#32470 ) issue: #32466 this PR enhance that when shard location changed, update proxy's shard leader cache. in case of query node failover case, proxy can find replica recover --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-05-08 10:05:29 +08:00
Xiaofan	02ace25c68	enhance: reduce the cpu usage when collection number is high (#32245 ) related to #32165 1. for all the manager, support collection level index 2. remove collection level filter to avoid extra cpu usage when collection number increases Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>	2024-04-26 11:49:25 +08:00
wei liu	4822b109bd	fix: Skip to load l0 segment on old version query node (#32124 ) issue: #32107 during rolling upgrade progress, skip to load l0 segment on old version query node --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-04-15 11:23:23 +08:00
chyezh	48fe977a9d	enhance: declarative resource group api (#31930 ) issue: #30647 - Add declarative resource group api - Add config for resource group management - Resource group recovery enhancement --------- Signed-off-by: chyezh <chyezh@outlook.com>	2024-04-15 08:13:19 +08:00
wei liu	c4806b69c4	enhance: Refactor leader view manager interface (#31133 ) issue: #31091 This PR add GetByFilter interface in leader view manager, instead of all kind of get func --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-04-10 15:13:36 +08:00
wei liu	177ddda47f	fix: Check stale should check leader task's leader id (#31962 ) issue: #30816 check stale rules for leader task: 1. for reduce leader task, it should keep executing until leader's node become offline. 2. for grow leader task,it should keep executing until leader's node become stopping. This PR check leader node's stopping state for grow leader task Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-04-09 15:33:25 +08:00
chyezh	a2502bde75	enhance: replica manager enhancement (#31496 ) issue: #30647 - ReplicaManager manage read only node now, and always do persistent of node distribution of replica. - All segment/channel checker using ReplicaManager to get read-only node or read-write node, but not ResourceManager. - ReplicaManager promise that only apply unique querynode to one replica in same collection now (replicas in same collection never hold same querynode at same time). - ReplicaManager promise that fairly node count assignment policy if multi replicas of collection is assigned to one resource group. - Move some parameters check into ReplicaManager to avoid data race. - Allow transfer replica to resource group that already load replica of same collection - Allow transfer node between resource groups that load replica of same collection --------- Signed-off-by: chyezh <chyezh@outlook.com>	2024-04-05 04:57:16 +08:00
Bingyi Sun	91cb529ba6	fix: get latest collection info when checking index (#31744 ) issue: https://github.com/milvus-io/milvus/issues/31727 --------- Signed-off-by: sunby <sunbingyi1992@gmail.com>	2024-04-02 14:43:13 +08:00
wei liu	0944a1f790	enhance: Refactor channel dist manager interface (#31119 ) issue: #31091 This PR add GetByFilter interface in channel dist manager, instead of all kind of get func --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-04-02 10:23:14 +08:00
wei liu	bb500d66c7	fix: Remove segment from leader view can't be executed (#31663 ) issue: #31664 Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-04-01 10:39:12 +08:00
wei liu	c311932d5f	fix: Update segment's version in leader task (#31643 ) issue: #31468 1. when segment's version in leader view doesn't match segment's version in dist, should update leader view 2. after call loadDeltalog, should update segment's load version with latest ts 3. change leader task's priority from high to low, to avoid leader task replace segment task and balance task --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-04-01 10:37:21 +08:00
wei liu	92971707de	enhance: Add restful api for devops to execute rolling upgrade (#29998 ) issue: #29261 This PR Add restful api for devops to execute rolling upgrade, including suspend/resume balance and manual transfer segments/channels. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-03-27 16:15:19 +08:00
wei liu	5d752498e7	fix: Skip release duplicate l0 segment (#31540 ) issue: #31480 #31481 release duplicate l0 segment task, which execute on old delegator may cause segment lack, and execute on new delegator may break new delegator's leader view. This PR skip release duplicate l0 segment by segment_checker, cause l0 segment will be released with unsub channel --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-03-27 12:53:10 +08:00
congqixia	4d2142d041	fix: Check latest leader exists before using it (#31500 ) See also #31495 --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-03-22 18:25:07 +08:00
chyezh	9f9ef8ac32	enhance: transfer resource group and dbname to querynode when load (#30936 ) issue: #30931 Signed-off-by: chyezh <chyezh@outlook.com>	2024-03-21 11:59:12 +08:00
wei liu	c26c1b33c2	fix: Transfer l0 segment to new delegator after balance (#31319 ) issue: #30186 during channel balance, after new delegator loaded, instead of syncing l0 segment's location to new delegator, we should load l0 segment on new delegator, and release the old l0 segment, then start to release old delegator. --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-03-19 09:59:05 +08:00
chyezh	ff4237bb90	enhance: add hostname into node info (#30673 ) issue: https://github.com/milvus-io/milvus/issues/30647 - Address may be reused in k8s environment. Using hostname can be better. Signed-off-by: chyezh <chyezh@outlook.com>	2024-03-15 10:45:06 +08:00
wei liu	06b191b164	fix: Balance channel stuck forever due to logic dead lock (#31202 ) issue: #30816 cause balance channel will stuck until leader view catch up the current target, then start to unsub the old delegator. which make sure that the new delegator can provide search before release old delegator. but another logic in segment_checker skip loading segment during balance channel. so during balance channel, if query node crash, new delegator can't catch up target forever, then stuck forever. This PR remove the rule that skip loading segment during balance channel to avoid the logic dead lock here. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-03-13 15:05:04 +08:00
wei liu	ddd918ba04	enhance: change frequency log to rated level (#31084 ) This PR change frequency log of check shard leader to rated level --------- Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-03-08 16:39:02 +08:00
wei liu	efe8cecc88	enhance: refactor segment dist manager interface (#31073 ) issue: #31091 This PR add `GetByFilter` interface in segment dist manager, instead of all kind of get func Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-03-08 16:29:01 +08:00
wei liu	22df5061c1	fix: Leader checker can't update segment's load version (#31040 ) issue: #30890 when leader checker find that leader view has an older load version of segment, it will try to correct leader view. but the sync action doesn't specify the latest load version. so the update operation will failed. This PR fix leader checker can't update segment's load version and keeping generate same task to scheduler. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-03-08 11:57:01 +08:00
wei liu	2a047103d6	fix: Dirty sealed segment won't release after channel balance (#31095 ) issue: #31074 This PR fix dirty sealed segment doesn't release after channel balance, dirty sealed segment means segment doesn't exist in targets. Signed-off-by: Wei Liu <wei.liu@zilliz.com>	2024-03-07 16:23:01 +08:00

1 2 3

126 Commits (hotfix-2.5.4)