Commit Graph

450 Commits (2.4-hotfix)

Author SHA1 Message Date
yiwangdr 018a784989
enhance: speed up GetByCollection/AndNode (2.4 patch) (#32234)
Related to https://github.com/milvus-io/milvus/issues/32165

Avoid iterating through all replicas/collections if possible. Iteration
is expensive when there are large number of replicas/collections.

Signed-off-by: yiwangdr <yiwangdr@gmail.com>
2024-04-15 19:53:21 +08:00
congqixia 9e3099e20e
enhance: [2.4] Maintain collection-patitions mapping in qc meta (#32227) (#32249)
Cherry-pick from master
pr: #32227
Related to #32165

Add collection to partitionIDs mapping to avoid interation on all
partitions loaded when trying to get all partitions with collection id

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-15 18:15:19 +08:00
wei liu e50599ba10
fix: Skip to load l0 segment on old version query node (#32131)
issue: #32107
pr: #32124

during rolling upgrade progress, skip to load l0 segment on old version
query node

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-15 11:23:23 +08:00
wei liu e495073e4b
fix: Use correct ts to avoid exclude segment list leak (#32191)
issue: #31990
pr: #31991

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-12 15:11:19 +08:00
congqixia 5457829660
fix: [Cherry-pick] Make `ResourceGroup.nodes` concurrent safe (#32159) (#32200)
Cherry-pick from master
pr: #32159
See also #32158

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-12 15:05:20 +08:00
congqixia 7f683000e9
fix: [2.4] Make coordinator `Register` not blocked on ProcessActiveStandby(#32069) (#32132)
Cherry-pick from master
pr: #32069
See also #32066

This PR make coordinator register successful and let
`ProcessActiveStandBy` run async. And roles may receive stop signal and
notify servers.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-11 17:31:19 +08:00
wei liu 94e35793c6
enhance: Refactor leader view manager interface (#31133) (#32127)
issue: #31091
pr: #31133
This PR add GetByFilter interface in leader view manager, instead of all
kind of get func

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-11 10:01:23 +08:00
wei liu 8475a63d72
fix: Check stale should check leader task's leader (#31995)
issue: #30816
pr: #31962

check stale rules for leader task:

for reduce leader task, it should keep executing until leader's node
become offline.
for grow leader task,it should keep executing until leader's node become
stopping.
This PR check leader node's stopping state for grow leader task

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-09 15:33:25 +08:00
zhenshan.cao 4c07304790
enhance: Refactor hybrid search (#31742)
issue: https://github.com/milvus-io/milvus/issues/25639
https://github.com/milvus-io/milvus/issues/31368
pr :https://github.com/milvus-io/milvus/pull/32020

Signed-off-by: zhenshan.cao <zhenshan.cao@zilliz.com>
2024-04-09 10:15:18 +08:00
congqixia 958f933810
fix: [Cherry-pick] Check collection nil before check load status (#31850) (#31897)
Cherry-pick from master
pr: #31850
See also #31849

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-08 10:01:16 +08:00
congqixia 732f0ace11
enhance: [Cherry-pick] Add back unit test for compactor and fix some TODOs (#31829) (#31876)
Cherry-pick from master
pr: #31829
This PR adds back compactor "Unhandled" data type unit test and fixes
some TODOs behvaior

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-03 17:19:15 +08:00
wei liu baa794a07d
fix: querycoord panic after node down (#31831) (#31860)
issue: #30519
pr: #31831

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-03 15:31:13 +08:00
congqixia 0b3f087896
enhance: [2.4] Add EmbedEtcd testutil and remove etcd dep of task pkg (#31802) (#31826)
Cherry-pick from master
pr: #31802 
See also #20478

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-03 10:21:14 +08:00
wei liu cc3bc556dd
enhance: Refactor channel dist manager interface (#31119) (#31814)
issue: #31091
pr: #31119
This PR add GetByFilter interface in channel dist manager, instead of
all kind of get func

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-02 14:31:14 +08:00
wei liu 12d2f4b39b
fix: Update segment's version in leader task (#31643) (#31774)
issue: #31468
pr: #31643

1. when segment's version in leader view doesn't match segment's version
in dist, should update leader view
2. after call loadDeltalog, should update segment's load version with
latest ts
3. change leader task's priority from high to low, to avoid leader task
replace segment task and balance task

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-01 21:09:18 +08:00
wei liu 609674c0ea
fix: Remove segment from leader view can't be executed (#31663) (#31775)
issue: #31664
pr: #31663

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-01 16:55:19 +08:00
wei liu 3e3a92fc89
enhance: Add restful api for devops to execute rolling upgrade (#29998) (#31645)
issue: #29261
pr: #29998
This PR Add restful api for devops to execute rolling upgrade, including
suspend/resume balance and manual transfer segments/channels.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-29 16:15:11 +08:00
wei liu ad07289819
fix: Skip release duplicate l0 segment (#31540) (#31644)
issue: #31480 #31481
pr: #31540

release duplicate l0 segment task, which execute on old delegator may
cause segment lack, and execute on new delegator may break new
delegator's leader view.

This PR skip release duplicate l0 segment by segment_checker, cause l0
segment will be released with unsub channel

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-27 20:09:09 +08:00
congqixia 55bc7207ed
fix: [2.4] Make target observer auto/manual task mutual exclusive (#31584) (#31602)
Cherry-pick from master
pr: #31584
See also #30867

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-27 19:51:15 +08:00
congqixia 34f21794df
enhance: [2.4] Save collection targets by batches (#31616) (#31632)
Cherry-pick from master
pr: #31616
See also #28491 #31240

When colleciton number is large, querycoord saves collection target one
by one, which is slow and may block querycoord exits.

In local run, 500 collections scenario may lead to about 40 seconds
saving collection targets.

This PR changes the `SaveCollectionTarget` interface into batch one and
organizes the collection in 16 per bundle batches to accelerate this
procedure.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-27 17:53:09 +08:00
wei liu 7c576a2340
fix: Grow task stuck at stopping node (#31487) (#31613)
issue: #30816
pr: #31487
this PR fix that grow task stuck at stopping node

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-27 16:17:15 +08:00
chyezh 96cec7871d
enhance: transfer resource group and dbname to querynode when load (#31322)
issue: #30931
pr: #30936

Signed-off-by: chyezh <chyezh@outlook.com>
2024-03-27 10:37:09 +08:00
congqixia 5d3aa2a496
fix: [Cherry-pick] Check latest leader exists before using it (#31500) (#31546)
Cherry-pick from master
pr: #31500
See also #31495

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-23 11:51:06 +08:00
congqixia 99774548f2
enhance: [Cherry-pick] Add AllPartitionsID const to replace InvalidPartitionID (#31438) (#31515)
Cherry-pick from master
pr: #31438

"-1" as `InvalidPartitionID` previously used as All partition place
holder in delete cases. It's confusing and hard to maintain when a const
var has more than one meaning.

This PR add `AllPartitionsID` to replace these usages in delete
scenarios.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-22 16:37:08 +08:00
congqixia c6019c4f9d
enhance: [Cherry-pick] Add metrics for querycoord current target cp lag (#31391) (#31420)
Cherry-pick from master
pr: #31391 #31399
See also #31390

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-20 20:47:10 +08:00
wei liu 7abebf81a3
fix: Load segment task promote failed (#31431)
issue: #30816
pr: #31430

pr #31319 introduce the logic that segment checker need to load level
zero segment which only exist in current target.

This PR fix load segment task promote failed when segment only belongs
to current target

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-20 15:11:05 +08:00
wei liu f4449d4ef4
fix: Wrong behavior of CurrentTargetFirst/NextTargetFirst in target manager (#31378)
issue: #31162
pr: #31379

when give scope CurrentTargetFirst/NextTargetFirst, it's expected to
scan both current and next target.

This PR fixed wrong behavior of CurrentTargetFirst/NextTargetFirst in
target manager, which may cause unexpected task generated, and load
collection may stuck forever due to dirty leader view.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-19 11:41:05 +08:00
wei liu 7ee7d484cc
fix: Transfer l0 segment to new delegator after balance (#31332)
issue: #30186
pr: #31319

during channel balance, after new delegator loaded, instead of syncing
l0 segment's location to new delegator, we should load l0 segment on new
delegator, and release the old l0 segment, then start to release old
delegator.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-18 20:49:04 +08:00
wei liu 3987cd69d7
fix: save current target after target observer stop (#31333)
issue: #28491
pr: #31315

should save target to meta store after target observer stop, incase of
target changed

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-18 13:49:11 +08:00
wei liu d79aa58b37
enhance: Speed up target recovery after query coord restart (#31240)
issue: #28491

after querycoord restart, it will pull a new target, which include
channel and segment list. when segments loaded on querynode has reached
the target, the collection could provide search/query. but if segment
list changes by time, ater querycoord pull a new target, it will takes a
few minutes to catch up the target's segment distribution. and before
that, query/search will fail due to lack of segments.

This PR save the current loaded target to meta storein querycoord's stop
progress, and recover it when query coord starts, to speed up the target
recovery time.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-15 14:19:03 +08:00
chyezh ff4237bb90
enhance: add hostname into node info (#30673)
issue: https://github.com/milvus-io/milvus/issues/30647

- Address may be reused in k8s environment. Using hostname can be
better.

Signed-off-by: chyezh <chyezh@outlook.com>
2024-03-15 10:45:06 +08:00
jaime db79be3ae0
fix: ctx cancel should be the last step while stopping server (#31220)
issue: #31219

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-03-15 10:33:05 +08:00
congqixia 773c64ecbb
fix: Set nodeID when remove distribution (#31259)
See also #30930

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-14 15:09:03 +08:00
wei liu 06b191b164
fix: Balance channel stuck forever due to logic dead lock (#31202)
issue: #30816

cause balance channel will stuck until leader view catch up the current
target, then start to unsub the old delegator. which make sure that the
new delegator can provide search before release old delegator. but
another logic in segment_checker skip loading segment during balance
channel. so during balance channel, if query node crash, new delegator
can't catch up target forever, then stuck forever.

This PR remove the rule that skip loading segment during balance channel
to avoid the logic dead lock here.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-13 15:05:04 +08:00
congqixia 5b51c20293
fix: Use `Remove` sync type for distribution removal (#31215)
See also #31214

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-13 06:11:04 +08:00
wei liu 06df9b8462
fix: Balance segment/channel won't be trigger on multi replicas (#31107)
issue: #30983 #30982

cause balancer call wrong interface to get segment/channel list in
replica, then got a wrong average segment/channel number, which make
each node have less segment/channel than average, and the balance won't
be trigger in multi replica case.

This PR fix that balance segment/channel won't be trigger on multi
replicas

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-11 20:35:04 +08:00
wei liu ddd918ba04
enhance: change frequency log to rated level (#31084)
This PR change frequency log of check shard leader to rated level

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-08 16:39:02 +08:00
wei liu efe8cecc88
enhance: refactor segment dist manager interface (#31073)
issue: #31091
This PR add `GetByFilter` interface in segment dist manager, instead of
all kind of get func

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-08 16:29:01 +08:00
wei liu 22df5061c1
fix: Leader checker can't update segment's load version (#31040)
issue: #30890

when leader checker find that leader view has an older load version of
segment, it will try to correct leader view. but the sync action doesn't
specify the latest load version. so the update operation will failed.

This PR fix leader checker can't update segment's load version and
keeping generate same task to scheduler.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-08 11:57:01 +08:00
congqixia c886aa29ff
enhance: Use `ListIndexes` instead of `DescribeIndex` for qc broker (#31122)
See also #31103

Since querycoord need index meta information from datacoord only, broker
shall use `ListIndexes` to skip segment index building check logic in
datacoord

This PR is also related to #30538, in which DescribeIndex caused lots of
memory usage and lead to OOM eventually

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-07 21:43:03 +08:00
wei liu 2a047103d6
fix: Dirty sealed segment won't release after channel balance (#31095)
issue: #31074
This PR fix dirty sealed segment doesn't release after channel balance,
dirty sealed segment means segment doesn't exist in targets.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-07 16:23:01 +08:00
Bingyi Sun e3cce11dd9
fix: data race in querynode task test (#31019)
issue: https://github.com/milvus-io/milvus/issues/31022

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-03-05 16:26:59 +08:00
Bingyi Sun 7783098ddd
feat: support lazy load on querycoord (#30372)
https://github.com/milvus-io/milvus/issues/30361

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-03-01 18:15:29 +08:00
SimFG ee8d6f236c
enhance: make the watch dm channel request better compatibility (#30952)
issue: #30938

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-03-01 16:07:37 +08:00
chyezh 0c7474d7e8
enhance: add graceful stop timeout to avoid node stop hang under extreme cases (#30317)
1. add coordinator graceful stop timeout to 5s
2. change the order of datacoord component while stop
3. change querynode grace stop timeout to 900s, and we should
potentially change this to 600s when graceful stop is smooth

issue: #30310
also see pr: #30306

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-02-29 17:01:50 +08:00
wei liu 545e8de401
fix: promote leader task failed when segment only exist on current target (#30794)
issue: #30150

`checkLeaderTaskStale` will check segment whether exist on next current
for leaderTask's growing action, which will cause promote leader task
failed when segment only exist on current target

This PR will check segment for both current or next target.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-02-28 13:14:59 +08:00
Bingyi Sun ece9d273a7
enhance: some patches for #30636 (#30664)
Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-02-26 11:42:55 +08:00
wei liu befe0e21fd
fix: Set indexInfo when try to set segment to leader view (#30758)
issue: #30150
see also: #30258

cause `SyncDataDistribution` will try to load delta for segment. if miss
indexInfo in request, sync action will failed due to lack of index info.

This PR set indexinfo when try to set segment to leader view

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-02-26 11:02:55 +08:00
wei liu 6dd7297178
fix: Skip generate balance task when target not ready (#30724)
issue: #30723

This PR skip generate balance task when collection's target isn't ready.
also refine the check stale logic in query coord's scheduler, if channel
exist in current or next target, task won't be canceled.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-02-23 10:32:53 +08:00
congqixia 7b91fa3db8
fix: Make leader checker generate leader task instead of segment task (#30258)
See also #30150

For leader view distribution with offline nodes, a release task can
never be sent to querynode due to targetNode online check logic. Even
the request is dispatched, normal release task does not have "force"
flag when calling `delegator.ReleaseSegment`.

This PR adds a new type of querycoord task: LeaderTask, the
responsibility of which is to rectify leader view distribtion.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-02-21 11:08:51 +08:00