Commit Graph

156 Commits (3948bd4e79443b8dc67cf9ca90ba057ad012568b)

Author SHA1 Message Date
wei liu c0200eec39
enhance: limit getSegmentInfo batch size to avoid excced grpc message limit (#35394)
issue: #35395

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-08-15 19:17:00 +08:00
wei liu f6aaf3fef2
fix: force update next target if target can't be loaded (#35365)
issue: #35361

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-08-15 19:15:00 +08:00
jaime fcec4c21b9
fix: check collection health(queryable) fail for releasing collection (#34947)
issue: #34946

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-08-02 17:20:15 +08:00
wei liu 8123bea1ae
enhance: Avoid assign too much segment/channels to new querynode (#34096)
issue: #34095

When a new query node comes online, the segment_checker,
channel_checker, and balance_checker simultaneously attempt to allocate
segments to it. If this occurs during the execution of a load task and
the distribution of the new query node hasn't been updated, the query
coordinator may mistakenly view the new query node as empty. As a
result, it assigns segments or channels to it, potentially overloading
the new query node with more segments or channels than expected.

This PR measures the workload of the executing tasks on the target query
node to prevent assigning an excessive number of segments to it.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-06-27 19:06:05 +08:00
jaime 9630974fbb
enhance: move rocksmq from internal to pkg module (#33881)
issue: #33956

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-06-25 21:18:15 +08:00
Chun Han ca7ef26e4b
fix: sync part stats task cannot be finished(#30376) (#34027)
related: #30376
also: refine log output for query_coord task by rephrasing action string

Signed-off-by: MrPresent-Han <chun.han@gmail.com>
Co-authored-by: MrPresent-Han <chun.han@gmail.com>
2024-06-24 10:16:02 +08:00
wei liu 02945959d9
enhance: Avoid to iterate whole segment list for each task's process (#33943)
when querycoord process segment task, it will try to iterate whole
segment list to checke whether segment is loaded, which cost too much
cpu if there has thousands of segments.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-06-19 10:19:58 +08:00
Chun Han f7af323d1e
fix: sync partitiion stats blocking balance task(#33741) (#33742)
related: #33741

Signed-off-by: MrPresent-Han <chun.han@zilliz.com>
2024-06-11 14:21:56 +08:00
wayblink a1232fafda
feat: Major compaction (#33620)
#30633

Signed-off-by: wayblink <anyang.wang@zilliz.com>
Co-authored-by: MrPresent-Han <chun.han@zilliz.com>
2024-06-10 21:34:08 +08:00
SimFG ecee7d90d4
enhance: try to speed up the loading of small collections (#33570)
- issue: #33569

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-06-07 08:25:53 +08:00
jaime 0d99db23b8
fix: metrics leak on the coord nodes (#33075)
issue: #32980

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-05-20 22:03:39 +08:00
congqixia 72c172a7d7
enhance: Remove duplicated collectionID label for task latency (#32308)
`CollectionID` already exists in channel name, so remove it to save
metrics traffic.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-16 18:55:19 +08:00
wei liu 68dec7dcd4
fix: Use correct ts to avoid exclude segment list leak (#31991)
issue: #31990

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-12 10:39:19 +08:00
wei liu c4806b69c4
enhance: Refactor leader view manager interface (#31133)
issue: #31091
This PR add GetByFilter interface in leader view manager, instead of all
kind of get func

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-10 15:13:36 +08:00
wei liu 177ddda47f
fix: Check stale should check leader task's leader id (#31962)
issue: #30816

check stale rules for leader task:
1. for reduce leader task, it should keep executing until leader's node
become offline.
2. for grow leader task,it should keep executing until leader's node
become stopping.

This PR check leader node's stopping state for grow leader task

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-09 15:33:25 +08:00
wei liu 7471a8005f
fix: querycoord panic after node down (#31831)
issue: #30519

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-03 10:03:22 +08:00
congqixia 16d869c57e
enhance: Add EmbedEtcd testutil and remove etcd dep of task pkg (#31802)
See also #20478

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-02 09:59:14 +08:00
wei liu 5d752498e7
fix: Skip release duplicate l0 segment (#31540)
issue: #31480 #31481

release duplicate l0 segment task, which execute on old delegator may
cause segment lack, and execute on new delegator may break new
delegator's leader view.

This PR skip release duplicate l0 segment by segment_checker, cause l0
segment will be released with unsub channel

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-27 12:53:10 +08:00
wei liu 6438d65459
fix: Grow task stuck at stopping node (#31487)
issue: #30816
this PR fix that grow task stuck at stopping node

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-25 18:57:07 +08:00
wei liu 03eaa5d478
fix: Load segment task promote failed (#31430)
issue: #30816

pr #31319 introduce the logic that segment checker need to load level
zero segment which only exist in current target.

This PR fix load segment task promote failed when segment only belongs
to current target

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-21 18:09:07 +08:00
chyezh 9f9ef8ac32
enhance: transfer resource group and dbname to querynode when load (#30936)
issue: #30931

Signed-off-by: chyezh <chyezh@outlook.com>
2024-03-21 11:59:12 +08:00
wei liu 7c7375031d
enhance: Add metrics for task latency in querycoord scheduler (#31405)
This PR add metrics for task latency in querycoord scheduler, so if any
kind of task stuck, it's easy to figure out by metrics

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-20 19:29:06 +08:00
wei liu c26c1b33c2
fix: Transfer l0 segment to new delegator after balance (#31319)
issue: #30186

during channel balance, after new delegator loaded, instead of syncing
l0 segment's location to new delegator, we should load l0 segment on new
delegator, and release the old l0 segment, then start to release old
delegator.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-19 09:59:05 +08:00
chyezh ff4237bb90
enhance: add hostname into node info (#30673)
issue: https://github.com/milvus-io/milvus/issues/30647

- Address may be reused in k8s environment. Using hostname can be
better.

Signed-off-by: chyezh <chyezh@outlook.com>
2024-03-15 10:45:06 +08:00
congqixia 773c64ecbb
fix: Set nodeID when remove distribution (#31259)
See also #30930

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-14 15:09:03 +08:00
congqixia 5b51c20293
fix: Use `Remove` sync type for distribution removal (#31215)
See also #31214

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-13 06:11:04 +08:00
wei liu efe8cecc88
enhance: refactor segment dist manager interface (#31073)
issue: #31091
This PR add `GetByFilter` interface in segment dist manager, instead of
all kind of get func

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-08 16:29:01 +08:00
wei liu 22df5061c1
fix: Leader checker can't update segment's load version (#31040)
issue: #30890

when leader checker find that leader view has an older load version of
segment, it will try to correct leader view. but the sync action doesn't
specify the latest load version. so the update operation will failed.

This PR fix leader checker can't update segment's load version and
keeping generate same task to scheduler.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-03-08 11:57:01 +08:00
congqixia c886aa29ff
enhance: Use `ListIndexes` instead of `DescribeIndex` for qc broker (#31122)
See also #31103

Since querycoord need index meta information from datacoord only, broker
shall use `ListIndexes` to skip segment index building check logic in
datacoord

This PR is also related to #30538, in which DescribeIndex caused lots of
memory usage and lead to OOM eventually

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-03-07 21:43:03 +08:00
Bingyi Sun e3cce11dd9
fix: data race in querynode task test (#31019)
issue: https://github.com/milvus-io/milvus/issues/31022

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-03-05 16:26:59 +08:00
Bingyi Sun 7783098ddd
feat: support lazy load on querycoord (#30372)
https://github.com/milvus-io/milvus/issues/30361

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-03-01 18:15:29 +08:00
SimFG ee8d6f236c
enhance: make the watch dm channel request better compatibility (#30952)
issue: #30938

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-03-01 16:07:37 +08:00
wei liu 545e8de401
fix: promote leader task failed when segment only exist on current target (#30794)
issue: #30150

`checkLeaderTaskStale` will check segment whether exist on next current
for leaderTask's growing action, which will cause promote leader task
failed when segment only exist on current target

This PR will check segment for both current or next target.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-02-28 13:14:59 +08:00
wei liu befe0e21fd
fix: Set indexInfo when try to set segment to leader view (#30758)
issue: #30150
see also: #30258

cause `SyncDataDistribution` will try to load delta for segment. if miss
indexInfo in request, sync action will failed due to lack of index info.

This PR set indexinfo when try to set segment to leader view

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-02-26 11:02:55 +08:00
wei liu 6dd7297178
fix: Skip generate balance task when target not ready (#30724)
issue: #30723

This PR skip generate balance task when collection's target isn't ready.
also refine the check stale logic in query coord's scheduler, if channel
exist in current or next target, task won't be canceled.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-02-23 10:32:53 +08:00
congqixia 7b91fa3db8
fix: Make leader checker generate leader task instead of segment task (#30258)
See also #30150

For leader view distribution with offline nodes, a release task can
never be sent to querynode due to targetNode online check logic. Even
the request is dispatched, normal release task does not have "force"
flag when calling `delegator.ReleaseSegment`.

This PR adds a new type of querycoord task: LeaderTask, the
responsibility of which is to rectify leader view distribtion.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-02-21 11:08:51 +08:00
wei liu f69f65ff68
fix: Leader checker can't remove segment from leader view (#30151)
issue: #30150

This PR fix three problems:
1. leader checker use wrong node id when generate release task, which
cause the release task finished immediately
2. the release request generated by leader_checker doesn't set the
`force` flag, the operation to clean leader view on delegator will fail.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-01-20 18:58:58 +08:00
wei liu 57bd3e2181
fix: Leader checker canot submit load task (#30067)
issue: #29841
if segment loaded, submit load segment task for it isn't permitted, to
avoid load segment twice. but this logic blocks the leader checker to
correct leader view by `LoadSegment`

This PR remove the segment loaded check, to fix that leader checker
cann't submit load task

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-01-17 19:12:54 +08:00
yah01 c68c128e47
fix: level 0 segments not loaded (#29908)
the recent changes move the level 0 segments list to a new proto field,
which leads to the QueryCoord can't see the level 0 segments, handle the
new changes
fix #29907

Signed-off-by: yah01 <yang.cen@zilliz.com>
2024-01-16 14:40:53 +08:00
congqixia c4ddfff2a7
enhance: make Load process traceable in querycoord (#29806)
See also #29803

This PR:
- Add trace span for collection/partition load
- Use TraceSpan to generate Segment/ChannelTasks when loading
- Refine BaseTask trace tag usage

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-01-10 09:58:49 +08:00
xige-16 9702cef2b5
feat: Support multiple vector search (#29433)
issue #25639 

Signed-off-by: xige-16 <xi.ge@zilliz.com>

Signed-off-by: xige-16 <xi.ge@zilliz.com>
2024-01-08 15:34:48 +08:00
wei liu e98c62abbb
enhance: refactor leader_observer to leader_checker (#29454)
issue: #29453 

sync distribution by rpc will also call loadSegment/releaseSegment,
which may cause all kinds of concurrent case on same segment, such as
concurrent load and release on one segment.
This PR add leader_checker which generate load/release task to correct
the leader view, instead of calling sync distribution by rpc

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-01-05 15:54:55 +08:00
congqixia aa967de0a8
enhance: Explicitly pass LevelZero segment ids in vchan info (#29612)
See also #27675

For `GetRecoveryInfo` & `GetRecoveryInfoV2`, Level zero segment ids
shall be specified in vchan info so that querycoord could re-fetch
current segment info during watch procedure without having all segment
info

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-01-04 16:46:45 +08:00
congqixia a3cb8e2625
fix: Add atomic method to get collection target (#29577)
Related to #29575

Add `getCollectionTarget` method which is atomic when scope is
`CurrentTargetFirst` or `NextTargetFirst`
Also return error when executor finds no channel in target manager

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-12-29 09:04:46 +08:00
congqixia aa279db44c
enhance: remove flushed segmentInfo in WatchChannelRequest (#29526)
`WatchDmChannel` only need growing segment info, this PR removes fetch
segmentInfos when fill watch dml channel request.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2023-12-28 00:40:47 +08:00
yah01 1d6bcd1ded
enhance: speed up loading with many deletions (#29455)
the executor always fetches the latest segment info, so we could consume
from the latest checkpoint, which could save much time while deleted
many entities

Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-12-25 20:58:45 +08:00
yah01 a0e1a1eb31
feat: support enable/disable mmap for index (#29005)
support enable/disable mmap for index, the user could alter the index's
mode by `AlterIndex` method
related: https://github.com/milvus-io/milvus/issues/21866

---------

Signed-off-by: yah01 <yah2er0ne@outlook.com>
Signed-off-by: yah01 <yang.cen@zilliz.com>
2023-12-21 18:07:24 +08:00
yah01 0a87724f18
enhance: remove merger for load segments (#29062)
remove merger as now QueryNode could load segments concurrently
fix #29063

Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-12-12 10:48:37 +08:00
yah01 fab52d167b
fix: may miss stream delta while loading (#28871)
we consume the delta data from the lastest channel checkpoint while
loading segment,

this works well without level 0 segments, but now it may lead to miss
some delta data,

so we have to consume from the current target's channel checkpoint

related: #27349

---------

Signed-off-by: yah01 <yah2er0ne@outlook.com>
2023-12-05 17:34:45 +08:00
wei liu 043ac87be0
fix: Balance channel may cause channel not availble error (#28829)
issue: #28831
release old delegator before new delegator update it's distribution may
cause `channel not availble` error

This PR will block release old delgator before new delegator finish
`syncDistribution`

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2023-12-01 10:08:34 +08:00