Commit Graph

193 Commits (eb046863485fdf3e130fc60484485c901b81276b)

Author SHA1 Message Date
wei liu 69b8b89369
enhance: Remove QueryCoord's scheduling of L0 segments (#39552)
issue: #39551
This PR remove querycoord's scheduling of l0 segments:
  - only load l0 segment when watch channel
- only release l0 segment when release channel or sync data distribution

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-26 21:38:00 +08:00
congqixia cb7f2fa6fd
enhance: Use v2 package name for pkg module (#39990)
Related to #39095

https://go.dev/doc/modules/version-numbers

Update pkg version according to golang dep version convention

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-22 23:15:58 +08:00
yihao.dai 2a037a97f1
enhance: Add get vector latency metric and refine request limit error message (#40083)
issue: https://github.com/milvus-io/milvus/issues/40078

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-02-21 19:41:55 +08:00
wei liu 7d2c948c69
fix: task delta cache leak on reduce task (#40055)
issue: #40052

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-21 16:47:54 +08:00
wei liu c12c4b4fff
fix: [skip e2e] pr conflict cause ut failed (#39811)
Related to https://github.com/milvus-io/milvus/pull/39701 &
https://github.com/milvus-io/milvus/issues/39681

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-12 11:44:51 +08:00
congqixia 7b51e4839f
fix: Resolve conflict on qc task test (#39796)
Related to #39701 & #39681

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-11 18:40:45 +08:00
wei liu ff5c680c99
fix: load collection stucks if compaction/gc happens (#39701)
issue: #39680
if compaction/gc happens, load collection may stuck due to
SegmentNotFound, we should trigger UpdateNextTarget to get a new data
view to execute loading operation.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-11 15:48:50 +08:00
wei liu 85c9f92ff4
fix: uneven distribution caused by executing task delta cache leak (#39702)
issue: #39681 

this PR maintain workload effect in action instead of computing workload
effect from target, which may cause leak if target changes.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-11 14:30:46 +08:00
jaime 8a4ac8cccd
enhance: expose more metrics data (#39456)
issue: #36621 #39417
1. Adjust the server-side cache size.
2. Add source information for configurations.
3. Add node ID for compaction and indexing tasks.
4. Resolve localhost access issues to fix health check failures for
etcd.

Signed-off-by: jaime <yun.zhang@zilliz.com>
2025-02-07 11:50:50 +08:00
yihao.dai 5fb597b37b
fix: Remove frequently updating metric to avoid mutex contention (#38775)
issue: https://github.com/milvus-io/milvus/issues/37630

Reduce the frequency updating metrics to avoid holding the mutex for
long periods.

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-24 10:31:07 +08:00
yihao.dai e0b26260f2
enhance: enable task delta cache (#39307)
When there are many segment tasks in the querycoord scheduler, the
traversal in `GetSegmentTaskDelta` checks becomes time-consuming. This
PR adds caching for segment deltas.

issue: https://github.com/milvus-io/milvus/issues/37630

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Co-authored-by: Wei Liu <wei.liu@zilliz.com>
2025-01-23 14:31:16 +08:00
yihao.dai 657550cf06
fix: Fix slow dist handle and slow observe (#38566)
1. Provide partition&channel level indexing in the collection target.
2. Make `SegmentAction` not wait for distribution.
3. Remove scheduler and target manager mutex.
4. Optimize logging to reduce CPU overhead.

issue: https://github.com/milvus-io/milvus/issues/37630

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-15 20:17:00 +08:00
Zhen Ye bb8d1ab3bf
enhance: make new go package to manage proto (#39114)
issue: #39095

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-01-10 10:49:01 +08:00
wei liu 9c3f59dbbe
fix: Prevent balancer from overloading the same QueryNode (#38719)
issue: #38718
The balancer calculates the workload of executing tasks as an ongoing
score for target nodes. However, a logic issue arises when
GetSegmentTaskDelta or GetChannelTaskDelta is called with
collectionID=-1, which incorrectly returns zero.

Due to the incorrect global score, the executing task's workload is not
properly reflected for each collection. Consequently, each collection
submits its own balance task, leading to the balancer assigning
excessive tasks to the same QueryNode.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-25 16:36:49 +08:00
SimFG 2afe2eaf3e
feat: support to replicate collection when the services contains the system tt msg (#37559)
- issue: #37105

---------

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-12-17 09:08:46 +08:00
tinswzy 27229f7907
enhance: refine exists log print with ctx (#38080)
issue: #35917 
Refines exists log print with ctx

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
2024-12-14 22:36:44 +08:00
wei liu 856e2aad7d
fix: Leader task stuck and retry again and again (#38202)
issue: #38201
leader task require to update delegator's distribution, and only success
after the distribution change has been applyed to delegator. but the
delegator will reject the distribution change if it's version is older
than current version in delegator. which cause the leader task stuck and
retry forever.

this PR remove the leader task finish check.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-10 19:16:42 +08:00
wei liu f04986fceb
enhance: Remove constraint on release segment task (#38297)
issue: #38305
after we disable balance segment and balance channel happens at same
time, the constriant which require release segment must happens on
serviceable shard leader is unnessary.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-10 11:18:49 +08:00
congqixia 051bc280dd
enhance: Make dynamic load/release partition follow targets (#38059)
Related to #37849

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-05 16:24:40 +08:00
tinswzy 7944538ade
enhance: Add ctx param to KV operation interfaces (#38154)
issue: #35917 
Refine KV operation interfaces by adding a ctx param

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
2024-12-05 15:16:41 +08:00
tinswzy e76802f910
enhance: refine querycoord meta/catalog related interfaces to ensure that each method includes a ctx parameter (#37916)
issue: #35917 
This PR refine the querycoord meta related interfaces to ensure that
each method includes a ctx parameter.

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
2024-11-25 11:14:34 +08:00
wei liu 0a440e0d38
fix: Prevent simultaneous balance of segments and channels (#37850)
issue: #33550
balance segment and balance segment execute at same time, which will
cause bounch of corner case.

This PR disable simultaneous balance of segments and channels

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-21 17:56:55 +08:00
yihao.dai b6612e02b4
enhance: Reduce GetIndexInfos calls (#37695)
Batch `GetIndexInfos` calls for segments to reduce RPC calls.

issue: https://github.com/milvus-io/milvus/issues/37634

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-19 14:24:31 +08:00
congqixia 6d86b9022e
enhance: Provide secondary index critria when filter leaderview (#37777)
Related to #37630

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-19 10:12:30 +08:00
congqixia b0bd290a6e
enhance: Use internal json(sonic) to replace std json lib (#37708)
Related to #35020

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-18 10:46:31 +08:00
jaime 1d06d4324b
fix: Int64 overflow in JSON encoding (#37657)
issue: ##36621

- For simple types in a struct, add "string" to the JSON tag for
automatic string conversion during JSON encoding.
- For complex types in a struct, replace "int64" with "string."

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-11-14 22:52:30 +08:00
jaime 1e8ea4a7e7
feat: add segment/channel/task/slow query render (#37561)
issue: #36621

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-11-12 17:44:29 +08:00
sthuang 70605cf5b3
enhance: Support custom privilege group for RBAC (#37087)
issue: #37031

---------

Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
2024-11-09 08:44:28 +08:00
wei liu 8714774305
fix: search/query failed due to segment not loaded (#37403)
issue: #36970
cause release segment and balance channel may happen at same time, and
before new delegator become serviceable, if release segment exeuctes on
new delegator, and search/query comes on old delegator, then release
segment and query segment happens in parallel, if release segment
execute first in worker, then search/query will got a SegmentNodeLoaded
error.

This PR add serviceable filter on delegator, then all load/release
segment operation will happens on serviceable delegator.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-06 15:10:25 +08:00
jaime 9d16b972ea
feat: add tasks page into management WebUI (#37002)
issue: #36621

1. Add API to access task runtime metrics, including:
  - build index task
  - compaction task
  - import task
- balance (including load/release of segments/channels and some leader
tasks on querycoord)
  - sync task
2. Add a debug model to the webpage by using debug=true or debug=false
in the URL query parameters to enable or disable debug mode.

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-10-28 10:13:29 +08:00
Bingyi Sun 6851738fd1
fix: fix `make generate-mockery` panic with go1.22 (#36830)
https://github.com/milvus-io/milvus/issues/36831
Fix `make generate-mockery` panic.

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-10-17 12:11:31 +08:00
wei liu 5dfa1c3397
fix: Segment unbalance after many times load/release (#36537)
issue: #36536
query coord use `segmentTaskDeleta/channelTaskDelta` to measure the
executing workload for querynode in scheduler, and we maintains the
`segmentTaskDeleta/channelTaskDelta` by `scheulder.Add(task)` and
`scheduler.remove(task)`, but `scheduler.remove(task)` has been called
in unexpected way, which cause a wrong
`segmentTaskDeleta/channelTaskDelta` value and affect the segment assign
logic, causes segment unbalance.

This PR moves to compute the `segmentTaskDeleta/channelTaskDelta` when
access, to avoid the wrong value affect.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-26 15:13:15 +08:00
wei liu 3cd0b26285
enhance: Enable dynamic update loaded collection's replica (#35822)
issue: #35821
After collection loaded, if we need to increase/decrease collection's
replica, we need to release and load it again.

milvus offers 4 solution to update loaded collection's replica, this PR
aims to dynamic change the replica number without release, and after
replica number changed, milvus will execute load replica or release
replica in async, and the replica loaded status can be checked by
getReplicas API.

Notice that if set too much replicas than querynode can afford,the new
replica won't be loaded successfully until enough querynode joins.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-25 10:13:18 +08:00
wei liu 3bd7ec8751
fix: Fix cornor case that segment can't be move out from stopping node (#36431)
issue: #36426
the old constriant requires only segment on current target can be
balanced, which is wrong, and caused that segment can't be move out from
stopping node, if it's only exist in next target.

by design, stopping balance need to move out all segment on it by
balance task, thus the unfair old constriant should be removed.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-24 17:01:14 +08:00
wei liu c84ea5465c
fix: Fix some replicas don't participate in the query after the failure recovery (#35850)
issue: #35846
querycoord will notify proxy to update shard leader cache after
delegator location changes, but during querynode's failure recovery,
some delegator may become unserviceable due to lacking of segments, and
back to serviceable after segment loaded, so we also need to notify
proxy to invalidate shard leader cache when delegator serviceable state
changes.

This PR will maintain querynode's serviceable state during heartbeat,
and notify proxy to invalidate shard leader cache if serviceable state
changes.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-03 15:39:03 +08:00
SimFG 731d45abbe
enhance: provide more general configuration to control mmap behavior (#35359)
- issue: #35273

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-08-21 00:22:54 +08:00
congqixia 2fbc628994
feat: Support field partial load collection (#35416)
Related to #35415

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-08-20 16:49:02 +08:00
wei liu c0200eec39
enhance: limit getSegmentInfo batch size to avoid excced grpc message limit (#35394)
issue: #35395

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-08-15 19:17:00 +08:00
wei liu f6aaf3fef2
fix: force update next target if target can't be loaded (#35365)
issue: #35361

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-08-15 19:15:00 +08:00
jaime fcec4c21b9
fix: check collection health(queryable) fail for releasing collection (#34947)
issue: #34946

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-08-02 17:20:15 +08:00
wei liu 8123bea1ae
enhance: Avoid assign too much segment/channels to new querynode (#34096)
issue: #34095

When a new query node comes online, the segment_checker,
channel_checker, and balance_checker simultaneously attempt to allocate
segments to it. If this occurs during the execution of a load task and
the distribution of the new query node hasn't been updated, the query
coordinator may mistakenly view the new query node as empty. As a
result, it assigns segments or channels to it, potentially overloading
the new query node with more segments or channels than expected.

This PR measures the workload of the executing tasks on the target query
node to prevent assigning an excessive number of segments to it.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-06-27 19:06:05 +08:00
jaime 9630974fbb
enhance: move rocksmq from internal to pkg module (#33881)
issue: #33956

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-06-25 21:18:15 +08:00
Chun Han ca7ef26e4b
fix: sync part stats task cannot be finished(#30376) (#34027)
related: #30376
also: refine log output for query_coord task by rephrasing action string

Signed-off-by: MrPresent-Han <chun.han@gmail.com>
Co-authored-by: MrPresent-Han <chun.han@gmail.com>
2024-06-24 10:16:02 +08:00
wei liu 02945959d9
enhance: Avoid to iterate whole segment list for each task's process (#33943)
when querycoord process segment task, it will try to iterate whole
segment list to checke whether segment is loaded, which cost too much
cpu if there has thousands of segments.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-06-19 10:19:58 +08:00
Chun Han f7af323d1e
fix: sync partitiion stats blocking balance task(#33741) (#33742)
related: #33741

Signed-off-by: MrPresent-Han <chun.han@zilliz.com>
2024-06-11 14:21:56 +08:00
wayblink a1232fafda
feat: Major compaction (#33620)
#30633

Signed-off-by: wayblink <anyang.wang@zilliz.com>
Co-authored-by: MrPresent-Han <chun.han@zilliz.com>
2024-06-10 21:34:08 +08:00
SimFG ecee7d90d4
enhance: try to speed up the loading of small collections (#33570)
- issue: #33569

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-06-07 08:25:53 +08:00
jaime 0d99db23b8
fix: metrics leak on the coord nodes (#33075)
issue: #32980

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-05-20 22:03:39 +08:00
congqixia 72c172a7d7
enhance: Remove duplicated collectionID label for task latency (#32308)
`CollectionID` already exists in channel name, so remove it to save
metrics traffic.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-04-16 18:55:19 +08:00
wei liu 68dec7dcd4
fix: Use correct ts to avoid exclude segment list leak (#31991)
issue: #31990

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-04-12 10:39:19 +08:00