Commit Graph

699 Commits (master)

Author SHA1 Message Date
yihao.dai 2a037a97f1
enhance: Add get vector latency metric and refine request limit error message (#40083)
issue: https://github.com/milvus-io/milvus/issues/40078

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-02-21 19:41:55 +08:00
wei liu 7d2c948c69
fix: task delta cache leak on reduce task (#40055)
issue: #40052

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-21 16:47:54 +08:00
wei liu 07578041ba
fix: querycoord panic in cornor case (#40057)
issue: #40050

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-21 11:19:58 +08:00
Zhen Ye 64dad60dc2
fix: delegator doesn't follow with wal if streaming enabled (#39890)
issue: #38399

Signed-off-by: chyezh <chyezh@outlook.com>
2025-02-17 14:10:15 +08:00
Bingyi Sun b59555057d
feat: support json index (#36750)
https://github.com/milvus-io/milvus/issues/35528

This PR adds json index support for json and dynamic fields. Now you can
only do unary query like 'a["b"] > 1' using this index. We will support
more filter type later.

basic usage:
```
collection.create_index("json_field", {"index_type": "INVERTED",
    "params": {"json_cast_type": DataType.STRING, "json_path":
'json_field["a"]["b"]'}})
```

There are some limits to use this index:
1. If a record does not have the json path you specify, it will be
ignored and there will not be an error.
2. If a value of the json path fails to be cast to the type you specify,
it will be ignored and there will not be an error.
3. A specific json path can have only one json index.
4. If you try to create more than one json indexes for one json field,
sdk(pymilvus<=2.4.7) may return immediately because of internal
implementation. This will be fixed in a later version.

---------

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2025-02-15 14:06:15 +08:00
wei liu bfc802297e
enhance: Add management api to check querycoord balance status (#37784)
issue: #37783

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-14 18:00:14 +08:00
wei liu b9e3ec7175
enhance: Add trigger interval config for auto balance (#39154)
issue: #39156

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-14 16:12:15 +08:00
congqixia 58045a3396
fix: Check collection released before target checks (#39841)
Related to #39840

The target could be updated async in previous code. This PR make remove
collection from target observer block until all tasks related in
dispatchers are removed preventing the metrics being updated after
collection released.

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-14 11:38:14 +08:00
Zhen Ye 0988807160
enhance: enable write ahead buffer for streaming service (#39771)
issue: #38399

- Make a timetick-commit-based write ahead buffer at write side.
- Add a switchable scanner at read side to transfer the state between
catchup and tailing read

Signed-off-by: chyezh <chyezh@outlook.com>
2025-02-12 20:38:46 +08:00
wei liu c12c4b4fff
fix: [skip e2e] pr conflict cause ut failed (#39811)
Related to https://github.com/milvus-io/milvus/pull/39701 &
https://github.com/milvus-io/milvus/issues/39681

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-12 11:44:51 +08:00
congqixia 7b51e4839f
fix: Resolve conflict on qc task test (#39796)
Related to #39701 & #39681

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-02-11 18:40:45 +08:00
wei liu ff5c680c99
fix: load collection stucks if compaction/gc happens (#39701)
issue: #39680
if compaction/gc happens, load collection may stuck due to
SegmentNotFound, we should trigger UpdateNextTarget to get a new data
view to execute loading operation.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-11 15:48:50 +08:00
wei liu 85c9f92ff4
fix: uneven distribution caused by executing task delta cache leak (#39702)
issue: #39681 

this PR maintain workload effect in action instead of computing workload
effect from target, which may cause leak if target changes.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-11 14:30:46 +08:00
Zhen Ye d3e32bb599
enhance: make pchannel level flusher (#39275)
issue: #38399

- Add a pchannel level checkpoint for flush processing
- Refactor the recovery of flushers of wal
- make a shared wal scanner first, then make multi datasyncservice on it

Signed-off-by: chyezh <chyezh@outlook.com>
2025-02-10 16:32:45 +08:00
jaime 8a4ac8cccd
enhance: expose more metrics data (#39456)
issue: #36621 #39417
1. Adjust the server-side cache size.
2. Add source information for configurations.
3. Add node ID for compaction and indexing tasks.
4. Resolve localhost access issues to fix health check failures for
etcd.

Signed-off-by: jaime <yun.zhang@zilliz.com>
2025-02-07 11:50:50 +08:00
wei liu 05ac4041aa
enhance: use rated logger for high frequency log in dist handler (#39452)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-02-05 15:31:10 +08:00
Zhen Ye c84a0748c4
enhance: add rw/ro streaming query node replica management (#38677)
issue: #38399

- Embed the query node into streaming node to make delegator available
at streaming node.
- The embedded query node has a special server label
`QUERYNODE_STREAMING-EMBEDDED`.
- Change the balance strategy to make the channel assigned to streaming
node as much as possible.

Signed-off-by: chyezh <chyezh@outlook.com>
2025-01-24 16:55:07 +08:00
yihao.dai 5fb597b37b
fix: Remove frequently updating metric to avoid mutex contention (#38775)
issue: https://github.com/milvus-io/milvus/issues/37630

Reduce the frequency updating metrics to avoid holding the mutex for
long periods.

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-24 10:31:07 +08:00
yihao.dai e0b26260f2
enhance: enable task delta cache (#39307)
When there are many segment tasks in the querycoord scheduler, the
traversal in `GetSegmentTaskDelta` checks becomes time-consuming. This
PR adds caching for segment deltas.

issue: https://github.com/milvus-io/milvus/issues/37630

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Co-authored-by: Wei Liu <wei.liu@zilliz.com>
2025-01-23 14:31:16 +08:00
yihao.dai 38f813bed3
enhance: Read metadata concurrently to accelerate recovery (#38403)
Read metadata such as segments, binlogs, and partitions concurrently at
the collection level.

issue: https://github.com/milvus-io/milvus/issues/37630

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-23 14:27:27 +08:00
yihao.dai e55d6506e3
enhance: Remove frequent observe log (#39413)
/kind improvement

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-20 11:01:10 +08:00
yihao.dai 657550cf06
fix: Fix slow dist handle and slow observe (#38566)
1. Provide partition&channel level indexing in the collection target.
2. Make `SegmentAction` not wait for distribution.
3. Remove scheduler and target manager mutex.
4. Optimize logging to reduce CPU overhead.

issue: https://github.com/milvus-io/milvus/issues/37630

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-15 20:17:00 +08:00
wei liu d2834a1812
enhance: Add logs for check health failed (#39208)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-01-15 17:31:00 +08:00
jaime e8f76cd2d9
fix: unstable ut in leader_vew_manager.go file (#39161)
issue: #38672

Signed-off-by: jaime <yun.zhang@zilliz.com>
2025-01-15 12:26:59 +08:00
Zhen Ye 3e788f0fbd
enhance: record memory size (uncompressed) item for index (#38770)
issue: #38715

- Current milvus use a serialized index size(compressed) for estimate
resource for loading.
- Add a new field `MemSize` (before compressing) for index to estimate
resource.

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-01-14 10:33:06 +08:00
wei liu cc5d59392a
fix: channel unbalance during stopping balance progress (#38971)
issue: #38970
cause the stopping balance channel still use the row_count_based policy,
which may causes channel unbalance in multi-collection case.

This PR impl a score based stopping balance channel policy.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-01-13 11:21:06 +08:00
wei liu 826b726c86
fix: Prevent leader checker from generating excessive duplicate leader tasks (#39000)
issue: #39001
Background:
Segment Load Version: Each segment load request assigns a timestamp as
its version. When multiple copies of a segment are loaded on different
QueryNodes, the leader checker uses this version to identify the latest
copy and updates the routing table in the leader view to point to it.
Delegator Router Version: When a delegator builds a route to a QueryNode
that has loaded a segment, it also records the segment's version.

Router Table Update Logic: If the leader checker detects that the
version of a segment in the routing table does not match the version in
the worker, it updates the routing table to point to the QueryNode with
the latest version. Additionally, it updates the segment's load version
in the QueryNode during this process.

Issue:
When a channel is undergoing load balancing, the leader checker may sync
the routing table to a new delegator. This sync operation modifies the
segment's load version, which invalidates the routing in the old
delegator. Subsequently, the leader checker updates the routing table in
the old delegator, breaking the routing in the new delegator. This cycle
continues, causing repeated updates and inconsistencies.

Fix:
This PR introduces two changes to address the issue:
1. Use NodeID to verify whether the delegator's routing table needs an
update, avoiding unnecessary modifications.
2. Ensure compatibility by using the latest segment's load version as
the version recorded in the routing table.

These changes resolve the cyclic updates and prevent the leader checker
from generating excessive duplicate tasks, ensuring routing stability
across delegators during load balancing.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-01-10 14:12:57 +08:00
Zhen Ye bb8d1ab3bf
enhance: make new go package to manage proto (#39114)
issue: #39095

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2025-01-10 10:49:01 +08:00
jaime f03a85725a
enhance: add db name in replica (#38672)
issue: #36621

Signed-off-by: jaime <yun.zhang@zilliz.com>
2025-01-09 19:40:59 +08:00
wei liu 47e7ea241e
enhance: Add log for case which target not update as expected (#38944)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2025-01-07 17:45:03 +08:00
Xiaofan cb6eca8e91
fix: drop partition can not be successful if load failed (#38793)
fix #38649
when partition load failed, the partition drop will also fail due to the
wrong error message

Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
2024-12-30 19:42:52 +08:00
wei liu f49d618382
fix: Querycoord will trigger unexpected balance task after restart (#38630)
issue: #38606

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-25 19:30:48 +08:00
wei liu 25f0c82ceb
fix: Fix update loading collection's load config doesn't work (#38595)
issue: #38594

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-25 18:02:51 +08:00
wei liu 9c3f59dbbe
fix: Prevent balancer from overloading the same QueryNode (#38719)
issue: #38718
The balancer calculates the workload of executing tasks as an ongoing
score for target nodes. However, a logic issue arises when
GetSegmentTaskDelta or GetChannelTaskDelta is called with
collectionID=-1, which incorrectly returns zero.

Due to the incorrect global score, the executing task's workload is not
properly reflected for each collection. Consequently, each collection
submits its own balance task, leading to the balancer assigning
excessive tasks to the same QueryNode.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-25 16:36:49 +08:00
jaime 5afd0c0a2b
fix: Revert "Expose metrics of stanby coordinators (#27698)" (#38620)
issue: #38608

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-12-23 11:46:57 +08:00
jaime 78438ef41e
fix: revert optimize CPU usage for CheckHealth requests (#35589) (#38555)
issue: #35563

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-12-19 00:38:45 +08:00
yihao.dai d3c174b0f1
enhance: Accelerate observe collection (#38028)
1. A collection should observe the channel only once.
2. A collection should check the CollectionLoadPercent for updates only
once.
3. Skip saving coll/partition meta if there are no changes, primarily to
accelerate collection observation after recovery.

issue: https://github.com/milvus-io/milvus/issues/37630

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-17 14:14:45 +08:00
jaime 28fdbc4e30
enhance: optimize CPU usage for CheckHealth requests (#35589)
issue: #35563
1. Use an internal health checker to monitor the cluster's health state,
storing the latest state on the coordinator node. The CheckHealth
request retrieves the cluster's health from this latest state on the
proxy sides, which enhances cluster stability.
2. Each health check will assess all collections and channels, with
detailed failure messages temporarily saved in the latest state.
3. Use CheckHealth request instead of the heavy GetMetrics request on
the querynode and datanode

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-12-17 11:02:45 +08:00
SimFG 2afe2eaf3e
feat: support to replicate collection when the services contains the system tt msg (#37559)
- issue: #37105

---------

Signed-off-by: SimFG <bang.fu@zilliz.com>
2024-12-17 09:08:46 +08:00
wei liu 659847c11f
enhance: Remove load task limit in one round (#38436)
the task limit in assignSegment/assignChannel will works for both load
task and balance task.

this PR remove the load task limit, only limit balance task num in one
round.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-16 19:30:43 +08:00
wei liu 40f9db491e
fix: Fix SyncDistribution may cost too much time on retry (#38454)
issue: #38428

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-16 11:38:44 +08:00
tinswzy 27229f7907
enhance: refine exists log print with ctx (#38080)
issue: #35917 
Refines exists log print with ctx

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
2024-12-14 22:36:44 +08:00
Zhen Ye 833c74aa66
enhance: add detail, replica count for resource group (#38314)
issue: #30647

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-12-13 14:14:50 +08:00
wei liu e279ccf109
enhance: Enable score based balance channel policy (#38143)
issue: #38142
current balance channel policy only consider current collection's
distribution, so if all collections has 1 channel, and all channels has
been loaded on same querynode, after querynode num increase, balance
channel won't be triggered.

This PR enable score based balance channel policy, to achieve:
1. distribute all channels evenly across multiple querynodes
2. distribute each collection's channel evenly across multiple
querynodes.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-11 17:20:43 +08:00
Zhen Ye d3ae8e9232
fix: delay the wait other coord logic in query coord after query coord change into standby state (#38259)
issue: https://github.com/milvus-io/milvus/issues/37764

- After removing rpc layer from mixcoord, the querycoord at standby mode
will be blocked forever of deployment rolling

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-12-11 15:48:42 +08:00
wei liu 950203aba0
enhance: Optimize save colelction target latency (#38345)
issue: #38237
this PR only use better compression level for proto msg which is larger
than 1MB, and use a lighter compression level for smaller proto msg,
which could get a better latency in most case.

this PR could reduce the latency from 22.7s to 4.7s with 10000
collctions and each collections has 1000 segments.

before this PR:
BenchmarkTargetManager-8 1 22781536357 ns/op 566407275088 B/op 11188282
allocs/op
after this PR:
BenchmarkTargetManager-8 1 4729566944 ns/op 36713248864 B/op 10963615
allocs/op

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-11 10:12:43 +08:00
congqixia 7ea9c983d2
enhance: Add mockery package config for QC&QN (#38340)
Related to #38339

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-10 19:18:42 +08:00
wei liu 856e2aad7d
fix: Leader task stuck and retry again and again (#38202)
issue: #38201
leader task require to update delegator's distribution, and only success
after the distribution change has been applyed to delegator. but the
delegator will reject the distribution change if it's version is older
than current version in delegator. which cause the leader task stuck and
retry forever.

this PR remove the leader task finish check.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-10 19:16:42 +08:00
wei liu f04986fceb
enhance: Remove constraint on release segment task (#38297)
issue: #38305
after we disable balance segment and balance channel happens at same
time, the constriant which require release segment must happens on
serviceable shard leader is unnessary.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-12-10 11:18:49 +08:00
jaime 8ed019735c
enhance: add disk stats within system metrics (#38033)
issue: ##36621

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-12-06 16:32:41 +08:00
congqixia 36946cc9ce
enhance: Set loaded collection/partition number to metrics (#38271)
Related to #36456
Previous PR: #38471 #38233

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-06 16:18:40 +08:00
congqixia 6ff19481f0
enhance: Resolve compilation error due to PR conflict (#38252)
Related pr: #38233 #38059

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-05 19:26:40 +08:00
congqixia 051bc280dd
enhance: Make dynamic load/release partition follow targets (#38059)
Related to #37849

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-05 16:24:40 +08:00
congqixia 32645fc28a
enhance: Unify querycoord meta metrics (#38233)
Related to #36456

Unify collection/partition number metrics to collection manager in case
of unwant missing modification

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-05 15:48:39 +08:00
tinswzy 7944538ade
enhance: Add ctx param to KV operation interfaces (#38154)
issue: #35917 
Refine KV operation interfaces by adding a ctx param

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
2024-12-05 15:16:41 +08:00
tinswzy e76802f910
enhance: refine querycoord meta/catalog related interfaces to ensure that each method includes a ctx parameter (#37916)
issue: #35917 
This PR refine the querycoord meta related interfaces to ensure that
each method includes a ctx parameter.

Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
2024-11-25 11:14:34 +08:00
jaime 7bbfe86bcd
enhance: add list index and segment index retrieval API for WebUI (#37861)
issue: #36621

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-11-22 16:58:34 +08:00
congqixia b34bfb98a0
enhance: Refine Replica manager colle2Replicas secondary index (#37906)
Related to #37630

This PR add a new util coll2Replicas secondary index to reduce map
access & iteration while get replicas by collection

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-22 11:54:32 +08:00
wei liu 965bda6e60
enhance: Add channel name to shard leader log in meta cache (#37856)
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-21 19:24:31 +08:00
wei liu 0a440e0d38
fix: Prevent simultaneous balance of segments and channels (#37850)
issue: #33550
balance segment and balance segment execute at same time, which will
cause bounch of corner case.

This PR disable simultaneous balance of segments and channels

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-21 17:56:55 +08:00
wei liu b983ef9fca
fix: Channel may be released after balance (#37862)
issue: #37830
casue dist handler doesn't set channel's version, so if channel checker
try to dedup channel, it may release the new delegator after balance
finished.

this PR fix the way to set proper version for channel.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-21 10:40:31 +08:00
congqixia b8d31ebed8
enhance: Remove unnecessary segment clone updating dist (#37797)
Related to #37630

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-20 11:26:31 +08:00
yihao.dai b6612e02b4
enhance: Reduce GetIndexInfos calls (#37695)
Batch `GetIndexInfos` calls for segments to reduce RPC calls.

issue: https://github.com/milvus-io/milvus/issues/37634

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-19 14:24:31 +08:00
congqixia 6d86b9022e
enhance: Provide secondary index critria when filter leaderview (#37777)
Related to #37630

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-19 10:12:30 +08:00
jaime 257ecab84b
enhance: remove collection queryable check from health check (#37712)
Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-11-18 10:50:38 +08:00
congqixia b0bd290a6e
enhance: Use internal json(sonic) to replace std json lib (#37708)
Related to #35020

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-18 10:46:31 +08:00
wei liu a1b6be1253
fix: Delegator stuck at unserviceable status (#37694)
issue: #37679

pr #36549 introduce the logic error which update current target when
only parts of channel is ready.

This PR fix the logic error and let dist handler keep pull distribution
on querynode until all delegator becomes serviceable.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-15 10:20:31 +08:00
jaime 1d06d4324b
fix: Int64 overflow in JSON encoding (#37657)
issue: ##36621

- For simple types in a struct, add "string" to the JSON tag for
automatic string conversion during JSON encoding.
- For complex types in a struct, replace "int64" with "string."

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-11-14 22:52:30 +08:00
wei liu 1304b40552
fix: Balance channel may stuck at increasing replica number case (#37641)
issue: #37640
fix the pr #36549
cause balance channel will wait until new delegator becomes serviceable,
but new delegator need to sync target version then becomes serviceable,
and sync target version need to be wait all replica load done. so if
increasing replica number and balance channel happens at same time,
logic dead lock occurs.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-14 10:08:31 +08:00
jaime 1e8ea4a7e7
feat: add segment/channel/task/slow query render (#37561)
issue: #36621

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-11-12 17:44:29 +08:00
wei liu 266f8ef1f5
fix: Search may return less result after qn recover (#36549)
issue: #36293 #36242
after qn recover, delegator may be loaded in new node, after all segment
has been loaded, delegator becomes serviceable. but delegator's target
version hasn't been synced, and if search/query comes, delegator will
use wrong target version to filter out a empty segment list, which
caused empty search result.

This pr will block delegator's serviceable status until target version
is synced

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-12 16:34:28 +08:00
congqixia f5b06a3c9f
enhance: Invalidate collection cache when release collection (#37577)
Related to #37395

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-12 10:16:29 +08:00
wei liu 61a5b15ada
fix: Lost loading collection's updateTs after qc restart (#37538)
issue: #37537

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-11 14:34:28 +08:00
congqixia 5e90f348fc
enhance: Handle legacy proxy load fields request (#37565)
Related to #35415

In rolling upgrade, legacy proxy may dispatch load request wit empty
load field list. The upgraded querycoord may report error by mistake
that load field list is changed.

This PR:

- Auto field empty load field list with all user field ids
- Refine the error messag when load field list updates
- Refine load job unit test with service cases

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-11 10:14:26 +08:00
sthuang 70605cf5b3
enhance: Support custom privilege group for RBAC (#37087)
issue: #37031

---------

Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
2024-11-09 08:44:28 +08:00
yihao.dai ff9bdf7029
fix: Fix load slowly (#37454)
When there're a lot of loaded collections, they would occupy the target
observer scheduler’s pool. This prevents loading collections from
updating the current target in time, slowing down the load process.
This PR adds a separate target dispatcher for loading collections.

issue: https://github.com/milvus-io/milvus/issues/37166

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-09 07:48:26 +08:00
congqixia dcc1b506dc
enhance: Add context trace for querycoord queryable check (#37524)
When check health logic failed to collection not-queryable, the related
reason is hard to find in log.

This PR add context for log with trace id and print unqueryable
collection info log.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-11-08 14:00:26 +08:00
wei liu a03157838b
enhance: Enable node assign policy on resource group (#36968)
issue: #36977
with node_label_filter on resource group, user can add label on
querynode with env `MILVUS_COMPONENT_LABEL`, then resource group will
prefer to accept node which match it's node_label_filter.

then querynode's can't be group by labels, and put querynodes with same
label to same resource groups.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-08 11:18:27 +08:00
Xiaofan e073906a19
enhance: optimize describe collection and index (#37490)
fix #37489
combine multiple describe collection and list index into one call

Signed-off-by: xiaofanluan <xiaofan.luan@zilliz.com>
2024-11-08 10:18:34 +08:00
jaime f348bd9441
feat: add segment,pipeline, replica and resourcegroup api for WebUI (#37344)
issue: #36621

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-11-07 11:52:25 +08:00
wei liu 8714774305
fix: search/query failed due to segment not loaded (#37403)
issue: #36970
cause release segment and balance channel may happen at same time, and
before new delegator become serviceable, if release segment exeuctes on
new delegator, and search/query comes on old delegator, then release
segment and query segment happens in parallel, if release segment
execute first in worker, then search/query will got a SegmentNodeLoaded
error.

This PR add serviceable filter on delegator, then all load/release
segment operation will happens on serviceable delegator.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-11-06 15:10:25 +08:00
congqixia 9539739781
enhance: Release compacted growing segment if in dropped list (#37245)
See also #37205

Previously releasing growing segments could be triggered by two
conditions:

- Sealed Segment with same id is loaded
- Segment start position is before target checkpoint ts

Which has a worst case that the corresponding sealed segment is
compacted and the checkpoint is pinned by a growing l0 segment.

This PR introduces a new rule that: a growing segment could be released
if the segment id appeared in current target dropped segment id list.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-10-29 18:04:21 +08:00
jaime 9d16b972ea
feat: add tasks page into management WebUI (#37002)
issue: #36621

1. Add API to access task runtime metrics, including:
  - build index task
  - compaction task
  - import task
- balance (including load/release of segments/channels and some leader
tasks on querycoord)
  - sync task
2. Add a debug model to the webpage by using debug=true or debug=false
in the URL query parameters to enable or disable debug mode.

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-10-28 10:13:29 +08:00
wei liu 39a91eb100
fix: Delegator may becomes unserviceable after querycoord restart (#37055)
issue: #37054
after querycoord restart, segment_checker may release segment by mistake
due to next target isn't ready yet.

This PR requires release segment must happens after next target is
ready.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-10-24 12:21:28 +08:00
congqixia f43527ef6f
enhance: Batch forward delete when using DirectForward (#37076)
Relatedt #36887

DirectFoward streaming delete will cause memory usage explode if the
segments number was large. This PR add batching delete API and using it
for direct forward implementation.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-10-24 10:39:28 +08:00
wei liu f029314e20
fix: Dynamic release parition may fail search/query. (#37049)
issue: #33550
cause wrong impl of UpdateCollectionNextTarget, if ReleaseCollection and
UpdateCollectionNextTarget happens at same time, the the released
partition's segment list may be add to target again, and delegator will
be marked as unserviceable due to lack of segment.

This PR fix the impl of UpdateCollectionNextTarget

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-10-24 01:03:28 +08:00
jaime 4746f47282
feat: management WebUI homepage (#36822)
issue: #36784
1. Implement an embedded web server for WebUI access.  
2. Complete the homepage development.

Home page demo:
<img width="2177" alt="iShot_2024-10-10_17 57 34"
src="https://github.com/user-attachments/assets/38539917-ce09-4e54-a5b5-7f4f7eaac353">

Signed-off-by: jaime <yun.zhang@zilliz.com>
2024-10-23 11:29:28 +08:00
Bingyi Sun 6851738fd1
fix: fix `make generate-mockery` panic with go1.22 (#36830)
https://github.com/milvus-io/milvus/issues/36831
Fix `make generate-mockery` panic.

Signed-off-by: sunby <sunbingyi1992@gmail.com>
2024-10-17 12:11:31 +08:00
congqixia 3fe0f82923
enhance: Add balance report log for qc balancer (#36747)
Related to #36746

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-10-11 10:25:24 +08:00
aoiasd db34572c56
feat: support load and query with bm25 metric (#36071)
relate: https://github.com/milvus-io/milvus/issues/35853

---------

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2024-10-11 10:23:20 +08:00
wei liu 470bb0cc3f
enhance: Enable balance on querynode with different mem capacity (#36466)
issue: #36464
This PR enable balance on querynode with different mem capacity, for
query node which has more mem capactity will be assigned more records,
and query node with the largest difference between assignedScore and
currentScore will have a higher priority to carry the new segment.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-30 16:15:17 +08:00
cai.zhang ecb2b242e2
enhance: Add sorted for segment info (#36469)
issue: #33744

Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
2024-09-30 10:01:16 +08:00
wei liu 55be814a58
enhance: make TransferChannel/TransferSegment idempotent (#36489)
issue: #36488
when call TransferChannel/TransferSegment, querycoord will generate and
submit balance task to scheduler, if segment/channel's task already
exist in scheduler, submit task will failed.

to make TransferChannel/TransferSegment idempotent, we should skip to
submit if task already exist in scheduler.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-26 18:11:23 +08:00
wei liu 5dfa1c3397
fix: Segment unbalance after many times load/release (#36537)
issue: #36536
query coord use `segmentTaskDeleta/channelTaskDelta` to measure the
executing workload for querynode in scheduler, and we maintains the
`segmentTaskDeleta/channelTaskDelta` by `scheulder.Add(task)` and
`scheduler.remove(task)`, but `scheduler.remove(task)` has been called
in unexpected way, which cause a wrong
`segmentTaskDeleta/channelTaskDelta` value and affect the segment assign
logic, causes segment unbalance.

This PR moves to compute the `segmentTaskDeleta/channelTaskDelta` when
access, to avoid the wrong value affect.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-26 15:13:15 +08:00
sthuang 4493aa2142
fix: querycoord collection num metric (#36471)
related to: #36456

Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
2024-09-26 14:23:13 +08:00
wei liu 3cd0b26285
enhance: Enable dynamic update loaded collection's replica (#35822)
issue: #35821
After collection loaded, if we need to increase/decrease collection's
replica, we need to release and load it again.

milvus offers 4 solution to update loaded collection's replica, this PR
aims to dynamic change the replica number without release, and after
replica number changed, milvus will execute load replica or release
replica in async, and the replica loaded status can be checked by
getReplicas API.

Notice that if set too much replicas than querynode can afford,the new
replica won't be loaded successfully until enough querynode joins.

---------

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-25 10:13:18 +08:00
wei liu 3bd7ec8751
fix: Fix cornor case that segment can't be move out from stopping node (#36431)
issue: #36426
the old constriant requires only segment on current target can be
balanced, which is wrong, and caused that segment can't be move out from
stopping node, if it's only exist in next target.

by design, stopping balance need to move out all segment on it by
balance task, thus the unfair old constriant should be removed.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-24 17:01:14 +08:00
wei liu f7d950d465
fix: [skip e2e] Fix unstable ut TestCollectionObserver (#36231)
issue: #36237

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-13 19:01:09 +08:00
wei liu fb2a41a94c
fix: Clean dirty segment/channel on querynode (#36202)
issue: #36201
after querynode has been remove from replica, all dirty segment/channel
on it should be released.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
2024-09-13 18:15:08 +08:00
Jiquan Long 89bf226f0b
feat: support keyword text match (#35923)
fix: #35922

---------

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2024-09-10 15:11:08 +08:00