Commit Graph

20660 Commits (10kcp)

Author SHA1 Message Date
yihao.dai 584b054981
fix: [10kcp] Fix concurrent scheule (#38974)
supplement to pr: https://github.com/milvus-io/milvus/pull/38566

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-03 12:59:12 +08:00
yihao.dai f5f4fed889
fix: [10kcp] channel unbalance during stopping balance progress (#38972)
issue: https://github.com/milvus-io/milvus/issues/38970,
https://github.com/milvus-io/milvus/issues/37630

cause the stopping balance channel still use the row_count_based policy,
which may causes channel unbalance in multi-collection case.

This PR impl a score based stopping balance channel policy.

pr: https://github.com/milvus-io/milvus/pull/38971

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Co-authored-by: Wei Liu <wei.liu@zilliz.com>
2025-01-03 12:05:03 +08:00
yihao.dai 9b2b2a2689
enhance: [10kcp] Remove scheduler and target manager mutex (#38968)
supplement to PR https://github.com/milvus-io/milvus/pull/38566

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-03 11:18:52 +08:00
yihao.dai 663ec6f822
enhance: [10kcp] Reducing the granularity of locks in the target manager (#38956)
pr: https://github.com/milvus-io/milvus/pull/38566

Just for test, I'll remove the global mutex latter.

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-02 20:54:47 +08:00
yihao.dai 7934c211c6
fix: [10kcp] Querycoord will trigger unexpected balance task after restart (#38951)
pr: https://github.com/milvus-io/milvus/pull/38944

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Co-authored-by: wei liu <wei.liu@zilliz.com>
2025-01-02 17:54:21 +08:00
yihao.dai b6c18f756f
enhance: [10ckp] Add log for case which target not update as expected (#38952)
pr: https://github.com/milvus-io/milvus/pull/38944

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Co-authored-by: Wei Liu <wei.liu@zilliz.com>
2025-01-02 17:53:29 +08:00
yihao.dai 734a20ac01
enhance: [10kcp] Add logs for querycoord checker (#38953)
issue: https://github.com/milvus-io/milvus/issues/37630

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2025-01-02 17:52:54 +08:00
yihao.dai c25fdf8080
enhance: [10kcp] Prevent frequently updating metric (#38828)
issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38827

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-28 00:04:04 +08:00
yihao.dai fee1f77d4e
fix: [10kcp] Fix incorrect memory estimation for small segments (#38814)
Skip estimation index memory logic for segments without index file.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38813

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-27 16:07:29 +08:00
yihao.dai 15b9f51728
enhance: [10kcp] Add channel index in target, optimize logs (#38804)
supplement to pr: https://github.com/milvus-io/milvus/pull/38566

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-27 10:44:03 +08:00
yihao.dai 4d0594ba04
fix: [10kcp] Fix rootcoord meta mutex contention (#38803)
RootCoord meta uses copy-on-write, allowing the removal of unnecessary
copies.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38799

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-27 10:17:33 +08:00
yihao.dai e13b8a2f58
enhance: [10kcp] Optimize save collection target latency (#38345) (#38370) (#38795)
issue: #38237
pr: #38345
this PR only use better compression level for proto msg which is larger
than 1MB, and use a lighter compression level for smaller proto msg,
which could get a better latency in most case.

this PR could reduce the latency from 22.7s to 4.7s with 10000
collctions and each collections has 1000 segments.

before this PR:
BenchmarkTargetManager-8 1 22781536357 ns/op 566407275088 B/op 11188282
allocs/op
after this PR:
BenchmarkTargetManager-8 1 4729566944 ns/op 36713248864 B/op 10963615
allocs/op

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Co-authored-by: wei liu <wei.liu@zilliz.com>
2024-12-26 21:49:07 +08:00
yihao.dai 501d1b58cf
Revert "fix: [10kcp] Query coord stop progress is too slow (#38300)" (#38794)
This reverts commit ae4e2b8063.
2024-12-26 21:48:41 +08:00
yihao.dai 05f50b11ff
fix: [10kcp] Fix slow preprocess in qc scheduler (#38784)
supplement to pr: https://github.com/milvus-io/milvus/pull/38566

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-26 17:05:44 +08:00
yihao.dai 7f5467577e
fix: [10kcp] Fix index meta mutex contention (#38777)
issue: https://github.com/milvus-io/milvus/issues/37630

Reduce the frequency of updateIndexTasksMetrics to avoid holding the
mutex for long periods.

pr: https://github.com/milvus-io/milvus/pull/38775

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-26 17:04:14 +08:00
yihao.dai 1969ab3da7
enhance: Optimize GetLocalDiskSize and segment loader mutex (#38683)
fix of: https://github.com/milvus-io/milvus/pull/38599

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-24 11:19:58 +08:00
yihao.dai bf27f70c32
enhance: [10kcp] Optimize GetLocalDiskSize and segment loader mutex (#38601)
fix of pr: https://github.com/milvus-io/milvus/pull/38599

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-19 21:29:13 +08:00
yihao.dai ecd55596cf
enhance: [10kcp] Optimize GetLocalDiskSize and segment loader mutex (#38600)
1. Make the segment loader lock protect only the resource.
2. Optimize GetDiskUsage to avoid excessive overhead.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38599

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-19 21:14:26 +08:00
congqixia f5ae24f955
fix: [10kcp] SyncSegments rpc always failed (#38032) (#38579)
Cherry-pick from 2.4
pr: #38032
issue: #38031
cause call `cli.SyncSegments` use ctx which already be override and
canceled, so SyncSegments rpc will always failed.

Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Co-authored-by: wei liu <wei.liu@zilliz.com>
2024-12-19 14:01:46 +08:00
yihao.dai c3d4469259
enhance: Print observe time (#38575)
Print observe, dist handing and schedule time.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38566

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-19 11:46:13 +08:00
yihao.dai ca234e7847
fix: [10kcp] Fix slow dist handle and slow observe (#38567)
1. Provide partition-level indexing in the collection target.
2. Make SegmentAction not wait for distribution.
3. Optimize logging to reduce CPU overhead.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38566

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-18 21:00:39 +08:00
congqixia 999437e76e
enhance: [10kcp] Trim data distribiton resp index info (#38521)
Related to #37630

Data distribution became too large when segment number was huge. This PR
trims the index info struct and return needed info only.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-17 15:20:26 +08:00
congqixia 28841ebdf9
enhance: [10kcp] Simplify querynode tsafe & reduce goroutine number (#38416) (#38433)
Related to #37630

TSafe manager is too complex for current implementation and each
delegator need one goroutine waiting for tsafe update event.

Tsafe updating could be executed in pipeline. This PR remove tsafe
manager and simplify the entire logic of tsafe updating.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-13 21:20:57 +08:00
yihao.dai de78de7689
fix: [10kcp] Fix consume blocked due to too many consumers (#38456)
This PR limits the maximum number of consumers per pchannel to 10 for
each QueryNode and DataNode.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38455

---------

Signed-off-by: SimFG <bang.fu@zilliz.com>
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Co-authored-by: SimFG <bang.fu@zilliz.com>
2024-12-13 21:20:47 +08:00
yihao.dai df4d5e1096
enhance: [10kcp] Read metadata concurrently to accelerate recovery (#38404)
Read metadata such as segments, binlogs, and partitions concurrently at
the collection level.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38403

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-12 16:39:06 +08:00
yihao.dai 11118db7d6
enhance: [10kcp] remove unnecessary clone in meta cache (#38398)
issue: https://github.com/milvus-io/milvus/issues/36627,
https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/36628

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Co-authored-by: Ted Xu <ted.xu@zilliz.com>
2024-12-12 16:33:38 +08:00
congqixia 5521091dcd
enhance: [10kcp] Refine querynode collection number metrics (#38352)
Related to #37630

Previously the loaded collection metrics was calculated via scanning all
loaded segment in segment manager, which is slow and buggy
implementation.

This PR:

- Move collection num metrics to collection manager
- Remove deprecated loaded partition metrics update logic

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-10 21:06:42 +08:00
yihao.dai 4a2a5f0183
fix: [10kcp] Fix standby mixcoord start failed (#38327)
fix of https://github.com/milvus-io/milvus/pull/38324

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-10 11:47:45 +08:00
yihao.dai 15b01daec5
fix: [10kcp] Fix standby mixcoord start failed (#38324)
When standby transitions to active, the component state changes to
Initialize. If the initialization takes too long (exceeding the liveness
probe's maximum retries), the standby pod is stopped and fails to start.
This PR removes the Initialize state during standby transitions in
rolling upgrades. The state now switches directly from standby to
healthy, preventing health check failures.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38308

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-10 10:53:50 +08:00
congqixia 24a055996b
enhance: [10kcp] Add secondary index for querynode segment manager (#38312)
Cherry pick from pr
#38311
Related to #37630

Add secondary index with vchannel to reduce `GetBy` rlock holding time
when segment number is large.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-09 19:56:16 +08:00
yihao.dai 3e65cc5850
enhance: [10kcp] Enable score based balance channel policy (#38301)
issue: https://github.com/milvus-io/milvus/issues/38142
current balance channel policy only consider current collection's
distribution, so if all collections has 1 channel, and all channels has
been loaded on same querynode, after querynode num increase, balance
channel won't be triggered.

This PR enable score based balance channel policy, to achieve:

1. distribute all channels evenly across multiple querynodes
2. distribute each collection's channel evenly across multiple
querynodes.

pr: https://github.com/milvus-io/milvus/pull/38143

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Co-authored-by: Wei Liu <wei.liu@zilliz.com>
2024-12-09 19:50:05 +08:00
yihao.dai ae4e2b8063
fix: [10kcp] Query coord stop progress is too slow (#38300)
issue: https://github.com/milvus-io/milvus/issues/38237

query coord will save collection's target during stop progress, which
will be used for new querycoord's fast recover. but if milvus cluster
has thounsands of collections, which make query coord's stop progress
much more slower than expected.

this PR refine the impl to save collection's target to etcd when target
update, and clean it when collection released.

pr: https://github.com/milvus-io/milvus/pull/38238

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
Co-authored-by: Wei Liu <wei.liu@zilliz.com>
2024-12-09 19:49:49 +08:00
yihao.dai 2fe6423552
enhance: [10kcp] Speed up meta recovery (#38298)
Increase the batchSize in WalkWithPrefix operations to 10000.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38285

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-09 19:49:35 +08:00
yihao.dai 3d490aa158
fix: [10kcp] Replace outer lock with concurrent map (#38286)
See also: #37493
pr: #37817

Signed-off-by: yangxuan <xuan.yang@zilliz.com>
Co-authored-by: XuanYang-cn <xuan.yang@zilliz.com>
2024-12-09 19:49:20 +08:00
yihao.dai df100e5bbe
fix: [10kcp] Fix init rootcoord meta timeout (#38249)
issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38248

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-05 17:45:31 +08:00
Zhen Ye 99279e0bef
enhance: remove the rpc layer of coordinator when enabling standalone or mixcoord (#38246)
issue: #33285
pr: #37815

- remove the rpc layer of coordinator when enabling standalone or
mixcoord
- move health check into init

---------

Signed-off-by: chyezh <chyezh@outlook.com>
2024-12-05 17:27:53 +08:00
congqixia c4df6b5910
enhance: [10kcp] Refine Replica manager colle2Replicas secondary index (#37907)
Related to #37630

This PR add a new util coll2Replicas secondary index to reduce map
access & iteration while get replicas by collection

---------

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2024-12-05 11:57:29 +08:00
yihao.dai d75fb5b3f8
enhance: [10kcp] Reduce mutex contention in datacoord meta (#38229)
1. Using secondary index to avoid retrieving all segments at
GetSegmentsChanPart.
2. Perform batch SetAllocations to reduce the number of times the meta
lock is acquired.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38219

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-05 11:57:07 +08:00
yihao.dai 3219b869a3
fix: [10kcp] Fix timeout when listing meta (#38152)
When there are too many key-value pairs, the etcd list operation may
times out. This PR replaces LoadWithPrefix in list operations, which
could involve many keys, with WalkWithPrefix.

issue: https://github.com/milvus-io/milvus/issues/37917

pr: https://github.com/milvus-io/milvus/pull/38151

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-03 14:15:49 +08:00
yihao.dai 0c29d8ff64
enhance: [10kcp] Update segment manger (#38153)
Use a channel level key lock for segments in segmentManager.

issue: https://github.com/milvus-io/milvus/issues/37633,
https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/37836

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-03 14:15:35 +08:00
yihao.dai 338ccc9ff9
enhance: [10kcp] Reduce memory usage of BF in DataNode and QueryNode (#38133)
1. DataNode: Skip generating BF during the insert phase (BF will be
regenerated during the sync phase).
2. QueryNode: Skip generating or maintaining BF for growing segments;
deletion checks will be handled in the segcore.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/38129

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-12-02 14:41:19 +08:00
yihao.dai 0930430a68
enhance: [10kcp] Skip creating partition rate limiters when not enable (#38062)
issue: https://github.com/milvus-io/milvus/issues/37630

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-28 10:45:46 +08:00
yihao.dai 635d161109
enhance: [10kcp] Accelerate observe collection (#38058)
issue: https://github.com/milvus-io/milvus/issues/37630

---------

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-28 10:05:24 +08:00
yihao.dai 312475d1f1
enhance: [10kcp] remove the rpc level of coordinator (#37984)
issue: https://github.com/milvus-io/milvus/issues/37764

- add a local client to call local server directly for
querycoord/rootcoord/datacoord.
- enable local client if milvus is running mixcoord or standalone mode.

Signed-off-by: chyezh <chyezh@outlook.com>

---------

Signed-off-by: chyezh <chyezh@outlook.com>
Co-authored-by: Zhen Ye <chyezh@outlook.com>
2024-11-25 14:50:42 +08:00
yihao.dai e5c16e0676
fix: [10kcp] Fix checkGeneralCapacity slowly (#37981)
Cache the general count to speed up checkGeneralCapacity.

issue: https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/37976

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-25 14:50:24 +08:00
yihao.dai fd30034c77
fix: [10kcp] Fix data view and add more ut (#37915)
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-21 21:35:42 +08:00
yihao.dai 4845e4d679
enhance: [10kcp] Revert "enhance: remove the rpc level of coordinator (#37914)
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-21 21:35:29 +08:00
yihao.dai bf90e55319
enhance: [10kcp] Reduce GetRecoveryInfo calls (#37891)
1. Introduce a data view mechanism for DataCoord, attempting to update
each collection's data view periodically.
2. QueryCoord maintains a cache of data view versions. Before
batch-fetching recovery info, it retrieves all versions and only fetches
recovery info for collections with updated versions.
3. Return DataCoord's current data view when fetching RecoverInfo.

issue: https://github.com/milvus-io/milvus/issues/37743,
https://github.com/milvus-io/milvus/issues/37630

pr: https://github.com/milvus-io/milvus/pull/37863

Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
2024-11-21 15:43:13 +08:00
Zhen Ye ce8069c0fd
enhance: remove the rpc layer of coordinator when enabling standalone or mixcoord (#37892)
issue: #37764

- add a local client to call local server directly for
querycoord/rootcoord/datacoord.
- enable local client if milvus is running mixcoord or standalone mode.

Signed-off-by: chyezh <chyezh@outlook.com>
2024-11-21 15:42:18 +08:00
Zhen Ye 1a6b98be77
enhance: remove the rpc level of coordinator (#37876)
issue: #33285
pr: #37722

- move most cgo opeartions related to search/query into segcore package
for reusing for streamingnode.
- add go unittest for segcore operations.

Signed-off-by: chyezh <chyezh@outlook.com>
2024-11-21 15:21:11 +08:00