milvus

Commit Graph

Author	SHA1	Message	Date
yihao.dai	584b054981	fix: [10kcp] Fix concurrent scheule (#38974 ) supplement to pr: https://github.com/milvus-io/milvus/pull/38566 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-01-03 12:59:12 +08:00
yihao.dai	f5f4fed889	fix: [10kcp] channel unbalance during stopping balance progress (#38972 ) issue: https://github.com/milvus-io/milvus/issues/38970, https://github.com/milvus-io/milvus/issues/37630 cause the stopping balance channel still use the row_count_based policy, which may causes channel unbalance in multi-collection case. This PR impl a score based stopping balance channel policy. pr: https://github.com/milvus-io/milvus/pull/38971 Signed-off-by: bigsheeper <yihao.dai@zilliz.com> Co-authored-by: Wei Liu <wei.liu@zilliz.com>	2025-01-03 12:05:03 +08:00
yihao.dai	9b2b2a2689	enhance: [10kcp] Remove scheduler and target manager mutex (#38968 ) supplement to PR https://github.com/milvus-io/milvus/pull/38566 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-01-03 11:18:52 +08:00
yihao.dai	663ec6f822	enhance: [10kcp] Reducing the granularity of locks in the target manager (#38956 ) pr: https://github.com/milvus-io/milvus/pull/38566 Just for test, I'll remove the global mutex latter. Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-01-02 20:54:47 +08:00
yihao.dai	7934c211c6	fix: [10kcp] Querycoord will trigger unexpected balance task after restart (#38951 ) pr: https://github.com/milvus-io/milvus/pull/38944 Signed-off-by: bigsheeper <yihao.dai@zilliz.com> Co-authored-by: wei liu <wei.liu@zilliz.com>	2025-01-02 17:54:21 +08:00
yihao.dai	b6c18f756f	enhance: [10ckp] Add log for case which target not update as expected (#38952 ) pr: https://github.com/milvus-io/milvus/pull/38944 Signed-off-by: bigsheeper <yihao.dai@zilliz.com> Co-authored-by: Wei Liu <wei.liu@zilliz.com>	2025-01-02 17:53:29 +08:00
yihao.dai	734a20ac01	enhance: [10kcp] Add logs for querycoord checker (#38953 ) issue: https://github.com/milvus-io/milvus/issues/37630 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2025-01-02 17:52:54 +08:00
yihao.dai	c25fdf8080	enhance: [10kcp] Prevent frequently updating metric (#38828 ) issue: https://github.com/milvus-io/milvus/issues/37630 pr: https://github.com/milvus-io/milvus/pull/38827 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-12-28 00:04:04 +08:00
yihao.dai	fee1f77d4e	fix: [10kcp] Fix incorrect memory estimation for small segments (#38814 ) Skip estimation index memory logic for segments without index file. issue: https://github.com/milvus-io/milvus/issues/37630 pr: https://github.com/milvus-io/milvus/pull/38813 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-12-27 16:07:29 +08:00
yihao.dai	15b9f51728	enhance: [10kcp] Add channel index in target, optimize logs (#38804 ) supplement to pr: https://github.com/milvus-io/milvus/pull/38566 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-12-27 10:44:03 +08:00
yihao.dai	4d0594ba04	fix: [10kcp] Fix rootcoord meta mutex contention (#38803 ) RootCoord meta uses copy-on-write, allowing the removal of unnecessary copies. issue: https://github.com/milvus-io/milvus/issues/37630 pr: https://github.com/milvus-io/milvus/pull/38799 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-12-27 10:17:33 +08:00
yihao.dai	e13b8a2f58	enhance: [10kcp] Optimize save collection target latency (#38345 ) (#38370 ) (#38795 ) issue: #38237 pr: #38345 this PR only use better compression level for proto msg which is larger than 1MB, and use a lighter compression level for smaller proto msg, which could get a better latency in most case. this PR could reduce the latency from 22.7s to 4.7s with 10000 collctions and each collections has 1000 segments. before this PR: BenchmarkTargetManager-8 1 22781536357 ns/op 566407275088 B/op 11188282 allocs/op after this PR: BenchmarkTargetManager-8 1 4729566944 ns/op 36713248864 B/op 10963615 allocs/op Signed-off-by: Wei Liu <wei.liu@zilliz.com> Co-authored-by: wei liu <wei.liu@zilliz.com>	2024-12-26 21:49:07 +08:00
yihao.dai	501d1b58cf	Revert "fix: [10kcp] Query coord stop progress is too slow (#38300 )" (#38794 ) This reverts commit `ae4e2b8063`.	2024-12-26 21:48:41 +08:00
yihao.dai	05f50b11ff	fix: [10kcp] Fix slow preprocess in qc scheduler (#38784 ) supplement to pr: https://github.com/milvus-io/milvus/pull/38566 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-12-26 17:05:44 +08:00
yihao.dai	7f5467577e	fix: [10kcp] Fix index meta mutex contention (#38777 ) issue: https://github.com/milvus-io/milvus/issues/37630 Reduce the frequency of updateIndexTasksMetrics to avoid holding the mutex for long periods. pr: https://github.com/milvus-io/milvus/pull/38775 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-12-26 17:04:14 +08:00
yihao.dai	1969ab3da7	enhance: Optimize GetLocalDiskSize and segment loader mutex (#38683 ) fix of: https://github.com/milvus-io/milvus/pull/38599 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-12-24 11:19:58 +08:00
yihao.dai	bf27f70c32	enhance: [10kcp] Optimize GetLocalDiskSize and segment loader mutex (#38601 ) fix of pr: https://github.com/milvus-io/milvus/pull/38599 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-12-19 21:29:13 +08:00
yihao.dai	ecd55596cf	enhance: [10kcp] Optimize GetLocalDiskSize and segment loader mutex (#38600 ) 1. Make the segment loader lock protect only the resource. 2. Optimize GetDiskUsage to avoid excessive overhead. issue: https://github.com/milvus-io/milvus/issues/37630 pr: https://github.com/milvus-io/milvus/pull/38599 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-12-19 21:14:26 +08:00
congqixia	f5ae24f955	fix: [10kcp] SyncSegments rpc always failed (#38032 ) (#38579 ) Cherry-pick from 2.4 pr: #38032 issue: #38031 cause call `cli.SyncSegments` use ctx which already be override and canceled, so SyncSegments rpc will always failed. Signed-off-by: Wei Liu <wei.liu@zilliz.com> Signed-off-by: Congqi Xia <congqi.xia@zilliz.com> Co-authored-by: wei liu <wei.liu@zilliz.com>	2024-12-19 14:01:46 +08:00
yihao.dai	c3d4469259	enhance: Print observe time (#38575 ) Print observe, dist handing and schedule time. issue: https://github.com/milvus-io/milvus/issues/37630 pr: https://github.com/milvus-io/milvus/pull/38566 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-12-19 11:46:13 +08:00
yihao.dai	ca234e7847	fix: [10kcp] Fix slow dist handle and slow observe (#38567 ) 1. Provide partition-level indexing in the collection target. 2. Make SegmentAction not wait for distribution. 3. Optimize logging to reduce CPU overhead. issue: https://github.com/milvus-io/milvus/issues/37630 pr: https://github.com/milvus-io/milvus/pull/38566 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-12-18 21:00:39 +08:00
congqixia	999437e76e	enhance: [10kcp] Trim data distribiton resp index info (#38521 ) Related to #37630 Data distribution became too large when segment number was huge. This PR trims the index info struct and return needed info only. Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-12-17 15:20:26 +08:00
congqixia	28841ebdf9	enhance: [10kcp] Simplify querynode tsafe & reduce goroutine number (#38416 ) (#38433 ) Related to #37630 TSafe manager is too complex for current implementation and each delegator need one goroutine waiting for tsafe update event. Tsafe updating could be executed in pipeline. This PR remove tsafe manager and simplify the entire logic of tsafe updating. Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-12-13 21:20:57 +08:00
yihao.dai	de78de7689	fix: [10kcp] Fix consume blocked due to too many consumers (#38456 ) This PR limits the maximum number of consumers per pchannel to 10 for each QueryNode and DataNode. issue: https://github.com/milvus-io/milvus/issues/37630 pr: https://github.com/milvus-io/milvus/pull/38455 --------- Signed-off-by: SimFG <bang.fu@zilliz.com> Signed-off-by: bigsheeper <yihao.dai@zilliz.com> Co-authored-by: SimFG <bang.fu@zilliz.com>	2024-12-13 21:20:47 +08:00
yihao.dai	df4d5e1096	enhance: [10kcp] Read metadata concurrently to accelerate recovery (#38404 ) Read metadata such as segments, binlogs, and partitions concurrently at the collection level. issue: https://github.com/milvus-io/milvus/issues/37630 pr: https://github.com/milvus-io/milvus/pull/38403 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-12-12 16:39:06 +08:00
yihao.dai	11118db7d6	enhance: [10kcp] remove unnecessary clone in meta cache (#38398 ) issue: https://github.com/milvus-io/milvus/issues/36627, https://github.com/milvus-io/milvus/issues/37630 pr: https://github.com/milvus-io/milvus/pull/36628 Signed-off-by: bigsheeper <yihao.dai@zilliz.com> Co-authored-by: Ted Xu <ted.xu@zilliz.com>	2024-12-12 16:33:38 +08:00
congqixia	5521091dcd	enhance: [10kcp] Refine querynode collection number metrics (#38352 ) Related to #37630 Previously the loaded collection metrics was calculated via scanning all loaded segment in segment manager, which is slow and buggy implementation. This PR: - Move collection num metrics to collection manager - Remove deprecated loaded partition metrics update logic Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-12-10 21:06:42 +08:00
yihao.dai	4a2a5f0183	fix: [10kcp] Fix standby mixcoord start failed (#38327 ) fix of https://github.com/milvus-io/milvus/pull/38324 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-12-10 11:47:45 +08:00
yihao.dai	15b01daec5	fix: [10kcp] Fix standby mixcoord start failed (#38324 ) When standby transitions to active, the component state changes to Initialize. If the initialization takes too long (exceeding the liveness probe's maximum retries), the standby pod is stopped and fails to start. This PR removes the Initialize state during standby transitions in rolling upgrades. The state now switches directly from standby to healthy, preventing health check failures. issue: https://github.com/milvus-io/milvus/issues/37630 pr: https://github.com/milvus-io/milvus/pull/38308 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-12-10 10:53:50 +08:00
congqixia	24a055996b	enhance: [10kcp] Add secondary index for querynode segment manager (#38312 ) Cherry pick from pr #38311 Related to #37630 Add secondary index with vchannel to reduce `GetBy` rlock holding time when segment number is large. Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-12-09 19:56:16 +08:00
yihao.dai	3e65cc5850	enhance: [10kcp] Enable score based balance channel policy (#38301 ) issue: https://github.com/milvus-io/milvus/issues/38142 current balance channel policy only consider current collection's distribution, so if all collections has 1 channel, and all channels has been loaded on same querynode, after querynode num increase, balance channel won't be triggered. This PR enable score based balance channel policy, to achieve: 1. distribute all channels evenly across multiple querynodes 2. distribute each collection's channel evenly across multiple querynodes. pr: https://github.com/milvus-io/milvus/pull/38143 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com> Co-authored-by: Wei Liu <wei.liu@zilliz.com>	2024-12-09 19:50:05 +08:00
yihao.dai	ae4e2b8063	fix: [10kcp] Query coord stop progress is too slow (#38300 ) issue: https://github.com/milvus-io/milvus/issues/38237 query coord will save collection's target during stop progress, which will be used for new querycoord's fast recover. but if milvus cluster has thounsands of collections, which make query coord's stop progress much more slower than expected. this PR refine the impl to save collection's target to etcd when target update, and clean it when collection released. pr: https://github.com/milvus-io/milvus/pull/38238 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com> Co-authored-by: Wei Liu <wei.liu@zilliz.com>	2024-12-09 19:49:49 +08:00
yihao.dai	2fe6423552	enhance: [10kcp] Speed up meta recovery (#38298 ) Increase the batchSize in WalkWithPrefix operations to 10000. issue: https://github.com/milvus-io/milvus/issues/37630 pr: https://github.com/milvus-io/milvus/pull/38285 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-12-09 19:49:35 +08:00
yihao.dai	3d490aa158	fix: [10kcp] Replace outer lock with concurrent map (#38286 ) See also: #37493 pr: #37817 Signed-off-by: yangxuan <xuan.yang@zilliz.com> Co-authored-by: XuanYang-cn <xuan.yang@zilliz.com>	2024-12-09 19:49:20 +08:00
yihao.dai	df100e5bbe	fix: [10kcp] Fix init rootcoord meta timeout (#38249 ) issue: https://github.com/milvus-io/milvus/issues/37630 pr: https://github.com/milvus-io/milvus/pull/38248 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-12-05 17:45:31 +08:00
Zhen Ye	99279e0bef	enhance: remove the rpc layer of coordinator when enabling standalone or mixcoord (#38246 ) issue: #33285 pr: #37815 - remove the rpc layer of coordinator when enabling standalone or mixcoord - move health check into init --------- Signed-off-by: chyezh <chyezh@outlook.com>	2024-12-05 17:27:53 +08:00
congqixia	c4df6b5910	enhance: [10kcp] Refine Replica manager colle2Replicas secondary index (#37907 ) Related to #37630 This PR add a new util coll2Replicas secondary index to reduce map access & iteration while get replicas by collection --------- Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>	2024-12-05 11:57:29 +08:00
yihao.dai	d75fb5b3f8	enhance: [10kcp] Reduce mutex contention in datacoord meta (#38229 ) 1. Using secondary index to avoid retrieving all segments at GetSegmentsChanPart. 2. Perform batch SetAllocations to reduce the number of times the meta lock is acquired. issue: https://github.com/milvus-io/milvus/issues/37630 pr: https://github.com/milvus-io/milvus/pull/38219 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-12-05 11:57:07 +08:00
yihao.dai	3219b869a3	fix: [10kcp] Fix timeout when listing meta (#38152 ) When there are too many key-value pairs, the etcd list operation may times out. This PR replaces LoadWithPrefix in list operations, which could involve many keys, with WalkWithPrefix. issue: https://github.com/milvus-io/milvus/issues/37917 pr: https://github.com/milvus-io/milvus/pull/38151 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-12-03 14:15:49 +08:00
yihao.dai	0c29d8ff64	enhance: [10kcp] Update segment manger (#38153 ) Use a channel level key lock for segments in segmentManager. issue: https://github.com/milvus-io/milvus/issues/37633, https://github.com/milvus-io/milvus/issues/37630 pr: https://github.com/milvus-io/milvus/pull/37836 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-12-03 14:15:35 +08:00
yihao.dai	338ccc9ff9	enhance: [10kcp] Reduce memory usage of BF in DataNode and QueryNode (#38133 ) 1. DataNode: Skip generating BF during the insert phase (BF will be regenerated during the sync phase). 2. QueryNode: Skip generating or maintaining BF for growing segments; deletion checks will be handled in the segcore. issue: https://github.com/milvus-io/milvus/issues/37630 pr: https://github.com/milvus-io/milvus/pull/38129 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-12-02 14:41:19 +08:00
yihao.dai	0930430a68	enhance: [10kcp] Skip creating partition rate limiters when not enable (#38062 ) issue: https://github.com/milvus-io/milvus/issues/37630 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-11-28 10:45:46 +08:00
yihao.dai	635d161109	enhance: [10kcp] Accelerate observe collection (#38058 ) issue: https://github.com/milvus-io/milvus/issues/37630 --------- Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-11-28 10:05:24 +08:00
yihao.dai	312475d1f1	enhance: [10kcp] remove the rpc level of coordinator (#37984 ) issue: https://github.com/milvus-io/milvus/issues/37764 - add a local client to call local server directly for querycoord/rootcoord/datacoord. - enable local client if milvus is running mixcoord or standalone mode. Signed-off-by: chyezh <chyezh@outlook.com> --------- Signed-off-by: chyezh <chyezh@outlook.com> Co-authored-by: Zhen Ye <chyezh@outlook.com>	2024-11-25 14:50:42 +08:00
yihao.dai	e5c16e0676	fix: [10kcp] Fix checkGeneralCapacity slowly (#37981 ) Cache the general count to speed up checkGeneralCapacity. issue: https://github.com/milvus-io/milvus/issues/37630 pr: https://github.com/milvus-io/milvus/pull/37976 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-11-25 14:50:24 +08:00
yihao.dai	fd30034c77	fix: [10kcp] Fix data view and add more ut (#37915 ) Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-11-21 21:35:42 +08:00
yihao.dai	4845e4d679	enhance: [10kcp] Revert "enhance: remove the rpc level of coordinator (#37914 ) Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-11-21 21:35:29 +08:00
yihao.dai	bf90e55319	enhance: [10kcp] Reduce GetRecoveryInfo calls (#37891 ) 1. Introduce a data view mechanism for DataCoord, attempting to update each collection's data view periodically. 2. QueryCoord maintains a cache of data view versions. Before batch-fetching recovery info, it retrieves all versions and only fetches recovery info for collections with updated versions. 3. Return DataCoord's current data view when fetching RecoverInfo. issue: https://github.com/milvus-io/milvus/issues/37743, https://github.com/milvus-io/milvus/issues/37630 pr: https://github.com/milvus-io/milvus/pull/37863 Signed-off-by: bigsheeper <yihao.dai@zilliz.com>	2024-11-21 15:43:13 +08:00
Zhen Ye	ce8069c0fd	enhance: remove the rpc layer of coordinator when enabling standalone or mixcoord (#37892 ) issue: #37764 - add a local client to call local server directly for querycoord/rootcoord/datacoord. - enable local client if milvus is running mixcoord or standalone mode. Signed-off-by: chyezh <chyezh@outlook.com>	2024-11-21 15:42:18 +08:00
Zhen Ye	1a6b98be77	enhance: remove the rpc level of coordinator (#37876 ) issue: #33285 pr: #37722 - move most cgo opeartions related to search/query into segcore package for reusing for streamingnode. - add go unittest for segcore operations. Signed-off-by: chyezh <chyezh@outlook.com>	2024-11-21 15:21:11 +08:00

1 2 3 4 5 ...

20660 Commits (10kcp) All Branches Search

20660 Commits (10kcp)

All Branches