issue: #38718
The balancer calculates the workload of executing tasks as an ongoing
score for target nodes. However, a logic issue arises when
GetSegmentTaskDelta or GetChannelTaskDelta is called with
collectionID=-1, which incorrectly returns zero.
Due to the incorrect global score, the executing task's workload is not
properly reflected for each collection. Consequently, each collection
submits its own balance task, leading to the balancer assigning
excessive tasks to the same QueryNode.
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
1. A collection should observe the channel only once.
2. A collection should check the CollectionLoadPercent for updates only
once.
3. Skip saving coll/partition meta if there are no changes, primarily to
accelerate collection observation after recovery.
issue: https://github.com/milvus-io/milvus/issues/37630
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
issue: #35563
1. Use an internal health checker to monitor the cluster's health state,
storing the latest state on the coordinator node. The CheckHealth
request retrieves the cluster's health from this latest state on the
proxy sides, which enhances cluster stability.
2. Each health check will assess all collections and channels, with
detailed failure messages temporarily saved in the latest state.
3. Use CheckHealth request instead of the heavy GetMetrics request on
the querynode and datanode
Signed-off-by: jaime <yun.zhang@zilliz.com>
the task limit in assignSegment/assignChannel will works for both load
task and balance task.
this PR remove the load task limit, only limit balance task num in one
round.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #38142
current balance channel policy only consider current collection's
distribution, so if all collections has 1 channel, and all channels has
been loaded on same querynode, after querynode num increase, balance
channel won't be triggered.
This PR enable score based balance channel policy, to achieve:
1. distribute all channels evenly across multiple querynodes
2. distribute each collection's channel evenly across multiple
querynodes.
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: https://github.com/milvus-io/milvus/issues/37764
- After removing rpc layer from mixcoord, the querycoord at standby mode
will be blocked forever of deployment rolling
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #38237
this PR only use better compression level for proto msg which is larger
than 1MB, and use a lighter compression level for smaller proto msg,
which could get a better latency in most case.
this PR could reduce the latency from 22.7s to 4.7s with 10000
collctions and each collections has 1000 segments.
before this PR:
BenchmarkTargetManager-8 1 22781536357 ns/op 566407275088 B/op 11188282
allocs/op
after this PR:
BenchmarkTargetManager-8 1 4729566944 ns/op 36713248864 B/op 10963615
allocs/op
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #38201
leader task require to update delegator's distribution, and only success
after the distribution change has been applyed to delegator. but the
delegator will reject the distribution change if it's version is older
than current version in delegator. which cause the leader task stuck and
retry forever.
this PR remove the leader task finish check.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #38305
after we disable balance segment and balance channel happens at same
time, the constriant which require release segment must happens on
serviceable shard leader is unnessary.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Related to #36456
Unify collection/partition number metrics to collection manager in case
of unwant missing modification
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #35917
This PR refine the querycoord meta related interfaces to ensure that
each method includes a ctx parameter.
Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
Related to #37630
This PR add a new util coll2Replicas secondary index to reduce map
access & iteration while get replicas by collection
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #33550
balance segment and balance segment execute at same time, which will
cause bounch of corner case.
This PR disable simultaneous balance of segments and channels
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #37830
casue dist handler doesn't set channel's version, so if channel checker
try to dedup channel, it may release the new delegator after balance
finished.
this PR fix the way to set proper version for channel.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #37679
pr #36549 introduce the logic error which update current target when
only parts of channel is ready.
This PR fix the logic error and let dist handler keep pull distribution
on querynode until all delegator becomes serviceable.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: ##36621
- For simple types in a struct, add "string" to the JSON tag for
automatic string conversion during JSON encoding.
- For complex types in a struct, replace "int64" with "string."
Signed-off-by: jaime <yun.zhang@zilliz.com>
issue: #37640
fix the pr #36549
cause balance channel will wait until new delegator becomes serviceable,
but new delegator need to sync target version then becomes serviceable,
and sync target version need to be wait all replica load done. so if
increasing replica number and balance channel happens at same time,
logic dead lock occurs.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #36293#36242
after qn recover, delegator may be loaded in new node, after all segment
has been loaded, delegator becomes serviceable. but delegator's target
version hasn't been synced, and if search/query comes, delegator will
use wrong target version to filter out a empty segment list, which
caused empty search result.
This pr will block delegator's serviceable status until target version
is synced
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Related to #35415
In rolling upgrade, legacy proxy may dispatch load request wit empty
load field list. The upgraded querycoord may report error by mistake
that load field list is changed.
This PR:
- Auto field empty load field list with all user field ids
- Refine the error messag when load field list updates
- Refine load job unit test with service cases
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
When there're a lot of loaded collections, they would occupy the target
observer scheduler’s pool. This prevents loading collections from
updating the current target in time, slowing down the load process.
This PR adds a separate target dispatcher for loading collections.
issue: https://github.com/milvus-io/milvus/issues/37166
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
When check health logic failed to collection not-queryable, the related
reason is hard to find in log.
This PR add context for log with trace id and print unqueryable
collection info log.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #36977
with node_label_filter on resource group, user can add label on
querynode with env `MILVUS_COMPONENT_LABEL`, then resource group will
prefer to accept node which match it's node_label_filter.
then querynode's can't be group by labels, and put querynodes with same
label to same resource groups.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #36970
cause release segment and balance channel may happen at same time, and
before new delegator become serviceable, if release segment exeuctes on
new delegator, and search/query comes on old delegator, then release
segment and query segment happens in parallel, if release segment
execute first in worker, then search/query will got a SegmentNodeLoaded
error.
This PR add serviceable filter on delegator, then all load/release
segment operation will happens on serviceable delegator.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>