issue: #38718
The balancer calculates the workload of executing tasks as an ongoing
score for target nodes. However, a logic issue arises when
GetSegmentTaskDelta or GetChannelTaskDelta is called with
collectionID=-1, which incorrectly returns zero.
Due to the incorrect global score, the executing task's workload is not
properly reflected for each collection. Consequently, each collection
submits its own balance task, leading to the balancer assigning
excessive tasks to the same QueryNode.
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
When delete by partition_key, Milvus will generates L0 segments
globally. During L0 Compaction, those L0 segments will touch all
partitions collection wise. Due to the false-positive rate of segment
bloomfilters, L0 compactions will append false deltalogs to completed
irrelevant partitions, which causes *partition deletion amplification.
This PR uses partition_key to set targeted partitionID when producing
deleteMsgs into MsgStreams. This'll narrow down L0 segments scope to
partition level, and remove the false-positive influence
collection-wise.
However, due to DeleteMsg structure, we can only label one partition to
one deleteMsg, so this enhancement fails if user wants to delete over 2
partition_keys in one deletion.
See also: #34665
Signed-off-by: yangxuan <xuan.yang@zilliz.com>
miss the patch due to code branching
previous pr: #38032
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Co-authored-by: Wei Liu <wei.liu@zilliz.com>
1. A collection should observe the channel only once.
2. A collection should check the CollectionLoadPercent for updates only
once.
3. Skip saving coll/partition meta if there are no changes, primarily to
accelerate collection observation after recovery.
issue: https://github.com/milvus-io/milvus/issues/37630
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
issue: #38399
- move the lifetime implementation of common code out of the server
level lifetime implementation
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #35563
1. Use an internal health checker to monitor the cluster's health state,
storing the latest state on the coordinator node. The CheckHealth
request retrieves the cluster's health from this latest state on the
proxy sides, which enhances cluster stability.
2. Each health check will assess all collections and channels, with
detailed failure messages temporarily saved in the latest state.
3. Use CheckHealth request instead of the heavy GetMetrics request on
the querynode and datanode
Signed-off-by: jaime <yun.zhang@zilliz.com>
the task limit in assignSegment/assignChannel will works for both load
task and balance task.
this PR remove the load task limit, only limit balance task num in one
round.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
The level zero mutex could be remove since all operations are guarded by
segment manager mutex
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Related to #37630
TSafe manager is too complex for current implementation and each
delegator need one goroutine waiting for tsafe update event.
Tsafe updating could be executed in pipeline. This PR remove tsafe
manager and simplify the entire logic of tsafe updating.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
sparse vectors may have arbitrary number of non zeros and it is hard to
optimize without knowing the actual distribution of nnz. this PR adds a
metric for analyzing that.
issue: https://github.com/milvus-io/milvus/issues/35853
comparing with https://github.com/milvus-io/milvus/pull/38328, this
includes also metric for FTS in query node delegator
also fixed a bug of sparse when searching by pk
Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
Related to #38372
This PR make drop partition only check target partition load states only
in case of concurrent releasing other partition in same collection.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #38142
current balance channel policy only consider current collection's
distribution, so if all collections has 1 channel, and all channels has
been loaded on same querynode, after querynode num increase, balance
channel won't be triggered.
This PR enable score based balance channel policy, to achieve:
1. distribute all channels evenly across multiple querynodes
2. distribute each collection's channel evenly across multiple
querynodes.
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: https://github.com/milvus-io/milvus/issues/37764
- After removing rpc layer from mixcoord, the querycoord at standby mode
will be blocked forever of deployment rolling
---------
Signed-off-by: chyezh <chyezh@outlook.com>
Related to #37630
Previously the loaded collection metrics was calculated via scanning all
loaded segment in segment manager, which is slow and buggy
implementation.
This PR:
- Move collection num metrics to collection manager
- Remove deprecated loaded partition metrics update logic
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #38325
the old impl only to check grant in default db before drop role, which
may cause role be dropped when grant still exist.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
enhance :
1. alterindex delete properties
We have introduced a new parameter deleteKeys to the alterindex
functionality, which allows for the deletion of properties within an
index. This enhancement provides users with the flexibility to manage
index properties more effectively by removing specific keys as needed.
2. altercollection delete properties
We have introduced a new parameter deleteKeys to the altercollection
functionality, which allows for the deletion of properties within an
collection. This enhancement provides users with the flexibility to
manage collection properties more effectively by removing specific keys
as needed.
3.support altercollectionfield
We currently support modifying the fieldparams of a field in a
collection using altercollectionfield, which only allows changes to the
max-length attribute.
Key Points:
- New Parameter - deleteKeys: This new parameter enables the deletion of
specified properties from an index. By passing a list of keys to
deleteKeys, users can remove the corresponding properties from the
index.
- Mutual Exclusivity: The deleteKeys parameter cannot be used in
conjunction with the extraParams parameter. Users must choose one
parameter to pass based on their requirement. If deleteKeys is provided,
it indicates an intent to delete properties; if extraParams is provided,
it signifies the addition or update of properties.
issue: https://github.com/milvus-io/milvus/issues/37436
---------
Signed-off-by: Xianhui.Lin <xianhui.lin@zilliz.com>
issue: #38237
this PR only use better compression level for proto msg which is larger
than 1MB, and use a lighter compression level for smaller proto msg,
which could get a better latency in most case.
this PR could reduce the latency from 22.7s to 4.7s with 10000
collctions and each collections has 1000 segments.
before this PR:
BenchmarkTargetManager-8 1 22781536357 ns/op 566407275088 B/op 11188282
allocs/op
after this PR:
BenchmarkTargetManager-8 1 4729566944 ns/op 36713248864 B/op 10963615
allocs/op
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #38201
leader task require to update delegator's distribution, and only success
after the distribution change has been applyed to delegator. but the
delegator will reject the distribution change if it's version is older
than current version in delegator. which cause the leader task stuck and
retry forever.
this PR remove the leader task finish check.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Related to #37630
Add secondary index with vchannel to reduce `GetBy` rlock holding time
when segment number is large.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Related to #38275
This PR move sync created partition step to proxy to avoid potential
logic deadlock when create partition happens with target segment change.
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #38305
after we disable balance segment and balance channel happens at same
time, the constriant which require release segment must happens on
serviceable shard leader is unnessary.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Related to #36456
Unify collection/partition number metrics to collection manager in case
of unwant missing modification
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #36621
- add filters to most of lists
- add more options to the existing list filter
- fix the error list column in channels and resource group list
Signed-off-by: mimo.oyn <ning.ouyang@zilliz.com>
The partition number has already been incremented in
`ChangePartitionState`, so there is no need to increment it again in
`AddPartition`.
issue: https://github.com/milvus-io/milvus/issues/37630
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
When there are too many key-value pairs, the etcd list operation may
times out. This PR replaces `LoadWithPrefix` in list operations, which
could involve many keys, with `WalkWithPrefix`.
issue: https://github.com/milvus-io/milvus/issues/37917
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
`File.Write` and `File.WriteInt` use `write`, which may be just direct
syscall in some systems. When mappding field data and write line by
line, this could cost lost of CPU time when the row number is large.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #38086
cause ManualCompact api pass collection id in request, but RBAC requires
to check collection name, so grant ManualCompact api doesn't work.
This PR refine the ManualCompact api to accpet collection name in
request.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
related issue: https://github.com/milvus-io/milvus/issues/37031
fixed issues #38042: The interface "grant_v2" does not support empty
collectionName while the error says it does
Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
#38077
remove the check for two reason
1. server will do the same to make sure use the correct database;
2. each req has an additional overhead of calling the proxy to check
database.
Signed-off-by: lixinguo <xinguo.li@zilliz.com>
Co-authored-by: lixinguo <xinguo.li@zilliz.com>
related issue: https://github.com/milvus-io/milvus/issues/37031
fixed issues:
#37974: better error messages for grant v2 interface
#37903: fix meta built-in privilege group object name
#37843: better error messages for custom privilege group interface
#38002: fix built-in privilege group meta to pass proxy interceptor
check
#38008: fix revoke v2 to support revoking v1 granted privileges
Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
issue: #33285
- move most cgo opeartions related to search/query into segcore package
for reusing for streamingnode.
- add go unittest for segcore operations.
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #35917
This PR refines the meta-related APIs in datacoord to allow the ctx to
be passed down to the catalog operation interfaces
Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
Related to #37999
This PR add `SetThreadName` API for marking cgo thread and utilize it
when initializing cgo worker.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: #35917
This PR refine the querycoord meta related interfaces to ensure that
each method includes a ctx parameter.
Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
RESTful API. The influenced API are as follows:
- Handler. insert
- HandlerV1. insert/upsert
- HandlerV2. insert/upsert/search
We do not modify search API in Handler/HandlerV1 because they do not
support fp16/bf16 vectors.
module github.com/milvus-io/milvus/pkg:
Add `Float32ArrayToBFloat16Bytes()`, `Float32ArrayToFloat16Bytes()` and
`Float32ArrayToBytes()`. These method will be used in GoSDK in the
future.
issue: #37448
Signed-off-by: Yinzuo Jiang <yinzuo.jiang@zilliz.com>
Signed-off-by: Yinzuo Jiang <jiangyinzuo@foxmail.com>
issue: #37764
- add a local client to call local server directly for
querycoord/rootcoord/datacoord.
- enable local client if milvus is running mixcoord or standalone mode.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
Related to #37630
This PR add a new util coll2Replicas secondary index to reduce map
access & iteration while get replicas by collection
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
/kind improvement
Here you only need to filter out the system fields, and you don’t need
to recreate a response, because recreating the response will cause this
part to be easily missed when adding fields later.
Signed-off-by: SimFG <bang.fu@zilliz.com>
issue: #37908
cause paramtable is global single instance, which cause
paramtable.GetNodeID may return wrong server id in integration test.
This PR use node.GetNodeID to replace paramtable.GetNodeID
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #36621
- fix the empty segment source
- remove http request timeout limit
- support more filters in seg and channel table
Signed-off-by: mimo.oyn <ning.ouyang@zilliz.com>
issue: #35917
Before enhancing log trace information, it's necessary to pass the
context to the method entry point.
This PR first refine the rootcoord/metatable interfaces to ensure that
each method includes a ctx parameter.
Signed-off-by: tinswzy <zhenyuan.wei@zilliz.com>
1. taskQueueCapacity 256 is too small for production when we want to
re-write the entire collection
2. tasks should be cleaned when unable to recover, or the meta will
remain in etcd forever later.
Signed-off-by: yangxuan <xuan.yang@zilliz.com>
issue: #33550
balance segment and balance segment execute at same time, which will
cause bounch of corner case.
This PR disable simultaneous balance of segments and channels
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #37830
casue dist handler doesn't set channel's version, so if channel checker
try to dedup channel, it may release the new delegator after balance
finished.
this PR fix the way to set proper version for channel.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #37851
- make a global thread pool at tantivy temporally.
- set 1 but not 4 threads for inverted text index.
Signed-off-by: chyezh <chyezh@outlook.com>
issue : https://github.com/milvus-io/milvus/issues/36864
I have a few questions regarding my approach.I will consolidate them
here for feedback and review.Thanks
---------
Signed-off-by: Nischay Yadav <nischay.yadav@ibm.com>
Signed-off-by: Nischay <Nischay.Yadav@ibm.com>
Custom `jsonRender` that encodes JSON data directly into the response
stream, it uses less memory since it does not buffer the entire JSON
structure before sending it, unlike `c.JSON` in `HTTPReturn`, which
serializes the JSON fully in memory before writing it to the response.
Benchmark testing shows that, using the custom render incurs no
performance loss and reduces memory consumption by nearly 50%.
issue: https://github.com/milvus-io/milvus/issues/37671
---------
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
issue: #37718
This PR refine the shard client ref counter, dec ref counter won't
release client anymore, and only permit shard client manager to remove
client.
---------
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
Limit the maximum concurrency of channel tasks for each DataNode to
prevent excessive subscriptions from causing DataNode OOM.
issue: https://github.com/milvus-io/milvus/issues/37665
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
When there are a large number of segments, the metrics consume a lot of
memory. This PR Remove segment-level tag from monitoring metrics.
issue: https://github.com/milvus-io/milvus/issues/37636
Signed-off-by: bigsheeper <yihao.dai@zilliz.com>
issue: #36621
* The front-end code of the management interface has been refactored
using React.
* The interaction and display of the front-end interface have been
optimized.
Signed-off-by: mimo.oyn <ning.ouyang@zilliz.com>
Co-authored-by: zilliz <zilliz@zillizdeMacBook-Pro.local>
issue: #37679
pr #36549 introduce the logic error which update current target when
only parts of channel is ready.
This PR fix the logic error and let dist handler keep pull distribution
on querynode until all delegator becomes serviceable.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>
issue: #37579
If the schema includes large varchar fields, a few thousand rows can
reach hundreds of MB in size. Therefore, if the batch size of the
segment writer is large, it will produce relatively large `binlogs`,
which can cause datanode to run out of memory (OOM) during compaction.
Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
issue: #37665#37631#37620#37587#36906
knowhere has add default nlist value, so some invalid param test ut with
no nlist param will be valid.
Signed-off-by: xianliang.li <xianliang.li@zilliz.com>
issue: ##36621
- For simple types in a struct, add "string" to the JSON tag for
automatic string conversion during JSON encoding.
- For complex types in a struct, replace "int64" with "string."
Signed-off-by: jaime <yun.zhang@zilliz.com>
issue: #37640
fix the pr #36549
cause balance channel will wait until new delegator becomes serviceable,
but new delegator need to sync target version then becomes serviceable,
and sync target version need to be wait all replica load done. so if
increasing replica number and balance channel happens at same time,
logic dead lock occurs.
Signed-off-by: Wei Liu <wei.liu@zilliz.com>