Related to #39173#39718
In storage v2, the `lack_bin_rows` cannot be used since field id is not
column group id, which will not be matched forever.
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
Related to #39173#41534
This pr fixes an issue that building mem index may report datatype not
match error when collection split fields into multiple groups
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
issue: https://github.com/milvus-io/milvus/issues/41435
this PR also makes HasRawData of ChunkedSegmentSealedImpl to return
based on metadata, without needing to load the cache just to answer this
simple question.
---------
Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
issue: https://github.com/milvus-io/milvus/issues/41435
this PR also:
1. fixed the skip index for VARCHAR. before this PR, skip index of
VARCHAR uses the minmax of the entire column as the minmax of chunk 0,
and provides no minmax for other chunks.
2. refactored some skip index loading related code
3. partly fixed a bug in test_expr.cpp
---------
Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
issue: #41348
related and optimized by #41347
---------
Signed-off-by: Cai Zhang <cai.zhang@zilliz.com>
Co-authored-by: Sangho Park <hoyaspark@gmail.com>
issue: https://github.com/milvus-io/milvus/issues/41435
this PR is based on https://github.com/milvus-io/milvus/pull/41436.
Improvements include:
- Lazy Load support for Storage v1
- Use Low/High watermark to control eviction
- Caching Layer related config changes
- Removed ChunkCache related configs and code in golang
- Add `PinAllCells` helper method to CacheSlot class
- Modified ValueAt, RawAt, PrimitiveRawAt to Bulk version, to reduce
caching layer overhead
- Removed some unclear templated bulk_subscript methods
- CachedSearchIterator to store PinWrapper when searching on
ChunkedColumn, and removed unused contrustor.
---------
Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
storage v2 chunked seal segment loading is based on caching layer. A
cell unit in storage v2 is a parquet row group in remote object storage,
containing all fields. Therefore, each field needs a proxy to do related
one field operations.
<img width="965" alt="Screenshot 2025-04-28 at 10 59 30"
src="https://github.com/user-attachments/assets/83e93a10-3b1d-4066-ac17-b996d5650416"
/>
related: #39173
---------
Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
Related to #39718
This PR:
- Add reopen logic for growing & sealed segments
- Lazy reopen when schema version increases
- Add FinishLoad api for loading progress
---------
Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
fix: #39755
The following shows a simple benchmark where insert 1M docs where all
rows are "hello", the latency is segcore level, CPU is 9900K:
master: 2.62ms
this PR: 2.11ms
bench mark code:
```
TEST(TextMatch, TestPerf) {
auto schema = GenTestSchema({}, true);
auto seg = CreateSealedSegment(schema, empty_index_meta);
int64_t N = 1000000;
uint64_t seed = 19190504;
auto raw_data = DataGen(schema, N, seed);
auto str_col = raw_data.raw_->mutable_fields_data()
->at(1)
.mutable_scalars()
->mutable_string_data()
->mutable_data();
for (int64_t i = 0; i < N - 1; i++) {
str_col->at(i) = "hello";
}
SealedLoadFieldData(raw_data, *seg);
seg->CreateTextIndex(FieldId(101));
auto now = std::chrono::high_resolution_clock::now();
auto expr = GetMatchExpr(schema, "hello", OpType::TextMatch);
auto final = ExecuteQueryExpr(expr, seg.get(), N, MAX_TIMESTAMP);
auto end = std::chrono::high_resolution_clock::now();
auto duration =
std::chrono::duration_cast<std::chrono::microseconds>(end - now);
std::cout << "TextMatch query time: " << duration.count() << "ms"
<< std::endl;
}
```
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
support parallel loading sealed and growing segments with storage v2
format by async reading row groups.
related: #39173
---------
Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
Ref: https://github.com/milvus-io/milvus/issues/40823
It does not make any sense to create single segment tantivy index for
old version such as 2.4 by using tantivy V7.
So, clean the relevant code.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
issue: https://github.com/milvus-io/milvus/issues/40006
This PR make tantivy document add by batch. Add document by batch can
greately reduce the latency of scheduling the document add operation
(call tantivy `add_document` only schdules the add operation and it
returns immediately after scheduled) , because each call involes a tokio
block_on which is relatively heavy.
Reduce scheduling part not necessarily reduces the overall latency if
the index writer threads does not process indexing quickly enough.
But if scheduling itself is pretty slow, even the index writer threads
process indexing very fast (by increasing thread number), the overall
performance can still be limited.
The following codes bench the PR (Note, the duration only counts for
scheduling without commit)
```
fn test_performance() {
let field_name = "text";
let dir = TempDir::new().unwrap();
let mut index_wrapper = IndexWriterWrapper::create_text_writer(
field_name,
dir.path().to_str().unwrap(),
"default",
"",
1,
50_000_000,
false,
TantivyIndexVersion::V7,
)
.unwrap();
let mut batch = vec![];
for i in 0..1_000_000 {
batch.push(format!("hello{:04}", i));
}
let batch_ref = batch.iter().map(|s| s.as_str()).collect::<Vec<_>>();
let now = std::time::Instant::now();
index_wrapper
.add_data_by_batch(&batch_ref, Some(0))
.unwrap();
let elapsed = now.elapsed();
println!("add_data_by_batch elapsed: {:?}", elapsed);
}
```
Latency roughly reduces from 1.4s to 558ms.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
fix: https://github.com/milvus-io/milvus/issues/40823
To solve the problem in the issue, we have to support building tantivy
index with low version
for those query nodes with low tantivy version.
This PR does two things:
1. refactor codes for IndexWriterWrapper to make it concise
2. enable IndexWriterWrapper to build tantivy index by different tantivy
crate
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
after the pr merged, we can support to insert, upsert, build index,
query, search in the added field.
can only do the above operates in added field after add field request
complete, which is a sync operate.
compact will be supported in the next pr.
#39718
---------
Signed-off-by: lixinguo <xinguo.li@zilliz.com>
Co-authored-by: lixinguo <xinguo.li@zilliz.com>
two point:
(1) reoder conjucts expr's subexpr, postpone heavy operations
sequence: int(column) -> index(column) -> string(column) -> light
conjuct
...... -> json(column) -> heavy conjuct -> two_column_compare
(2) support pre filter for expr execute, skip scan raw data that had
been skipped
because of preceding expr result.
#39869
Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
issue: #40308
This issue fixes these two concurrent issues:
1. element in null_offset is used to set bitset where the size of bitset
is initialized by tantivy document count. However, there may still be
some documents that are not committed in tantivy but are null in
null_offset. So array out of range occurs.
2. null_offset can be read and write concurrently but there's no
synchronization protection.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
issue: #40198
Fix random sample does not consider empty input, that is no data is hit
by filter expression.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
issue: https://github.com/milvus-io/milvus/issues/35528
If the query data type does not match the index type, fall back to a
brute-force search
---------
Signed-off-by: sunby <sunbingyi1992@gmail.com>
* use the new packed reader and writer api to be compatible with current
etcd meta
* For the new packed writer API: column groups and paths are explicitly
defined by users and won't split column groups by memory in storage v2.
Packed writer follows the user-defined column groups to split arrow
record and write into the corresponding file path.
* For the new packed reader API: read paths are explicitly defined by
users.
related: #39173
Signed-off-by: shaoting-huang <shaoting.huang@zilliz.com>
fix: #39711
Unlike English sentence where each words are parsed exactly once and one
after one with position length 1, one Chinese word may be parsed to
multiple words with position length larger than 1.
For example, "badminton and skiing" will be parsed to Token{ start: 0,
length: 1, text: "badminton" }, Token{ start: 1, length: 1, text: "and"
}, and Token{ start: 2, length: 1, text: "tennis" }.
While for exmaple for Chinsese: "羽毛球和滑雪" may be parsed to Token{ start:
0, length: 2, text: "羽毛" }, Token{ start: 0, length: 3, text: "羽毛球" },
Token{ start: 3, length: 1, text: "和" }, and Token{ start: 4, length: 2,
text: "滑雪" }.
This PR fix that the code not recognizes this situation.
---------
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
issue: #39541
This PR implements random sample, the syntax is:
```
filter="random_sample(factor)"
or
filter="boolean_expression && random_sample(factor)"
where
factor is a float between (0, 1) and
boolean_expression is like
"1 <= number < 10", "color in ["read, "blue"]" or others
```
---------
Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>
Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
https://github.com/milvus-io/milvus/issues/35528
This PR adds json index support for json and dynamic fields. Now you can
only do unary query like 'a["b"] > 1' using this index. We will support
more filter type later.
basic usage:
```
collection.create_index("json_field", {"index_type": "INVERTED",
"params": {"json_cast_type": DataType.STRING, "json_path":
'json_field["a"]["b"]'}})
```
There are some limits to use this index:
1. If a record does not have the json path you specify, it will be
ignored and there will not be an error.
2. If a value of the json path fails to be cast to the type you specify,
it will be ignored and there will not be an error.
3. A specific json path can have only one json index.
4. If you try to create more than one json indexes for one json field,
sdk(pymilvus<=2.4.7) may return immediately because of internal
implementation. This will be fixed in a later version.
---------
Signed-off-by: sunby <sunbingyi1992@gmail.com>
issue: #38715
- Current milvus use a serialized index size(compressed) for estimate
resource for loading.
- Add a new field `MemSize` (before compressing) for index to estimate
resource.
---------
Signed-off-by: chyezh <chyezh@outlook.com>
issue: #39124
`bitset::find_first()` and `bitset::find_next()` now accept one more
parameter, which allows to search for `0` bit instead of `1` bit
Signed-off-by: Alexandr Guzhva <alexanderguzhva@gmail.com>