Commit Graph

21 Commits (a848c4a8c56014488081a061c49f8bd7de11510c)

Author SHA1 Message Date
Spade A fce0bbe2ae
fix: remove redundant locks for null_offset (#43103)
Ref: https://github.com/milvus-io/milvus/issues/40308
https://github.com/milvus-io/milvus/pull/40363 add lock for protecting
concurrent read/write for null offset. But we don't need this for sealed
segment.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-07-04 10:10:45 +08:00
Spade A 50f7579d8f
fix: fix some bugs discovered by chaos tests (#42906)
fix: https://github.com/milvus-io/milvus/issues/42870

This PR fixes:
1. SetBitset fn shuold consider growing segments with concurrent write
2. avoid using from_raw_parts directly

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-06-24 16:32:42 +08:00
congqixia c9bc70f272
fix: [AddField] Use shared_ptr of schema in plan fixing dangling ref (#42693)
Related to #42640

The search/query plan holded a reference to schema, which could be
destructed after schema change. This PR make plan hold a shared ptr to
it fixing dangling reference problem under concurrent read & schema
change.

This PR also remove field binlog check for loading index for old segment
with old schema may have binlog lack.

Signed-off-by: Congqi Xia <congqi.xia@zilliz.com>
2025-06-12 20:46:36 +08:00
Buqian Zheng 2e3539319d
feat: vector field raw data to mmap by default (#41975)
issue: https://github.com/milvus-io/milvus/issues/41435

should address https://github.com/milvus-io/milvus/issues/41774

this PR also: 
* added caching layer memory overhead metric
* re-enable TextMatch.GrowingLoadData test

Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
2025-05-22 11:56:25 +08:00
Buqian Zheng 3de904c7ea
feat: add cachinglayer to sealed segment (#41436)
issue: https://github.com/milvus-io/milvus/issues/41435

---------

Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
2025-04-28 10:52:40 +08:00
Spade A 5b1430f27e
enhance: tantivy collector set bitset directly (#39748)
fix: #39755

The following shows a simple benchmark where insert 1M docs where all
rows are "hello", the latency is segcore level, CPU is 9900K:
master: 2.62ms
this PR: 2.11ms

bench mark code:

```
TEST(TextMatch, TestPerf) {
    auto schema = GenTestSchema({}, true);
    auto seg = CreateSealedSegment(schema, empty_index_meta);
    int64_t N = 1000000;
    uint64_t seed = 19190504;
    auto raw_data = DataGen(schema, N, seed);
    auto str_col = raw_data.raw_->mutable_fields_data()
                       ->at(1)
                       .mutable_scalars()
                       ->mutable_string_data()
                       ->mutable_data();
    for (int64_t i = 0; i < N - 1; i++) {
        str_col->at(i) = "hello";
    }
    SealedLoadFieldData(raw_data, *seg);
    seg->CreateTextIndex(FieldId(101));

    auto now = std::chrono::high_resolution_clock::now();
    auto expr = GetMatchExpr(schema, "hello", OpType::TextMatch);
    auto final = ExecuteQueryExpr(expr, seg.get(), N, MAX_TIMESTAMP);
    auto end = std::chrono::high_resolution_clock::now();
    auto duration =
        std::chrono::duration_cast<std::chrono::microseconds>(end - now);
    std::cout << "TextMatch query time: " << duration.count() << "ms"
              << std::endl;
}
```

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-20 23:02:41 +08:00
Spade A 62293cb582
fix: revert batch add (#41374)
issue: #41375

todo: to fix the problems fixed in the issue.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-17 22:32:38 +08:00
Spade A c6a0c2ab64
enhance: process tantivy document add by batch (#40124)
issue: https://github.com/milvus-io/milvus/issues/40006

This PR make tantivy document add by batch. Add document by batch can
greately reduce the latency of scheduling the document add operation
(call tantivy `add_document` only schdules the add operation and it
returns immediately after scheduled) , because each call involes a tokio
block_on which is relatively heavy.

Reduce scheduling part not necessarily reduces the overall latency if
the index writer threads does not process indexing quickly enough.
But if scheduling itself is pretty slow, even the index writer threads
process indexing very fast (by increasing thread number), the overall
performance can still be limited.

The following codes bench the PR (Note, the duration only counts for
scheduling without commit)
```
fn test_performance() {
    let field_name = "text";
    let dir = TempDir::new().unwrap();
    let mut index_wrapper = IndexWriterWrapper::create_text_writer(
        field_name,
        dir.path().to_str().unwrap(),
        "default",
        "",
        1,
        50_000_000,
        false,
        TantivyIndexVersion::V7,
    )
    .unwrap();

    let mut batch = vec![];
    for i in 0..1_000_000 {
        batch.push(format!("hello{:04}", i));
    }
    let batch_ref = batch.iter().map(|s| s.as_str()).collect::<Vec<_>>();

    let now = std::time::Instant::now();
    index_wrapper
        .add_data_by_batch(&batch_ref, Some(0))
        .unwrap();
    let elapsed = now.elapsed();
    println!("add_data_by_batch elapsed: {:?}", elapsed);
}
```
Latency roughly reduces from 1.4s to 558ms.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-04-08 19:50:24 +08:00
smellthemoon cb1e86e17c
enhance: support add field (#39800)
after the pr merged, we can support to insert, upsert, build index,
query, search in the added field.
can only do the above operates in added field after add field request
complete, which is a sync operate.

compact will be supported in the next pr.
#39718

---------

Signed-off-by: lixinguo <xinguo.li@zilliz.com>
Co-authored-by: lixinguo <xinguo.li@zilliz.com>
2025-04-02 14:24:31 +08:00
Spade A 3db56560fb
fix: fix concurrent issues in null offset (#40363)
issue: #40308
This issue fixes these two concurrent issues:
1. element in null_offset is used to set bitset where the size of bitset
is initialized by tantivy document count. However, there may still be
some documents that are not committed in tantivy but are null in
null_offset. So array out of range occurs.
2. null_offset can be read and write concurrently but there's no
synchronization protection.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-03-05 17:48:00 +08:00
Spade A 52c7d7dd80
fix: offset combined with term should be based on Token positions in phrase match (#39931)
fix: #39711

Unlike English sentence where each words are parsed exactly once and one
after one with position length 1, one Chinese word may be parsed to
multiple words with position length larger than 1.

For example, "badminton and skiing" will be parsed to Token{ start: 0,
length: 1, text: "badminton" }, Token{ start: 1, length: 1, text: "and"
}, and Token{ start: 2, length: 1, text: "tennis" }.

While for exmaple for Chinsese: "羽毛球和滑雪" may be parsed to Token{ start:
0, length: 2, text: "羽毛" }, Token{ start: 0, length: 3, text: "羽毛球" },
Token{ start: 3, length: 1, text: "和" }, and Token{ start: 4, length: 2,
text: "滑雪" }.

This PR fix that the code not recognizes this situation.

---------

Signed-off-by: SpadeA <tangchenjie1210@gmail.com>
2025-02-18 20:38:51 +08:00
Spade A 032292a432
feat: support phrase match query (#38869)
The relevant issue: https://github.com/milvus-io/milvus/issues/38930

---------

Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>
2025-01-12 20:24:58 +08:00
Spade A 8abf6c9149
fix: build text index when loading field data (#39070)
fix: https://github.com/milvus-io/milvus/issues/39053
may fix https://github.com/milvus-io/milvus/issues/38644 which could be
caused by https://github.com/milvus-io/milvus/issues/39053

---------

Signed-off-by: SpadeA-Tang <tangchenjie1210@gmail.com>
2025-01-09 15:24:56 +08:00
Zhen Ye b537a72309
fix: interted index out of range (#38577)
issue: #38546, #38486

Signed-off-by: chyezh <chyezh@outlook.com>
2024-12-19 15:20:47 +08:00
zhagnlu e4b6773d0a
fix: fix create text index dir conflict bug (#37693)
#37623

Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2024-11-15 18:26:30 +08:00
smellthemoon 3389a6b500
enhance: support null in text match index (#37517)
#37508

Signed-off-by: lixinguo <xinguo.li@zilliz.com>
Co-authored-by: lixinguo <xinguo.li@zilliz.com>
2024-11-13 11:08:29 +08:00
aoiasd 12951f0abb
enhance: rename tokenizer to analyzer and check analyzer params (#37478)
relate: https://github.com/milvus-io/milvus/issues/35853

---------

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2024-11-10 16:12:26 +08:00
aoiasd d67853fa89
feat: Tokenizer support build with params and clone for concurrency (#37048)
relate: https://github.com/milvus-io/milvus/issues/35853
https://github.com/milvus-io/milvus/issues/36751

---------

Signed-off-by: aoiasd <zhicheng.yue@zilliz.com>
2024-11-06 17:48:24 +08:00
Buqian Zheng f7b811450d
feat: add enable_tokenizer params to VarChar field (#36480)
issue: #35922

add an enable_tokenizer param to varchar field: must be set to true so
that a varchar field can enable_match or used as input of BM25 function

---------

Signed-off-by: Buqian Zheng <zhengbuqian@gmail.com>
2024-10-10 20:33:21 +08:00
zhagnlu 489087d18b
enhance: refactor executor framework V2 (#35251)
#32636

Signed-off-by: luzhang <luzhang@zilliz.com>
Co-authored-by: luzhang <luzhang@zilliz.com>
2024-09-13 20:57:09 +08:00
Jiquan Long 89bf226f0b
feat: support keyword text match (#35923)
fix: #35922

---------

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2024-09-10 15:11:08 +08:00