milvus/internal/storage
Jiquan Long 3f46c6d459
feat: support inverted index (#28783)
issue: https://github.com/milvus-io/milvus/issues/27704

Add inverted index for some data types in Milvus. This index type can
save a lot of memory compared to loading all data into RAM and speed up
the term query and range query.

Supported: `INT8`, `INT16`, `INT32`, `INT64`, `FLOAT`, `DOUBLE`, `BOOL`
and `VARCHAR`.

Not supported: `ARRAY` and `JSON`.

Note:
- The inverted index for `VARCHAR` is not designed to serve full-text
search now. We will treat every row as a whole keyword instead of
tokenizing it into multiple terms.
- The inverted index don't support retrieval well, so if you create
inverted index for field, those operations which depend on the raw data
will fallback to use chunk storage, which will bring some performance
loss. For example, comparisons between two columns and retrieval of
output fields.

The inverted index is very easy to be used.

Taking below collection as an example:

```python
fields = [
		FieldSchema(name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=100),
		FieldSchema(name="int8", dtype=DataType.INT8),
		FieldSchema(name="int16", dtype=DataType.INT16),
		FieldSchema(name="int32", dtype=DataType.INT32),
		FieldSchema(name="int64", dtype=DataType.INT64),
		FieldSchema(name="float", dtype=DataType.FLOAT),
		FieldSchema(name="double", dtype=DataType.DOUBLE),
		FieldSchema(name="bool", dtype=DataType.BOOL),
		FieldSchema(name="varchar", dtype=DataType.VARCHAR, max_length=1000),
		FieldSchema(name="random", dtype=DataType.DOUBLE),
		FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=dim),
]
schema = CollectionSchema(fields)
collection = Collection("demo", schema)
```

Then we can simply create inverted index for field via:

```python
index_type = "INVERTED"
collection.create_index("int8", {"index_type": index_type})
collection.create_index("int16", {"index_type": index_type})
collection.create_index("int32", {"index_type": index_type})
collection.create_index("int64", {"index_type": index_type})
collection.create_index("float", {"index_type": index_type})
collection.create_index("double", {"index_type": index_type})
collection.create_index("bool", {"index_type": index_type})
collection.create_index("varchar", {"index_type": index_type})
```

Then, term query and range query on the field can be speed up
automatically by the inverted index:

```python
result = collection.query(expr='int64 in [1, 2, 3]', output_fields=["pk"])
result = collection.query(expr='int64 < 5', output_fields=["pk"])
result = collection.query(expr='int64 > 2997', output_fields=["pk"])
result = collection.query(expr='1 < int64 < 5', output_fields=["pk"])
```

---------

Signed-off-by: longjiquan <jiquan.long@zilliz.com>
2023-12-31 19:50:47 +08:00
..
aliyun Identify service providers based on addresses (#27907) 2023-10-25 17:28:10 +08:00
gcp Format the code (#27275) 2023-09-21 09:45:27 +08:00
OWNERS [skip ci]Update OWNERS files (#11898) 2021-11-16 15:41:11 +08:00
azure_object_storage.go enhance: Support importing data with parquet file (#28608) 2023-11-29 20:52:27 +08:00
azure_object_storage_test.go fix azure ListObjects (#27931) 2023-11-01 11:34:14 +08:00
binlog_iterator.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
binlog_iterator_test.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
binlog_reader.go Move some modules from internal to public package (#22572) 2023-04-06 19:14:32 +08:00
binlog_test.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
binlog_util.go Move some modules from internal to public package (#22572) 2023-04-06 19:14:32 +08:00
binlog_util_test.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
binlog_writer.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
binlog_writer_test.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
data_codec.go feat: support inverted index (#28783) 2023-12-31 19:50:47 +08:00
data_codec_test.go feat: support inverted index (#28783) 2023-12-31 19:50:47 +08:00
data_sorter.go Add float16 vector (#25852) 2023-09-08 10:03:16 +08:00
data_sorter_test.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
event_data.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
event_header.go Move some modules from internal to public package (#22572) 2023-04-06 19:14:32 +08:00
event_reader.go Use go-api/v2 for milvus-proto (#24770) 2023-06-09 01:28:37 +08:00
event_test.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
event_writer.go Add go payload writer (#24656) (#24762) 2023-06-09 13:52:39 +08:00
event_writer_test.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
factory.go Use OpenDAL to access object store (#25642) 2023-11-01 09:00:14 +08:00
file.go enhance: Support importing data with parquet file (#28608) 2023-11-29 20:52:27 +08:00
file_test.go enhance: Support importing data with parquet file (#28608) 2023-11-29 20:52:27 +08:00
index_data_codec.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
index_data_codec_test.go Check error by Error() and NoError() for better report message (#24736) 2023-06-08 15:36:36 +08:00
insert_data.go feat: support inverted index (#28783) 2023-12-31 19:50:47 +08:00
insert_data_test.go feat: support inverted index (#28783) 2023-12-31 19:50:47 +08:00
local_chunk_manager.go Refine chunk manager errors (#27590) 2023-10-31 12:18:15 +08:00
local_chunk_manager_test.go enhance: Remove vector chunk manager (#28569) 2023-11-30 18:00:33 +08:00
minio_chunk_manager.go fix: Fix minio latency monitoring for get operation (#28510) 2023-11-28 10:00:27 +08:00
minio_chunk_manager_test.go Refine chunk manager errors (#27590) 2023-10-31 12:18:15 +08:00
minio_object_storage.go fix azure ListObjects (#27931) 2023-11-01 11:34:14 +08:00
minio_object_storage_test.go fix: Align minio object storage ut to new minio server behavior (#29014) 2023-12-06 15:42:43 +08:00
options.go Add chunk manager request timeout (#27692) 2023-10-23 20:08:08 +08:00
payload.go Add float16 vector (#25852) 2023-09-08 10:03:16 +08:00
payload_reader.go Update arrow version to v12 (#28425) 2023-11-15 10:36:19 +08:00
payload_reader_test.go Update arrow version to v12 (#28425) 2023-11-15 10:36:19 +08:00
payload_test.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
payload_writer.go Update arrow version to v12 (#28425) 2023-11-15 10:36:19 +08:00
pk_statistics.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
primary_key.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
primary_key_test.go Use go-api/v2 for milvus-proto (#24770) 2023-06-09 01:28:37 +08:00
print_binlog.go Format the code (#27275) 2023-09-21 09:45:27 +08:00
print_binlog_test.go Remove deprecated io/ioutil usage (#27747) 2023-10-17 20:32:09 +08:00
remote_chunk_manager.go fix: Fix minio latency monitoring for get operation (#28510) 2023-11-28 10:00:27 +08:00
remote_chunk_manager_test.go Refine chunk manager errors (#27590) 2023-10-31 12:18:15 +08:00
stats.go enhance: add param for bloomfilter(#29388) (#29490) 2023-12-28 18:10:46 +08:00
stats_test.go Add retry time when lazy load BF (#25096) 2023-06-25 11:32:43 +08:00
storage_test.go enhance: Remove vector chunk manager (#28569) 2023-11-30 18:00:33 +08:00
types.go enhance: Support importing data with parquet file (#28608) 2023-11-29 20:52:27 +08:00
unsafe.go [skip e2e]Update license for storage unsafe (#14452) 2021-12-28 20:03:56 +08:00
unsafe_test.go [skip e2e]Update license for storage unsafe (#14452) 2021-12-28 20:03:56 +08:00
utils.go Fix buffer FieldData has no `ElementType` and array logsize always zero (#28295) 2023-11-09 14:16:20 +08:00
utils_test.go Add float16 vector (#25852) 2023-09-08 10:03:16 +08:00