Created Searching with ANNOY index (markdown)

master
jingkl 2021-10-18 17:26:54 +08:00
parent 84ed1e2f36
commit 4a8ed18c29
1 changed files with 53 additions and 0 deletions

@ -0,0 +1,53 @@
### ANNOY
<a name="ANNOY"></a>
ANNOY (Approximate Nearest Neighbors Oh Yeah) is an index that uses a hyperplane to divide a high-dimensional space into multiple subspaces, and then stores them in a tree structure.
When searching for vectors, ANNOY follows the tree structure to find subspaces closer to the target vector, and then compares all the vectors in these subspaces (The number of vectors being compared should not be less than `search_k`) to obtain the final result. Obviously, when the target vector is close to the edge of a certain subspace, sometimes it is necessary to greatly increase the number of searched subspaces to obtain a high recall rate. Therefore, ANNOY uses `n_trees` different methods to divide the whole space, and searches all the dividing methods simultaneously to reduce the probability that the target vector is always at the edge of the subspace.
- Index building parameters
| Parameter | Description | Range |
| --------- | ---------------------------------------- | --------- |
| `n_trees` | The number of methods of space division. | [1, 1024] |
- Search parameters
| Parameter | Description | Range |
| ---------- | ------------------------------------------------------------ | ------------------------------- |
| `search_k` | The number of nodes to search. -1 means 5% of the whole data. | {-1} [`top_k`, n × `n_trees`] |
```python
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection
# create a collection
collection_name = "milvus_ANNOY"
default_fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=d)
]default_schema = CollectionSchema(fields=default_fields, description="test collection")print(f"\nCreate collection...")collection = Collection(name= collection_name, schema=default_schema)
# insert data
mr = collection.insert([xb])print(collection.num_entities)
# create index
collection.create_index(field_name=field_name,
index_params={'index_type': 'ANNOY',
'metric_type': 'L2',
'params': {
"n_trees": 8 # int. 1~1024
}})
collection.load()
# search
top_k = 10
results = collection.search(data=xq, anns_field="vector",params={
"search_k": -1 # int. {-1} U [top_k, n*n_trees], n represents vectors count.
}, limit=top_k) # show results
for result in results:
print(result.ids)
print(result.distance)
```