Table of Contents
Binary embeddings
Distance Metrics | Index Types |
---|---|
|
|
|
BIN_FLAT |
Jaccard distance
Jaccard similarity coefficient measures the similarity between two sample sets and is defined as the cardinality of the intersection of the defined sets divided by the cardinality of the union of them. It can only be applied to finite sample sets.
Jaccard distance measures the dissimilarity between data sets and is obtained by subtracting the Jaccard similarity coefficient from 1. For binary variables, Jaccard distance is equivalent to the Tanimoto coefficient.
![Jaccard distance]((https://github.com/milvus-io/milvus-docs/blob/v2.0.0/assets/jaccard_dist.png)
Tanimoto distance
For binary variables, the Tanimoto coefficient is equivalent to Jaccard distance:
In Milvus, the Tanimoto coefficient is only applicable for a binary variable, and for binary variables, the Tanimoto coefficient ranges from 0 to +1 (where +1 is the highest similarity).
For binary variables, the formula of Tanimoto distance is:
The value ranges from 0 to +infinity.
Hamming distance
Hamming distance measures binary data strings. The distance between two strings of equal length is the number of bit positions at which the bits are different.
For example, suppose there are two strings, 1101 1001 and 1001 1101.
11011001 ⊕ 10011101 = 01000100. Since, this contains two 1s, the Hamming distance, d (11011001, 10011101) = 2.
Superstructure
The Superstructure is used to measure the similarity of a chemical structure and its superstructure. The less the value, the more similar the structure is to its superstructure. Only the vectors whose distance equals to 0 can be found now.
Superstructure similarity can be measured by:
Where
- B is the superstructure of A
- NA specifies the number of bits in the fingerprint of molecular A.
- NB specifies the number of bits in the fingerprint of molecular B.
- NAB specifies the number of shared bits in the fingerprint of molecular A and B.
Substructure
The Substructure is used to measure the similarity of a chemical structure and its substructure. The less the value, the more similar the structure is to its substructure. Only the vectors whose distance equals to 0 can be found now.
Substructure similarity can be measured by:
Where
- B is the substructure of A
- NA specifies the number of bits in the fingerprint of molecular A.
- NB specifies the number of bits in the fingerprint of molecular B.
- NAB specifies the number of shared bits in the fingerprint of molecular A and B.
For example:
In python
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection
# create a collection
collection_name = "milvus_test"
default_fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=d)
]default_schema = CollectionSchema(fields=default_fields, description="test collection")
print(f"\nCreate collection...")
collection = Collection(name= collection_name, schema=default_schema)
# insert data
import random
vectors = [[random.random() for _ in range(8)] for _ in range(10)]
entities = [vectors]
mr = collection.insert(entities)
print(collection.num_entities)
# create index
collection.create_index(field_name=field_name,
index_params={'index_type': 'IVF_FLAT',
'metric_type': 'JACCARD',
'params': {
"M": 16, # int. 4~64
"efConstruction": 40 # int. 8~512
}})
collection.load()
# search
top_k = 10
search_params = {"metric_type": "JACCARD", "params": {"nprobe": 10}}
results = collection.search(vectors[:5], anns_field="vector", param=search_params,limit=top_k)
# show results
for result in results:
print(result.ids)
print(result.distance)
Tutorial
Advanced Deployment
Deploy Milvus with External Components
Deploy a Milvus Cluster on EC2
Deploy a Milvus Cluster on EKS
Deploy a Milvus Cluster on GCP
Deploy Milvus on Azure with AKS
Upgrade Milvus with Helm Chart