3 Float point embeddings
jingkl edited this page 2021-10-22 16:16:12 +08:00

Float point embeddings

Similarity Metrics Index Types
  • Euclidean distance (L2)
  • Inner product (IP)
  • FLAT
  • IVF_FLAT
  • IVF_SQ8
  • IVF_PQ
  • HNSW
  • ANNOY

Euclidean distance (L2)

Essentially, Euclidean distance measures the length of a segment that connects 2 points.

The formula for Euclidean distance is as follows:

euclidean

where a = (a1, a2,..., an) and b = (b1, b2,..., bn) are two points in n-dimensional Euclidean space

It's the most commonly used distance metric and is very useful when the data is continuous.

Inner product (IP)

The IP distance between two embeddings are defined as follows:

ip

Where A and B are embeddings, ||A|| and ||B|| are the norms of A and B.

IP is more useful if you are more interested in measuring the orientation but not the magnitude of the vectors.

If you use IP to calculate embeddings similarities, you must normalize your embeddings. After normalization, the inner product equals cosine similarity.

Suppose X' is normalized from embedding X:

normalize

The correlation between the two embeddings is as follows:

normalization

For example:

In python

from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection
# create a collection
collection_name = "milvus_test"
default_fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="vector", dtype=DataType.FLOAT_VECTOR, dim=d)
]default_schema = CollectionSchema(fields=default_fields, description="test collection")
print(f"\nCreate collection...")
collection = Collection(name= collection_name, schema=default_schema)

# insert data
import random
vectors = [[random.random() for _ in range(8)] for _ in range(10)]
entities = [vectors]
mr = collection.insert(entities)
print(collection.num_entities) 

# create index
collection.create_index(field_name=field_name,
                        index_params={'index_type': 'IVF_FLAT',
                                      'metric_type': 'L2',
                                      'params': {
                                        "M": 16,              # int. 4~64
                                        "efConstruction": 40  # int. 8~512
                                      }})
collection.load()
                                     
# search 
top_k = 10
search_params = {"metric_type": "L2", "params": {"nprobe": 10}}
results = collection.search(vectors[:5], anns_field="vector", param=search_params,limit=top_k)

# show results
for result in results:
  print(result.ids)
  print(result.distance)