milvus/docs/design_docs/20211217-milvus_create_coll...

5.0 KiB

Create Collection

Milvus 2.0 uses Collection to represent a set of data, like Table in a traditional database. User can create or drop Collection. This article introduces the execution path of CreateCollection, at the end of this article, you should know which components are involved in CreateCollection.

The execution flow of CreateCollection is shown in the following figure:

create_collection

  1. Firstly, SDK starts a CreateCollection request to Proxy via Grpc, the proto is defined as follows:
service MilvusService {
    ...

    rpc CreateCollection(CreateCollectionRequest) returns (common.Status) {}

    ...
}

message CreateCollectionRequest {
  // Not useful for now
  common.MsgBase base = 1;
  // Not useful for now
  string db_name = 2;
  // The unique collection name in milvus.(Required)
  string collection_name = 3;
  // The serialized `schema.CollectionSchema`(Required)
  bytes schema = 4;
  // Once set, no modification is allowed (Optional)
  // https://github.com/milvus-io/milvus/issues/6690
  int32 shards_num = 5;
}

message CollectionSchema {
  string name = 1;
  string description = 2;
  bool autoID = 3; // deprecated later, keep compatible with c++ part now
  repeated FieldSchema fields = 4;
}

  1. When receiving the CreateCollection request, Proxy would wrap this request into CreateCollectionTask, and pushes this task into DdTaskQueue queue. After that, Proxy would call WaitToFinish method to wait until the task is finished.
type task interface {
	TraceCtx() context.Context
	ID() UniqueID // return ReqID
	SetID(uid UniqueID) // set ReqID
	Name() string
	Type() commonpb.MsgType
	BeginTs() Timestamp
	EndTs() Timestamp
	SetTs(ts Timestamp)
	OnEnqueue() error
	PreExecute(ctx context.Context) error
	Execute(ctx context.Context) error
	PostExecute(ctx context.Context) error
	WaitToFinish() error
	Notify(err error)
}

type createCollectionTask struct {
	Condition
	*milvuspb.CreateCollectionRequest
	ctx       context.Context
	rootCoord types.RootCoord
	result    *commonpb.Status
	schema    *schemapb.CollectionSchema
}
  1. There is a background service in Proxy, this service would get the CreateCollectionTask from DdTaskQueue, and execute it in three phases.

    • PreExecute, do some static checking at this phase, such as check if Collection Name and Field Name are legal, if there are duplicate columns, etc.
    • Execute, at this phase, Proxy would send CreateCollection request to RootCoord via Grpc, and wait for response, the proto is defined as follows:
    service RootCoord {
        ...
    
        rpc CreateCollection(milvus.CreateCollectionRequest) returns (common.Status){}
    
        ...
    }
    
    • PostExecute, CreateCollectionTask does nothing at this phase, and return directly.
  2. RootCoord would wrap the CreateCollection request into CreateCollectionReqTask, and then call function executeTask. executeTask would return until the context is done or CreateCollectionReqTask.Execute is returned.

type reqTask interface {
	Ctx() context.Context
	Type() commonpb.MsgType
	Execute(ctx context.Context) error
	Core() *Core
}

type CreateCollectionReqTask struct {
	baseReqTask
	Req *milvuspb.CreateCollectionRequest
}
  1. CreateCollectionReqTask.Execute would alloc CollectionID and default PartitionID, and set Virtual Channel and Physical Channel, which are used by MsgStream, then write the Collection's meta into metaTable

  2. After Collection's meta written into metaTable, Milvus would consider this collection has been created successfully.

  3. RootCoord would alloc a timestamp from TSO before writing Collection's meta into metaTable, and this timestamp is considered as the point when the collection was created

  4. At last RootCoord will send a message of CreateCollectionRequest into MsgStream, and other components, who have subscribed to the MsgStream, would be notified. The Proto of CreateCollectionRequest is defined as follows:

message CreateCollectionRequest {
  common.MsgBase base = 1;
  string db_name = 2;
  string collectionName = 3;
  string partitionName = 4;
  int64 dbID = 5;
  int64 collectionID = 6;
  int64 partitionID = 7;
  // `schema` is the serialized `schema.CollectionSchema`
  bytes schema = 8;
  repeated string virtualChannelNames = 9;
  repeated string physicalChannelNames = 10;
}

  1. After the above operations, RootCoord would update the internal timestamp and return, so Proxy would get the response.

Notes:

  1. In Proxy, all DDL requests will be wrapped into task, and push the task into DdTaskQueue. A background service will read a new task from DdTaskQueue only when the previous one is finished. So all the DDL requests are executed serially on Proxy.

  2. In RootCoord, all DDL requests will be wrapped into reqTask, but there is no task queue, so the DDL requests will be executed in parallel on RootCoord.