milvus/docs/design_docs/20210521-datanode_recovery_...

# DataNode Recovery Design

update: 5.21.2021, by [Goose](https://github.com/XuanYang-cn)
update: 6.03.2021, by [Goose](https://github.com/XuanYang-cn)
update: 6.21.2021, by [Goose](https://github.com/XuanYang-cn)

## What's DataNode?

DataNode processes insert data and persists insert data into storage.

DataNode is based on flowgraph; each flowgraph cares about only one vchannel. There are data definition language (DDL) messages, data manipulation language (DML)
messages, and timetick messages inside one vchannel, FIFO log stream.

One vchannel only contains DML messages of one collection. A collection consists of many segments, hence
a vchannel contains DML messages of many segments. **Most importantly, the DML messages of the same segment can appear anywhere in vchannel.**

## What is the real meaning of DataNode recovery?

DataNode is stateless, but vchannel has states. DataNode's statelessness is guaranteed by DataCoord, which
means the vchannel's state is maintained by DataCoord. So DataNode recovery is no different from starting.

So what's DataNode's starting procedure?

## Objectives

### 1. Service Registration

DataNode registers itself to etcd after grpc server started, in *INITIALIZING* state.

### 2. Service Discovery

DataNode discovers DataCoord and RootCoord, in *HEALTHY* and *IDLE* state.

### 3. Flowgraph Recovery

The detailed design can be found at [datanode flowgraph recovery design](datanode_flowgraph_recovery_design_0604_2021.md).

After DataNode subscribes to a stateful vchannel, DataNode starts to work, or more specifically, flowgraph starts to work.

Vchannel is stateful because we don't want to process twice what's already processed, as a "processed" message means its
already persistent. In DataNode's terminology, a message is processed if it's been flushed.

DataCoord tells DataNode stateful vchannel info through RPC `WatchDmChannels` so that DataNode won't process
the same messages over and over again. So flowgraph needs the ability to consume messages in the middle of a vchannel.

DataNode tells DataCoord vchannel states after each flush through RPC `SaveBinlogPaths`, so that DataCoord
keeps the vchannel states updated.


## Some interface/proto designs below are outdated, will be updated soon

### 1. DataNode no longer interacts with etcd except service registering

#### DataCoord rather than DataNode saves binlog paths into etcd

   ![datanode_design](graphs/datanode_design_01.jpg)


##### DataCoord RPC Design

```proto
rpc SaveBinlogPaths(SaveBinlogPathsRequest) returns (common.Status){}
message ID2PathList {
    int64 ID = 1;
    repeated string Paths = 2;
}

message CheckPoint {
    int64 segmentID = 1;
    msgpb.MsgPosition position = 2;
    int64 num_of_rows = 3;
}

message SaveBinlogPathsRequest {
    common.MsgBase base = 1;
    int64 segmentID = 2;
    int64 collectionID = 3;
    repeated ID2PathList field2BinlogPaths = 4;
    repeated CheckPoint checkPoints = 7;
    repeated SegmentStartPosition start_positions = 6;
    bool flushed = 7;
 }
```

### 4. DataNode with collection with flowgraph with vchannel designs

#### The winner
  ![datanode_design](graphs/collection_flowgraph_relation.png)

  ![datanode_design](graphs/collection_flowgraph_1_n.png)

**O4-1.** DataNode scales flowgraph 2 Day

Change `WatchDmChannelsRequest` proto.

``` proto
message VchannelInfo {
  int64 collectionID = 1;
  string channelName = 2;
  msgpb.MsgPosition seek_position = 3;
  repeated SegmentInfo unflushedSegments = 4;
  repeated int64 flushedSegments = 5;
}

message WatchDmChannelsRequest {
  common.MsgBase base = 1;
  repeated VchannelInfo vchannels = 2;
}
```

DataNode consists of multiple DataSyncService, each service controls one flowgraph.

```go
// DataNode
type DataNode struct {
    ...
    vchan2Sync map[string]*dataSyncService
    vchan2FlushCh map[string]chan<- *flushMsg

    clearSignal chan UniqueID
    ...
}

// DataSyncService
type dataSyncService struct {
	ctx          context.Context
	fg           *flowgraph.TimeTickedFlowGraph
	flushChan    <-chan *flushMsg
	replica      Replica
	idAllocator  allocatorInterface
	msFactory    msgstream.Factory
	collectionID UniqueID
}
```

DataNode Init -> Register to etcd -> Discovery data service -> Discover master service -> IDLE

WatchDmChannels -> new dataSyncService -> HEALTH

`WatchDmChannels:`

1. If `DataNode.vchan2Sync` is empty, DataNode is in IDLE, `WatchDmChannels` will create new dataSyncService for every unique vchannel, then DataNode is in HEALTH.
2. If vchannel name of `VchannelPair` is not in `DataNode.vchan2Sync`, create a new dataSyncService.
3. If vchannel name of `VchannelPair` is in `DataNode.vchan2Sync`, ignore.

```

#### The boring design

• If collection:flowgraph = 1 : 1, datanode must have the ability to scale flowgraph.

![datanode_design](graphs/collection_flowgraph_1_1.jpg)

•** [Winner]** If collection:flowgraph = 1 : n, flowgraph:vchannel = 1:1

![datanode_design](graphs/collection_flowgraph_1_n.png)

• If collection:flowgraph = n : 1, in the blue cases, datanode must have the ability to scale flowgraph. In the brown cases, flowgraph must be able to scale channels.

![datanode_design](graphs/collection_flowgraph_n_1.jpg)

• If collection:flowgraph = n : n  , load balancing on vchannels.

![datanode_design](graphs/collection_flowgraph_n_n.jpg)