There was a race where a remote write could arrive before a meta
client cache update arrived. When this happened, the receiving node
would drop the write because it could not determine what database
and retention policy the shard it was supposed to create belonged
to.
This change sends the db and rp along with the write so that the
receiving node does not need to consult the meta store. It also
allows us to not send writes for shards that no longer exist instead
of always sending them and having the receiving node logs fill up
with dropped write requests. This second situation can occur when
shards are deleted and some nodes still have writes queued in hinted
handoff for those shards.
Fixes#5610
The RPC handler for remote queries would attempt to reuse a closed
connection for certain commands that didn't use pooling. The RPC
commands that close the connection have been fixed to not try reusing
the connection.
When creating an iterator, if there are no points to return, the points
decoder would hit an EOF that it didn't catch and would return that
error back to the client who made the request. It now properly returns
no points by using a `nilFloatIterator` if there are no points to
return.
This fixes remote execution when a cluster has nothing to return.
Fixes#5612, #5573 and #5518.
Using the MetaExecuter, queries that need to run on both data nodes
and optionally the meta store will be executed across all data nodes
in the cluster.
Fixes#5653 and #5394.
Previously dropping retention policies did not propogate to local TSDB
shards. Instead, the retention policiess would just be removed from the
Meta Store.
This PR adds ensures that data associated with retention policies is
removed, when the retention policy is dropped.
Also, it cleans up a couple of other methods in `tsdb`, including the
requirement to provide (redundant) shardIDs when deleting databases.
Windows does not timeout when the timeout is set to 1Nano sec. This causes TestShardWriter_Write_ErrDialTimeout to fail in Windows.
```go
=== RUN TestShardWriter_Write_ErrDialTimeout
[cluster] 2016/02/10 09:32:30 Starting cluster service
[cluster] 2016/02/10 09:32:30 accept remote connection from 127.0.0.1:57034
[cluster] 2016/02/10 09:32:30 unable to read type-length-value read message type: WSARecv tcp 127.0.0.1:57033: use of closed network connection
[cluster] 2016/02/10 09:32:30 close remote connection from 127.0.0.1:57034
[cluster] 2016/02/10 09:32:30 cluster service accept error: network connection closed
--- FAIL: TestShardWriter_Write_ErrDialTimeout (0.00s)
shard_writer_test.go:162: expected error <nil>, to contain i/o timeout
```
Parsing the line protocol again on the receiving side of the remote
write consumes a lot cpu. This uses a different marshaling format
that is much faster to parse after we already parsed the point on
the write side.
Under highly conncurrent write load, the coordinating node would
create a connection to any other node that is part of the replica
group. Since each connection can be expensive, OOM sitations could
occur because there was no bounds on the number of new connections
that would be created. If writes on a remote node were slow, connections
could pile up an exacerbate the problem.
This switches the pool to be bounded and has a checkout that is blocking
with a timeout. If a connection is available, it's returned immediately.
If the pool still has room for more connections, it will create one if needed.
Otherwise, the call will block until a connection becomes available or
the timeout expires. In the case of a timeout, it is propogated back up
to the PointsWriter that determine what do return to the client.
Go style -- and existing runtime stats -- do not use underscores, but
instead use camel case. This change makes the internal stats adhere to
that convention.
Float values are not supported in the existing engine and the tsm1
engines. This changes NewPoint to return an error if a field value
contains a NaN field. It also allows us to validate fields to prevent
other unsupported types from sneaking in through other input plugins.