2015/04/08 22:27:01 no broker or server configured to handle messaging endpoints
2015/04/08 22:27:02 join: failed to connect data node: http://box296:9012: unable to join
2015/04/08 22:27:02 join: failed to connect data node to any specified server
There is a race when joining a data only node to a broker and another data only node between the
data node heartbeater and the join operation. If the heartbeater
fire before the join attempt, it's possible for the booting data node
to be selected as the first data node for redirection by the broker.
The join attempt would request a data node endpoint on the broker "/data_nodes"
but since the broker cannot handle it, it would redirect to a valid broker.
During this race, the broker would redirect the request back to the same server. If
this happens, the data node would get stuck and not be able to join because it's
still booting.
To work around this, the redirect is randonmized and the join calls will not attempt
to call itself and instead re-request the original URL. A better fix might be to
not start the heartbeater until after the datanode has joined or initialized.
If the node is running a broker and a data node, always have the
data node client connect to the local broker since it will already
be initialized or joined.
3 was fairly arbitrary and would cause errors such as:
2015/04/08 14:01:12 join: failed to connect data node: {http <nil> influxdb.local:8191 }: unable to join
2015/04/08 14:01:12 join: failed to connect data node to any specified server
in the tests. This can happen when the nodes are slow to startup. The limit is set
arbitarily higher to avoid this error but still give up if it can't connect
after a minute.
Removing this option causes issues when deploying influxd
via configuration management. We can now define the same
set of join URLs in the config file across nodes.
This also ensures that the `-flag` option overrides the
config file setting if passed.
The timeout goroutine would continue to run (until the timeout)
even after queryAndWait returned. This causes thousands of extra
goroutines to linger around and makes the test stack traces very
difficult to read.
This commit changes the binary format of messaging.Message to encode
a 4-byte checksum at the beginning of it. This is used when reading
data back out to verify that it is not corrupt.
Corrupted messages are truncated on recovery so the broker can
restart from the previous message.