Go style -- and existing runtime stats -- do not use underscores, but
instead use camel case. This change makes the internal stats adhere to
that convention.
Float values are not supported in the existing engine and the tsm1
engines. This changes NewPoint to return an error if a field value
contains a NaN field. It also allows us to validate fields to prevent
other unsupported types from sneaking in through other input plugins.
Everytime the purge check was running, a new segment was being added.
This meant the list of almost-empty files in the HH directories would
grow continually.
I believe this change address the issues with hinted-handoff not fully replicating all data to nodes that come back online after an outage.. A detailed explanation follows.
During testing of of hinted-handoff (HH) under various scenarios, HH stats showed that the HH Processor was occasionally encountering errors while unmarshalling hinted data. This error was not handled completely correctly, and in clusters with more than 3 nodes, this could cause the HH service to stall until the node was restarted. This was the high-level reason why HH data was not being replicated.
Furthermore by watching, at the byte-level, the hinted-handoff data it could be seen that HH segment block lengths were getting randomly set to 0, but the block data itself was fine (Block data contains hinted writes). This was the root cause of the unmarshalling errors outlined above. This, in turn, was tracked down to the HH system opening each segment file multiple times concurrently, which was not file-level thread-safe, so these mutiple open calls were corrupting the file.
Finally, the reason a segment file was being opened multiple times in parallel was because WriteShard on the HH Processor was checking for node queues in an unsafe manner. Since WriteShard can be called concurrently this was adding queues for the same node more than once, and each queue-addition results in opening segment files.
This change fixes the locking in WriteShard such the check for an existing HH queue for a given node is performed in a synchronized manner.
Without this change if hinted-handoff was disabled the service would
correctly reject writes, but it would process any data sitting in
hinted-handoff queues. With this change the service is completely
disabled.
If the hinted handoff segment is corrupt, the size read could be
invalid and attempting to create a slice using that size causes
a panic. Ideally, we'd have a checksum on the seqment record but
for now just return an error when the size is larger than the
segment file.
Fixes#3687
* Capitalize first letter of message
* Log all services staring consistently
* Remove some extraneous log statements in meta.Store
* Log data dirs for meta, data and hinted handoff
A short write has occurred and we do not have enough bytes to determine
the size of the payload. This is corrupted record that we should drop.
Instead of panicing, log the error and advance the queue since the error
at this location is unreoverable currently.
Fixes#3436
Basic throughput limiting to dynamically maintain a bytes/sec
limit if configured for hinted handoff retry queues. If a batch is
larger than the limit, the limit will slow the processing down to one
write per second.
By default, the limit is unbounded. It can be enabled to help prevent
retstarting nodes that have queued writes for them from being overloaded
at startup.
The advance function was not writing the updated position in the
queue until after the next advance call was called. Resulted in
the very last message would get replayed on restart each time.