influxdb/services
Philip O'Toole 44d52ac138 Fully lock HH node queue creation
I believe this change address the issues with hinted-handoff not fully replicating all data to nodes that come back online after an outage.. A detailed explanation follows.

During testing of of hinted-handoff (HH) under various scenarios, HH stats showed that the HH Processor was occasionally encountering errors while unmarshalling hinted data. This error was not handled completely correctly, and in clusters with more than 3 nodes, this could cause the HH service to stall until the node was restarted. This was the high-level reason why HH data was not being replicated.

Furthermore by watching, at the byte-level, the hinted-handoff data it could be seen that HH segment block lengths were getting randomly set to 0, but the block data itself was fine (Block data contains hinted writes). This was the root cause of the unmarshalling errors outlined above. This, in turn, was tracked down to the HH system opening each segment file multiple times concurrently, which was not file-level thread-safe, so these mutiple open calls were corrupting the file.

Finally, the reason a segment file was being opened multiple times in parallel was because WriteShard on the HH Processor was checking for node queues in an unsafe manner. Since WriteShard can be called concurrently this was adding queues for the same node more than once, and each queue-addition results in opening segment files.

This change fixes the locking in WriteShard such the check for an existing HH queue for a given node is performed in a synchronized manner.
2015-10-07 02:33:43 -07:00
..
admin Fix typos/spacing 2015-08-13 10:02:05 -06:00
collectd refactor Points and Rows to dedicated packages 2015-09-16 15:33:08 -05:00
continuous_querier Fix go vet warnings 2015-09-21 15:28:54 +02:00
copier Disable copier test 2015-10-05 20:09:56 -04:00
graphite Add public function to graphite parser to apply template 2015-10-06 17:42:36 -06:00
hh Fully lock HH node queue creation 2015-10-07 02:33:43 -07:00
httpd Updates based on @otoolp's PR comments 2015-10-05 20:09:56 -04:00
opentsdb refactor Points and Rows to dedicated packages 2015-09-16 15:33:08 -05:00
precreator Don't precreate shard groups entirely in past 2015-09-04 08:31:50 -07:00
retention Set default retention check interval to 30 minutes 2015-08-27 16:08:03 -07:00
snapshotter silence snapshotter logger for testing 2015-08-13 20:53:40 -05:00
udp Allow configuration of UDP retention policy 2015-09-28 15:17:56 -07:00