The monitor goroutine calls enable compactions every 10s to spin down
(or start up) goroutines for cold shards. This frequent Lock may be
causing lock contention for writes and queries which get blocked trying
to acquire an RLock.
The go RWMutex says that new RLock calls will block if there is a
pending Lock call that is blocked. Switching the common path to use
an RLock should avoid the Lock and reduce lock contention for writes
and queries.
Currently two write locks in `inmem` are obtained and then
manually unlocked at function exit points. However, we have
reports that the `inmem` index is hanging on a write lock and
cannot track the issue down to anything else besides a lock
that could have been left unlocked because of a panic.
This commit changes the two locks to always defer their unlocks
to prevent these hangs.
This fixes the case where log files are compacted out of order
and cause non-contiguous sets of index files to be compacted.
Previously, the compaction planner would fetch a list of index files
for each level and compact them in order starting with the oldest
ones. This can be a problem for level 1 because level 0 (log files)
are compacted individually and in some cases a log file can finish
compacting before older log files are finished compacting. This
causes there to be a gap in the list of level 1 files that is
ignored when fetching a list of index files.
Now, the planner reads the list of index files starting from the
oldest but stops once it hits a log file. This prevents that gap
from being ignored.
This check was previously in a different section of code which
was lost during a refactor to the new compaction strategy. The
compaction planning now makes a check to ensure at least two
files are available for compaction in a level.
WriteBlock was missing the check for the max series keys which allowed
series keys to be written that were larger than the 2 bytes allocated
to store their length. When this occurred, the TSM can fail to load.
The defer was never executed because the planning happens in a
long running goroutine that loops. The plans need to be released
immediately after applying them.
TMP files could leak when compactions failed for various reasons. They
were also being deleted inadvertently when compactions were disabled causing
other errors to be reported in the logs.
This changes full compactions within a shard to run sequentially
instead of running all the compaction groups in parallel. Normally,
there is only 1 full compaction group to run. At times, there could
be several which causes instability if they are all running concurrently
as they tie up a cpu for long periods of time.
Level compactions are also capped to a max of 4 concurrently running for each level
in a shard. This prevents sudden spikes in CPU and disk usage due to a large backlog
of tsm files at a given level.
Measurement name and field were converted between []byte and string
repetively causing lots of garbage. This switches the code to use
[]byte in the write path.
This pool was previously a pool.Bytes to avoid repetitive allocations.
It was recently switchted to a sync.Pool because pool.Bytes held onto
very larger buffers at times which were never released. sync.Pool is
showing up in allocation profiles quite frequently.
This switches the pool to a new pool that limits how many buffers are
in the pool as well as the max size of each buffer in the pool. This
provides better bounds on allocations.