If a large compaction was running and was aborted. It could would leave
some tmp files around for files that it had fully written. The current
active file was cleaned up, but already completed ones would not. This
would occur when a TSM file needed to rollover due to size.
* <type>FinalizerIterator sets a runtime finalizer and calls Close
when garbage collected. This will ensure any associated cursors
are closed and the associated TSM files released
* `query.Iterators#Merge` call could return an error and the inputs
would not be closed, causing a cursor leak
The OnReplace func ends up trying to acquire locks on MeasurementFields. When
its called via snapshotting, this can deadlock because the snapshotting goroutine
also holds an RLock on the engine. If a delete measurement calls is run at the
right time, it will lock the MeasurementFields and try to acquire a lock on the engine
to disable compactions. This creates a deadlock.
To fix this, the OnReplace callback is moved to a function param to allow only Replace
calls as part of a compaction to invoke it as opposed to both snapshotting and compactions.
Fixes#8713
This change provides a clear separation between the query engine
mechanics and the query language so that the language can be parsed and
dealt with separate from the query engine itself.
Previously pseudo iterators could be created for meta data such
as series, measurement, and tag data. These iterators were created
at a higher level and lacked a lot of the power of the query engine.
This commit moves system iterators down to the series level and
supports the following:
- _name
- _seriesKey
- _tagKey
- _tagValue
- _fieldKey
These can be used as normal fields such as:
SELECT _seriesKey FROM cpu
This will return all the series keys for `cpu`.
If there were multiple shards, drop measurement could update the index
and remove the measurement before the other shards ran their deletes.
This causes the later shards to not see any series to delete.
The fix is to all deleteSeries to handle the index delete which already
accounts for removing the measurement when it is fully removed from the
index.
The partiallyRead func didn't account for the initial values and would
return true for blocks that had not been read at all. This causes a
slower path during compactions that forces a block to be decoded when
it could just be merged as is without decoded. This causes compactions
to consume more CPU and run slower at times.
This switches all the interfaces that take string series key to
take a []byte. This eliminates many small allocations where we
convert between to two repeatedly. Eventually, this change should
propogate futher up the stack.
The refs map was to increment the file references one time each.
It doesn't hurt to increment them multiple times though.
We also do not need to copy the files slice as we are accessing it
under a read lock so it can't be changed.
When snapshots and compactions are disabled, the check to see if
the compaction should be aborted occurs in between writing to the
next TSM file. If a large compaction is running, it might take
a while for the file to be finished writing causing long delays.
This now interrupts compactions while iterating over the blocks to
write which allows them to abort immediately.
* introduced UnsignedValue type
* leveraged existing int64 compression algorithms (RLE, Simple 8B)
* tsm and WAL can read and write UnsignedValue
* compaction is aware of UnsignedValue
* unsigned support to model, cursors and write points
NOTE: there is no support to create unsigned points, as the line
protocol has not been modified.
There was a race in the WAL writeToLog and scheduleSync which could
lead to a writing goroutine blocking indefinitely on its syncErr channel.
The issue was that the clearing of the syncCount happenend after the
wal was unlock. If a goroutine was able to lock, write and call scheduleSync
before the existing scheduleSync goroutine returns and ran the defer to
clear the syncCount, then a new scheduleSync goroutine would not get started.
This left the writing goroutine block with nothing to signal it.
While in this state, a RLock on the engine was held. If a Lock was requested
on the engine during this time, all future writes and queries would block waiting
on the blocked wal writer.
The fix is to move the atomic clearing of syncCount before the Lock is released.
The min key was not used in OverlapsKeyRange which caused it to return
false when it should be true. This causes a bug where deletes would not
write tombstones for files that actually contained the data it was supposed
to delete.
The in-memory index can get out of sync when deletes and writes
to the same measurement are running concurrently. The index is
updated independently from data on disk and it's possible for the
index to unassign a shard when data still exists on disk. What happens
is that there are TSM files on disk, but the index does not know that
the series that exist in those files still are in the shard. Restarting
the server reloads the index and the data is visible again. From and
end user perspective, this can look like more data is deleted than should
have been or that deleted data re-appears after a restart or writes to the
shard occur again.
There isn't an easy way to resolve this since the index and storage
are not transactional resources and we cannot atomically commit or
rollback changes to both at once.
As a workaround, after new TSM files are installed, we refresh the
index with series keys that exist in the new tsm files as well as
any lingering data still in the cache. There is a small window of time
when the index may be missing series, but it will re-appear after the refresh
completes.
This adds a v3 format that is a gzip compressed version of the v2
format. It reduces the size of tombstone files substantially without
having to support a more feature rich file format for tombstones.
The monitor goroutine calls enable compactions every 10s to spin down
(or start up) goroutines for cold shards. This frequent Lock may be
causing lock contention for writes and queries which get blocked trying
to acquire an RLock.
The go RWMutex says that new RLock calls will block if there is a
pending Lock call that is blocked. Switching the common path to use
an RLock should avoid the Lock and reduce lock contention for writes
and queries.
WriteBlock was missing the check for the max series keys which allowed
series keys to be written that were larger than the 2 bytes allocated
to store their length. When this occurred, the TSM can fail to load.
The defer was never executed because the planning happens in a
long running goroutine that loops. The plans need to be released
immediately after applying them.
TMP files could leak when compactions failed for various reasons. They
were also being deleted inadvertently when compactions were disabled causing
other errors to be reported in the logs.