If multiple tombstones exists for a series that ended up causing the
full data to be deleted, the blocks were not removed from the offsets
in the index. This causes the TSMReader to report that a key exist
but does not have any data.
During a compaction, every key should have at least one value. Since
this invariant was broken, the compaction aborted early and ends up
dropping all series keys that are lexigraphically greater than where
the breakage occured. This would cause data to be dropped during the
compaction.
This fixes a potential bug where the BlockIterator would skip blocks
if the underlying TSMReader had deletes on it concurrently. This
could possibly occur due to changes in 91eb9de3 that now use the
existing TSMReaders from the FileStore instead of creating new ones
during compaction.
This instructs the kernel that it can release memory used by mmap'd
TSM files when they are not actively being used. It the mappings are
use, the kernel will fault the pages back in. On linux, this causes
RES memory to drop immediately when run.
This switches all the interfaces that take string series key to
take a []byte. This eliminates many small allocations where we
convert between to two repeatedly. Eventually, this change should
propogate futher up the stack.
* introduced UnsignedValue type
* leveraged existing int64 compression algorithms (RLE, Simple 8B)
* tsm and WAL can read and write UnsignedValue
* compaction is aware of UnsignedValue
* unsigned support to model, cursors and write points
NOTE: there is no support to create unsigned points, as the line
protocol has not been modified.
The min key was not used in OverlapsKeyRange which caused it to return
false when it should be true. This causes a bug where deletes would not
write tombstones for files that actually contained the data it was supposed
to delete.
Tombstone files would be written to all TSM files even if the deleted
keys or timerange did not exist in the TSM file. This had the side
effect of causing shards to get recompacted back to the same state. If
any shards or large numbers of TSM files existed, disk usage and CPU
utilization would spike causing issues.
This prevents tombstones being written for TSM files that could not
possiby contain the series keys being deleted or if the delted time
range is outside the range of the file.
Each shard has a number of goroutines for compacting different levels
of TSM files. When a shard goes cold and is fully compacted, these
goroutines are still running.
This change will stop background shard goroutines when the shard goes
cold and start them back up if new writes arrive.
This reworks drop measurement to use a sorted list of series keys
instead of creating an intermediate map. It remove allocations
and some extra garbage that is created during drop measurement.
This code was added to address some slow startup issues. It is believed
to be the cause of some segfault panic's that occur at query time when
the underlying MMAP array has been unmapped. The current structure of
code makes this change unnecessary now.
The decoders were held onto each iterator to avoid creating them all
the time. Some of them have use quite a bit of memory so they can
be expensive to create when querying across many series.
Intead, more them to a re-usable pool where we create the minimum that
could active be in use. This reduces garbage as well as makes the iterators
less expensive to create.
When the planner runs, it needs to determine if any files have tombstones.
The code to determine if a tombstone existed involved stating the .tombstone
file. Since the planner runs very frequently when there are many shards, this
causea a lot of system calls that are unnecessary.
Instead, cache the results of the stats calls and only refresh them when we
haven't checked at least once or we write new tombstone data.
This also caches the results of the TSMReader.Stats call to avoid creating
garbage.
Negative timestamps are now supported. We also now refuse two
nanoseconds that are at the edge of the minimum time window. One of the
nanoseconds we do not accept is because we need MinInt64 to be used for
some internal comparisons in the TSM engine and it was causing an
underflow when we subtracted one from the minimum time. The second is so
we can have one minimum time that signifies the default minimum that
nobody can write to (so we can implicitly rewrite the timestamp on
aggregate queries) but still use the explicit timestamp if it is given
to us by the user. We aren't able to tell the difference between if the
user provided it or if it was implicit without those values being
different.
If the default minimum time is used with an aggregate query, we rewrite
the time to be the epoch for backwards compatibility since we believe
that's more important than supporting that extra nanosecond.
This keeps some memory bounds when reloading a TSM files tombstones
so that the heap does not grow exceedintly fast and stay there
after the deletes are applied.
Tombstone were read fully into memory at startup which could consume
a lot of RAM and OOM the process if there were a lot of deleted
series and many TSM files.
This now walks the tombstone file and iteratively applies the tombstone
which uses significantly less RAM. This may be slightly slower in the
generate cause, but should scale better.
If a query was running against a file being compacted, we close the file
and the query would end wherever it had read up to. This could result
in queries that randomly lost data, but running them again showed the
full results.
We now use a reference counting approach and move the in-use files out
of the way in the filestore and allow the queries to complete against
the old tsm files. The new files are installed and new queries will
use them.
Fixes#5501
In some query scenarios, if there are a lot of points on disk spread
across many blocks in TSM files and a point is overwritten near the
begginning of the shard's timerange, the full series could be loaded
into RAM triggering OOMs and huge allocations.
The issue was that the KeyCursor code that handles overwriting points
had a simple implementation that just deduped the whole series in this
case. This falls over when the series is quite large.
Instead, the KeyCursor has been changed to only decode blocks with
updated points. It then keeps track of what section of the blocks
have been read so they are not re-read when the later points are
decoded.
Since the points in a block are always sorted, the code was also changed
to remove the Deduplicate calls since they end up
reallocating the slice. Instead, we do a sorted merge and re-use
the slice as much as we can.
If multiple tombstone entries happen to exist for the same key in a
tombstone file, it was possible to panic. The first application
would remove all index entries and the second time around the code
still assumed entries would exist and would index into the nil slice.
Also fixes a case where the range of time would fully delete all index
entries, but it did not align with math.MinInt64 and math.MaxInt64. This
would cause the index locations to still exist in the offset slice. This
is inefficient because the BlockIterator would still scan and decode the block
only to discover that all the values are deleted. We now just remove it from
the offsets slice in this case since the range of values are deleted.
When a large tombstone file existed on disk, this code was slow since
it would apply each tombstone to the index one at a time causing the
index to be scanned for each key.
Instead, we group all the tombstones together by timestamp and apply
in bulk so that the index in scan once for each set of tombstones.
If we change to immuntable tombstone files, it might be better to just
write a file where all the keys have the same tombstone so we can re-apply
them efficiently.
This was the wrong fix. The real issue was the tombstones were
being read incorrectly and also applied incorrectly at times. This
code is slower and not necessary so reverting it.
This remove the dropMeta param from the tsdb.Store.DeleteSeries and
lets the shard determine when to remove the meta data from the index
based on what series still have data in the shard.
This uncovered a nasty bug in compactions where a fully deleted series would
prematurely end the compactions and not carry forward the rest of the data
in the TSM file. This is now fixed as well.