Since we are not locking but relying on atomic arithmetic,
use Add rather than Set. Will also result in slightly less garbage
being created.
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
The intent of this change is to ensure that all statistic fields of the
resulting tsm1_cache measurement are initialized on initialization of
the cache. That way, any consumer of those measurements doesn't
have to deal with the null case.
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
Complementing and extending the changes in #5758.
Add 2 level statistics:
* snapshotCount
* cacheAgeMs
Add 2 counter statistics
* cachedBytes
* WALCompactionTimeMs
snapshotCount can be used to measure transient write errors that are causing snapshots to accumulate
cacheAgeMs can be used to guage the level of write activity into the cache
The differences between cachedBytes stats sampled at different times can be used to calculate cache throughput rates
The ratio (cachedBytes-diskBytes)/WALCompactionTimeMs can be used calculate WAL compaction throughput.
The ratio of difference between first and last WAL compaction time over the interval
length is an estimate of percentage of cache throughput consumed.
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
There was a fix in 5b1791, but is not present in the current branch likely due to a rebase issue.
The current code panics with a query like:
select value from cpu group by host order by time desc limit 1
This fixes the panic as well as prevents #5193 from re-occurring. The issue is that agressively
closing the cursors clears out the seeks slice so re-seeking will fail.
Aux iterators now ask the iterator creator what series will be returned
and determine which aux fields to create based on the results.
The `tsdb.Shards` struct also creates a call iterator around the
iterators returned from each shard.
Fill requires an additional function for IteratorCreator to retrieve the
series that will be returned from the iterator. When fill is required
for an aggregate, the IteratorCreator will be asked what series will be
returned by the created iterator.
This reduces some of the lock contention when writing to the cache.
When a new entry is created, it avoids an allocation. It also skips
a check to see if we need to sorted if we already know it needs to sorted.
Writing the snapshot would deduplicate the snapshot points
while still holding the engine write-lock. This can be expensive
under high load and cause writes to back up and OOM the server.
Instead, grab the snapshot under the lock and dedup it after releasing
the lock.
Possible fix for #5442
We were closing the cursor when we read the last block which caused
the internal state to be cleared. In a group by query, we seeked multiple
times so depending on the group by interval and how the data was laid out
in the blocks, we woudl close the cursor and the last block would get skipped.
Fixes#5193
This may be causing slow restart times for systems with many large TSM files.
What I believe is happening at startup in these cases is that multiple goroutines
are started to load each TSM file concurrently. The kernel appears to serialize
mmap calls from the same process so all of the goroutines end up getting blocked
on the actual mmap system call. MAP_POPULATE instruct the kernel to pre-fault the
page table for the files and triggers read-ahead of the pages. For larger, 2GB files,
this makes the mmap call more expensive and slower. When there are many of these files
and calls it is possible to fill all available memory with pagecache. In this case,
the OS will end up pre-faulting pages from one file and have to remove pages that it just
loaded from another files causing slowness. MAP_POPULATE may also be cause much more data
to be pre-faulted than necessary. To load a file, we just need to scan the index at the end
of the file. MAP_POPULATE is likely causing the whole file to be loaded when it won't actually
be accessed for a while (or at all).
Might fix issue #5311.
Some data shapes would cause files to grow larger than the max size more
quickly which resulted in them getting skipped by the full compaction planner
at times. Some datasets that could make this happen are very large keys or
very large numbers of keys (10M). When this happened, multiple max sized
files would accumulate but the blocks would not be full. When the shard went
cold for writes, these files would get recompacted down to the optimal size, but
a lot of space would be wasted in the mean time.
This is contributing to some of the high memory usage on queries and possibly
some OOMs. This is slightly slower, but removing it allows some fairly large
count queries over 5M series to complete instead of crashing the process using
tsm1 engine.
Key() returned the key and the entries. We did not always need the
entries so they would be allocated and ignored. Added a KeyAt func
that just returns the key to avoid the unnecesary entries allocation.
Use sync.Pool for some temporary buffers used while encoding instead of
allocatin new ones each time. Also increased the default buffer size which
might be too small. Probably need to make this a config var.