This instructs the kernel that it can release memory used by mmap'd
TSM files when they are not actively being used. It the mappings are
use, the kernel will fault the pages back in. On linux, this causes
RES memory to drop immediately when run.
A cold shard that suddenly receives a lot of writes could get a very
big cache that takes a long time to snapshot or causes the cache
max memory limit to be hit more quickly. This re-enables the compactions
if necessary during writes so we don't have to wait for the shard monitor
goroutine to re-enable them.
Compactions would create their own TSMReaders for simplicity. With
very high cardinality compactions, creating the reader and indirectIndex
can start to use a significant amount of memory.
This changes the compactions to use a reader that is already allocated
and managed by the FileStore.
These are already sorted during compaction, so switch to sorting lazily
to avoid the CPU and allocations. This would only occur when using if
using the writer directly.
The directIndex used by the TSMWriter maintained a map of series keys
to index entries. When the index is written to the TSM file, the keys
are sorted and then written out in order.
The reason for this is because directIndex used to be the only index
and it was optimized more for reading. The reading has been replaced
by the indirectIndex so the map of keys ends up wasting space.
During compactions, the series keys (and index entries) are already sorted
so this change uses the sorting to avoid the map and sort when writing the
index. This reduces allocations and CPU usage quite a bit for larger cardinality
TSM files.
This leaves the slower compactions that create full blocks to only
the full compaction. This helps reduce CPU usage and memory while shards
are hot, but increases disk usage (reduced compression) slightly.
Deleting high cardinality series could take a very long time, cause
write timeouts as well as dead lock the process. This fixes these
issue to by changing the approach for cleaning up the indexes and
reducing lock contention.
The prior approach delete each series and updated every index (inmem)
during the delete. This was very slow and cause the index to be locked
while it items in a slice were removed one by one. This has been changed
to mark series as deleted and then rebuild the index asynchronously which
speeds up the process.
There was also a dead lock that could occur when deleing the field set.
Deleting the field set held a write lock and the function it invoked under
the lock could try to take a read lock on the field set. This would then
deadlock. This approach was also very slow and caused time out for writes.
It now uses faster approach that checks for the existing of the measurment
in the cache and filestore which does not take write locks.
It prints the statistics of each iterator that will access the storage
engine. For each access of the storage engine, it will print the number
of shards that will potentially be accessed, the number of files that
may be accessed, the number of series that will be created, the number
of blocks, and the size of those blocks.
This is used quite a bit to determine which fields are needed in a
condition. When the condition gets large, the memory usage begins to
slow it down considerably and it doesn't take care of duplicates.
There are several places in the code where comma-ok map retrieval was
being used poorly. Some were benign, like checking existence before
issuing an unconditional delete with no cleanup. Others were potentially
far more serious: assuming that if 'ok' was true, then the resulting
pointer retrieved from the map would be non-nil. `nil` is a perfectly
valid value to store in a map of pointers, and the comma-ok syntax is
meant for when membership is distinct from having a non-zero value.
There was only one or two cases that I saw that being used correctly for
maps of pointers.
If a large compaction was running and was aborted. It could would leave
some tmp files around for files that it had fully written. The current
active file was cleaned up, but already completed ones would not. This
would occur when a TSM file needed to rollover due to size.
* <type>FinalizerIterator sets a runtime finalizer and calls Close
when garbage collected. This will ensure any associated cursors
are closed and the associated TSM files released
* `query.Iterators#Merge` call could return an error and the inputs
would not be closed, causing a cursor leak
The OnReplace func ends up trying to acquire locks on MeasurementFields. When
its called via snapshotting, this can deadlock because the snapshotting goroutine
also holds an RLock on the engine. If a delete measurement calls is run at the
right time, it will lock the MeasurementFields and try to acquire a lock on the engine
to disable compactions. This creates a deadlock.
To fix this, the OnReplace callback is moved to a function param to allow only Replace
calls as part of a compaction to invoke it as opposed to both snapshotting and compactions.
Fixes#8713
The ConditionExpr function is more accurate because it parses the
condition and ensures that time conditions are actually used correctly.
That means that attempting to combine conditions with OR will not result
in the query silently pretending it's an AND and nested conditions work
correctly so there is only one way to read the query.
It also extracts the non-time conditions into a separate condition so we
can stop attempting to parse around the time conditions in lower layers
of the storage engine. This change does not remove those hacks, but a
following commit should be able to sanitize the condition and remove
them.
This change provides a clear separation between the query engine
mechanics and the query language so that the language can be parsed and
dealt with separate from the query engine itself.
Previously pseudo iterators could be created for meta data such
as series, measurement, and tag data. These iterators were created
at a higher level and lacked a lot of the power of the query engine.
This commit moves system iterators down to the series level and
supports the following:
- _name
- _seriesKey
- _tagKey
- _tagValue
- _fieldKey
These can be used as normal fields such as:
SELECT _seriesKey FROM cpu
This will return all the series keys for `cpu`.
The tag cardinality checks were run for all inmem shards. Since inmem
shards share the same index, a lot of the work is redundant. Inmem shards
also need to sort their measurmenet and tag keys which can be CPU intensive
with many shards or higher cardinality.
This changes the monitoring to just check one shard in each database which
should lower CPU usage due to excessive sorting. The longer term solution
is to use TSI which would not have this check or required sorting.
If there were multiple shards, drop measurement could update the index
and remove the measurement before the other shards ran their deletes.
This causes the later shards to not see any series to delete.
The fix is to all deleteSeries to handle the index delete which already
accounts for removing the measurement when it is fully removed from the
index.
The partiallyRead func didn't account for the initial values and would
return true for blocks that had not been read at all. This causes a
slower path during compactions that forces a block to be decoded when
it could just be merged as is without decoded. This causes compactions
to consume more CPU and run slower at times.
This switches all the interfaces that take string series key to
take a []byte. This eliminates many small allocations where we
convert between to two repeatedly. Eventually, this change should
propogate futher up the stack.
The refs map was to increment the file references one time each.
It doesn't hurt to increment them multiple times though.
We also do not need to copy the files slice as we are accessing it
under a read lock so it can't be changed.