The scheduling logic ended up favoring more backlogged shards
too much and would starved active, less backed up shards. This
occurred because the scheduling kicks in once a second. When it
runs, it schedules as many compactions as it can. A backed up shard
would end up having more compactions to run during the loop an would
generally get to schedule them more frequently.
This now allows each shard to try and schedule one compaction at a time
which provides a more balanced approach. At some point, we'll probably
want to more directly balanc the each shards backlog vs letting it happen
somewhat randomly.
Some files seem to get orphan behind higher levels. This causes
the compactions to get blocked as the lowere level files will not
get picked up by their lower level planners. This allows the full
plan to identify them and pull them into their plans.
This check doesn't make sense for high cardinality data as the files
typically get big and sparse very quickly. This causes a lot of extra
disk space to be used which is taken up by large indexes and sparse
data.
One shard might be able to run a compaction, but could fail to
limits being hit. This loop would continue indefinitely as the
same task would continue to be rescheduled.
With higher cardinality or larger series keys, the files can roll
over early which causes them to take longer to be compacted by higher
levels. This causes larger disk usage and higher numbers of tsm files
at times.
This changes the compaction scheduling to better utilize the available
cores that are free. Previously, a level was planned in its own goroutine
and would kick off a number of compactions groups. The problem with this
model was that if there were 4 groups, and 3 completed quickly, the planning
would be blocked for that level until the last group finished. If the compactions
at the prior level are running more quickly, a large backlog could accumlate.
This now moves the planning to a single goroutine that plans each level in
succession and starts as many groups as it can. When one group finishes,
the planning will start the next group for the level.
The fysncs due to large writes when writing to TSM files and the
WAL can eventually cause large pauses. Since we already buffer
writes, using synchronous IO reduces fsync latency by ensuring
the individiual writes hit disk. This spreads out the latecncy
across multiple writes better.
This commit adds a basic TSI versioning scheme, by adding a Version field
to an index's MANIFEST file.
Existing TSI indexes will not have this field present in their MANIFEST
files, and thus will be deemed incomatible with the current version.
Users with existing TSI indexes will be able to remove them, and convert the
resulting inmem indexes to the current version of a TSI index using the
influx_inspect tooling.
With higher cardinalities, the encoder pools where become a bottleneck.
This changes the snapshot compactions ot checkout one encoder of each
type and re-use it while writing the snapshots as opposed to repeatedly
checking it out and in.
This perioically re-allocates the cache store to avoid memory
fragmentation and gradual slow down of the store after repeated
deletes and inserts into the map.
This instructs the kernel that it can release memory used by mmap'd
TSM files when they are not actively being used. It the mappings are
use, the kernel will fault the pages back in. On linux, this causes
RES memory to drop immediately when run.
A cold shard that suddenly receives a lot of writes could get a very
big cache that takes a long time to snapshot or causes the cache
max memory limit to be hit more quickly. This re-enables the compactions
if necessary during writes so we don't have to wait for the shard monitor
goroutine to re-enable them.
Compactions would create their own TSMReaders for simplicity. With
very high cardinality compactions, creating the reader and indirectIndex
can start to use a significant amount of memory.
This changes the compactions to use a reader that is already allocated
and managed by the FileStore.
These are already sorted during compaction, so switch to sorting lazily
to avoid the CPU and allocations. This would only occur when using if
using the writer directly.
The directIndex used by the TSMWriter maintained a map of series keys
to index entries. When the index is written to the TSM file, the keys
are sorted and then written out in order.
The reason for this is because directIndex used to be the only index
and it was optimized more for reading. The reading has been replaced
by the indirectIndex so the map of keys ends up wasting space.
During compactions, the series keys (and index entries) are already sorted
so this change uses the sorting to avoid the map and sort when writing the
index. This reduces allocations and CPU usage quite a bit for larger cardinality
TSM files.
This leaves the slower compactions that create full blocks to only
the full compaction. This helps reduce CPU usage and memory while shards
are hot, but increases disk usage (reduced compression) slightly.
Deleting high cardinality series could take a very long time, cause
write timeouts as well as dead lock the process. This fixes these
issue to by changing the approach for cleaning up the indexes and
reducing lock contention.
The prior approach delete each series and updated every index (inmem)
during the delete. This was very slow and cause the index to be locked
while it items in a slice were removed one by one. This has been changed
to mark series as deleted and then rebuild the index asynchronously which
speeds up the process.
There was also a dead lock that could occur when deleing the field set.
Deleting the field set held a write lock and the function it invoked under
the lock could try to take a read lock on the field set. This would then
deadlock. This approach was also very slow and caused time out for writes.
It now uses faster approach that checks for the existing of the measurment
in the cache and filestore which does not take write locks.
It prints the statistics of each iterator that will access the storage
engine. For each access of the storage engine, it will print the number
of shards that will potentially be accessed, the number of files that
may be accessed, the number of series that will be created, the number
of blocks, and the size of those blocks.
This is used quite a bit to determine which fields are needed in a
condition. When the condition gets large, the memory usage begins to
slow it down considerably and it doesn't take care of duplicates.
There are several places in the code where comma-ok map retrieval was
being used poorly. Some were benign, like checking existence before
issuing an unconditional delete with no cleanup. Others were potentially
far more serious: assuming that if 'ok' was true, then the resulting
pointer retrieved from the map would be non-nil. `nil` is a perfectly
valid value to store in a map of pointers, and the comma-ok syntax is
meant for when membership is distinct from having a non-zero value.
There was only one or two cases that I saw that being used correctly for
maps of pointers.
If a large compaction was running and was aborted. It could would leave
some tmp files around for files that it had fully written. The current
active file was cleaned up, but already completed ones would not. This
would occur when a TSM file needed to rollover due to size.
* <type>FinalizerIterator sets a runtime finalizer and calls Close
when garbage collected. This will ensure any associated cursors
are closed and the associated TSM files released
* `query.Iterators#Merge` call could return an error and the inputs
would not be closed, causing a cursor leak
The OnReplace func ends up trying to acquire locks on MeasurementFields. When
its called via snapshotting, this can deadlock because the snapshotting goroutine
also holds an RLock on the engine. If a delete measurement calls is run at the
right time, it will lock the MeasurementFields and try to acquire a lock on the engine
to disable compactions. This creates a deadlock.
To fix this, the OnReplace callback is moved to a function param to allow only Replace
calls as part of a compaction to invoke it as opposed to both snapshotting and compactions.
Fixes#8713
The ConditionExpr function is more accurate because it parses the
condition and ensures that time conditions are actually used correctly.
That means that attempting to combine conditions with OR will not result
in the query silently pretending it's an AND and nested conditions work
correctly so there is only one way to read the query.
It also extracts the non-time conditions into a separate condition so we
can stop attempting to parse around the time conditions in lower layers
of the storage engine. This change does not remove those hacks, but a
following commit should be able to sanitize the condition and remove
them.
This change provides a clear separation between the query engine
mechanics and the query language so that the language can be parsed and
dealt with separate from the query engine itself.
Previously pseudo iterators could be created for meta data such
as series, measurement, and tag data. These iterators were created
at a higher level and lacked a lot of the power of the query engine.
This commit moves system iterators down to the series level and
supports the following:
- _name
- _seriesKey
- _tagKey
- _tagValue
- _fieldKey
These can be used as normal fields such as:
SELECT _seriesKey FROM cpu
This will return all the series keys for `cpu`.
The tag cardinality checks were run for all inmem shards. Since inmem
shards share the same index, a lot of the work is redundant. Inmem shards
also need to sort their measurmenet and tag keys which can be CPU intensive
with many shards or higher cardinality.
This changes the monitoring to just check one shard in each database which
should lower CPU usage due to excessive sorting. The longer term solution
is to use TSI which would not have this check or required sorting.
If there were multiple shards, drop measurement could update the index
and remove the measurement before the other shards ran their deletes.
This causes the later shards to not see any series to delete.
The fix is to all deleteSeries to handle the index delete which already
accounts for removing the measurement when it is fully removed from the
index.
The partiallyRead func didn't account for the initial values and would
return true for blocks that had not been read at all. This causes a
slower path during compactions that forces a block to be decoded when
it could just be merged as is without decoded. This causes compactions
to consume more CPU and run slower at times.
This switches all the interfaces that take string series key to
take a []byte. This eliminates many small allocations where we
convert between to two repeatedly. Eventually, this change should
propogate futher up the stack.
The refs map was to increment the file references one time each.
It doesn't hurt to increment them multiple times though.
We also do not need to copy the files slice as we are accessing it
under a read lock so it can't be changed.
tsdb.Shard and tsdb.Shards both implement tsdb.ShardGroup and neither
were tested within the tsdb package itself. This adds tests for those
methods which are used by the query engine.
There was a change to speed up deleting and dropping measurements
that executed the deletes in parallel for all shards at once. #7015
When TSI was merged in #7618, the series keys passed into Shard.DeleteMeasurement
were removed and were expanded lower down. This causes memory to blow up
when a delete across many shards occurs as we now expand the set of series
keys N times instead of just once as before.
While running the deletes in parallel would be ideal, there have been a number
of optimizations in the delete path that make running deletes serially pretty
good. This change just limits the concurrency of the deletes which keeps memory
more stable.
When snapshots and compactions are disabled, the check to see if
the compaction should be aborted occurs in between writing to the
next TSM file. If a large compaction is running, it might take
a while for the file to be finished writing causing long delays.
This now interrupts compactions while iterating over the blocks to
write which allows them to abort immediately.
* introduced UnsignedValue type
* leveraged existing int64 compression algorithms (RLE, Simple 8B)
* tsm and WAL can read and write UnsignedValue
* compaction is aware of UnsignedValue
* unsigned support to model, cursors and write points
NOTE: there is no support to create unsigned points, as the line
protocol has not been modified.
The sortedSeriesIds slice was not getting reset to 0 which caused
the same series ids to exist in the slice more than once. Since
the size of the slice never matched the size of the seriesID map,
it kept appendending to the slice and sorting it which cause multiple
cursor to get created for the same series.
Fixes#8531
There was a race in the WAL writeToLog and scheduleSync which could
lead to a writing goroutine blocking indefinitely on its syncErr channel.
The issue was that the clearing of the syncCount happenend after the
wal was unlock. If a goroutine was able to lock, write and call scheduleSync
before the existing scheduleSync goroutine returns and ran the defer to
clear the syncCount, then a new scheduleSync goroutine would not get started.
This left the writing goroutine block with nothing to signal it.
While in this state, a RLock on the engine was held. If a Lock was requested
on the engine during this time, all future writes and queries would block waiting
on the blocked wal writer.
The fix is to move the atomic clearing of syncCount before the Lock is released.
The min key was not used in OverlapsKeyRange which caused it to return
false when it should be true. This causes a bug where deletes would not
write tombstones for files that actually contained the data it was supposed
to delete.
The in-memory index can get out of sync when deletes and writes
to the same measurement are running concurrently. The index is
updated independently from data on disk and it's possible for the
index to unassign a shard when data still exists on disk. What happens
is that there are TSM files on disk, but the index does not know that
the series that exist in those files still are in the shard. Restarting
the server reloads the index and the data is visible again. From and
end user perspective, this can look like more data is deleted than should
have been or that deleted data re-appears after a restart or writes to the
shard occur again.
There isn't an easy way to resolve this since the index and storage
are not transactional resources and we cannot atomically commit or
rollback changes to both at once.
As a workaround, after new TSM files are installed, we refresh the
index with series keys that exist in the new tsm files as well as
any lingering data still in the cache. There is a small window of time
when the index may be missing series, but it will re-appear after the refresh
completes.
This adds a v3 format that is a gzip compressed version of the v2
format. It reduces the size of tombstone files substantially without
having to support a more feature rich file format for tombstones.
The monitor goroutine calls enable compactions every 10s to spin down
(or start up) goroutines for cold shards. This frequent Lock may be
causing lock contention for writes and queries which get blocked trying
to acquire an RLock.
The go RWMutex says that new RLock calls will block if there is a
pending Lock call that is blocked. Switching the common path to use
an RLock should avoid the Lock and reduce lock contention for writes
and queries.
Currently two write locks in `inmem` are obtained and then
manually unlocked at function exit points. However, we have
reports that the `inmem` index is hanging on a write lock and
cannot track the issue down to anything else besides a lock
that could have been left unlocked because of a panic.
This commit changes the two locks to always defer their unlocks
to prevent these hangs.
This fixes the case where log files are compacted out of order
and cause non-contiguous sets of index files to be compacted.
Previously, the compaction planner would fetch a list of index files
for each level and compact them in order starting with the oldest
ones. This can be a problem for level 1 because level 0 (log files)
are compacted individually and in some cases a log file can finish
compacting before older log files are finished compacting. This
causes there to be a gap in the list of level 1 files that is
ignored when fetching a list of index files.
Now, the planner reads the list of index files starting from the
oldest but stops once it hits a log file. This prevents that gap
from being ignored.
This check was previously in a different section of code which
was lost during a refactor to the new compaction strategy. The
compaction planning now makes a check to ensure at least two
files are available for compaction in a level.
WriteBlock was missing the check for the max series keys which allowed
series keys to be written that were larger than the 2 bytes allocated
to store their length. When this occurred, the TSM can fail to load.
The defer was never executed because the planning happens in a
long running goroutine that loops. The plans need to be released
immediately after applying them.
TMP files could leak when compactions failed for various reasons. They
were also being deleted inadvertently when compactions were disabled causing
other errors to be reported in the logs.
This changes full compactions within a shard to run sequentially
instead of running all the compaction groups in parallel. Normally,
there is only 1 full compaction group to run. At times, there could
be several which causes instability if they are all running concurrently
as they tie up a cpu for long periods of time.
Level compactions are also capped to a max of 4 concurrently running for each level
in a shard. This prevents sudden spikes in CPU and disk usage due to a large backlog
of tsm files at a given level.
Measurement name and field were converted between []byte and string
repetively causing lots of garbage. This switches the code to use
[]byte in the write path.
This pool was previously a pool.Bytes to avoid repetitive allocations.
It was recently switchted to a sync.Pool because pool.Bytes held onto
very larger buffers at times which were never released. sync.Pool is
showing up in allocation profiles quite frequently.
This switches the pool to a new pool that limits how many buffers are
in the pool as well as the max size of each buffer in the pool. This
provides better bounds on allocations.
This speeds up time encoding and decoding by skipping the divisor
scaling if scaling by 1. Since division and multiplication are expensive
cpu and scaling by 1 has no effect, this just slows encoding and decoding
down.
Tombstone files would be written to all TSM files even if the deleted
keys or timerange did not exist in the TSM file. This had the side
effect of causing shards to get recompacted back to the same state. If
any shards or large numbers of TSM files existed, disk usage and CPU
utilization would spike causing issues.
This prevents tombstones being written for TSM files that could not
possiby contain the series keys being deleted or if the delted time
range is outside the range of the file.
When monitoring shards, a slice of measurements is allocated for
each shard. With many shards and measurements, these allocations
can be large. Since inmem shards share the same index, we only
need to do this once since the resulting slices are all the same.
This reduces memory usage when monitoring shard cardinality.
Since this is called more frequently now, the cleanup func was invoked
quite a bit which makes several syscalls per shard. This should only
be called the first time compactions are disabled.
Index.ForEachMeasurementTagKey held an RLock while call the fn,
if the fn made another call into the index which acquired an RLock
and after another goroutine tried to acquire a Lock, it would deadlock.
This was causing a shard to appear idle when in fact a snapshot compaction
was running. If the time was write, the compactions would be disabled and
the snapshot compaction would be aborted.
The monitor goroutine ran for each shard and updated disk stats
as well as logged cardinality warnings. This goroutine has been
removed by making the disks stats more lightweight and callable
direclty from Statisics and move the logging to the tsdb.Store. The
latter allows one goroutine to handle all shards.
Each shard has a number of goroutines for compacting different levels
of TSM files. When a shard goes cold and is fully compacted, these
goroutines are still running.
This change will stop background shard goroutines when the shard goes
cold and start them back up if new writes arrive.
The compactor prevents the same file from being compacted by different
compaction runs, but it can result in warning errors in the logs that
are confusing.
This adds compaction plan tracking to the planner so that files are
only part of one plan at a given time.
This limit allows the number of concurrent level and full compactions
to be throttled. Snapshot compactions are not affected by this limit
as then need to run continously.
This limit can be used to control how much CPU is consumed by compactions.
The default is to limit to the number of CPU available.
Compactions are enabled as soon as the shard is opened. This can
slow down startup or cause the system to spike in CPU usage at startup
if many shards need to be compacted.
This now delays compactions until after they are loaded.
The lazy sorting of series caused a deadlock since it can not take
a Lock when a caller may have already acquired an RLock.
filters should be called w/o any locks as the function already acquires
locks as needed.
Under high write load, the check for each series was done sequentially
which caused a lot of CPU time to acquire/release the RLock on LogFile.
This switches the code to check multiple series at once under an RLock
similar to the chang for inmem.
The current bytes.Pool will hold onto byte slices indefinitely. Large
writes can cause the pool to hold onto very large buffers over time.
Testing w/ sync/pool seems to perform similarly now so using a sync/pool
will allow these buffers to be GC'd when necessary.
The inmem index would call CreateSeriesIfNotExist for each series
which takes and releases and RLock to see if a series exists. Under
high write load, the lock shows up in profiles quite a bit. This
adds a filtering step that obtains a single RLock and checks all the
series and returns the non-existent series to contine though the slow
path.
Under high write load, the sync goroutine would startup, and end
very frequently. Starting a new goroutine so frequently adds a small
amount of latency which causes writes to take long and sometimes timeout.
This changes the goroutine to loop until there are no more waiters which
reduce the churn and latency.
If the sync waiters channel was full, it would block sending to the
channel while holding a the wal write lock. The sync goroutine would
then be stuck acquiring the write lock and could not drain the channel.
This increases the buffer to 1024 which would require a very high write
load to fill as well as retuns and error if the channel is full to prevent
the blocking.
The Point is intended to be immutable after being parsed since it
is shared by several goroutines. When dropping a field (e.g. time),
corrupted data can result if one goroutine is delete the field
while another is marshaling the underlying byte slices.
To avoid this, the shard will just skip invalid fields and series
instead of trying to mutate them by deleting them.
Removing series while trying to maintain the sorted series list
does not perform well when removing many series. This causes
drop DB, RP, series, to be very slow in some cases.
Instead, lazily create a sorted series list when first requested and
invalidate it when dropping series.
This reworks drop measurement to use a sorted list of series keys
instead of creating an intermediate map. It remove allocations
and some extra garbage that is created during drop measurement.
WalkKeys serially walked each TSM file and invoked fn for each key.
Caller needed to handle duplicate calls to fn with the same key
because the same key could exist in multiple TSM files. The serial
execution was also slower.
Since the series keys are already sorted, we can iterate over all
files in parallel and skip duplicates using a sorted merge. This
fixes the duplicate invocation issue as well as speeds up walking
all keys.
This can significant improve startup performance when many TSM files
exists that may not have been fully compacted. This also has benefits
for deletes (measurements/series) since duplicates are removed saving
extra allocations and work. This may also allow for the optimize
compaction to be removed provided startup times are fast enough.
When using the inmem index, if one drops a database, and then creates it
again, the previous index object will be reused. This includes the
previous cardinality estimation sketches, leading to inaccurate
cardinality estimations.