A deadlock occurs under write load if a backup is run in between the
time when a snapshot compactions has snapshotted the cache and successfully
written it to disk. The issus is that the second snapshot call will block
on the commit lock while it is holding the engine write lock. This causes
all writes to block as well as prevents the currently runnign snapshot
compaction from completing because it needs to acquire a read-lock.
This PR removes the commit lock and just returns an error if a snapshot is
in progress to all any locks being held to be released. The caller can determine
whether to retry or giveup.
Currently two compactors can execute Engine.WriteSnapshot at once.
This isn't thread safe since both threads want to make modifications to
Cache.snapshot at the same time.
This commit introduces a lock which is acquired during Snapshot() and
released during ClearSnapshot(), ensuring that at most one thread
executes within Engine.WriteSnapshot() at once.
To ensure that we always release this lock, but only release the
snapshot resources on a successful commit, we modify ClearSnapshot() to
accept a boolean which indicates whether the write was successful or not
and guarantee to call this function if Snapshot() has been called.
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
The Cache had support for taking multiple snapshots to support writing
multiple snapshots to TSM files concurrently if that happened to be
a bottleneck. In practice, this is never a bottleneck and we only
run one snappshoting goroutine continously per shard which has worked
well for all workloads.
The multiple snapshot support introduces some unhandled failure scenarios
where wal segments could be removed without writing them to TSM files. If
a snapshot compaction fails to write due to transient disk errors, subsequent
snapshots will continue, but the failed one will not be retried. When the
subsequent ones succeeded, all closed wal segments are removed causing data
loss.
This change simplifies the snapshotting capability to ensure that there is only
ever one snapshot. If one fails, the next snapshot will update the existing
snapshot and retry all of old and new data.
Fixes#5686
Complementing and extending the changes in #5758.
Add 2 level statistics:
* snapshotCount
* cacheAgeMs
Add 2 counter statistics
* cachedBytes
* WALCompactionTimeMs
snapshotCount can be used to measure transient write errors that are causing snapshots to accumulate
cacheAgeMs can be used to guage the level of write activity into the cache
The differences between cachedBytes stats sampled at different times can be used to calculate cache throughput rates
The ratio (cachedBytes-diskBytes)/WALCompactionTimeMs can be used calculate WAL compaction throughput.
The ratio of difference between first and last WAL compaction time over the interval
length is an estimate of percentage of cache throughput consumed.
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
Aux iterators now ask the iterator creator what series will be returned
and determine which aux fields to create based on the results.
The `tsdb.Shards` struct also creates a call iterator around the
iterators returned from each shard.
Fill requires an additional function for IteratorCreator to retrieve the
series that will be returned from the iterator. When fill is required
for an aggregate, the IteratorCreator will be asked what series will be
returned by the created iterator.
Writing the snapshot would deduplicate the snapshot points
while still holding the engine write-lock. This can be expensive
under high load and cause writes to back up and OOM the server.
Instead, grab the snapshot under the lock and dedup it after releasing
the lock.
Possible fix for #5442
We were closing the cursor when we read the last block which caused
the internal state to be cleared. In a group by query, we seeked multiple
times so depending on the group by interval and how the data was laid out
in the blocks, we woudl close the cursor and the last block would get skipped.
Fixes#5193
This changes backup and restore to work for TSM. It breaks it for b1 and bz1, but since those are getting removed it's ok.
The backup runs against any host that is specified and can backup either the metasstore, a database, specific retention policy, or a specific shard. It can also take incremental backups with the `since` flag, which will only backup TSM files that have been created since that timestamp.
The backup is safe to run online. However, for shards that are still hot for writes, they won't be able to create new TSM files while the backup for that single shard runs. If the backup isn't too large and the write throughput isn't too high this shouldn't be a problem since the writes will just go into the WAL cache.
This has a few changes in it (unfortuantely). The main change is to run compactions
concurrently. While implementing this, a few query and performance bugs showed up that
are also fixed by this commit.
If the engine is closed while a compaction is going on, the close call
blocks until the goroutine exits. This could be several minutes because
the control does not return back up to the channel selector while there is
still data to write.
* Update cache loader to delete entries from cache
* Add cache.Delete()
* Update delete to look at keys in the Cache in addition to the FileStore
* Update cache compaction to never happen if the cache is empty
* Update Plan to do a full compaction if cold for writes
* Remove MaxFileSize as a config variable from Compactor. Should be a set constant
* Update Plan to keep track of if the last check was fully compacted so we can skip future planning calls
* Update compact min file count to 3 so that compactions run more frequently
* remove rolloverTSMFileSize constant that is no longer used
* remove the maxGenerationFileCount since it is no longer a limitation that's necessary with the new compaction scheme. We no longer read WAL segments as part of the compaction so memory is only used as we read in each individual key
* remove minFileCount and switch to a user configurable variable
* remove the mutex from WALSegmentWriter. There's never more than one open in the WAL at one time and it's not exported through any function so the lock on the WAL should be used. This simplified keeping track of the last write time and removed a bunch of unnecessary locks.
* update WALSegmentWriter.Write to take the compressed bytes so that encoding and compression can occur before the call to write (while we don't hold the WAL lock)
* remove a bunch of unnecessary locking in WAL.writeToLog
* Add check for TSM file magic number and vesion
* Remove old tsm, log, and unused cursor code
* Remove references to tsm1dev everywhere except in the inspector
* Clean up config options for compaction and snapshotting
* Remove old TSM configuration options
* Update the config.sample.toml with TSM options
* Update WAL compact to force if it has been cold for writes for a configurable period of time (1h by default)
This was causing races in the code, when the cache was being reloaded,
because back-to-back open-and-closing of the engine during testing left
goroutines running. With this change the engine is completely shutdown
when Close() is called on it.
First pass at TSM cursor iteration ended up searching the file indexes
too frequently and hurt performance. This changes that to search it once
and then have the cursor hold onto the block locations to seek
to. Doubles the query performance from the first iteration, but still a lot
of room for improvement.
* Add InfluxQLType to Values to map the TSM type to InfluxQL
* Fix bug in WAL where close wouldn't nil out the currentSegment after closing it
* Export writeSnapshot to be used in tests, add argument to run it async or not
* Update reloadCache to load temporary metadata information in the engine
* Update LoadMetadataIndex to use the temp WAL metadata information