It looks like the real import path to the project is go.uber.org/zap
instead of github.com/uber-go/zap since the example in the project
references that path.
The logging library has been switched to use uber-go/zap. While the
logging has been changed to use structured logging, this commit does not
change any of the logging statements to take advantage of the new
structured log or new log levels. Those changes will come in future
commits.
NO-OP on platforms with unix path separator.
On Windows paths get converted to slashes before adding to archive and back to backslashes during restore.
This returns the LastModified time of the shard. The LastModified
time is the wall time when a change to the shards state occurred.
It uses the WAL or FileStore to determine the max mod time.
The `first()` and `last()` functions response rate would increase linear
to the number of points even though it seems like it shouldn't. This
optimization greatly reduces the amount of time to return a response
when no `GROUP BY time(...)` clause is present in a query.
Previously, we would return a full tag set for every shard and the tag
set would include all series that existed in the database index
including series that didn't physically exist within that shard. This
led to the tag sets returned being incredibly huge when we had high
cardinality but sparse data. Since the data was sparse, it was
unexpected that it would cause such a large strain on the system by most
people.
Now we filter out the series ids that are not assigned to the current
shard when computing a tag set for that shard. This lowers the memory
usage for high cardinality sparse data drastically and allows queries on
those to complete successfully.
This does not resolve issues for high cardinality data in every shard
that is also spread out over a long series of time. That situation isn't
nearly as common as the above situation though.
Unify logic around compaction execution to a single place.
Also report on the error stats that we track. Previously they were not
emitted in the stats output.
If a delete takes a long time to process while writes to the
shard are occuring, it was possible for the cache to fill up
and writes to be rejected. This occurred because we disabled
all compactions while writing tombstone file to prevent deleted
data from re-appearing after a compaction completed.
Instead, we only disable the level compactions and allow snapshot
compactions to continue. Snapshots already handle deleted data
with the cache and wal.
Fixes#7161
The FieldIterator is used to scan over the fields of a point, providing
information, and delaying parsing/decoding the value until it is needed.
This change uses this new type to avoid the allocation of a map for the
fields which is then thrown away as soon as the points get converted
into columns within the datastore.
The logic for determining whether a series key was already in the
the set of TSM series was too restrictive. It allowed only the first
field of a series to be added leaving all the remaing fields.
If they were left around, re-enabling them again could cause
future compactions to continuously fail. A restart of the
server would clean them up correctly though.
If there were multiple TSM files and a delete/drop was run,
we would write the delete series to the tombstone file N
times for each file. This occurred because FileStore.WalkKeys walks
every key in every TSM file which can return duplicate keys.
This issue caused TSM files to be much larger than they should be
and also cause large memory usage during the delete.
Normally, compactions do not conflict on the files they are compacting.
If the full cold threshold is set very low, it can cause conflicts where
two compactions compact the same files. The full compaction was the
only place this could happen as it's planning is greedy.
To make this safer for concurrent execution, the compaction tracks which
files are current being compacted and prevents any new compactions from
starting if the file set overlaps.
Fixes#6595
If a delete is issued while a compaction is running, the a newly
deleted series could re-appear after the compaction completed. This
could occur the compaction had already written the blocks for series
that were just deleted. When the compaction completes, the newly
written tombstone files would be deleted, essentially undeleting the
series.
For larger datasets, it's possible for shards to get into a state where
many large, dense TSM files exist. While the shard is still hot for
writes, full compactions will skip these files since they are already
fairly optimized and full compactions are expensive. If the write volume
is large enough, the shard can accumulate lots of these files. When
a file is in this state, it's index can contain every series which
causes startup times to increase since each file must parse the full
set of series keys for every file. If the number of series is high,
the index can be quite large causing large amount of disk IO at startup.
To fix this, a optmize compaction is run when a full compaction planning
step decides there is nothing to do. The optimize compaction combines
and spreads the data and series keys across all files resulting in each
file containing the full series data for that shard and a subset of the
total set of keys in the shard.
This allows a shard to only store a series key once in the shard reducing
storage size as well allows a shard to only load each key once at startup.
Truncate the time interval output of the monitor service to be on even
time intervals rather than on every minute based on the start time. This
normalizes the output from the monitor service.
The tsdb package had a substantial amount of dead code related to the
old query engine still in there. It is no longer used, so it was removed
since it was left unmaintained. There is likely still more code that is
the same, but wasn't found as part of this code cleanup.
influxql has dead code show up because of the code generation so it is
not included in this pruning.
A copy/paste error had nil cursors destined for a condition cursor get
set to the auxiliary cursor instead. When the number of conditions
exceeded the number of auxiliary fields, this would result in a stack
trace in some situations. When the number of conditions was less than or
equal to the number of auxiliary fields, it means that an auxiliary
cursor may have been overwritten with a nil cursor accidentally and a
leak might have happened since it was never closed.
Fixes#6859.
Restore would try to open the shard if there was an error. If there
was an error, the files written are very likely to be partially written
and they can cause the server to panic.
To prevent a shard from trying to open broken files, we now write to
a temp file and rename it to the actual name only after fully writing
and fsyncing the file.
For restoring a shard, we need to be able to have the shard open,
but disabled. It was racy to open it and then disable it separately
since writes/queries could occur in between that time.
This switch the backup shard call to use the shard Snapshot that
internally creates a snapshot by hardlinking all of the TSM and
tombstone files instead. This reduces the time that the FileStore
is locked and will allow for larger shards to be backup more easily.
This fixes a pathalogical query condition cause by and problematic
structuring of TSM files based on how points were written. The
condition can occur when there are multiple TSM files and a large
number of points are written into the past. The earlier existing
TSM files must also have points in the past and close to the present
causing their time range to eclipse the later files.
When this condition occurs, some queries can spend an excessive amount
of time merge all the overlapping blocks.
The fix was to constrain the window of overlapping blocks based on
the first one we ran into. There was also a simple case in the Merge
where we could skip the binary search path and just append the two
inputs.
The limit optimization was put into the wrong place and caused only part
of the shard to be read when a limit was used. The optimization is
possible, but requires a bit of refactoring to the code here so the call
iterator is created per series before handed to the limit iterator.
Fixes#6661.
The optimization to speed up shard loading had the side effect of
skipping adding series to the index that already exist. The skipping
was in the wrong location and also skipped the shards measurementFields
index which is required in order to query that series in the shard.
Casting syntax is done with the PostgreSQL syntax `field1::float` to
specify which type should be used when selecting a field. You can also
do `field1::field` or `tag1::tag` to specify that a field or tag should
be selected.
This makes it possible to select a tag when a field key and a tag key
conflict with each other in a measurement. It also means it's possible
to choose a field with a specific type if multiple shards disagree. If
no types are given, the same ordering for how a type is chosen is used
to determine which type to return.
The FieldDimensions method has been updated to return the data type for
the fields that get returned. The SeriesKeys function has also been
removed since it is no longer needed. SeriesKeys was originally used for
the fill iterator, but then expanded to be used by auxiliary iterators
for determining the channel iterator types. The fill iterator doesn't
need it anymore and the auxiliary types are better served by
FieldDimensions implementing that functionality, so SeriesKeys is no
longer needed.
Fixes#6519.
On data sets with many series and potentially large series keys,
the cost of parsing the key and re-indexing can be high.
Loading the TSM keys into the index was being done repeatedly for
series that were already index by an earlier TSM file. This was
wasted worked and slows down shard loading.
Parsing the key was also innefficient and allocated a new string
slice. This was simplified to remove that allocation.
This commit changes the `tsm1.Engine` to create individual series
iterators in batches so that it can be parallelized. Iterators
are combined at the end so they can be redistributed to the
parallelized merge iterator.
This commit moves the `CallIterator` to wrap the individual series
instead of wrapping a shard. This allows individual points to be
aggregated before being merged.
This will cause a small increase in memory usuage per series but
it shows a 20% decrease in query time when there are a moderate
number of points per series.
If a shard is empty for a specific field and the field type is something
other than a float, a nil iterator would get returned from one of the
empty shards and cause the combined iterators to be cast to the float
type and all other iterator types to be discarded (or for integers, to
be cast).
This is rare since most aggregates don't accept strings or booleans, but
for queries like:
SELECT distinct(string) FROM mydata
It would result in nothing getting returned if one of the shards didn't
have a value for `string`.
This change modifies the query engine to return nil for the shards
instead of a fake iterator and then to only use the fake iterator if the
final aggregate iterator is nil (meaning that no iterators could be
constructed for the field from any shard).
Fixes#6495.
The code for parsing a key our of the WAL or TSM files in the engine
was naive and didn't account for measurements with escape chars. This
uses the correct parsing code to parse and load them correctly.
Fixes#6496
This remove the dropMeta param from the tsdb.Store.DeleteSeries and
lets the shard determine when to remove the meta data from the index
based on what series still have data in the shard.
This uncovered a nasty bug in compactions where a fully deleted series would
prematurely end the compactions and not carry forward the rest of the data
in the TSM file. This is now fixed as well.
This has various benefits:
- Users embedding InfluxDB within other Go programs can specify a different logger / prefix easily.
- More consistent with code used elsewhere in InfluxDB (e.g. services, other `run.Server.*` fields, etc).
- This is also more efficient, because it means `executeQuery` no longer allocates a single `*log.Logger` each time it is called.
The cache max memory size is an approximate size and can prevent a
shard from loading at startup. This change disable the max size
at startup to prevent this problem and sets the limt back after
reloading.
Fixes#6109
When a GROUP BY or multiple sources are used, the top level limit
iterator requires reading the entire iterator stream so it can find all
of the tag groups it needs to return. For large data series, this ends
up with the limit iterator discarding a lot of output.
This change adds a new lower level limit iterator on each series itself
so that there are fewer data points that have to be thrown away by the
top level iterator.
Fixes#5553.
Now it is possible to compare tags and fields and it is also now
possible to compare tags and tags. Previously, it was only possible to
compare fields with fields and tags with a string or a regex.
Fixes#3371.
Send nil values from the tsm1 cursor at the end of the cursor. After the
cursor reached tsm1, the `nextAt()` call would always return the default
value rather than a nil value.
Descending also didn't work correctly because the seeking functionality
for tsm1 iterators would always act like they were ascending instead of
descending when choosing which value to select. This resulted in very
strange output from the emitter since it couldn't figure out if it was
ascending or descending.
Fixes#6206.
Both Shard and Engine had the same reference to the measurementField map,
but they each protected it with their own locks. This causes a race when
write and queries are occurring because writes can add new fields to the
map while queries are reading from it.
The fix moves the ownership to the Engine and provides protected accessors
to that Shard now users. For the most parts, the access on shard were old
dead code.
Fixing the measurementFields map race created a new race on the internal
fields map. This is now unexported and protected via MeasurementFields
exported funcs.
Fixes#6188
If an OR was used, merging filters between different expressions would
not work correctly. If one of the sides had a set of series ids with a
condition and the other side had no series ids associated with the
expression, all of the series from the side with a condition would have
the condition ignored. Instead of defaulting a non-existant series
filter to true, it should just be false and the evaluation of the one
side that does exist should take care of determining if the series id
should be included or not. The AND condition used false correctly so did
not have to be changed.
If a tag did not exist and `!=` or `!~` were used, it would return false
even though the neither a field or a tag equaled those values. This has
now been modified to correctly return the correct series ids and the
correct condition.
Also fixed a panic that would occur when a tag caused a field access to
become unnecessary. The filter using the field access still got created
and used even though it was unnecessary, resulting in an attempted
access to a non-initialized map.
Fixes#5152 and a bunch of other miscellaneous issues.
Use of the iterator is spread out into both `IteratorCreators` and
inside of the iterators themselves. Part of the interrupt must be
handled inside of the engine so it stops trying to emit points when an
interrupt is found and another part of the interrupt has to happen when
combining the iterators so it doesn't just start reading the next shard.
A deadlock occurs under write load if a backup is run in between the
time when a snapshot compactions has snapshotted the cache and successfully
written it to disk. The issus is that the second snapshot call will block
on the commit lock while it is holding the engine write lock. This causes
all writes to block as well as prevents the currently runnign snapshot
compaction from completing because it needs to acquire a read-lock.
This PR removes the commit lock and just returns an error if a snapshot is
in progress to all any locks being held to be released. The caller can determine
whether to retry or giveup.
Currently two compactors can execute Engine.WriteSnapshot at once.
This isn't thread safe since both threads want to make modifications to
Cache.snapshot at the same time.
This commit introduces a lock which is acquired during Snapshot() and
released during ClearSnapshot(), ensuring that at most one thread
executes within Engine.WriteSnapshot() at once.
To ensure that we always release this lock, but only release the
snapshot resources on a successful commit, we modify ClearSnapshot() to
accept a boolean which indicates whether the write was successful or not
and guarantee to call this function if Snapshot() has been called.
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
The Cache had support for taking multiple snapshots to support writing
multiple snapshots to TSM files concurrently if that happened to be
a bottleneck. In practice, this is never a bottleneck and we only
run one snappshoting goroutine continously per shard which has worked
well for all workloads.
The multiple snapshot support introduces some unhandled failure scenarios
where wal segments could be removed without writing them to TSM files. If
a snapshot compaction fails to write due to transient disk errors, subsequent
snapshots will continue, but the failed one will not be retried. When the
subsequent ones succeeded, all closed wal segments are removed causing data
loss.
This change simplifies the snapshotting capability to ensure that there is only
ever one snapshot. If one fails, the next snapshot will update the existing
snapshot and retry all of old and new data.
Fixes#5686
Complementing and extending the changes in #5758.
Add 2 level statistics:
* snapshotCount
* cacheAgeMs
Add 2 counter statistics
* cachedBytes
* WALCompactionTimeMs
snapshotCount can be used to measure transient write errors that are causing snapshots to accumulate
cacheAgeMs can be used to guage the level of write activity into the cache
The differences between cachedBytes stats sampled at different times can be used to calculate cache throughput rates
The ratio (cachedBytes-diskBytes)/WALCompactionTimeMs can be used calculate WAL compaction throughput.
The ratio of difference between first and last WAL compaction time over the interval
length is an estimate of percentage of cache throughput consumed.
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
Aux iterators now ask the iterator creator what series will be returned
and determine which aux fields to create based on the results.
The `tsdb.Shards` struct also creates a call iterator around the
iterators returned from each shard.
Fill requires an additional function for IteratorCreator to retrieve the
series that will be returned from the iterator. When fill is required
for an aggregate, the IteratorCreator will be asked what series will be
returned by the created iterator.
Writing the snapshot would deduplicate the snapshot points
while still holding the engine write-lock. This can be expensive
under high load and cause writes to back up and OOM the server.
Instead, grab the snapshot under the lock and dedup it after releasing
the lock.
Possible fix for #5442
We were closing the cursor when we read the last block which caused
the internal state to be cleared. In a group by query, we seeked multiple
times so depending on the group by interval and how the data was laid out
in the blocks, we woudl close the cursor and the last block would get skipped.
Fixes#5193
This changes backup and restore to work for TSM. It breaks it for b1 and bz1, but since those are getting removed it's ok.
The backup runs against any host that is specified and can backup either the metasstore, a database, specific retention policy, or a specific shard. It can also take incremental backups with the `since` flag, which will only backup TSM files that have been created since that timestamp.
The backup is safe to run online. However, for shards that are still hot for writes, they won't be able to create new TSM files while the backup for that single shard runs. If the backup isn't too large and the write throughput isn't too high this shouldn't be a problem since the writes will just go into the WAL cache.
This has a few changes in it (unfortuantely). The main change is to run compactions
concurrently. While implementing this, a few query and performance bugs showed up that
are also fixed by this commit.
If the engine is closed while a compaction is going on, the close call
blocks until the goroutine exits. This could be several minutes because
the control does not return back up to the channel selector while there is
still data to write.
* Update cache loader to delete entries from cache
* Add cache.Delete()
* Update delete to look at keys in the Cache in addition to the FileStore
* Update cache compaction to never happen if the cache is empty
* Update Plan to do a full compaction if cold for writes
* Remove MaxFileSize as a config variable from Compactor. Should be a set constant
* Update Plan to keep track of if the last check was fully compacted so we can skip future planning calls
* Update compact min file count to 3 so that compactions run more frequently
* remove rolloverTSMFileSize constant that is no longer used
* remove the maxGenerationFileCount since it is no longer a limitation that's necessary with the new compaction scheme. We no longer read WAL segments as part of the compaction so memory is only used as we read in each individual key
* remove minFileCount and switch to a user configurable variable
* remove the mutex from WALSegmentWriter. There's never more than one open in the WAL at one time and it's not exported through any function so the lock on the WAL should be used. This simplified keeping track of the last write time and removed a bunch of unnecessary locks.
* update WALSegmentWriter.Write to take the compressed bytes so that encoding and compression can occur before the call to write (while we don't hold the WAL lock)
* remove a bunch of unnecessary locking in WAL.writeToLog
* Add check for TSM file magic number and vesion
* Remove old tsm, log, and unused cursor code
* Remove references to tsm1dev everywhere except in the inspector
* Clean up config options for compaction and snapshotting
* Remove old TSM configuration options
* Update the config.sample.toml with TSM options
* Update WAL compact to force if it has been cold for writes for a configurable period of time (1h by default)
This was causing races in the code, when the cache was being reloaded,
because back-to-back open-and-closing of the engine during testing left
goroutines running. With this change the engine is completely shutdown
when Close() is called on it.
First pass at TSM cursor iteration ended up searching the file indexes
too frequently and hurt performance. This changes that to search it once
and then have the cursor hold onto the block locations to seek
to. Doubles the query performance from the first iteration, but still a lot
of room for improvement.
* Add InfluxQLType to Values to map the TSM type to InfluxQL
* Fix bug in WAL where close wouldn't nil out the currentSegment after closing it
* Export writeSnapshot to be used in tests, add argument to run it async or not
* Update reloadCache to load temporary metadata information in the engine
* Update LoadMetadataIndex to use the temp WAL metadata information
Something broke with writing to the WAL now that compactions are running
concurrently. There was also a performance problem with Next/Prev doing
twice as many searches as necessary.
* Update cache to have a single slice of values for a key (removed checkpoints)
* Changed compact.Plan to only worry about TSM files.
* Updated Plan to not return an error since there was no case in which it would.
* Update WAL to not keep stats since they're no longer needed.
* Update engine to flush the Cache/WAL to a new TSM file when the min threshold is hit.
* Split compact logic between TSM compacts and WAL/Cache writes.
* Remove unnecessary merge iterator, wal segment iterator, and other no longer necessary stuff.
* Remove the asending bool from the Dedupe method. Values should always be in ascending order. It's up to the cursor to iterate through values based on the direction. Giving the cursor responsibility makes it so we don't need to sort, dedupe or reallocate anything for different query orders.
* Updated engine to use its locks to ensure writes and cache flushes don't cause a race.
* Update all tests with new signatures. Removed a bunch of tests around TSM rewrites and WAL segment iteration that are no longer necessary.
100mb is easy it hit even with basic stress test config. Don't set
a limit by default so that an operator can size it appropriately based
on their hardware.
Starting to integrate some of the components into a engine that is
usable for development purposes. This allows the code to evolve while
keeping the existing TSM engine in tact for reference.
Currently, just the WAL is wired up so writes can be tested. Other engine
functions will panic the server if called.