This was the wrong fix. The real issue was the tombstones were
being read incorrectly and also applied incorrectly at times. This
code is slower and not necessary so reverting it.
Each iteration of the loop was incrementing the position by 4 incorrectly.
The position should start at four since the header is 4 bytes. This
caused tombstones at the end of the file to not be read because the counter
was out of sync with the actual file position which cause the loop to exit early.
Probably better to refactor this to check for io.EOF instead of using the counter.
The code for parsing a key our of the WAL or TSM files in the engine
was naive and didn't account for measurements with escape chars. This
uses the correct parsing code to parse and load them correctly.
Fixes#6496
This remove the dropMeta param from the tsdb.Store.DeleteSeries and
lets the shard determine when to remove the meta data from the index
based on what series still have data in the shard.
This uncovered a nasty bug in compactions where a fully deleted series would
prematurely end the compactions and not carry forward the rest of the data
in the TSM file. This is now fixed as well.
When a shard is closed and removed due to retention policy enforcement,
the series contained in the shard would still exists in the index causing
a memory leak. Restarting the server would cause them not to be loaded.
Fixes#6457
There are two TSMIndex implementations, the directIndex and the
indirectIndex. Originally, we only had the directIndex and later
added the indirectIndex and NewTSMReaderWithOptions in order to
allow both indexes to be used in tests and code. This has created
a problem since we really only use the directIndex for writing and
always use the indirectIndex for reading.
This changes removes the NewTSMReaderWithOptions func so that it is
no longer possible to create a TSMReader with a directIndex. This
will allow a lot of the block reading code used by the directIndex
to be removed and simplify maintainence. It also gives better test
coverage of the code that is actually used by the TSM engine now.
This adds support for a time range to tombstone files to allow a subset
of points to be deleted instead of the whole series. It changes the
tombstone file format to a binary format and maintains backwards compatibility
with the old text format tombstone files.
Binary math inside of a where condition was previously disallowed. Now,
these types of queries are just passed verbatim down to the underlying
query engine which can handle it.
We may want to revisit this when it comes to tags at some point as it
prevents the more efficient filtering of tags that a simple expression
allows, but it allows a query like this to be done:
SELECT * FROM cpu WHERE value + 2 < 5
So while it can be better, this is a good initial implementation to
provide this functionality. There are very rare situations where a tag
may be used appropriately in one of these circumstances.
Fixes#3558.
This commit changes the `FloatDecoder.val` from a `float64` type
to a `uint64` to avoid an additional type conversion during read.
Now the type gets converted to a `float64` only on call to `Values()`.
This has various benefits:
- Users embedding InfluxDB within other Go programs can specify a different logger / prefix easily.
- More consistent with code used elsewhere in InfluxDB (e.g. services, other `run.Server.*` fields, etc).
- This is also more efficient, because it means `executeQuery` no longer allocates a single `*log.Logger` each time it is called.
The cache max memory size is an approximate size and can prevent a
shard from loading at startup. This change disable the max size
at startup to prevent this problem and sets the limt back after
reloading.
Fixes#6109
The series keys within a tag set were previously not sorted which would
cause the output to be non-deterministic. This sorts the output series
by their keys so it has a consistent output especially when using
limits.
Fixes#3166.
This also switches the remaining iterators to be lazy so they can return
errors properly. They needed to be converted to lazy initialization
anyway, which has the side effect of making it much easier for us to
propagate the underlying error during initialization.
Updated the Emitter to return errors when it cannot read properly from
the iterators.
When a GROUP BY or multiple sources are used, the top level limit
iterator requires reading the entire iterator stream so it can find all
of the tag groups it needs to return. For large data series, this ends
up with the limit iterator discarding a lot of output.
This change adds a new lower level limit iterator on each series itself
so that there are fewer data points that have to be thrown away by the
top level iterator.
Fixes#5553.
Now it is possible to compare tags and fields and it is also now
possible to compare tags and tags. Previously, it was only possible to
compare fields with fields and tags with a string or a regex.
Fixes#3371.
This commit makes a number of performance improvements to
reduce allocations during query execution. Several objects
and buffers are now reused across the components to avoid
allocations.
Previously a simple `count(value)` query across 1M points
would require 26,000+ allocations. After the changes in
this commit that number has been reduced to 88.
A missing tag on a point was sometimes treated as `""` and sometimes
treated as a separate `null` entity. This change modifies the equality
operations to always treat a missing tag as an empty string.
Empty tags are *not* indexed and do not have the same performance as a
tag that exists.
Fixes#3773.
Send nil values from the tsm1 cursor at the end of the cursor. After the
cursor reached tsm1, the `nextAt()` call would always return the default
value rather than a nil value.
Descending also didn't work correctly because the seeking functionality
for tsm1 iterators would always act like they were ascending instead of
descending when choosing which value to select. This resulted in very
strange output from the emitter since it couldn't figure out if it was
ascending or descending.
Fixes#6206.
The QueryExecutor had a lot of dead code made obsolete by the query
engine refactor that has now been removed. The TSDBStore interface has
also been cleaned up so we can have multiple implementations of this
(such as a local and remote version).
A StatementExecutor interface has been created for adding custom
functionality to the QueryExecutor that may not be available in the open
source version. The QueryExecutor delegate all statement execution to
the StatementExecutor and the QueryExecutor will only keep track of
housekeeping. Implementing additional queries is as simple as wrapping
the cluster.StatementExecutor struct or replacing it with something
completely different.
The PointsWriter in the QueryExecutor has been changed to a simple
interface that implements the one method needed by the query executor.
This is to allow different PointsWriter implementations to be used by
the QueryExecutor. It has also been moved into the StatementExecutor
instead.
The TSDBStore interface has now been modified to contain the code for
creating an IteratorCreator. This is so the underlying TSDBStore can
implement different ways of accessing the underlying shards rather than
always having to access each shard individually (such as batch
requests).
Remove the show servers handling. This isn't a valid command in the open
source version of InfluxDB anymore.
The QueryManager interface is now built into QueryExecutor and is no
longer necessary. The StatementExecutor and QueryExecutor split allows
task management to much more easily be built into QueryExecutor rather
than as a separate struct.
Both Shard and Engine had the same reference to the measurementField map,
but they each protected it with their own locks. This causes a race when
write and queries are occurring because writes can add new fields to the
map while queries are reading from it.
The fix moves the ownership to the Engine and provides protected accessors
to that Shard now users. For the most parts, the access on shard were old
dead code.
Fixing the measurementFields map race created a new race on the internal
fields map. This is now unexported and protected via MeasurementFields
exported funcs.
Fixes#6188
The stats setup ends up creating a lot of lock contention which signifcantly
impacts write throughput when a large number of measurements are used.
Fixes#6131
Writing a key that exceeds the max key length could cause a panic
when reading a tsm file because the 2 bytes used for the key length
would not be enough to represent the actual key length.
The writer will now return an error if when trying to write a key
that is too large.
After adding type-switches to the tsm1 packages, the custom
implementation found in the conversion tool broke. This change uses
tsm1.NewValue() instead of a custom implementation.
This change also ensures that the tsm1.Value interface can only be
implemented internally to allow for the optimized type-switch based
encoding
Since loading a shard can allocate a lot of memory, running them all
at once could OOM the process. This limits the number of shards
loaded to 4. This will be changed to a config option provided the
approach helps.
When loading many shards concurrently they block trying to
acquire a write lock in the sync pool adding a new source of
contention. Since this code flow always needs to allocate a
buffer it's not really buying us much.
This commit adds a buffer for stats to be updated without
requiring a mutex lock/unlock on every point. The tradeoff
is that stats are not exactly precise. This works for our
use case because stats are only periodically checked.
If an OR was used, merging filters between different expressions would
not work correctly. If one of the sides had a set of series ids with a
condition and the other side had no series ids associated with the
expression, all of the series from the side with a condition would have
the condition ignored. Instead of defaulting a non-existant series
filter to true, it should just be false and the evaluation of the one
side that does exist should take care of determining if the series id
should be included or not. The AND condition used false correctly so did
not have to be changed.
If a tag did not exist and `!=` or `!~` were used, it would return false
even though the neither a field or a tag equaled those values. This has
now been modified to correctly return the correct series ids and the
correct condition.
Also fixed a panic that would occur when a tag caused a field access to
become unnecessary. The filter using the field access still got created
and used even though it was unnecessary, resulting in an attempted
access to a non-initialized map.
Fixes#5152 and a bunch of other miscellaneous issues.
After reading the initial buffer, ORDER BY desc would read the next
block into the buffer and only read the first element. It's because the
code that was copied from the ascending cursor wasn't modified correctly
to set the position to the last element in the buffer.
The buffer size has also been lowered from 1000 to 10 to match with the
ascending cursor for performance with limit queries.
Fixes#6055.
This commit adds an `IteratorStats` that holds aggregate
iterator processing information. A method is also added to
`Iterator` to return the stats:
Stats() influxql.IteratorStats
The remote iterators will also emit their stats in the point
stream upon first connection, on a given interval, and then
finally once the last point has been sent.
The TSM writer uses a bufio.Writer that needs to be flushed before
it's closed. If the flush fails for some reason, the error is not
handled by the defer and the compactor continues on as if all is good.
This can create files with truncated indexes or zero-length TSM files.
Fixes#5889
Use of the iterator is spread out into both `IteratorCreators` and
inside of the iterators themselves. Part of the interrupt must be
handled inside of the engine so it stops trying to emit points when an
interrupt is found and another part of the interrupt has to happen when
combining the iterators so it doesn't just start reading the next shard.
These were all b1/bz1 settings that no longer have any effect:
- {Default,}MaxWALSize
- {Default,}WALFlushInterval
- {Default,}WALPartitionFlushDelay
- {Default,WAL}ReadySeriesSize
- {Default,WAL}CompactionThreshold
- {Default,WAL}MaxSeriesSize
- {Default,WAL}FlushColdInterval
- {Default,WAL}PartitionSizeThreshold
Internal system series start with an underscore prefix but
restricting this prevents users who already use an underscore
prefix in their series names.
Fixes#5870
A deadlock occurs under write load if a backup is run in between the
time when a snapshot compactions has snapshotted the cache and successfully
written it to disk. The issus is that the second snapshot call will block
on the commit lock while it is holding the engine write lock. This causes
all writes to block as well as prevents the currently runnign snapshot
compaction from completing because it needs to acquire a read-lock.
This PR removes the commit lock and just returns an error if a snapshot is
in progress to all any locks being held to be released. The caller can determine
whether to retry or giveup.
Slices of tsm1.Value interfaces are only ever used with all the same
types, and the previous code would switch on the type returned from a
call to Value(), which allocated and returned an interface{} object for
the underlying value.
This change instead type-switches on the tsm1.Value object itself,
allowing it direct access to the underlying value field, eliminating the
unecessary allocations.
This commit moves the `tsdb.Store.ExpandSources()` function onto
the `influxql.IteratorCreator` and provides support for issuing
source expansion across a cluster.
`SHOW TAG VALUES` output has been modified to print the measurement name
for every measurement and to return the output in two columns: key and
value. An example output might be:
> SHOW TAG VALUES WITH KEY IN (host, region)
name: cpu
---------
key value
host server01
region useast
name: mem
---------
key value
host server02
region useast
`measurementsByExpr` has been taught how to handle reserved keys (ones
with an underscore at the beginning) to allow reusing that function and
skipping over expressions that don't matter to the call.
Fixes#5593.
... by extracting the db/rp from the given path.
Now that the code has "standardized" on extracting db/rp this way, the
ShardLocation struct is no longer necessary and thus has been removed.
We're back on the previous style of passing the path and walPath to
NewShard.
The current go compiler at the tip of the go master (1d5001af) has a modified implementation of
testing.quick.Check that now generates nil slices as test data. (See: https://gophers.slack.com/archives/general/p14567053570110). The existing tests expect round tripping to work in this case
but it does not. So, in these cases we change the expectation to reflect actual behaviour.
This needs to be checked for reasonableness.
This commit updates tsdb.Shard to contain a ShardConfig and updates
tsdb.Store to directly reference a map of tsdb.Shard rather than the
previous tsdb.shardLocation abstraction.
Previously, the for loop at the end of the method assumed that all entries
had been deduplicated, including the entry discovered in the snapshot.
However, this wasn't actually true. With this change, we make it true.
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
Consider the write sequence: 6,1,snapshot,7,2.
The hot cache gets deduplicated, so is 2,7.
Now consider the test if 1 >= 2, this is false, so needSort is not set to true.
The problem is the implicit assumption that the snapshot is always sorted
by the time that merged() runs, but this may not be true.
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
Previously, we needed a write lock on the cache because it was the
only lock we had available to guard updates to entry.values and
entry.needSort.
However, now we have a entry-scoped lock for this purpose, we don't
need the cache write lock for this purpose. Since merged() doesn't
modify the .store or the c.snapshot.sort, there is no need for
a write lock on the cache to protect the cache.
So, we don't need to escalate here - we simply rely on the entry lock
to protect the entries we are iterating over.
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
Based on @jwilder's alternative to the 'dirty' slice that featured
in previous iterations of this fix.
Suggested-by: Jason Wilder <jason@influxdb.com>
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
Currently two compactors can execute Engine.WriteSnapshot at once.
This isn't thread safe since both threads want to make modifications to
Cache.snapshot at the same time.
This commit introduces a lock which is acquired during Snapshot() and
released during ClearSnapshot(), ensuring that at most one thread
executes within Engine.WriteSnapshot() at once.
To ensure that we always release this lock, but only release the
snapshot resources on a successful commit, we modify ClearSnapshot() to
accept a boolean which indicates whether the write was successful or not
and guarantee to call this function if Snapshot() has been called.
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
There are two tests that show two different one vulnerability.
One test shows that Cache.Deduplicate modifies entries in a snapshot's
store without a lock while cache readers are deduplicating those same
entries while correctly locked.
A second test shows that two threads trying to execute the methods
that Engine.WriteSnapshot calls will cause concurrent, unsynchronized
mutating access to the snapshot's store and entries.
The tests fail at this commit and are fixed by subsequent commits.
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
Fix for #5804.
The commit for #5789 rendered the semantics of snapshotCount statistic
useless. This commit restores semantics that have diagnostic value to
this statistic.
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
The Cache had support for taking multiple snapshots to support writing
multiple snapshots to TSM files concurrently if that happened to be
a bottleneck. In practice, this is never a bottleneck and we only
run one snappshoting goroutine continously per shard which has worked
well for all workloads.
The multiple snapshot support introduces some unhandled failure scenarios
where wal segments could be removed without writing them to TSM files. If
a snapshot compaction fails to write due to transient disk errors, subsequent
snapshots will continue, but the failed one will not be retried. When the
subsequent ones succeeded, all closed wal segments are removed causing data
loss.
This change simplifies the snapshotting capability to ensure that there is only
ever one snapshot. If one fails, the next snapshot will update the existing
snapshot and retry all of old and new data.
Fixes#5686
The cache had some incorrect logic for determine when a series needed
to be deduplicated. The logic was checking for unsorted points and
not considering duplicate points. This would manifest itself as many
points (duplicate) points being returned from the cache and after a
snapshot compaction run, the points would disappear because snapshot
compaction always deduplicates and sorts the points.
Added a test that reproduces the issue.
Fixes#5719
The intent of this change is to avoid writing caches created for
snapshot cache instances into the tsm1_cache measurement. We can do
this by avoiding use of the NewCache constructor. All other methods
are only intended to be called from on the engine cache - never
on a snapshot.
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
Since we are not locking but relying on atomic arithmetic,
use Add rather than Set. Will also result in slightly less garbage
being created.
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
The intent of this change is to ensure that all statistic fields of the
resulting tsm1_cache measurement are initialized on initialization of
the cache. That way, any consumer of those measurements doesn't
have to deal with the null case.
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
Complementing and extending the changes in #5758.
Add 2 level statistics:
* snapshotCount
* cacheAgeMs
Add 2 counter statistics
* cachedBytes
* WALCompactionTimeMs
snapshotCount can be used to measure transient write errors that are causing snapshots to accumulate
cacheAgeMs can be used to guage the level of write activity into the cache
The differences between cachedBytes stats sampled at different times can be used to calculate cache throughput rates
The ratio (cachedBytes-diskBytes)/WALCompactionTimeMs can be used calculate WAL compaction throughput.
The ratio of difference between first and last WAL compaction time over the interval
length is an estimate of percentage of cache throughput consumed.
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
Fixes#5653 and #5394.
Previously dropping retention policies did not propogate to local TSDB
shards. Instead, the retention policiess would just be removed from the
Meta Store.
This PR adds ensures that data associated with retention policies is
removed, when the retention policy is dropped.
Also, it cleans up a couple of other methods in `tsdb`, including the
requirement to provide (redundant) shardIDs when deleting databases.
The description of the cache design was out of date - reflecting an older
design based on checkpoints and evictions. This revision updates the
design to describe snapshots and also clarify that if compaction performance
falls behind the inbound write rate then writes will fail.
Updates based in part of clarifications provided by Jason Wilder. See https://goo.gl/L7AzVu
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
The select call and the query executor would both calculate the time
range, but in separate ways. The query executor needed some way to pass
in the implicit end time that is placed there by the query executor.
Fixes#5636.
There was a fix in 5b1791, but is not present in the current branch likely due to a rebase issue.
The current code panics with a query like:
select value from cpu group by host order by time desc limit 1
This fixes the panic as well as prevents #5193 from re-occurring. The issue is that agressively
closing the cursors clears out the seeks slice so re-seeking will fail.
go 1.5 was being used to develop the query engine branch, but we aren't
using 1.5 for master at the moment. This fixes issues that go vet brings
up in 1.4 that don't exist in 1.5.
Aux iterators now ask the iterator creator what series will be returned
and determine which aux fields to create based on the results.
The `tsdb.Shards` struct also creates a call iterator around the
iterators returned from each shard.
This commit removes `nil` shards returned from `tsdb.Store.Shards()`
which caused panics in some SELECTs. This can occur if the meta
store has created shards before the store or if the shards are
distributed throughout a cluster.
Fixes#5555
Fill requires an additional function for IteratorCreator to retrieve the
series that will be returned from the iterator. When fill is required
for an aggregate, the IteratorCreator will be asked what series will be
returned by the created iterator.
This reduces some of the lock contention when writing to the cache.
When a new entry is created, it avoids an allocation. It also skips
a check to see if we need to sorted if we already know it needs to sorted.
Writing the snapshot would deduplicate the snapshot points
while still holding the engine write-lock. This can be expensive
under high load and cause writes to back up and OOM the server.
Instead, grab the snapshot under the lock and dedup it after releasing
the lock.
Possible fix for #5442
This commit enforces a limit on `RawMapper` so that it will not
produce more values than are specified by the LIMIT clause.
Previously the mapper would read up to the chunk size and the
values would be limited afterward.
We were closing the cursor when we read the last block which caused
the internal state to be cleared. In a group by query, we seeked multiple
times so depending on the group by interval and how the data was laid out
in the blocks, we woudl close the cursor and the last block would get skipped.
Fixes#5193
This may be causing slow restart times for systems with many large TSM files.
What I believe is happening at startup in these cases is that multiple goroutines
are started to load each TSM file concurrently. The kernel appears to serialize
mmap calls from the same process so all of the goroutines end up getting blocked
on the actual mmap system call. MAP_POPULATE instruct the kernel to pre-fault the
page table for the files and triggers read-ahead of the pages. For larger, 2GB files,
this makes the mmap call more expensive and slower. When there are many of these files
and calls it is possible to fill all available memory with pagecache. In this case,
the OS will end up pre-faulting pages from one file and have to remove pages that it just
loaded from another files causing slowness. MAP_POPULATE may also be cause much more data
to be pre-faulted than necessary. To load a file, we just need to scan the index at the end
of the file. MAP_POPULATE is likely causing the whole file to be loaded when it won't actually
be accessed for a while (or at all).
Might fix issue #5311.
Some data shapes would cause files to grow larger than the max size more
quickly which resulted in them getting skipped by the full compaction planner
at times. Some datasets that could make this happen are very large keys or
very large numbers of keys (10M). When this happened, multiple max sized
files would accumulate but the blocks would not be full. When the shard went
cold for writes, these files would get recompacted down to the optimal size, but
a lot of space would be wasted in the mean time.
This is contributing to some of the high memory usage on queries and possibly
some OOMs. This is slightly slower, but removing it allows some fairly large
count queries over 5M series to complete instead of crashing the process using
tsm1 engine.
Key() returned the key and the entries. We did not always need the
entries so they would be allocated and ignored. Added a KeyAt func
that just returns the key to avoid the unnecesary entries allocation.
Use sync.Pool for some temporary buffers used while encoding instead of
allocatin new ones each time. Also increased the default buffer size which
might be too small. Probably need to make this a config var.
We were buffering up the data to write into byte slices to reduce
IO calls but at larger sizes, this causes memory to spike. The TSMWriter
was switched to use a bufio.Writer internally so this byte slice buffering
is unnecessary and costly now.
The block count was an uint16 when incrementing the index location
which was an int32. This caused the value the uint16 value to overflow
before the index location was incremented causing the wrong location
to be read on the next iteration of the loop. This triggers the slice
out of range errors.
Added a test that recreates the panic seen in #5257 and possibly #5202 which
is older code.
Fixes#5257
This changes backup and restore to work for TSM. It breaks it for b1 and bz1, but since those are getting removed it's ok.
The backup runs against any host that is specified and can backup either the metasstore, a database, specific retention policy, or a specific shard. It can also take incremental backups with the `since` flag, which will only backup TSM files that have been created since that timestamp.
The backup is safe to run online. However, for shards that are still hot for writes, they won't be able to create new TSM files while the backup for that single shard runs. If the backup isn't too large and the write throughput isn't too high this shouldn't be a problem since the writes will just go into the WAL cache.
This has a few changes in it (unfortuantely). The main change is to run compactions
concurrently. While implementing this, a few query and performance bugs showed up that
are also fixed by this commit.