This change delays Tag cloning until a new series is found, and will
only clone Tags acquired from `ParsePoints...` and not those referencing
the mmap-ed files (TSM) that are created on startup.
This leak seems to have been introduced in 8aa224b22d,
present in 1.1.0 and 1.1.1.
When points were parsed from HTTP payloads, their tags and fields
referred to subslices of the request body; if any tag set introduced a
new series, then those tags then were stored in the in-memory series
index objects, preventing the HTTP body from being garbage collected. If
there were no new series in the payload, then the request body would be
garbage collected as usual.
Now, we clone the tags before we store them in the index. This is an
imperfect fix because the Point still holds references to the original
tags, and the Point's field iterator also refers to the payload buffer.
However, the current write code path does not retain references to the
Point or its fields; and this change will likely be obsoleted when TSI
is introduced.
This change likely fixes#7827, #7810, #7778, and perhaps others.
This change adds some very basic name validation with the following
plain-english description: names must be non-zero sequence of printable
characters that do not contain slashes ('/' or '\') and are not equal to
either "." or "..".
The intent is that, since we currently just use database and retention
policy names directly as path elements, these rules will hopefully leave
us with names that should be at least close to valid directory names.
Ideally, we would restrict names even further or not use them as path
elements directly, but this should be a step towards the former without
restricting names "too much"
Fixes#7822
This change first ensures that databases and retention policies exist
before attempting to remove them from the Store. It also adds some
checks in the `DeleteDatabase` and `DeleteRetentionPolicy` to ensure
that maliciously named entries won't remove anything outside of the
configured data directory.
I ran into an issue where the cache snapshotting seemed to stop
completely causing the cache to fill up and never recover. I believe
this is due to the the Timer being reused incorrectly. Instead,
use a Ticker that will fire more regularly and not require the resetting
logic (which was wrong).
The memory stats as well as the size of the cache were not accurate.
There was also a problem where the cache size would be increased
optimisitically, but if the cache size limit was hit, it would not
be decreased. This would cause the cache size to grow without
bounds with every failed write.
The CacheKeyIterator (used for snapshot compactions), iterated over
each key and serially encoded the values for that key as the TSM
file is written. With many series, this can be slow and will only
use 1 CPU core even if more are available.
This changes it so that the key space is split amongst a number of
goroutines that start encoding all keys in parallel to improve
throughput.
This simplifies the cache.Snapshot func to swap the hot cache to
the snapshot cache instead of copy and appending entries. This
reduces the amount of time the cache is write locked which should
reduce cache contention for the read only code paths.
Also, fix the `Iterators.Merge(IteratorOptions)` function so it consults
the `Ordered` attribute to determine which iterator it should use to
merge the input iterators.
The backup command can fail if a snapshot is running which silently
closes the connection. This causes the backup shard command to continue
on as if nothing failed.
This adds query syntax support for subqueries and adds support to the
query engine to execute queries on subqueries.
Subqueries act as a source for another query. It is the equivalent of
writing the results of a query to a temporary database, executing
a query on that temporary database, and then deleting the database
(except this is all performed in-memory).
The syntax is like this:
SELECT sum(derivative) FROM (SELECT derivative(mean(value)) FROM cpu GROUP BY *)
This will execute derivative and then sum the result of those derivatives.
Another example:
SELECT max(min) FROM (SELECT min(value) FROM cpu GROUP BY host)
This would let you find the maximum minimum value of each host.
There is complete freedom to mix subqueries with auxiliary fields. The only
caveat is that the following two queries:
SELECT mean(value) FROM cpu
SELECT mean(value) FROM (SELECT value FROM cpu)
Have different performance characteristics. The first will calculate
`mean(value)` at the shard level and will be faster, especially when it comes to
clustered setups. The second will process the mean at the top level and will not
include that optimization.
The previous implementation was susceptible to a race condition (of
correctness) since c.decreaseSize is called without a lock in
(*Cache).WriteMulti.
There were already tests which asserted the correctness of the result of
decreaseSize, so no tests were added or modified.
It looks like the real import path to the project is go.uber.org/zap
instead of github.com/uber-go/zap since the example in the project
references that path.
Currently, whenever a snapshot occurs the Cache is reset and so many
allocations are repeated, as the same type of data is re-added to
the Cache.
This commit allows the stores to keep track of the number of values
within an entry, and use that size as a hint when the same entry needs
to be recreated after a snapshot.
To avoid hints persisting over a long period of time they are deleting
after every snapshot, and rebuilt using the most recent entries only.
The logging library has been switched to use uber-go/zap. While the
logging has been changed to use structured logging, this commit does not
change any of the logging statements to take advantage of the new
structured log or new log levels. Those changes will come in future
commits.
Deduplicate is called from various places in the engine and can cause
a lot of garbage to get created. It first creates a map and then
adds each value to the map in order (1st alloc). It then creates a
new slice (2nd alloc) and appends everything from the map to the slice.
Finally, it sorted the new slice (3rd alloc).
This switches the algorithm to use stable sorting and resuing the existing
slice to avoid allocations.
NO-OP on platforms with unix path separator.
On Windows paths get converted to slashes before adding to archive and back to backslashes during restore.
This returns the LastModified time of the shard. The LastModified
time is the wall time when a change to the shards state occurred.
It uses the WAL or FileStore to determine the max mod time.
A new sorted slice was called by the monitor func every 10s. The
tag keys don't need to be sorted so this avoid the allocation of the
slice and one during sorting.
This allocates quite a bit and it's called multiple times per
second per shard. The generations don't change until a compaction
has occurred so most of the time is re-calculating the same thing
and creating garbage.
When a limit is exceeded, we return errors and sometimes log (if appropriate)
that a limit was exceeded. The messages don't always provide an indication
as to where or how they are configured.
Instead, return the config option (easily searchable for) as well as the limit
currently set and the value that exceeded it when possible.
If concurrent writes to the same shard occur, it's possible for different types to
be added to the cache for the same series. The way the measurementFields map on the
shard is updated is racy in this scenario which would normally prevent this from occurring.
When this occurs, the snapshot compaction panics because it can't encode different types
in the same series.
To prevent this, we have the cache return an error a different type is added to existing
values in the cache.
Fixes#7498
The file store stats slice is re-used which causes the race below:
WARNING: DATA RACE
Write at 0x00c42007e140 by goroutine 43:
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*FileStore).Stats()
/Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/file_store.go:511 +0x22e
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*DefaultPlanner).findGenerations()
/Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:461 +0x6f
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*DefaultPlanner).PlanLevel()
Previous read at 0x00c42007e140 by goroutine 40:
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*DefaultPlanner).findGenerations()
/Users/jason/go/src/github.com/influxdata/influxdb/tsdb/engine/tsm1/compact.go:463 +0x13d
github.com/influxdata/influxdb/tsdb/engine/tsm1.(*DefaultPlanner).PlanOptimize()
Reduce the cache lock contention by widening the cache lock scope in WriteMulti, while this sounds counter intuitive it was:
* 1 x Read Lock to read the size
* 1 x Read Lock per values
* 1 x Write Lock per values on race
* 1 x Write Lock to update the size
We now have:
* 1 x Write Lock
This also reduces contention on the entries Values lock too as we have the global cache lock.
Move the calculation of the added size before taking the lock as it takes time and doesn't need the lock.
This also fixes a race in WriteMulti due to the lock not being held across the entire operation, which could cause the cache size to have an invalid value if Snapshot has been run in the between the addition of the values and the size update.
Fix the cache benchmark which where benchmarking the creation of the cache not its operation and add a parallel test for more real world scenario, however this could still be improved.
Add a fast path newEntryValues values for the new case which avoids taking the values lock and all the other calculations.
Drop the lock before performing the sort in Cache.Keys().
The `first()` and `last()` functions response rate would increase linear
to the number of points even though it seems like it shouldn't. This
optimization greatly reduces the amount of time to return a response
when no `GROUP BY time(...)` clause is present in a query.
Previously, we would return a full tag set for every shard and the tag
set would include all series that existed in the database index
including series that didn't physically exist within that shard. This
led to the tag sets returned being incredibly huge when we had high
cardinality but sparse data. Since the data was sparse, it was
unexpected that it would cause such a large strain on the system by most
people.
Now we filter out the series ids that are not assigned to the current
shard when computing a tag set for that shard. This lowers the memory
usage for high cardinality sparse data drastically and allows queries on
those to complete successfully.
This does not resolve issues for high cardinality data in every shard
that is also spread out over a long series of time. That situation isn't
nearly as common as the above situation though.
Unify logic around compaction execution to a single place.
Also report on the error stats that we track. Previously they were not
emitted in the stats output.
If a delete takes a long time to process while writes to the
shard are occuring, it was possible for the cache to fill up
and writes to be rejected. This occurred because we disabled
all compactions while writing tombstone file to prevent deleted
data from re-appearing after a compaction completed.
Instead, we only disable the level compactions and allow snapshot
compactions to continue. Snapshots already handle deleted data
with the cache and wal.
Fixes#7161
Instead of assigning a boolean value of true to the filter expressions
when there was no meaningful expression, this drops a boolean expression
of true from the filter expressions so we don't have to perform a map
assignment. This allows us to reduce allocations and assignments when a
`WHERE` clause only contains tag comparisons and no field comparisons.
This changes the behavior of the max-series-per-database and
max-values-per-tag limits to drop points that would exceed the limits
and allow the remaining points to be written. Previously, the whole
batch would fail and return and 500 error to the client.
This now will write the allow points and return a `partial write`
error indicating some of the points were dropped, how many were
dropped and one of the problem measureent and tags.
On my machine with about 20 shards, it would take 10+ seconds to shut
down InfluxDB with SIGINT. After this change, it shuts down in nearly
instantly.
(*tsdb.Store).Close was shutting down each of its shards sequentially.
Each shard's engine would signal to its compaction goroutines to quit,
and because each compaction goroutine has a hardcoded 1-second sleep in
between checks, waiting for the goroutines would often block for up to a
second.
This change closes all of the TSDB store's shards in parallel. This
means it's possible that multiple close values could error at once, but
we're still only returning the first error, consistent with previous
behavior. That being said, the return value of (*tsdb.Store).Close is
ignored in (*cmd/influxd/run.Server).Close anyway.
The FieldIterator is used to scan over the fields of a point, providing
information, and delaying parsing/decoding the value until it is needed.
This change uses this new type to avoid the allocation of a map for the
fields which is then thrown away as soon as the points get converted
into columns within the datastore.
The decoders were held onto each iterator to avoid creating them all
the time. Some of them have use quite a bit of memory so they can
be expensive to create when querying across many series.
Intead, more them to a re-usable pool where we create the minimum that
could active be in use. This reduces garbage as well as makes the iterators
less expensive to create.
Integer blocks that were run length encoded could produce the wrong
value when read back out because the deltas were not zig zag decoded
before scaling the final value. If the deltas were negative, as would
be seen in a counter that decrements by a constant value, the results
would be random with som negative and positive values.
Fixes#7391
The TagSets function was creating a lot of intermediate maps and
slices to calculate the sorted tag sets. It first creates a map
to group tag sets with their series, it then created an equally
sized slice of the tag keys and sorted then. Finally, it created
a new slice and added the tag sets in the original map by the ordering
of the sorted keys. It was also recreating the tags map multiple time
creating extra garbage in the loop.
This simplifies the code to create one map for grouping and than adding
the distinct sets to a slice which is then sorted. It also fixes the
multple tag maps getting created.