Under high write load, the check for each series was done sequentially
which caused a lot of CPU time to acquire/release the RLock on LogFile.
This switches the code to check multiple series at once under an RLock
similar to the chang for inmem.
The current bytes.Pool will hold onto byte slices indefinitely. Large
writes can cause the pool to hold onto very large buffers over time.
Testing w/ sync/pool seems to perform similarly now so using a sync/pool
will allow these buffers to be GC'd when necessary.
The inmem index would call CreateSeriesIfNotExist for each series
which takes and releases and RLock to see if a series exists. Under
high write load, the lock shows up in profiles quite a bit. This
adds a filtering step that obtains a single RLock and checks all the
series and returns the non-existent series to contine though the slow
path.
Under high write load, the sync goroutine would startup, and end
very frequently. Starting a new goroutine so frequently adds a small
amount of latency which causes writes to take long and sometimes timeout.
This changes the goroutine to loop until there are no more waiters which
reduce the churn and latency.
If the sync waiters channel was full, it would block sending to the
channel while holding a the wal write lock. The sync goroutine would
then be stuck acquiring the write lock and could not drain the channel.
This increases the buffer to 1024 which would require a very high write
load to fill as well as retuns and error if the channel is full to prevent
the blocking.
The Point is intended to be immutable after being parsed since it
is shared by several goroutines. When dropping a field (e.g. time),
corrupted data can result if one goroutine is delete the field
while another is marshaling the underlying byte slices.
To avoid this, the shard will just skip invalid fields and series
instead of trying to mutate them by deleting them.
Removing series while trying to maintain the sorted series list
does not perform well when removing many series. This causes
drop DB, RP, series, to be very slow in some cases.
Instead, lazily create a sorted series list when first requested and
invalidate it when dropping series.
This reworks drop measurement to use a sorted list of series keys
instead of creating an intermediate map. It remove allocations
and some extra garbage that is created during drop measurement.
WalkKeys serially walked each TSM file and invoked fn for each key.
Caller needed to handle duplicate calls to fn with the same key
because the same key could exist in multiple TSM files. The serial
execution was also slower.
Since the series keys are already sorted, we can iterate over all
files in parallel and skip duplicates using a sorted merge. This
fixes the duplicate invocation issue as well as speeds up walking
all keys.
This can significant improve startup performance when many TSM files
exists that may not have been fully compacted. This also has benefits
for deletes (measurements/series) since duplicates are removed saving
extra allocations and work. This may also allow for the optimize
compaction to be removed provided startup times are fast enough.
When using the inmem index, if one drops a database, and then creates it
again, the previous index object will be reused. This includes the
previous cardinality estimation sketches, leading to inaccurate
cardinality estimations.
The previous version was very innefficient due to the benchmarks used
to optimize it having a bug. This version always allocates a new
slice, but is O(n).
This switches compactions to use type values (FloatValues) from the
generic Values type. It avoids a bunch of allocations where each value
much be converted from a specific type to an interface{}.
This code was added to address some slow startup issues. It is believed
to be the cause of some segfault panic's that occur at query time when
the underlying MMAP array has been unmapped. The current structure of
code makes this change unnecessary now.
If a bad query is run, kill query and limits would not kick in until
after it started executing. Some bad queries that involve high
cardinality can cause the server to OOM just from planning which
defeats the purpose of the max-select-series limit.
This change primarily fixes max-select-series limit so that the query
is killed earlier and has the side effect that kill query now can kill
a query while it's being planned.
The limit waited until all the iterators had been created which still
allows problem queries to be planned. This allows the queries to be
aborted much earlier in some cases.
Fsyncs to the WAL can cause higher IO with lots of small writes or
slower disks. This reworks the previous wal fsyncing to remove the
extra goroutine and remove the hard-coded 100ms delay. Writes to
the wal still maintain the invariant that they do not return to the
caller until the write is fsync'd.
This also adds a new config options wal-fsync-delay (default 0s)
which can be increased if a delay is desired. This is somewhat useful
for system with slower disks, but the current default works well as
is.
Calling DiskSize can be expensive with many shards. Since the stats
collection runs this every 10s by default, it can be expensive and
wasteful to calculate the stats when nothing has changed. This avoids
re-calculating the shard size unless something has chagned.
The previous version was very innefficient due to the benchmarks used
to optimize it having a bug. This version always allocates a new
slice, but is O(n).
This switches compactions to use type values (FloatValues) from the
generic Values type. It avoids a bunch of allocations where each value
much be converted from a specific type to an interface{}.
Still seeing the panic that switching this logic around was supposed
to fix. We now delete the bulk of data outside of the fields lock
and then again, under the write lock, to ensure that the field mapping
is accurate. We don't do the full delete under the lock because it
can block writes and queries that require a read lock.
If blocks containing overlapping ranges of time where partially
recombined, it was possible for the some points to get dropped
during compactions. This occurred because the window of time of
the points we need to merge did not account for the partial blocks
created from a prior merge.
Fixes#8084
There is a race where the field type can be deleted while a new type
is written and during a query. When this happens, an iterator for
the new type is created but old data make still exist in the cache
for TSM files causing a panic.
Under high query load, a race exists in the cache and the WAL. Since
writes currently hit the cache first, they are availble for query before
they hit the WAL. If the WAL is writing and accessign the Value slice
at the same time that a query is run that needs to dedup the same slice,
a race occurs.
To fix this, the cache now just copies the values instead of storing the
slice passed in. Another way to fix this might be to have the writes go
to the wal before the cache. I think the latter would be better, but it
introduces some larger write path issues that we'd need to also address.
e.g. if the cache was full, writes to the WAL would need to be rejected
to avoid filling the disk.
Copying the slice in the cache is simpler for now and does not appear to
dramatically affect performance.
A measurement name that begins with an underscore and does not conflict
with one of the reserved measurement names will now be passed untouched
to the underlying shards rather than being intercepted as an empty
measurement.
A user still shouldn't rely on measurements that begin with underscores
to always be accessible, but this will prevent the most common use case
from causing unexpected behavior since we will very rarely, if ever, add
additional system sources.