If the engine is closed while a compaction is going on, the close call
blocks until the goroutine exits. This could be several minutes because
the control does not return back up to the channel selector while there is
still data to write.
* Update compaction to look at newest files of the smallest step first
* Update compaction to look at older files in larger steps if newer files don't have enough small steps to compact
* Changed the TestDefaultCompactionPlanner_CombineSequence test to reflect what's possible now. We'd only have multiple files in the same generation if the all files but one were over the max allowable size.
* Clean up the logic on when full compactions are run and when planning can be skipped
Move the index locations planning to be lazily created after the first
seek when we know what time and direction we're searching for. This
allows files and blocks to be skip before having to scan the files index.
This improves queries times with time filters wherne there are many TSM
files on disk.
* Update cache loader to delete entries from cache
* Add cache.Delete()
* Update delete to look at keys in the Cache in addition to the FileStore
* Update cache compaction to never happen if the cache is empty
MinTime is not in the index for each block so storing it in the block
header is redundant. The encodings also store it in their header so
we are actually storing it 3 times.
Removing this is an incompatible change with the current tsm1 file format.
Added mmap implementation for Windows. It uses MapViewOfFile similar to Bolt's implementation. MapViewOfFile returns a pointer and not a byte array. Bolt changed their data structure to support it.
Instead of changing the implementation of tsm data structure, I used a trick shown in https://groups.google.com/forum/#!topic/golang-nuts/g0nLwQI9www to use SliceHeader to convert the pointer into a slice.
Bolt's implementation also closes the file handle in mmap itself. It was resulting in a timeout, so implemented https://github.com/edsrzf/mmap-go/blob/master/mmap_windows.go logic to keep file handle open until munmap
* Update Plan to do a full compaction if cold for writes
* Remove MaxFileSize as a config variable from Compactor. Should be a set constant
* Update Plan to keep track of if the last check was fully compacted so we can skip future planning calls
* Update compact min file count to 3 so that compactions run more frequently
* remove rolloverTSMFileSize constant that is no longer used
* remove the maxGenerationFileCount since it is no longer a limitation that's necessary with the new compaction scheme. We no longer read WAL segments as part of the compaction so memory is only used as we read in each individual key
* remove minFileCount and switch to a user configurable variable
* remove the mutex from WALSegmentWriter. There's never more than one open in the WAL at one time and it's not exported through any function so the lock on the WAL should be used. This simplified keeping track of the last write time and removed a bunch of unnecessary locks.
* update WALSegmentWriter.Write to take the compressed bytes so that encoding and compression can occur before the call to write (while we don't hold the WAL lock)
* remove a bunch of unnecessary locking in WAL.writeToLog
* Add check for TSM file magic number and vesion
* Remove old tsm, log, and unused cursor code
* Remove references to tsm1dev everywhere except in the inspector
* Clean up config options for compaction and snapshotting
* Remove old TSM configuration options
* Update the config.sample.toml with TSM options
* Update WAL compact to force if it has been cold for writes for a configurable period of time (1h by default)
This was causing races in the code, when the cache was being reloaded,
because back-to-back open-and-closing of the engine during testing left
goroutines running. With this change the engine is completely shutdown
when Close() is called on it.
This change starts by building the sequence of entries, which also
allows the required size of destination buffer to be calculated. Then
the buffer is allocated up-front in 1 call.
Each snapshot and hot value-set is appended to the buffer. If ordering
is violated at anytime, set the 'needSort' flag. Sorting, if necessary,
is performed just before returning the data.
FileStats called frequently during compaction planning was too expensive because
they were cleared out every time a file replaced causing them all to be reloaded.
Insted, we grab the stats that are already maintained by the files themselves from
the files when needed.
First pass at TSM cursor iteration ended up searching the file indexes
too frequently and hurt performance. This changes that to search it once
and then have the cursor hold onto the block locations to seek
to. Doubles the query performance from the first iteration, but still a lot
of room for improvement.
* Add InfluxQLType to Values to map the TSM type to InfluxQL
* Fix bug in WAL where close wouldn't nil out the currentSegment after closing it
* Export writeSnapshot to be used in tests, add argument to run it async or not
* Update reloadCache to load temporary metadata information in the engine
* Update LoadMetadataIndex to use the temp WAL metadata information
Added mmap implementation for Windows. It uses MapViewOfFile similar to Bolt's implementation. MapViewOfFile returns a pointer and not a byte array. Bolt changed their data structure to support it.
Instead of changing the implementation of tsm data structure, I used a trick shown in https://groups.google.com/forum/#!topic/golang-nuts/g0nLwQI9www to use SliceHeader to convert the pointer into a slice.
Something broke with writing to the WAL now that compactions are running
concurrently. There was also a performance problem with Next/Prev doing
twice as many searches as necessary.
Added mmap implementation for Windows based on MapViewOfFile. Used SliceHeader trick to change the pointer returned by MapViewOfFile to a byte slice. This will not call for any change in rest of tsm.
However I am not sure where this mmap function is called, as go build is still complains about
tsdb\engine\tsm1\tsm1.go:1974: undefined: syscall.Mmap
tsdb\engine\tsm1\tsm1.go:1974: undefined: syscall.PROT_READ
tsdb\engine\tsm1\tsm1.go:1974: undefined: syscall.MAP_SHARED
tsdb\engine\tsm1\tsm1.go:2033: undefined: syscall.Munmap
* Update cache to have a single slice of values for a key (removed checkpoints)
* Changed compact.Plan to only worry about TSM files.
* Updated Plan to not return an error since there was no case in which it would.
* Update WAL to not keep stats since they're no longer needed.
* Update engine to flush the Cache/WAL to a new TSM file when the min threshold is hit.
* Split compact logic between TSM compacts and WAL/Cache writes.
* Remove unnecessary merge iterator, wal segment iterator, and other no longer necessary stuff.
* Remove the asending bool from the Dedupe method. Values should always be in ascending order. It's up to the cursor to iterate through values based on the direction. Giving the cursor responsibility makes it so we don't need to sort, dedupe or reallocate anything for different query orders.
* Updated engine to use its locks to ensure writes and cache flushes don't cause a race.
* Update all tests with new signatures. Removed a bunch of tests around TSM rewrites and WAL segment iteration that are no longer necessary.
100mb is easy it hit even with basic stress test config. Don't set
a limit by default so that an operator can size it appropriately based
on their hardware.
This commit fixes a deadlock that occurs during b1 flushes. It's
caused by taking locks in a different order. In the flush, b1
locks the engine and then bolt. However, in the query cursor, a
lock is obtained on bolt first (via `DB.Begin()`) and then the
engine is locked while reading from the engine's cache.
Getting an intermittent test failure with this so removing it for now since compactions
are still able to keep up without it. Will need to look into this further because the
allocations is still very high and will affect compactions over longer periods of time.
MergeIterator will be used to merge multiple TSM KeyIterators and the
WAL KeyIterator using a stream based iteration approach. Each iteration
cycle returns a key and values ordered in way to write a new TSM file
optimally.
This provides and interface and type to combine multiple WAL segments
in order and then allow the values to be read in an order suitable for
writing to a TSM file.
Starting to integrate some of the components into a engine that is
usable for development purposes. This allows the code to evolve while
keeping the existing TSM engine in tact for reference.
Currently, just the WAL is wired up so writes can be tested. Other engine
functions will panic the server if called.
This is the existing WAL + cache implementation. Moving it to a separate file
so that it can remain intact while a refactoring to a independent WAL can occur.
The WAL was also named Log in the code so this names file more closely to the concept
in the code.
This will faciliate loading a block into a type specific result without
first loading the block. This will also allow us to populate the database
index solely from the index.
There is a lot of allocations performed when decoding blocks. These
types can be re-used to reduce allocations in many cases. This change
allows a type specific slice to be passed in to decode funcs to be re-used
if it is large enough.
The existing decode is is left for backwards compatibility but is not
very efficient right now. It may be removed.
This writes a tombstone file containing a line per deleted key. This
file is read when a TSMReader is created and any keys listed in the file
are removed from the index.
This prevented the encoders from using other implementations of the Value
interface because it would always cast one of the types to our specific
implementations.
This adds some basic file reader/writers for creating the updated TSM file format. It uses a simple
in-memory index without MMAP for now, but will be extended to use and indirect indexing approach as well as MMAPed file access as described in the design doc.
This code is not integrated into the TSM engine yet
Go style -- and existing runtime stats -- do not use underscores, but
instead use camel case. This change makes the internal stats adhere to
that convention.
When a dataFile is deleted, the f file pointer is set to nil. Since deleting
a file happens asynchronously, code that had a reference when it was valid may
run when it's gone.
dataFile was not protected by a mutex which causes a data race and live
code and tests. filesAndLock used reflect.DeepEqual on a copy of dataFile
slices. reflect.DeepEqual appears to access unexported dataFile fields
which can't be protected. This was changed to use a equals func that will
require a mutex to be acquired.
The other issue was that many of the dataFile funcs access the mmap without
acquiring a lock. When a dataFile is deleted (possibly during rewriting),
reads from the mmap could return invalid data because references to the dataFile
are still in use by other goroutines.
Fixes#4534
Mainly for debugging as since this should not happen going forward. Since
there may be points with NaN already stored in the WAL, this is helpful for
troubleshooting panics.
Float values are not supported in the existing engine and the tsm1
engines. This changes NewPoint to return an error if a field value
contains a NaN field. It also allows us to validate fields to prevent
other unsupported types from sneaking in through other input plugins.
When a database is dropped, removing old segments returns an error
because the files are already gone. Using RemoveAll handles this
case more gracefully.
If a drop database is executed while writes are in flight, a panic
could occur because the WAL would fail to write to the DB dirs where
had been removed.
Partil fix for #4538
The Stat+Remove calls are unnecessary because Rename will replace
the destination file if it exist or not. There is no need to remove
the destination file before calling Rename.
Several places use os.Remove and check for os.ErrNotExist. os.Remove
does not return os.ErrNotExit, it returns a *PathError so these remove
calls will panic if the file does not exist.
Instead use os.RemoveAll that will not return an error if the file does
not exist.
Fixes#4545
* refactor compaction
* rework compaction cleanup logic to work with multiple resulting files
* ensure the uint64 number for a series key doesn't use 0 or MaxInt64 for sentinel values
Close acquired the cacheLock and writeLock in a different order than flush. If addToCache was also
running in a goroutine (acquiring cacheLock), a deadlock could happen.
panic: error opening new segment file for wal: open /var/folders/lj/vlbynqp52pxdxxlxx64j6bk80000gn/T/tsm1-test709000715/_00002.wal: no such file or directory
goroutine 8 [running]:
github.com/influxdb/influxdb/tsdb/engine/tsm1.(*Log).writeToLog(0xc820098500, 0x1, 0xc8201584b0, 0x1c, 0x45, 0x0, 0x0)
/Users/jason/go/src/github.com/influxdb/influxdb/tsdb/engine/tsm1/wal.go:427 +0xc19
When rewriting a tsm file, a panice on the Values slice could happen
if there were no values in the slice and the conditions of the rewrite
causes DecodeAndCombine to be called with the empty slice. This could
happen is the sizes of the points new values was equal to
the MaxPointsInBlock config options and there were no future blocks after
the current one being written.
When this happens, DecodeAndCombine returns a zero length remaining values
slice which is passed back into DecodeAndCombine one last time. In this case,
we now just return the original block since there is nothing new to combine.
Fixes#4444#4365
This will help large integer counters type fields that increment by
small amounts over time. Instead of storing the larger raw value
in a compressed format, we store the difference from the prior value
in compressed format which allows the value to be stored using
fewer bits.
influx_inpsect uncovered some scenarios where timestamps could be stored using
run-length encoding but were being stored using simple8 which uses more space.
If DecodeSameTypeBlock is called on on an empty Values slice, it would
panic with an index out of bounds error. This func can actually be removed
because DecodeBlock can determine what type of values are encoded already.
This will still panic if the block cannot be decoded due to other reasons.
Fixes#4365
If similar float values were encoded, the number of leading bits would
overflow the 5 available bits to store them (e.g. store 33 in 5 bits). When
decoding, the values after the overflowed value would spike to very large and
small values.
To prevent the overflow, we clamp the value to 31 which is the maximum
number of leading zero bits we can encoded.
Fixes#4357
Will make it less error-prone to add new encodings int the future
since each encoder has it's set of constants. There are some placeholder
contants for uncompressed encodings which are not in all encoder currently.
* Fix bug with locking when the interval completely covers or is totally inside another one.
* Fix bug with full compactions running when the index is actively being written to.
If reading into fixed sized buffer using io.ReadFull, the func can
return io.ErrUnexpectedEOF if the read was short. This was slipping
through the error handling causing the shard to fail to load.
The defer tx.Rollback() tries to free the queryLock but the defer e.Cleanup() runs
before it and tries to take a write lock on the query lock (which blocks) and prevents
tx.Rollback() from acquring the read lock.
Previously were using a frame of reference approach where we would
transform the (possibly negative) deltas into positive values from
the minimum. That required an extra pass over the values as well
as a large slice allocation so we could encode the originals in uncompressed
form if they were too large.
This switches the encoding to use zigzag encoding for the deltas which
removes the extra slice allocation as well as the extra loops.
Improves encoding performane by ~4x.
This is using zig zag encoding to convert int64 to uint64s and then using simple8b
to compress them, falling back to uncompressed if the value exceeds 1 << 60. A
patched encoding scheme would likely be better in general but this provides decent
compression for integers that are not at the ends of the int64 range.
Time compression uses an adaptive approach using delta-encoding,
frame-of-reference, run length encoding as well as compressed integer
encoding.
Float compression uses an implementation of the Gorilla paper encoding
for timestamps based on XOR deltas and leading and trailing null suppression.
Don't declare distinct stat map for partitions. It's more useful to see
the stats collated together per-WAL. This may need further change in the
future.
If the memory gets 5x above the partition size threshold, the WAL will start returning write failures to the clients. This will allow them to backoff their write volume.
Also updated the stress script to track failed requests and output messages on failure and when it returns to success.
* Only fire a go routine to flush and compact if it isn't already running
* Have a sleep backoff time that scales up as the percentage of memory used goes up
Start of a lower-level file inspection tool. This currently dumps
summary statistics for the shards, index and WAL that can be used to
understand the shape of the data is in the local shards. This util
operates on the shards itself and not through the server and is intended
more for debugging/troubleshooting.
A write lock was being taken to read the memory size to determine if writes
should be paused. What happens is that writers get blocked indefintely when
trying to acquire a write lock which makes writes pause (or stop) for long periods
of time.
The log was deferring the release of the read lock on the WAL. This had
the affect that a read-lock was held until after the partition finished writing
(which maintains it's own locks). The read lock is only needed around the call
to pointsToPartions so it can get a consistent copy of the points to write. After
that calls returns, a lock is not needed so free it immediatedly.
addToCache is called in a goroutine and can panic if the server is closed while opening. If
part of the open func errors, it returns an error and immediately calls close. close sets
p.cache to nil which causes the goroutine trying to initialized the cache to panic as well. The
goroutine should run under a write lock to avoid this race/panic.
If LoadMetadataIndex() tries to log an error, it causes a panic because the
logger is not set until Open() is called, which is after LoadMetaDataIndex() returns.
Instead, just set the logger up when the WAL is created.
This commit changes the default block size from 64KB to 4KB for
bz1. This was lowered because small blocks were being uncompressed,
merged, recompressed, and inserted for a large portion of updates.
This became slower and slower over time until it reached the 64KB
threshold. We moved to the 4KB threshold in order to lower the
impact of this recompression.
The buffer allocation in bz1 was unused and I'm fairly certain that it
was harmful to performance if used. For queries that run through a bz1
block, needing to hold on to a 64kb block is expensive. Better to churn
on the allocator and have the blocks be released when they are unused
than to have 64kb hanging around for each series regardless of size.
Thanks to @jwilder for brainstorming this issue with me.
By using preallocated buffers for marshaling WAL entries, we can
reduce the amount of memory we allocate.
On a run of `influx_stress -series 10000 -points 1000` this cuts
total allocations from 18684.15MB to 15200.73MB
* Update the store to remove the WAL directories associated with a shard or database when they are deleted.
* Fix the Store so that it creates separate WAL directories for databases and retention policies.
This commit changes the bz1 append to check for a small
ending block first. If the block is below the threshold
for block size then it is rewritten with the new data
points instead of having a new block written.
If a flush is happening and you bring up a cursor for a series, if that series didn't have any data in the cache (after the flush started) then it would return no data. What it should have done instead is return the data that is in the flush cache, which is held in separate area of memory until it is committed to the index.
* All metadata for each shard is now stored in a single key with compressed value
* Creation of new metadata no longer requires a syncrhnous write to Bolt. It is passed to the WAL and written to Bolt periodically outside the write path
* Added DeleteSeries to WAL and updated bz1 to remove series there when DeleteSeries or DropMeasurement are called
This commit fixes the b1 cursor so that reads from either the cache
or bolt buffer will check against the previously read key to ensure
that two of the same keys are not returned.
Fixes#3571.
This commit fixes issues found from using a more complex `testing/quick`
implementation of the `WriteIndex()` test. The newer test inserts
multiple sets of random data that's confined to a smaller random space
so there's more chance of overlapping data.
The fixes were primarily around inserting old data or inserting the same
timestamp multiple times for a single write. The block splitting was not
working correctly before and the sorting and deduping was not handled
correctly.