influxdb

Commit Graph

Author	SHA1	Message	Date
devanbenz	703f16a602	chore: Only do debug logs if there are compaction groups	2025-08-01 13:20:35 -05:00
devanbenz	a86441180f	chore: adds logging to limiters prior to apply	2025-08-01 13:06:53 -05:00
devanbenz	fc54ef3272	chore: additional logging of groups	2025-08-01 13:04:05 -05:00
devanbenz	3fbd6f1f52	chore: Adding groups as keys for logs	2025-08-01 13:00:12 -05:00
devanbenz	5f07da3798	chore: Add additional logging around scheduler loop	2025-08-01 12:42:11 -05:00
devanbenz	aa4698e61c	chore: Adding additional trace logging	2025-08-01 12:23:19 -05:00
devanbenz	80902585a8	chore: Adjust logging and add groups	2025-08-01 12:12:45 -05:00
devanbenz	17a8c15c1d	chore: Adding debug level logging for engine.go compaction	2025-08-01 12:09:43 -05:00
WeblWabl	0f57087944	feat: Adds LastModifiedOrErr to expose error for LastModified (#26623 )	2025-07-24 20:54:41 -05:00
Phil Bracikowski	4e8a3b389b	feat: file store merge metrics (#26615 ) * feat(1.x,file_store): port metrics for merge work This commit ports metrics around merging tsm blocks when executing a query. These will appear in EXPLAN ANALYZE results. The new information records the time spent merging blocks, the number of blocks merged, roughly the number of values merged into the first block of each ReadBlock call, and the number of times that single calls to ReadBlock have more than 4 block merges. The multiblock merge is sequential and might benefit from a tree merge algorithm. The latter stat helps identify if the engineering effort would be fruitful. * closes #26614 * chore: switch to a timer for duration printing of times * chore: rename method * fix: avoid race and use new atomic primitive	2025-07-18 12:18:37 -07:00
davidby-influx	ea36c5ff47	chore: improve logging on compaction failures (#26545 ) Streamline compaction logging, while providing more information to debug remnant temporary files.	2025-06-25 13:54:52 -07:00
WeblWabl	149fb47597	feat: Defer cleanup for log/index compactions, add debug log (#26511 ) I believe that there is something happening which causes CurrentCompactionN() to always be greater than 0. Thus making Partition.Wait() hang forever. Taking a look at some profiles where this issue occurs. I'm seeing a consistent one where we're stuck on Partition.Wait() ``` -----------+------------------------------------------------------- 1 runtime.gopark runtime.chanrecv runtime.chanrecv1 github.com/influxdata/influxdb/tsdb/index/tsi1.(Partition).Wait github.com/influxdata/influxdb/tsdb/index/tsi1.(Partition).Close github.com/influxdata/influxdb/tsdb/index/tsi1.(Index).close github.com/influxdata/influxdb/tsdb/index/tsi1.(Index).Close github.com/influxdata/influxdb/tsdb.(Shard).closeNoLock github.com/influxdata/influxdb/tsdb.(Shard).Close github.com/influxdata/influxdb/tsdb.(Store).DeleteShard github.com/influxdata/influxdb/services/retention.(Service).DeletionCheck.func3 github.com/influxdata/influxdb/services/retention.(Service).DeletionCheck github.com/influxdata/influxdb/services/retention.(Service).run github.com/influxdata/influxdb/services/retention.(*Service).Open.func1 -----------+------------------------------------------------------- ``` Defer'ing compaction count cleanup inside goroutines should help with any hanging current compaction counts. Modify currentCompactionN to be a sync atomic. Adding a debug level log within Compaction.Wait() should aid in debugging.	2025-06-20 13:18:47 -05:00
WeblWabl	7437f275ff	feat: Add new logging for compaction level 5 and remove bug with opt holdoff time (#26488 ) Previously ```go // StartOptHoldOff will create a hold off timer for OptimizedCompaction func (e *Engine) StartOptHoldOff(holdOffDurationCheck time.Duration, optHoldoffStart time.Time, optHoldoffDuration time.Duration) { startOptHoldoff := func(dur time.Duration) { optHoldoffStart = time.Now() optHoldoffDuration = dur e.logger.Info("optimize compaction holdoff timer started", logger.Shard(e.id), zap.Duration("duration", optHoldoffDuration), zap.Time("endTime", optHoldoffStart.Add(optHoldoffDuration))) } startOptHoldoff(holdOffDurationCheck) } ``` was not passing the data by reference which meant we were never modifying the `optHoldoffDuration` and `optHoldoffStart` vars. This PR also adds additional logging to Optimized level 5 compactions to clear up a little bit of confusion around log messages.	2025-06-02 17:51:59 -05:00
Geoffrey Wossum	1fbe319080	fix: reduce excessive CPU usage during compaction planning (#26432 ) Co-authored-by: devanbenz <devandbenz@gmail.com>	2025-05-27 16:55:20 -05:00
davidby-influx	eab8a8a6e8	fix: add locking in ClearBadShardList (#26423 )	2025-05-19 09:14:07 -07:00
Geoffrey Wossum	66f4dbeaad	fix: limit number of concurrent optimized compactions (#26319 ) Limit number of concurrent optimized compactions so that level compactions do not get starved. Starved level compactions result in a sudden increase in disk usage. Add [data] max-concurrent-optimized-compactions for configuring maximum number of concurrent optimized compactions. Default value is 1. Co-authored-by: davidby-influx <dbyrne@influxdata.com> Co-authored-by: devanbenz <devandbenz@gmail.com> Closes: #26315	2025-05-06 15:42:39 -05:00
WeblWabl	96e44cac73	fix: PlanOptimize is running too frequently (#26211 ) PlanOptimize is being checked far too frequently. This PR is the simplest change that can be made in order to ensure that PlanOptimize is not being ran too much. To alleviate the frequency I've added a lastWrite parameter to PlanOptimize and added an additional test that mocks the edge cause out in the wild that led to this PR. Previously in test cases for PlanOptimize I was not checked to see if certain cases would be picked up by Plan I've adjusted a few of the existing test cases after modifying Plan and PlanOptimize to have the same lastWrite time.	2025-04-08 12:22:29 -05:00
WeblWabl	d8bcbd894c	feat: Add CompactPointsPerBlock config opt (#26100 ) * feat: Add CompactPointsPerBlock config opt This PR adds an additional parameter for influxd CompactPointsPerBlock. It adjusts the DefaultAggressiveMaxPointsPerBlock to 10,000. We had discovered that with the points per block set to 100,000 compacted TSM files were increasing. After modifying the points per block to 10,000 we noticed that the file sizes decreased. The value has been set as a parameter that can be adjusted by administrators this allows there to be some tuning if compression problems are encountered.	2025-03-05 14:59:06 -06:00
davidby-influx	2ab5aad52e	chore: add logging to Filestore.purger (#26089 ) Also fixes error type checks in TestCompactor_CompactFull_InProgress	2025-03-05 11:46:07 -08:00
davidby-influx	1efb8dad43	fix: remove temp files on error in Compactor.writeNewFiles (#26074 ) Compactor.writeNewFiles should delete temporary files created on iterations before an error halts the compaction. closes https://github.com/influxdata/influxdb/issues/26073	2025-02-27 08:17:48 -08:00
davidby-influx	ba95c9b0f0	fix: ensure temp files removed on failed compaction (#26070 ) Add more robust temporary file removal on a failed compaction. Don't halt on a failed removal, and don't assume a failed compaction won't generate temporary files. closes https://github.com/influxdata/influxdb/issues/26068	2025-02-26 13:17:17 -08:00
davidby-influx	083b679b56	fix: ensure fields in memory match on disk A field could be created in memory but not saved to disk if a later field in that point was invalid (type conflict, too big) Ensure that if a field is created, it is saved.	2025-02-24 13:53:40 -08:00
WeblWabl	03b6ed2bed	feat: Upgrade flux to v0.196.1 (#26041 ) * feat: update flux to 0.196.1 * feat: Update proto files This updates from protoc-gen-go v1.33.0 -> v1.34.1 and protoc from v5.26.1 -> v5.29.2	2025-02-20 13:46:06 -06:00
davidby-influx	5f576331d3	chore: refactor field creation for maintainability Address review comments in the port work of the field creation. Also fixes one bug in returning the wrong error.	2025-02-18 14:00:11 -08:00
davidby-influx	b617eb24a7	fix: switch MeasurementFields from atomic.Value to sync.Map (#26022 ) Simplify and speed up synchronization for MeasurementFields structures by switching from a mutex and atomic.Value to a sync.Map	2025-02-13 16:53:25 -08:00
davidby-influx	5a20a835a5	fix: lock MeasurementFields while validating (#25998 ) There was a window where a race between writes with differing types for the same field were being validated. Lock the MeasurementFields struct during field validation to avoid this. closes https://github.com/influxdata/influxdb/issues/23756	2025-02-13 11:33:34 -08:00
WeblWabl	4ad5e2aba7	feat: Add error join for file writing in snapshots (#26004 ) This PR adds an error join to help with handling multiple errors from snapshot file writers.	2025-02-12 15:06:43 -06:00
WeblWabl	306a184a8d	feat: Add error joins/returns (#25996 ) This pr adds err handling for branch that did not specify os file removal errors previously. This is part of EAR #5819.	2025-02-11 12:15:25 -06:00
davidby-influx	800970490a	fix: move aside TSM file on errBlockRead (#25839 ) The error type check for errBlockRead was incorrect, and bad TSM files were not being moved aside when that error was encountered. Use errors.Join, errors.Is, and errors.As to correctly unwrap multiple errors. Closes https://github.com/influxdata/influxdb/issues/25838	2025-01-22 10:46:31 -08:00
WeblWabl	f04105bede	feat: Modify optimized compaction to cover edge cases (#25594 ) * feat: Modify optimized compaction to cover edge cases This PR changes the algorithm for compaction to account for the following cases that were not previously accounted for: - Many generations with a groupsize over 2 GB - Single generation with many files and a groupsize under 2 GB - Where groupsize is the total size of the TSM files in said shard directory. - shards that may have over a 2 GB group size but many fragmented files (under 2 GB and under aggressive point per block count) closes https://github.com/influxdata/influxdb/issues/25666	2025-01-14 14:51:09 -06:00
davidby-influx	e974165d25	fix: do not leak file handles from Compactor.write (#25725 ) There are a number of code paths in Compactor.write which on error can lead to leaked file handles to temporary files. This, in turn, prevents the removal of the temporary files until InfluxDB is rebooted, releasing the file handles. closes https://github.com/influxdata/influxdb/issues/25724	2025-01-03 14:43:41 -08:00
WeblWabl	45a8227ad6	fix(influxd): update xxhash, avoid stringtoslicebyte in cache (#578 ) (#25622 ) (#25624 ) * fix(influxd): update xxhash, avoid stringtoslicebyte in cache (#578) * fix(influxd): update xxhash, avoid stringtoslicebyte in cache This commit does 3 things: * it updates xxhash from v1 to v2; v2 includes a assembly arm version of Sum64 * it changes the cache storer to write with a string key instead of a byte slice. The cache only reads the key which WriteMulti already has as a string so we can avoid a host of allocations when converting back and forth from immutable strings to mutable byte slices. This includes updating the cache ring and ring partition to write with a string key * it updates the xxhash for finding the cache ring partition to use Sum64String which uses unsafe pointers to directly use a string as a byte slice since it only reads the string. Note: this now uses an assembly version because of the v2 xxhash update. Go 1.22 included new compiler ability to recognize calls of Method([]byte(myString)) and not make a copy but from looking at the call sites, I'm not sure the compiler would recognize it as the conversion to a byte slice was happening several calls earlier. That's what this change set does. If we are uncomfortable with any of these, we can do fewer of them (for example, not upgrade xxhash; and/or not use the specialized Sum64String, etc). For the performance issue in maz-rr, I see converting string keys to byte slices taking between 3-5% of cpu usage on both the primary and secondary. So while this pr doesn't address directly the increased cpu usage on the secondary, it makes cpu usage less on both which still feels like a win. I believe these changes are easier to review that switching to a byte slice pool that is likely needed in other places as the compiler provides nearly all of the correctness checks we need (we are relying also on xxhash v2 being correct). * helps #550 * chore: fix tests/lint * chore: don't use assembly version; should inline This 2 line change causes xxhash to use a purego Sum64 implementation which allows the compiler to see that Sum64 only read the byte slice input which them means is can skip the string to byte slice allocation and since it can skip that, it should inline all the calls to getPartitionStringKey and Sum64 avoiding 1 call to Sum64String which isn't inlined. * chore: update ci build file the ci build doesn't use the make file!!! * chore: revert "chore: update ci build file" This reverts commit 94be66fde03e0bbe18004aab25c0e19051406de2. * chore: revert "chore: don't use assembly version; should inline" This reverts commit 67d8d06c02e17e91ba643a2991e30a49308a5283. (cherry picked from commit 1d334c679ca025645ed93518b7832ae676499cd2) * feat: need to update go sum --------- Co-authored-by: Phil Bracikowski <13472206+philjb@users.noreply.github.com> (cherry picked from commit `06ab224516`)	2024-12-06 16:05:03 -06:00
davidby-influx	07c261a21a	feat: allow the specification of a write window for retention policies (#25517 ) Add FutureWriteLimit and PastWriteLimit to retention policies. Points which are outside of now() + FutureWriteLimit or now() - PastWriteLimit will be rejected on write with a PartialWriteError. closes https://github.com/influxdata/influxdb/issues/25424	2024-11-15 13:30:14 -08:00
Geoffrey Wossum	8497fbf0af	chore: remove unnecessary fmt.Sprintf calls (#25536 ) Remove unnecessary fmt.Sprintf calls for static code checks in main-2.x.	2024-11-12 11:06:39 -06:00
Geoffrey Wossum	65683bf166	chore: fix logging issues in Store.loadShards (#25529 ) Fix reporting shards not opening correctly when they actually did. Fix race condition with logging in loadShards.	2024-11-12 09:34:05 -06:00
Geoffrey Wossum	0bc167bbd7	chore: loadShards changes to more cleanly support 2.x feature (#25513 ) * chore: move shardID parsing and shard filtering into walkShardsAndProcess * chore: make it impossible to miss sending shardResponse or marking shard as complete * chore: always count number of shards (preparation for 2.x related feature) * chore: explicitly load series files and create indices serially Explicitly load series files and create indices serially. Also avoid passing them to work functions that don't need them. * chore: rework loadShards for changes necessary to cancel loading process * chore: comment improvements * fix: fix race conditions in TestStore_StartupShardProgress and TestStore_BadShardLoading * chore: avoid logging nil error * chore: refactor shard loading and shard walking Refactor loadShards and CreateShard to use a common shardLoader class that makes thread-safety easier. Refactor walkShardsAndProcess into findShards. * chore: improve comment * chore: rename OpenShard to ReopenShard and implement with shardLoader Rename Store.OpenShard to Store.ReopenShard and implement using a shardLoader object. Changes to tests as necessary. * chore: avoid resetting shard options and locking on Reopen Avoid resetting shard options when reopening a shard. Proper mutex locker in Shard.ReopenShard. * chore: fix formatting issue * chore: warn on mixed index types in Store.CreateShard * chore: change from info to warn when invalid shard IDs found in path * chore: use coarser locking in Store.ReopenShard * chore: fix typo in comment * chore: code simplification	2024-11-08 15:49:48 -06:00
WeblWabl	2cab9a2a1f	feat: Adds functionality to clear out bad shard list (#25398 ) * feat(tsdb): Adds functionality to clear bad shards list This PR adds test and new method to clear out the bad shards list the method will return the values of the shards that it cleared out along with the errors. This is the first part in the feature for adding a load-shards command to influxd-ctl. Closes influxdata/feature-requests#591	2024-10-18 13:22:32 -05:00
WeblWabl	3c87f524ed	feat(logging): Add startup logging for shard counts (#25378 ) * feat(tsdb): Adds shard opening progress checks to startup This PR adds a check to see how many shards are remaining vs how many shards are opened. This change displays the percent completed too. closes influxdata/feature-requests#476	2024-10-16 10:09:15 -05:00
Shiwen Cheng	1bc0eb4795	fix(tsm1): Fix data race of seriesKeys in deleteSeriesRange (#25268 ) Add an RWMutex to allow safe concurrent access in deleteSeriesRange	2024-09-27 16:36:27 -07:00
WeblWabl	8eaa24d813	feat(tsm): Allow for deletion of series outside default rp (#25312 ) * feat(tsm): Allow for deletion of series outside default RP 9d116f6 This PR adds the ability for deletion of series that are outside of the default retention policy. This updates InfluxQL to include changes from: influxdata/influxql#71 closes: influxdata/feature-requests#175 * feat(tsm): Allow for deletion of series outside default RP 9d116f6 This PR adds the ability for deletion of series that are outside of the default retention policy. This updates InfluxQL to include changes from: influxdata/influxql#71 closes: influxdata/feature-requests#175	2024-09-17 16:34:14 -05:00
WeblWabl	5c9e45f033	fix(tsi1/partition/test): fix data races in test code (#57 ) (#25338 ) * fix(tsi1/partition/test): fix data races in test code (#57) * fix(tsi1/partition/test): fix data races in test code This PR is like influxdata/influxdb#24613 but solves it with a setter method for MaxLogFileSize which allows unexporting that value and MaxLogFileAge. There are actually two places locks were needed in test code. The behavior of production code is unchanged. (cherry picked from commit f0235c4daf4b97769db932f7346c1d3aecf57f8f) * feat: modify error handling to be more idiomatic closes https://github.com/influxdata/influxdb/issues/24042 * fix: errors.Join() filters nil errors --------- Co-authored-by: Phil Bracikowski <13472206+philjb@users.noreply.github.com>	2024-09-16 20:26:14 -05:00
Geoffrey Wossum	23008e5286	chore: improve error messages and logging during shard opening (#25314 ) * chore: improve error messages and logging during shard opening	2024-09-12 15:11:56 -05:00
davidby-influx	5d8d1120e1	fix: add additional logging on loading fields.idxl files (#25309 ) Log the path of the file being loaded, and when level=debug log progress fpr each set of field changes closes https://github.com/influxdata/influxdb/issues/25289	2024-09-12 08:25:02 -07:00
WeblWabl	7dc8b1d648	fix(tsi1/partition/test): fix data race in test code (#25288 )	2024-09-11 19:48:41 -05:00
Geoffrey Wossum	2cf2103cc4	feat: add hook for optimizing series reads based on authorizer (#25207 )	2024-08-02 15:03:44 -05:00
Shiwen Cheng	7333da9592	fix(tsi1): fix data race between appendEntry and FlushAndSync tsi1.(*LogFile) (#25182 ) Extend lock lifespan to encompass the flushAndSync() call to avoid a race closes https://github.com/influxdata/influxdb/issues/25181	2024-07-23 14:40:10 -07:00
davidby-influx	176fca2138	fix: prevent an infinite loop in measurementFieldSetChangeMgr (#25155 ) The measurementFieldSetChangeMgr has a possibly infinite loop if the writeRequests channel is closed while in the inner loop to consolidate write requests. We need to check for ok on channel receive and exit the loop when ok is false. closes https://github.com/influxdata/influxdb/issues/25151	2024-07-12 16:52:28 -07:00
Geoffrey Wossum	b4bd607eef	fix: prevent retention service from hanging (#25055 ) * fix: prevent retention service from hanging Fix issue that can cause the retention service to hang waiting on a `Shard.Close` call. When this occurs, no other shards will be deleted by the retention service. This is usually noticed as an increase in disk usage because old shards are not cleaned up. The fix adds to new methods to `Store`, `SetShardNewReadersBlocked` and `InUse`. `InUse` can be used to poll if a shard has active readers, which the retention service uses to skip over in-use shards to prevent the service from hanging. `SetShardNewReadersBlocked` determines if new read access may be granted to a shard. This is required to prevent race conditions around the use of `InUse` and the deletion of shards. If the retention service skips over a shard because it is in-use, the shard will be checked again the next time the retention service is run. It can be deleted on subsequent checks if it is no longer in-use. If the shards is stuck in-use, the retention service will not be able to delete the shards, which can be observed in the logs for manual intervention. Other shards can still be deleted by the retention service even if a shard is stuck with readers. closes: #25054	2024-06-13 11:07:17 -05:00
davidby-influx	82cbdb5478	fix: ensure TSMBatchKeyIterator and FileStore close all TSMReaders (#24957 ) Do not let errors on closing a TSMReader prevent other closes.	2024-05-06 09:59:30 -07:00
Brandon Pfeifer	d4b16dcd98	chore: upgrade protocol buffers to v5.26.1 (#24949 )	2024-05-01 11:00:26 -07:00

1 2 3 4 5 ...

2814 Commits (db/6263/compaction-debug-logging)