influxdb

Commit Graph

Author	SHA1	Message	Date
WeblWabl	7437f275ff	feat: Add new logging for compaction level 5 and remove bug with opt holdoff time (#26488 ) Previously ```go // StartOptHoldOff will create a hold off timer for OptimizedCompaction func (e *Engine) StartOptHoldOff(holdOffDurationCheck time.Duration, optHoldoffStart time.Time, optHoldoffDuration time.Duration) { startOptHoldoff := func(dur time.Duration) { optHoldoffStart = time.Now() optHoldoffDuration = dur e.logger.Info("optimize compaction holdoff timer started", logger.Shard(e.id), zap.Duration("duration", optHoldoffDuration), zap.Time("endTime", optHoldoffStart.Add(optHoldoffDuration))) } startOptHoldoff(holdOffDurationCheck) } ``` was not passing the data by reference which meant we were never modifying the `optHoldoffDuration` and `optHoldoffStart` vars. This PR also adds additional logging to Optimized level 5 compactions to clear up a little bit of confusion around log messages.	2025-06-02 17:51:59 -05:00
Geoffrey Wossum	1fbe319080	fix: reduce excessive CPU usage during compaction planning (#26432 ) Co-authored-by: devanbenz <devandbenz@gmail.com>	2025-05-27 16:55:20 -05:00
Geoffrey Wossum	66f4dbeaad	fix: limit number of concurrent optimized compactions (#26319 ) Limit number of concurrent optimized compactions so that level compactions do not get starved. Starved level compactions result in a sudden increase in disk usage. Add [data] max-concurrent-optimized-compactions for configuring maximum number of concurrent optimized compactions. Default value is 1. Co-authored-by: davidby-influx <dbyrne@influxdata.com> Co-authored-by: devanbenz <devandbenz@gmail.com> Closes: #26315	2025-05-06 15:42:39 -05:00
WeblWabl	96e44cac73	fix: PlanOptimize is running too frequently (#26211 ) PlanOptimize is being checked far too frequently. This PR is the simplest change that can be made in order to ensure that PlanOptimize is not being ran too much. To alleviate the frequency I've added a lastWrite parameter to PlanOptimize and added an additional test that mocks the edge cause out in the wild that led to this PR. Previously in test cases for PlanOptimize I was not checked to see if certain cases would be picked up by Plan I've adjusted a few of the existing test cases after modifying Plan and PlanOptimize to have the same lastWrite time.	2025-04-08 12:22:29 -05:00
WeblWabl	d8bcbd894c	feat: Add CompactPointsPerBlock config opt (#26100 ) * feat: Add CompactPointsPerBlock config opt This PR adds an additional parameter for influxd CompactPointsPerBlock. It adjusts the DefaultAggressiveMaxPointsPerBlock to 10,000. We had discovered that with the points per block set to 100,000 compacted TSM files were increasing. After modifying the points per block to 10,000 we noticed that the file sizes decreased. The value has been set as a parameter that can be adjusted by administrators this allows there to be some tuning if compression problems are encountered.	2025-03-05 14:59:06 -06:00
davidby-influx	2ab5aad52e	chore: add logging to Filestore.purger (#26089 ) Also fixes error type checks in TestCompactor_CompactFull_InProgress	2025-03-05 11:46:07 -08:00
davidby-influx	ba95c9b0f0	fix: ensure temp files removed on failed compaction (#26070 ) Add more robust temporary file removal on a failed compaction. Don't halt on a failed removal, and don't assume a failed compaction won't generate temporary files. closes https://github.com/influxdata/influxdb/issues/26068	2025-02-26 13:17:17 -08:00
davidby-influx	083b679b56	fix: ensure fields in memory match on disk A field could be created in memory but not saved to disk if a later field in that point was invalid (type conflict, too big) Ensure that if a field is created, it is saved.	2025-02-24 13:53:40 -08:00
davidby-influx	5f576331d3	chore: refactor field creation for maintainability Address review comments in the port work of the field creation. Also fixes one bug in returning the wrong error.	2025-02-18 14:00:11 -08:00
davidby-influx	5a20a835a5	fix: lock MeasurementFields while validating (#25998 ) There was a window where a race between writes with differing types for the same field were being validated. Lock the MeasurementFields struct during field validation to avoid this. closes https://github.com/influxdata/influxdb/issues/23756	2025-02-13 11:33:34 -08:00
davidby-influx	800970490a	fix: move aside TSM file on errBlockRead (#25839 ) The error type check for errBlockRead was incorrect, and bad TSM files were not being moved aside when that error was encountered. Use errors.Join, errors.Is, and errors.As to correctly unwrap multiple errors. Closes https://github.com/influxdata/influxdb/issues/25838	2025-01-22 10:46:31 -08:00
WeblWabl	f04105bede	feat: Modify optimized compaction to cover edge cases (#25594 ) * feat: Modify optimized compaction to cover edge cases This PR changes the algorithm for compaction to account for the following cases that were not previously accounted for: - Many generations with a groupsize over 2 GB - Single generation with many files and a groupsize under 2 GB - Where groupsize is the total size of the TSM files in said shard directory. - shards that may have over a 2 GB group size but many fragmented files (under 2 GB and under aggressive point per block count) closes https://github.com/influxdata/influxdb/issues/25666	2025-01-14 14:51:09 -06:00
Shiwen Cheng	1bc0eb4795	fix(tsm1): Fix data race of seriesKeys in deleteSeriesRange (#25268 ) Add an RWMutex to allow safe concurrent access in deleteSeriesRange	2024-09-27 16:36:27 -07:00
Geoffrey Wossum	23008e5286	chore: improve error messages and logging during shard opening (#25314 ) * chore: improve error messages and logging during shard opening	2024-09-12 15:11:56 -05:00
Geoffrey Wossum	b4bd607eef	fix: prevent retention service from hanging (#25055 ) * fix: prevent retention service from hanging Fix issue that can cause the retention service to hang waiting on a `Shard.Close` call. When this occurs, no other shards will be deleted by the retention service. This is usually noticed as an increase in disk usage because old shards are not cleaned up. The fix adds to new methods to `Store`, `SetShardNewReadersBlocked` and `InUse`. `InUse` can be used to poll if a shard has active readers, which the retention service uses to skip over in-use shards to prevent the service from hanging. `SetShardNewReadersBlocked` determines if new read access may be granted to a shard. This is required to prevent race conditions around the use of `InUse` and the deletion of shards. If the retention service skips over a shard because it is in-use, the shard will be checked again the next time the retention service is run. It can be deleted on subsequent checks if it is no longer in-use. If the shards is stuck in-use, the retention service will not be able to delete the shards, which can be observed in the logs for manual intervention. Other shards can still be deleted by the retention service even if a shard is stuck with readers. closes: #25054	2024-06-13 11:07:17 -05:00
Sam Arnold	9e9f1be574	fix: remove dead iterator (#23888 )	2022-11-09 16:24:01 -05:00
davidby-influx	80c10c8c04	feat: optimize saving changes to fields.idx (#23701 ) Instead of writing out the complete fields.idx file when it changes, write out incremental changes that will be applied to the file on close and startup. closes https://github.com/influxdata/influxdb/issues/23653	2022-09-14 13:14:09 -07:00
davidby-influx	ec412f793b	fix: do not rename files on mmap failure (#23396 ) If NewTSMReader() fails because mmap fails, do not rename the file, because the error is probably caused by vm.max_map_count being too low closes https://github.com/influxdata/influxdb/issues/23172	2022-06-07 08:37:00 -07:00
Dane Strandboge	0574163566	build: upgrade to go1.18 (#23250 )	2022-03-31 16:17:57 -05:00
davidby-influx	7d3efe1e9e	fix: avoid compaction queue stats flutter. (#22195 ) When the compaction planner runs, if it cannot acquire a lock on the files it plans to compact, it returns a nil list of compaction groups. This, in turn, sets the engine statistics for compactions queues to zero, which is incorrect. Instead, use the length of pending files which would have been returned. closes https://github.com/influxdata/influxdb/issues/22138	2021-08-16 09:21:07 -07:00
davidby-influx	73bdb2860e	chore: add logging to compaction (#21707 ) Compaction logging will generate intermediate information on volume of data written and output files created, as well as improve some of the anti-entropy messages related to compaction. This will also apply to `influx_tools compact` Closes https://github.com/influxdata/influxdb/issues/21704	2021-06-16 15:28:44 -07:00
davidby-influx	f64be286be	fix: avoid rewriting fields.idx unnecessarily (#21592 ) Under heavy write load creating new fields and measurements the rewrite of the fields.idx file is a bottleneck. This enhancement combines multiple writes into a single one and shares any error return value with all of the combined invocations. MeasurementFieldSet and the new MeasurementFieldSetWriter must both now be explicitly closed. Closes #21577	2021-06-04 09:21:33 -07:00
davidby-influx	c8da9bafbf	chore(ae): add more logging (#21381 ) (#21452 ) tsdb.Engine.IsIdle and tsdb.Engine.Digest now return a reason string for why the engine & shard are not idle. Callers can then use this string for logging, if desired. The returned reason does not allocate memory, so the caller may want to add the shard ID and path for more information in the log. This is intended to be used in calls from the anti-entropy service in Enterprise. (cherry picked from commit `bf45841359`) fixes https://github.com/influxdata/influxdb/issues/21448	2021-05-11 09:46:45 -07:00
Daniel Moran	333cff1b15	fix(tsdb): exclude the stop time from the array cursor (#21139 ) This is a backport of #14262 to the 1.x storage engine. This also ports the table tests that existed with the pre-beta version of the storage engine to the one that is now used in the production version. A few of the tests are skipped. These are portions of the storage engine that have not been ported over. They should be unskipped when that functionality is ported over. Co-authored-by: Jonathan A. Sternberg <jonathan@influxdata.com>	2021-04-06 14:50:07 -04:00
Daniel Moran	3eb4fdaf33	fix(tsm1): fix data race when accessing tombstone stats (#20903 )	2021-03-09 15:20:40 -05:00
Sam Arnold	de1a0eb2a9	feat: use count_hll for 'show series cardinality' queries (#20745 ) Closes: https://github.com/influxdata/influxdb/issues/20614 Also fix nil pointer for seriesKey iterator Fix for bug in: https://github.com/influxdata/influxdb/issues/20543 Also add a test for ingress metrics	2021-02-10 16:00:16 -05:00
Sam Arnold	903b8cd0ea	feat(query): Hyper log log operators in influxql (#20603 ) * feat(query): hyper log log counting in query engine In addition to helping with normal queries, this can improve the 'SHOW CARDINALITY' meta-queries: time influx -database mydb -execute 'select count_hll(sum_hll(_seriesKey)) from big' name: big time count_hll ---- --------- 0 200767781 influx -database mydb -execute 0.06s user 0.12s system 0% cpu 8:49.99 total	2021-02-08 08:38:14 -05:00
Sam Arnold	21823db00b	feat: series creation ingress metrics (#20700 ) After turning this on and testing locally, note the 'seriesCreated' metric "localStore": {"name":"localStore","tags":null,"values":{"pointsWritten":2987,"seriesCreated":58,"valuesWritten":23754}}, "ingress": {"name":"ingress","tags":{"db":"_internal","login":"_systemuser_monitor","measurement":"cq","rp":"monitor"},"values":{"pointsWritten":2,"seriesCreated":1,"valuesWritten":4}}, "ingress:1": {"name":"ingress","tags":{"db":"_internal","login":"_systemuser_monitor","measurement":"database","rp":"monitor"},"values":{"pointsWritten":2,"seriesCreated":2,"valuesWritten":4}}, "ingress:2": {"name":"ingress","tags":{"db":"_internal","login":"_systemuser_monitor","measurement":"httpd","rp":"monitor"},"values":{"pointsWritten":2,"seriesCreated":1,"valuesWritten":46}}, "ingress:3": {"name":"ingress","tags":{"db":"_internal","login":"_systemuser_monitor","measurement":"ingress","rp":"monitor"},"values":{"pointsWritten":14,"seriesCreated":14,"valuesWritten":42}}, "ingress:4": {"name":"ingress","tags":{"db":"_internal","login":"_systemuser_monitor","measurement":"localStore","rp":"monitor"},"values":{"pointsWritten":2,"seriesCreated":1,"valuesWritten":6}}, "ingress:5": {"name":"ingress","tags":{"db":"_internal","login":"_systemuser_monitor","measurement":"queryExecutor","rp":"monitor"},"values":{"pointsWritten":2,"seriesCreated":1,"valuesWritten":10}}, "ingress:6": {"name":"ingress","tags":{"db":"_internal","login":"_systemuser_monitor","measurement":"runtime","rp":"monitor"},"values":{"pointsWritten":2,"seriesCreated":1,"valuesWritten":30}}, "ingress:7": {"name":"ingress","tags":{"db":"_internal","login":"_systemuser_monitor","measurement":"shard","rp":"monitor"},"values":{"pointsWritten":2,"seriesCreated":2,"valuesWritten":22}}, "ingress:8": {"name":"ingress","tags":{"db":"_internal","login":"_systemuser_monitor","measurement":"subscriber","rp":"monitor"},"values":{"pointsWritten":2,"seriesCreated":1,"valuesWritten":6}}, "ingress:9": {"name":"ingress","tags":{"db":"_internal","login":"_systemuser_monitor","measurement":"tsm1_cache","rp":"monitor"},"values":{"pointsWritten":2,"seriesCreated":2,"valuesWritten":18}}, "ingress:10": {"name":"ingress","tags":{"db":"_internal","login":"_systemuser_monitor","measurement":"tsm1_engine","rp":"monitor"},"values":{"pointsWritten":2,"seriesCreated":2,"valuesWritten":58}}, "ingress:11": {"name":"ingress","tags":{"db":"_internal","login":"_systemuser_monitor","measurement":"tsm1_filestore","rp":"monitor"},"values":{"pointsWritten":2,"seriesCreated":2,"valuesWritten":4}}, "ingress:12": {"name":"ingress","tags":{"db":"_internal","login":"_systemuser_monitor","measurement":"tsm1_wal","rp":"monitor"},"values":{"pointsWritten":2,"seriesCreated":2,"valuesWritten":8}}, "ingress:13": {"name":"ingress","tags":{"db":"_internal","login":"_systemuser_monitor","measurement":"write","rp":"monitor"},"values":{"pointsWritten":2,"seriesCreated":1,"valuesWritten":18}}, "ingress:14": {"name":"ingress","tags":{"db":"telegraf","login":"_systemuser_unknown","measurement":"cpu","rp":"autogen"},"values":{"pointsWritten":1342,"seriesCreated":13,"valuesWritten":13420}}, "ingress:15": {"name":"ingress","tags":{"db":"telegraf","login":"_systemuser_unknown","measurement":"disk","rp":"autogen"},"values":{"pointsWritten":642,"seriesCreated":6,"valuesWritten":4494}}, "ingress:16": {"name":"ingress","tags":{"db":"telegraf","login":"_systemuser_unknown","measurement":"diskio","rp":"autogen"},"values":{"pointsWritten":214,"seriesCreated":2,"valuesWritten":2354}}, "ingress:17": {"name":"ingress","tags":{"db":"telegraf","login":"_systemuser_unknown","measurement":"mem","rp":"autogen"},"values":{"pointsWritten":107,"seriesCreated":1,"valuesWritten":963}}, "ingress:18": {"name":"ingress","tags":{"db":"telegraf","login":"_systemuser_unknown","measurement":"processes","rp":"autogen"},"values":{"pointsWritten":107,"seriesCreated":1,"valuesWritten":856}}, "ingress:19": {"name":"ingress","tags":{"db":"telegraf","login":"_systemuser_unknown","measurement":"swap","rp":"autogen"},"values":{"pointsWritten":214,"seriesCreated":1,"valuesWritten":642}}, "ingress:20": {"name":"ingress","tags":{"db":"telegraf","login":"_systemuser_unknown","measurement":"system","rp":"autogen"},"values":{"pointsWritten":321,"seriesCreated":1,"valuesWritten":749}}, Closes: https://github.com/influxdata/influxdb/issues/20613	2021-02-05 14:52:43 -04:00
Sam Arnold	eb92c997cd	feat: Ingress metrics by measurement Partial implementation of https://github.com/influxdata/influxdb/issues/20612 Implements per-measurement points written metric. Next step: Also support per-login.	2021-02-02 15:58:28 -05:00
Sam Arnold	6795ec6c01	refactor: do not use context value anti-pattern Extending the context instead of fixing the API breaks type safety. For tracking the number of points / values written, it is much clearer to pass an explicit tracker.	2021-02-01 14:34:11 -05:00
Sam Arnold	98a76a11a0	feat(tsi): optimize series iteration When using queries like 'select count(_seriesKey) from bigmeasurement`, we should iterate over the tsi structures to serve the query instead of loading all the series into memory up front. Closes #20543	2021-01-25 14:27:31 -05:00
Sam Arnold	d1a1e4b667	chore: restore ImportShard This reverts commit `d14acea44d`.	2020-12-07 11:01:00 -04:00
davidby-influx	0faac1a478	chore(tsm1): fix formatting Failed to format code before commit.	2020-11-16 21:25:26 -08:00
davidby-influx	b3724581bc	fix(tsm1): "snapshot in progress" error during backup Loop with backoff in (Engine).CreateSnapshot() to retry (Engine).WriteSnapshot() up to 3 times if ErrSnapshotInPrgress is returned. Then continue on no error or on SnapshotInProgress if skipCacheOk is true. https://github.com/influxdata/plutonium/issues/3227 (cherry picked from commit `dfa6aa8cea`)	2020-11-16 21:23:00 -08:00
davidby-influx	6ec446f422	fix(tsm1): "snapshot in progress" error during backup This fix adds a skipCacheOk flag to tsdb.Store.CreateShardSnapshot() and tsdb.Shard.CreateSnapshot() to pass to tsdb.Engine.CreateSnapshot() A value of true allows the backup to proceed even if a cache snapshot cannot be taken. This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path This flag is set to false in tsm1.Engine.Export() https://github.com/influxdata/plutonium/issues/3227	2020-11-05 11:08:08 -08:00
davidby-influx	23be20bf1b	fix(tsm1): "snapshot in progress" error during backup When an InfluxDB database is very busy writing new points the backup the process can fail because it can not write a new snapshot. The error is: operation timed out with error: create snapshot: snapshot in progress. This happens because InfluxDB takes almost "continuously" a snapshot from the cache caused by the high number of points ingested. The fix for this was https://github.com/influxdata/influxdb/pull/16627 but it was for OSS only, and was not in the code path for backups in clusters. This fix adds a skipCacheOk flag to tsdb.Engine.CreateSnapshot(). A value of true allows the backup to proceed even if a cache snapshot cannot be taken. This flag is set to true in tsm1.Engine.Backup(), the OSS backup code path and in tsdb.Shard.CreateSnapshot(), the cluster backup code path. This flag is set to false in tsm1.Engine.Export() https://github.com/influxdata/plutonium/issues/3227	2020-10-30 10:37:36 -07:00
David Norton	3d92eef720	feat: allow disable compaction per shard This feature allows compaction to be disabled on a per-shard basis by creating a file named do_not_compact in a shard's directory. When disabled, a message is logged every 15 minutes with the reason for compaction being disabled (existance of the file). This makes it easy to know if compaction has been disabled for any shards by searching the log for "compaction disabled" or running "find path/to/data -type f -name do_not_compact".	2020-10-06 10:58:07 -04:00
Ayan George	6ce0e11738	feat: Collect values written stats (#19187 ) * feat(engine/tsm1): Add WritePointsWithContext() Add WritePontsWithContext() and make WritePoints() a thin wrapper for it. The purpose is to add statistics context values that we'll use to propagate the number of fields and points written to calls up the call chain. * feat(tsdb): Add WriteToShardWithContext() When applied, this patch adds WriteToShardWithContext() and wraps it with WriteToShard() to preserve the API. The the purpose of this addition is to propagate a context.Context value to Shard.WritePointsWithContext(). * feat(tsdb/shard): Add WritePointsWithContext() The purpose of adding WritePointsWithContext() is to propage context values down to engine code and propage statistics via the context.Value up to callers. This patch also adds values written statistics to the shard. * feat(http): Gather values written stats WritePointsWithContext() was added to propagate context values down to the engine and communicate stats to the caller. * feat(http): Gather values written stats WritePointsWithContext() was added to propagate context values down to the engine and communicate stats to the caller. * refactor: Change MetricKey to ContextKey This patch gives the type we're useing for context keys a better name.	2020-08-12 11:26:12 -04:00
Ben Johnson	4a1a8c0041	Merge pull request #18689 from influxdata/batch-write-tombstones-when-deleting perf(tsi1): batch write tombstone entries when dropping/deleting	2020-06-25 08:15:12 -06:00
Ayan George	a9d02e7ab7	fix: Handle snapshot related errors (#18710 ) When applied this patch will: * log snapshot directory removal errors Prior to this patch, errors when removing temporary snapshot directories happens silently. This patch ensures that errors are logged when os.RemoveAll() fails. * refactor tsm1: Declare error value in condition Save a line of code and limits the scope of an error value. * refactor tsm1: Add MakeSnapshotLinks() This commit adds (*FileStore).MakeSnapshotLinks(). The code in this function was originally part of CreateSnapshot(). That code was hoisted out and into MakeSnapshotLinks() becuase there are two points of failure that require cleanup -- we have to delete a temporary directory on failure. Placing the code in one function allows us to check its returned error value and perform cleanup in only once place. In short, we hoisted code out of CreateSnapshot() to simplify error handling. On error, we remove any directories we created.	2020-06-25 10:05:04 -04:00
dengzhi.ldz	331569bc11	perf(tsi1): batch write tombstone entries when dropping/deleting	2020-06-24 09:26:09 -06:00
Tristan Su	d14acea44d	chore: clean up unused functions	2020-05-08 13:45:34 +08:00
Gianluca Arbezzano	30621ca9ec	chore(tsm1): skip WriteSnapshot during backup if snapshotter is busy When an InfluxDB database is very busy writing new points the backup the process can fail because it can not write a new snapshot. The error is: `operation timed out with error: create snapshot: snapshot in progress`. This happens because InfluxDB takes almost "continuously" a snapshot from the cache caused by the high number of points ingested. This PR skips snapshots if the `snapshotter` does not come available after three attempts when a backup is requested. The backup won't contain the data in the cache or WAL. Signed-off-by: Gianluca Arbezzano <gianarb92@gmail.com>	2020-02-04 20:09:50 +01:00
Sean Brickley	fe55d728f0	fix(tsm1): Compaction log error	2020-01-13 20:05:05 -05:00
tmgordeeva	f1d26652e9	fix(storage): skip TSM files with block read errors (#15885 ) * fix(storage): skip TSM files with block read errors When we find a bad TSM file during compaction, propagate the error up and move the bad file aside. The engine will disregard the file so the next compaction will not hit the same error.	2019-12-13 15:05:39 -08:00
David Norton	102fcd671b	fix(tsm1): make Digest() safe for concurrent use This change adds a lock around digest creation so that it is safe for concurrent calls. Prior to this change, calls from multiple goroutines resulted in "Digest aborted, problem renaming tmp digest" errors.	2019-11-12 18:02:41 -05:00
Max U	c6c0a5d3b1	change log level from info to error	2019-07-08 14:52:53 -04:00
Max U	9091d72ba7	initial commit for 1.8	2019-07-02 13:18:20 -04:00
Ben Johnson	aa3dfc0662	Merge pull request #11791 from influxdata/bj-revert-limit-full-compaction-1.8 Revert "Limit force-full and cold compaction size."	2019-02-11 12:50:55 -07:00
Ben Johnson	198f6fde38	Fix deleteSeriesRange() race condition.	2019-02-11 11:29:09 -07:00

1 2 3 4 5 ...

496 Commits (db/shards_persisting_chaos_testing)