Commit Graph

14918 Commits (db/6263/compaction-debug-logging)

Author SHA1 Message Date
devanbenz 703f16a602 chore: Only do debug logs if there are compaction groups 2025-08-01 13:20:35 -05:00
devanbenz a86441180f chore: adds logging to limiters prior to apply 2025-08-01 13:06:53 -05:00
devanbenz fc54ef3272 chore: additional logging of groups 2025-08-01 13:04:05 -05:00
devanbenz 3fbd6f1f52 chore: Adding groups as keys for logs 2025-08-01 13:00:12 -05:00
devanbenz 5f07da3798 chore: Add additional logging around scheduler loop 2025-08-01 12:42:11 -05:00
devanbenz aa4698e61c chore: Adding additional trace logging 2025-08-01 12:23:19 -05:00
devanbenz 80902585a8 chore: Adjust logging and add groups 2025-08-01 12:12:45 -05:00
devanbenz 17a8c15c1d chore: Adding debug level logging for engine.go compaction 2025-08-01 12:09:43 -05:00
Jamie Strandboge 40ec5b01a1
chore(deps): bump golang.org/x/oauth2 from v0.21.0 to 0.27.0 (#26625) 2025-07-25 11:21:49 -05:00
WeblWabl 0f57087944
feat: Adds LastModifiedOrErr to expose error for LastModified (#26623) 2025-07-24 20:54:41 -05:00
Phil Bracikowski 4e8a3b389b
feat: file store merge metrics (#26615)
* feat(1.x,file_store): port metrics for merge work

This commit ports metrics around merging tsm blocks when executing a
query. These will appear in EXPLAN ANALYZE results. The new information
records the time spent merging blocks, the number of blocks merged,
roughly the number of values merged into the first block of each
ReadBlock call, and the number of times that single calls to ReadBlock
have more than 4 block merges. The multiblock merge is sequential and
might benefit from a tree merge algorithm. The latter stat helps
identify if the engineering effort would be fruitful.

* closes #26614

* chore: switch to a timer for duration printing of times

* chore: rename method

* fix: avoid race and use new atomic primitive
2025-07-18 12:18:37 -07:00
Phil Bracikowski 1c082def6c
feat(influx_tools): report more than one error type (#26600)
Without this PR, the export-parquet tool would report on type conflict
errors and not name conflict errors in the schema if type conflicts were
encountered first. It stopped checking for validation issues once type
conflicts were found.

This PR changes it so that both type and name schema issues are both
identified and reported in the commands output. Either still fails an
export to parquet; but in --dry-run mode the validation is an useful
tool to check for schemas that will be an issue in parquet of
influxdbv3.

* follows #25297
2025-07-11 15:43:28 -07:00
WeblWabl 57da7aa4e7
feat: Adds time_format param for httpd (#26596)
* feat: Adds time_format param for httpd

* This PR will add a time_format parameter which takes in the value "epoch" or "rfc3339". The default will be "epoch" depending on the value output timestamps will be formatted in epoch or rfc3339.

Closes FR#615

* feat: Adding some changes
* error if incorrect param
* update naming for converter function
* combine tests

* chore: fmt'ing

* feat: A few modifications
* Rename convertToRfc3339Nano to convertToTimeFormat
* allow time formatting to be passed as parameter
* adjust error handling to use already defined timeFormats
* merge test data for test cases to reduce boilerplate
2025-07-10 16:48:59 -07:00
davidby-influx ea36c5ff47
chore: improve logging on compaction failures (#26545)
Streamline compaction logging, while
providing more information to debug
remnant temporary files.
2025-06-25 13:54:52 -07:00
WeblWabl 149fb47597
feat: Defer cleanup for log/index compactions, add debug log (#26511)
I believe that there is something happening which causes CurrentCompactionN() to always be greater than 0. Thus making Partition.Wait() hang forever.

Taking a look at some profiles where this issue occurs. I'm seeing a consistent one where we're stuck on Partition.Wait()
```
-----------+-------------------------------------------------------
         1   runtime.gopark
             runtime.chanrecv
             runtime.chanrecv1
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Wait
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Partition).Close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).close
             github.com/influxdata/influxdb/tsdb/index/tsi1.(*Index).Close
             github.com/influxdata/influxdb/tsdb.(*Shard).closeNoLock
             github.com/influxdata/influxdb/tsdb.(*Shard).Close
             github.com/influxdata/influxdb/tsdb.(*Store).DeleteShard
             github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck.func3
             github.com/influxdata/influxdb/services/retention.(*Service).DeletionCheck
             github.com/influxdata/influxdb/services/retention.(*Service).run
             github.com/influxdata/influxdb/services/retention.(*Service).Open.func1
-----------+-------------------------------------------------------
```

Defer'ing compaction count cleanup inside goroutines should help with any hanging current compaction counts.

Modify currentCompactionN to be a sync atomic.

Adding a debug level log within Compaction.Wait() should aid in debugging.
2025-06-20 13:18:47 -05:00
Geoffrey Wossum 4378e85744
chore: stop publishing nightly changelog (#26539)
Stop publishing nightly changelog since we do not publish nightly
build artifacts. This addresses issues with dependent projects
that check status of CI for influxdb.

Closes: #26538
2025-06-18 14:15:00 -05:00
Geoffrey Wossum 8ef2aca1ca
fix: stop noisy logging about phantom shards that do not belong to node (#26527)
Stop noisy logging about phantom shards that do not belong to the
current node by checking the shard ownership before logging about the
phantom shard. Note that only the logging was inaccurate. This did not
accidentally remove shards from the metadata that weren't really phantom
shards due to checks in `DropShardMetaRef` implementations.

closes: #26525
2025-06-17 09:40:33 -05:00
WeblWabl 7437f275ff
feat: Add new logging for compaction level 5 and remove bug with opt holdoff time (#26488)
Previously 

```go
// StartOptHoldOff will create a hold off timer for OptimizedCompaction
func (e *Engine) StartOptHoldOff(holdOffDurationCheck time.Duration, optHoldoffStart time.Time, optHoldoffDuration time.Duration) {
	startOptHoldoff := func(dur time.Duration) {
		optHoldoffStart = time.Now()
		optHoldoffDuration = dur
		e.logger.Info("optimize compaction holdoff timer started", logger.Shard(e.id), zap.Duration("duration", optHoldoffDuration), zap.Time("endTime", optHoldoffStart.Add(optHoldoffDuration)))
	}
	startOptHoldoff(holdOffDurationCheck)
}
```
was not passing the data by reference which meant we were never modifying the `optHoldoffDuration` and `optHoldoffStart` vars. 

This PR also adds additional logging to Optimized level 5 compactions to clear up a little bit of confusion around log messages.
2025-06-02 17:51:59 -05:00
Sven Rebhan c07e237142
feat(influx_tools): Add export to parquet files (#25297)
Adds a command to export data into per-shard 
parquet files. To do so, the command iterates 
over the shards, creates a cumulative schema 
over the series of a measurement (i.e. a super-set 
of tags and fields) and exports the data to a 
parquet file per measurement and shard.
2025-06-02 10:59:54 -07:00
Geoffrey Wossum 1fbe319080
fix: reduce excessive CPU usage during compaction planning (#26432)
Co-authored-by: devanbenz <devandbenz@gmail.com>
2025-05-27 16:55:20 -05:00
davidby-influx eab8a8a6e8
fix: add locking in ClearBadShardList (#26423) 2025-05-19 09:14:07 -07:00
Geoffrey Wossum 66f4dbeaad
fix: limit number of concurrent optimized compactions (#26319)
Limit number of concurrent optimized compactions so that level compactions do not get starved. Starved level compactions result in a sudden increase in disk usage.

Add [data] max-concurrent-optimized-compactions for configuring maximum number of concurrent optimized compactions. Default value is 1.

Co-authored-by: davidby-influx <dbyrne@influxdata.com>
Co-authored-by: devanbenz <devandbenz@gmail.com>
Closes: #26315
2025-05-06 15:42:39 -05:00
davidby-influx 62e803e673
feat: improve dropped point logging (#26257)
Log the reason for a point being dropped,
the type of boundary violated, and the
time that was the boundary. Prints the
maximum and minimum points (by time)
that were dropped

closes https://github.com/influxdata/influxdb/issues/26252

* fix: better time formatting and additional testing

* fix: differentiate point time boundary violations

* chore: clean up switch statement

* fix: improve error messages
2025-04-18 15:18:19 -07:00
Jamie Strandboge f61a082618
chore: update to go 1.23.8 (#26293) 2025-04-18 13:53:04 -05:00
Jamie Strandboge 58475a1b36
chore: use github.com/golang-jwt/jwt/v4 and update golang.org/x/net to v0.38.0 (1.x) (#26292)
* chore: update to supported github.com/golang-jwt/jwt/v4

* chore(dep): update golang.org/x/net to v0.38.0
2025-04-18 13:52:55 -05:00
davidby-influx 53329a3ad3
feat: use zap.AtomicLevel for dynamic logging levels (#26182)
Use the zap.AtomicLevel struct for log levels
which allows the level to be changed dynamically.
Enterprise will use this feature.
2025-04-17 10:07:33 -07:00
WeblWabl 8358f1beb9
fix: Modify package publishing to fix slack msg & publish_packages (#26279) 2025-04-16 15:55:57 -05:00
WeblWabl 96e44cac73
fix: PlanOptimize is running too frequently (#26211)
PlanOptimize is being checked far too frequently. This PR is the simplest change that can be made in order to ensure that PlanOptimize is not being ran too much. To alleviate the frequency I've added a lastWrite parameter to PlanOptimize and added an additional test that mocks the edge cause out in the wild that led to this PR.

Previously in test cases for PlanOptimize I was not checked to see if certain cases would be picked up by Plan I've adjusted a few of the existing test cases after modifying Plan and PlanOptimize to have the same lastWrite time.
2025-04-08 12:22:29 -05:00
Geoffrey Wossum 61f21c5adb
chore(ci): push artifiacts to public bucket (#26190)
* chore(ci): push artifacts to public bucket (#25435)

Clean cherry-pick of #25435 to master-1.x.

(cherry picked from commit ca80b243ed)

* chore: port #24491 to master-1.x

Port a portion of #24491 that was not included in previous cherry-picks to master-1.x
2025-03-25 12:31:31 -05:00
WeblWabl 77d6f20894
feat: Upgrade influxql to v1.4.1 (#26181) 2025-03-21 12:24:38 -05:00
WeblWabl 6cda9c903e
fix: Remove nil dereference (#26154) 2025-03-18 08:11:22 -05:00
davidby-influx 9e00f0de98
fix: do not panic on invalid multiple subqueries (#26143)
Multiple subqueries in a FROM clause caused a
panic, insead of returning an error because
they are syntactically invalid. This corrects
that problem

closes https://github.com/influxdata/influxdb/issues/26139
2025-03-14 13:38:57 -07:00
WeblWabl d8bcbd894c
feat: Add CompactPointsPerBlock config opt (#26100)
* feat: Add CompactPointsPerBlock config opt
This PR adds an additional parameter for influxd
CompactPointsPerBlock. It adjusts the DefaultAggressiveMaxPointsPerBlock
to 10,000. We had discovered that with the points per block set to
100,000 compacted TSM files were increasing. After modifying the
points per block to 10,000 we noticed that the file sizes decreased.
The value has been set as a parameter that can be adjusted by administrators
this allows there to be some tuning if compression problems are encountered.
2025-03-05 14:59:06 -06:00
davidby-influx 2ab5aad52e
chore: add logging to Filestore.purger (#26089)
Also fixes error type checks in
TestCompactor_CompactFull_InProgress
2025-03-05 11:46:07 -08:00
davidby-influx 1efb8dad43
fix: remove temp files on error in Compactor.writeNewFiles (#26074)
Compactor.writeNewFiles should delete
temporary files created on iterations
before an error halts the compaction.

closes https://github.com/influxdata/influxdb/issues/26073
2025-02-27 08:17:48 -08:00
davidby-influx ba95c9b0f0
fix: ensure temp files removed on failed compaction (#26070)
Add more robust temporary file removal
on a failed compaction. Don't halt on
a failed removal, and don't assume a
failed compaction won't generate
temporary files.

closes https://github.com/influxdata/influxdb/issues/26068
2025-02-26 13:17:17 -08:00
davidby-influx 083b679b56
fix: ensure fields in memory match on disk
A field could be created in  memory but not
saved to disk if a later field in that
point was invalid (type conflict, too big)
Ensure that if a field is created, it is
saved.
2025-02-24 13:53:40 -08:00
WeblWabl 03b6ed2bed
feat: Upgrade flux to v0.196.1 (#26041)
* feat: update flux to 0.196.1

* feat: Update proto files
This updates from protoc-gen-go v1.33.0 -> v1.34.1
and protoc from v5.26.1 -> v5.29.2
2025-02-20 13:46:06 -06:00
davidby-influx 5f576331d3
chore: refactor field creation for maintainability
Address review comments in the port work of the
field creation. Also fixes one bug in returning the wrong
error.
2025-02-18 14:00:11 -08:00
davidby-influx b617eb24a7
fix: switch MeasurementFields from atomic.Value to sync.Map (#26022)
Simplify and speed up synchronization for
MeasurementFields structures by switching
from a mutex and atomic.Value to a sync.Map
2025-02-13 16:53:25 -08:00
davidby-influx 5a20a835a5
fix: lock MeasurementFields while validating (#25998)
There was a window where a race between writes with
differing types for the same field were being validated.
Lock the  MeasurementFields struct during field
validation to avoid this.

closes https://github.com/influxdata/influxdb/issues/23756
2025-02-13 11:33:34 -08:00
WeblWabl 4ad5e2aba7
feat: Add error join for file writing in snapshots (#26004)
This PR adds an error join to help with handling multiple errors
from snapshot file writers.
2025-02-12 15:06:43 -06:00
WeblWabl 306a184a8d
feat: Add error joins/returns (#25996)
This pr adds err handling for branch that did not specify os file removal errors
previously. This is part of EAR #5819.
2025-02-11 12:15:25 -06:00
davidby-influx f54a34ae33
fix: actually call the deferred function (#25952) 2025-01-31 15:42:38 -08:00
WeblWabl edf5ff20f6
feat: updates go to 1.23.5 (#25926)
* feat: updates go to 1.23.5 and gosnowflake to 1.9.0
2025-01-28 13:31:31 -06:00
davidby-influx 800970490a
fix: move aside TSM file on errBlockRead (#25839)
The error type check for errBlockRead was incorrect,
and bad TSM files were not being moved aside when
that error was encountered. Use errors.Join,
errors.Is, and errors.As to correctly unwrap multiple
errors.

Closes https://github.com/influxdata/influxdb/issues/25838
2025-01-22 10:46:31 -08:00
WeblWabl f04105bede
feat: Modify optimized compaction to cover edge cases (#25594)
* feat: Modify optimized compaction to cover edge cases
This PR changes the algorithm for compaction to account for the following
cases that were not previously accounted for:

- Many generations with a groupsize over 2 GB
- Single generation with many files and a groupsize under 2 GB
- Where groupsize is the total size of the TSM files in said shard directory.
- shards that may have over a 2 GB group size but
many fragmented files (under 2 GB and under aggressive
point per block count)

closes https://github.com/influxdata/influxdb/issues/25666
2025-01-14 14:51:09 -06:00
WeblWabl e2d76edb40
feat: expose NewEncoder from logging package (#25710)
* feat: This PR exposes NewEncoder from our internal logger package
2025-01-14 12:15:17 -06:00
mwdmwd 7999835ac3
feat: influx_inspect export from a single tsm file (#25530)
* feat: This PR adds -tsm file flag to export

Adds the ability to use influx_inspect export to export data from a single tsm file, for example influx_inspect export -out - -tsmfile 000000006-000000002.tsm.bad -database thermo -retention autogen.
2025-01-13 13:48:35 -06:00
davidby-influx e974165d25
fix: do not leak file handles from Compactor.write (#25725)
There are a number of code paths in Compactor.write which
on error can lead to leaked file handles to temporary files.
This, in turn, prevents the removal of the temporary files until
InfluxDB is rebooted, releasing the file handles.

closes https://github.com/influxdata/influxdb/issues/25724
2025-01-03 14:43:41 -08:00