Commit Graph

2804 Commits (db/4217/shards-persisting)

Author SHA1 Message Date
Devan e854c93574 feat: working on it
Seeing an erroneous index out of bounds error sometimes when
running tests
2025-05-30 20:52:25 -05:00
devanbenz bdb71ffbfc feat: need to figure out how to update the partition 2025-05-30 15:37:54 -05:00
devanbenz ebf1dc858f feat: Pushing up changes for testing deadlock 2025-05-30 15:28:20 -05:00
Geoffrey Wossum 1fbe319080
fix: reduce excessive CPU usage during compaction planning (#26432)
Co-authored-by: devanbenz <devandbenz@gmail.com>
2025-05-27 16:55:20 -05:00
davidby-influx eab8a8a6e8
fix: add locking in ClearBadShardList (#26423) 2025-05-19 09:14:07 -07:00
Geoffrey Wossum 66f4dbeaad
fix: limit number of concurrent optimized compactions (#26319)
Limit number of concurrent optimized compactions so that level compactions do not get starved. Starved level compactions result in a sudden increase in disk usage.

Add [data] max-concurrent-optimized-compactions for configuring maximum number of concurrent optimized compactions. Default value is 1.

Co-authored-by: davidby-influx <dbyrne@influxdata.com>
Co-authored-by: devanbenz <devandbenz@gmail.com>
Closes: #26315
2025-05-06 15:42:39 -05:00
WeblWabl 96e44cac73
fix: PlanOptimize is running too frequently (#26211)
PlanOptimize is being checked far too frequently. This PR is the simplest change that can be made in order to ensure that PlanOptimize is not being ran too much. To alleviate the frequency I've added a lastWrite parameter to PlanOptimize and added an additional test that mocks the edge cause out in the wild that led to this PR.

Previously in test cases for PlanOptimize I was not checked to see if certain cases would be picked up by Plan I've adjusted a few of the existing test cases after modifying Plan and PlanOptimize to have the same lastWrite time.
2025-04-08 12:22:29 -05:00
WeblWabl d8bcbd894c
feat: Add CompactPointsPerBlock config opt (#26100)
* feat: Add CompactPointsPerBlock config opt
This PR adds an additional parameter for influxd
CompactPointsPerBlock. It adjusts the DefaultAggressiveMaxPointsPerBlock
to 10,000. We had discovered that with the points per block set to
100,000 compacted TSM files were increasing. After modifying the
points per block to 10,000 we noticed that the file sizes decreased.
The value has been set as a parameter that can be adjusted by administrators
this allows there to be some tuning if compression problems are encountered.
2025-03-05 14:59:06 -06:00
davidby-influx 2ab5aad52e
chore: add logging to Filestore.purger (#26089)
Also fixes error type checks in
TestCompactor_CompactFull_InProgress
2025-03-05 11:46:07 -08:00
davidby-influx 1efb8dad43
fix: remove temp files on error in Compactor.writeNewFiles (#26074)
Compactor.writeNewFiles should delete
temporary files created on iterations
before an error halts the compaction.

closes https://github.com/influxdata/influxdb/issues/26073
2025-02-27 08:17:48 -08:00
davidby-influx ba95c9b0f0
fix: ensure temp files removed on failed compaction (#26070)
Add more robust temporary file removal
on a failed compaction. Don't halt on
a failed removal, and don't assume a
failed compaction won't generate
temporary files.

closes https://github.com/influxdata/influxdb/issues/26068
2025-02-26 13:17:17 -08:00
davidby-influx 083b679b56
fix: ensure fields in memory match on disk
A field could be created in  memory but not
saved to disk if a later field in that
point was invalid (type conflict, too big)
Ensure that if a field is created, it is
saved.
2025-02-24 13:53:40 -08:00
WeblWabl 03b6ed2bed
feat: Upgrade flux to v0.196.1 (#26041)
* feat: update flux to 0.196.1

* feat: Update proto files
This updates from protoc-gen-go v1.33.0 -> v1.34.1
and protoc from v5.26.1 -> v5.29.2
2025-02-20 13:46:06 -06:00
davidby-influx 5f576331d3
chore: refactor field creation for maintainability
Address review comments in the port work of the
field creation. Also fixes one bug in returning the wrong
error.
2025-02-18 14:00:11 -08:00
davidby-influx b617eb24a7
fix: switch MeasurementFields from atomic.Value to sync.Map (#26022)
Simplify and speed up synchronization for
MeasurementFields structures by switching
from a mutex and atomic.Value to a sync.Map
2025-02-13 16:53:25 -08:00
davidby-influx 5a20a835a5
fix: lock MeasurementFields while validating (#25998)
There was a window where a race between writes with
differing types for the same field were being validated.
Lock the  MeasurementFields struct during field
validation to avoid this.

closes https://github.com/influxdata/influxdb/issues/23756
2025-02-13 11:33:34 -08:00
WeblWabl 4ad5e2aba7
feat: Add error join for file writing in snapshots (#26004)
This PR adds an error join to help with handling multiple errors
from snapshot file writers.
2025-02-12 15:06:43 -06:00
WeblWabl 306a184a8d
feat: Add error joins/returns (#25996)
This pr adds err handling for branch that did not specify os file removal errors
previously. This is part of EAR #5819.
2025-02-11 12:15:25 -06:00
davidby-influx 800970490a
fix: move aside TSM file on errBlockRead (#25839)
The error type check for errBlockRead was incorrect,
and bad TSM files were not being moved aside when
that error was encountered. Use errors.Join,
errors.Is, and errors.As to correctly unwrap multiple
errors.

Closes https://github.com/influxdata/influxdb/issues/25838
2025-01-22 10:46:31 -08:00
WeblWabl f04105bede
feat: Modify optimized compaction to cover edge cases (#25594)
* feat: Modify optimized compaction to cover edge cases
This PR changes the algorithm for compaction to account for the following
cases that were not previously accounted for:

- Many generations with a groupsize over 2 GB
- Single generation with many files and a groupsize under 2 GB
- Where groupsize is the total size of the TSM files in said shard directory.
- shards that may have over a 2 GB group size but
many fragmented files (under 2 GB and under aggressive
point per block count)

closes https://github.com/influxdata/influxdb/issues/25666
2025-01-14 14:51:09 -06:00
davidby-influx e974165d25
fix: do not leak file handles from Compactor.write (#25725)
There are a number of code paths in Compactor.write which
on error can lead to leaked file handles to temporary files.
This, in turn, prevents the removal of the temporary files until
InfluxDB is rebooted, releasing the file handles.

closes https://github.com/influxdata/influxdb/issues/25724
2025-01-03 14:43:41 -08:00
WeblWabl 45a8227ad6
fix(influxd): update xxhash, avoid stringtoslicebyte in cache (#578) (#25622) (#25624)
* fix(influxd): update xxhash, avoid stringtoslicebyte in cache (#578)

* fix(influxd): update xxhash, avoid stringtoslicebyte in cache

This commit does 3 things:

* it updates xxhash from v1 to v2; v2 includes a assembly arm version of
  Sum64
* it changes the cache storer to write with a string key instead of a
  byte slice. The cache only reads the key which WriteMulti already has
as a string so we can avoid a host of allocations when converting back
and forth from immutable strings to mutable byte slices. This includes
updating the cache ring and ring partition to write with a string key
* it updates the xxhash for finding the cache ring partition to use
Sum64String which uses unsafe pointers to directly use a string as a
byte slice since it only reads the string. Note: this now uses an
assembly version because of the v2 xxhash update. Go 1.22 included new
compiler ability to recognize calls of Method([]byte(myString)) and not
make a copy but from looking at the call sites, I'm not sure the
compiler would recognize it as the conversion to a byte slice was
happening several calls earlier.

That's what this change set does. If we are uncomfortable with any of
these, we can do fewer of them (for example, not upgrade xxhash; and/or
not use the specialized Sum64String, etc).

For the performance issue in maz-rr, I see converting string keys to
byte slices taking between 3-5% of cpu usage on both the primary and
secondary. So while this pr doesn't address directly the increased cpu
usage on the secondary, it makes cpu usage less on both which still
feels like a win. I believe these changes are easier to review that
switching to a byte slice pool that is likely needed in other places as
the compiler provides nearly all of the correctness checks we need (we
are relying also on xxhash v2 being correct).

* helps #550

* chore: fix tests/lint

* chore: don't use assembly version; should inline

This 2 line change causes xxhash to use a purego Sum64 implementation
which allows the compiler to see that Sum64 only read the byte slice
input which them means is can skip the string to byte slice allocation
and since it can skip that, it should inline all the calls to
getPartitionStringKey and Sum64 avoiding 1 call to Sum64String which
isn't inlined.

* chore: update ci build file

the ci build doesn't use the make file!!!

* chore: revert "chore: update ci build file"

This reverts commit 94be66fde03e0bbe18004aab25c0e19051406de2.

* chore: revert "chore: don't use assembly version; should inline"

This reverts commit 67d8d06c02e17e91ba643a2991e30a49308a5283.

(cherry picked from commit 1d334c679ca025645ed93518b7832ae676499cd2)

* feat: need to update go sum

---------

Co-authored-by: Phil Bracikowski <13472206+philjb@users.noreply.github.com>
(cherry picked from commit 06ab224516)
2024-12-06 16:05:03 -06:00
davidby-influx 07c261a21a
feat: allow the specification of a write window for retention policies (#25517)
Add FutureWriteLimit and PastWriteLimit to retention
policies. Points which are outside of
now() + FutureWriteLimit
or
now() - PastWriteLimit
will be rejected on write with a PartialWriteError.

closes https://github.com/influxdata/influxdb/issues/25424
2024-11-15 13:30:14 -08:00
Geoffrey Wossum 8497fbf0af
chore: remove unnecessary fmt.Sprintf calls (#25536)
Remove unnecessary fmt.Sprintf calls for static code checks in main-2.x.
2024-11-12 11:06:39 -06:00
Geoffrey Wossum 65683bf166
chore: fix logging issues in Store.loadShards (#25529)
Fix reporting shards not opening correctly when they actually did.
Fix race condition with logging in loadShards.
2024-11-12 09:34:05 -06:00
Geoffrey Wossum 0bc167bbd7
chore: loadShards changes to more cleanly support 2.x feature (#25513)
* chore: move shardID parsing and shard filtering into walkShardsAndProcess

* chore: make it impossible to miss sending shardResponse or marking shard as complete

* chore: always count number of shards (preparation for 2.x related feature)

* chore: explicitly load series files and create indices serially

Explicitly load series files and create indices serially. Also
avoid passing them to work functions that don't need them.

* chore: rework loadShards for changes necessary to cancel loading process

* chore: comment improvements

* fix: fix race conditions in TestStore_StartupShardProgress and TestStore_BadShardLoading

* chore: avoid logging nil error

* chore: refactor shard loading and shard walking

Refactor loadShards and CreateShard to use a common shardLoader class that
makes thread-safety easier. Refactor walkShardsAndProcess into findShards.

* chore: improve comment

* chore: rename OpenShard to ReopenShard and implement with shardLoader

Rename Store.OpenShard to Store.ReopenShard and implement using a
shardLoader object. Changes to tests as necessary.

* chore: avoid resetting shard options and locking on Reopen

Avoid resetting shard options when reopening a shard.
Proper mutex locker in Shard.ReopenShard.

* chore: fix formatting issue

* chore: warn on mixed index types in Store.CreateShard

* chore: change from info to warn when invalid shard IDs found in path

* chore: use coarser locking in Store.ReopenShard

* chore: fix typo in comment

* chore: code simplification
2024-11-08 15:49:48 -06:00
WeblWabl 2cab9a2a1f
feat: Adds functionality to clear out bad shard list (#25398)
* feat(tsdb): Adds functionality to clear bad shards list

This PR adds test and new method to clear out the bad shards list
the method will return the values of the shards that it cleared out
along with the errors. This is the first part in the feature
for adding a load-shards command to influxd-ctl.

Closes influxdata/feature-requests#591
2024-10-18 13:22:32 -05:00
WeblWabl 3c87f524ed
feat(logging): Add startup logging for shard counts (#25378)
* feat(tsdb): Adds shard opening progress checks to startup
This PR adds a check to see how many shards are remaining
vs how many shards are opened. This change displays the percent
completed too.

closes influxdata/feature-requests#476
2024-10-16 10:09:15 -05:00
Shiwen Cheng 1bc0eb4795
fix(tsm1): Fix data race of seriesKeys in deleteSeriesRange (#25268)
Add an RWMutex to allow safe concurrent 
access in deleteSeriesRange
2024-09-27 16:36:27 -07:00
WeblWabl 8eaa24d813
feat(tsm): Allow for deletion of series outside default rp (#25312)
* feat(tsm): Allow for deletion of series outside default RP
9d116f6
This PR adds the ability for deletion of series that are outside
of the default retention policy. This updates InfluxQL to include changes
from: influxdata/influxql#71

closes: influxdata/feature-requests#175

* feat(tsm): Allow for deletion of series outside default RP
9d116f6
This PR adds the ability for deletion of series that are outside
of the default retention policy. This updates InfluxQL to include changes
from: influxdata/influxql#71

closes: influxdata/feature-requests#175
2024-09-17 16:34:14 -05:00
WeblWabl 5c9e45f033
fix(tsi1/partition/test): fix data races in test code (#57) (#25338)
* fix(tsi1/partition/test): fix data races in test code (#57)

* fix(tsi1/partition/test): fix data races in test code

This PR is like influxdata/influxdb#24613 but solves it with a setter
method for MaxLogFileSize which allows unexporting that value and
MaxLogFileAge. There are actually two places locks were needed in test
code. The behavior of production code is unchanged.

(cherry picked from commit f0235c4daf4b97769db932f7346c1d3aecf57f8f)

* feat: modify error handling to be more idiomatic

closes https://github.com/influxdata/influxdb/issues/24042

* fix: errors.Join() filters nil errors

---------

Co-authored-by: Phil Bracikowski <13472206+philjb@users.noreply.github.com>
2024-09-16 20:26:14 -05:00
Geoffrey Wossum 23008e5286
chore: improve error messages and logging during shard opening (#25314)
* chore: improve error messages and logging during shard opening
2024-09-12 15:11:56 -05:00
davidby-influx 5d8d1120e1
fix: add additional logging on loading fields.idxl files (#25309)
Log the path of the file being loaded, and when level=debug
log progress fpr each set of field changes

closes https://github.com/influxdata/influxdb/issues/25289
2024-09-12 08:25:02 -07:00
WeblWabl 7dc8b1d648
fix(tsi1/partition/test): fix data race in test code (#25288) 2024-09-11 19:48:41 -05:00
Geoffrey Wossum 2cf2103cc4
feat: add hook for optimizing series reads based on authorizer (#25207) 2024-08-02 15:03:44 -05:00
Shiwen Cheng 7333da9592
fix(tsi1): fix data race between appendEntry and FlushAndSync tsi1.(*LogFile) (#25182)
Extend lock lifespan to encompass the 
flushAndSync() call to avoid a race

closes https://github.com/influxdata/influxdb/issues/25181
2024-07-23 14:40:10 -07:00
davidby-influx 176fca2138
fix: prevent an infinite loop in measurementFieldSetChangeMgr (#25155)
The measurementFieldSetChangeMgr has a possibly infinite loop
if the writeRequests channel is closed while in the inner
loop to consolidate write requests. We need to check for ok
on channel receive and exit the loop when ok is false.

closes https://github.com/influxdata/influxdb/issues/25151
2024-07-12 16:52:28 -07:00
Geoffrey Wossum b4bd607eef
fix: prevent retention service from hanging (#25055)
* fix: prevent retention service from hanging

Fix issue that can cause the retention service to hang waiting on a
`Shard.Close` call. When this occurs, no other shards will be deleted
by the retention service. This is usually noticed as an increase in
disk usage because old shards are not cleaned up.

The fix adds to new methods to `Store`, `SetShardNewReadersBlocked`
and `InUse`. `InUse` can be used to poll if a shard has active readers,
which the retention service uses to skip over in-use shards to prevent
the service from hanging. `SetShardNewReadersBlocked` determines if
new read access may be granted to a shard. This is required to prevent
race conditions around the use of `InUse` and the deletion of shards.

If the retention service skips over a shard because it is in-use, the
shard will be checked again the next time the retention service is run.
It can be deleted on subsequent checks if it is no longer in-use. If
the shards is stuck in-use, the retention service will not be able to
delete the shards, which can be observed in the logs for manual
intervention. Other shards can still be deleted by the retention service
even if a shard is stuck with readers.

closes: #25054
2024-06-13 11:07:17 -05:00
davidby-influx 82cbdb5478
fix: ensure TSMBatchKeyIterator and FileStore close all TSMReaders (#24957)
Do not let errors on closing 
a TSMReader prevent other 
closes.
2024-05-06 09:59:30 -07:00
Brandon Pfeifer d4b16dcd98
chore: upgrade protocol buffers to v5.26.1 (#24949) 2024-05-01 11:00:26 -07:00
Jakub Bednář dbbe4611c0
build(deps): upgrade google.golang.org/protobuf to v1.33.0 (master-1.x) (#24818) 2024-03-26 14:07:28 +01:00
davidby-influx fe6c64b21e
fix: return and respect cursor errors (#24791)
ArrayCursors were ignoring errors, which led to panics when nil
cursors were operated on. This fix passes errors back up the stack
and uses them to enforce healthy cursor creation.

Closes https://github.com/influxdata/influxdb/issues/24789
---------
Co-authored-by: Stuart Carnie <stuart.carnie@gmail.com>
2024-03-25 17:22:33 -07:00
davidby-influx 8ff06d5a92
fix: improved shard deletion (#24602)
Avoid unnecessarily deleting series from the series file
Try harder to delete series from InMem indices
Log all errors on shard deletion

Closes https://github.com/influxdata/influxdb/issues/24834
2024-03-25 17:15:31 -07:00
davidby-influx bc80e881fa
fix: do not panic when empty tags are queried (#24784)
Do not panic if a cursor array is nil and the number
of timestamps is retrieved.

closes https://github.com/influxdata/influxdb/issues/24536
2024-03-18 15:28:29 -07:00
Jack 6af0be9234
fix: panic index out of range for invalid series keys (#24565)
* chore: add scaffolding for naive solution

* feat: test case scaffolding

* fix: implement check for series key before proceeding

* fix: add validation for ReadSeriesKeyMeasurement usage

* refactor: explicit use of series key len

* feat: add remaining check to index

* feat: add check to remaining files

As the Len function is used as part of the parseSeriesKey, this also needs to be accounted for on the nil return from this function as it is used in different contexts

* feat: expand test cases

* chore: go fmt

* chore: update test failure message

* chore: impl feedback on unnecessary sz checks

* feat: expand test cases

* fix: nil series key check

In both sections for index.go there is a pre-existing length check against the series key which should catch invalid values, perhaps this explains why it hasn't cropped up in the reported panics. For even more safety, we can also skip a nil key because we know that subsequent calls will cause a panic where this key is attempted to be used

* fix: remove nil tags check

A key with no tags is valid, so we should not check for BOTH nil key and tags as a key could be nil, which is invalid, yet still have tags and therefore cause the check to pass which we do not want

* feat: extend test cases from feedback

* fix: extend checks for CompareSeriesKeys

* feat: add nilKeyHandler for shared key checking logic

* fix: logical error in nilKeyHandler

Prior to this, the else was always defaulted to at the end of the conditional branch, which causes unexpected behaviour and a failure of a bunch of tests.

* fix: return tags keep nil data

In a recent change to this, we agreed on a simple name == nil check for the actual data. As a follow on to this, I just realised that we don't actually want to nil back the tags, even if they're not checked, because having no tags is a valid input so we can simply return whatever we were passed unchanged.

* fix: use len == 0 for extra safety

* feat: extra test for blank series key
2024-01-23 09:44:29 +00:00
davidby-influx c05b340b72
chore: upgrade flux (#24504)
* chore: upgrade flux

* chore: execute "go generate" inside cross-builder (#24582)

---------

Co-authored-by: Brandon Pfeifer <bpfeifer@influxdata.com>
2024-01-19 17:40:48 -05:00
davidby-influx 969abf3da2
fix: avoid SIGBUS when reading non-std series segment files (#24509)
Some series files which are smaller than the standard
sizes cause SIGBUS in influx_inspect and influxd, because
entry iteration walks onto mapped memory not backed by the
the file.  Avoid walking off the end of the file while
iterating series entries in oddly sized files.

closes https://github.com/influxdata/influxdb/issues/24508

Co-authored-by: Geoffrey Wossum <gwossum@influxdata.com>
2023-12-08 15:46:11 -08:00
davidby-influx 2dc3dcb3d1
fix: do not escape CSV output (#24311)
CSV output is incorrectly escaped.
Add a boolean flag to tag output
functions to prevent this.

closes https://github.com/influxdata/influxdb/issues/24309
2023-06-29 12:00:41 -07:00
davidby-influx 53856cdaae
fix: series file index compaction (#23916)
Series file indices monotonically grew even
when series were deleted.  Also stop 
ignoring error in series index recovery

Partially closes https://github.com/influxdata/EAR/issues/3643
2023-06-01 10:49:23 -07:00
davidby-influx aad79e471f
fix: prevent world-writable MANIFEST files (#24235)
When a new MANIFEST file is created, set
its permissions to 644, not 666

closes https://github.com/influxdata/influxdb/issues/24233
2023-05-18 12:07:34 -07:00