This adds some computational overhead during the merging of new
namespace schema with what's in the router's local cache, but will allow
gossiping of changes.
PR #8327 introduced a bunch of metrics for the sqlx connection pool. One
of the metrics was the "used" metrics that was supposed to count
"currently in use" connection. In prod however this metric underflows to
a very large integer. It seems that "acquire" callback is only used by sqlx for
re-used connections (i.e. for the transition from "idle" to "used").
Now we could try to work around it but since there is no "close
connection" callback, I doubt it it possible to do the accurately.
Luckily though we don't really need that counter. sqlx already offers
"active" (defined as idle + used) and "idle", so getting "used" is just
the difference. I removed the "used" metric nevertheless because
"active" and "idle" are read independently from each other (based on atomic
integers) and are NOT guaranteed to be in-sync. Calculating the
difference within IOx however would give the illusion that they are. So
I leave this to the dashboard / alert / whatever, because there it is
usually understood that metrics are samples and may be out of sync for a
very short time.
A nice side effect of this change is that it simplifies the code quite a
bit.
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
In very rare cases a panic mid-write can result in a partially completed
write to the WAL which contains no table data. This is now not replayed
(as there is nothing to replay) and does not panic when encountered,
but tracks the occurence into the WAL replayed ops metric and logs a
warning.
Exposes the `ERROR_WINDOW` parameter that controls the router's
downstream error-gate health check behaviour as an environment
variable/command line flag. This allows tuning, per-environment, the
period over which the error rate of 80% must be exceeded to cause an
ingester to appear unhealthy.
Cache the merged Schema of all the RecordBatch within a buffer at
snapshot generation time.
To be useful, this cached schema is made available to the PartitionData
for re-use, allowing the schema of "hot" data within a partition's
mutable buffer to be read without generating a RecordBatch first.
Provide row count & timestamp min/max statistics on a per-partition
basis.
This commit builds on the FSM summary statistics, merging all FSM
statistics across all data within the PartitionData (in various states)
and making them available to the caller.
Cache the row count & timestamp min/max values within the partition FSM
/ buffer, and make them available through the Queryable trait.
This allows the PartitionData to read the row count of a buffer (either
"hot" for writes, a "snapshot" of immutable RecordBatch, or "persisting"
for in-flight persisting data).
These values will enable early partition pruning.
To better gauge how many connections we use and especially if we hit the
max connection limit, it would be helpful to actually have some metrics
available for the pool usage. This change adds a few basic metrics.
* feat(idpe-17789): scheduler job_status() (#8121)
This block of work moves into the scheduler some of the specific downstream actions affiliated with compaction outcomes. Which responsibilities stay in the compactor, versus moved to the scheduler, roughly followed the heuristic of whether the action (a) had an impact on global catalog state (a.k.a. commits and partition skipping), (b) whether it's logging affiliated with compactor health (e.g. ParitionDoneSink logging outcomes) versus system health (e.g. logging commits), and (c) reporting to the scheduler on any errors encountered during compaction. This boundary is subject to change as we move forward.
Also, a noted caveat (TODO) on this commit. We have a CompactionJob which is used to track work handed off to each compactor. Currently it still uses the partition_id for tracking, but the followup PR will start moving the compactor to have more CompactionJob uuid awareness.
* fix(idpe-17789): need to remove partition from uniqueness tracking, so it becomes available again
* refactor(idpe-17789): split up the single-use end_job() from the multi-use update_job_status()
* feat(idpe-17789): Commit is now a scheduler trait, only used externally in the compactor_test_utils
* feat(idpe-17789): Propagate errors pertaining to commit, in both the scheduler and the compactor.
* feat(idpe-17789): PartitionDoneSink should have different crate-private traits for scheduler versus comactor.
* feat(idpe-17789): PartitionDoneSink should propagate errors
* test(idpe-17789): integration tests suite
* test(idpe-17789): test documenting what skip request does (as outcome)
* refactor(idpe-17789): make the validate of the upgrade commit, versus replacement commit, more explicit.
* feat(idpe-17789): switch to using parking_lot Mutex within the scheduler
Adds benchmarks that exercise partition pruning during query execution
within the ingester, for varying partition counts within a table, and
varying row counts within each partition.
* refactor: isolate docker build to script
* chore: add labels to docker image
* chore: export image as OCI
* chore: print image digest
* fix: convert to OCI BEFORE calculating digest
* fix: use digest of uploaded image, not of the local archive
---------
Co-authored-by: kodiakhq[bot] <49736102+kodiakhq[bot]@users.noreply.github.com>
I've seen at least one case in prod where the UTC clock goes backwards.
The `TimeProvider` and `Time` interface even warns about that. However
there was a `Sub` impl that would panic if that happens and even though
this was documented, I think we can do better and just not offer a
panicky interface at all.
So this removes the `Sub` impl. and replaces all uses with
`checked_duration_since`.
Time has a special meaning and can be partitioned on by the strftime
formatter. It should not be used as a tag value part in a custom
partitioning template.