TMP files could leak when compactions failed for various reasons. They
were also being deleted inadvertently when compactions were disabled causing
other errors to be reported in the logs.
The previous version of `top()` and `bottom()` would gather all of the
points to use in a slice, filter them (if necessary), then use a
slightly modified heap sort to retrieve the top or bottom values.
This performed horrendously from the standpoint of memory. Since it
consumed so much memory and spent so much time in allocations (along
with sorting a potentially very large slice), this affected speed too.
These calls have now been modified so they keep the top or bottom points
in a min or max heap. For `top()`, a new point will read the minimum
value from the heap. If the new point is greater than the minimum point,
it will replace the minimum point and fix the heap with the new value.
If the new point is smaller, it discards that point. For `bottom()`, the
process is the opposite.
It will then sort the final result to ensure the correct ordering of the
selected points.
When `top()` or `bottom()` contain a tag to select, they have now been
modified so this query:
SELECT top(value, host, 2) FROM cpu
Essentially becomes this query:
SELECT top(value, 2), host FROM (
SELECT max(value) FROM cpu GROUP BY host
)
This should drastically increase the performance of all `top()` and
`bottom()` queries.
Writing points outside of a retention policy range were silently dropped. They
are dropped to prevent creating a shard that will be immediately deleted. These
dropped points were silent and did not return an error respone to the caller.
Fixes#8392
Currently, when debugging issues with InfluxDB we often ask for the
following profiles:
curl -o block.txt "http://localhost:8086/debug/pprof/block?debug=1"
curl -o goroutine.txt
"http://localhost:8086/debug/pprof/goroutine?debug=1"
curl -o heap.txt "http://localhost:8086/debug/pprof/heap?debug=1"
curl -o cpu.txt "http://localhost:8086/debug/pprof/profile
This can be bothersome for users, or even difficult if they're
unfamiliar with cURL (or it's not on their system).
This commit adds a new endpoint: /debug/pprof/all which will return a
single compressed archive of all of the above profiles. The CPU profile
is optional, and not returned by default. To include a CPU profile the
URL to request should be: /debug/pprof/all?cpu=true. It's also possible
to vary the length of the CPU profile by adding a `seconds=x` parameter,
where x defaults to 30, if absent.
The new command for gathering profiles from users should now be:
curl -o profiles.tar.gz "http://localhost:8086/debug/pprof/all"
Or, if we need to see a CPU profile:
curl -o profiles.tar.gz
"http://localhost:8086/debug/pprof/all?cpu=true"
It's important to remember that a CPU profile is a blocking operation
and by default it will take 30 seconds for the response to be returned
to the user.
Finally, if the user is unfamiliar with cURL, they will now be able to
visit http://localhost:8086/debug/pprof/all in a web browser, and the
archive will be downloaded to their machine.
This changes full compactions within a shard to run sequentially
instead of running all the compaction groups in parallel. Normally,
there is only 1 full compaction group to run. At times, there could
be several which causes instability if they are all running concurrently
as they tie up a cpu for long periods of time.
Level compactions are also capped to a max of 4 concurrently running for each level
in a shard. This prevents sudden spikes in CPU and disk usage due to a large backlog
of tsm files at a given level.
Measurement name and field were converted between []byte and string
repetively causing lots of garbage. This switches the code to use
[]byte in the write path.
This pool was previously a pool.Bytes to avoid repetitive allocations.
It was recently switchted to a sync.Pool because pool.Bytes held onto
very larger buffers at times which were never released. sync.Pool is
showing up in allocation profiles quite frequently.
This switches the pool to a new pool that limits how many buffers are
in the pool as well as the max size of each buffer in the pool. This
provides better bounds on allocations.
This speeds up time encoding and decoding by skipping the divisor
scaling if scaling by 1. Since division and multiplication are expensive
cpu and scaling by 1 has no effect, this just slows encoding and decoding
down.
After using `/debug/requests`, the client will wait for 30 seconds
(configurable by specifying `seconds=` in the query parameters) and the
HTTP handler will track every incoming query and write to the system.
After that time period has passed, it will output a JSON blob that looks
very similar to `/debug/vars` that shows every IP address and user
account (if authentication is used) that connected to the host during
that time.
In the future, we can add more metrics to track. This is an initial
start to aid with debugging machines that connect too often by looking
at a sample of time (like `/debug/pprof`).