This commit converts meta.ShardInfo.OwnerIDs from a slice of ids
to a slice of objects. This is to support adding statuses for a
shard for a given node. For example, a node may have a shard
assigned to it but it is currently copying the shard and is not
ready to serve data for it.
The old `OwnerIDs` is marked as deprecated, however, the code
still supports loading from older protobuf-encoded data.
Running show measurements in a partially replicated cluster produces inconsistent
results due to the connection pooling. When running remote meta-data queries,
the cluster service ends ups keeping map shard request open but still checks the connection
back into the pool. This causes inconsistent results because data from the last request
interferes with the new request.
This removes the connection pool which fixes the issue. It also has the side effect of fixing
a nodes pool connections that have gone bad when a node restarts. For example, in a 3 node cluster
that has been responding to queries correctly, restarting 1 node will cause all the other to fail
to query that node indefinitely. This is now fixed as well.
Some nodes may receive requests for shards that they are assigned but
do not have the shards locally yet. Just return no results in this case
instead of panicing.
The TSDBStore was returning a nil mapper if the shard did not exist. The caller always
assumed the mapper would not be nil causing a panic. Instead, have the mapper skip the mapping
phase if it's shard reference is nil. This fixes queries against data-only nodes and against
shards that are not fully replicated in the cluster.
Fixes#3574
A write lock was being taken to read the memory size to determine if writes
should be paused. What happens is that writers get blocked indefintely when
trying to acquire a write lock which makes writes pause (or stop) for long periods
of time.
The log was deferring the release of the read lock on the WAL. This had
the affect that a read-lock was held until after the partition finished writing
(which maintains it's own locks). The read lock is only needed around the call
to pointsToPartions so it can get a consistent copy of the points to write. After
that calls returns, a lock is not needed so free it immediatedly.
addToCache is called in a goroutine and can panic if the server is closed while opening. If
part of the open func errors, it returns an error and immediately calls close. close sets
p.cache to nil which causes the goroutine trying to initialized the cache to panic as well. The
goroutine should run under a write lock to avoid this race/panic.
If LoadMetadataIndex() tries to log an error, it causes a panic because the
logger is not set until Open() is called, which is after LoadMetaDataIndex() returns.
Instead, just set the logger up when the WAL is created.
Writes could timeout and when adding new measurement names to the
index if the sort took a long time. The names slice was never
actually used (except a test) so keeping it in index wastes memory
and sort it wastes CPU and increases lock contention. The sorting
was happening while the shard held a write-lock.
Fixes#3869
If we've seeked a cursor, then we can be sure that there will be no
data between it and the point that was seeked to. Take advantage of
this fact to only seek when it would yield us a value that would be
different from the last.
In addition, only init the pointsheap when doing a raw query. For
aggregate queries, it is reinitialized on every time bucket, so no
need to seek through all the cursors
For a synthetic database where there was only entries for a tiny
slice of time, it cut queries from 112 seconds to 30 seconds doing
`select mean(value) from cpu where time > now - 2h group by time(1h)`
This commit changes the default block size from 64KB to 4KB for
bz1. This was lowered because small blocks were being uncompressed,
merged, recompressed, and inserted for a large portion of updates.
This became slower and slower over time until it reached the 64KB
threshold. We moved to the 4KB threshold in order to lower the
impact of this recompression.
Since we already got a pointsHeapItem, let's just reuse it instead
of allocating a new one. This cuts allocated memory of a 1 million
points aggregate query from 4881.97MB to 4139.86MB
The buffer allocation in bz1 was unused and I'm fairly certain that it
was harmful to performance if used. For queries that run through a bz1
block, needing to hold on to a 64kb block is expensive. Better to churn
on the allocator and have the blocks be released when they are unused
than to have 64kb hanging around for each series regardless of size.
Thanks to @jwilder for brainstorming this issue with me.
By using preallocated buffers for marshaling WAL entries, we can
reduce the amount of memory we allocate.
On a run of `influx_stress -series 10000 -points 1000` this cuts
total allocations from 18684.15MB to 15200.73MB
* Update the store to remove the WAL directories associated with a shard or database when they are deleted.
* Fix the Store so that it creates separate WAL directories for databases and retention policies.
This func show up in profiling. It's called frequently from multiple places and
can be made more efficient. The previous implementation looped over the input
slice 4 times updating an returning a new slice each time. The changes it to loop
once and create one result slice.
With influx_stress
Before:
Wrote 10000000 points at average rate of 241750
Average response time: 187.78968ms
After:
Wrote 10000000 points at average rate of 254618
Average response time: 172.235028ms
Through profiling of writes, point.Fields() and point.Name() were called
repeatedly in PointsWriter and the Shard. These calls are somewhat expensive
when writing large batches so we can cache them to avoid wasting CPU cycles.
Using influx_stress with default settings
Before:
Wrote 10000000 points at average rate of 202570
Average response time: 235.450355ms
After:
Wrote 10000000 points at average rate of 246120
Average response time: 182.881008ms
This commit changes the bz1 append to check for a small
ending block first. If the block is below the threshold
for block size then it is rewritten with the new data
points instead of having a new block written.
If a flush is happening and you bring up a cursor for a series, if that series didn't have any data in the cache (after the flush started) then it would return no data. What it should have done instead is return the data that is in the flush cache, which is held in separate area of memory until it is committed to the index.
If the measurement started with a quote, a panic would happen. This
is a reegression due to cb7f0b8.
This also uncovered that measurement names were being escaped incorrectly.
The escape codes for tag and fields also includes `=` and '"` which should
not be escaped for measurement names.
Fixes#3681
* All metadata for each shard is now stored in a single key with compressed value
* Creation of new metadata no longer requires a syncrhnous write to Bolt. It is passed to the WAL and written to Bolt periodically outside the write path
* Added DeleteSeries to WAL and updated bz1 to remove series there when DeleteSeries or DropMeasurement are called
CPU profiling shows that computing the tagset-based key of each point is significant CPU cost. These keys actually don't change per cursor, so precompute once at mapper-open time, and then use those values as points are drained from the cursor.
Before this change the cursor tag was getting computed on every point, which involved marshalling tags.
The series map on Measurement was updated and deleted from but never
actually used. Series keys can be very bia since they are the the
string representation of the measurement plus sorted tags.
Locally I see 20%-30% reduction in memory usage with 1M series.
This commit changes FieldCodec to always be non-nil. Normally it should
always be non-nil, however, if metadata is not persisted correctly or
consistently then it could be missing. A nil FieldCodec causes queries
to panic.
Fixes#3535
* Capitalize first letter of message
* Log all services staring consistently
* Remove some extraneous log statements in meta.Store
* Log data dirs for meta, data and hinted handoff
This commit fixes the b1 cursor so that reads from either the cache
or bolt buffer will check against the previously read key to ensure
that two of the same keys are not returned.
Fixes#3571.
ValidateGroupBy was returning an error if a tag does not exist
but it appears that function was supposed to be validating that
a field name was not used as a group by field.
Fixes#3326
This commit adds a cursor that wraps multiple `tsdb.Cursor` objects
and streams them out as one cursor. The multi-cursor automatically
dedupes keys by using the first cursor specified in the argument
list.
This commit fixes issues found from using a more complex `testing/quick`
implementation of the `WriteIndex()` test. The newer test inserts
multiple sets of random data that's confined to a smaller random space
so there's more chance of overlapping data.
The fixes were primarily around inserting old data or inserting the same
timestamp multiple times for a single write. The block splitting was not
working correctly before and the sorting and deduping was not handled
correctly.
Newlines in a string field would cause the parser to return
the line prematurely causing "unbalanced quotes" errors. This
makes the line scanning aware of quote fields so that the whole
line is returned.
Fixes#3545
This change moves tracking of next timestamp and values to simple
slices, as performance measurement showed that Peek() on TagSet cursors
was a huge performance drain. There is much more that can be done here,
but with this in place query performance has been restored to 0.9.1
levels.
This change also uses -1 to indicate that no value is available for a
given timestamp.
With this change remote mapping no longer uses HTTP, as the HTTP ports
exposed by nodes on the cluster are not known cluster wide. The TCP
ports exposed by the cluster service are, so this change uses that
functionality. Each RemoteMapper has its own dedicated connection pool
for each node, and remote mapping TCP connections are in no way coupled
with query TCP connections.
This change significantly simplifies query executor code. Before this
change there were two types of executors -- RawExecutor and
AggregateExecutor. These two types only differed in one function
Execute(). Otherwise all other methods on the Executors were common and
duplicated between executors
This change merges the two executors into a single type called, wait for
it, Executor and simply switches execute functions depending on the
statement type.
The multiple checks for Mapper and Executor type -- the lack of DRYness
in this code -- meant the same checks would need to be copied. Therefore
this change, as well as fixing the bug, improves the situation a little
bit by *asking* the Mappers what type of Executor is required. This code
is still not ideal.
Fixes#3355.
With this change, the query engine code gathers information about
shards and tagsets by working with individual shards, collating the
information, and returning that to the client. It does not assume that any
particular shard is local, and accesses all shards through abstracted
Mappers, of which there are two types -- a Mapper type for Raw queries
and a second type for Aggregate queries. There are corresponding
Executors for each type of Mapper, but both types of Executors share the
same interface.
Writing points that were not sorted by time could cause very high
CPU usages and increased latencies because each point inserted would
cause the in-memory cache to be resorted. The worst case would be
writing a large batch of N points in reverse time order which would
invoke N sorts of the slice.
This patch keeps track of which slices need to be sorted and sorts
them once at the end. In the previous example, the N sorts becomes
one. There is still a pathalogical case that would require N/2 sorts.
For example, 10000 points split across 5000 series. Each series has two
points that are in reverse time order. This would incur 5000 sorts still.
Fixes#3159
These are not supported types but previously it would cause the
point.Fields() func to panic. This prevents it from panicing
so the values can be ignored if needed.
When creating a point manually, the field values are interface{}
which allows unsupported types to be passed in. Previously, the
code would panic. It will now default to string representation of
the value if it's not a known type.
This commit adds a write ahead log to the shard. Entries are cached
in memory and periodically flushed back into the index. The WAL and
the cache are both partitioned into buckets so that flushing doesn't
stop the world as long.
Field values that were out of range for the type would panic the database
when being inserted because the parser would allow them as valid points.
This change prevents those invalid values from being parsed and instead
returns an error.
An alternative fix considered was to handle the error and clamp the value
to the min/max value for the type. This would treat numeric range errors
slightly differently than other type erros which might lead to confusion.
The simplest fix with the current parser would be to just convert each field
to the type at parse time. Unfortunately, this adds extra memory allocations
and lowers throughput significantly. Since out of range values are less common
than in-range values, some heuristics are used to determine when the more
expensive type parsing and range checking is performed. Essentially, we only
do the slow path when we cannot determine that the value is in an acceptable
type range.
Fixes#3127
A field value of just a numeric value would be accepted by the line
protocol parser but the value would be set as the field name and
the value would be nil. Instead, return an error because all field
values need a field name.
Statements were only being normalized if a default database was included
in the query (usually via the query param 'db'). However if no default
database was included, and none was an explicit part of the measurement
name, no database-existence check was run. This result in a later panic
with wildcard expansion.
Fixes#2960
Integers were were written back to line protocol using strconv.FormatFloat
incorrectly. Large integers are written in scientific notation which
causes their type to change to a float when parsed back.
Supported boolean values are now t, T, true, TRUE, f, F, false, and
FALSE. This is what the strconv.ParseBool function supports with
the exception of 1 and 0. 1 and 0 would be parsed as ints in the
line protocol.
Previously, any non-true value would be parsed as false. e.g.
value=blah would parse to false. This will now return an error as
parsing time.
Adds more tests for invalid numbers such as 0.1a, -2.-4, as well
test for supported formats for negative and positive integers/floats
as well as scientific notation.
Fixes#2869
When adding a new field to an existing measurment, Shard.validateSeriesAndFields
would also encode the fields as a side-effect. In the case of a new field
that needed to be created, the encoding would fail because the field type
had not been created for the measurement yet. The fields are re-encoded
after validateSeriesAndFields returns and after the field encoding have been
setup properly so this additional encoding during
validation isn't necessary.
Fixes a panic on writes because the field value was not parse correctly.
panic: unsupported value type during encode fields: <nil>
goroutine 117 [running]:
github.com/influxdb/influxdb/tsdb.(*FieldCodec).EncodeFields(0xc2081c4020, 0xc2081dc180, 0x0, 0x0, 0x0, 0x0, 0x0)
/Users/jason/go/src/github.com/influxdb/influxdb/tsdb/shard.go:573 +0x8e3
* Add deleteMeasurement to store and shard
* Add DropMeasurement to DatabaseIndex
* Update ErrMeasurementNotFound and ErrDatabaseNotFound to not include the first line of the stack trace.