Commit Graph

9526 Commits (4d98a1cf28c5eb64caa194ea284f190ee380fce6)

Author SHA1 Message Date
Jon Seymour 4d98a1cf28 tsm: cache: remove unnecessary lock escalation.
Previously, we needed a write lock on the cache because it was the
only lock we had available to guard updates to entry.values and
entry.needSort.

However, now we have a entry-scoped lock for this purpose, we don't
need the cache write lock for this purpose. Since merged() doesn't
modify the .store or the c.snapshot.sort, there is no need for
a write lock on the cache to protect the cache.

So, we don't need to escalate here - we simply rely on the entry lock
to protect the entries we are iterating over.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
2016-02-26 01:31:54 +11:00
Jason Wilder 452d77cbaf tsm: cache: introduce entry locks.
Based on @jwilder's alternative to the 'dirty' slice that featured
in previous iterations of this fix.

Suggested-by: Jason Wilder <jason@influxdb.com>
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
2016-02-26 00:05:38 +11:00
Jon Seymour eb7eec078d tsm: cache: introduce commit lock to Cache
Currently two compactors can execute Engine.WriteSnapshot at once.

This isn't thread safe since both threads want to make modifications to
Cache.snapshot at the same time.

This commit introduces a lock which is acquired during Snapshot() and
released during ClearSnapshot(), ensuring that at most one thread
executes within Engine.WriteSnapshot() at once.

To ensure that we always release this lock, but only release the
snapshot resources on a successful commit, we modify ClearSnapshot() to
accept a boolean which indicates whether the write was successful or not
and guarantee to call this function if Snapshot() has been called.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
2016-02-25 12:10:37 +11:00
Jon Seymour 45d025db99 tsm: cache: add a tests to demonstrate thread safety vulnerabilities
There are two tests that show two different one vulnerability.

One test shows that Cache.Deduplicate modifies entries in a snapshot's
store without a lock while cache readers are deduplicating those same
entries while correctly locked.

A second test shows that two threads trying to execute the methods
that Engine.WriteSnapshot calls will cause concurrent, unsynchronized
mutating access to the snapshot's store and entries.

The tests fail at this commit and are fixed by subsequent commits.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
2016-02-25 12:10:31 +11:00
Jon Seymour d7d81f79da tsm: cache: add a test that demonstrates concurrent reads are safe
Signed-off-by: Jon Seymour <jon@wildducktheories.com>
2016-02-25 12:06:10 +11:00
Jason Wilder e32e5ff481 Merge pull request #5807 from jonseymour/jss-5804+5805
tsm: cache: undo statistics regressions #5804, #5805.
2016-02-23 13:46:27 -07:00
Jon Seymour 530b86ba7d tsm: cache: restore the semantics of cachedBytes and memSize stats
Fixes #5805.

This commit undoes a regression introduced by #5789.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
2016-02-24 06:16:46 +11:00
Jon Seymour 3475356dc9 tsm: cache: fix semantics of snapshotCount statistic to make it useful.
Fix for #5804.

The commit for #5789 rendered the semantics of snapshotCount statistic
useless. This commit restores semantics that have diagnostic value to
this statistic.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
2016-02-24 06:13:54 +11:00
Jason Wilder e0b23fd5b0 Merge pull request #5765 from oiooj/master
No need check Meta.Dir twice
2016-02-23 11:20:24 -07:00
Jason Wilder 92ae2a0e2d Merge pull request #5787 from chris-ramon/handler-query-authorizer
Improvement on `run.NewServer` related to `meta.QueryAuthorizer`.
2016-02-23 11:11:56 -07:00
Gunnar 54957a4838 Merge pull request #5785 from influxdata/ga-build
Implement --generate option in build script
2016-02-23 09:54:39 -08:00
Jason Wilder e7c29d5a37 Merge pull request #5789 from influxdata/jw-5686
Simplify cache snapshotting
2016-02-23 10:46:54 -07:00
gunnaraasen e2d83e53cc Implement -generate option in build script 2016-02-23 09:08:25 -08:00
Jason Wilder 017c24c98e Simplify cache snapshotting
The Cache had support for taking multiple snapshots to support writing
multiple snapshots to TSM files concurrently if that happened to be
a bottleneck.  In practice, this is never a bottleneck and we only
run one snappshoting goroutine continously per shard which has worked
well for all workloads.

The multiple snapshot support introduces some unhandled failure scenarios
where wal segments could be removed without writing them to TSM files.  If
a snapshot compaction fails to write due to transient disk errors, subsequent
snapshots will continue, but the failed one will not be retried.  When the
subsequent ones succeeded, all closed wal segments are removed causing data
loss.

This change simplifies the snapshotting capability to ensure that there is only
ever one snapshot.  If one fails, the next snapshot will update the existing
snapshot and retry all of old and new data.

Fixes #5686
2016-02-23 09:38:51 -07:00
Jason Wilder 0df6d558c2 Merge pull request #5800 from influxdata/jw-5757-regression
Fix data nodes not getting created
2016-02-23 09:22:03 -07:00
Jason Wilder 9ead458399 Fix data nodes not getting created
This fixes a regression introduced in #5757 due to the node.ID getting
assigned by both the meta and data services.  When both roles are active,
the data CreateDataNode path was not getting called because a node ID was
already assigned.

This fixes the issue by seeing if a DataNode already exists for our node
ID, and if it does not, we create one.
2016-02-23 09:01:02 -07:00
Chris Ramón f235852c0b updates changelog 2016-02-23 00:03:32 -05:00
Chris Ramón e52accaf90 adds missing srv.Handler.QueryAuthorizer 2016-02-23 00:02:48 -05:00
Jason Wilder 2894234b1e Merge pull request #5757 from influxdata/jw-cluster
Meta node only fixes
2016-02-22 15:44:07 -07:00
Jonathan A. Sternberg 50753de032 Merge pull request #5782 from influxdata/js-5777-audit-panics-in-influxql
Remove the non-unreachable panics in the new query engine
2016-02-22 17:18:57 -05:00
Jason Wilder 6f39b355bc Code cleanups 2016-02-22 15:06:05 -07:00
Jason Wilder a2d3d44505 Fix creating meta only nodes
This fixes a couple of issues with starting meta-only nodes.

1. We were always calling CreateDataNode regardless of whether the the
node is running data services.  We only call that now when node is
data enabled.
2. The node.json was created along-side creating the data node. Since
we are not creatinga a data node, this didn't happen anymore.  There
wasn't a simple way to do this in one place so it's actually handle
for when creating a meta or a data node now.  Since the ID assigned
to the node is the same regardless of role this works in all combinations
of roles.
3. The JoinMetaServer didn't return the ID of the joining node which
created some races when multiple nodes were joining.  The join call now
returns that information to the caller.

Fixes #5754
2016-02-22 15:06:05 -07:00
Jason Wilder 194d8d4693 Ensure monitor store is disabled for meta only nodes
We can't store points locally so ensure it's disabled for now.
2016-02-22 15:05:47 -07:00
Jason Wilder a437002969 Fix join option in config file
The join option was incorrectly exposed on the meta config.  It should
be at the top-level as a string and propogate down to the meta config
as a slice.
2016-02-22 15:05:46 -07:00
Mark Rushakoff 7f457b8852 Merge pull request #5786 from influxdata/mr-fix-tsm1-test-compilation
Fix non-compiling test
2016-02-22 14:04:05 -08:00
Mark Rushakoff 191de2670c Fix non-compiling test 2016-02-22 13:49:11 -08:00
Mark Rushakoff fc5c8597ab Merge pull request #5758 from influxdata/mr-disk-stats
Track cache, WAL, filestore stats within tsm1 engine
2016-02-22 13:01:55 -08:00
Mark Rushakoff 688863cec5 Update changelog 2016-02-22 12:51:52 -08:00
Jason Wilder e25b5abf61 Merge pull request #5751 from influxdata/jw-5719
Fix cache not deduplicating points in some cases
2016-02-22 13:41:17 -07:00
Jason Wilder aa2e878019 Fix cache not deduplicating points in some cases
The cache had some incorrect logic for determine when a series needed
to be deduplicated.  The logic was checking for unsorted points and
not considering duplicate points.  This would manifest itself as many
points (duplicate) points being returned from the cache and after a
snapshot compaction run, the points would disappear because snapshot
compaction always deduplicates and sorts the points.

Added a test that reproduces the issue.

Fixes #5719
2016-02-22 13:24:42 -07:00
Jonathan A. Sternberg 7a03df2af1 Remove the non-unreachable panics in the new query engine
The only panics left are ones that should be unreachable unless there is
a bug.

Fixes #5777.
2016-02-22 12:52:43 -05:00
Mark Rushakoff c7223157a6 Merge commit 'c93da21' into mr-disk-stats 2016-02-22 09:32:56 -08:00
Jonathan A. Sternberg b6a0b6a65a Merge pull request #5742 from influxdata/js-ensure-non-empty-column-names
Ensure column names get implicitly renamed with conflicts
2016-02-22 08:55:38 -05:00
Jonathan A. Sternberg 87e04b1a46 Merge pull request #5776 from influxdata/js-5773-unsupported-call-panic
Replace a panic with returning an error when an unsupported call is used
2016-02-22 08:17:59 -05:00
Edd Robinson 10b9befd82 Merge pull request #5716 from jonseymour/js-tolerate-empty-field-names
models: tolerate empty field names when unpacking binary points
2016-02-22 12:44:45 +00:00
Jon Seymour c93da21a61 tsm: cache: only use NewCache for engine cache's snapshots use a simpler constructor
The intent of this change is to avoid writing caches created for
snapshot cache instances into the tsm1_cache measurement. We can do
this by avoiding use of the NewCache constructor. All other methods
are only intended to be called from on the engine cache - never
on a snapshot.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
2016-02-22 15:17:43 +11:00
Jonathan A. Sternberg 6982d5310e Replace a panic with returning an error when an unsupported call is used
Fixes #5773.
2016-02-21 19:39:14 -05:00
Mark Rushakoff 2ab79e75eb Merge pull request #5775 from jonseymour/jss-5499-extend-tsm-cache-stats
tsm: cache: during writes, update the memSize statistic outside the lock
2016-02-21 14:36:56 -08:00
Jon Seymour 510ee2c790 tsm: cache: during writes, update the memSize statistic outside the lock
Since we are not locking but relying on atomic arithmetic,
use Add rather than Set. Will also result in slightly less garbage
being created.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
2016-02-22 08:26:35 +11:00
Mark Rushakoff feceb4dae1 Merge pull request #5769 from jonseymour/jss-5499-extend-tsm-cache-stats
tsm: cache: ensure all statistics are initialised on cache creation.
2016-02-21 07:25:49 -08:00
Jon Seymour 9c6efe99f1 tsm: cache: ensure all statistics are initialised on cache creation.
The intent of this change is to ensure that all statistic fields of the
resulting tsm1_cache measurement are initialized on initialization of
the cache. That way, any consumer of those measurements doesn't
have to deal with the null case.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
2016-02-21 15:33:50 +11:00
Mark Rushakoff 04645188fa Merge pull request #5762 from jonseymour/jss-5499-extend-tsm-cache-stats
tsm: cache: add cache throughput related statistics.
2016-02-20 10:15:17 -08:00
Jon Seymour a8877badcd Update CHANGELOG for #5716, #5664
Note that this series also includes cherry-pick of #5697.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
2016-02-21 03:56:37 +11:00
oiooj f1c027543c No need check Meta.Dir twice 2016-02-20 23:54:24 +08:00
Jon Seymour d46e0407a0 Merge #5716
RHS merges cleanly with 0.10.0 maintenance branch.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
2016-02-20 22:24:03 +11:00
Jon Seymour 9491846047 models: improve handling of points with empty field names or with no fields
Influx does not support fields with empty names or points
with no fields.

NewPoint is changed to validate that all field names are non-empty.

AddField is removed because we now require that all fields are
specified on construction.

NewPointFromByte is changed to return an error if a unmarshaled
binary point does not have any fields.

newFieldsFromBinary is changed to prevent an infinite loop that
can arise while attempting to parse corrupt binary point data.

TestNewPointsWithBytesWithCorruptData is changed to reflect the
change in the behaviour of NewPointFromByte.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
2016-02-20 22:22:26 +11:00
Jon Seymour 6697c721fb tsm: cache: add cache throughput related statistics.
Complementing and extending the changes in #5758.

Add 2 level statistics:

  * snapshotCount
  * cacheAgeMs

Add 2 counter statistics

  * cachedBytes
  * WALCompactionTimeMs

snapshotCount can be used to measure transient write errors that are causing snapshots to accumulate

cacheAgeMs can be used to guage the level of write activity into the cache

The differences between cachedBytes stats sampled at different times can be used to calculate cache throughput rates

The ratio (cachedBytes-diskBytes)/WALCompactionTimeMs can be used calculate WAL compaction throughput.

The ratio of difference between first and last WAL compaction time over the interval
length is an estimate of percentage of cache throughput consumed.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
2016-02-20 22:18:57 +11:00
Mark Rushakoff 602043e11b Add disk stats for FileStore 2016-02-19 16:37:34 -08:00
Mark Rushakoff d99c09cedd Add stats for current and old WAL segment sizes 2016-02-19 16:37:34 -08:00
Mark Rushakoff e76967efb6 Add stats to tsm1.Cache 2016-02-19 16:37:34 -08:00