Previously pseudo iterators could be created for meta data such
as series, measurement, and tag data. These iterators were created
at a higher level and lacked a lot of the power of the query engine.
This commit moves system iterators down to the series level and
supports the following:
- _name
- _seriesKey
- _tagKey
- _tagValue
- _fieldKey
These can be used as normal fields such as:
SELECT _seriesKey FROM cpu
This will return all the series keys for `cpu`.
* introduced UnsignedValue type
* leveraged existing int64 compression algorithms (RLE, Simple 8B)
* tsm and WAL can read and write UnsignedValue
* compaction is aware of UnsignedValue
* unsigned support to model, cursors and write points
NOTE: there is no support to create unsigned points, as the line
protocol has not been modified.
The series key stored in TSM files includes the field. We validated
the series length using only the measurement and tag set which allowed
very large field names to overflow. This now checks the series key
as the measurement + tagset + field + the tsm field key separator size.
Measurement name and field were converted between []byte and string
repetively causing lots of garbage. This switches the code to use
[]byte in the write path.
The Point is intended to be immutable after being parsed since it
is shared by several goroutines. When dropping a field (e.g. time),
corrupted data can result if one goroutine is delete the field
while another is marshaling the underlying byte slices.
To avoid this, the shard will just skip invalid fields and series
instead of trying to mutate them by deleting them.
There was a check to ensure that fields exists when unmarshalBinary
is called. This created a map and other garbage just to see if any
fields exist.
This changes it to use a FieldIterator that does not allocate as
much as the other method.
If a field was named time was written and was subsequently dropped,
it could leave a trailing comma in the series key causing it to fail
to be parseable in other parts of the code.
If a field was named time was written and was subsequently dropped,
it could leave a trailing comma in the series key causing it to fail
to be parseable in other parts of the code.
Previously, tags had a `shouldCopy` flag to indicate if those tags
referenced an underlying buffer and should be copied to allow GC.
Unfortunately, this prevented tags from being copied that were
created and referenced the mmap which caused segfaults.
This change removes the `shouldCopy` flag and replaces it with a
`forceCopy` argument in `CreateSeriesIfNotExists()`. This allows
the write path to indicate that tags must be cloned on insert.
This change delays Tag cloning until a new series is found, and will
only clone Tags acquired from `ParsePoints...` and not those referencing
the mmap-ed files (TSM) that are created on startup.
This leak seems to have been introduced in 8aa224b22d,
present in 1.1.0 and 1.1.1.
When points were parsed from HTTP payloads, their tags and fields
referred to subslices of the request body; if any tag set introduced a
new series, then those tags then were stored in the in-memory series
index objects, preventing the HTTP body from being garbage collected. If
there were no new series in the payload, then the request body would be
garbage collected as usual.
Now, we clone the tags before we store them in the index. This is an
imperfect fix because the Point still holds references to the original
tags, and the Point's field iterator also refers to the payload buffer.
However, the current write code path does not retain references to the
Point or its fields; and this change will likely be obsoleted when TSI
is introduced.
This change likely fixes#7827, #7810, #7778, and perhaps others.
I haven't been able to reproduce creating a point without any fields,
but we've seen points in the wild that have been marshalled with no
fields - that is, the length header for fields is uint32(0) and a
well-formed encoded time follows.
Attempting to unmarshal points via NewPointFromBytes returns
ErrPointMustHaveAField, so it seems better to fail earlier with the same
error, rather than allowing those points to be serialized in the first
place.
A string field w/ a trailing slash before the quote would parse incorrectly
because the quote would be seen as escaped. We have to treat \\ as an
escape sequence within strings in order to handle this.
The FieldIterator is used to scan over the fields of a point, providing
information, and delaying parsing/decoding the value until it is needed.
This change uses this new type to avoid the allocation of a map for the
fields which is then thrown away as soon as the points get converted
into columns within the datastore.
+ Remove a heap alloc in (Point).HashID() and (Row).tagsHash()
(According to `-gcflags -m`).
+ Direct port from the stdlib.
+ Fuzz test for equivalence to stdlib version.
+ Save one alloc per line when writing with the bulk protocol.
Over a longer period of writes, this allocation shows up quite
a bit in profiles since the slice needs to be resized frequently.
This scans the slice to count how many lines are going to be parsed
in order to pre-allocate the slice capacity. It's slightly slower,
but creates less garbage in the long run.
The v2 UDP client will attempt to split points that exceed the
configured payload size. It will only do this for points that have a
timestamp specified.