fix(cloud-iox): schema design corrections (closes #4851): (#4888)

* fix(cloud-iox): typo

* fix(cloud-iox): fix typos and cleanup SQL descriptions

* fix(cloud-iox): schema design corrections (closes #4851):

- Update IOx schema design best practice with feedback from @pauldix:
  - Timestamp
  - Primary key and complexity
  - Wide schema correction
  - Sparse schema clarification
  - Remove explicit bucket schemas
  - "nearly infinite" was replaced in an earlier commit.
- Add timestamp references in Glossary.

* Update content/influxdb/cloud-iox/write-data/best-practices/schema-design.md

Co-authored-by: Scott Anderson <sanderson@users.noreply.github.com>

---------

Co-authored-by: Scott Anderson <sanderson@users.noreply.github.com>
pull/4889/head^2
Jason Stirnaman 2023-04-24 12:10:20 -05:00 committed by GitHub
parent 114ef5085a
commit 6d13675739
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
4 changed files with 65 additions and 38 deletions

View File

@ -66,10 +66,10 @@ to your InfluxDB Cloud bucket before running the example queries.
### Query data within time boundaries
- Use the `SELECT` clause to specify what tags and fields to return.
- Use the `SELECT` clause to specify what columns (tags and fields) to return.
To return all tags and fields, use the wildcard alias (`*`).
- Specify the measurement to query in the `FROM` clause.
- Specify time boundaries in the `WHERE` clause.
- In the `FROM` clause, specify the table (measurement) to query.
- In the `WHERE` clause, specify time boundaries and other conditions for filtering.
Include time-based predicates that compare the value of the `time` column to a timestamp.
Use the `AND` logical operator to chain multiple predicates together.
@ -110,8 +110,8 @@ WHERE
{{% expand "Query with absolute time boundaries" %}}
To query data from absolute time boundaries, compare the value of the `time column
to a timestamp literals.
To query data from absolute time boundaries, compare the value of the `time` column
to a timestamp literal.
Use the `AND` logical operator to chain together multiple predicates and define
both start and stop boundaries for the query.
@ -132,11 +132,11 @@ WHERE
### Query data without time boundaries
To query data without time boundaries, do not include any time-based predicates
To query data without time boundaries, don't include any time-based predicates
in your `WHERE` clause.
{{% warn %}}
Querying data _without time bounds_ can return an unexpected amount of data.
Querying data _without time bounds_ can return a large number of rows.
The query may take a long time to complete and results may be truncated.
{{% /warn %}}
@ -146,8 +146,8 @@ SELECT * FROM home
### Query specific fields and tags
To query specific fields, include them in the `SELECT` clause.
If querying multiple fields or tags, comma-delimit each.
To specify columns (fields, tags, or calculations) you want to retrieve, list them in the `SELECT` clause.
Use a comma to separate column names.
If the field or tag keys include special characters or spaces or are case-sensitive,
wrap the key in _double-quotes_.

View File

@ -2,7 +2,7 @@
title: Explore your schema with SQL
description: >
When working with InfluxDB's implementation of SQL, a **bucket** is equivalent
to a databases, a **measurement** is structured as a table, and **time**,
to a database, a **measurement** is structured as a table, and **time**,
**fields**, and **tags** are structured as columns.
menu:
influxdb_cloud_iox:

View File

@ -13,20 +13,23 @@ menu:
Use the following guidelines to design your [schema](/influxdb/cloud-iox/reference/glossary/#schema)
for simpler and more performant queries.
<!-- TOC -->
- [InfluxDB data structure](#influxdb-data-structure)
- [Primary keys](#primary-keys)
- [Tags versus fields](#tags-versus-fields)
- [Schema restrictions](#schema-restrictions)
- [Do not use duplicate names for tags and fields](#do-not-use-duplicate-names-for-tags-and-fields)
- [Measurements can contain up to 200 columns](#measurements-can-contain-up-to-200-columns)
- [Design for performance](#design-for-performance)
- [Avoid wide schemas](#avoid-wide-schemas)
- [Avoid too many tags](#avoid-too-many-tags)
- [Avoid sparse schemas](#avoid-sparse-schemas)
- [Writing individual fields with different timestamps](#writing-individual-fields-with-different-timestamps)
- [Measurement schemas should be homogenous](#measurement-schemas-should-be-homogenous)
- [Design for query simplicity](#design-for-query-simplicity)
- [Keep measurement names, tag keys, and field keys simple](#keep-measurement-names-tag-keys-and-field-keys-simple)
- [Avoid keywords and special characters](#avoid-keywords-and-special-characters)
- [Use explicit bucket schemas to enforce schema](#use-explicit-bucket-schemas-to-enforce-schema)
---
## InfluxDB data structure
@ -35,17 +38,25 @@ A bucket can contain multiple measurements. Measurements contain multiple
tags and fields.
- **Bucket**: Named location where time series data is stored.
In the InfluxDB SQL implementation, a bucket is synonymous with a _database_.
A bucket can contain multiple _measurements_.
- **Measurement**: Logical grouping for time series data.
In the InfluxDB SQL implementation, a measurement is synonymous with a _table_.
All _points_ in a given measurement should have the same _tags_.
A measurement contains multiple _tags_ and _fields_.
- **Tags**: Key-value pairs that provide metadata for each point--for example,
something to identify the source or context of the data like host,
- **Tags**: Key-value pairs that store metadata string values for each point--for example,
a value that identifies or differentiates the data source or context--for example, host,
location, station, etc.
- **Fields**: Key-value pairs with values that change over time--for example,
- **Fields**: Key-value pairs that store data for each point--for example,
temperature, pressure, stock price, etc.
- **Timestamp**: Timestamp associated with the data.
When stored on disk and queried, all data is ordered by time.
In InfluxDB, a timestamp is a nanosecond-scale [unix timestamp](#unix-timestamp) in UTC.
### Primary keys
In time series data, the primary key for a row of data is typically a combination of timestamp and other attributes that uniquely identify each data point.
In InfluxDB, the primary key for a row is the combination of the point's timestamp and _tag set_ - the collection of [tag keys](/influxdb/cloud-iox/reference/glossary/#tag-key) and [tag values](/influxdb/cloud-iox/reference/glossary/#tag-value) on the point.
### Tags versus fields
@ -54,7 +65,7 @@ tag and what should be a field?" The following guidelines should help answer tha
question as you design your schema.
- Use tags to store identifying information about the source or context of the data.
- Use fields to store values that change over time.
- Use fields to store measured values.
- Tag values can only be strings.
- Field values can be any of the following data types:
- Integer
@ -64,9 +75,9 @@ question as you design your schema.
- Boolean
{{% note %}}
If coming from a version of InfluxDB backed by the TSM storage engine, **tag value**
cardinality no longer affects the overall performance of your database.
The InfluxDB IOx engine supports infinite tag value and series cardinality.
Unlike InfluxDB backed by the TSM storage engine, **tag value**
cardinality doesn't affect the overall performance of your database.
{{% /note %}}
---
@ -81,11 +92,6 @@ measurement on disk.
If you attempt to write a measurement that contains tags or fields with the same name,
the write fails due to a column conflict.
{{% note %}}
Use [explicit bucket schemas](/influxdb/cloud-iox/admin/buckets/manage-explicit-bucket-schemas/) to enforce unique tag and
field keys within a schema.
{{% /note %}}
### Measurements can contain up to 200 columns
A measurement can contain **up to 200 columns**. Each row requires a time column,
@ -106,30 +112,55 @@ The following guidelines help to optimize query performance:
- [Avoid sparse schemas](#avoid-sparse-schemas)
- [Measurement schemas should be homogenous](#measurement-schemas-should-be-homogenous)
### Avoid wide schemas
A wide schema is one with many tags and fields and corresponding columns for each.
At query time, InfluxDB evaluates each row in the queried measurement to
determine what rows to return. The "wider" the measurement (more columns), the
less performant queries are against that measurement.
To ensure queries stay performant, the InfluxDB IOx storage engine has a
With the InfluxDB IOx storage engine, wide schemas don't impact query execution performance.
Because IOx is a columnar database, it executes queries only against columns selected in the query.
Although a wide schema won't affect query performance, it can lead to the following:
- More resources required for persisting and compacting data during ingestion.
- Decreased sorting performance due to complex primary keys with [too many tags](#avoid-too-many-tags).
The InfluxDB IOx storage engine has a
[limit of 200 columns per measurement](#measurements-can-contain-up-to-200-columns).
To avoid a wide schema, limit the number of tags and fields stored in a measurement.
If you need to store more than 199 total tags and fields, consider segmenting
your fields into a separate measurement.
#### Avoid too many tags
In InfluxDB, the primary key for a row is the combination of the point's timestamp and _tag set_ - the collection of [tag keys](/influxdb/cloud-iox/reference/glossary/#tag-key) and [tag values](/influxdb/cloud-iox/reference/glossary/#tag-value) on the point.
A point that contains more tags has a more complex primary key, which could impact sorting performance if you sort using all parts of the key.
### Avoid sparse schemas
A sparse schema is one where, for many rows, columns contain null values.
These generally stem from [non-homogenous measurement schemas](#measurement-schemas-should-be-homogenous)
or individual fields for a tag set being reported at separate times.
These generally stem from the following:
- [non-homogenous measurement schemas](#measurement-schemas-should-be-homogenous)
- [writing individual fields with different timestamps]()
Sparse schemas require the InfluxDB query engine to evaluate many
null columns, adding unnecessary overhead to storing and querying data.
_For an example of a sparse schema,
[view the non-homogenous schema example below](#view-example-of-a-sparse-non-homogenous-schema)._
#### Writing individual fields with different timestamps
Reporting fields at different times with different timestamps creates distinct rows that contain null values--for example:
You report `fieldA` with `tagset`, and then report `field B` with the same `tagset`, but with a different timestamp.
The result is two rows: one row has a _null_ value for **field A** and the other has a _null_ value for **field B**.
In contrast, if you report fields at different times while using the same tagset and timestamp, the existing row is updated.
This requires slightly more resources at ingestion time, but then gets resolved at persistence time or compaction time
and avoids a sparse schema.
### Measurement schemas should be homogenous
Data stored within a measurement should be "homogenous," meaning each row should
@ -368,9 +399,3 @@ iox.from(bucket: "example-bucket")
{{% /code-tab-content %}}
{{< /code-tabs-wrapper >}}
## Use explicit bucket schemas to enforce schema
By default, buckets have an `implicit` **schema-type** and a schema that conforms to your data.
To require measurements to have specific columns and data types and prevent non-conforming write requests,
use [`explicit` buckets and explicit bucket schemas](/influxdb/cloud-iox/admin/buckets/manage-explicit-bucket-schemas/).

View File

@ -673,6 +673,8 @@ Related entries: [check](#check), [notification endpoint](#notification-endpoint
The local server's nanosecond timestamp.
Related entries: [timestamp](#timestamp)
### null
A data type that represents a missing or unknown value.
@ -776,7 +778,7 @@ For example, if the precision is set to `ms`, the nanosecond epoch timestamp `14
Telegraf output plugins do not alter the timestamp further.
The precision setting is ignored for service input plugins.
Related entries: [aggregator plugin](#aggregator-plugin), [input plugin](#input-plugin), [output plugin](#output-plugin), [processor plugin](#processor-plugin), [service input plugin](#service-input-plugin)
Related entries: [aggregator plugin](#aggregator-plugin), [input plugin](#input-plugin), [output plugin](#output-plugin), [processor plugin](#processor-plugin), [service input plugin](#service-input-plugin), [timestamp](#timestamp)
### predicate expression
@ -1139,12 +1141,12 @@ Irregular time series data changes at non-constant intervals.
### timestamp
The date and time associated with a point.
Time in InfluxDB is in UTC.
In InfluxDB, a timestamp is a nanosecond-scale [unix timestamp](#unix-timestamp) in UTC.
To specify time when writing data, see [Elements of line protocol](/influxdb/v2.7/reference/syntax/line-protocol/#elements-of-line-protocol).
To specify time when querying data, see [Query InfluxDB with Flux](/influxdb/v2.7/query-data/get-started/query-influxdb/#2-specify-a-time-range).
Related entries: [point](#point), [unix timestamp](#unix-timestamp), [RFC3339 timestamp](#rfc3339-timestamp)
Related entries: [point](#point), [precision](#precision), [RFC3339 timestamp](#rfc3339-timestamp), [unix timestamp](#unix-timestamp),
### token