fix(cloud-iox): schema design corrections (closes #4851): (#4888)

* fix(cloud-iox): typo * fix(cloud-iox): fix typos and cleanup SQL descriptions * fix(cloud-iox): schema design corrections (closes #4851): - Update IOx schema design best practice with feedback from @pauldix: - Timestamp - Primary key and complexity - Wide schema correction - Sparse schema clarification - Remove explicit bucket schemas - "nearly infinite" was replaced in an earlier commit. - Add timestamp references in Glossary. * Update content/influxdb/cloud-iox/write-data/best-practices/schema-design.md Co-authored-by: Scott Anderson <sanderson@users.noreply.github.com> --------- Co-authored-by: Scott Anderson <sanderson@users.noreply.github.com>
2023-04-24 12:10:20 -05:00 · 2023-04-24 12:10:20 -05:00 · 6d13675739
parent 114ef5085a
commit 6d13675739
4 changed files with 65 additions and 38 deletions
--- a/content/influxdb/cloud-iox/query-data/sql/basic-query.md
+++ b/content/influxdb/cloud-iox/query-data/sql/basic-query.md
@ -66,10 +66,10 @@ to your InfluxDB Cloud bucket before running the example queries.

 ### Query data within time boundaries

- Use the `SELECT` clause to specify what tags and fields to return.
+- Use the `SELECT` clause to specify what columns (tags and fields) to return.
  To return all tags and fields, use the wildcard alias (`*`).
- Specify the measurement to query in the `FROM` clause.
- Specify time boundaries in the `WHERE` clause.
+- In the `FROM` clause, specify the table (measurement) to query.
+- In the `WHERE` clause, specify time boundaries and other conditions for filtering.
  Include time-based predicates that compare the value of the `time` column to a timestamp.
  Use the `AND` logical operator to chain multiple predicates together.

@ -110,8 +110,8 @@ WHERE

 {{% expand "Query with absolute time boundaries" %}}

-To query data from absolute time boundaries, compare the value of the `time column
-to a timestamp literals.
+To query data from absolute time boundaries, compare the value of the `time` column
+to a timestamp literal.
 Use the `AND` logical operator to chain together multiple predicates and define
 both start and stop boundaries for the query.

@ -132,11 +132,11 @@ WHERE

 ### Query data without time boundaries

-To query data without time boundaries, do not include any time-based predicates
+To query data without time boundaries, don't include any time-based predicates
 in your `WHERE` clause.

 {{% warn %}}
-Querying data _without time bounds_ can return an unexpected amount of data.
+Querying data _without time bounds_ can return a large number of rows.
 The query may take a long time to complete and results may be truncated.
 {{% /warn %}}

@ -146,8 +146,8 @@ SELECT * FROM home

 ### Query specific fields and tags

-To query specific fields, include them in the `SELECT` clause.
-If querying multiple fields or tags, comma-delimit each.
+To specify columns (fields, tags, or calculations) you want to retrieve, list them in the `SELECT` clause.
+Use a comma to separate column names.
 If the field or tag keys include special characters or spaces or are case-sensitive,
 wrap the key in _double-quotes_.

--- a/content/influxdb/cloud-iox/query-data/sql/explore-schema.md
+++ b/content/influxdb/cloud-iox/query-data/sql/explore-schema.md
@ -2,7 +2,7 @@
 title: Explore your schema with SQL
 description: >
  When working with InfluxDB's implementation of SQL, a **bucket** is equivalent
-  to a databases, a **measurement** is structured as a table, and **time**,
+  to a database, a **measurement** is structured as a table, and **time**,
  **fields**, and **tags** are structured as columns.
 menu:
  influxdb_cloud_iox:
--- a/content/influxdb/cloud-iox/write-data/best-practices/schema-design.md
+++ b/content/influxdb/cloud-iox/write-data/best-practices/schema-design.md
@ -13,20 +13,23 @@ menu:
 Use the following guidelines to design your [schema](/influxdb/cloud-iox/reference/glossary/#schema)
 for simpler and more performant queries.

+<!-- TOC -->
+
 - [InfluxDB data structure](#influxdb-data-structure)
+  - [Primary keys](#primary-keys)
  - [Tags versus fields](#tags-versus-fields)
 - [Schema restrictions](#schema-restrictions)
  - [Do not use duplicate names for tags and fields](#do-not-use-duplicate-names-for-tags-and-fields)
  - [Measurements can contain up to 200 columns](#measurements-can-contain-up-to-200-columns)
 - [Design for performance](#design-for-performance)
  - [Avoid wide schemas](#avoid-wide-schemas)
+    - [Avoid too many tags](#avoid-too-many-tags)
  - [Avoid sparse schemas](#avoid-sparse-schemas)
+    - [Writing individual fields with different timestamps](#writing-individual-fields-with-different-timestamps)
  - [Measurement schemas should be homogenous](#measurement-schemas-should-be-homogenous)
 - [Design for query simplicity](#design-for-query-simplicity)
  - [Keep measurement names, tag keys, and field keys simple](#keep-measurement-names-tag-keys-and-field-keys-simple)
  - [Avoid keywords and special characters](#avoid-keywords-and-special-characters)
- [Use explicit bucket schemas to enforce schema](#use-explicit-bucket-schemas-to-enforce-schema)
---

 ## InfluxDB data structure

@ -35,17 +38,25 @@ A bucket can contain multiple measurements. Measurements contain multiple
 tags and fields.

 - **Bucket**: Named location where time series data is stored.
+  In the InfluxDB SQL implementation, a bucket is synonymous with a _database_.
  A bucket can contain multiple _measurements_.
  - **Measurement**: Logical grouping for time series data.
+    In the InfluxDB SQL implementation, a measurement is synonymous with a _table_.
    All _points_ in a given measurement should have the same _tags_.
    A measurement contains multiple _tags_ and _fields_.
-      - **Tags**: Key-value pairs that provide metadata for each point--for example,
-        something to identify the source or context of the data like host,
+      - **Tags**: Key-value pairs that store metadata string values for each point--for example,
+        a value that identifies or differentiates the data source or context--for example, host,
        location, station, etc.
-      - **Fields**: Key-value pairs with values that change over time--for example,
+      - **Fields**: Key-value pairs that store data for each point--for example,
        temperature, pressure, stock price, etc.
      - **Timestamp**: Timestamp associated with the data.
        When stored on disk and queried, all data is ordered by time.
+        In InfluxDB, a timestamp is a nanosecond-scale [unix timestamp](#unix-timestamp) in UTC.
+
+### Primary keys
+
+In time series data, the primary key for a row of data is typically a combination of timestamp and other attributes that uniquely identify each data point.
+In InfluxDB, the primary key for a row is the combination of the point's timestamp and _tag set_ - the collection of [tag keys](/influxdb/cloud-iox/reference/glossary/#tag-key) and [tag values](/influxdb/cloud-iox/reference/glossary/#tag-value) on the point.

 ### Tags versus fields

@ -54,7 +65,7 @@ tag and what should be a field?" The following guidelines should help answer tha
 question as you design your schema.

 - Use tags to store identifying information about the source or context of the data.
- Use fields to store values that change over time.
+- Use fields to store measured values.
 - Tag values can only be strings.
 - Field values can be any of the following data types:
  - Integer
@ -64,9 +75,9 @@ question as you design your schema.
  - Boolean

 {{% note %}}
-If coming from a version of InfluxDB backed by the TSM storage engine, **tag value**
-cardinality no longer affects the overall performance of your database.
 The InfluxDB IOx engine supports infinite tag value and series cardinality.
+Unlike InfluxDB backed by the TSM storage engine, **tag value**
+cardinality doesn't affect the overall performance of your database.
 {{% /note %}}

 ---
@ -81,11 +92,6 @@ measurement on disk.
 If you attempt to write a measurement that contains tags or fields with the same name,
 the write fails due to a column conflict.

-{{% note %}}
-Use [explicit bucket schemas](/influxdb/cloud-iox/admin/buckets/manage-explicit-bucket-schemas/) to enforce unique tag and
-field keys within a schema.
-{{% /note %}}
-
 ### Measurements can contain up to 200 columns

 A measurement can contain **up to 200 columns**. Each row requires a time column,
@ -106,30 +112,55 @@ The following guidelines help to optimize query performance:
 - [Avoid sparse schemas](#avoid-sparse-schemas)
 - [Measurement schemas should be homogenous](#measurement-schemas-should-be-homogenous)

+
 ### Avoid wide schemas

 A wide schema is one with many tags and fields and corresponding columns for each.
-At query time, InfluxDB evaluates each row in the queried measurement to
-determine what rows to return. The "wider" the measurement (more columns), the
-less performant queries are against that measurement.
-To ensure queries stay performant, the InfluxDB IOx storage engine has a
+With the InfluxDB IOx storage engine, wide schemas don't impact query execution performance.
+Because IOx is a columnar database, it executes queries only against columns selected in the query.
+
+Although a wide schema won't affect query performance, it can lead to the following:
+
+- More resources required for persisting and compacting data during ingestion.
+- Decreased sorting performance due to complex primary keys with [too many tags](#avoid-too-many-tags).
+
+The InfluxDB IOx storage engine has a
 [limit of 200 columns per measurement](#measurements-can-contain-up-to-200-columns).

 To avoid a wide schema, limit the number of tags and fields stored in a measurement.
 If you need to store more than 199 total tags and fields, consider segmenting
 your fields into a separate measurement.

+#### Avoid too many tags
+
+In InfluxDB, the primary key for a row is the combination of the point's timestamp and _tag set_ - the collection of [tag keys](/influxdb/cloud-iox/reference/glossary/#tag-key) and [tag values](/influxdb/cloud-iox/reference/glossary/#tag-value) on the point.
+A point that contains more tags has a more complex primary key, which could impact sorting performance if you sort using all parts of the key.
+
 ### Avoid sparse schemas

 A sparse schema is one where, for many rows, columns contain null values.
-These generally stem from [non-homogenous measurement schemas](#measurement-schemas-should-be-homogenous)
-or individual fields for a tag set being reported at separate times.
+
+ These generally stem from the following:
+- [non-homogenous measurement schemas](#measurement-schemas-should-be-homogenous)
+- [writing individual fields with different timestamps]()
+
 Sparse schemas require the InfluxDB query engine to evaluate many
 null columns, adding unnecessary overhead to storing and querying data.

 _For an example of a sparse schema,
 [view the non-homogenous schema example below](#view-example-of-a-sparse-non-homogenous-schema)._

+#### Writing individual fields with different timestamps
+
+Reporting fields at different times with different timestamps creates distinct rows that contain null values--for example:
+
+You report `fieldA` with `tagset`, and then report `field B` with the same `tagset`, but with a different timestamp.
+The result is two rows: one row has a _null_ value for **field A** and the other has a _null_ value for **field B**.
+
+In contrast, if you report fields at different times while using the same tagset and timestamp, the existing row is updated.
+This requires slightly more resources at ingestion time, but then gets resolved at persistence time or compaction time
+and avoids a sparse schema.
+
 ### Measurement schemas should be homogenous

 Data stored within a measurement should be "homogenous," meaning each row should
@ -368,9 +399,3 @@ iox.from(bucket: "example-bucket")

 {{% /code-tab-content %}}
 {{< /code-tabs-wrapper >}}
-
-## Use explicit bucket schemas to enforce schema
-
-By default, buckets have an `implicit` **schema-type** and a schema that conforms to your data.
-To require measurements to have specific columns and data types and prevent non-conforming write requests,
-use [`explicit` buckets and explicit bucket schemas](/influxdb/cloud-iox/admin/buckets/manage-explicit-bucket-schemas/).
--- a/content/influxdb/v2.7/reference/glossary.md
+++ b/content/influxdb/v2.7/reference/glossary.md
@ -673,6 +673,8 @@ Related entries: [check](#check), [notification endpoint](#notification-endpoint

 The local server's nanosecond timestamp.

+Related entries: [timestamp](#timestamp)
+
 ### null

 A data type that represents a missing or unknown value.
@ -776,7 +778,7 @@ For example, if the precision is set to `ms`, the nanosecond epoch timestamp `14
 Telegraf output plugins do not alter the timestamp further.
 The precision setting is ignored for service input plugins.

-Related entries:  [aggregator plugin](#aggregator-plugin), [input plugin](#input-plugin), [output plugin](#output-plugin), [processor plugin](#processor-plugin), [service input plugin](#service-input-plugin)
+Related entries:  [aggregator plugin](#aggregator-plugin), [input plugin](#input-plugin), [output plugin](#output-plugin), [processor plugin](#processor-plugin), [service input plugin](#service-input-plugin), [timestamp](#timestamp)

 ### predicate expression

@ -1139,12 +1141,12 @@ Irregular time series data changes at non-constant intervals.
 ### timestamp

 The date and time associated with a point.
-Time in InfluxDB is in UTC.
+In InfluxDB, a timestamp is a nanosecond-scale [unix timestamp](#unix-timestamp) in UTC.

 To specify time when writing data, see [Elements of line protocol](/influxdb/v2.7/reference/syntax/line-protocol/#elements-of-line-protocol).
 To specify time when querying data, see [Query InfluxDB with Flux](/influxdb/v2.7/query-data/get-started/query-influxdb/#2-specify-a-time-range).

-Related entries: [point](#point), [unix timestamp](#unix-timestamp), [RFC3339 timestamp](#rfc3339-timestamp)
+Related entries: [point](#point), [precision](#precision), [RFC3339 timestamp](#rfc3339-timestamp), [unix timestamp](#unix-timestamp),

 ### token