docs-v2/content/shared/influxdb3-write-guides/best-practices/schema-design.md

394 lines
15 KiB
Markdown

Use the following guidelines to design your [schema](/influxdb3/version/reference/glossary/#schema)
for simpler and more performant queries.
- [InfluxDB data structure](#influxdb-data-structure)
- [Primary keys](#primary-keys)
- [Tags versus fields](#tags-versus-fields)
- [Schema restrictions](#schema-restrictions)
- [Do not use duplicate names for tags and fields](#do-not-use-duplicate-names-for-tags-and-fields)
- [Maximum number of columns per table](#maximum-number-of-columns-per-table)
- [Design for performance](#design-for-performance)
- [Avoid wide schemas](#avoid-wide-schemas)
- [Avoid sparse schemas](#avoid-sparse-schemas)
- [Table schemas should be homogenous](#table-schemas-should-be-homogenous)
- [Use the best data type for your data](#use-the-best-data-type-for-your-data)
- [Design for query simplicity](#design-for-query-simplicity)
- [Keep table names, tags, and fields simple](#keep-table-names-tags-and-fields-simple)
- [Avoid keywords and special characters](#avoid-keywords-and-special-characters)
## InfluxDB data structure
The {{% product-name %}} data model organizes time series data into databases and tables.
A database can contain multiple tables.
Tables contain multiple tags and fields.
<!-- vale InfluxDataDocs.v3Schema = NO -->
- **Database**: A named location where time series data is stored.
In {{% product-name %}}, _database_ is synonymous with _bucket_ in InfluxDB
Cloud Serverless and InfluxDB TSM implementations.
A database can contain multiple _tables_.
- **Table**: A logical grouping for time series data.
In {{% product-name %}}, _table_ is synonymous with _measurement_ in
InfluxDB Cloud Serverless and InfluxDB TSM implementations.
All _points_ in a given table should have the same _tags_.
A table contains multiple _tags_ and _fields_.
- **Tags**: Key-value pairs that store metadata string values for each point--for example,
a value that identifies or differentiates the data source or context--for example, host,
location, station, etc.
Tag values may be null.
- **Fields**: Key-value pairs that store data for each point--for example,
temperature, pressure, stock price, etc.
Field values may be null, but at least one field value is not null on any given row.
- **Timestamp**: Timestamp associated with the data.
When stored on disk and queried, all data is ordered by time.
In InfluxDB, a timestamp is a nanosecond-scale
[Unix timestamp](/influxdb3/version/reference/glossary/#unix-timestamp)
in UTC.
A timestamp is never null.
> [!Note]
>
> #### What happened to buckets and measurements?
>
> If coming from earlier versions of InfluxDB, InfluxDB Cloud (TSM), or
> InfluxDB Cloud Serverless, you're likely familiar with the concepts _bucket_
> and _measurement_:
>
> - _**Bucket**_ in InfluxDB v2 or InfluxDB Cloud Serverless is synonymous with
> _**database**_ in {{% product-name %}}.
> - _**Measurement**_ in InfluxDB v1, v2, or InfluxDB Cloud Serverless is synonymous
> with _**table**_ in {{% product-name %}}.
<!-- vale InfluxDataDocs.v3Schema = YES -->
### Primary keys
In time series data, the primary key for a row of data is typically a combination
of timestamp and other attributes that uniquely identify each data point.
In {{% product-name %}}, the primary key for a row is the combination of the
point's timestamp and _tag set_—the collection of
[tag keys](/influxdb3/version/reference/glossary/#tag-key) and
[tag values](/influxdb3/version/reference/glossary/#tag-value) on the point.
A row's primary key tag set does not include tags with null values.
### Tags versus fields
When designing your schema for InfluxDB, a common question is, "what should be a
tag and what should be a field?" The following guidelines should help answer that
question as you design your schema.
- Use tags to store metadata, or identifying information, about the source or context of the data.
- Use fields to store measured values.
- Tag values can only be strings.
- Field values can be any of the following data types:
- Integer
- Unsigned integer
- Float
- String
- Boolean
> [!Note]
> The InfluxDB 3 storage engine supports infinite tag value and series cardinality.
> Unlike previous versions of InfluxDB, **tag value** cardinality doesn't affect
> the overall performance of your database.
---
## Schema restrictions
### Do not use duplicate names for tags and fields
Use unique names for tags and fields within the same table.
{{% product-name %}} stores tags and fields as unique columns in a table.
If you attempt to write a table that contains tags or fields with the same name,
the write fails due to a column conflict.
### Maximum number of columns per table
A table has a [maximum number of columns](/influxdb3/version/admin/databases/#column-limit).
Each row must include a time column.
As a result, a table can have the following:
- a time column
- field and tag columns up to the configured maximum
If you attempt to write to a table and exceed the column limit, then the write
request fails and InfluxDB returns an error.
InfluxData identified the
[column limit](/influxdb3/version/admin/databases/#column-limit)
as the safe limit for maintaining system performance and stability.
Exceeding this threshold can result in
[wide schemas](#avoid-wide-schemas), which can negatively impact performance
and resource use, [depending on your queries](#avoid-non-specific-queries),
the shape of your schema, and data types in the schema.
---
## Design for performance
How you structure your schema within a table can affect resource use and
the performance of queries against that table.
The following guidelines help to optimize query performance:
- [Avoid wide schemas](#avoid-wide-schemas)
- [Avoid sparse schemas](#avoid-sparse-schemas)
- [Table schemas should be homogenous](#table-schemas-should-be-homogenous)
- [Use the best data type for your data](#use-the-best-data-type-for-your-data)
### Avoid wide schemas
A wide schema refers to a schema with a large number of columns (tags and fields).
Wide schemas can lead to the following issues:
- Increased resource usage for persisting data during ingestion.
- Reduced sorting performance due to complex primary keys with [too many tags](#avoid-too-many-tags).
- Reduced query performance when selecting too many columns
To prevent wide schema issues, limit the number of tags and fields stored in a table.
If you need to store more than the [maximum number of columns](/influxdb3/version/admin/databases/#column-limit),
consider segmenting your fields into separate tables.
#### Avoid too many tags
In {{% product-name %}}, the primary key for a row is the combination of the
point's timestamp and _tag set_ - the collection of
[tag keys](/influxdb3/version/reference/glossary/#tag-key)
and [tag values](/influxdb3/version/reference/glossary/#tag-value) on the point.
A point that contains more tags has a more complex primary key, which could
impact sorting performance if you sort using all parts of the key.
### Avoid sparse schemas
A sparse schema is one where, for many rows, columns contain null values.
These generally stem from the following:
- [non-homogenous table schemas](#table-schemas-should-be-homogenous)
- [writing individual fields with different timestamps](#writing-individual-fields-with-different-timestamps)
Sparse schemas require the InfluxDB query engine to evaluate many
null columns, adding unnecessary overhead to storing and querying data.
_For an example of a sparse schema,
[view the non-homogenous schema example below](#view-example-of-a-sparse-non-homogenous-schema)._
#### Writing individual fields with different timestamps
Reporting fields at different times with different timestamps creates distinct
rows that contain null values--for example:
You report `fieldA` with `tagset`, and then report `field B` with the same
`tagset`, but with a different timestamp.
The result is two rows: one row has a _null_ value for **field A** and the other
has a _null_ value for **field B**.
In contrast, if you report fields at different times while using the same tagset
and timestamp, the existing row is updated.
This requires slightly more resources at ingestion time, but then gets resolved
at persistence time or compaction time and avoids a sparse schema.
### Table schemas should be homogenous
Data stored in a table should be "homogenous," meaning each row should have the
same tag and field keys.
All rows stored in a table share the same columns, but if a point doesn't
include a value for a column, the column value is _null_.
A table full of _null_ values has a ["sparse" schema](#avoid-sparse-schemas).
{{< expand-wrapper >}}
{{% expand "View example of a sparse, non-homogenous schema" %}}
Non-homogenous schemas are often caused by writing points to a table with
inconsistent tag or field sets.
In the following example, data is collected from two
different sources and each source returns data with different tag and field sets.
{{< flex >}}
{{% flex-content %}}
##### Source 1 tags and fields:
- tags:
- source
- code
- crypto
- fields:
- price
{{% /flex-content %}}
{{% flex-content %}}
##### Source 2 tags and fields:
- tags:
- src
- currency
- crypto
- fields:
- cost
- volume
{{% /flex-content %}}
{{< /flex >}}
These sets of data written to the same table result in a table
full of null values (also known as a _sparse schema_):
| time | source | src | code | currency | crypto | price | cost | volume |
| :------------------- | :----- | --: | :--- | :------- | :------ | ----------: | ---------: | ----------: |
| 2025-01-01T12:00:00Z | src1 | | USD | | bitcoin | 16588.45865 | | |
| 2025-01-01T12:00:00Z | | 2 | | EUR | bitcoin | | 16159.5806 | 16749450200 |
| 2025-01-01T13:00:00Z | src1 | | USD | | bitcoin | 16559.49871 | | |
| 2025-01-01T13:00:00Z | | 2 | | EUR | bitcoin | | 16131.3694 | 16829683245 |
| 2025-01-01T14:00:00Z | src1 | | USD | | bitcoin | 16577.46667 | | |
| 2025-01-01T14:00:00Z | | 2 | | EUR | bitcoin | | 16148.8727 | 17151722208 |
| 2025-01-01T15:00:00Z | src1 | | USD | | bitcoin | 16591.36998 | | |
| 2025-01-01T15:00:00Z | | 2 | | EUR | bitcoin | | 16162.4167 | 17311854919 |
{{% /expand %}}
{{< /expand-wrapper >}}
### Use the best data type for your data
When writing data to a field, use the most appropriate
[data type](/influxdb3/version/reference/glossary/#data-type) for your data--write
integers as integers, decimals as floats, and booleans as booleans.
A query against a field that stores integers outperforms a query against string data;
querying over many long string values can negatively affect performance.
## Design for query simplicity
Naming conventions for tables, tag keys, and field keys can simplify or
complicate the process of writing queries for your data.
The following guidelines help to ensure writing queries for your data is as
simple as possible.
- [Keep table names, tags, and fields simple](#keep-table-names-tags-and-fields-simple)
- [Avoid keywords and special characters](#avoid-keywords-and-special-characters)
### Keep table names, tags, and fields simple
Use one tag or one field for each data attribute.
If your source data contains multiple data attributes in a single parameter,
split each attribute into its own tag or field.
Table names, tag keys, and field keys should be simple and accurately
describe what each contains.
Keep names free of data.
The most common cause of a complex naming convention is when you try to "embed"
data attributes into a table name, tag key, or field key.
When each key and value represents one attribute (not multiple concatenated attributes)
of your data, you'll reduce the need for regular expressions in your queries.
Without regular expressions, your queries will be easier to write and more performant.
#### Not recommended {.orange}
For example, consider the following [line protocol](/influxdb3/version/reference/syntax/line-protocol/)
that embeds multiple attributes (location, model, and ID) into a `sensor` tag value:
```text
home,sensor=loc-kitchen.model-A612.id-1726ZA temp=72.1
home,sensor=loc-bath.model-A612.id-2635YB temp=71.8
```
{{< expand-wrapper >}}
{{% expand "View written data" %}}
{{% influxql/table-meta %}}
**table**: home
{{% /influxql/table-meta %}}
| time | sensor | temp |
| :------------------- | :------------------------------- | ---: |
| 2025-01-01T00:00:00Z | loc-kitchen.model-A612.id-1726ZA | 72.1 |
| 2025-01-01T00:00:00Z | loc-bath.model-A612.id-2635YB | 71.8 |
{{% /expand %}}
{{< /expand-wrapper >}}
To query data from the sensor with ID `1726ZA`, you have to use either SQL
pattern matching or regular expressions to evaluate the `sensor` tag:
{{< code-tabs-wrapper >}}
{{% code-tabs %}}
[SQL](#)
[InfluxQL](#)
{{% /code-tabs %}}
{{% code-tab-content %}}
```sql
SELECT * FROM home WHERE sensor LIKE '%id-1726ZA%'
```
{{% /code-tab-content %}}
{{% code-tab-content %}}
```sql
SELECT * FROM home WHERE sensor =~ /id-1726ZA/
```
{{% /code-tab-content %}}
{{< /code-tabs-wrapper >}}
SQL pattern matching and regular expressions both complicate the query and
are less performant than simple equality expressions.
#### Recommended {.green}
The better approach would be to write each sensor attribute as a separate tag:
```text
home,location=kitchen,sensor_model=A612,sensor_id=1726ZA temp=72.1
home,location=bath,sensor_model=A612,sensor_id=2635YB temp=71.8
```
{{< expand-wrapper >}}
{{% expand "View written data" %}}
{{% influxql/table-meta %}}
**table**: home
{{% /influxql/table-meta %}}
| time | location | sensor_model | sensor_id | temp |
| :------------------- | :------- | :----------- | :-------- | ---: |
| 2023-01-01T00:00:00Z | kitchen | A612 | 1726ZA | 72.1 |
| 2023-01-01T00:00:00Z | bath | A612 | 2635YB | 71.8 |
{{% /expand %}}
{{< /expand-wrapper >}}
To query data from the sensor with ID `1726ZA` using this schema, you can use a
simple equality expression:
```sql
SELECT * FROM home WHERE sensor_id = '1726ZA'
```
This query is easier to write and is more performant than using pattern matching
or regular expressions.
### Avoid keywords and special characters
To simplify query writing, avoid using reserved keywords or special characters
in table names, tag keys, and field keys.
- [SQL keywords](/influxdb3/version/reference/sql/#keywords)
- [InfluxQL keywords](/influxdb3/version/reference/influxql/#keywords)
When using SQL or InfluxQL to query tables, tags, and fields with special
characters or keywords, you have to wrap these identifiers in **double quotes**.
```sql
SELECT
"example-field", "tag@1-23"
FROM
"example-table"
WHERE
"tag@1-23" = 'ABC'
```