14 KiB

Raw Blame History

InfluxDB IOx -- Query Processing

This document illustrates query processing for SQL and InfluxQL.

Note

There is another query interface called InfluxRPC (implemented in iox_query_influxrpc) which mostly reflects the old TSM storage API. The planning there works significantly different and this is NOT part of this document.

Basic Flow

Query arrives from the user (e.g. SQL, InfluxQL)
The query engine creates a LogicalPlan by consulting the Catalog to find:
- Tables referenced in the query, and their schema and column details
The query engine creates a ExecutionPlan by determining the Chunks that contain data:
1. Contacts the ingester for any unpersisted data
2. Consults the catalog for the name/location of parquet files
3. Prunes (discards at this step) any parquet files
Starts the ExecutionPlan and streams the results back to the client

Some objects cached, especially the schema information, information about parquet file existence and parquet file content.

A graphical representation may look like this:

flowchart LR
    classDef intermediate color:#020A47,fill:#D6F622,stroke-width:0
    classDef processor color:#FFFFFF,fill:#D30971,stroke-width:0
    classDef systemIO color:#020A47,fill:#5EE4E4,stroke-width:0

    Query[Query Text]:::systemIO
    LogicalPlanner[Logical Planner]:::processor
    LogicalPlan[Logical Plan]:::intermediate
    PhysicalPlanner[Physical Planner]:::processor
    ExecutionPlan[Execution Plan]:::intermediate
    QueryExec[QueryExecution]:::processor
    Result[Result]:::systemIO

    Query --> LogicalPlanner --> LogicalPlan --> PhysicalPlanner --> ExecutionPlan --> QueryExec --> Result

Code Organization

The IOx query layer is responsible for translating query requests from different query languages and planning and executing them against chunks stored across various IOx storage systems.

Query Frontends:

SQL
InfluxQL
Others (possibly in the future)

Sources of chunk data:

Ingester Data
Parquet Files
Others (possibly in the future)

The goal is to use the shared query / plan representation in order to avoid N*M combinations of language and chunk source. While each frontend has their own plan construction and each chunk may be lowered to a different ExecutionPlan, the frontends and the chunks sources should not interact directly. This is achieved by first creating a LogicalPlan from the frontend without knowing the chunk sources and only during physical planning -- i.e. when the ExecutionPlan is constructed -- the chunks are transformed into appropriate [DataFusion] nodes.

So we should end up with roughly this picture:

flowchart TB
    classDef out color:#020A47,fill:#9394FF,stroke-width:0
    classDef intermediate color:#020A47,fill:#D6F622,stroke-width:0
    classDef in color:#020A47,fill:#5EE4E4,stroke-width:0

    SQL[SQL]:::in
    InfluxQL[InfluxQL]:::in
    OtherIn["Other (possibly in the future)"]:::in

    LogicalPlan[Logical Plan]:::intermediate

    IngesterData[Ingester Data]:::out
    ParquetFile[Parquet File]:::out
    OtherOut["Other (possibly in the future)"]:::out

    SQL --> LogicalPlan
    InfluxQL --> LogicalPlan
    OtherIn --> LogicalPlan

    LogicalPlan --> IngesterData
    LogicalPlan --> ParquetFile
    LogicalPlan --> OtherOut

We are trying to avoid ending up with something like this:

flowchart TB
    classDef out color:#020A47,fill:#9394FF,stroke-width:0
    classDef intermediate color:#020A47,fill:#D6F622,stroke-width:0
    classDef in color:#020A47,fill:#5EE4E4,stroke-width:0

    SQL[SQL]:::in
    InfluxQL[InfluxQL]:::in
    OtherIn["Other (possibly in the future)"]:::in

    IngesterData[Ingester Data]:::out
    ParquetFile[Parquet File]:::out
    OtherOut["Other (possibly in the future)"]:::out

    SQL --> IngesterData
    SQL --> ParquetFile
    SQL --> OtherOut

    InfluxQL --> IngesterData
    InfluxQL --> ParquetFile
    InfluxQL --> OtherOut

    OtherIn --> IngesterData
    OtherIn --> ParquetFile
    OtherIn --> OtherOut

Frontend

We accept queries via an Apache Arrow Flight based native protocol (see service_grpc_flight::FlightService), or via the standard Apache Arrow Flight SQL.

Note that we stream data back to the client while DataFusion is still executing the query. This way we can emit rather large results without large buffer usage.

Also see:

"Flight SQL"

Logical Planning

Logical planning transforms the query text into a LogicalPlan.

The steps are the following:

Parse text representation is parsed into some intermediate representation
Lower intermediate representation into LogicalPlan
Apply logical optimizer passes to the LogicalPlan

SQL

For SQL queries, we just use datafusion-sql to generate the LogicalPlan from the query text.

InfluxQL

For InfluxQL queries, we use iox_query_influxql to generate the LogicalPlan from the query text.

Logical Optimizer

We have a few logical optimizer passes that are specific to IOx. These can be split into two categories: optimizing and functional.

The optimizing only change to plan to make it run faster. They do not implement any functionality. These passes are:

influx_regex_to_datafusion_regex: Replaces InfluxDB-specific regex operator with DataFusion regex operator.

The functional passes implement features that are NOT offered by DataFusion by transforming the LogicalPlan accordingly. These passes are:

handle_gapfill: enables gap-filling semantics for SQL queries that contain calls to DATE_BIN_GAPFILL() and related functions like LOCF().

The IOx-specific passes are executed AFTER the DataFusion builtin passes.

Physical Planning

Physical planning transforms the LogicalPlan into a ExecutionPlan.

These are the steps:

DataFusion lowers LogicalPlan to ExecutionPlan
- While doing so it calls IOx code to transform table scans into concrete physical operators
Apply physical optimizer passes to the ExecutionPlan

For more details, see:

Data Flow

This is a detailled data flow from the querier point of view:

flowchart TB
    classDef cache color:#020A47,fill:#9394FF,stroke-width:0
    classDef external color:#FFFFFF,fill:#9B2AFF,stroke-width:0
    classDef intermediate color:#020A47,fill:#D6F622,stroke-width:0
    classDef processor color:#FFFFFF,fill:#D30971,stroke-width:0
    classDef systemIO color:#020A47,fill:#5EE4E4,stroke-width:0

    NamespaceName[Namespace Name]:::systemIO
    SqlQuery[SQL Query]:::systemIO
    Result[Result]:::systemIO

    Catalog[/Catalog/]:::external
    Ingester[/Ingester/]:::external
    ObjectStore[/Object Store/]:::external

    NamespaceCache[Namespace Cache]:::cache
    OSCache[Object Store Cache]:::cache
    ParquetCache[Parquet File Cache]:::cache
    PartitionCache[Partition Cache]:::cache
    ProjectedSchemaCache[Projected Schema Cache]:::cache

    CachedNamespace[Cached Namespace]:::intermediate
    LogicalPlan[Logical Plan]:::intermediate
    ExecutionPlan[Execution Plan]:::intermediate
    ParquetBytes[Parquet Bytes]:::intermediate

    LogicalPlanner[LogicalPlanner]:::processor
    PhysicalPlanner[PhysicalPlanner]:::processor
    QueryExec[Query Execution]:::processor

    %% help layout engine a bit
    ProjectedSchemaCache --- PartitionCache
    linkStyle 0 stroke-width:0px

    Catalog --> NamespaceCache
    Catalog --> ParquetCache
    Catalog --> PartitionCache

    ObjectStore --> OSCache
    OSCache --> ParquetBytes

    NamespaceName --> NamespaceCache
    NamespaceCache --> CachedNamespace
    SqlQuery --> LogicalPlanner
    LogicalPlanner --> LogicalPlan

    CachedNamespace --> CachedTable
    LogicalPlan --> IngesterRequest
    IngesterRequest --> Ingester
    Ingester --> IngesterResponse
    ParquetCache --> ParquetFileMD1
    PartitionCache --> ColumnRanges
    PartitionCache --> SortKey
    ProjectedSchemaCache --> ProjectedSchema

    subgraph table [Querier Table]
        ArrowSchema[ArrowSchema]:::intermediate
        CachedTable[Cached Table]:::intermediate
        ColumnRanges[Column Ranges]:::intermediate
        IngesterChunks[Ingester Chunks]:::intermediate
        IngesterRequest[Ingester Request]:::intermediate
        IngesterResponse[Ingester Partitions]:::intermediate
        IngesterWatermark[Ingester Watermark]:::intermediate
        ParquetChunks[Parquet Chunks]:::intermediate
        ParquetFileMD1[Parquet File MD]:::intermediate
        ParquetFileMD2[Parquet File MD]:::intermediate
        ProjectedSchema[ProjectedSchema]:::intermediate
        SortKey[SortKey]:::intermediate
        QueryChunks1[Query Chunks]:::intermediate
        QueryChunks2[Query Chunks]:::intermediate

        ChunkAdapter[ChunkAdapter]:::processor
        IngesterDecoder[Ingester Decoder]:::processor
        PreFilter[Pre-filter]:::processor
        Pruning[Pruning]:::processor

        CachedTable --> ArrowSchema

        ColumnRanges --> IngesterDecoder
        IngesterResponse --> IngesterDecoder
        IngesterDecoder --> IngesterChunks
        IngesterDecoder --> IngesterWatermark

        ParquetFileMD1 --> PreFilter
        PreFilter --> ParquetFileMD2
        ParquetFileMD2 --> ChunkAdapter
        ColumnRanges --> ChunkAdapter
        SortKey --> ChunkAdapter
        ProjectedSchema --> ChunkAdapter
        ChunkAdapter --> ParquetChunks

        IngesterChunks --> QueryChunks1
        ParquetChunks --> QueryChunks1
        QueryChunks1 --> Pruning
        Pruning --> QueryChunks2
    end

    style table color:#020A47,fill:#00000000,stroke:#020A47,stroke-dasharray:20

    ArrowSchema --> LogicalPlanner
    CachedTable --> PartitionCache
    CachedTable --> ProjectedSchemaCache

    IngesterChunks -.-> PartitionCache
    ParquetFileMD2 -.-> PartitionCache
    IngesterWatermark -.-> ParquetCache
    LogicalPlan -.-> NamespaceCache
    ParquetFileMD1 -.-> NamespaceCache

    QueryChunks2 --> PhysicalPlanner
    LogicalPlan --> PhysicalPlanner
    PhysicalPlanner --> ExecutionPlan
    ExecutionPlan --> QueryExec
    ParquetBytes --> QueryExec
    QueryExec --> Result

Legend:

flowchart TB
    classDef cache color:#020A47,fill:#9394FF,stroke-width:0
    classDef external color:#FFFFFF,fill:#9B2AFF,stroke-width:0
    classDef intermediate color:#020A47,fill:#D6F622,stroke-width:0
    classDef processor color:#FFFFFF,fill:#D30971,stroke-width:0
    classDef systemIO color:#020A47,fill:#5EE4E4,stroke-width:0
    classDef helper color:#020A47,fill:#020A47,stroke-width:0

    n_c[Cache]:::cache
    n_e[/External System/]:::external
    n_i[Intermediate Result]:::intermediate
    n_p[Processor]:::processor
    n_s[System Input and Output]:::systemIO

    a((xxx)):::helper -->|data flow| b((xxx)):::helper
    c((xxx)):::helper -.->|cache invalidation| d((xxx)):::helper

Caches

Each querier process has a set of in-memory caches. These are:

Name	Pool	Backing System	Key	Value	Invalidation / TTL / Refreshes	Notes
Namespace	Metadata	Catalog	Namespace Name	`CachedNamespace`	refresh policy, TTL, invalidation by unknown table/columns	Unknown entries NOT cached (assumes upstream DDoS protection)
Object Store	Data	Object Store	Path	Raw object store bytes for the entire object	--
Parquet File	Metadata	Catalog	Table ID	Parquet files (all the data that the catalog has, i.e. the entire row) for all files that are NOT marked for deletion.	TTL, but no refresh yet (see #5718), can be invalided by ingester watermark.
Partition	Metadata	Catalog	Partition ID	`CachedPartition`	Invalided if ingester data or any parquet files has columns that are NOT covered by the sort key.	Needs `CachedTable` for access
Projected Schema	Metadata	Querier	Table ID, Column IDs	`ProjectedSchema`	--	Needs `CachedTable` for access

Note that ALL caches have a LRU eviction policy bound to the specified pool.

Cached Objects

The following objects are stored within the aforementioned caches.

`CachedNamespace`

namespace ID
retention policy
map from Arced table name to Arced CachedTable

`CachedPartition`

sort key
column ranges (decoded from partition key using the partition template)

`CachedTable`

table ID
schema
column ID => colum name map
column name => column ID map (i.e. the reverse of the above)
column IDs of primary key columns
partition template

`ProjectedSchema`

Arrow schema projected from the table schema for a specific subset of columns (since some chunks do not contain all the columns). Mostly done to optimize memory usage, i.e. some form of interning.

14 KiB Raw Blame History