178 lines
5.9 KiB
Markdown
178 lines
5.9 KiB
Markdown
---
|
|
title: Use the PyArrow library to analyze data
|
|
list_title: PyArrow
|
|
description: >
|
|
Use [PyArrow](https://arrow.apache.org/docs/python/) to read and analyze InfluxDB query results.
|
|
weight: 201
|
|
menu:
|
|
influxdb3_clustered:
|
|
parent: Use data analysis tools
|
|
name: Use PyArrow
|
|
identifier: analyze_with_pyarrow
|
|
influxdb3/clustered/tags: [analysis, arrow, pyarrow, python]
|
|
related:
|
|
- /influxdb3/clustered/process-data/tools/pandas/
|
|
- /influxdb3/clustered/query-data/sql/
|
|
- /influxdb3/clustered/query-data/execute-queries/client-libraries/python/
|
|
aliases:
|
|
- /influxdb3/clustered/visualize-data/pyarrow/
|
|
list_code_example: |
|
|
```py
|
|
...
|
|
table = client.query(
|
|
'''SELECT *
|
|
FROM home
|
|
WHERE time >= now() - INTERVAL '90 days'
|
|
ORDER BY time'''
|
|
)
|
|
table.group_by('room').aggregate([('temp', 'mean')])
|
|
```
|
|
---
|
|
|
|
Use [PyArrow](https://arrow.apache.org/docs/python/) to read and analyze query
|
|
results from {{% product-name %}}.
|
|
The PyArrow library provides efficient computation, aggregation, serialization,
|
|
and conversion of Arrow format data.
|
|
|
|
> Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable
|
|
> big data systems to store, process and move data fast.
|
|
>
|
|
> The Arrow Python bindings (also named “PyArrow”) have first-class integration with NumPy, pandas, and built-in Python objects. They are based on the C++ implementation of Arrow.
|
|
> {{% caption %}}[PyArrow documentation](https://arrow.apache.org/docs/python/index.html){{% /caption %}}
|
|
|
|
<!-- TOC -->
|
|
|
|
- [Install prerequisites](#install-prerequisites)
|
|
- [Use PyArrow to read query results](#use-pyarrow-to-read-query-results)
|
|
- [Use PyArrow to analyze data](#use-pyarrow-to-analyze-data)
|
|
- [Group and aggregate data](#group-and-aggregate-data)
|
|
|
|
<!-- /TOC -->
|
|
|
|
## Install prerequisites
|
|
|
|
The examples in this guide assume using a Python virtual environment and the InfluxDB 3 [`influxdb3-python` Python client library](/influxdb3/clustered/reference/client-libraries/v3/python/).
|
|
For more information, see how to [get started using Python to query InfluxDB](/influxdb3/clustered/query-data/execute-queries/client-libraries/python/).
|
|
|
|
Installing `influxdb3-python` also installs the [`pyarrow`](https://arrow.apache.org/docs/python/index.html) library that provides Python bindings for Apache Arrow.
|
|
|
|
## Use PyArrow to read query results
|
|
|
|
The following example shows how to use `influxdb3-python` and `pyarrow` to query InfluxDB and view Arrow data as a PyArrow `Table`.
|
|
|
|
1. In your editor, copy and paste the following sample code to a new file--for example, `pyarrow-example.py`:
|
|
|
|
{{% tabs-wrapper %}}
|
|
{{% code-placeholders "DATABASE_NAME|DATABASE_TOKEN" %}}
|
|
```py
|
|
# pyarrow-example.py
|
|
|
|
from influxdb_client_3 import InfluxDBClient3
|
|
import pandas
|
|
|
|
def querySQL():
|
|
|
|
# Instantiate an InfluxDB client configured for a database
|
|
client = InfluxDBClient3(
|
|
"https://{{< influxdb/host >}}",
|
|
database="DATABASE_NAME",
|
|
token="DATABASE_TOKEN")
|
|
|
|
# Execute the query to retrieve all record batches in the stream formatted as a PyArrow Table.
|
|
table = client.query(
|
|
'''SELECT *
|
|
FROM home
|
|
WHERE time >= now() - INTERVAL '90 days'
|
|
ORDER BY time'''
|
|
)
|
|
|
|
client.close()
|
|
|
|
print(querySQL())
|
|
```
|
|
{{% /code-placeholders %}}
|
|
{{% /tabs-wrapper %}}
|
|
|
|
2. Replace the following configuration values:
|
|
|
|
- {{% code-placeholder-key %}}`DATABASE_TOKEN`{{% /code-placeholder-key %}}:
|
|
a [database token](/influxdb3/clustered/admin/tokens/#database-tokens)
|
|
with read permissions on the databases you want to query
|
|
- {{% code-placeholder-key %}}`DATABASE_NAME`{{% /code-placeholder-key %}}: the name of the [database](/influxdb3/clustered/admin/databases/) to query
|
|
|
|
3. In your terminal, use the Python interpreter to run the file:
|
|
|
|
```sh
|
|
python pyarrow-example.py
|
|
```
|
|
|
|
The `InfluxDBClient3.query()` method sends the query request, and then returns a [`pyarrow.Table`](https://arrow.apache.org/docs/python/generated/pyarrow.Table.html) that contains all the Arrow record batches from the response stream.
|
|
|
|
Next, [use PyArrow to analyze data](#use-pyarrow-to-analyze-data).
|
|
|
|
## Use PyArrow to analyze data
|
|
|
|
### Group and aggregate data
|
|
|
|
With a `pyarrow.Table`, you can use values in a column as _keys_ for grouping.
|
|
|
|
The following example shows how to query InfluxDB, and then use PyArrow to group the table data and calculate an aggregate value for each group:
|
|
|
|
{{% code-placeholders "DATABASE_NAME|DATABASE_TOKEN" %}}
|
|
```py
|
|
# pyarrow-example.py
|
|
|
|
from influxdb_client_3 import InfluxDBClient3
|
|
import pandas
|
|
|
|
def querySQL():
|
|
|
|
# Instantiate an InfluxDB client configured for a database
|
|
client = InfluxDBClient3(
|
|
"https://{{< influxdb/host >}}",
|
|
database="DATABASE_NAME",
|
|
token="DATABASE_TOKEN")
|
|
|
|
# Execute the query to retrieve data
|
|
# formatted as a PyArrow Table
|
|
table = client.query(
|
|
'''SELECT *
|
|
FROM home
|
|
WHERE time >= now() - INTERVAL '90 days'
|
|
ORDER BY time'''
|
|
)
|
|
|
|
client.close()
|
|
|
|
return table
|
|
|
|
table = querySQL()
|
|
|
|
# Use PyArrow to aggregate data
|
|
print(table.group_by('room').aggregate([('temp', 'mean')]))
|
|
```
|
|
{{% /code-placeholders %}}
|
|
|
|
Replace the following:
|
|
|
|
- {{% code-placeholder-key %}}`DATABASE_TOKEN`{{% /code-placeholder-key %}}:
|
|
a [database token](/influxdb3/clustered/admin/tokens/#database-tokens)
|
|
with read permissions on the databases you want to query
|
|
- {{% code-placeholder-key %}}`DATABASE_NAME`{{% /code-placeholder-key %}}:
|
|
the name of the [database](/influxdb3/clustered/admin/databases/) to query
|
|
|
|
{{< expand-wrapper >}}
|
|
{{% expand "View example results" %}}
|
|
```arrow
|
|
pyarrow.Table
|
|
temp_mean: double
|
|
room: string
|
|
----
|
|
temp_mean: [[22.581987577639747,22.10807453416151]]
|
|
room: [["Kitchen","Living Room"]]
|
|
```
|
|
{{% /expand %}}
|
|
{{< /expand-wrapper >}}
|
|
|
|
For more detail and examples, see the [PyArrow documentation](https://arrow.apache.org/docs/python/getstarted.html) and the [Apache Arrow Python Cookbook](https://arrow.apache.org/cookbook/py/data.html).
|