5.9 KiB
title | list_title | description | weight | menu | influxdb3/clustered/tags | related | aliases | list_code_example | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Use the PyArrow library to analyze data | PyArrow | Use [PyArrow](https://arrow.apache.org/docs/python/) to read and analyze InfluxDB query results. | 201 |
|
|
|
|
```py ... table = client.query( '''SELECT * FROM home WHERE time >= now() - INTERVAL '90 days' ORDER BY time''' ) table.group_by('room').aggregate([('temp', 'mean')]) ``` |
Use PyArrow to read and analyze query results from {{% product-name %}}. The PyArrow library provides efficient computation, aggregation, serialization, and conversion of Arrow format data.
Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to store, process and move data fast.
The Arrow Python bindings (also named “PyArrow”) have first-class integration with NumPy, pandas, and built-in Python objects. They are based on the C++ implementation of Arrow. {{% caption %}}PyArrow documentation{{% /caption %}}
Install prerequisites
The examples in this guide assume using a Python virtual environment and the InfluxDB 3 influxdb3-python
Python client library.
For more information, see how to get started using Python to query InfluxDB.
Installing influxdb3-python
also installs the pyarrow
library that provides Python bindings for Apache Arrow.
Use PyArrow to read query results
The following example shows how to use influxdb3-python
and pyarrow
to query InfluxDB and view Arrow data as a PyArrow Table
.
-
In your editor, copy and paste the following sample code to a new file--for example,
pyarrow-example.py
:{{% tabs-wrapper %}} {{% code-placeholders "DATABASE_NAME|DATABASE_TOKEN" %}}
# pyarrow-example.py
from influxdb_client_3 import InfluxDBClient3
import pandas
def querySQL():
# Instantiate an InfluxDB client configured for a database
client = InfluxDBClient3(
"https://{{< influxdb/host >}}",
database="DATABASE_NAME",
token="DATABASE_TOKEN")
# Execute the query to retrieve all record batches in the stream formatted as a PyArrow Table.
table = client.query(
'''SELECT *
FROM home
WHERE time >= now() - INTERVAL '90 days'
ORDER BY time'''
)
client.close()
print(querySQL())
{{% /code-placeholders %}} {{% /tabs-wrapper %}}
-
Replace the following configuration values:
- {{% code-placeholder-key %}}
DATABASE_TOKEN
{{% /code-placeholder-key %}}: a database token with read permissions on the databases you want to query - {{% code-placeholder-key %}}
DATABASE_NAME
{{% /code-placeholder-key %}}: the name of the database to query
- {{% code-placeholder-key %}}
-
In your terminal, use the Python interpreter to run the file:
python pyarrow-example.py
The InfluxDBClient3.query()
method sends the query request, and then returns a pyarrow.Table
that contains all the Arrow record batches from the response stream.
Next, use PyArrow to analyze data.
Use PyArrow to analyze data
Group and aggregate data
With a pyarrow.Table
, you can use values in a column as keys for grouping.
The following example shows how to query InfluxDB, and then use PyArrow to group the table data and calculate an aggregate value for each group:
{{% code-placeholders "DATABASE_NAME|DATABASE_TOKEN" %}}
# pyarrow-example.py
from influxdb_client_3 import InfluxDBClient3
import pandas
def querySQL():
# Instantiate an InfluxDB client configured for a database
client = InfluxDBClient3(
"https://{{< influxdb/host >}}",
database="DATABASE_NAME",
token="DATABASE_TOKEN")
# Execute the query to retrieve data
# formatted as a PyArrow Table
table = client.query(
'''SELECT *
FROM home
WHERE time >= now() - INTERVAL '90 days'
ORDER BY time'''
)
client.close()
return table
table = querySQL()
# Use PyArrow to aggregate data
print(table.group_by('room').aggregate([('temp', 'mean')]))
{{% /code-placeholders %}}
Replace the following:
- {{% code-placeholder-key %}}
DATABASE_TOKEN
{{% /code-placeholder-key %}}: a database token with read permissions on the databases you want to query - {{% code-placeholder-key %}}
DATABASE_NAME
{{% /code-placeholder-key %}}: the name of the database to query
{{< expand-wrapper >}} {{% expand "View example results" %}}
pyarrow.Table
temp_mean: double
room: string
----
temp_mean: [[22.581987577639747,22.10807453416151]]
room: [["Kitchen","Living Room"]]
{{% /expand %}} {{< /expand-wrapper >}}
For more detail and examples, see the PyArrow documentation and the Apache Arrow Python Cookbook.