docs-v2/content/influxdb3/cloud-serverless/process-data/tools/pyarrow.md

6.0 KiB

title list_title description weight menu influxdb3/cloud-serverless/tags related aliases list_code_example
Use the PyArrow library to analyze data PyArrow Use [PyArrow](https://arrow.apache.org/docs/python/) to read and analyze InfluxDB query results from InfluxDB Cloud Serverless. 101
influxdb3_cloud_serverless
parent name identifier
Use data analysis tools Use PyArrow analyze-with-pyarrow
analysis
arrow
pyarrow
python
/influxdb3/cloud-serverless/process-data/tools/pandas/
/influxdb3/cloud-serverless/query-data/sql/
/influxdb3/cloud-serverless/query-data/execute-queries/client-libraries/python/
/influxdb3/cloud-serverless/visualize-data/pyarrow/
```py ... table = client.query( '''SELECT * FROM home WHERE time >= now() - INTERVAL '90 days' ORDER BY time''' ) table.group_by('room').aggregate([('temp', 'mean')]) ```

Use PyArrow to read and analyze query results from {{% product-name %}}. The PyArrow library provides efficient computation, aggregation, serialization, and conversion of Arrow format data.

Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to store, process and move data fast.

The Arrow Python bindings (also named “PyArrow”) have first-class integration with NumPy, pandas, and built-in Python objects. They are based on the C++ implementation of Arrow. {{% caption %}}PyArrow documentation{{% /caption %}}

Install prerequisites

The examples in this guide assume using a Python virtual environment and the InfluxDB 3 influxdb3-python Python client library. For more information, see how to get started using Python to query InfluxDB.

Installing influxdb3-python also installs the pyarrow library that provides Python bindings for Apache Arrow.

Use PyArrow to read query results

The following example shows how to use influxdb3-python and pyarrow to query InfluxDB and view Arrow data as a PyArrow Table.

  1. In your editor, copy and paste the following sample code to a new file--for example, pyarrow-example.py:

    {{% tabs-wrapper %}} {{% code-placeholders "BUCKET_NAME|API_TOKEN" %}}

# pyarrow-example.py

from influxdb_client_3 import InfluxDBClient3
import pandas

def querySQL():
  
  # Instantiate an InfluxDB client configured for a bucket
  client = InfluxDBClient3(
    "https://{{< influxdb/host >}}",
    database="BUCKET_NAME",
    token="API_TOKEN")

  # Execute the query to retrieve all record batches in the stream formatted as a PyArrow Table.
  table = client.query(
    '''SELECT *
      FROM home
      WHERE time >= now() - INTERVAL '90 days'
      ORDER BY time'''
  )

  client.close()

print(querySQL())

{{% /code-placeholders %}} {{% /tabs-wrapper %}}

  1. Replace the following configuration values:

    • {{% code-placeholder-key %}}API_TOKEN{{% /code-placeholder-key %}}: An InfluxDB token with read permissions on the buckets you want to query.
    • {{% code-placeholder-key %}}BUCKET_NAME{{% /code-placeholder-key %}}: The name of the InfluxDB bucket to query.
  2. In your terminal, use the Python interpreter to run the file:

    python pyarrow-example.py
    

The InfluxDBClient3.query() method sends the query request, and then returns a pyarrow.Table that contains all the Arrow record batches from the response stream.

Next, use PyArrow to analyze data.

Use PyArrow to analyze data

Group and aggregate data

With a pyarrow.Table, you can use values in a column as keys for grouping.

The following example shows how to query InfluxDB, and then use PyArrow to group the table data and calculate an aggregate value for each group:

{{% code-placeholders "BUCKET_NAME|API_TOKEN" %}}

# pyarrow-example.py

from influxdb_client_3 import InfluxDBClient3
import pandas

def querySQL():
  
  # Instantiate an InfluxDB client configured for a bucket
  client = InfluxDBClient3(
    "https://{{< influxdb/host >}}",
    database="BUCKET_NAME",
    token="API_TOKEN")

  # Execute the query to retrieve data 
  # formatted as a PyArrow Table
  table = client.query(
    '''SELECT *
      FROM home
      WHERE time >= now() - INTERVAL '90 days'
      ORDER BY time'''
  )

  client.close()

  return table

table = querySQL()

# Use PyArrow to aggregate data
print(table.group_by('room').aggregate([('temp', 'mean')]))

{{% /code-placeholders %}}

Replace the following:

  • {{% code-placeholder-key %}}API_TOKEN{{% /code-placeholder-key %}}: An InfluxDB token with read permissions on the buckets you want to query.
  • {{% code-placeholder-key %}}BUCKET_NAME{{% /code-placeholder-key %}}: The name of the InfluxDB bucket to query.

{{< expand-wrapper >}} {{% expand "View example results" %}}

pyarrow.Table
temp_mean: double
room: string
----
temp_mean: [[22.581987577639747,22.10807453416151]]
room: [["Kitchen","Living Room"]]

{{% /expand %}} {{< /expand-wrapper >}}

For more detail and examples, see the PyArrow documentation and the Apache Arrow Python Cookbook.