Database Access for Data Analysis#

The MAPLE Platform provides secure database access for data scientists and analysts to query time-series data using Python, pandas, and other analysis tools.

Overview#

The MAPLE Platform uses dedicated database users for each service, following the principle of least privilege:

User

Purpose

Permissions

Used By

postgres

Superuser

Full control

Database administration only

telegraf_user

Metrics ingestion

CREATE tables, INSERT/UPDATE data

Telegraf service

nodered_user

Data processing

SELECT/INSERT/UPDATE/DELETE

Node-RED service

readonly_user

Data analysis

SELECT only (read-only)

Python scripts, data analysts

These users are automatically created during the first startup by the db-init service.

Quick Start#

Connection Information#

The read-only user credentials are provided in your .env file after running the password generation step.

Connection String:

postgresql://readonly_user:YOUR_PASSWORD@localhost:5432/leaf

To view your password:

cat .env | grep READONLY_PASSWORD

Note: When connecting from outside the Docker network (e.g., from your local machine), use localhost as the host. When connecting from within the Docker network (e.g., from another container), use timescaledb as the host.

Basic Python Example#

import pandas as pd
from sqlalchemy import create_engine

# Create connection
engine = create_engine('postgresql://readonly_user:YOUR_PASSWORD@localhost:5432/leaf')

# Query data into DataFrame
df = pd.read_sql("""
    SELECT * FROM sensor_data
    WHERE timestamp > NOW() - INTERVAL '24 hours'
    LIMIT 1000
""", engine)

# Analyze data
print(df.describe())
print(df.head())

Python Setup#

Installing Required Packages#

pip install pandas sqlalchemy psycopg2-binary

For Jupyter notebooks:

pip install pandas sqlalchemy psycopg2-binary jupyter matplotlib seaborn

Usage Examples#

Example 1: Time-Series Query with TimescaleDB#

TimescaleDB provides powerful time-series functions like time_bucket for aggregation:

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('postgresql://readonly_user:YOUR_PASSWORD@localhost:5432/leaf')

# Hourly aggregation
df = pd.read_sql("""
    SELECT
        time_bucket('1 hour', timestamp) AS hour,
        sensor_id,
        AVG(temperature) as avg_temp,
        MAX(temperature) as max_temp,
        MIN(temperature) as min_temp,
        COUNT(*) as sample_count
    FROM sensor_data
    WHERE timestamp > NOW() - INTERVAL '7 days'
    GROUP BY hour, sensor_id
    ORDER BY hour DESC
""", engine)

print(df)

Example 2: Data Visualization#

import pandas as pd
import matplotlib.pyplot as plt
from sqlalchemy import create_engine

engine = create_engine('postgresql://readonly_user:YOUR_PASSWORD@localhost:5432/leaf')

# Load 24 hours of temperature data
df = pd.read_sql("""
    SELECT
        time_bucket('5 minutes', timestamp) AS time,
        AVG(temperature) as temperature
    FROM sensor_data
    WHERE sensor_id = 'sensor_001'
      AND timestamp > NOW() - INTERVAL '24 hours'
    GROUP BY time
    ORDER BY time
""", engine)

# Plot
df.set_index('time')['temperature'].plot(figsize=(12, 6))
plt.title('Temperature Over Last 24 Hours')
plt.ylabel('Temperature (°C)')
plt.xlabel('Time')
plt.grid(True)
plt.show()

Example 3: Multiple Sensors Comparison#

import pandas as pd
import matplotlib.pyplot as plt
from sqlalchemy import create_engine

engine = create_engine('postgresql://readonly_user:YOUR_PASSWORD@localhost:5432/leaf')

# Compare multiple sensors
df = pd.read_sql("""
    SELECT
        time_bucket('10 minutes', timestamp) AS time,
        sensor_id,
        AVG(temperature) as avg_temperature
    FROM sensor_data
    WHERE timestamp > NOW() - INTERVAL '24 hours'
      AND sensor_id IN ('sensor_001', 'sensor_002', 'sensor_003')
    GROUP BY time, sensor_id
    ORDER BY time
""", engine)

# Pivot for plotting
df_pivot = df.pivot(index='time', columns='sensor_id', values='avg_temperature')

# Plot
df_pivot.plot(figsize=(14, 6))
plt.title('Temperature Comparison: Multiple Sensors')
plt.ylabel('Temperature (°C)')
plt.xlabel('Time')
plt.legend(title='Sensor')
plt.grid(True)
plt.show()

Example 4: Statistical Analysis#

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('postgresql://readonly_user:YOUR_PASSWORD@localhost:5432/leaf')

# Calculate statistics per sensor
stats = pd.read_sql("""
    SELECT
        sensor_id,
        COUNT(*) as total_readings,
        AVG(temperature) as mean_temp,
        STDDEV(temperature) as std_temp,
        MIN(temperature) as min_temp,
        MAX(temperature) as max_temp,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY temperature) as median_temp
    FROM sensor_data
    WHERE timestamp > NOW() - INTERVAL '30 days'
    GROUP BY sensor_id
    ORDER BY sensor_id
""", engine)

print(stats)

Example 5: Jupyter Notebook Workflow#

# Cell 1: Setup
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine

%matplotlib inline
sns.set_style("whitegrid")

engine = create_engine('postgresql://readonly_user:YOUR_PASSWORD@localhost:5432/leaf')

# Cell 2: Load Data
df = pd.read_sql("""
    SELECT timestamp, sensor_id, temperature, humidity
    FROM sensor_data
    WHERE timestamp > NOW() - INTERVAL '7 days'
""", engine)

print(f"Loaded {len(df)} rows")
df.head()

# Cell 3: Data Cleaning
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.dropna()
df.info()

# Cell 4: Visualization
fig, axes = plt.subplots(2, 1, figsize=(14, 8))

df.groupby('sensor_id')['temperature'].plot(ax=axes[0], legend=True)
axes[0].set_title('Temperature by Sensor')
axes[0].set_ylabel('Temperature (°C)')

df.groupby('sensor_id')['humidity'].plot(ax=axes[1], legend=True)
axes[1].set_title('Humidity by Sensor')
axes[1].set_ylabel('Humidity (%)')

plt.tight_layout()
plt.show()

# Cell 5: Export Results
summary = df.groupby('sensor_id').agg({
    'temperature': ['mean', 'std', 'min', 'max'],
    'humidity': ['mean', 'std', 'min', 'max']
})

summary.to_csv('sensor_summary.csv')
print("Summary exported to sensor_summary.csv")

TimescaleDB Functions#

Time Bucketing#

Aggregate data into time intervals:

SELECT time_bucket('5 minutes', timestamp) AS bucket,
       AVG(value) AS avg_value
FROM sensor_data
GROUP BY bucket
ORDER BY bucket;

Available intervals: 1 second, 5 minutes, 1 hour, 1 day, 1 week

First/Last Values#

Get first and last values in a time range:

SELECT first(value, timestamp) AS first_value,
       last(value, timestamp) AS last_value
FROM sensor_data
WHERE timestamp > NOW() - INTERVAL '1 hour';

Continuous Aggregates#

Query pre-computed aggregations for faster analysis:

SELECT * FROM sensor_data_hourly
WHERE bucket > NOW() - INTERVAL '30 days';

Connection Methods#

Using psycopg2 (Direct PostgreSQL)#

import psycopg2
import pandas as pd

conn = psycopg2.connect(
    host="localhost",
    port=5432,
    database="leaf",
    user="readonly_user",
    password="YOUR_PASSWORD"
)

df = pd.read_sql_query("SELECT * FROM sensor_data LIMIT 100", conn)
conn.close()

Using Context Managers#

from sqlalchemy import create_engine

engine = create_engine('postgresql://readonly_user:PASSWORD@localhost:5432/leaf')

with engine.connect() as conn:
    df = pd.read_sql("SELECT * FROM sensor_data", conn)

print(df.head())

Best Practices#

1. Always Use WHERE Clauses#

Limit data scanned by filtering by time:

WHERE timestamp > NOW() - INTERVAL '7 days'

2. Use LIMIT for Exploration#

When exploring data, limit results:

SELECT * FROM large_table LIMIT 100;

3. Connection Pooling#

For production scripts with multiple queries:

engine = create_engine(
    'postgresql://readonly_user:PASSWORD@localhost:5432/leaf',
    pool_size=5,
    max_overflow=10
)

4. Close Connections#

Always close connections when done:

conn.close()

Or use context managers (automatically closes):

with engine.connect() as conn:
    df = pd.read_sql("SELECT ...", conn)

5. Store Credentials Securely#

Don’t hardcode passwords. Use environment variables:

import os
from sqlalchemy import create_engine

password = os.environ.get('DB_PASSWORD')
engine = create_engine(f'postgresql://readonly_user:{password}@localhost:5432/leaf')

Or use .env files with python-dotenv:

from dotenv import load_dotenv
import os

load_dotenv()
password = os.getenv('READONLY_PASSWORD')

Permissions#

The readonly_user has the following permissions:

Allowed#

  • SELECT - Read all data from all tables

  • Connect to the leaf database

  • View table structures and schemas

  • Execute TimescaleDB functions

  • View sequences and metadata

Not Allowed#

  • INSERT, UPDATE, DELETE - Cannot modify data

  • CREATE, DROP - Cannot create or drop tables

  • TRUNCATE - Cannot truncate tables

  • Cannot create databases or schemas

This ensures safe data analysis without risk of accidentally modifying production data.

Tools and Clients#

You can access the database using various tools:

Tool

Type

Use Case

pandas + Jupyter

Python notebook

Interactive analysis

pgAdmin

Web UI

Visual query builder

DBeaver

Desktop app

Universal database tool

DataGrip

IDE

JetBrains database IDE

psql

Command line

Quick queries

VS Code

IDE

With PostgreSQL extension

Troubleshooting#

Connection Refused#

Symptoms: psycopg2.OperationalError: could not connect to server

Solutions:

  1. Verify database is running: docker compose ps timescaledb

  2. Check port 5432 is not blocked by firewall

  3. Use localhost (not timescaledb which is internal Docker hostname)

Authentication Failed#

Symptoms: password authentication failed for user "readonly_user"

Solutions:

  1. Double-check the password from your .credentials file

  2. Ensure you’re using username readonly_user (not postgres)

  3. Verify .env file has the correct READONLY_PASSWORD

Permission Denied#

Symptoms: ERROR: permission denied for table

Solutions:

  1. This is expected for write operations - the user is read-only

  2. For read queries, verify the table exists:

    SELECT * FROM pg_tables WHERE schemaname = 'public';
    

Slow Queries#

Solutions:

  1. Add indexes on frequently queried columns

  2. Use time_bucket for aggregations instead of GROUP BY

  3. Filter by time range to reduce data scanned

  4. Use continuous aggregates for pre-computed results

Example Notebooks#

Check the examples/ folder in the repository for complete Jupyter notebook examples:

  • sensor_analysis.ipynb - Time-series analysis workflow

  • data_quality.ipynb - Data quality checks

  • anomaly_detection.ipynb - Detecting anomalies in sensor data

Resources#