Database Access for Data Analysis

Database Access for Data Analysis#

The MAPLE Platform provides secure database access for data scientists and analysts to query time-series data using Python, pandas, and other analysis tools.

Overview#

The MAPLE Platform uses dedicated database users for each service, following the principle of least privilege:

User	Purpose	Permissions	Used By
`postgres`	Superuser	Full control	Database administration only
`telegraf_user`	Metrics ingestion	CREATE tables, INSERT/UPDATE data	Telegraf service
`nodered_user`	Data processing	SELECT/INSERT/UPDATE/DELETE	Node-RED service
`readonly_user`	Data analysis	SELECT only (read-only)	Python scripts, data analysts

These users are automatically created during the first startup by the db-init service.

Quick Start#

Connection Information#

The read-only user credentials are provided in your .env file after running the password generation step.

Connection String:

postgresql://readonly_user:YOUR_PASSWORD@localhost:5432/leaf

To view your password:

cat .env | grep READONLY_PASSWORD

Note: When connecting from outside the Docker network (e.g., from your local machine), use localhost as the host. When connecting from within the Docker network (e.g., from another container), use timescaledb as the host.

Basic Python Example#

import pandas as pd
from sqlalchemy import create_engine

# Create connection
engine = create_engine('postgresql://readonly_user:YOUR_PASSWORD@localhost:5432/leaf')

# Query data into DataFrame
df = pd.read_sql("""
    SELECT * FROM sensor_data
    WHERE timestamp > NOW() - INTERVAL '24 hours'
    LIMIT 1000
""", engine)

# Analyze data
print(df.describe())
print(df.head())

Python Setup#

Installing Required Packages#

pip install pandas sqlalchemy psycopg2-binary

For Jupyter notebooks:

pip install pandas sqlalchemy psycopg2-binary jupyter matplotlib seaborn

Usage Examples#

Example 1: Time-Series Query with TimescaleDB#

TimescaleDB provides powerful time-series functions like time_bucket for aggregation:

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('postgresql://readonly_user:YOUR_PASSWORD@localhost:5432/leaf')

# Hourly aggregation
df = pd.read_sql("""
    SELECT
        time_bucket('1 hour', timestamp) AS hour,
        sensor_id,
        AVG(temperature) as avg_temp,
        MAX(temperature) as max_temp,
        MIN(temperature) as min_temp,
        COUNT(*) as sample_count
    FROM sensor_data
    WHERE timestamp > NOW() - INTERVAL '7 days'
    GROUP BY hour, sensor_id
    ORDER BY hour DESC
""", engine)

print(df)

Example 2: Data Visualization#

import pandas as pd
import matplotlib.pyplot as plt
from sqlalchemy import create_engine

engine = create_engine('postgresql://readonly_user:YOUR_PASSWORD@localhost:5432/leaf')

# Load 24 hours of temperature data
df = pd.read_sql("""
    SELECT
        time_bucket('5 minutes', timestamp) AS time,
        AVG(temperature) as temperature
    FROM sensor_data
    WHERE sensor_id = 'sensor_001'
      AND timestamp > NOW() - INTERVAL '24 hours'
    GROUP BY time
    ORDER BY time
""", engine)

# Plot
df.set_index('time')['temperature'].plot(figsize=(12, 6))
plt.title('Temperature Over Last 24 Hours')
plt.ylabel('Temperature (°C)')
plt.xlabel('Time')
plt.grid(True)
plt.show()

Example 3: Multiple Sensors Comparison#

import pandas as pd
import matplotlib.pyplot as plt
from sqlalchemy import create_engine

engine = create_engine('postgresql://readonly_user:YOUR_PASSWORD@localhost:5432/leaf')

# Compare multiple sensors
df = pd.read_sql("""
    SELECT
        time_bucket('10 minutes', timestamp) AS time,
        sensor_id,
        AVG(temperature) as avg_temperature
    FROM sensor_data
    WHERE timestamp > NOW() - INTERVAL '24 hours'
      AND sensor_id IN ('sensor_001', 'sensor_002', 'sensor_003')
    GROUP BY time, sensor_id
    ORDER BY time
""", engine)

# Pivot for plotting
df_pivot = df.pivot(index='time', columns='sensor_id', values='avg_temperature')

# Plot
df_pivot.plot(figsize=(14, 6))
plt.title('Temperature Comparison: Multiple Sensors')
plt.ylabel('Temperature (°C)')
plt.xlabel('Time')
plt.legend(title='Sensor')
plt.grid(True)
plt.show()

Example 4: Statistical Analysis#

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine('postgresql://readonly_user:YOUR_PASSWORD@localhost:5432/leaf')

# Calculate statistics per sensor
stats = pd.read_sql("""
    SELECT
        sensor_id,
        COUNT(*) as total_readings,
        AVG(temperature) as mean_temp,
        STDDEV(temperature) as std_temp,
        MIN(temperature) as min_temp,
        MAX(temperature) as max_temp,
        PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY temperature) as median_temp
    FROM sensor_data
    WHERE timestamp > NOW() - INTERVAL '30 days'
    GROUP BY sensor_id
    ORDER BY sensor_id
""", engine)

print(stats)

Example 5: Jupyter Notebook Workflow#

# Cell 1: Setup
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine

%matplotlib inline
sns.set_style("whitegrid")

engine = create_engine('postgresql://readonly_user:YOUR_PASSWORD@localhost:5432/leaf')

# Cell 2: Load Data
df = pd.read_sql("""
    SELECT timestamp, sensor_id, temperature, humidity
    FROM sensor_data
    WHERE timestamp > NOW() - INTERVAL '7 days'
""", engine)

print(f"Loaded {len(df)} rows")
df.head()

# Cell 3: Data Cleaning
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.dropna()
df.info()

# Cell 4: Visualization
fig, axes = plt.subplots(2, 1, figsize=(14, 8))

df.groupby('sensor_id')['temperature'].plot(ax=axes[0], legend=True)
axes[0].set_title('Temperature by Sensor')
axes[0].set_ylabel('Temperature (°C)')

df.groupby('sensor_id')['humidity'].plot(ax=axes[1], legend=True)
axes[1].set_title('Humidity by Sensor')
axes[1].set_ylabel('Humidity (%)')

plt.tight_layout()
plt.show()

# Cell 5: Export Results
summary = df.groupby('sensor_id').agg({
    'temperature': ['mean', 'std', 'min', 'max'],
    'humidity': ['mean', 'std', 'min', 'max']
})

summary.to_csv('sensor_summary.csv')
print("Summary exported to sensor_summary.csv")

TimescaleDB Functions#

Time Bucketing#

Aggregate data into time intervals:

SELECT time_bucket('5 minutes', timestamp) AS bucket,
       AVG(value) AS avg_value
FROM sensor_data
GROUP BY bucket
ORDER BY bucket;

Available intervals: 1 second, 5 minutes, 1 hour, 1 day, 1 week

First/Last Values#

Get first and last values in a time range:

SELECT first(value, timestamp) AS first_value,
       last(value, timestamp) AS last_value
FROM sensor_data
WHERE timestamp > NOW() - INTERVAL '1 hour';

Continuous Aggregates#

Query pre-computed aggregations for faster analysis:

SELECT * FROM sensor_data_hourly
WHERE bucket > NOW() - INTERVAL '30 days';

Connection Methods#

Using pandas + SQLAlchemy (Recommended)#

from sqlalchemy import create_engine
import pandas as pd

engine = create_engine('postgresql://readonly_user:PASSWORD@localhost:5432/leaf')
df = pd.read_sql("SELECT * FROM sensor_data LIMIT 100", engine)

Using psycopg2 (Direct PostgreSQL)#

import psycopg2
import pandas as pd

conn = psycopg2.connect(
    host="localhost",
    port=5432,
    database="leaf",
    user="readonly_user",
    password="YOUR_PASSWORD"
)

df = pd.read_sql_query("SELECT * FROM sensor_data LIMIT 100", conn)
conn.close()

Using Context Managers#

from sqlalchemy import create_engine

engine = create_engine('postgresql://readonly_user:PASSWORD@localhost:5432/leaf')

with engine.connect() as conn:
    df = pd.read_sql("SELECT * FROM sensor_data", conn)

print(df.head())

Best Practices#

1. Always Use WHERE Clauses#

Limit data scanned by filtering by time:

WHERE timestamp > NOW() - INTERVAL '7 days'

2. Use LIMIT for Exploration#

When exploring data, limit results:

SELECT * FROM large_table LIMIT 100;

3. Connection Pooling#

For production scripts with multiple queries:

engine = create_engine(
    'postgresql://readonly_user:PASSWORD@localhost:5432/leaf',
    pool_size=5,
    max_overflow=10
)

4. Close Connections#

Always close connections when done:

conn.close()

Or use context managers (automatically closes):

with engine.connect() as conn:
    df = pd.read_sql("SELECT ...", conn)

5. Store Credentials Securely#

Don’t hardcode passwords. Use environment variables:

import os
from sqlalchemy import create_engine

password = os.environ.get('DB_PASSWORD')
engine = create_engine(f'postgresql://readonly_user:{password}@localhost:5432/leaf')

Or use .env files with python-dotenv:

from dotenv import load_dotenv
import os

load_dotenv()
password = os.getenv('READONLY_PASSWORD')

Permissions#

The readonly_user has the following permissions:

Allowed#

SELECT - Read all data from all tables
Connect to the leaf database
View table structures and schemas
Execute TimescaleDB functions
View sequences and metadata

Not Allowed#

INSERT, UPDATE, DELETE - Cannot modify data
CREATE, DROP - Cannot create or drop tables
TRUNCATE - Cannot truncate tables
Cannot create databases or schemas

This ensures safe data analysis without risk of accidentally modifying production data.

Tools and Clients#

You can access the database using various tools:

Tool	Type	Use Case
pandas + Jupyter	Python notebook	Interactive analysis
pgAdmin	Web UI	Visual query builder
DBeaver	Desktop app	Universal database tool
DataGrip	IDE	JetBrains database IDE
psql	Command line	Quick queries
VS Code	IDE	With PostgreSQL extension

Troubleshooting#

Connection Refused#

Symptoms: psycopg2.OperationalError: could not connect to server

Solutions:

Verify database is running: docker compose ps timescaledb
Check port 5432 is not blocked by firewall
Use localhost (not timescaledb which is internal Docker hostname)

Authentication Failed#

Symptoms: password authentication failed for user "readonly_user"

Solutions:

Double-check the password from your .credentials file
Ensure you’re using username readonly_user (not postgres)
Verify .env file has the correct READONLY_PASSWORD

Permission Denied#

Symptoms: ERROR: permission denied for table

Solutions:

This is expected for write operations - the user is read-only

For read queries, verify the table exists:

SELECT * FROM pg_tables WHERE schemaname = 'public';

Slow Queries#

Solutions:

Add indexes on frequently queried columns
Use time_bucket for aggregations instead of GROUP BY
Filter by time range to reduce data scanned
Use continuous aggregates for pre-computed results

Example Notebooks#

Check the examples/ folder in the repository for complete Jupyter notebook examples:

sensor_analysis.ipynb - Time-series analysis workflow
data_quality.ipynb - Data quality checks
anomaly_detection.ipynb - Detecting anomalies in sensor data

Database Access for Data Analysis

Contents

Database Access for Data Analysis#

Overview#

Quick Start#

Connection Information#

Basic Python Example#

Python Setup#

Installing Required Packages#

Usage Examples#

Example 1: Time-Series Query with TimescaleDB#

Example 2: Data Visualization#

Example 3: Multiple Sensors Comparison#

Example 4: Statistical Analysis#

Example 5: Jupyter Notebook Workflow#

TimescaleDB Functions#

Time Bucketing#

First/Last Values#

Continuous Aggregates#

Connection Methods#

Using pandas + SQLAlchemy (Recommended)#

Using psycopg2 (Direct PostgreSQL)#

Using Context Managers#

Best Practices#

1. Always Use WHERE Clauses#

2. Use LIMIT for Exploration#

3. Connection Pooling#

4. Close Connections#

5. Store Credentials Securely#

Permissions#

Allowed#

Not Allowed#

Tools and Clients#

Troubleshooting#

Connection Refused#

Authentication Failed#

Permission Denied#

Slow Queries#

Example Notebooks#

Resources#