Database Access for Data Analysis#
The MAPLE Platform provides secure database access for data scientists and analysts to query time-series data using Python, pandas, and other analysis tools.
Overview#
The MAPLE Platform uses dedicated database users for each service, following the principle of least privilege:
User |
Purpose |
Permissions |
Used By |
|---|---|---|---|
|
Superuser |
Full control |
Database administration only |
|
Metrics ingestion |
CREATE tables, INSERT/UPDATE data |
Telegraf service |
|
Data processing |
SELECT/INSERT/UPDATE/DELETE |
Node-RED service |
|
Data analysis |
SELECT only (read-only) |
Python scripts, data analysts |
These users are automatically created during the first startup by the db-init service.
Quick Start#
Connection Information#
The read-only user credentials are provided in your .env file after running the password generation step.
Connection String:
postgresql://readonly_user:YOUR_PASSWORD@localhost:5432/leaf
To view your password:
cat .env | grep READONLY_PASSWORD
Note: When connecting from outside the Docker network (e.g., from your local machine), use
localhostas the host. When connecting from within the Docker network (e.g., from another container), usetimescaledbas the host.
Basic Python Example#
import pandas as pd
from sqlalchemy import create_engine
# Create connection
engine = create_engine('postgresql://readonly_user:YOUR_PASSWORD@localhost:5432/leaf')
# Query data into DataFrame
df = pd.read_sql("""
SELECT * FROM sensor_data
WHERE timestamp > NOW() - INTERVAL '24 hours'
LIMIT 1000
""", engine)
# Analyze data
print(df.describe())
print(df.head())
Python Setup#
Installing Required Packages#
pip install pandas sqlalchemy psycopg2-binary
For Jupyter notebooks:
pip install pandas sqlalchemy psycopg2-binary jupyter matplotlib seaborn
Usage Examples#
Example 1: Time-Series Query with TimescaleDB#
TimescaleDB provides powerful time-series functions like time_bucket for aggregation:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('postgresql://readonly_user:YOUR_PASSWORD@localhost:5432/leaf')
# Hourly aggregation
df = pd.read_sql("""
SELECT
time_bucket('1 hour', timestamp) AS hour,
sensor_id,
AVG(temperature) as avg_temp,
MAX(temperature) as max_temp,
MIN(temperature) as min_temp,
COUNT(*) as sample_count
FROM sensor_data
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY hour, sensor_id
ORDER BY hour DESC
""", engine)
print(df)
Example 2: Data Visualization#
import pandas as pd
import matplotlib.pyplot as plt
from sqlalchemy import create_engine
engine = create_engine('postgresql://readonly_user:YOUR_PASSWORD@localhost:5432/leaf')
# Load 24 hours of temperature data
df = pd.read_sql("""
SELECT
time_bucket('5 minutes', timestamp) AS time,
AVG(temperature) as temperature
FROM sensor_data
WHERE sensor_id = 'sensor_001'
AND timestamp > NOW() - INTERVAL '24 hours'
GROUP BY time
ORDER BY time
""", engine)
# Plot
df.set_index('time')['temperature'].plot(figsize=(12, 6))
plt.title('Temperature Over Last 24 Hours')
plt.ylabel('Temperature (°C)')
plt.xlabel('Time')
plt.grid(True)
plt.show()
Example 3: Multiple Sensors Comparison#
import pandas as pd
import matplotlib.pyplot as plt
from sqlalchemy import create_engine
engine = create_engine('postgresql://readonly_user:YOUR_PASSWORD@localhost:5432/leaf')
# Compare multiple sensors
df = pd.read_sql("""
SELECT
time_bucket('10 minutes', timestamp) AS time,
sensor_id,
AVG(temperature) as avg_temperature
FROM sensor_data
WHERE timestamp > NOW() - INTERVAL '24 hours'
AND sensor_id IN ('sensor_001', 'sensor_002', 'sensor_003')
GROUP BY time, sensor_id
ORDER BY time
""", engine)
# Pivot for plotting
df_pivot = df.pivot(index='time', columns='sensor_id', values='avg_temperature')
# Plot
df_pivot.plot(figsize=(14, 6))
plt.title('Temperature Comparison: Multiple Sensors')
plt.ylabel('Temperature (°C)')
plt.xlabel('Time')
plt.legend(title='Sensor')
plt.grid(True)
plt.show()
Example 4: Statistical Analysis#
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('postgresql://readonly_user:YOUR_PASSWORD@localhost:5432/leaf')
# Calculate statistics per sensor
stats = pd.read_sql("""
SELECT
sensor_id,
COUNT(*) as total_readings,
AVG(temperature) as mean_temp,
STDDEV(temperature) as std_temp,
MIN(temperature) as min_temp,
MAX(temperature) as max_temp,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY temperature) as median_temp
FROM sensor_data
WHERE timestamp > NOW() - INTERVAL '30 days'
GROUP BY sensor_id
ORDER BY sensor_id
""", engine)
print(stats)
Example 5: Jupyter Notebook Workflow#
# Cell 1: Setup
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sqlalchemy import create_engine
%matplotlib inline
sns.set_style("whitegrid")
engine = create_engine('postgresql://readonly_user:YOUR_PASSWORD@localhost:5432/leaf')
# Cell 2: Load Data
df = pd.read_sql("""
SELECT timestamp, sensor_id, temperature, humidity
FROM sensor_data
WHERE timestamp > NOW() - INTERVAL '7 days'
""", engine)
print(f"Loaded {len(df)} rows")
df.head()
# Cell 3: Data Cleaning
df['timestamp'] = pd.to_datetime(df['timestamp'])
df = df.dropna()
df.info()
# Cell 4: Visualization
fig, axes = plt.subplots(2, 1, figsize=(14, 8))
df.groupby('sensor_id')['temperature'].plot(ax=axes[0], legend=True)
axes[0].set_title('Temperature by Sensor')
axes[0].set_ylabel('Temperature (°C)')
df.groupby('sensor_id')['humidity'].plot(ax=axes[1], legend=True)
axes[1].set_title('Humidity by Sensor')
axes[1].set_ylabel('Humidity (%)')
plt.tight_layout()
plt.show()
# Cell 5: Export Results
summary = df.groupby('sensor_id').agg({
'temperature': ['mean', 'std', 'min', 'max'],
'humidity': ['mean', 'std', 'min', 'max']
})
summary.to_csv('sensor_summary.csv')
print("Summary exported to sensor_summary.csv")
TimescaleDB Functions#
Time Bucketing#
Aggregate data into time intervals:
SELECT time_bucket('5 minutes', timestamp) AS bucket,
AVG(value) AS avg_value
FROM sensor_data
GROUP BY bucket
ORDER BY bucket;
Available intervals: 1 second, 5 minutes, 1 hour, 1 day, 1 week
First/Last Values#
Get first and last values in a time range:
SELECT first(value, timestamp) AS first_value,
last(value, timestamp) AS last_value
FROM sensor_data
WHERE timestamp > NOW() - INTERVAL '1 hour';
Continuous Aggregates#
Query pre-computed aggregations for faster analysis:
SELECT * FROM sensor_data_hourly
WHERE bucket > NOW() - INTERVAL '30 days';
Connection Methods#
Using pandas + SQLAlchemy (Recommended)#
from sqlalchemy import create_engine
import pandas as pd
engine = create_engine('postgresql://readonly_user:PASSWORD@localhost:5432/leaf')
df = pd.read_sql("SELECT * FROM sensor_data LIMIT 100", engine)
Using psycopg2 (Direct PostgreSQL)#
import psycopg2
import pandas as pd
conn = psycopg2.connect(
host="localhost",
port=5432,
database="leaf",
user="readonly_user",
password="YOUR_PASSWORD"
)
df = pd.read_sql_query("SELECT * FROM sensor_data LIMIT 100", conn)
conn.close()
Using Context Managers#
from sqlalchemy import create_engine
engine = create_engine('postgresql://readonly_user:PASSWORD@localhost:5432/leaf')
with engine.connect() as conn:
df = pd.read_sql("SELECT * FROM sensor_data", conn)
print(df.head())
Best Practices#
1. Always Use WHERE Clauses#
Limit data scanned by filtering by time:
WHERE timestamp > NOW() - INTERVAL '7 days'
2. Use LIMIT for Exploration#
When exploring data, limit results:
SELECT * FROM large_table LIMIT 100;
3. Connection Pooling#
For production scripts with multiple queries:
engine = create_engine(
'postgresql://readonly_user:PASSWORD@localhost:5432/leaf',
pool_size=5,
max_overflow=10
)
4. Close Connections#
Always close connections when done:
conn.close()
Or use context managers (automatically closes):
with engine.connect() as conn:
df = pd.read_sql("SELECT ...", conn)
5. Store Credentials Securely#
Don’t hardcode passwords. Use environment variables:
import os
from sqlalchemy import create_engine
password = os.environ.get('DB_PASSWORD')
engine = create_engine(f'postgresql://readonly_user:{password}@localhost:5432/leaf')
Or use .env files with python-dotenv:
from dotenv import load_dotenv
import os
load_dotenv()
password = os.getenv('READONLY_PASSWORD')
Permissions#
The readonly_user has the following permissions:
Allowed#
SELECT- Read all data from all tablesConnect to the
leafdatabaseView table structures and schemas
Execute TimescaleDB functions
View sequences and metadata
Not Allowed#
INSERT,UPDATE,DELETE- Cannot modify dataCREATE,DROP- Cannot create or drop tablesTRUNCATE- Cannot truncate tablesCannot create databases or schemas
This ensures safe data analysis without risk of accidentally modifying production data.
Tools and Clients#
You can access the database using various tools:
Tool |
Type |
Use Case |
|---|---|---|
pandas + Jupyter |
Python notebook |
Interactive analysis |
pgAdmin |
Web UI |
Visual query builder |
DBeaver |
Desktop app |
Universal database tool |
DataGrip |
IDE |
JetBrains database IDE |
psql |
Command line |
Quick queries |
VS Code |
IDE |
With PostgreSQL extension |
Troubleshooting#
Connection Refused#
Symptoms: psycopg2.OperationalError: could not connect to server
Solutions:
Verify database is running:
docker compose ps timescaledbCheck port 5432 is not blocked by firewall
Use
localhost(nottimescaledbwhich is internal Docker hostname)
Authentication Failed#
Symptoms: password authentication failed for user "readonly_user"
Solutions:
Double-check the password from your
.credentialsfileEnsure you’re using username
readonly_user(notpostgres)Verify
.envfile has the correctREADONLY_PASSWORD
Permission Denied#
Symptoms: ERROR: permission denied for table
Solutions:
This is expected for write operations - the user is read-only
For read queries, verify the table exists:
SELECT * FROM pg_tables WHERE schemaname = 'public';
Slow Queries#
Solutions:
Add indexes on frequently queried columns
Use
time_bucketfor aggregations instead of GROUP BYFilter by time range to reduce data scanned
Use continuous aggregates for pre-computed results
Example Notebooks#
Check the examples/ folder in the repository for complete Jupyter notebook examples:
sensor_analysis.ipynb- Time-series analysis workflowdata_quality.ipynb- Data quality checksanomaly_detection.ipynb- Detecting anomalies in sensor data