Skip to content

Troubleshooting

Common issues and solutions for DB Provision Operator.

Diagnostic Commands

Quick Health Check

# Operator status
kubectl get pods -n db-provision-operator-system

# Resource status
kubectl get databaseinstances,databases,databaseusers -A

# Recent events
kubectl get events -n db-provision-operator-system --sort-by='.lastTimestamp' | tail -20

Detailed Debugging

# Operator logs
kubectl logs -n db-provision-operator-system deployment/db-provision-operator -f

# Specific resource details
kubectl describe databaseinstance postgres-primary

# Resource YAML
kubectl get databaseinstance postgres-primary -o yaml

Common Issues

DatabaseInstance Issues

Instance Stuck in Pending

Symptoms:

NAME              ENGINE    PHASE
postgres-primary  postgres  Pending

Causes and Solutions:

  1. Cannot connect to database:

    # Check connectivity
    kubectl run psql-test --rm -it --image=postgres:15 -- \
      psql "postgresql://user:pass@host:5432/postgres?sslmode=disable" -c "SELECT 1"
    

  2. Invalid credentials:

    # Verify secret exists and has correct keys
    kubectl get secret postgres-admin-credentials -o yaml
    

  3. DNS resolution:

    # Test DNS from within cluster
    kubectl run dns-test --rm -it --image=busybox -- nslookup postgres.database.svc.cluster.local
    

Instance Shows Failed

Check events:

kubectl describe databaseinstance postgres-primary | grep -A10 "Events:"

Common errors:

Error Solution
connection refused Check database is running, correct port
authentication failed Verify credentials in Secret
SSL required Add sslMode: require to spec
unknown host Check hostname and DNS

Database Issues

Database Not Created

Check dependencies:

# Instance must be Ready
kubectl get databaseinstance postgres-primary -o jsonpath='{.status.phase}'

Check logs:

kubectl logs -n db-provision-operator-system deployment/db-provision-operator | grep "myapp-database"

Extension Installation Failed

Common causes:

  1. Extension not available:

    -- Check available extensions in database
    SELECT * FROM pg_available_extensions;
    

  2. Insufficient permissions:

    -- Admin user needs CREATE permission
    GRANT CREATE ON DATABASE myapp TO admin_user;
    

User Issues

User Not Created

Check instance status:

kubectl get databaseinstance -o jsonpath='{.status.phase}'
# Must be "Ready"

Verify username is valid: - PostgreSQL: alphanumeric and underscores - MySQL: max 32 characters

Password Not Generated

Check secret:

kubectl get secret myapp-user-credentials

If missing: 1. Verify passwordSecret.generate: true in spec 2. Check passwordSecret.secretName is specified 3. Look for errors in operator logs

Cannot Connect with Generated Password

Verify credentials:

# Get password
kubectl get secret myapp-user-credentials -o jsonpath='{.data.password}' | base64 -d

# Test connection
kubectl run psql-test --rm -it --image=postgres:15 -- \
  psql "postgresql://myapp_user:PASSWORD@postgres:5432/myapp"

Grant Issues

Grants Not Applied

Prerequisites: 1. User/Role must exist and be Ready 2. Database must exist 3. Tables/schemas must exist

Check user status:

kubectl get databaseuser myapp-user -o jsonpath='{.status.phase}'

Verify grants in database:

-- PostgreSQL
\dp tablename
-- or
SELECT * FROM information_schema.table_privileges WHERE grantee = 'myapp_user';

Backup Issues

Backup Stuck in Running

Check backup job:

kubectl get jobs -l backup-name=myapp-backup
kubectl logs job/myapp-backup-job

Common causes: - Large database taking long time - Network issues to storage - Insufficient resources

Backup Failed

Check job logs:

kubectl logs job/myapp-backup-job

Storage issues:

# Verify storage credentials
kubectl get secret s3-credentials -o yaml

# Test S3 connectivity
kubectl run aws-cli --rm -it --image=amazon/aws-cli -- s3 ls s3://my-bucket/

Schedule Not Triggering

Verify schedule:

kubectl get databasebackupschedule myapp-backup -o yaml | grep schedule

Check timezone:

# Schedule uses UTC by default
kubectl get databasebackupschedule myapp-backup -o jsonpath='{.spec.timezone}'

Restore Issues

Restore Failed

Check restore job:

kubectl logs job/myapp-restore-job

Common errors:

Error Solution
database exists Set dropExisting: true or delete database
backup not found Verify backup exists and path is correct
permission denied Check storage credentials

Drift Detection Issues

Drift Not Detected

Check drift mode:

kubectl get database myapp -o jsonpath='{.spec.driftPolicy.mode}'
# Must be "detect" or "correct", not "ignore"

Check last drift check time:

kubectl get database myapp -o jsonpath='{.status.drift.lastChecked}'

Verify interval has passed:

# Default is 5m, check configured interval
kubectl get database myapp -o jsonpath='{.spec.driftPolicy.interval}'

Check operator logs:

kubectl logs -n db-provision-operator-system deployment/db-provision-operator | grep -i drift

Drift Detected But Not Corrected

Check drift mode:

# Mode must be "correct", not "detect"
kubectl get database myapp -o jsonpath='{.spec.driftPolicy.mode}'

Check if field is immutable:

kubectl get database myapp -o jsonpath='{.status.drift.diffs}' | jq '.[] | select(.immutable==true)'

Check if correction is destructive:

kubectl get database myapp -o jsonpath='{.status.drift.diffs}' | jq '.[] | select(.destructive==true)'

Allow destructive corrections (use with caution):

kubectl annotate database myapp dbops.dbprovision.io/allow-destructive-drift="true"

Drift Correction Failed

Check events:

kubectl get events --field-selector reason=DriftCorrectionFailed,involvedObject.name=myapp

Common causes:

Cause Solution
Insufficient privileges Check admin account permissions
Field is immutable Cannot auto-correct, update spec to match
Database object locked Check for active transactions or locks

Drift Keeps Reappearing

Possible causes:

  1. External processes making changes:
  2. Check for migration scripts, other operators, or manual changes

  3. Application creating/modifying objects:

  4. Review application database connection and permissions

  5. Spec doesn't match intended state:

  6. Update CR spec to match desired configuration

Deletion Protection Issues

Cannot Delete Protected Resource

Check if deletion protection is enabled:

# Spec-based resources (Database, Instance, Grant, BackupSchedule)
kubectl get database myapp -o jsonpath='{.spec.deletionProtection}'

# Annotation-based resources (DatabaseUser, DatabaseRole)
kubectl get databaseuser myapp-user -o jsonpath='{.metadata.annotations.dbops\.dbprovision\.io/deletion-protection}'

Method 1: Disable protection:

# Spec-based (Database, Instance, Grant)
kubectl patch database myapp -p '{"spec":{"deletionProtection":false}}'
kubectl delete database myapp

# Annotation-based (User, Role)
kubectl annotate databaseuser myapp-user dbops.dbprovision.io/deletion-protection-
kubectl delete databaseuser myapp-user

Method 2: Force delete annotation:

kubectl annotate database myapp dbops.dbprovision.io/force-delete="true"
# Resource will be deleted on next reconciliation
# Note: if the resource has children, you must also confirm the cascade — see deletion-protection docs

Method 3: One-liner (spec-based only):

kubectl patch database myapp -p '{"spec":{"deletionProtection":false}}' && kubectl delete database myapp

Resource Stuck in Terminating

Check for DeletionBlocked events:

kubectl describe database myapp | grep -A5 "Events:"

Check finalizers:

kubectl get database myapp -o jsonpath='{.metadata.finalizers}'

If operator is running, disable protection:

kubectl patch database myapp -p '{"spec":{"deletionProtection":false}}'

If operator is down (emergency only):

# WARNING: Bypasses cleanup logic, may leave orphaned database objects
kubectl patch database myapp -p '{"metadata":{"finalizers":null}}' --type=merge

Namespace Stuck Deleting

Find protected resources blocking deletion:

# Spec-protected resources (Database, Instance, Grant, BackupSchedule)
kubectl get databases,databaseinstances,databasegrants,databasebackupschedules -n my-namespace \
  -o jsonpath='{range .items[?(@.spec.deletionProtection==true)]}{.kind}/{.metadata.name}{"\n"}{end}'

# Annotation-protected resources (User, Role)
kubectl get databaseusers,databaseroles -n my-namespace -o json | \
  jq -r '.items[] | select(.metadata.annotations["dbops.dbprovision.io/deletion-protection"]=="true") | "\(.kind)/\(.metadata.name)"'

Also check for dependency blocks (resources with children that block deletion even without protection):

kubectl get databases,databaseusers,databaseroles,databaseinstances -n my-namespace \
  -o jsonpath='{range .items[?(@.status.conditions[?(@.reason=="DependenciesExist")])]}{.kind}/{.metadata.name}{"\n"}{end}'

Force delete all protected resources:

for kind in database databaseuser databaserole databasegrant databasebackupschedule; do
  for name in $(kubectl get $kind -n my-namespace -o name 2>/dev/null); do
    kubectl annotate $name -n my-namespace dbops.dbprovision.io/force-delete="true" --overwrite
  done
done

# Wait for cascade confirmation hashes to appear, then confirm them
sleep 5
for kind in databaseinstance database databaseuser databaserole; do
  for name in $(kubectl get $kind -n my-namespace -o name 2>/dev/null); do
    HASH=$(kubectl get $name -n my-namespace -o jsonpath='{.status.deletionConfirmation.hash}' 2>/dev/null)
    if [ -n "$HASH" ]; then
      kubectl annotate $name -n my-namespace dbops.dbprovision.io/confirm-force-delete="$HASH" --overwrite
    fi
  done
done

CockroachDB Issues

Connection Failed in Insecure Mode

Error: password authentication failed

Cause: CockroachDB insecure mode doesn't support passwords.

Solution: Use empty password in secret:

apiVersion: v1
kind: Secret
metadata:
  name: cockroach-admin-credentials
type: Opaque
stringData:
  username: dbprovision_admin
  password: ""  # Must be empty

Verify insecure mode:

kubectl exec -it cockroachdb-0 -- cockroach sql --insecure -e "SHOW CLUSTER SETTING server.host_based_authentication.enabled;"

Cannot Create User with Password

Error: password is not supported in insecure mode

Cause: Attempting to create user with password on insecure cluster.

Solution: The operator automatically handles this. Users are created without passwords in insecure mode.

Permission Denied for Grants

Error: must be admin or owner

Cause: Admin user doesn't have admin role membership.

Solution:

-- Connect to CockroachDB as root
GRANT admin TO dbprovision_admin;

Cannot Create Database

Error: user does not have CREATEDB privilege

Solution:

ALTER USER dbprovision_admin WITH CREATEDB;

TLS Certificate Errors

Error: certificate signed by unknown authority

Check TLS configuration:

kubectl get databaseinstance cockroach-cluster -o jsonpath='{.spec.tls}'

Verify certificates:

# Check secret contains required keys
kubectl get secret cockroach-tls -o jsonpath='{.data}' | jq -r 'keys'
# Should include: ca.crt, tls.crt, tls.key

Test connection manually:

kubectl run cockroach-test --rm -it --image=cockroachdb/cockroach:v24.1.0 -- \
  sql --url="postgresql://user:pass@cockroachdb:26257/defaultdb?sslmode=verify-full&sslrootcert=/certs/ca.crt"

Backup Failed

Error: BACKUP requires admin role

Solution:

GRANT admin TO dbprovision_admin;

Check backup job in CockroachDB:

SELECT * FROM [SHOW JOBS] WHERE job_type = 'BACKUP' ORDER BY created DESC LIMIT 5;

Operator Issues

Operator Not Starting

Check pod status:

kubectl describe pod -n db-provision-operator-system -l app=db-provision-operator

Common issues:

  1. Image pull error:

    kubectl get events -n db-provision-operator-system | grep "Failed"
    

  2. RBAC issues:

    kubectl auth can-i --as=system:serviceaccount:db-provision-operator-system:db-provision-operator \
      create secrets -n default
    

High Memory Usage

Check resource usage:

kubectl top pod -n db-provision-operator-system

Solutions: 1. Increase memory limits 2. Reduce concurrent reconciles 3. Check for resource leaks in logs

Reconciliation Loops

Symptoms: Constant reconciliation, high CPU

Debug:

# Watch reconciliation
kubectl logs -n db-provision-operator-system deployment/db-provision-operator -f | grep "Reconciling"

Causes: - Status updates triggering reconciles - External changes to managed resources - Conflicting controllers

Recovery Procedures

Force Delete Stuck Resource

# Remove finalizers (use with caution!)
kubectl patch database myapp-database -p '{"metadata":{"finalizers":null}}' --type=merge
kubectl delete database myapp-database

Reset Resource State

# Delete and recreate
kubectl delete database myapp-database
kubectl apply -f database.yaml

Operator Recovery

# Restart operator
kubectl rollout restart deployment/db-provision-operator -n db-provision-operator-system

# Watch rollout
kubectl rollout status deployment/db-provision-operator -n db-provision-operator-system

Debug Mode

Enable Debug Logging

# Helm values
logging:
  level: debug

Or patch deployment:

kubectl set env deployment/db-provision-operator LOG_LEVEL=debug -n db-provision-operator-system

Trace Specific Resource

kubectl logs -n db-provision-operator-system deployment/db-provision-operator | \
  grep "myapp-database"

Getting Help

Collect Debug Information

#!/bin/bash
# debug-bundle.sh

echo "=== Operator Pods ===" > debug.txt
kubectl get pods -n db-provision-operator-system -o wide >> debug.txt

echo "=== Operator Logs ===" >> debug.txt
kubectl logs -n db-provision-operator-system deployment/db-provision-operator --tail=500 >> debug.txt

echo "=== All Resources ===" >> debug.txt
kubectl get databaseinstances,databases,databaseusers,databaseroles,databasegrants,databasebackups -A >> debug.txt

echo "=== Events ===" >> debug.txt
kubectl get events -n db-provision-operator-system >> debug.txt

echo "Debug bundle saved to debug.txt"

Report Issues

Open an issue at: https://github.com/panteparak/db-provision-operator/issues

Include: 1. Operator version 2. Kubernetes version 3. Database engine and version 4. Resource YAML (redact secrets) 5. Operator logs 6. Steps to reproduce