What's Actually Happening

CockroachDB node fails to start and join the cluster. The node process exits or hangs during startup.

The Error You'll See

```bash $ cockroach start --join=node1:26257,node2:26257

Error: problem with RPC handshake: x509: certificate signed by unknown authority ```

Storage error:

bash
Error: could not load cluster ID: file does not exist

Join error:

bash
Error: could not reach any of the nodes specified in --join

Port error:

bash
Error: unable to listen on port 26257: address already in use

Why This Happens

  1. 1.Certificate issues - Invalid or missing TLS certificates
  2. 2.Storage corruption - Data directory corrupted
  3. 3.Network issues - Cannot reach join hosts
  4. 4.Port conflicts - Port already in use
  5. 5.Resource limits - Insufficient memory or file descriptors
  6. 6.Version mismatch - Incompatible CockroachDB version

Step 1: Check Node Process

```bash # Check if CockroachDB is running: ps aux | grep cockroach

# Check systemd status: systemctl status cockroachdb

# Check logs: journalctl -u cockroachdb -f

# Check output logs: tail -f /var/lib/cockroach/cockroach-data/logs/cockroach.log

# Manual start with debug: cockroach start --join=node1:26257 \ --store=/var/lib/cockroach/cockroach-data \ --logtostderr --v=5

# Check for error in logs: grep -i error /var/lib/cockroach/cockroach-data/logs/cockroach.log | tail -20

# Check startup log: head -50 /var/lib/cockroach/cockroach-data/logs/cockroach.log ```

Step 2: Check Certificates

```bash # CockroachDB requires TLS by default

# Check certificate files: ls -la /var/lib/cockroach/certs/

# Required files: # ca.crt - CA certificate # node.crt - Node certificate # node.key - Node private key # client.root.crt - Client certificate # client.root.key - Client private key

# Check certificate validity: openssl x509 -in /var/lib/cockroach/certs/node.crt -text -noout | head -20

# Check certificate expiration: openssl x509 -in /var/lib/cockroach/certs/node.crt -noout -dates

# Check certificate chain: openssl verify -CAfile /var/lib/cockroach/certs/ca.crt /var/lib/cockroach/certs/node.crt

# Create certificates: cockroach cert create-ca \ --certs-dir=/var/lib/cockroach/certs \ --ca-key=/var/lib/cockroach/ca.key

cockroach cert create-node \ localhost \ $(hostname) \ node1 \ --certs-dir=/var/lib/cockroach/certs \ --ca-key=/var/lib/cockroach/ca.key

cockroach cert create-client root \ --certs-dir=/var/lib/cockroach/certs \ --ca-key=/var/lib/cockroach/ca.key

# Fix permissions: chmod 400 /var/lib/cockroach/certs/*.key chown cockroach:cockroach /var/lib/cockroach/certs/* ```

Step 3: Check Network Connectivity

```bash # Test connectivity to other nodes: ping node1 ping node2

# Test gRPC port (26257): nc -zv node1 26257 telnet node1 26257

# Test HTTP port (8080): curl -k https://node1:8080/health

# Check DNS resolution: nslookup node1 dig node1

# Check firewall: iptables -L -n | grep 26257 ufw status | grep 26257

# Allow ports: ufw allow 26257/tcp ufw allow 8080/tcp

# Check if node can reach itself: cockroach sql --host=localhost:26257 --insecure

# Test from another node: cockroach node status --host=node1:26257 --insecure

# Check network interface: ip addr show ```

Step 4: Check Storage

```bash # Check data directory: ls -la /var/lib/cockroach/cockroach-data/

# Check disk space: df -h /var/lib/cockroach/

# Check disk permissions: ls -la /var/lib/cockroach/ chown -R cockroach:cockroach /var/lib/cockroach/

# Check for existing data: ls -la /var/lib/cockroach/cockroach-data/auxiliary/

# Check temporary directory: ls -la /tmp/

# If data directory corrupted: # Option 1: Remove and rejoin as new node rm -rf /var/lib/cockroach/cockroach-data/*

# Option 2: Restore from backup cockroach sql --execute="RESTORE DATABASE db FROM '/backup'"

# Check store settings: cockroach start --store=path=/var/lib/cockroach/cockroach-data,size=100GB

# Verify no other process using data: lsof /var/lib/cockroach/cockroach-data/ ```

Step 5: Check Port Availability

```bash # Check if ports are in use: netstat -tlnp | grep -E "26257|8080" ss -tlnp | grep -E "26257|8080"

# Check what's using the port: lsof -i :26257 lsof -i :8080

# Kill process using port: kill -9 <pid>

# Use different port: cockroach start --port=26258 --http-port=8081 --join=node1:26257

# Check all CockroachDB processes: ps aux | grep cockroach

# Kill all CockroachDB processes: pkill -9 cockroach

# Check for zombie processes: ps aux | grep defunct ```

Step 6: Check Resource Limits

```bash # Check file descriptor limits: ulimit -n # Should be high (65535 or more)

# Check memory: free -m

# Check current limits: cat /proc/$(pgrep cockroach)/limits

# Increase file descriptors: ulimit -n 65535

# In systemd service: [Service] LimitNOFILE=65535 LimitMEMLOCK=infinity

# Or in limits.conf: echo "* soft nofile 65535" >> /etc/security/limits.conf echo "* hard nofile 65535" >> /etc/security/limits.conf

# Check available memory: cat /proc/meminfo | grep -E "MemTotal|MemFree"

# Check for OOM: dmesg | grep -i "out of memory" ```

Step 7: Verify Cluster Join

```bash # Check if cluster exists: cockroach init --host=node1:26257 --insecure

# Check cluster status: cockroach node status --host=node1:26257 --insecure

# List all nodes: cockroach node list --host=node1:26257 --insecure

# Check node is in cluster: cockroach node status --host=node1:26257 --insecure | grep $(hostname)

# If cluster not initialized: cockroach init --host=node1:26257 --insecure

# Check initialization: cockroach sql --host=node1:26257 --insecure -e "SELECT 1"

# View cluster settings: cockroach sql --host=node1:26257 --insecure -e "SHOW ALL CLUSTER SETTINGS"

# Decommission node if needed: cockroach node decommission <node-id> --host=node1:26257 --insecure ```

Step 8: Check Version Compatibility

```bash # Check CockroachDB version: cockroach version

# Check version on all nodes: cockroach sql --host=node1:26257 --insecure -e "SELECT version()"

# All nodes must run compatible versions # Major version should match

# Check release notes for breaking changes: # https://www.cockroachlabs.com/docs/releases/

# Upgrade process: # 1. One node at a time # 2. Wait for node to rejoin # 3. Verify cluster healthy # 4. Proceed to next node

# Downgrade not supported in most cases

# Check binary location: which cockroach ls -la /usr/local/bin/cockroach ```

Step 9: Debug Startup Issues

```bash # Enable debug logging: cockroach start \ --join=node1:26257 \ --store=/var/lib/cockroach/cockroach-data \ --logtostderr \ --v=5 \ --log-file-verbosity=5

# Check startup sequence in log: grep -i "starting|listening|connected" /var/lib/cockroach/cockroach-data/logs/cockroach.log

# Check for bootstrap issues: grep -i bootstrap /var/lib/cockroach/cockroach-data/logs/cockroach.log

# Check for replication issues: cockroach sql --host=localhost:26257 --insecure -e "SHOW RANGES FROM DATABASE system"

# Check node liveness: cockroach sql --host=localhost:26257 --insecure -e "SELECT * FROM system.liveness"

# Check for stuck operations: cockroach sql --host=localhost:26257 --insecure -e "SHOW JOBS"

# Run diagnostics: cockroach debug zip /tmp/cockroach-debug.zip ```

Step 10: CockroachDB Node Verification Script

```bash # Create verification script: cat << 'EOF' > /usr/local/bin/check-cockroach-node.sh #!/bin/bash

echo "=== CockroachDB Process ===" ps aux | grep cockroach | grep -v grep

echo "" echo "=== Service Status ===" systemctl status cockroachdb 2>/dev/null || echo "Not running via systemd"

echo "" echo "=== Certificate Files ===" ls -la /var/lib/cockroach/certs/ 2>/dev/null || echo "No certs directory"

echo "" echo "=== Certificate Expiration ===" openssl x509 -in /var/lib/cockroach/certs/node.crt -noout -dates 2>/dev/null || echo "No node certificate"

echo "" echo "=== Data Directory ===" ls -la /var/lib/cockroach/cockroach-data/ 2>/dev/null || echo "Data directory not found"

echo "" echo "=== Disk Space ===" df -h /var/lib/cockroach 2>/dev/null

echo "" echo "=== Port Status ===" netstat -tlnp 2>/dev/null | grep -E "26257|8080" || ss -tlnp | grep -E "26257|8080"

echo "" echo "=== Node Connectivity ===" nc -zv localhost 26257 2>&1 nc -zv localhost 8080 2>&1

echo "" echo "=== File Descriptors ===" ulimit -n

echo "" echo "=== Recent Logs ===" tail -20 /var/lib/cockroach/cockroach-data/logs/cockroach.log 2>/dev/null || echo "No log file"

echo "" echo "=== Recommendations ===" echo "1. Verify certificates exist and are valid" echo "2. Check network connectivity to join hosts" echo "3. Ensure ports 26257 and 8080 are available" echo "4. Verify data directory permissions" echo "5. Increase file descriptor limits" echo "6. Check version compatibility with cluster" echo "7. Review logs for specific errors" EOF

chmod +x /usr/local/bin/check-cockroach-node.sh

# Usage: /usr/local/bin/check-cockroach-node.sh ```

CockroachDB Node Startup Checklist

CheckExpected
CertificatesPresent and valid
Data directoryExists and writable
Ports26257 and 8080 available
Join hostsReachable
File descriptors>= 65535
MemoryAdequate
VersionCompatible with cluster

Verify the Fix

```bash # After fixing CockroachDB node startup

# 1. Check process running ps aux | grep cockroach // Process running

# 2. Check node status cockroach node status --host=localhost:26257 --insecure // Node appears in list

# 3. Check health curl -k https://localhost:8080/health // Returns "OK"

# 4. Test SQL cockroach sql --host=localhost:26257 --insecure -e "SELECT 1" // Returns 1

# 5. Check logs tail /var/lib/cockroach/cockroach-data/logs/cockroach.log // No errors

# 6. Verify replication cockroach sql --host=localhost:26257 --insecure -e "SHOW RANGES" // Replicas distributed ```

  • [Fix PostgreSQL Start Failed](/articles/fix-postgresql-start-failed)
  • [Fix MySQL Start Failed](/articles/fix-mysql-start-failed)
  • [Fix Cassandra Nodes Down](/articles/fix-cassandra-nodes-down)