FailureGrid™ telemetry backbone

BatMON BLE → MQTT bridge

Live FailureBat™ pack telemetry without reboot drama.

BatMON is the Raspberry Pi side bridge that reads JK-BMS packs over BLE and publishes pack state into MQTT and Home Assistant. FailureGrid™ depends on this feed for pack voltage, current, SOC, MOS temperature, cell spread and diagnostics.

  • Four FailureBat™ packs, four dedicated Bluetooth adapters.
  • BLE telemetry is forwarded to MQTT and Home Assistant discovery.
  • Watchdog recovery resets BatMON and hci adapters instead of rebooting the whole Pi.
FailureCamMon / batmon.service
FailureBat-02: BmsSampl(66.4%,U=52.5V,I=15.44A,P=810W,Q=208/314Ah,mos=34°C)
FailureBat-03: BmsSampl(45.6%,U=52.0V,I=0.00A,P=0W,Q=143/314Ah,mos=32°C)
FailureBat-01: BmsSampl(72.5%,U=52.4V,I=13.19A,P=691W,Q=228/314Ah,mos=34°C)
FailureBat-00: BmsSampl(66.3%,U=52.4V,I=12.79A,P=670W,Q=208/314Ah,mos=33°C)
🧠

Service

batmon.service runs the Python bridge from /home/failurep/batmon-ha/.

📡

Transport

JK-BMS BLE packets become MQTT telemetry and Home Assistant sensor discovery.

Adapters

Dedicated hci0hci3 adapters reduce BLE contention between packs.

Stable cadence

sample_period: 30, publish_period: 10, keep_alive: false.

Watchdog

Detects recurring BLE errors and resets BatMON + adapters automatically.

🔎

Inspectable

Live journal filters show samples, failures, watchdog actions and reconnects.

Known BLE quirks

Watch for InProgress, timeout waiting for and stale discovery states.

🔐

Redaction

Do not publish raw logs with JK-BMS PSK values or full debug byte buffers.

Architecture

BatMON is the live pack telemetry layer under FailureGrid™. It should fail small, recover locally and never require a full Raspberry Pi reboot.

🔋

FailureBat-03

Dedicated BLE path via hci0. Publishes SOC, voltage, current, power, capacity and cells.

🔋

FailureBat-02

Dedicated BLE path via hci1. Watch for occasional timeout waiting for 2 events.

🔋

FailureBat-01

Dedicated BLE path via hci2. Previously hit BlueZ Operation already in progress; hci reset restored it.

🔋

FailureBat-00

Dedicated BLE path via hci3. Feeds the same MQTT/HA discovery structure.

📨

MQTT broker

BatMON connects to the local MQTT broker and Home Assistant consumes the discovered sensors.

🏠

Home Assistant

FailureGrid™ dashboards rely on these entities to show pack state and control confidence.

Live status commands

Use these first. They show whether BatMON, BLE, MQTT and Home Assistant discovery are alive.

1. Service health

systemctl status batmon.service --no-pager -l

Good: active (running) and fresh sample lines. Bad: repeated Tracebacks, restart loops, or no fresh samples.

2. Clean live BatMON view

journalctl -u batmon.service -f -o short-iso \
| grep -aEi 'FailureBat-|BmsSampl|ERROR|WARNING|InProgress|not found|Service Discovery|timeout waiting|disconnected after'

This is the main operator view: samples, reconnects and useful failures without the full byte-buffer noise.

3. One-shot recent status

journalctl -u batmon.service -n 160 --no-pager -o short-iso \
| grep -aEi 'FailureBat-|BmsSampl|ERROR|WARNING|InProgress|not found|Service Discovery|timeout waiting|disconnected after'

Useful after opening an SSH session when you want the last few minutes without following the log.

4. Bluetooth adapter map

bluetoothctl list
for i in 0 1 2 3; do echo "===== hci$i ====="; btmgmt --index "$i" info 2>/dev/null || true; done

Confirms that the four adapters still exist and have not disappeared from USB/BlueZ.

5. MQTT discovery check

timeout 20 mosquitto_sub -h 192.168.42.99 -t 'homeassistant/sensor/#' -v -R \
| grep -aEi 'FailureBat|state_topic|json_attributes_topic|unique_id|name'

Shows Home Assistant discovery messages. Use this when HA entities look wrong even though BatMON samples are fresh.

6. Avoid noisy full MQTT grep

timeout 60 mosquitto_sub -h 192.168.42.99 -t '#' -v -R \
| grep -aEiv '^(zigbee2mqtt|homeassistant)/' \
| grep -aiE 'fail|bat|bms|jk|cell|soc|volt|curr'

Use -R and grep -a so retained Zigbee2MQTT payloads do not turn the output into binary-looking noise.

How to read the log

The point is not zero warnings forever. The point is fresh samples, automatic recovery and no long stale periods.

Good

BmsSampl(...) appears for all four packs within a normal cycle.

Recovered

A single timeout waiting for followed by a successful reconnect is annoying but not fatal.

🟡

Watch

disconnected after 300s means the BLE connection aged out or stalled.

🔴

Bad

Repeated org.bluez.Error.InProgress usually means a stuck BlueZ scan/connect operation.

Normal recovery example
WARNING [bt] BMS FailureBat-02 disconnected after 300.9s!
ERROR [sampling] FailureBat-02 error (#1): timeout waiting for 2
INFO [sampling] connecting bms FailureBat-02
INFO [sampling] connected bms FailureBat-02!
INFO [sampling] FailureBat-02: BmsSampl(...)

Manual recovery without rebooting the Raspberry Pi

Use the smallest reset that matches the failure. Rebooting the whole machine should be the last resort.

Restart only BatMON

sudo systemctl restart batmon.service

Use when BatMON itself looks stale but BlueZ adapters still look normal.

Reset one stuck adapter

sudo systemctl stop batmon.service
sudo btmgmt --index 2 power off
sleep 3
sudo btmgmt --index 2 power on
sleep 5
sudo systemctl start batmon.service

Example uses hci2. Change the index based on the adapter-to-pack map.

Reset all BatMON adapters

sudo systemctl stop batmon.service
for i in 0 1 2 3; do sudo btmgmt --index "$i" power off || true; done
sleep 3
for i in 0 1 2 3; do sudo btmgmt --index "$i" power on || true; done
sleep 5
sudo systemctl start batmon.service

This is the reboot replacement when several packs are stale or BlueZ is clearly wedged.

Full Bluetooth service reset

sudo systemctl stop batmon.service
sudo systemctl restart bluetooth.service
sleep 5
sudo systemctl start batmon.service

Use if adapter power cycling is not enough.

Watchdog setup

Keep the logic in /home/failurep/batmon-ha/. Systemd only points to it.

1. Create the watchdog script

mkdir -p /home/failurep/batmon-ha/scripts
sudo tee /home/failurep/batmon-ha/scripts/batmon-ble-watchdog.sh >/dev/null <<'EOF'
#!/usr/bin/env bash
set -Eeuo pipefail

SERVICE="batmon.service"
COOLDOWN_FILE="/run/batmon-ble-watchdog.last"
COOLDOWN_SECONDS=600
ERROR_WINDOW="5 minutes ago"
ERROR_PATTERN='org\.bluez\.Error\.InProgress|Service Discovery has not been performed yet|device .* not found|timeout waiting for|disconnected after|BleakDBusError'

now="$(date +%s)"

if [[ -f "$COOLDOWN_FILE" ]]; then
  last="$(cat "$COOLDOWN_FILE" 2>/dev/null || echo 0)"
  if (( now - last < COOLDOWN_SECONDS )); then
    exit 0
  fi
fi

errors="$(journalctl -u "$SERVICE" --since "$ERROR_WINDOW" --no-pager -o cat | grep -Eci "$ERROR_PATTERN" || true)"

if (( errors < 3 )); then
  exit 0
fi

echo "$now" > "$COOLDOWN_FILE"
logger -t batmon-ble-watchdog "Detected $errors BatMON/BLE errors; resetting BatMON and hci0-hci3"

systemctl stop "$SERVICE" || true
sleep 2
for i in 0 1 2 3; do btmgmt --index "$i" power off || true; done
sleep 3
for i in 0 1 2 3; do btmgmt --index "$i" power on || true; done
sleep 5
systemctl start "$SERVICE" || true
logger -t batmon-ble-watchdog "BatMON BLE reset completed"
EOF
sudo chmod +x /home/failurep/batmon-ha/scripts/batmon-ble-watchdog.sh

Download the standalone script file →

2. Create systemd service

sudo tee /etc/systemd/system/batmon-ble-watchdog.service >/dev/null <<'EOF'
[Unit]
Description=BatMON BLE watchdog resetter

[Service]
Type=oneshot
ExecStart=/home/failurep/batmon-ha/scripts/batmon-ble-watchdog.sh
EOF

3. Create systemd timer

sudo tee /etc/systemd/system/batmon-ble-watchdog.timer >/dev/null <<'EOF'
[Unit]
Description=Run BatMON BLE watchdog every minute

[Timer]
OnBootSec=2min
OnUnitActiveSec=1min
AccuracySec=10s
Persistent=true

[Install]
WantedBy=timers.target
EOF

4. Enable watchdog

sudo systemctl daemon-reload
sudo systemctl enable --now batmon-ble-watchdog.timer
systemctl status batmon-ble-watchdog.timer --no-pager

5. Watch watchdog actions

journalctl -t batmon-ble-watchdog -n 80 --no-pager
systemctl list-timers | grep batmon

Known stable config

This is the current stability-oriented BatMON setup for the FailureGrid™ pack bridge.

Battery blocks

# Each pack has a dedicated adapter.
# Example structure only — keep private addresses in batmon.yaml.
- alias: "FailureBat-03"
  adapter: "hci0"
  keep_alive: false
- alias: "FailureBat-02"
  adapter: "hci1"
  keep_alive: false
- alias: "FailureBat-01"
  adapter: "hci2"
  keep_alive: false
- alias: "FailureBat-00"
  adapter: "hci3"
  keep_alive: false

Sampling and publishing

sample_period: 30          # seconds between reads
publish_period: 10         # seconds between MQTT publishes

sample_period: 2 was too aggressive for long-term BlueZ/Bleak stability. The slower cadence is still fast enough for pack telemetry.

Verify current config

grep -nEi 'keep_alive|sample_period|publish_period|FailureBat|adapter|hci' /home/failurep/batmon-ha/batmon.yaml

Backup before edits

cp /home/failurep/batmon-ha/batmon.yaml \
/home/failurep/batmon-ha/batmon.yaml.bak-$(date +%F-%H%M%S)

Systemd reliability

BatMON should restart itself if it exits, and the watchdog should handle BLE wedging.

Auto-restart BatMON if the process exits

sudo mkdir -p /etc/systemd/system/batmon.service.d
sudo tee /etc/systemd/system/batmon.service.d/restart.conf >/dev/null <<'EOF'
[Service]
Restart=always
RestartSec=10
EOF
sudo systemctl daemon-reload
sudo systemctl restart batmon.service

Confirm unit command

systemctl cat batmon.service

Expected runtime path: /home/failurep/batmon-ha/main.py -c /home/failurep/batmon-ha/batmon.yaml.

Confirm process

ps aux | grep -Ei 'batmon|main.py|python' | grep -v grep
Operator rules

What to do when something looks wrong

FailureGrid™ should keep pack telemetry boring. When it is not boring, follow this order.

1Check systemctl status batmon.service.
2Run the clean live log filter and see which pack is stale.
3If one adapter is bad, reset only that hci index.
4If several are stale, reset all hci0–hci3 and restart BatMON.
5If errors repeat, check watchdog logs and USB/BT adapter health.
6Reboot the Pi only after targeted BLE recovery fails.