Service
batmon.service runs the Python bridge from /home/failurep/batmon-ha/.
BatMON is the Raspberry Pi side bridge that reads JK-BMS packs over BLE and publishes pack state into MQTT and Home Assistant. FailureGrid™ depends on this feed for pack voltage, current, SOC, MOS temperature, cell spread and diagnostics.
FailureBat-02: BmsSampl(66.4%,U=52.5V,I=15.44A,P=810W,Q=208/314Ah,mos=34°C)
FailureBat-03: BmsSampl(45.6%,U=52.0V,I=0.00A,P=0W,Q=143/314Ah,mos=32°C)
FailureBat-01: BmsSampl(72.5%,U=52.4V,I=13.19A,P=691W,Q=228/314Ah,mos=34°C)
FailureBat-00: BmsSampl(66.3%,U=52.4V,I=12.79A,P=670W,Q=208/314Ah,mos=33°C)
batmon.service runs the Python bridge from /home/failurep/batmon-ha/.
JK-BMS BLE packets become MQTT telemetry and Home Assistant sensor discovery.
Dedicated hci0–hci3 adapters reduce BLE contention between packs.
sample_period: 30, publish_period: 10, keep_alive: false.
Detects recurring BLE errors and resets BatMON + adapters automatically.
Live journal filters show samples, failures, watchdog actions and reconnects.
Watch for InProgress, timeout waiting for and stale discovery states.
Do not publish raw logs with JK-BMS PSK values or full debug byte buffers.
BatMON is the live pack telemetry layer under FailureGrid™. It should fail small, recover locally and never require a full Raspberry Pi reboot.
Dedicated BLE path via hci0. Publishes SOC, voltage, current, power, capacity and cells.
Dedicated BLE path via hci1. Watch for occasional timeout waiting for 2 events.
Dedicated BLE path via hci2. Previously hit BlueZ Operation already in progress; hci reset restored it.
Dedicated BLE path via hci3. Feeds the same MQTT/HA discovery structure.
BatMON connects to the local MQTT broker and Home Assistant consumes the discovered sensors.
FailureGrid™ dashboards rely on these entities to show pack state and control confidence.
Use these first. They show whether BatMON, BLE, MQTT and Home Assistant discovery are alive.
systemctl status batmon.service --no-pager -l
Good: active (running) and fresh sample lines. Bad: repeated Tracebacks, restart loops, or no fresh samples.
journalctl -u batmon.service -f -o short-iso \
| grep -aEi 'FailureBat-|BmsSampl|ERROR|WARNING|InProgress|not found|Service Discovery|timeout waiting|disconnected after'
This is the main operator view: samples, reconnects and useful failures without the full byte-buffer noise.
journalctl -u batmon.service -n 160 --no-pager -o short-iso \
| grep -aEi 'FailureBat-|BmsSampl|ERROR|WARNING|InProgress|not found|Service Discovery|timeout waiting|disconnected after'
Useful after opening an SSH session when you want the last few minutes without following the log.
bluetoothctl list
for i in 0 1 2 3; do echo "===== hci$i ====="; btmgmt --index "$i" info 2>/dev/null || true; done
Confirms that the four adapters still exist and have not disappeared from USB/BlueZ.
timeout 20 mosquitto_sub -h 192.168.42.99 -t 'homeassistant/sensor/#' -v -R \
| grep -aEi 'FailureBat|state_topic|json_attributes_topic|unique_id|name'
Shows Home Assistant discovery messages. Use this when HA entities look wrong even though BatMON samples are fresh.
timeout 60 mosquitto_sub -h 192.168.42.99 -t '#' -v -R \
| grep -aEiv '^(zigbee2mqtt|homeassistant)/' \
| grep -aiE 'fail|bat|bms|jk|cell|soc|volt|curr'
Use -R and grep -a so retained Zigbee2MQTT payloads do not turn the output into binary-looking noise.
The point is not zero warnings forever. The point is fresh samples, automatic recovery and no long stale periods.
BmsSampl(...) appears for all four packs within a normal cycle.
A single timeout waiting for followed by a successful reconnect is annoying but not fatal.
disconnected after 300s means the BLE connection aged out or stalled.
Repeated org.bluez.Error.InProgress usually means a stuck BlueZ scan/connect operation.
WARNING [bt] BMS FailureBat-02 disconnected after 300.9s!
ERROR [sampling] FailureBat-02 error (#1): timeout waiting for 2
INFO [sampling] connecting bms FailureBat-02
INFO [sampling] connected bms FailureBat-02!
INFO [sampling] FailureBat-02: BmsSampl(...)
Use the smallest reset that matches the failure. Rebooting the whole machine should be the last resort.
sudo systemctl restart batmon.service
Use when BatMON itself looks stale but BlueZ adapters still look normal.
sudo systemctl stop batmon.service
sudo btmgmt --index 2 power off
sleep 3
sudo btmgmt --index 2 power on
sleep 5
sudo systemctl start batmon.service
Example uses hci2. Change the index based on the adapter-to-pack map.
sudo systemctl stop batmon.service
for i in 0 1 2 3; do sudo btmgmt --index "$i" power off || true; done
sleep 3
for i in 0 1 2 3; do sudo btmgmt --index "$i" power on || true; done
sleep 5
sudo systemctl start batmon.service
This is the reboot replacement when several packs are stale or BlueZ is clearly wedged.
sudo systemctl stop batmon.service
sudo systemctl restart bluetooth.service
sleep 5
sudo systemctl start batmon.service
Use if adapter power cycling is not enough.
Keep the logic in /home/failurep/batmon-ha/. Systemd only points to it.
mkdir -p /home/failurep/batmon-ha/scripts
sudo tee /home/failurep/batmon-ha/scripts/batmon-ble-watchdog.sh >/dev/null <<'EOF'
#!/usr/bin/env bash
set -Eeuo pipefail
SERVICE="batmon.service"
COOLDOWN_FILE="/run/batmon-ble-watchdog.last"
COOLDOWN_SECONDS=600
ERROR_WINDOW="5 minutes ago"
ERROR_PATTERN='org\.bluez\.Error\.InProgress|Service Discovery has not been performed yet|device .* not found|timeout waiting for|disconnected after|BleakDBusError'
now="$(date +%s)"
if [[ -f "$COOLDOWN_FILE" ]]; then
last="$(cat "$COOLDOWN_FILE" 2>/dev/null || echo 0)"
if (( now - last < COOLDOWN_SECONDS )); then
exit 0
fi
fi
errors="$(journalctl -u "$SERVICE" --since "$ERROR_WINDOW" --no-pager -o cat | grep -Eci "$ERROR_PATTERN" || true)"
if (( errors < 3 )); then
exit 0
fi
echo "$now" > "$COOLDOWN_FILE"
logger -t batmon-ble-watchdog "Detected $errors BatMON/BLE errors; resetting BatMON and hci0-hci3"
systemctl stop "$SERVICE" || true
sleep 2
for i in 0 1 2 3; do btmgmt --index "$i" power off || true; done
sleep 3
for i in 0 1 2 3; do btmgmt --index "$i" power on || true; done
sleep 5
systemctl start "$SERVICE" || true
logger -t batmon-ble-watchdog "BatMON BLE reset completed"
EOF
sudo chmod +x /home/failurep/batmon-ha/scripts/batmon-ble-watchdog.sh
sudo tee /etc/systemd/system/batmon-ble-watchdog.service >/dev/null <<'EOF'
[Unit]
Description=BatMON BLE watchdog resetter
[Service]
Type=oneshot
ExecStart=/home/failurep/batmon-ha/scripts/batmon-ble-watchdog.sh
EOF
sudo tee /etc/systemd/system/batmon-ble-watchdog.timer >/dev/null <<'EOF'
[Unit]
Description=Run BatMON BLE watchdog every minute
[Timer]
OnBootSec=2min
OnUnitActiveSec=1min
AccuracySec=10s
Persistent=true
[Install]
WantedBy=timers.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now batmon-ble-watchdog.timer
systemctl status batmon-ble-watchdog.timer --no-pager
journalctl -t batmon-ble-watchdog -n 80 --no-pager
systemctl list-timers | grep batmon
This is the current stability-oriented BatMON setup for the FailureGrid™ pack bridge.
# Each pack has a dedicated adapter.
# Example structure only — keep private addresses in batmon.yaml.
- alias: "FailureBat-03"
adapter: "hci0"
keep_alive: false
- alias: "FailureBat-02"
adapter: "hci1"
keep_alive: false
- alias: "FailureBat-01"
adapter: "hci2"
keep_alive: false
- alias: "FailureBat-00"
adapter: "hci3"
keep_alive: false
sample_period: 30 # seconds between reads
publish_period: 10 # seconds between MQTT publishes
sample_period: 2 was too aggressive for long-term BlueZ/Bleak stability. The slower cadence is still fast enough for pack telemetry.
grep -nEi 'keep_alive|sample_period|publish_period|FailureBat|adapter|hci' /home/failurep/batmon-ha/batmon.yaml
cp /home/failurep/batmon-ha/batmon.yaml \
/home/failurep/batmon-ha/batmon.yaml.bak-$(date +%F-%H%M%S)
BatMON should restart itself if it exits, and the watchdog should handle BLE wedging.
sudo mkdir -p /etc/systemd/system/batmon.service.d
sudo tee /etc/systemd/system/batmon.service.d/restart.conf >/dev/null <<'EOF'
[Service]
Restart=always
RestartSec=10
EOF
sudo systemctl daemon-reload
sudo systemctl restart batmon.service
systemctl cat batmon.service
Expected runtime path: /home/failurep/batmon-ha/main.py -c /home/failurep/batmon-ha/batmon.yaml.
ps aux | grep -Ei 'batmon|main.py|python' | grep -v grep
FailureGrid™ should keep pack telemetry boring. When it is not boring, follow this order.
systemctl status batmon.service.