Urgent Help MSR stopped running

toggledbits

@cw-kid it's the folder called storage in your Reactor folder.

cw-kid

@toggledbits

Is that the folder with the rulesets in it ?
I have a backup of all the folders and files, but it's not very recent.

Hope I am not going to lose the more recent changes to the rules.

cw-kid

Looks like it is

So they are all corrupted ?

There are hundreds of rules on those folders

cw-kid

Looking in the current storage sub folders the "states" folder has a lot of the files in there are 0kb. But in the other folders like "rules" "rulesets" and "reactions" they still have data it seems and have kb values.

cw-kid

I just deleted all the files in the states folder on the Linux box there was 170 odd files. I then uploaded the files from the backup I had, there were 130 odd files. MSR has now started and loaded in the browser. But I am unsure what I have lost.

toggledbits

That's good sleuthing and good work. Yes, the states folder gets most of the I/O. If it looks like everything else is intact, that's fine. Rules and rule sets would only be damaged if you happened to edit them during the time the disk was full.

On the Reactor master device, there are flags for warning you when disk space is becoming limited; rules with notifications may be a good idea, to avoid future unpleasant surprises.

cw-kid

I am assuming I have had a lucky escape here then, all my rules look to be present still, I am assuming "States" is just that as the name suggests the current state of the rules and that they will rebuild and repopulate themselves.

I didn't have any auto backup in place as not really sure how to do that in Linux command line, so I was just now and again manually copying the entire contents of the /reactor folder down on to my PC.

toggledbits

All correct.

cw-kid

In Entities under Reactor System I have all this information

reactor_system.alert_count=0
reactor_system.alert_last=1647657325518
reactor_system.alert_severity=null
reactor_system.alerts=[]
reactor_system.arch="x64"
reactor_system.hostname="HP-Thin01"
reactor_system.internet_ok=null
reactor_system.platform="linux"
reactor_system.reactor_memory_used=138702848
reactor_system.reactor_uptime=1082
reactor_system.system_load=[0.13,0.15,0.11]
reactor_system.system_memory_free=2020356096
reactor_system.system_memory_size=3537993728
reactor_system.system_uptime=17002.98
reactor_system.volume_critical_reactor_base=false
reactor_system.volume_critical_reactor_data=false
reactor_system.volume_critical_reactor_logs=false
reactor_system.volume_reactor_base=[5390756,13825496,0.389,"/home/stuart/reactor"]
reactor_system.volume_reactor_data=[5390756,13825496,0.389,"/home/stuart/reactor/storage"]
reactor_system.volume_reactor_logs=[5390756,13825496,0.389,"/home/stuart/reactor/logs"]
sys_system.state=true
Capabilities: reactor_system, sys_system
Actions: reactor_system.clear_alert, reactor_system.clear_alerts, sys_system.restart

Memory free is that disk or RAM ?

toggledbits

SystemController - Reactor - Multi-Hub Automation

cw-kid

Ok thanks I will read that tomorrow

Also have to figure out why my Raspberry Pi has gone tits up, that's still totally down, but only InfluxDB and Grafana on there so not the end of the world.

cw-kid

Patrick would it be possible for the system to detect if any state files have 0kb values and just delete them automatically ?

toggledbits

Losing state will cause rules to possibly run unexpectedly. For example, say you had a rule that triggers between sunrise and sunset. If state was lost due to disk full corruption such as this in mid-day, discarding the state at restart will cause the rule to have no state, therefore think it has a transition into the sunrise (triggered) period and fire. This may or may not be a desirable side-effect, depending entirely on how the user determines the rules to work, and I cannot predetermine whether re-running a rule would have side-effects in the user environment or not. Since this can occur unattended (such as while on vacation or at a remote vacation home), it's potentially very troublesome and should at least be corrected with the full awareness of the user, as you have in this case.

My number one recommendation is that you manage filesystem space well on the system. This is always an issue on all systems, and in Linux/Unix systems in particular, processes can die (due to errors) and files can become (logically) corrupt due to disk full conditions. It is something that needs to be managed. In your case, I would remember what log files were deleted, and investigate why those were so large and not being rotated and expired or archived. In your case, it appears you have a single root volume with everything on it. That means any subsystem running on the box can potentially fill the filesystem and cause other applications to misbehave. Best practices for system management often call for segmenting files/directories out. It is not uncommon, for example, to have /var, /usr and /home all on separate filesystems, that any one of them filling (/var often has this issue because it holds system logs and other fast-changing files) would not result in other files in the other directories to become corrupt or truncated. Taking this to an extreme, but if hardening Reactor is mission-critical to you, it is possible to create a small volume and mount it as /var/reactor at boot time, and use that as the home for the storage directory (and config, but not Reactor's logs), thereby isolating and protecting Reactor's storage from everything else. This is all Linux system management stuff, so not really appropriate to deep dive into, but if you're going to use and maintain these systems as part of your infrastructure, well worth spending the time to learn. A must really, because when it goes wrong, and it will go wrong, your fluency in system management will directly determine the time it takes to recover, and how well (or not) you recover.

I can take more mitigating steps to harden startup, and I will definitely do that. I will also see about adding Status page alerts for disk space problems. The way Reactor works, if the disk space problem is mitigated before a restart of Reactor (i.e. while Reactor is running), Reactor will (eventually) rewrite the files with state (which is cached in RAM during operation) -- at shutdown, Reactor writes the RAM state back to disk as a final assurance that they are in sync, just for this reason. But I cannot protect those files from every eventuality, and every other subsystem on the machine works pretty much the same way (rebooting after fixing a no-space condition is always recommended, because many daemons will just die when they can't write a file).

And as I've said again and again and again, please read the documentation and look at your log files. If you see something in the log files that points to obvious issue, handle it. If you don't know how to handle it, post the log file snippet and remember that context is vital, so posting 2-3 lines of a log file may provide little or no useful information (even when the error contains module names and line numbers); it often takes a dozen or more lines of context prior to really interpret how the system got to the crash point, so you must post a least a dozen or more lines prior to any error message you are inquiring about. Fortunately in this case, it was pretty obvious.

Pardon errors. Tapping this out on a bluetooth keyboard and over my home VPN on sketchy Internet made this a bit of a chore. Onward.

Unofficial thread for compatibility

[Reactor] Variables not updating correctly in latest-25201-2aa18550

Gradually turn on lights.

Reactor (Multi-System/Multi-Hub) Announcements

Can't customize home page

Stop the MSR by an external switch on Hubitat.

Error After Upgrade

Reset attribute value of entity in event handler

Need help figuring out how to delay a reset on reaction

Way to search for rules (rule state) in other rules

Links to MSR from HA

Set Reaction > Script Action

Errors after updating to MQTTController build 25139

🎉 My very first MSR controller: OpenSprinkler

Advice reqeusted to migrate MSR from Bare Metal to Container

Can´t restart or upgrade/deploy MSR

[Solved] Limit HA Entity in MSR

Organizing/ structuring rule sets and rules

Moving MSR from a QNAP container to RP 5 - some issues

Widget deletion does not work and landing page (status) is empy

Urgent Help MSR stopped running

Recent Topics