OpenSprinkler › Forums › Hardware Questions › OpenSprinkler › Controller lockups / crashes with wired Ethernet module › Reply To: Controller lockups / crashes with wired Ethernet module
Wendell
John,
I think you are correct that rebooting isn’t a perfect solution (although it’s better than a zone getting stuck “on” forever). In theory you could restore the unit to the last known state by periodically writing the current status to NVRAM and then reading that back out on each reboot, but from my experience this can lead to other problems. And if the lock-ups are occurring due to a buffer overrun corrupting the program or data space, you really wouldn’t want to try to restart from the last known state anyway.
The variety of symptoms I’ve seen really makes me think that the crashes aren’t just one portion of the system (e.g. the ethernet module) locking up, and if I’m correct, it would be nearly impossible to detect the crash condition and reliably reboot from it. Years ago I experienced an issue where an embedded controller was doing really strange things at random times. It occurred on only a small percentage of the units we made, but when the crashes happened, they were completely random. In our case it was a hardware bug in an Atmel CPU chip that was causing it to literally execute code at random locations… pretty much the worst case nightmare a programmer can run into. I suspect that the problem we’re seeing with the OS controller isn’t nearly this insidious, but until someone can figure out the actual cause of the crashes (versus simply detecting when they have occurred and trying to reboot), I don’t think there will be a good solution to the problem.
If Ray can pin down the actual cause of the problem then I agree that a software patch will likely solve all of the problems, but simply trying to detect when a problem has already occurred isn’t likely to be a viable solution. I’ll freely admit that I don’t know the architecture of the OS system (i.e. is everything running on the main CPU, or is there an additional microcontroller in the ethernet module?), so my hypothesis could be off due to not understanding how the various system components relate to one another, but if the entirety of the code is running on one CPU, the varied symptoms I’m seeing suggest that there is some type of widespread corruption of data at play.
Regardless of whether the root cause can be found and corrected, I would rather be running a system that handles the TCP/IP stack in hardware, since it should result in more reliable network operations and free up the CPU to run only the application itself (potentially making it more responsive to user inputs). Implementing the W5500 chip sounds like a really good idea to me!