Tagged: Controller lock up hang crash
July 8, 2020 at 2:00 pm #67213
The script I wrote is here:
Thanks for sending that. If I have the controller set on a different port, I would need to enter the ip address:port# in the address, correct? It wont let me add port on the script. Is that easy to change in the script? Or better I adjust the my controller settings to port 80 for time being?
Thanks!July 8, 2020 at 3:26 pm #67215
I just updated the script with custom port. Refresh and you should see it.July 9, 2020 at 8:38 am #67225
*Update*: check this post for the up-to-date information: https://opensprinkler.com/forums/topic/instructions-for-testing-os-3-2-with-w5500-ethernet-module
The adapter PCB for W5500 has arrived. If you want to give it a try, please submit a support ticket at:
and let me know your shipping address so we can send you one. Please note that:
1) This is just an adapter, it does NOT include W5500 module itself, which you can buy on Amazon such as here: https://www.amazon.com/ARCELI-Ethernet-Network-Hardware-Microcontroller/dp/B07JLFN3T1 or eBay, or any of your favorite online store. The module has a 2×5 pin header.
2) There is no 3D printed enclosure for it yet (though I am in the process of designing it).
3) This would only work on OS 3.2 with wired Ethernet module cable. It won’t work on OS 2.3 because OS 2.3 has ENC28J60 built-in so it’s not a replaceable module.
4) Using W5500 requires uploading a different firmware (which I will make available shortly). While the firmware source code is almost identical to the current firmware, it does use a different library (Ethernet2 instead of UIPEthernet), at the moment there is no easy way to link both of them to the compiled code to do dynamic switching. As a result, using W5500 requires uploading firmware that’s compiled specifically for it.
Attached is a picture of what it looks like with the adapter.
Attachments:July 9, 2020 at 9:34 am #67228
Ray, that’s great! I just sent a new support ticket requesting one of the adaptor boards.
When you’re done designing the enclosure, will you post the file online for those of us who want to 3D print our own enclosure?
Thanks!July 9, 2020 at 9:51 am #67229
I’ve sent a support ticket requested one as well.
ThanksJuly 10, 2020 at 2:54 am #67240
just to let you know, I experience the same problem on my OS 3.2.
I have to reboot it every 1-2 days. Otherwise connection is impossible.
I’m running firmware 2.19(2).
RikJuly 10, 2020 at 7:32 am #67243
My understanding, based on the discussions and online searches, is that:
1) The hanging problem is dependent on the network or more specifically, other devices co-existing on the network. On some networks, this does not happen, on others it does. It’s not easy to reproduce and it may take many hours or even days for the symptoms to occur. I haven’t found a reliable way to trigger the symptom to happen quickly, so debugging is hard.
2) The issue does not seem to be due to the Ethernet module itself. Instead, it seems to be fundamentally due to the UIPEthernet library (https://github.com/UIPEthernet/UIPEthernet). Therefore I am pretty sure it’s a software problem. We haven’t used this library for long enough so still trying to understand the issues. Prior to UIPEthernet, we’ve been using EtherCard library (https://github.com/njh/EtherCard) for many years and that worked quite reliably with no hanging issues. However, the biggest drawback of EtherCard library is that it’s incompatible with Arduino’s Ethernet library, therefore the firmware code was a lot more messy. While we could switch back to EtherCard library again, that should only be a backup plan if everything else fails.
3) The current wired Ethernet module we use is ENC28J60, which requires software TCP/IP stack (implemented by UIPEthernet and EtherCard libraries). As discussed above, we’ve started testing an alternative Ethernet module: W5500, which has hardware-integrated TCP/IP stack therefore should be much more reliable than ENC28J60. If you want to give it a try, you can send a support ticket (as described above) to request an adapter, which can convert W5500 header layout to ENC28J60 layout, therefore can be directly plugged into OS 3.2’s Ethernet connector. Since we do not have W5500 modules in stock, you do need to buy one yourself, and also update your firmware to a version specifically compiled for W5500 (will be posted shortly).
4) Given that the issue is network dependent, another possible work-around is to use a separate router (call it the secondary router), such as a spare router you may have, or buy a cheap router like this one (https://www.amazon.com/TP-Link-N300-Wireless-Wi-Fi-Router-TL-WR841N/dp/B001FWYGJS). Have your OS as the only device plugged into the secondary router, and connect its WAN port to one of your primary router’s network ports. You can configure the secondary router to disable its WiFi (i.e. Ethernet only), and set up a port forwarding record so that you can access your OS from the primary network. The idea is to isolate OS from your primary router so it’s not affected by other devices, but you can still access it through port forwarding on the secondary router. I know this is cumbersome and not meant as a permanent solution, but it’s a useful experiment to try. In fact, if your router supports VLAN (virtual lan), you can make use of that to avoid the hassle of getting a secondary router.
5) Keep in mind that OS 3.2 has built-in WiFi — if wired Ethernet is not essential to you, I suggest that you keep the controller in WiFi mode until we figure out the issue with the wired Ethernet.
In any case, the current situation is that I am still debugging UIPEthernet library to see if we can fix the issue in software. At the same time we are planning to transition to W5500 modules once it has been tested out. In the meantime, you can try out the work-around in bullet 4) above to see if it addresses the problem.July 10, 2020 at 8:28 am #67246
Rik, in addition to what Ray posted, if you are able to use WiFi instead of the ethernet module, you may be able to avoid the crashes. My system has been running via WiFi for several days now without crashes. I *do* have a lot of issues with not being able to connect to the controller, but the controller isn’t actually crashing, and whatever is causing it to timeout while trying to connect eventually resolves itself without my intervention. It’s possible that my timeout issues are related to WiFi signal strength in my garage, so you may not even have to deal with this inconvenience in your setup.
Also, if your controller is connected to a switch that supports VLANs, you may be able to solve the problem without the hassle of setting up a secondary router as Ray outlined in #4 above. Another user reported that putting the controller on its own private VLAN solved the problem for him. I haven’t tried this yet myself because my controller is connected to a non-VLAN capable switch that’s downstream from my main switch that does support VLANs.July 11, 2020 at 8:12 am #67254
So I made some small progress in debugging, at least something that are likely correlated with the hanging issue. Inspired by a work-around here: https://github.com/ntruchsess/arduino_uip/issues/167 I started checking the values of two registers on ENC28J60, specifically the ESTAT and EIR registers. These two contain certain bits (specifically buffer overflow flag ESTAT.BUFFER, and receive error flag EIR.RXERIF) that the work-around uses to flag a hanging state and reboot the microcontroller. But here are my findings: I put two OS 3.2 with ethernet modules on my network, they run exactly the same code:
– on the one that’s connected to my main router, these two bits are flagged shortly after the controller starts. No hanging yet, but assume that eventually that may happen.
– on the one that’s connected to my secondary router (as I described above, in order to isolate the OS from the rest of my main network), these two bits remain 0 since I started the experiment two days ago, and the OS has been running fine with no hanging since then.
I suspect that some device or maybe the main router itself is constantly sending broadcast messages of some sort which quickly led to an erroneous state. Of course this doesn’t mean that the controller will hang immediately, but if these bits are not cleared, they may lead to a hanging state eventually.
This should be fixable in software by modifying the UIPEthernet library. The reason is that I also tested the EtherCard library, which we’ve been using for a long time prior to UIPEthernet. When using EtherCard, these two bits remain 0 on both of the testing controllers. This seems a strong evidence that the two bits are related to the hanging issue.
So we will keep digging and maybe reach out to the author of UIPEthernet to see if he can help. The bottomline is that I don’t think there is anything fundamentally wrong with ENC28J60 at the hardware level — it’s a chip that has been around for a very long time. So I am pretty confident that the issue can be resolved by a firmware update.July 12, 2020 at 10:05 am #67268
I recently purchased OS V3 and have spent two days figuring out how to configure it. I have 16-station expander. Also about 8 stations are configured as remote ( to a OSPi with 24 stations). (Order 67055). I use the Ethernet dongle not wifi.
Question: I note at the bottom of the root display, there is a red bar saying ‘configured as extender’. What did I do to cause this message? What does it mean?
More important. In two days, I have had two crashes requiring me to recycle power. When unresponsive to http, ping does not work either.
I’m not much help in providing diagnostic info, but this hardware/software combination seems to have a problem.
This seems the same problem as described aboveJuly 12, 2020 at 2:36 pm #67276
“Question: I note at the bottom of the root display, there is a red bar saying ‘configured as extender’. What did I do to cause this message? What does it mean?”
That message means the controller has been configured as ‘remote extender’ mode. This happens when you set a zone on the master controller to point to a remote controller — the UI will automatically configure the remote controller in ‘remote extender’ mode. If you want to remove that mode, just click on that bar and it will prompt you to disable extender mode.July 12, 2020 at 3:20 pm #67281
I’ve started a new thread with W5500 instructions:
including where to download the experimental firmware.July 15, 2020 at 12:18 am #67333
So I found something today which I think is really interesting: if you took a look at my post above (https://opensprinkler.com/forums/topic/controller-lockups-crashes/page/2/#post-67254), I suspect that the ENC28J60 register values: buffer overflow error flag ESTAT.BUFFER and receive error flag EIR.RXERIF, are indicators that can tell if the controller is in an erroneous state which after some time of running will eventually lead to lockup. I will call the state when these two bits are 0 as ‘clean state’, and the state when these two are 1 as ‘corrupted state’.
My experiment was to find out what causes the corrupted state to happen in the first place. I know that when I use a secondary router to isolate OS from the rest of my primary WiFi network, it’s always in clean state (or at least during the testing started a few days ago, it has always been in the clean state). On the other hand, if OS is connected to my primary router, it goes into corrupted state shortly after booting.
So I started by unplugging / turning off all WiFi devices leaving only OS on the primary router. Sure enough it stays in clean state. Then I turned each WiFi device back on, one after another, restarting OS each time to observe if and when it goes into the corrupted state. Interestingly, most devices don’t cause any trouble, except my two MacBooks, and a Linux computer — as soon as I turn on WiFi on these computers, the test OS goes into corrupted state.
What’s interesting is that I have two other Linux computers that don’t cause this symptom. Comparing what are installed on these computers led me quickly discover that it’s Dropbox that makes the difference. This can be reliably reproduced: if I quit Dropbox, then reboot OS, it doesn’t go to corrupted state; if I leave Dropbox on, OS goes into corrupted state shortly after booting.
Using Wireshark, I saw Dropbox sends a lot of Dropbox lan sync discovery protocols (DB-LSP-DISC). I have strong feeling that this is the root cause of the problem. Although I haven’t found a way to address the issue yet, this at least gave me a way to reliably reproduce the corrupted state, which I highly suspect will eventually lead to the controller locking up. Apparently there is a way to turn off the sync protocol in Dropbox so that’s what I am going to try next.
Anyways, still digging the issue and trying to figure out a solution, but at least feeling a bit closer.July 15, 2020 at 3:26 pm #67342
Ray, this is definitely interesting, but ultimately we need to figure out why the buffer overflow / Receive Error Flag causes the system to become unstable. Based on the seemingly random results I see when things go south, I am strongly suspecting that there’s a memory corruption problem. I’m wondering if the buffer overflow condition is allowing a memory overwrite to occur? Have you looked at the input buffer to see if it’s correctly declared? Does the code that writes bytes into the buffer check to see if the buffer has enough room for the current batch of bytes? If so, could there be a boundary condition error that’s causing a memory overwrite? I haven’t even attempted to look at the code in question, so I have no idea what it looks like, but based on my prior experience, I’m suspecting that a memory overwrite may be corrupting adjacent variables / code.July 15, 2020 at 3:44 pm #67346
Note that ‘buffer overlow flag’ refers to the buffer on the ENC28J60 module, it has nothing to do with the microcontroller’s memory. I doubt there is any memory corruption issue on the microcontroller, otherwise it would have behaved strangely on WiFi mode as well, or would show up even if I isolate OS from the primary network. I suspect it has something to do with UIPEthernet library not handling certain conditions correctly, like not clearing register bits or handling certain conditions that arise when there are too many broadcasts messages and so on.
What you mentioned about “look at the input buffer to see if it’s declared; check if buffer has enough room’ — yes, of course, these are the basic steps anyone has to do when writing a C++ program. As I said, ‘buffer overflow’ is NOT referring to microcontroller’s buffer, it refers to ENC28J60’s receiving buffer (hence the receiving error flag is always raised together with the buffer overflow flag). This is a hardware buffer, not allocated by the program, but exists at fixed size on the Ethernet chip. It’s not something that we can declare the size for.July 15, 2020 at 3:54 pm #67350
Also, as I said I have never been able to reproduce the symptoms you reported — when my test controller locks up, it still responds to button clicks, runs programs fine, displays time correctly, it just locks up w.r.t. web requests. I’ve now gone through three OS 3.0s and one OS 2.3, run my test script repeatedly on them, with IFTTT notifications enabled. I’ve never seen the random zone running problem you reported. So I highly doubt it’s a common problem with the firmware (otherwise I would have heard more reports from other users). If it’s a firmware problem, I unfortunately cannot reproduce it, and without seeing it happening I can’t debug it and find out what’s going on.July 17, 2020 at 3:43 am #67379
I changed some Ethernet parameters in the ethernet interface implementation.
Can you please test it?
https://opensprinklershop.de/wp-content/uploads/2020/07/osefw2194_20200717.binJuly 17, 2020 at 8:47 am #67381
Thanks, Stefan. I will give it a try shortly.
I went through all computers on my network that has Dropbox installed, and turned off the ‘Lan sync’ flag. Since then the ENC28J60 register values on my test OSE has been in clean state (i.e. the two bits are 0) and I no longer observe the corrupted state. So it seems at least for me, the Dropbox ‘Lan sync’ is the culprit. In fact, I have further verified it by turning Lan sync back on, and almost instantly I observe the register bits get set to 1.
I think it’s because the current UIPEthernet library has trouble dealing with a large number of broadcast requests, this results in it not able to clear register bits promptly, eventually leading to a lockup state. While we are still trying to modify the library to address this issue, the other users who experience this issue can check if you have Dropbox installed on any computer. If so, try to turn off the ‘Lan sync’ flag (in Preferences -> Network), then reboot OS so it starts in a clean state. The Lan sync feature is meant to allow computers on the same local network to sync files between each other faster, even if there is no Internet connection, so it’s ok to turn it off.July 19, 2020 at 10:06 am #67413
I too am having my OS drop off my hardwired network. I do not have Dropbox installed on my computer. I do have a number of
security cameras on my network. I leave a window in my browser (Firefox) showing my OS all of the time. At the moment I
have my OS in “disabled” state while I work on my sprinkler system. After a few days the OS window will display “network error”
at the bottom. If I try to refresh the screen the connection times out. A power cycle returns everything to normal.
Everything on my network is connected through a Dell PowerConnect 7048P switch. It is my understanding that with an Ethernet
switch devices that are not the destination of a packet do not see the packet. Any device attached to a switch will only
see traffic that is addressed to it. This is not like an Ethernet hub.
Thus my OS should not see the traffic generated by my security cameras. It should not matter if Dropbox is installed unless
it is scanning your local network for some reason.
An interesting feature of my switch is a feature that allows monitoring any port. It can direct a copy of all traffic on
a specific port to another designated port. Thus you could watch all traffic from any source that goes to the OS port
using something like WireShark. Running WireShark on a host computer with the intent to watch the OS traffic will only
capture the host computer traffic to the OS. You would miss any rogue communication. I have never used this feature
so I can’t say how well it works.
Ray: this is the OS that you just sent me (thanks). I got the Ethernet board that you recommended and wired it up with
no problems.July 19, 2020 at 12:48 pm #67415
I have a managed switch that has the capability (port mirroring) described by @Water_my_lawn. I recently used this capability to find a problem where a device (not OS) would drop off the network occasionally.
I mirrored the port with the ‘bad’ device to another port. I plugged a PC into the mirror port and ran WireShark. As expected the mirrored port receives more than just packets with the device IP as the destination, most significantly broadcast packets and lots of them. After some experimentation I made a capture filter to remove most of the broadcast packet noise that was not likely to be creating the problem. In my case, the problem was caused by SNMP discovery broadcast packets that the ‘bad’ device did not handle correctly.
So, I think this method could be a good way to characterize the traffic on the OS port. E.g. Is the issue caused by a packet volume or a specific type of packet or something else?
DaveJuly 22, 2020 at 3:24 pm #67469
Has anyone tested the firmware? And did it work?
I made some more changes to the timing and the ethernet buffer, here is the new version:
https://opensprinklershop.de/wp-content/uploads/2020/07/osefw2194_20200722.binJuly 22, 2020 at 3:34 pm #67470
I will try to test it tommorow.
Today I had a lockup while a station was active and it kept on spraying…
RikJuly 23, 2020 at 7:46 am #67477
I am willing to try the new firmware but cannot get to the update page.
I enter http://<my IP>/update but I only get a JSON output: “Result: 2”.
In the past I would get a page that let me select the firmware I wanted to install.
I am running firmware: 2.1.9 (4). This was installed by Ray a few days ago.
My Ethernet link hangs after a day or so if I leave a browser pointing at it.
When the link is hung the buttons on the OS still work properly.
I am using the ENC28J60 Ethernet module.
How can I install the new firmware?July 23, 2020 at 11:02 am #67483
Since you didn’t specify which version of OS you have, I assume you have OS 3.0 — if you have OS 2.3, the update is done in a different way (through USB port). Next, assuming yo have OS 3.0, please note that firmware update on OS 3.0 can only be performed in WiFi mode, it cannot be performed when wired Ethernet module is plugged in. This is explained in the update instructions:
- You must be logged in to reply to this topic.