Tagged: Controller lock up hang crash
July 24, 2020 at 8:19 pm #67505
I too experience the loss of connection to my Master
I have OS V3, OS V3 with one extension board and OSPi all running on my general wired network which spans three buildings.
V3 with one extension is my Master controller and it is running firmware: 2.1.9 (4).
I too experience the loss of connection to my Master over wired Ethernet although the display time is correct(ticking) and reading the log, the watering program executes correctly.
Initially, I thought V3 with just 8 channels did not have this problem. But I think it does. But it ran for three or four days before losing connection.
My Master will not run 24 hours before losing Ethernet. Does having more channels make this problem more accute?
It does not happen on OSPi.
I don’t have DropBox, but my W10 has all kind of things running, looking for their friends, etc. So I expect my network has all kinds of broadcast things going on.
I have put in a support request for the W5500 adapter module. I will order one or two W5500 modules.
Sorry, not much help in finding a solution!July 25, 2020 at 12:38 pm #67521
I have updated the firmware (I think but am not sure) to osefw2194_20200722.bin.
But the firmware version 2.1.9(4) which is the old firmware version.
What should I see on the “About” page?July 27, 2020 at 6:59 am #67555
I have been running for 3 days without a network error. In the past it would hang every day or so.
From this it seems that my software update was successful and that the fix is working.
I will keep running and report back.July 28, 2020 at 5:50 am #67573
I have also applied Stefan’s firmware update today. Let’s see…July 28, 2020 at 7:34 am #67578
I just checked and I have a “network error”. I checked the OS and it responds
to button pushes normally. So perhaps this experimental firmware is not a fix.
It ran 4 days before the hang which is longer than before.July 28, 2020 at 9:24 am #67581
After 2 days of non-stop debugging, I think I am finally getting closer to the bottom of the issue. There are two main discoveries:
1) The UIPEthernet library for ENC28J60 does seem to have trouble when there are a constant influx of UDP broadcasts. On some networks, there aren’t that many broadcast traffic, so it works fine; but on other networks, there are lots of broadcast traffic, so eventually it goes into a corrupted state, which is the source of the hanging issue.
Stefan’s firmware (osefw2194_20200722.bin) uses a tweak of the UIPEthernet library that disables incoming broadcasts and that’s why it’s not prone to the issue. This probably explains why it has lasted much longer on your network. However, completely disabling UDP broadcasts has a downside, which is the second discovery below.
2) DHCP relies on UDP broadcasts, so if UDP broadcast is disabled, then DHCP renewal will fail, and that can lead to a stall. The reason this leads to a stall is because Ethernet.maintain() function is being called at every loop iteration:
its main job is to handle DHCP renewal requests. When DHCP renewal fails, each call will stall for 60 seconds, but then when it comes back to the loop, it will call it again, which is another 60 seconds of non-responsiveness. The UIPEthernet library document never says how often Ethernet.maintain() should be called, it just says call it on a regular interval. So it wasn’t immediately clear to me the consequence of calling it at every loop iteration.
With these two discoveries, I’ve now modified the firmware and made firmware 2.1.9 revision(5). It has the following main changes:
A) Disable handling of UDP broadcast most of the time but only enable it temporarily during DHCP events. This way, most of the time the firmware is not affecting by influx of DHCP requests.
B) Change the code to call Ethernet.maintain() only once per hour to process DHCP renewal requests. This way, even in the case of renewal failure, it won’t go into an infinite loop of stalling.
C) There are also a few other improvements, such as improving DNS functionality, clean up send_http_request function, and for OS 3.x supporting LCD dimming or turn off LCD when the controller is idle (to help preserve the lifespan of OLED displays).
p.s. after I did A) above, I accidentally discovered that this is also what EtherCard library does:
that it only enables broadcast when processing DHCP. I am surprised that UIPEthernet doesn’t do this, but now I’ve added this feature.
D) If you are currently using Stefan’s firmware (osefw2194_20200722.bin), since it disables UDP broadcasts completely, you can either set static IP on OpenSprinkler, or set a DHCP reservation on your router, and then reboot your OpenSprinkler. This way it won’t incur DHCP renewal requests so need handling of broadcasts.
In any case, I’ve uploaded the current version of firmware 2.1.9(5) to the experimental firmware folder:
– for OS 3.2, it’s at: http://raysfiles.com/os_compiled_firmware/v3.0/experimental/ (note: the rev5_enc28j60 file, not the w5500 file!)
– for OS 2.3, it’s at: http://raysfiles.com/os_compiled_firmware/v2.3/experimental/
Feel free to give it a try and see if it addresses the hanging/locking issue. I’ve tested it myself for about 2 days now, but obviously it needs longer-term testing.
Notes: this firmware is largely meant for controllers with ENC28J60 ethernet module, which includes OS 2.3, and OS 3.2 with ENC28J60 wired Ethernet module. You do NOT need to try this firmware if you use WiFi only, or if you are using the experimental W5500 Ethernet module (so far only a very small number of users are trying out W5500 as far as I am aware).July 28, 2020 at 11:27 am #67585
I will try the new firmware and report back.
A humble request: please add 2 more digits of resolution to the flow meter count per gallon setting.
Now the limit is .01, it would be great if it were .0001.
Pete.July 28, 2020 at 11:40 am #67586
About flow meter pulse rate: yes we will change the resolution in the next firmware. For now, honestly this is just a scaling factor. You can leave it to 1 for now. The result is simply the pulse count multiplied by this number. If you can’t set it to 0.0001 just set it to 1, and remember to scale the number you read by 0.0001.July 28, 2020 at 12:54 pm #67587
Ray, I have installed your latest firmware from above. I will report back with the results.July 28, 2020 at 4:55 pm #67592
I got a network error with the new firmware, so the problem is still there.July 28, 2020 at 6:27 pm #67594
Can you provide some more details, like:
– does the controller respond to button presses?
– is the time displayed on the LCD correct?
– does it respond to ping?
Just seeing ‘network error’ does not necessarily mean it has lost response — the UI has a default timeout (3 seconds or something like that), if it doesn’t get response within 3 seconds, it will display network error. But it could be that it took 5 seconds for the response to come back. For that reason, it may be better for you to use the follow API test script:
which does not rely on the UI, and it just pulls data directly from the controller. If you get a response, that means the controller is working fine.
Also, just to double check the firmware is flashed correctly, go to About page and see if it shows 2.1.9(5). Or, if you use use the TestOSAPi script, check the /ja result and look for a variable named “fwm” see if it’s value is 5.
One last thing you can try is: if you haven’t done so, try set a static IP on OpenSprinkler: go to Edit Options -> Advanced -> turn off Use DHCP, it will auto populate the fields with your currently assigned IP and DNS, gateway etc, so you can just submit changes (or make edits if anything looks incorrect). Then do a reboot of the controller. This way it will disable DHCP and won’t send DHCP requests (which also keeps UDP broadcast disabled).July 28, 2020 at 7:05 pm #67597
It did respond to button presses.
I did not check if the time was correct.
I did not attempt to ping it.
I can do these things next time.July 28, 2020 at 7:18 pm #67598
When I get the network error message I refresh the browser, Firefox, screen.
This results in a timeout.
I will try your test script next time.
I confirmed the version (5) after I did the update.
I always run my OS with a fixed IP address. I disable DHCP in the OS.July 28, 2020 at 10:38 pm #67600
The OS has displayed a network error message again. However after a minute or so it
will return to a “system idle” message. It cycles between the two messages.
The clock is correct. I have configured my own NTP server which is working correctly.
The displayed time advances as you would expect.
I cannot ping the OS. I press the B1 button and the expected IP address is displayed.
I ping it from two different computers, one Win10 and the other Linux, both fail.
Your test script displays:
Response: ERROR! IP/port not reacheable or timeout happened!
A power cycle restores normal operation.July 29, 2020 at 1:28 pm #67619
OK, well at this point I probably will have to send you a custom firmware that prints debugging information on the LCD screen, to help understand the issue. I am mostly puzzled that you said Stefan’s firmware worked for you at least for several days. This firmware (2.1.9(5)) is very similar to Stefan’s firmware, except it turns on broadcast when processing DHCP. But if you are already using static IP, then it will never turn on broadcast. So I am puzzled why this works much less effectively than Stefan’s firmware. In any case, I can generate a custom firmware with additional debugging information on the screen to help understand the issue.
It’s certainly also possible that your ENC28J60 module has a defect, in which case you can send a support ticket so we can send you a new one (or you can buy one from amazon: https://www.amazon.com/ENC28J60-Ethernet-Network-Module-Arduino/dp/B01FDD3YYW).
Some other options to consider are: 1) use a secondary router (or VLAN if your router supports VLAN), as described in a post above); 2) try the W5500 Ethernet module as described in this post: https://opensprinkler.com/forums/topic/instructions-for-testing-os-3-2-with-w5500-ethernet-module/)July 29, 2020 at 4:31 pm #67623
It seems that it is easy to not notice the crash. The sprinkler program continues
to run as expected. The web page cycles between “network error” and normal.
The web page otherwise looks OK. If you ping it you will get an error message.
The firmware from Stefan may have worked the same. Now, I cannot be sure.
Would it be possible to enable the WiFi at the same time at a different IP address.
If you directed debug info to the WiFi connection I could record it for you.
Could I get a console on the WiFi connection? I could even give you a remote
I did buy 2 ENC28J60 modules, I can try the other one.
I may try using the monotoring feature of my Dell switch with WireShark.
That will require some study on my part. I have not used it before.July 29, 2020 at 7:11 pm #67626
I have two OS V3’s with this problem.
It is too bad that firmware update is not available over wired Internet. Once OS is installed in a steel box in the next building basement it is a pain doing a software update. Is it difficult to use wired?
Disconnecting all the zones to remove the OS and take it to a more convenient location would be even more work.
I have been following this thread.
To me this seems like a problem in the UIPEthernet library for ENC28J60. It cannot handle broadcast events correctly if there are too many in a short time. I am no expert in the Ethernet stack. But it seems to me that all devices on the network have to process all broadcast events (in case the recipient needs to respond) and then quickly look to see if the broadcast is needed by the client(OS). If not it should quickly discard the packet. If the software is slow or the buffer not big enough, then a sudden sequence of events may just overwelm the ignore-broadcast implementation either by buffer overrun or just slow software response. This would be true for all broadcast events, not just DHCP.
Could it be that this library has not been rigorously tested on larger networks where there is a lot going on and many other programs issuing broadcasts. The collective experience here is that when testing on a larger network, you may have to wait 24 hours or more to know the software really works. I wonder if this was the case when testing UIPEthernet?
To me, turning on DHCP only when needed is not a real fix. It is just reducing the window (significanly), but the same crash could occur when the window is open. That might mean a crash once a month instead of evry few hours.
I have been reading Issues on the UIPEthernet github site. @jandrassy (a committer) says
UIPEthernet library is made for small MCUs. To store the TCP packets before the application reads them, the library uses the 8kB memory of the ENC28J60. To have more for stored packets the part of memory reserved for receive ring buffer is small.
But the enc28j80 puts in the RX ring buffer almost every packet seen on network and the library must analyse the packet. Some packets are ignored, some are processed for support protocols and the packets for the host are moved to other part of the enc28j60 internal memory to have place for the next packet from the network. The processing of the RX ring buffer is done in maintain() and in every UIPEthernet library function called by host application.
In a network with large traffic sometimes the RX ring buffer doesn’t have place for a packet.
Is this where the problem lies?July 29, 2020 at 7:31 pm #67628
“Is it difficult to use wired for over the air firmware update” –> yes, no one has ever implemented this. The WiFi OTA update is a built-in functionality of ESP8266 library so that works out of the box. The wired Ethernet was a fairly recent add-on that we only started selling last year. The firmware update feature for it does not exist and it’s going to take quite a bit of efforts to figure out how to do so. Honestly back when we had OS 2.x which only has wired Ethernet, everyone wants WiFi; now we have built-in WiFi, it seems people want wired Ethernet again. Almost all smart sprinkler controllers on the market use WiFi, and OpenSprinkler is probably among the very few that provides a wired Ethernet option. If you really want reliable wired Ethernet, I would say OpenSprinkler Pi is probably the best since it’s based on RPi which runs a full Linux system and it’s also lower cost than the microcontroller-based OpenSprinkler. You can perform over the air firmware update on RPi whether it’s connected through WiFi or wired Ethernet.
“Could it be that this library has not been rigorously tested on larger networks where there is a lot going on and many other programs issuing broadcasts” — sure, it’s possible. As you know, all our products are open-source and we rely on open-source libraries. But the downside is that some of these libraries are not used on commercial products so are only tested for hobby projects and not tested rigorously over long term.
“To me, turning on DHCP only when needed is not a real fix.” –> I don’t see why this isn’t a fix. The DHCP request usually finishes well within one second, and it only does so every few hours. The chance of the controller getting tons of broadcast over less than 1 second is extremely small. Also, if you use static IP, or set DHCP reservation on your router, then DHCP is completely off (or the lease time is infinite so after booting it won’t ever request DHCP again). Of course without knowing your specific network I can’t say this for sure, but again, if that’s an issue then why not choose OSPi.
“But the enc28j80 puts in the RX ring buffer almost every packet seen on network and the library must analyse the packet.” — this is NOT true — the ENC28J60 chip has hardware filters. When I say ‘disable broadcast’ it is setting a register bit on the chip so that broadcasts messages are completely dropped and are not put int he buffer. This is hardware level filtering, not software level.
If you haven’t tried firmware 2.1.9(5), I suggest you give it a try. If you are still on firmware 2.1.9(4) or earlier version, then the broadcast issue is likely what’s causing the hanging on your controller, which is what (5) is meant to address.
Finally, UIPEthernet is really the only library we can use for ENC28J60 now. The previous EtherCard library we’ve used for OS 2.3 is not available for ESP8266 (there is a forked branch that attempts to make it available for ESP8266 but last time I tried it failed). So there is no other library we can use for now. You can always try the W5500 Ethernet module as I described above.July 30, 2020 at 12:29 am #67634
@Water_my_lawn: I’ve generated a version of 2.1.9(5) which prints debugging information on the LCD. it’s at:
and the firmware name is “os_219_5_enc28j60_debug.bin”. It prints a sequence of numbers separated by | at the top line of the LCD, basically containing various register values for ENC28J60. You can let me know that entire string 1) upon fresh reboot of the controller; and 2) after network error appeared, then by comparing how the values changed I may get some information about what’s going on.July 30, 2020 at 8:36 am #67640
I did load your new firmware but I see nothing different. There is nothing different
on the display after booting. The “About” screen still has the 2.1.9(5) version.
The firmware file that I downloaded is slightly larger so I think that I have the right thing.
I will run a ping script to detect the actual time when it fails since it is otherwise
hard to notice.
I will watch for a hang and report back with anything that I see then.July 30, 2020 at 10:39 am #67643
@Water_my_lawn: I said, the information is displayed on the LCD — meaning the LCD on the controller, not in the ‘About’ page (if network error happens, you can’t load the webpage so it won’t be useful to show debugging information there). Therefore the debugging information needs to be displayed on the LCD so that even if the webpage doesn’t load it still shows.July 30, 2020 at 3:01 pm #67645
Yes, I understand. My concern with the About page was that the version did not indicate that the debug
firmware was loaded. It is a confidence builder if there is some indication that the new firmware
is different than the previous firmware.
I forgot to plug the ENC28J60 back in after doing the firmware update. I now see the debug code
on the LCD. Here is what I get: 28|1|4|B0|0|31. The net connection is working OK at the moment.
I will watch for the hang.
I am running a ping script which should indicate when the hang occurs.July 30, 2020 at 3:14 pm #67646
It’s the same firmware, just turning on debugging, so I didn’t create a new minor revision because nothing else has changed. As I said, if the LCD shows the sequence of numbers at the top then that means your firmware is loaded correctly. That will allow you to tell if it’s the debug version or not.
If you use the API script:
and pull Json debug (/db) it will show the firmware’s build-time. That will provide the difference (the previous version has a different built time).
Also, I don’t know what’s the frequency of pinging you are doing, but I would suggest a slower ping rate (maybe once per minute or something) to begin with.July 31, 2020 at 9:17 am #67653
I have been running the os_219_5_enc28j60_debug firmware and had my first unresponsive event.
The symptoms I observe:
The OpenSprinkler is not responsive when opening the webpage and the web browser times out.
The TestOSAPI utility also times out. “ERROR! IP/port not reacheable or timeout happened!”
However, the OpenSprinkler continues to respond to pings normally.
The OpenSprinkler’s clock continues to be accurate.
The OpenSprinkler continues to run programs and water normally.
DHCP is turned off in the OpenSprinkler.
The debug information shown after a reboot when responding normally:
The debug information shown after the OpenSprinkler becomes unresponsive:
28|13|4|B0|0|31July 31, 2020 at 9:59 am #67656
@bena, the sequence of numbers are: the first four are ENC28J60 registers EIR, ESTAT, ECCON1 and ERXFCON (all in hex format); followed by a count that detects main loop timeout, and the last one is the available RAM (in KB).
Most of the numbers you have are the same with mine, but one is different after you said it becomes unresponsive: 13 (ESTAT). According to ENC28J60 datasheet, 0x13 means the following bits are set/flaged: Late collision, and Transmit abort. This probably explained why it’s not transmitting data out. Late collision is defined as collision that occurred after 64 bytes have been transmitted. I have never seen this flagged ever on my test controllers. I did some googling and found a few pieces of information here and there:
I don’t know how to reproduce this situation since I’ve never seen it on my network.
- You must be logged in to reply to this topic.