OpenSprinkler › Forums › Hardware Questions › OpenSprinkler › Controller lockups / crashes with wired Ethernet module
Tagged: Controller lock up hang crash
- This topic has 168 replies, 18 voices, and was last updated 4 months, 2 weeks ago by Ray.
-
AuthorPosts
-
July 31, 2020 at 10:05 am #67659
benaParticipantOk, with this information, I’ll start removing other devices from my network and try different cabling to see if I can find something causing this error.
Also, for reference, the cable run to my garage is about 100 feet, but I’m sure I failed to follow installation standards correctly.
I’m wondering if this condition could be recreated with a long spool of Ethernet cabling. I’ll continue to troubleshoot and test and report back.
July 31, 2020 at 10:48 am #67661
Water_my_lawnParticipantRay:
I have a hang using the debug firmware.
One data point to consider: I was running my ping script at 1 ping per second. I did not get a chance
to slow it down as you recommended. While the ping was running there was no problem. This morning
I stopped the ping. A few hours later I have a hang.The data on the LCD:
Running OK:
28|1|4|B0|0|31After hang:
29|41|4|B0|0|31Hope this helps.
July 31, 2020 at 10:51 am #67662
Water_my_lawnParticipantI forgot to add: after the hang ping fails.
July 31, 2020 at 1:43 pm #67664
RayKeymaster@Water_my_lawn: 29|41 matches the problem I was describing earlier, that ‘receiver error’ flag and ‘receiver buffer error’ flag are both been set. Once these are set, it seems eventually the ethernet controller will hang, though not immediately.
So based on the debugging information I collected, I’ve compiled a new firmware 2.1.9(6) with the following work-around: the firmware will periodically check enc28j60 register values, and if it detects any indicator of problems (so far the indicators are: EIR.RXERIF and ESTAT.BUFER are both set; or ESTAT.LATCOL and ESTAT.TXARBT are both set), it will issue a soft reset to enc28j60 to re-initialize its state. This is completely seamless to the user: you will NOT observe any reboot, programs will continue to run; all it does is to trigger a soft reset of the ethernet chip to recover the registers to initial states.
The firmware is at the same place before:
http://raysfiles.com/os_compiled_firmware/v3.0/experimental/
the name is “os_219_6_enc28j60_debug.bin”. I’ve also changed the debugging information displayed on the LCD a bit, to remove numbers which are now irrelevant. Once flashed, you should see on the top four numbers, for example it might show:
8|1|4|0
the first three are EIR, ESTAT, and ECON1 register values (all in HEX format), the last one is a count of how many re-initializations it has done so far.It’s important to use the debug version, as only the debug version has the re-initialization logic I described. This firmware also fixes another issue we discovered just today, where an invalid NTP server/IP can cause the controller to get stuck in NTP syncing state (a rather rare situation that only happens if you’ve put in a wrong NTP server/IP).
At this point I am pretty much doing blind debugging — as I cannot reproduce the situations that Water_my_lawn and bena encountered, I am coming up with theories to address issues without actually being able to observe the issues. So if 2.1.9(6) still doesn’t solve the issue, I’m gonna admit it’s beyond my knowledge then and I don’t know what else to try 🙂
July 31, 2020 at 7:21 pm #67667
Water_my_lawnParticipantI got a hang after a few hours of running.
Here are the debug codes:
Running OK:
0|1|4|0After hang:
8|1|4|1Ping fails.
July 31, 2020 at 10:10 pm #67669
Water_my_lawnParticipantRay;
I had a quick look at your code. I think that you may be writing past the end of ether_buffer
at line 435, 527, and 556 in main.cpp if client.read returns a full buffer.August 1, 2020 at 6:07 am #67671
Water_my_lawnParticipantI got another hang with different results:
Debug codes after hang:
0|1|80|1This time I was running ping at once per second. I thought the ping might act
like a keep-a-live but I guess not.August 1, 2020 at 7:58 am #67672
Water_my_lawnParticipantIn the last hang the third debug field is 80. The bit that is set is TXRST in ECON1 which resets the
Ethernet transmitter. This bit must be cleared after it is set to release the transmitter.In the ECON1 register bit 2 which is RXEN must be set to 1 to be able to receive packets.
Got another hang with 8|1|84|1 on the LCD.
August 1, 2020 at 9:23 am #67676
RayKeymaster@Water_my_lawn: no, the buffer is not the issue. The actual ether_buffer is 2x the size of ETHER_BUFFER_SIZE:
https://github.com/OpenSprinkler/OpenSprinkler-Firmware/blob/dev/219-5/main.cpp#L67
and all client.read is constrained to read maximum of ETHER_BUFFER_SIZE number of characters. Even in previous versions of the code, the buffer’s actual size is ETHER_BUFFER_SIZE+TMP_BUFFER_SIZE (which is 256). So I don’t think there is buffer overflow.August 2, 2020 at 11:59 am #67689
DennisParticipantI have owned an OpenSprinkler 2.3 DC for several years and love it. My controller is connected to the network using its Ethernet port; and the device started becoming inaccessible on the network every 2 to 3 days after upgrading to firmware 2.1.9 back in 2019. I was tempted to downgrade to 2.1.7; however I enjoy using the ET algorithm introduced in 2.1.9.
For the last 8 months my workaround has been to plug the controller into a timer that powers down the device daily for 15 minutes. I have not been locked out of the device ever since. Anyhow, I want to give props to Ray for taking the time to research the ethernet lockups. I look forward to the day when I can remove the timer. Thank you!
Attachments:
August 3, 2020 at 12:18 pm #67707
RayKeymaster@Dennis: thanks for posting the work-around.
At the moment I am pretty lost what else to try. I have 3 test OS on my own network, one OS 2.3 DC, two OS 3.2 AC+Ethenret, two are directly connected to my router and one is corrected through a powerline Ethernet adapter. All three are running 2.1.9(5) and have been alive for more than 5 days since I flahsed 2.1.9(5) onto them. I run the Test script (http://raysfiles.com/os/TestOSManual.html) on all three of them, with a browser tab open on the side to show homepage status, and have IFTTT set up to receive notification on station runs. I have encountered a couple of cases where IFTTT notification was missing (debugging information shows it wasn’t able to connect to IFTTT server at that moment). But otherwise the 3 controllers have been running fine, no hanging, no issue accessing them.
So the issue some of the other users are experiencing is pretty much beyond my knowledge — the only way to find out would be to go to their home in person to debug the issue… The feature implemented in 2.1.9(6) debug version — performing a Ethernet reset when the microcontroller detects ENC28J60 registers are in one of the erroneous states, is akin to performing a reboot, though it’s a softer reboot than a power-on reboot. But it seems even with this, it doesn’t solve the problem for Water_my_lawn, and some of the register values reported, like ECON1=80 or 84, I have’ never seen these values on my controllers. In any case, I am pretty lost and I have to move on for now to other priorities and come back to it when anyone has more insight to what’s happening.
Regarding work-around, in addition to what Dennis mentioned (using a timer to trigger a power-on reboot once a day), I still think using a secondary router is an effective approach — there are inexpensive routers less than $20, it won’t affect access to the controller from the primary network as long as you set up port forwarding on the secondary router.
Another work-around, for OS 3.2 users, is to try W5500 Ethernet module (https://opensprinkler.com/forums/topic/instructions-for-testing-os-3-2-with-w5500-ethernet-module/). Though, this only works for OS 3.2 and doesn’t work for OS 2.3.
August 4, 2020 at 9:05 am #67722
Water_my_lawnParticipantHello Ray;
Could you send me the OS files that you modified with the debug code? I would like to take
a look at them and see if anything catches my eye. I know that I can get the standard source
from github.Oddly, I have not had a hand since Aug 1.
August 4, 2020 at 10:29 am #67724
RayKeymasterThe most recent code is in this branch:
https://github.com/OpenSprinkler/OpenSprinkler-Firmware/tree/dev/219-7Apparently while working on revision (6) I turned off MQTT loop by mistake, so have to turn it back on and re-compile and name it revision (7). You probably didn’t notice this unless if you were using MQTT.
August 5, 2020 at 10:58 pm #67745
Water_my_lawnParticipantI am trying to compile the source. Here is the procedure that I followed which
is as close as possible to the procedure that you described. However I get
compile errors.————————————————————-
#Get the code.
git clone https://github.com/OpenSprinkler/OpenSprinkler-Firmware.git
#Puts it in ~/OpenSprinkler-Firmware/#Get the ESP8266 for Arguino stuff.
git clone https://github.com/esp8266/Arduino.git
#Puts it in ~/Arduinogit clone https://github.com/esp8266/Arduino.git esp8266_2.5.2
#Puts it in ~/esp8266_2.5.2#Go into esp8266_2.5.2
cd esp8266_2.5.2
git checkout tags/2.5.2cd tools
python get.py#Install necessary libraries, including SSD1306, RCSwitch, and UIPEthernet.
#Download and unzip or git clone these into ~/Arduino/libraries folder.cd ~/Arduino/libraries
git clone https://github.com/ThingPulse/esp8266-oled-ssd1306.git
git clone https://github.com/sui77/rc-switch.git
git clone https://github.com/UIPEthernet/UIPEthernet.git#And this one which is new.
git clone https://github.com/knolleary/pubsubclient.gitcd ~/OpenSprinkler-Firmware
#There is an error in make.lin32:
#Replace this line:
~/Arduino/libraries/SSD1306 \#with this line:
~/Arduino/libraries/esp8266-oled-ssd1306 \make -f make.lin32
———————————————–
I get a series of errors like:
home/peter/Arduino/libraries/ESP8266WiFi/src/BearSSLHelpers.h:149:34: error: ‘virtual const unsigned char* BearSSL::HashSHA256::oid()’ marked override, but does not override
virtual const unsigned char *oid() override;/home/peter/Arduino/libraries/ESP8266WebServer/src/Parsing-impl.h:139:15: error: ‘class String’ has no member named ‘isEmpty’
if (req.isEmpty()) break; //no more headersI suspect that there is some version miss-match somewhere.
August 5, 2020 at 11:17 pm #67750
RayKeymasterOne possibility is that some of the libraries have a ‘test’ folder, which needs to be deleted otherwise they cause compilation errors. I think you didn’t post all the errors so I can’t be sure, but if you see any error pointing to some /test folder, then you should delete those test folders.
Also, the original UIPEthernet libraries does not contain some of the improvements that Stefan and I made. You should use this modified UIPEthernet library (my own fork under /fixes/dhcp branch):
https://github.com/OpenSprinkler/UIPEthernet/tree/fixes/dhcp
I’ve tried to submit a pull request but the author did not seem to be taking pull request yet so I had to keep it in my own fork.August 6, 2020 at 8:25 am #67760
Water_my_lawnParticipantI updated the UIPEthernet library from your source but I get the same errors.
I issue these commands from ~/OpenSprinkler-Firmware.make -f make.lin32 clean
make -f make.lin32I have attached the full compiler output showing all the errors that I get.
I don’t have any errors that refer to “test”.Thanks.
Attachments:
August 6, 2020 at 11:17 am #67765
RayKeymasterI don’t recall seeing this error before. But a couple of things may be related:
1. As I said, you should delete ‘test’ or ‘tests’ folders in all relevant libraries, such as pubsubclient. I find them sometimes lead to compilation errors.
2. This step you described:git clone https://github.com/esp8266/Arduino.git
#Puts it in ~/ArduinoI am not sure where this is coming from — you clone ESP core again into ~/esp8266_2.5.2 in the next step, so why duplicate the core into ~/Arduino? In the error messages you received, there is this line:
“Arduino/libraries/ESP8266WiFi/src/WiFiClientSecureBearSSL.h”
it implies that the compiler is looking for ESP8266 core files in ~/Arduino/libraries folder. This is bizarre, it should be looking for that in ~/esp8266_2.5.2 since that’s what you set as ESP_ROOT in the Makefile.August 6, 2020 at 11:31 am #67767
rboer01ParticipantHi Ray,
Been using your latest firmware for a few days now. No more issues from my side.
I’l keep you posted.
Brgds,
Rik
August 6, 2020 at 3:35 pm #67774
Water_my_lawnParticipantPerhaps I was reading your instructions too literally.
Here is my update instructions that seem to work and
produce the mainArduino.bin file. I have not tried it
yet.Ps: I have not had a hang since Aug 1. No change to the
firmware and no change on my network!
—————————————————–#Get the code.
git clone https://github.com/OpenSprinkler/OpenSprinkler-Firmware.git
#Puts it in ~/OpenSprinkler-Firmware/#Get the Arduino code.
git clone https://github.com/esp8266/Arduino.git esp8266_2.5.2
#Puts it in ~/esp8266_2.5.2#Go into esp8266_2.5.2 and get the correct tag.
cd esp8266_2.5.2
git checkout tags/2.5.2cd tools
python get.py#Install necessary libraries, including SSD1306, RCSwitch, and UIPEthernet.
#Download and unzip or git clone these into ~/Arduino/libraries folder.mkdir -p ~/Arduino/libraries
cd ~/Arduino/libraries
git clone https://github.com/ThingPulse/esp8266-oled-ssd1306.git# The latest version of the OLED code is not compatible, backup to 4.1.0
cd esp8266-oled-ssd1306
git checkout tags/4.1.0git clone https://github.com/sui77/rc-switch.git
git clone https://github.com/UIPEthernet/UIPEthernet.git#And this one which is new.
git clone https://github.com/knolleary/pubsubclient.gitcd ~/OpenSprinkler-Firmware
#There is an error in make.lin32:
#Replace this line:
~/Arduino/libraries/SSD1306 \#with this line:
~/Arduino/libraries/esp8266-oled-ssd1306 \# Remove tests directory, will not compile.
rm -rf ~/Arduino/libraries/pubsubclient/testsmake -f make.lin32
August 7, 2020 at 3:01 pm #67803
rboer01ParticipantDear,
Today after 4 days. I got a lockup.
See attached picture of the display.
Brgds,
Rik
Attachments:
August 9, 2020 at 10:54 pm #67831
Water_my_lawnParticipantI got the code and can compile it with debugging and load it successfully.
Now I an ready to try some debugging.Here is my take on the situation:
The ENC28J60 is not interrupt driven. There is an interrupt pin #2 on the
connector but it is not connected to anything in the OS. It runs in polled
mode.The OS continues to run normally, only the network interface is down. The
polling loop in main.cpp runs OK because the sprinkler programs continue
to run normally.The interface does not respond to a ping. ICMP packets are handled in the
UIPEthernet driver, they never get into the OS code. There is no hardware
support for ICMP packets.I suspect that the receive buffer fills and is not being cleared for some
reason. One possible reason is that the incoming packets over-run the
OS in the rate that can be digested. Another possible reason could be
could be some non-thread safe code.I am going to put some debug messages into a new version and try to catch
the problem.I went 7 days without a hang then had 2 in succession.
I have looked at the OLED debug messages for a number of these hang events and cannot
identify a root cause.I would like to produce a new debug version and I will run it. I would like
to have some volunteers that have had these problems. The code will otherwise
be identical to Ray’s latest release.August 10, 2020 at 7:54 am #67836
benaParticipantAlthough the symptoms I see are slightly different from yours and may be linked the length or quality of my Ethernet wire, I’d be happy to try your debug code.
August 10, 2020 at 9:25 pm #67847
Water_my_lawnParticipantI just had another hang. This time it was unusual, the web page was hung as with other
hangs but this time the OS responded to a pings. The display showed 28|1|4|10.
The only other time I saw a 28 was when the OS was running OK.August 11, 2020 at 6:48 am #67851
rboer01ParticipantDear,
Display now shows 8|1|4|13 after hang.
Brgds,
Rik
August 19, 2020 at 12:25 pm #67952
Water_my_lawnParticipantI have produced a debug version of the code. It should operate no
differently than the official release. I have added some debug
information that will appear on a line above the standard display
and a line that will appear below the standard display.The line above will contain 4 hex numbers. The first is the
flag field of the current packet being handled. This will
normally be zero.The second, third and forth are counts of the packet count,
the ICMP packet count, and the TCP packet count. These are
only one byte counters so they roll over often. The ICMP
count will be 0 until you ping the OS.Before you communicate with the OS the tol line will
display “client”. That means that no client has established
communications. Just point a web browser at the OS and
the debug counters will appear.The bottom line will contain 4 numbers. These are state
indicators for the 4 levels of code involved in the network
communication using the ENC28J60 interface.I will run this firmware on my system and watch for a hang.
If other people with the hang problem would like to help
that would be great.If you get a hang I would like to get all the numbers.
Take a photo to save writing them down.
Generally a hang is indicated with a “Network error”
message at the bottom of the OS web page.When you se that happens send me the numbers. Sometimes
I can ping the OS when it is hung but mostly ping will fail.
If you ping it the numbers may change. Please also send
the changed numbers.Try to refresh the web page. The numbers may change, if
so then please send the changed numbers.I have attached the debug version of the firmware.
-
AuthorPosts
- You must be logged in to reply to this topic.
OpenSprinkler › Forums › Hardware Questions › OpenSprinkler › Controller lockups / crashes with wired Ethernet module