July 31, 2020 at 10:05 am #67659
Ok, with this information, I’ll start removing other devices from my network and try different cabling to see if I can find something causing this error.
Also, for reference, the cable run to my garage is about 100 feet, but I’m sure I failed to follow installation standards correctly.
I’m wondering if this condition could be recreated with a long spool of Ethernet cabling. I’ll continue to troubleshoot and test and report back.
July 31, 2020 at 10:48 am #67661
- This reply was modified 5 days, 18 hours ago by bena. Reason: Added cable run length
I have a hang using the debug firmware.
One data point to consider: I was running my ping script at 1 ping per second. I did not get a chance
to slow it down as you recommended. While the ping was running there was no problem. This morning
I stopped the ping. A few hours later I have a hang.
The data on the LCD:
Hope this helps.July 31, 2020 at 10:51 am #67662
I forgot to add: after the hang ping fails.July 31, 2020 at 1:43 pm #67664
@Water_my_lawn: 29|41 matches the problem I was describing earlier, that ‘receiver error’ flag and ‘receiver buffer error’ flag are both been set. Once these are set, it seems eventually the ethernet controller will hang, though not immediately.
So based on the debugging information I collected, I’ve compiled a new firmware 2.1.9(6) with the following work-around: the firmware will periodically check enc28j60 register values, and if it detects any indicator of problems (so far the indicators are: EIR.RXERIF and ESTAT.BUFER are both set; or ESTAT.LATCOL and ESTAT.TXARBT are both set), it will issue a soft reset to enc28j60 to re-initialize its state. This is completely seamless to the user: you will NOT observe any reboot, programs will continue to run; all it does is to trigger a soft reset of the ethernet chip to recover the registers to initial states.
The firmware is at the same place before:
the name is “os_219_6_enc28j60_debug.bin”. I’ve also changed the debugging information displayed on the LCD a bit, to remove numbers which are now irrelevant. Once flashed, you should see on the top four numbers, for example it might show:
the first three are EIR, ESTAT, and ECON1 register values (all in HEX format), the last one is a count of how many re-initializations it has done so far.
It’s important to use the debug version, as only the debug version has the re-initialization logic I described. This firmware also fixes another issue we discovered just today, where an invalid NTP server/IP can cause the controller to get stuck in NTP syncing state (a rather rare situation that only happens if you’ve put in a wrong NTP server/IP).
At this point I am pretty much doing blind debugging — as I cannot reproduce the situations that Water_my_lawn and bena encountered, I am coming up with theories to address issues without actually being able to observe the issues. So if 2.1.9(6) still doesn’t solve the issue, I’m gonna admit it’s beyond my knowledge then and I don’t know what else to try 🙂July 31, 2020 at 7:21 pm #67667
I got a hang after a few hours of running.
Here are the debug codes:
Ping fails.July 31, 2020 at 10:10 pm #67669
I had a quick look at your code. I think that you may be writing past the end of ether_buffer
at line 435, 527, and 556 in main.cpp if client.read returns a full buffer.
August 1, 2020 at 6:07 am #67671
- This reply was modified 5 days, 6 hours ago by Water_my_lawn.
I got another hang with different results:
Debug codes after hang:
This time I was running ping at once per second. I thought the ping might act
like a keep-a-live but I guess not.August 1, 2020 at 7:58 am #67672
In the last hang the third debug field is 80. The bit that is set is TXRST in ECON1 which resets the
Ethernet transmitter. This bit must be cleared after it is set to release the transmitter.
In the ECON1 register bit 2 which is RXEN must be set to 1 to be able to receive packets.
Got another hang with 8|1|84|1 on the LCD.August 1, 2020 at 9:23 am #67676
@Water_my_lawn: no, the buffer is not the issue. The actual ether_buffer is 2x the size of ETHER_BUFFER_SIZE:
and all client.read is constrained to read maximum of ETHER_BUFFER_SIZE number of characters. Even in previous versions of the code, the buffer’s actual size is ETHER_BUFFER_SIZE+TMP_BUFFER_SIZE (which is 256). So I don’t think there is buffer overflow.August 2, 2020 at 11:59 am #67689
I have owned an OpenSprinkler 2.3 DC for several years and love it. My controller is connected to the network using its Ethernet port; and the device started becoming inaccessible on the network every 2 to 3 days after upgrading to firmware 2.1.9 back in 2019. I was tempted to downgrade to 2.1.7; however I enjoy using the ET algorithm introduced in 2.1.9.
For the last 8 months my workaround has been to plug the controller into a timer that powers down the device daily for 15 minutes. I have not been locked out of the device ever since. Anyhow, I want to give props to Ray for taking the time to research the ethernet lockups. I look forward to the day when I can remove the timer. Thank you!
Attachments:August 3, 2020 at 12:18 pm #67707
@Dennis: thanks for posting the work-around.
At the moment I am pretty lost what else to try. I have 3 test OS on my own network, one OS 2.3 DC, two OS 3.2 AC+Ethenret, two are directly connected to my router and one is corrected through a powerline Ethernet adapter. All three are running 2.1.9(5) and have been alive for more than 5 days since I flahsed 2.1.9(5) onto them. I run the Test script (http://raysfiles.com/os/TestOSManual.html) on all three of them, with a browser tab open on the side to show homepage status, and have IFTTT set up to receive notification on station runs. I have encountered a couple of cases where IFTTT notification was missing (debugging information shows it wasn’t able to connect to IFTTT server at that moment). But otherwise the 3 controllers have been running fine, no hanging, no issue accessing them.
So the issue some of the other users are experiencing is pretty much beyond my knowledge — the only way to find out would be to go to their home in person to debug the issue… The feature implemented in 2.1.9(6) debug version — performing a Ethernet reset when the microcontroller detects ENC28J60 registers are in one of the erroneous states, is akin to performing a reboot, though it’s a softer reboot than a power-on reboot. But it seems even with this, it doesn’t solve the problem for Water_my_lawn, and some of the register values reported, like ECON1=80 or 84, I have’ never seen these values on my controllers. In any case, I am pretty lost and I have to move on for now to other priorities and come back to it when anyone has more insight to what’s happening.
Regarding work-around, in addition to what Dennis mentioned (using a timer to trigger a power-on reboot once a day), I still think using a secondary router is an effective approach — there are inexpensive routers less than $20, it won’t affect access to the controller from the primary network as long as you set up port forwarding on the secondary router.
Another work-around, for OS 3.2 users, is to try W5500 Ethernet module (https://opensprinkler.com/forums/topic/instructions-for-testing-os-3-2-with-w5500-ethernet-module/). Though, this only works for OS 3.2 and doesn’t work for OS 2.3.August 4, 2020 at 9:05 am #67722
Could you send me the OS files that you modified with the debug code? I would like to take
a look at them and see if anything catches my eye. I know that I can get the standard source
Oddly, I have not had a hand since Aug 1.August 4, 2020 at 10:29 am #67724
The most recent code is in this branch:
Apparently while working on revision (6) I turned off MQTT loop by mistake, so have to turn it back on and re-compile and name it revision (7). You probably didn’t notice this unless if you were using MQTT.August 5, 2020 at 10:58 pm #67745
I am trying to compile the source. Here is the procedure that I followed which
is as close as possible to the procedure that you described. However I get
#Get the code.
git clone https://github.com/OpenSprinkler/OpenSprinkler-Firmware.git
#Puts it in ~/OpenSprinkler-Firmware/
#Get the ESP8266 for Arguino stuff.
git clone https://github.com/esp8266/Arduino.git
#Puts it in ~/Arduino
git clone https://github.com/esp8266/Arduino.git esp8266_2.5.2
#Puts it in ~/esp8266_2.5.2
#Go into esp8266_2.5.2
git checkout tags/2.5.2
#Install necessary libraries, including SSD1306, RCSwitch, and UIPEthernet.
#Download and unzip or git clone these into ~/Arduino/libraries folder.
git clone https://github.com/ThingPulse/esp8266-oled-ssd1306.git
git clone https://github.com/sui77/rc-switch.git
git clone https://github.com/UIPEthernet/UIPEthernet.git
#And this one which is new.
git clone https://github.com/knolleary/pubsubclient.git
#There is an error in make.lin32:
#Replace this line:
#with this line:
make -f make.lin32
I get a series of errors like:
home/peter/Arduino/libraries/ESP8266WiFi/src/BearSSLHelpers.h:149:34: error: ‘virtual const unsigned char* BearSSL::HashSHA256::oid()’ marked override, but does not override
virtual const unsigned char *oid() override;
/home/peter/Arduino/libraries/ESP8266WebServer/src/Parsing-impl.h:139:15: error: ‘class String’ has no member named ‘isEmpty’
if (req.isEmpty()) break; //no more headers
I suspect that there is some version miss-match somewhere.August 5, 2020 at 11:17 pm #67750
One possibility is that some of the libraries have a ‘test’ folder, which needs to be deleted otherwise they cause compilation errors. I think you didn’t post all the errors so I can’t be sure, but if you see any error pointing to some /test folder, then you should delete those test folders.
Also, the original UIPEthernet libraries does not contain some of the improvements that Stefan and I made. You should use this modified UIPEthernet library (my own fork under /fixes/dhcp branch):
I’ve tried to submit a pull request but the author did not seem to be taking pull request yet so I had to keep it in my own fork.
- You must be logged in to reply to this topic.