OpenSprinkler Forums Hardware Questions OpenSprinkler Controller lockups / crashes with wired Ethernet module

Viewing 25 posts - 76 through 100 (of 167 total)
  • Author
    Posts
  • #67659

    bena
    Participant

    Ok, with this information, I’ll start removing other devices from my network and try different cabling to see if I can find something causing this error.

    Also, for reference, the cable run to my garage is about 100 feet, but I’m sure I failed to follow installation standards correctly.

    I’m wondering if this condition could be recreated with a long spool of Ethernet cabling. I’ll continue to troubleshoot and test and report back.

    #67661

    Water_my_lawn
    Participant

    Ray:

    I have a hang using the debug firmware.

    One data point to consider: I was running my ping script at 1 ping per second. I did not get a chance
    to slow it down as you recommended. While the ping was running there was no problem. This morning
    I stopped the ping. A few hours later I have a hang.

    The data on the LCD:

    Running OK:
    28|1|4|B0|0|31

    After hang:
    29|41|4|B0|0|31

    Hope this helps.

    #67662

    Water_my_lawn
    Participant

    I forgot to add: after the hang ping fails.

    #67664

    Ray
    Keymaster

    @Water_my_lawn: 29|41 matches the problem I was describing earlier, that ‘receiver error’ flag and ‘receiver buffer error’ flag are both been set. Once these are set, it seems eventually the ethernet controller will hang, though not immediately.

    So based on the debugging information I collected, I’ve compiled a new firmware 2.1.9(6) with the following work-around: the firmware will periodically check enc28j60 register values, and if it detects any indicator of problems (so far the indicators are: EIR.RXERIF and ESTAT.BUFER are both set; or ESTAT.LATCOL and ESTAT.TXARBT are both set), it will issue a soft reset to enc28j60 to re-initialize its state. This is completely seamless to the user: you will NOT observe any reboot, programs will continue to run; all it does is to trigger a soft reset of the ethernet chip to recover the registers to initial states.

    The firmware is at the same place before:
    http://raysfiles.com/os_compiled_firmware/v3.0/experimental/
    the name is “os_219_6_enc28j60_debug.bin”. I’ve also changed the debugging information displayed on the LCD a bit, to remove numbers which are now irrelevant. Once flashed, you should see on the top four numbers, for example it might show:
    8|1|4|0
    the first three are EIR, ESTAT, and ECON1 register values (all in HEX format), the last one is a count of how many re-initializations it has done so far.

    It’s important to use the debug version, as only the debug version has the re-initialization logic I described. This firmware also fixes another issue we discovered just today, where an invalid NTP server/IP can cause the controller to get stuck in NTP syncing state (a rather rare situation that only happens if you’ve put in a wrong NTP server/IP).

    At this point I am pretty much doing blind debugging — as I cannot reproduce the situations that Water_my_lawn and bena encountered, I am coming up with theories to address issues without actually being able to observe the issues. So if 2.1.9(6) still doesn’t solve the issue, I’m gonna admit it’s beyond my knowledge then and I don’t know what else to try 🙂

    #67667

    Water_my_lawn
    Participant

    I got a hang after a few hours of running.
    Here are the debug codes:
    Running OK:
    0|1|4|0

    After hang:
    8|1|4|1

    Ping fails.

    #67669

    Water_my_lawn
    Participant

    Ray;

    I had a quick look at your code. I think that you may be writing past the end of ether_buffer
    at line 435, 527, and 556 in main.cpp if client.read returns a full buffer.

    #67671

    Water_my_lawn
    Participant

    I got another hang with different results:

    Debug codes after hang:
    0|1|80|1

    This time I was running ping at once per second. I thought the ping might act
    like a keep-a-live but I guess not.

    #67672

    Water_my_lawn
    Participant

    In the last hang the third debug field is 80. The bit that is set is TXRST in ECON1 which resets the
    Ethernet transmitter. This bit must be cleared after it is set to release the transmitter.

    In the ECON1 register bit 2 which is RXEN must be set to 1 to be able to receive packets.

    Got another hang with 8|1|84|1 on the LCD.

    #67676

    Ray
    Keymaster

    @Water_my_lawn: no, the buffer is not the issue. The actual ether_buffer is 2x the size of ETHER_BUFFER_SIZE:
    https://github.com/OpenSprinkler/OpenSprinkler-Firmware/blob/dev/219-5/main.cpp#L67
    and all client.read is constrained to read maximum of ETHER_BUFFER_SIZE number of characters. Even in previous versions of the code, the buffer’s actual size is ETHER_BUFFER_SIZE+TMP_BUFFER_SIZE (which is 256). So I don’t think there is buffer overflow.

    #67689

    Dennis
    Participant

    I have owned an OpenSprinkler 2.3 DC for several years and love it. My controller is connected to the network using its Ethernet port; and the device started becoming inaccessible on the network every 2 to 3 days after upgrading to firmware 2.1.9 back in 2019. I was tempted to downgrade to 2.1.7; however I enjoy using the ET algorithm introduced in 2.1.9.

    For the last 8 months my workaround has been to plug the controller into a timer that powers down the device daily for 15 minutes. I have not been locked out of the device ever since. Anyhow, I want to give props to Ray for taking the time to research the ethernet lockups. I look forward to the day when I can remove the timer. Thank you!

    #67707

    Ray
    Keymaster

    @Dennis: thanks for posting the work-around.

    At the moment I am pretty lost what else to try. I have 3 test OS on my own network, one OS 2.3 DC, two OS 3.2 AC+Ethenret, two are directly connected to my router and one is corrected through a powerline Ethernet adapter. All three are running 2.1.9(5) and have been alive for more than 5 days since I flahsed 2.1.9(5) onto them. I run the Test script (http://raysfiles.com/os/TestOSManual.html) on all three of them, with a browser tab open on the side to show homepage status, and have IFTTT set up to receive notification on station runs. I have encountered a couple of cases where IFTTT notification was missing (debugging information shows it wasn’t able to connect to IFTTT server at that moment). But otherwise the 3 controllers have been running fine, no hanging, no issue accessing them.

    So the issue some of the other users are experiencing is pretty much beyond my knowledge — the only way to find out would be to go to their home in person to debug the issue… The feature implemented in 2.1.9(6) debug version — performing a Ethernet reset when the microcontroller detects ENC28J60 registers are in one of the erroneous states, is akin to performing a reboot, though it’s a softer reboot than a power-on reboot. But it seems even with this, it doesn’t solve the problem for Water_my_lawn, and some of the register values reported, like ECON1=80 or 84, I have’ never seen these values on my controllers. In any case, I am pretty lost and I have to move on for now to other priorities and come back to it when anyone has more insight to what’s happening.

    Regarding work-around, in addition to what Dennis mentioned (using a timer to trigger a power-on reboot once a day), I still think using a secondary router is an effective approach — there are inexpensive routers less than $20, it won’t affect access to the controller from the primary network as long as you set up port forwarding on the secondary router.

    Another work-around, for OS 3.2 users, is to try W5500 Ethernet module (https://opensprinkler.com/forums/topic/instructions-for-testing-os-3-2-with-w5500-ethernet-module/). Though, this only works for OS 3.2 and doesn’t work for OS 2.3.

    #67722

    Water_my_lawn
    Participant

    Hello Ray;

    Could you send me the OS files that you modified with the debug code? I would like to take
    a look at them and see if anything catches my eye. I know that I can get the standard source
    from github.

    Oddly, I have not had a hand since Aug 1.

    #67724

    Ray
    Keymaster

    The most recent code is in this branch:
    https://github.com/OpenSprinkler/OpenSprinkler-Firmware/tree/dev/219-7

    Apparently while working on revision (6) I turned off MQTT loop by mistake, so have to turn it back on and re-compile and name it revision (7). You probably didn’t notice this unless if you were using MQTT.

    #67745

    Water_my_lawn
    Participant

    I am trying to compile the source. Here is the procedure that I followed which
    is as close as possible to the procedure that you described. However I get
    compile errors.

    ————————————————————-

    #Get the code.
    git clone https://github.com/OpenSprinkler/OpenSprinkler-Firmware.git
    #Puts it in ~/OpenSprinkler-Firmware/

    #Get the ESP8266 for Arguino stuff.
    git clone https://github.com/esp8266/Arduino.git
    #Puts it in ~/Arduino

    git clone https://github.com/esp8266/Arduino.git esp8266_2.5.2
    #Puts it in ~/esp8266_2.5.2

    #Go into esp8266_2.5.2
    cd esp8266_2.5.2
    git checkout tags/2.5.2

    cd tools
    python get.py

    #Install necessary libraries, including SSD1306, RCSwitch, and UIPEthernet.
    #Download and unzip or git clone these into ~/Arduino/libraries folder.

    cd ~/Arduino/libraries
    git clone https://github.com/ThingPulse/esp8266-oled-ssd1306.git
    git clone https://github.com/sui77/rc-switch.git
    git clone https://github.com/UIPEthernet/UIPEthernet.git

    #And this one which is new.
    git clone https://github.com/knolleary/pubsubclient.git

    cd ~/OpenSprinkler-Firmware

    #There is an error in make.lin32:
    #Replace this line:
    ~/Arduino/libraries/SSD1306 \

    #with this line:
    ~/Arduino/libraries/esp8266-oled-ssd1306 \

    make -f make.lin32

    ———————————————–

    I get a series of errors like:

    home/peter/Arduino/libraries/ESP8266WiFi/src/BearSSLHelpers.h:149:34: error: ‘virtual const unsigned char* BearSSL::HashSHA256::oid()’ marked override, but does not override
    virtual const unsigned char *oid() override;

    /home/peter/Arduino/libraries/ESP8266WebServer/src/Parsing-impl.h:139:15: error: ‘class String’ has no member named ‘isEmpty’
    if (req.isEmpty()) break; //no more headers

    I suspect that there is some version miss-match somewhere.

    #67750

    Ray
    Keymaster

    One possibility is that some of the libraries have a ‘test’ folder, which needs to be deleted otherwise they cause compilation errors. I think you didn’t post all the errors so I can’t be sure, but if you see any error pointing to some /test folder, then you should delete those test folders.

    Also, the original UIPEthernet libraries does not contain some of the improvements that Stefan and I made. You should use this modified UIPEthernet library (my own fork under /fixes/dhcp branch):
    https://github.com/OpenSprinkler/UIPEthernet/tree/fixes/dhcp
    I’ve tried to submit a pull request but the author did not seem to be taking pull request yet so I had to keep it in my own fork.

    #67760

    Water_my_lawn
    Participant

    I updated the UIPEthernet library from your source but I get the same errors.
    I issue these commands from ~/OpenSprinkler-Firmware.

    make -f make.lin32 clean
    make -f make.lin32

    I have attached the full compiler output showing all the errors that I get.
    I don’t have any errors that refer to “test”.

    Thanks.

    Attachments:
    #67765

    Ray
    Keymaster

    I don’t recall seeing this error before. But a couple of things may be related:

    1. As I said, you should delete ‘test’ or ‘tests’ folders in all relevant libraries, such as pubsubclient. I find them sometimes lead to compilation errors.
    2. This step you described:

    git clone https://github.com/esp8266/Arduino.git
    #Puts it in ~/Arduino

    I am not sure where this is coming from — you clone ESP core again into ~/esp8266_2.5.2 in the next step, so why duplicate the core into ~/Arduino? In the error messages you received, there is this line:
    “Arduino/libraries/ESP8266WiFi/src/WiFiClientSecureBearSSL.h”
    it implies that the compiler is looking for ESP8266 core files in ~/Arduino/libraries folder. This is bizarre, it should be looking for that in ~/esp8266_2.5.2 since that’s what you set as ESP_ROOT in the Makefile.

    #67767

    rboer01
    Participant

    Hi Ray,

    Been using your latest firmware for a few days now. No more issues from my side.

    I’l keep you posted.

    Brgds,

    Rik

    #67774

    Water_my_lawn
    Participant

    Perhaps I was reading your instructions too literally.
    Here is my update instructions that seem to work and
    produce the mainArduino.bin file. I have not tried it
    yet.

    Ps: I have not had a hang since Aug 1. No change to the
    firmware and no change on my network!
    —————————————————–

    #Get the code.
    git clone https://github.com/OpenSprinkler/OpenSprinkler-Firmware.git
    #Puts it in ~/OpenSprinkler-Firmware/

    #Get the Arduino code.
    git clone https://github.com/esp8266/Arduino.git esp8266_2.5.2
    #Puts it in ~/esp8266_2.5.2

    #Go into esp8266_2.5.2 and get the correct tag.
    cd esp8266_2.5.2
    git checkout tags/2.5.2

    cd tools
    python get.py

    #Install necessary libraries, including SSD1306, RCSwitch, and UIPEthernet.
    #Download and unzip or git clone these into ~/Arduino/libraries folder.

    mkdir -p ~/Arduino/libraries
    cd ~/Arduino/libraries
    git clone https://github.com/ThingPulse/esp8266-oled-ssd1306.git

    # The latest version of the OLED code is not compatible, backup to 4.1.0
    cd esp8266-oled-ssd1306
    git checkout tags/4.1.0

    git clone https://github.com/sui77/rc-switch.git
    git clone https://github.com/UIPEthernet/UIPEthernet.git

    #And this one which is new.
    git clone https://github.com/knolleary/pubsubclient.git

    cd ~/OpenSprinkler-Firmware

    #There is an error in make.lin32:
    #Replace this line:
    ~/Arduino/libraries/SSD1306 \

    #with this line:
    ~/Arduino/libraries/esp8266-oled-ssd1306 \

    # Remove tests directory, will not compile.
    rm -rf ~/Arduino/libraries/pubsubclient/tests

    make -f make.lin32

    #67803

    rboer01
    Participant

    Dear,

    Today after 4 days. I got a lockup.

    See attached picture of the display.

    Brgds,

    Rik

    #67831

    Water_my_lawn
    Participant

    I got the code and can compile it with debugging and load it successfully.
    Now I an ready to try some debugging.

    Here is my take on the situation:

    The ENC28J60 is not interrupt driven. There is an interrupt pin #2 on the
    connector but it is not connected to anything in the OS. It runs in polled
    mode.

    The OS continues to run normally, only the network interface is down. The
    polling loop in main.cpp runs OK because the sprinkler programs continue
    to run normally.

    The interface does not respond to a ping. ICMP packets are handled in the
    UIPEthernet driver, they never get into the OS code. There is no hardware
    support for ICMP packets.

    I suspect that the receive buffer fills and is not being cleared for some
    reason. One possible reason is that the incoming packets over-run the
    OS in the rate that can be digested. Another possible reason could be
    could be some non-thread safe code.

    I am going to put some debug messages into a new version and try to catch
    the problem.

    I went 7 days without a hang then had 2 in succession.

    I have looked at the OLED debug messages for a number of these hang events and cannot
    identify a root cause.

    I would like to produce a new debug version and I will run it. I would like
    to have some volunteers that have had these problems. The code will otherwise
    be identical to Ray’s latest release.

    #67836

    bena
    Participant

    Although the symptoms I see are slightly different from yours and may be linked the length or quality of my Ethernet wire, I’d be happy to try your debug code.

    #67847

    Water_my_lawn
    Participant

    I just had another hang. This time it was unusual, the web page was hung as with other
    hangs but this time the OS responded to a pings. The display showed 28|1|4|10.
    The only other time I saw a 28 was when the OS was running OK.

    #67851

    rboer01
    Participant

    Dear,

    Display now shows 8|1|4|13 after hang.

    Brgds,

    Rik

    #67952

    Water_my_lawn
    Participant

    I have produced a debug version of the code. It should operate no
    differently than the official release. I have added some debug
    information that will appear on a line above the standard display
    and a line that will appear below the standard display.

    The line above will contain 4 hex numbers. The first is the
    flag field of the current packet being handled. This will
    normally be zero.

    The second, third and forth are counts of the packet count,
    the ICMP packet count, and the TCP packet count. These are
    only one byte counters so they roll over often. The ICMP
    count will be 0 until you ping the OS.

    Before you communicate with the OS the tol line will
    display “client”. That means that no client has established
    communications. Just point a web browser at the OS and
    the debug counters will appear.

    The bottom line will contain 4 numbers. These are state
    indicators for the 4 levels of code involved in the network
    communication using the ENC28J60 interface.

    I will run this firmware on my system and watch for a hang.
    If other people with the hang problem would like to help
    that would be great.

    If you get a hang I would like to get all the numbers.
    Take a photo to save writing them down.
    Generally a hang is indicated with a “Network error”
    message at the bottom of the OS web page.

    When you se that happens send me the numbers. Sometimes
    I can ping the OS when it is hung but mostly ping will fail.
    If you ping it the numbers may change. Please also send
    the changed numbers.

    Try to refresh the web page. The numbers may change, if
    so then please send the changed numbers.

    I have attached the debug version of the firmware.

Viewing 25 posts - 76 through 100 (of 167 total)
  • You must be logged in to reply to this topic.

OpenSprinkler Forums Hardware Questions OpenSprinkler Controller lockups / crashes with wired Ethernet module