Forum Replies Created

Viewing 25 posts - 1 through 25 (of 72 total)
  • Author
    Posts

  • Water_my_lawn
    Participant

    It is not uncommon for small changes in the code to result in masking a problem. This is not a fix and can make finding the actual
    problem a real trick. However, I have only been running for 2 months and in the past I have run for longer before the hang occurred.

    If I could connect a debugger or get a crash dump, I could fix this problem easily. As it is, I can only make small changes and
    hope the little bit of information extracted points to the problem.


    Water_my_lawn
    Participant

    I have run with the latest code, version 2.1.9(9), and had some hang conditions. I have been communicating with the author of EthernetENC and
    he made some suggestions on where to add some debug code. I have created a version of 2.1.9(9) with the debug code but have not hit a hang yet
    after 2 months of running. I am debugging with EthernetENC now, and no longer the older UIPEthernet. Same author on both cases.

    I wish you well on your health issues.


    Water_my_lawn
    Participant

    I have run my fixed version without problems so far. Since Ray changed to EthernetENC from the previous UIPEthernet Ethernet
    stack, my fix will not help the current release. I have run the current release and continue to have the hang problem, so I suspect
    that there is a bug in the EthernetENC code as well.

    It seems that most people never run into this hang problem. I have it quite frequently. However, I don’t know how
    many people are actually running a wired Ethernet connection.

    I have opened an issue with the Ethernet stack here:
    https://github.com/JAndrassy/EthernetENC/issues/35


    Water_my_lawn
    Participant

    I have taken a few more hangs and have loaded my special fixed version based on 2.1.9 (7).
    I will see how that does.


    Water_my_lawn
    Participant

    Would you be interested in trying my fixed version? This is using the older network stack: UIPEthernet.
    There was a obvious (in hindsight) bug that the original author fixed. I tested it and it worked perfectly.
    However the firmware rev is 2.1.9 (7) which is a few months behind the current development.

    I can make it available if you would like to try it.
    Let me know.

    in reply to: Controller lockups / crashes with wired Ethernet module #72723

    Water_my_lawn
    Participant

    Are you running the latest firmware 2.1.9 (9)?


    Water_my_lawn
    Participant

    I am hitting an OS network error every few days using the latest firmware. This time is was a network hang but the OS worked OK using the buttons. It was not the boot loop hang that I had previously.

    Is anyone else using their OS with a hard-wired Ethernet connection?


    Water_my_lawn
    Participant

    I have caught another hang. The OS is locked in a boot loop. You can see it in this video:
    https://youtu.be/47hsE2BB1gw
    What is happening?

    in reply to: Controller lockups / crashes with wired Ethernet module #72634

    Water_my_lawn
    Participant

    I just had a network hang with firmware 2.1.9 (9) after 3 weeks running.
    This is using the ENC28J60 Ethernet adapter board.
    Since this has the new network stack perhaps there is a problem just
    like there was with the old (unpatched) stack.

    Has anyone else seen this?

    in reply to: Controller lockups / crashes with wired Ethernet module #72466

    Water_my_lawn
    Participant

    I have been running my firmware for over 3 months on the wired Ethernet connection without a crash.
    The only change made was a fix by Juraj Andrássy to the UIPClient.cpp which is part of the UIPEthernet
    Ethernet driver. The only change made is to replace (*this) with (data) 4 places in the source file.
    This is one of those bugs that is obvious in hindsight.

    This bug would result in dereferencing a bad pointer when there was a zero length packet. Such
    a thing produces unpredictable behavior or a crash.

    However this is all moot since Ray moved to EthernetENC from the previous UIPEthernet Ethernet drive.

    in reply to: Captcha problem when trying to post. #71376

    Water_my_lawn
    Participant

    When I try to post I get that error in Linux but not Windows.
    Am I doing something wrong? Did something change? It used
    to work.

    in reply to: Captcha problem when trying to post. #71333

    Water_my_lawn
    Participant

    In case you were wondering how I could post here: this post is done using Firefox on Windows 10.
    I see the little Captcha box on the lower right but i did not have to solve any Captcha puzzle.

    in reply to: Controller lockups / crashes with wired Ethernet module #71165

    Water_my_lawn
    Participant

    Even though I have not posted for a while I am still working on the problem of the wired Ethernet
    connection hang. In discussions with the UIPEthernet developers I am convinced that their
    driver is OK. So I have changed my debug strategy.

    The essential problem seems to be that the main loop is not seeing Ethernet packets when
    they arrive. The loop queries if any data has arrived and the Ethernet driver always
    reports “no data”. This happens even when packets are arriving. Since the packet
    data is not read from the Ethernet chip buffer, the buffer fills and flags an overrun
    condition. Originally I thought that this overrun flag was the cause of the problem
    but now I see that it is just a result of the problem.

    It seems to me now that there must be data corruption in the RAM. The UIPEthernet returns
    an incorrect result but the UIPEthernet appears to be error free. Perhaps a buffer
    is over-running it’s bounds. Perhaps there is an error in variable type casting.

    My new strategy is to dump the entire system RAM for the normal running state and dump
    it for the error state. I have a bunch of captures of both conditions.

    I have written a program that takes the map file produced by objdump and filled out each
    variable with the actual data from the RAM dump. This gives me value of all variables
    that exist in the system. I can compare these results from the many captures from the
    good running systems. Any differences in the variables will be just the normal running
    state changing. I edit these variables out. What is left is a list of the variables that
    don’t change in my configuration in my running system when it is working properly.

    I next process the bad state RAM dumps and compare them to the good state variables.
    This gives me a map of what is different between a good running system and a system
    with the Ethernet port hung. After this processing I find about 40 variables that
    are different in the bad state. This is out of 586 total variables in the map of
    all variables.

    At the moment I am pondering this result but have not come to a conclusion.

    It is interesting that the WiFi also seems to have a problem. perhaps they are
    related.

    in reply to: Controller lockups / crashes with wired Ethernet module #70410

    Water_my_lawn
    Participant

    I have about 1 week of runtime with my debug code. I have caught a few errors.
    The first run showed 6 recoveries, then I power cycled the OS and now it shows
    3 recoveries. These mostly resulted in no errors showing on a web browser
    pointed at the OS. All of the errors were receive buffer overflows.

    I normally keep 2 browsers showing my OS, both Firefox; one running on Windows
    and one running on Linux. One time the browser on Windows showed “Network error”
    and would not refresh. At the same time the browser on Linux refreshed properly
    indicating that the OS Ethernet interface was not at fault. When I closed the
    browser tab the reopened the tab the OS web page came up OK.

    I suspect that there is some problem in the protocol between the browser and the OS.
    Perhaps the browser protocol is not robust enough to withstand the lost packets that
    will occur when the OS resets the Ethernet interface. This will necessarily
    result in lost packets and likely connection timeouts.

    I recommend that the recovery code for the ENC28J60 Ethernet interface be included
    in the standard release. The driver code for the ENC28J60 does not detect buffer
    overflow and does not have any error recovery code. A buffer overflow stops the
    processing of received packets and must be recovered by the system before it can
    resume normal processing.

    I suspect that there is a problem in the higher level protocol that communicates
    with the browser. It is possible that the protocol sometimes cannot recover
    in a situation where the channel is momentarily broken and a number of packets
    are lost. However that protocol is outside of my area of experience.

    Hope this helps,
    Pete.

    in reply to: Controller lockups / crashes with wired Ethernet module #70356

    Water_my_lawn
    Participant

    Well my hang condition went away for no known reason. I was able
    to capture one hang with my debug code and have the data from the hang.

    When hung this is the state of the registers that I am logging:
    EIR 0x09 TXIF (transmit done), RXERIF (receive aborted, buffer overrun)
    ESTAT 0x41 BUFER (read or write buffer error), CLKRDY (clock is OK)
    ECON1 0x04 RXEN (receive enable)

    At this stage the recovery counter (n_reinits) is at 3. This means that
    a hang condition has been detected and the recovery code has executed
    3 times but the Ethernet interface is still hung. This recovery code
    is not in the standard release code. Ray has it turned off.

    I turned the debug flag on which enabled the recovery code. I have
    added some additional logging code to further try to understand why
    the recovery process does not work. Otherwise this debug version
    is identical to the latest release of Ray’s firmware: 2.1.9 (7).

    I have attached a firmware binary with the additional debug logging.
    If anyone is experiencing the same hang with a hardwired Ethernet
    connection using the ENC28J60 module I ask that you would give my
    firmware a try and report back what is says.

    The debug code prints two lines on the OLED display. One line
    appears above the standard messages and the other line appears
    below the standard messages.

    The top line is formatted as such:
    XX|XX|XX|XX
    The XX is the value in the EIR register, the ESTAT register,
    the ECON1 register, and the recovery counter.

    The bottom line is formatted as such:
    XX|XX|XX XX|XX|XX
    The EIR register, ESTAT register, ECON1 register, the EIR register,
    ESTAT register, ECON1 register.
    The apparent duplication is because the registers are read two
    times at different places in the code.

    If I could get all of this information after a Ethernet hang
    it would help me figure out this very elusive bug.

    Thanks.

    Attachments:

    Water_my_lawn
    Participant

    I have repaired by OS with help from Ray (thanks Ray). I tried many more times
    to unbrick the ESP-12N but failed. Now I am back up.

    With the latest firmware 2.1.8.(7) and the connection using the Ethernet adapter
    ENC28J60 it hangs frequently. I cannot make it through a single watering cycle
    without it hanging. When It is hung it will not respond to pings.

    This is actually a much better situation for debugging. Previously it might
    take more than a month to hang.

    My current working theory is that the ESP processor does not respond fast enough
    to prevent a ethernet buffer overflow. I will add some code to detect this
    situation.

    in reply to: Error building firmware from source #70261

    Water_my_lawn
    Participant

    I have taken your fixes and updated my script.
    This script downloads all of the source and applies the necessary
    fixed to build the latest binary. This is currently: Firmware 2.1.9 (7)
    If this script is run in a new subdirectory it creates it’s own environment
    to build the binary. This allows you to have multiple build trees with
    their own build root.

    I have attached my script. I use Ubuntu 20.04 for development.

    in reply to: Feature request: Prevent OLED burn-in #70025

    Water_my_lawn
    Participant

    These OLED displays (SSD1306) are quite reliable and long lasting. This Russian guy did a burnin test for over a year of
    a bunch of the displays:
    https://www.youtube.com/watch?v=GWOFF5tMv_A&t=493s

    It seems that their life depends on brightness and time, not how frequently the pixels are changed.

    If you do replace the OLED display be careful to note positions of the power pins on the 4 pin header connector.
    The power pins are reversed on a lot of the displays.

    in reply to: Dead Coils #69909

    Water_my_lawn
    Participant

    Why are you applying DC voltage to your AC solenoids? This is likely to put too much current
    through the coils.

    in reply to: Controller lockups / crashes with wired Ethernet module #69512

    Water_my_lawn
    Participant

    Well I have done it this time, really bricked my system!

    I loaded an image with a bug that causes the system to crash and reboot. No big
    thing, I have done this a number of times. However my normal recovery scheme is
    not working this time. I have covered this previously and described the procedure.
    This time, no-go.

    I think I have identified the situation that causes the ENC28J60 Ethernet port to
    stop working. If the packets are not unloaded from the Ethernet chip fast enough
    the fifo will fill and result in the receive error. This error must be deliberately
    cleared before the chip will return to normal operation.

    I am trying to figure where to go from here.

    in reply to: Error building firmware from source #69365

    Water_my_lawn
    Participant

    I see that my script is out of date!
    The lines with esp8266_2.5.2 need to be edited to pick up the 2.7.4 version.
    The issue with tick() must also be dealt with.

    in reply to: Error building firmware from source #69362

    Water_my_lawn
    Participant

    I wrote a script that does the download and necessary fixups.
    Make a build directory and run this script from there.
    Here it is:

    —————————————————–

    #!/bin/bash

    # This script downloads all of the source code necessary to build
    # the OpenSprinkler binary for the 3.0 hardware.
    # The script fixes a few places in the source downloads that are
    # needed before it will compile cleanly.
    # Finally it runs the make command.
    # This can be run in any directory and will set a local $HOME
    # This runs in Linux.

    # Remove any previous installation. You may or may not want to do this!
    rm -rf Arduino esp8266_2.5.2 OpenSprinkler-Firmware

    # Create a local $HOME foe installation and build.
    export HOME=pwd

    # Get the OpenSprinkler code.
    # Puts it in ~/OpenSprinkler-Firmware
    git clone https://github.com/OpenSprinkler/OpenSprinkler-Firmware.git

    # Get the Arduino code.
    # Puts it in ~/esp8266_2.5.2
    git clone https://github.com/esp8266/Arduino.git esp8266_2.5.2

    # Go into esp8266_2.5.2 and checkout the correct version.
    cd ~/esp8266_2.5.2
    git checkout tags/2.5.2

    # If necessary, install Python.
    # sudo apt install python

    # This Perl script installs the xtensa compiler and tools.
    cd ~/esp8266_2.5.2/tools
    python get.py

    # Go back up the base level.
    #cd

    # Install necessary libraries, including SSD1306, RCSwitch, and UIPEthernet.
    # Download and unzip or git clone these into Arduino/libraries folder.
    mkdir -p ~/Arduino/libraries
    cd ~/Arduino/libraries

    # Get the library for the OLED display.
    git clone https://github.com/ThingPulse/esp8266-oled-ssd1306.git

    # The latest version of the OLED code is not compatable, backup to 4.1.0
    cd ~/Arduino/libraries/esp8266-oled-ssd1306
    git checkout tags/4.1.0

    cd ~/Arduino/libraries

    # Get some of the necessary pieces.
    git clone https://github.com/sui77/rc-switch.git
    git clone https://github.com/UIPEthernet/UIPEthernet.git
    git clone https://github.com/knolleary/pubsubclient.git

    # Remove tests directory as it will not compile but is unnecessary.
    rm -rf ~/Arduino/libraries/pubsubclient/tests

    # Go into the actual build location
    cd ~/OpenSprinkler-Firmware

    # There is an fixup needed in make.lin32:
    # This changes the OLED library to the correct one.
    sed -i s+Arduino/libraries/SSD1306+Arduino/libraries/esp8266-oled-ssd1306+ make.lin32

    # And finally build the final OpenSprinkler binary.
    make -f make.lin32

    echo “**********************************************************”
    echo ‘To build in a directory other than your normal $HOME directory’
    echo ‘run this command to set a local $HOME directory to your current directory:’
    echo ‘ export HOME=pwd
    echo “Next go into the build directory.”
    echo ” cd ~/OpenSprinkler-Firmware”
    echo “Then run this command to build the binary:”
    echo ” make -f make.lin32″
    echo “**********************************************************”

    in reply to: Controller lockups / crashes with wired Ethernet module #69346

    Water_my_lawn
    Participant

    More updates: I have caught a few hang conditions. I have also had some bad luck!
    One time I accidentally kicked the power connector out when I was extracting state
    data during a hang condition thereby losing the state. The other day the power
    went out for a few hours when I had another hang condition that I was examining.

    Some hang conditions I have waited 2 months for, others I have caught in 1 or
    2 days. I have also bricked my OS a few times.

    Anyway, it seems that there are two conditions that produce network problems.
    The first is a receive buffer overflow. This is indicated by the ESTAT register
    bit 6 set to 1 and the EIR register bit 0 set to one. This condition does
    not clear itself.

    I suspect the cause is that the OS code does not poll the network layer
    fast enough to prevent a buffer overflow in all conditions. Since the ENC28J60
    chip does not have any internal packet processing it must be polled often enough
    to handle all packets that appear on the wire. This includes ICMP (ARP)
    packets and packets that are not addressed to the OS. This condition only
    appears rarely so it need not be fixed in the OS code but there must
    be some recovery mechanism.

    The second fault is indicated by the ESTAT register being set to 0x13. This
    is a transmit “late collision error”. This condition also does not seem to
    clear itself.

    I will add this code in the do_loop() routine.

    if ((estat & ESTAT_ERROR) || (eir & EIR_ERROR)) {
    OpenSprinkler::start_ether();
    }

    This resets the entire Ethernet layer including the ENC28J60 chip.
    It should clear all these error conditions.
    I will run this for my next test and log the number of occurrences.

    In looking through the code find no memory leaks or data corruption.
    That is what I was expecting to find. I think that the problem
    results from a mismatch between the relatively slow ESP based OS
    code and the 1 Gigabit Ethernet. Packets can arrive too fast.
    The solution, I think, lies in a robust recovery of a rare error
    condition rather than to try to handle the packets faster.

    in reply to: Controller lockups / crashes with wired Ethernet module #69275

    Water_my_lawn
    Participant

    The reason for using 74880 is that the data initially sent out during booting
    is at that speed for ESP’s running at 26 MHz. If you are using an ESP at 40 MHz
    then the initial BAUD rate is 115200, which is a standard rate.

    The ESP does go into an auto-baud mode after booting but auto-baud is tricky and not
    always reliable.

    I think that my problem with bricking my OS this time is that the ESP booted
    OK and then went into OS code which immediately panicked. This disrupted
    the auto-baud and prevented the flash utility from grabbing the ESP and taking
    control. By using 74880 the flash utility did not depend on the auto-baud
    being completed successfully.

    Not all USB to ASYNC adapters support arbitrary BAUD rated but the ones using
    the CH340 chip do. However you must also have a device driver that supports
    this mode, the standard Linux driver does not. The kernel module called ch341
    does support the ch340 chip and arbitrary BAUD rates.

    in reply to: Controller lockups / crashes with wired Ethernet module #69227

    Water_my_lawn
    Participant

    After waiting for 2 months for a network interface hang I added debug code to
    narrow the focus of my investigation. I started the next run and caught a
    hang after 2 days. This time the network interface, while hung, would respond
    to pings. The pings had a valid response ratio of about 10%. The bad
    response packets seemed to be corrupted. I watched the traffic with Wireshark.

    I further narrowed my debug code to focus more closely in the received and
    transmitted packets code. When I loaded my new code I totally bricked
    the device. The recovery method that I previously used with esptool.py
    did not work. The OS was transmitting data continuously from the ASYNC
    port but it would not autobaud so the data was just garbage.

    The default BAUD rate of the ESP8266 with a 26MHz oscillator is 74880 BAUD.
    This is non-standard and Putty does not support it even though my USB to ASYNC
    adapter does support that odd BAUD rate. I found a terminal emulator called
    miniterm.py which does support any BAUD rate.

    Using this I successfully received the data from the OS. This is what
    I got:
    ———————————
    ets Jan 8 2013,rst cause:2, boot mode:(3,6)

    load 0x4010f000, len 1384, room 16
    tail 8
    chksum 0x2d
    csum 0x2d
    v00000000
    ~ld
    Fatal exception 9(LoadStoreAlignmentCause):
    epc1=0x401014d7, epc2=0x00000000, epc3=0x00000000, excvaddr=0x0000000a, depc=0x00000000

    Exception (9):
    epc1=0x401014d7 epc2=0x00000000 epc3=0x00000000 excvaddr=0x0000000a depc=0x00000000

    >>>stack>>>

    ctx: sys
    sp: 3ffff8f0 end: 3fffffb0 offset: 01a0
    3ffffa90: 4024c3fa 3ffee5aa 3ffee5aa 3ffeed9c
    3ffffaa0: 4024c409 4024c3b6 40105450 c1781c9b
    3ffffab0: 00000000 400042db 40105712 000003fd
    3ffffac0: 000000ed 00000020 3fffff10 00000001
    3ffffad0: 4010570c 40105583 00000003 8667a4e3
    3ffffae0: ffffffff ffffffff ffff0002 00000000
    3ffffaf0: 00000000 00000000 00000000 00000000
    3ffffb00: 00000000 00000000 00000000 00000000
    3ffffb10: ffffffff 00ffffff 00000000 00000000
    3ffffb20: 00000000 00000000 00000000 00000000
    3ffffb30: 00000000 00000000 00000000 00000000
    etc…
    ——————————————-

    This was sent repeatedly and the maximum rate.
    The important message is:

    Fatal exception 9

    This means that a pointer expecting to read a 32 bit value
    is not word aligned. The compiler should not do this so
    perhaps the process of flashing my code had an error.
    The OS was initializing and taking an exception in
    a very tight loop.

    Since the OS was in this loop, the regular tools would
    not write new firmware. Even the esptool.py loader would
    not work. On the Espressif web site I found their tool,
    flash_download_tool_3.8.5.exe, for programming the device.
    That tool is really klugey but I did manage to over-write
    the flash with the OpenGarage binary. Then the OS did
    respond to IP address 192.168.4.1/update. Now I could
    fully recover.

    Now the the OS is back I will further zoom into the
    suspected area and hopefully fix this problem. This was a
    struggle, I thought that I had permanently bricked my OS!

Viewing 25 posts - 1 through 25 (of 72 total)