OpenSprinkler Forums Hardware Questions OpenSprinkler Controller lockups / crashes with wired Ethernet module

Viewing 11 posts - 126 through 136 (of 136 total)
  • Author
    Posts
  • #69165

    Water_my_lawn
    Participant

    An update;

    I have run since my last post and just now detected a network subsystem hang.
    With the debug code that I added I can see that the uip_process in the uip.c
    file is receiving packets but always drops them. The main OpenSprinkler
    code never gets the packets. There is clearly something wrong with packet
    handling since even ping does not work and ping does not involve the OS code.

    I have been communicating with jandrassy, one of the maintainers of the
    UIPEthernet code. That thread is here:
    https://github.com/UIPEthernet/UIPEthernet/issues/129

    He has been chasing a memory leak in this code.
    There is a memory heap manager for packet buffers called mempool.c.
    He suspects that the problem may lay there. Since I am seeing receive
    buffer overflow errors in the ENC28J60 chip the problems could be related.
    I have added code to check for this and will start another run.
    It took 2 months to catch this error it may take a while to catch another
    error.

    #69171

    Ray
    Keymaster

    Thanks for chasing this down. I will be keeping an eye on the issue update. Thanks!

    #69227

    Water_my_lawn
    Participant

    After waiting for 2 months for a network interface hang I added debug code to
    narrow the focus of my investigation. I started the next run and caught a
    hang after 2 days. This time the network interface, while hung, would respond
    to pings. The pings had a valid response ratio of about 10%. The bad
    response packets seemed to be corrupted. I watched the traffic with Wireshark.

    I further narrowed my debug code to focus more closely in the received and
    transmitted packets code. When I loaded my new code I totally bricked
    the device. The recovery method that I previously used with esptool.py
    did not work. The OS was transmitting data continuously from the ASYNC
    port but it would not autobaud so the data was just garbage.

    The default BAUD rate of the ESP8266 with a 26MHz oscillator is 74880 BAUD.
    This is non-standard and Putty does not support it even though my USB to ASYNC
    adapter does support that odd BAUD rate. I found a terminal emulator called
    miniterm.py which does support any BAUD rate.

    Using this I successfully received the data from the OS. This is what
    I got:
    ———————————
    ets Jan 8 2013,rst cause:2, boot mode:(3,6)

    load 0x4010f000, len 1384, room 16
    tail 8
    chksum 0x2d
    csum 0x2d
    v00000000
    ~ld
    Fatal exception 9(LoadStoreAlignmentCause):
    epc1=0x401014d7, epc2=0x00000000, epc3=0x00000000, excvaddr=0x0000000a, depc=0x00000000

    Exception (9):
    epc1=0x401014d7 epc2=0x00000000 epc3=0x00000000 excvaddr=0x0000000a depc=0x00000000

    >>>stack>>>

    ctx: sys
    sp: 3ffff8f0 end: 3fffffb0 offset: 01a0
    3ffffa90: 4024c3fa 3ffee5aa 3ffee5aa 3ffeed9c
    3ffffaa0: 4024c409 4024c3b6 40105450 c1781c9b
    3ffffab0: 00000000 400042db 40105712 000003fd
    3ffffac0: 000000ed 00000020 3fffff10 00000001
    3ffffad0: 4010570c 40105583 00000003 8667a4e3
    3ffffae0: ffffffff ffffffff ffff0002 00000000
    3ffffaf0: 00000000 00000000 00000000 00000000
    3ffffb00: 00000000 00000000 00000000 00000000
    3ffffb10: ffffffff 00ffffff 00000000 00000000
    3ffffb20: 00000000 00000000 00000000 00000000
    3ffffb30: 00000000 00000000 00000000 00000000
    etc…
    ——————————————-

    This was sent repeatedly and the maximum rate.
    The important message is:

    Fatal exception 9

    This means that a pointer expecting to read a 32 bit value
    is not word aligned. The compiler should not do this so
    perhaps the process of flashing my code had an error.
    The OS was initializing and taking an exception in
    a very tight loop.

    Since the OS was in this loop, the regular tools would
    not write new firmware. Even the esptool.py loader would
    not work. On the Espressif web site I found their tool,
    flash_download_tool_3.8.5.exe, for programming the device.
    That tool is really klugey but I did manage to over-write
    the flash with the OpenGarage binary. Then the OS did
    respond to IP address 192.168.4.1/update. Now I could
    fully recover.

    Now the the OS is back I will further zoom into the
    suspected area and hopefully fix this problem. This was a
    struggle, I thought that I had permanently bricked my OS!

    #69260

    Ray
    Keymaster

    I’ve never used 74880 baud rate. Common baud rates for ESP8266 are: 115200, 230400, 460800, and 921600. Generally 230400 is pretty safe regardless of what auto-reset circuit there is; and 921600 is occasionally too fast for boards depending on the auto-reset circuit design.

    #69275

    Water_my_lawn
    Participant

    The reason for using 74880 is that the data initially sent out during booting
    is at that speed for ESP’s running at 26 MHz. If you are using an ESP at 40 MHz
    then the initial BAUD rate is 115200, which is a standard rate.

    The ESP does go into an auto-baud mode after booting but auto-baud is tricky and not
    always reliable.

    I think that my problem with bricking my OS this time is that the ESP booted
    OK and then went into OS code which immediately panicked. This disrupted
    the auto-baud and prevented the flash utility from grabbing the ESP and taking
    control. By using 74880 the flash utility did not depend on the auto-baud
    being completed successfully.

    Not all USB to ASYNC adapters support arbitrary BAUD rated but the ones using
    the CH340 chip do. However you must also have a device driver that supports
    this mode, the standard Linux driver does not. The kernel module called ch341
    does support the ch340 chip and arbitrary BAUD rates.

    #69346

    Water_my_lawn
    Participant

    More updates: I have caught a few hang conditions. I have also had some bad luck!
    One time I accidentally kicked the power connector out when I was extracting state
    data during a hang condition thereby losing the state. The other day the power
    went out for a few hours when I had another hang condition that I was examining.

    Some hang conditions I have waited 2 months for, others I have caught in 1 or
    2 days. I have also bricked my OS a few times.

    Anyway, it seems that there are two conditions that produce network problems.
    The first is a receive buffer overflow. This is indicated by the ESTAT register
    bit 6 set to 1 and the EIR register bit 0 set to one. This condition does
    not clear itself.

    I suspect the cause is that the OS code does not poll the network layer
    fast enough to prevent a buffer overflow in all conditions. Since the ENC28J60
    chip does not have any internal packet processing it must be polled often enough
    to handle all packets that appear on the wire. This includes ICMP (ARP)
    packets and packets that are not addressed to the OS. This condition only
    appears rarely so it need not be fixed in the OS code but there must
    be some recovery mechanism.

    The second fault is indicated by the ESTAT register being set to 0x13. This
    is a transmit “late collision error”. This condition also does not seem to
    clear itself.

    I will add this code in the do_loop() routine.

    if ((estat & ESTAT_ERROR) || (eir & EIR_ERROR)) {
    OpenSprinkler::start_ether();
    }

    This resets the entire Ethernet layer including the ENC28J60 chip.
    It should clear all these error conditions.
    I will run this for my next test and log the number of occurrences.

    In looking through the code find no memory leaks or data corruption.
    That is what I was expecting to find. I think that the problem
    results from a mismatch between the relatively slow ESP based OS
    code and the 1 Gigabit Ethernet. Packets can arrive too fast.
    The solution, I think, lies in a robust recovery of a rare error
    condition rather than to try to handle the packets faster.

    #69512

    Water_my_lawn
    Participant

    Well I have done it this time, really bricked my system!

    I loaded an image with a bug that causes the system to crash and reboot. No big
    thing, I have done this a number of times. However my normal recovery scheme is
    not working this time. I have covered this previously and described the procedure.
    This time, no-go.

    I think I have identified the situation that causes the ENC28J60 Ethernet port to
    stop working. If the packets are not unloaded from the Ethernet chip fast enough
    the fifo will fill and result in the receive error. This error must be deliberately
    cleared before the chip will return to normal operation.

    I am trying to figure where to go from here.

    #70263

    Water_my_lawn
    Participant

    I have repaired by OS with help from Ray (thanks Ray). I tried many more times
    to unbrick the ESP-12N but failed. Now I am back up.

    With the latest firmware 2.1.8.(7) and the connection using the Ethernet adapter
    ENC28J60 it hangs frequently. I cannot make it through a single watering cycle
    without it hanging. When It is hung it will not respond to pings.

    This is actually a much better situation for debugging. Previously it might
    take more than a month to hang.

    My current working theory is that the ESP processor does not respond fast enough
    to prevent a ethernet buffer overflow. I will add some code to detect this
    situation.

    #70356

    Water_my_lawn
    Participant

    Well my hang condition went away for no known reason. I was able
    to capture one hang with my debug code and have the data from the hang.

    When hung this is the state of the registers that I am logging:
    EIR 0x09 TXIF (transmit done), RXERIF (receive aborted, buffer overrun)
    ESTAT 0x41 BUFER (read or write buffer error), CLKRDY (clock is OK)
    ECON1 0x04 RXEN (receive enable)

    At this stage the recovery counter (n_reinits) is at 3. This means that
    a hang condition has been detected and the recovery code has executed
    3 times but the Ethernet interface is still hung. This recovery code
    is not in the standard release code. Ray has it turned off.

    I turned the debug flag on which enabled the recovery code. I have
    added some additional logging code to further try to understand why
    the recovery process does not work. Otherwise this debug version
    is identical to the latest release of Ray’s firmware: 2.1.9 (7).

    I have attached a firmware binary with the additional debug logging.
    If anyone is experiencing the same hang with a hardwired Ethernet
    connection using the ENC28J60 module I ask that you would give my
    firmware a try and report back what is says.

    The debug code prints two lines on the OLED display. One line
    appears above the standard messages and the other line appears
    below the standard messages.

    The top line is formatted as such:
    XX|XX|XX|XX
    The XX is the value in the EIR register, the ESTAT register,
    the ECON1 register, and the recovery counter.

    The bottom line is formatted as such:
    XX|XX|XX XX|XX|XX
    The EIR register, ESTAT register, ECON1 register, the EIR register,
    ESTAT register, ECON1 register.
    The apparent duplication is because the registers are read two
    times at different places in the code.

    If I could get all of this information after a Ethernet hang
    it would help me figure out this very elusive bug.

    Thanks.

    Attachments:
    #70410

    Water_my_lawn
    Participant

    I have about 1 week of runtime with my debug code. I have caught a few errors.
    The first run showed 6 recoveries, then I power cycled the OS and now it shows
    3 recoveries. These mostly resulted in no errors showing on a web browser
    pointed at the OS. All of the errors were receive buffer overflows.

    I normally keep 2 browsers showing my OS, both Firefox; one running on Windows
    and one running on Linux. One time the browser on Windows showed “Network error”
    and would not refresh. At the same time the browser on Linux refreshed properly
    indicating that the OS Ethernet interface was not at fault. When I closed the
    browser tab the reopened the tab the OS web page came up OK.

    I suspect that there is some problem in the protocol between the browser and the OS.
    Perhaps the browser protocol is not robust enough to withstand the lost packets that
    will occur when the OS resets the Ethernet interface. This will necessarily
    result in lost packets and likely connection timeouts.

    I recommend that the recovery code for the ENC28J60 Ethernet interface be included
    in the standard release. The driver code for the ENC28J60 does not detect buffer
    overflow and does not have any error recovery code. A buffer overflow stops the
    processing of received packets and must be recovered by the system before it can
    resume normal processing.

    I suspect that there is a problem in the higher level protocol that communicates
    with the browser. It is possible that the protocol sometimes cannot recover
    in a situation where the channel is momentarily broken and a number of packets
    are lost. However that protocol is outside of my area of experience.

    Hope this helps,
    Pete.

    #71165

    Water_my_lawn
    Participant

    Even though I have not posted for a while I am still working on the problem of the wired Ethernet
    connection hang. In discussions with the UIPEthernet developers I am convinced that their
    driver is OK. So I have changed my debug strategy.

    The essential problem seems to be that the main loop is not seeing Ethernet packets when
    they arrive. The loop queries if any data has arrived and the Ethernet driver always
    reports “no data”. This happens even when packets are arriving. Since the packet
    data is not read from the Ethernet chip buffer, the buffer fills and flags an overrun
    condition. Originally I thought that this overrun flag was the cause of the problem
    but now I see that it is just a result of the problem.

    It seems to me now that there must be data corruption in the RAM. The UIPEthernet returns
    an incorrect result but the UIPEthernet appears to be error free. Perhaps a buffer
    is over-running it’s bounds. Perhaps there is an error in variable type casting.

    My new strategy is to dump the entire system RAM for the normal running state and dump
    it for the error state. I have a bunch of captures of both conditions.

    I have written a program that takes the map file produced by objdump and filled out each
    variable with the actual data from the RAM dump. This gives me value of all variables
    that exist in the system. I can compare these results from the many captures from the
    good running systems. Any differences in the variables will be just the normal running
    state changing. I edit these variables out. What is left is a list of the variables that
    don’t change in my configuration in my running system when it is working properly.

    I next process the bad state RAM dumps and compare them to the good state variables.
    This gives me a map of what is different between a good running system and a system
    with the Ethernet port hung. After this processing I find about 40 variables that
    are different in the bad state. This is out of 586 total variables in the map of
    all variables.

    At the moment I am pondering this result but have not come to a conclusion.

    It is interesting that the WiFi also seems to have a problem. perhaps they are
    related.

Viewing 11 posts - 126 through 136 (of 136 total)
  • You must be logged in to reply to this topic.

OpenSprinkler Forums Hardware Questions OpenSprinkler Controller lockups / crashes with wired Ethernet module