OpenSprinkler Forums Hardware Questions OpenSprinkler Controller lockups / crashes with wired Ethernet module

Viewing 6 posts - 126 through 131 (of 131 total)
  • Author
  • #69165


    An update;

    I have run since my last post and just now detected a network subsystem hang.
    With the debug code that I added I can see that the uip_process in the uip.c
    file is receiving packets but always drops them. The main OpenSprinkler
    code never gets the packets. There is clearly something wrong with packet
    handling since even ping does not work and ping does not involve the OS code.

    I have been communicating with jandrassy, one of the maintainers of the
    UIPEthernet code. That thread is here:

    He has been chasing a memory leak in this code.
    There is a memory heap manager for packet buffers called mempool.c.
    He suspects that the problem may lay there. Since I am seeing receive
    buffer overflow errors in the ENC28J60 chip the problems could be related.
    I have added code to check for this and will start another run.
    It took 2 months to catch this error it may take a while to catch another



    Thanks for chasing this down. I will be keeping an eye on the issue update. Thanks!



    After waiting for 2 months for a network interface hang I added debug code to
    narrow the focus of my investigation. I started the next run and caught a
    hang after 2 days. This time the network interface, while hung, would respond
    to pings. The pings had a valid response ratio of about 10%. The bad
    response packets seemed to be corrupted. I watched the traffic with Wireshark.

    I further narrowed my debug code to focus more closely in the received and
    transmitted packets code. When I loaded my new code I totally bricked
    the device. The recovery method that I previously used with
    did not work. The OS was transmitting data continuously from the ASYNC
    port but it would not autobaud so the data was just garbage.

    The default BAUD rate of the ESP8266 with a 26MHz oscillator is 74880 BAUD.
    This is non-standard and Putty does not support it even though my USB to ASYNC
    adapter does support that odd BAUD rate. I found a terminal emulator called which does support any BAUD rate.

    Using this I successfully received the data from the OS. This is what
    I got:
    ets Jan 8 2013,rst cause:2, boot mode:(3,6)

    load 0x4010f000, len 1384, room 16
    tail 8
    chksum 0x2d
    csum 0x2d
    Fatal exception 9(LoadStoreAlignmentCause):
    epc1=0x401014d7, epc2=0x00000000, epc3=0x00000000, excvaddr=0x0000000a, depc=0x00000000

    Exception (9):
    epc1=0x401014d7 epc2=0x00000000 epc3=0x00000000 excvaddr=0x0000000a depc=0x00000000


    ctx: sys
    sp: 3ffff8f0 end: 3fffffb0 offset: 01a0
    3ffffa90: 4024c3fa 3ffee5aa 3ffee5aa 3ffeed9c
    3ffffaa0: 4024c409 4024c3b6 40105450 c1781c9b
    3ffffab0: 00000000 400042db 40105712 000003fd
    3ffffac0: 000000ed 00000020 3fffff10 00000001
    3ffffad0: 4010570c 40105583 00000003 8667a4e3
    3ffffae0: ffffffff ffffffff ffff0002 00000000
    3ffffaf0: 00000000 00000000 00000000 00000000
    3ffffb00: 00000000 00000000 00000000 00000000
    3ffffb10: ffffffff 00ffffff 00000000 00000000
    3ffffb20: 00000000 00000000 00000000 00000000
    3ffffb30: 00000000 00000000 00000000 00000000

    This was sent repeatedly and the maximum rate.
    The important message is:

    Fatal exception 9

    This means that a pointer expecting to read a 32 bit value
    is not word aligned. The compiler should not do this so
    perhaps the process of flashing my code had an error.
    The OS was initializing and taking an exception in
    a very tight loop.

    Since the OS was in this loop, the regular tools would
    not write new firmware. Even the loader would
    not work. On the Espressif web site I found their tool,
    flash_download_tool_3.8.5.exe, for programming the device.
    That tool is really klugey but I did manage to over-write
    the flash with the OpenGarage binary. Then the OS did
    respond to IP address Now I could
    fully recover.

    Now the the OS is back I will further zoom into the
    suspected area and hopefully fix this problem. This was a
    struggle, I thought that I had permanently bricked my OS!



    I’ve never used 74880 baud rate. Common baud rates for ESP8266 are: 115200, 230400, 460800, and 921600. Generally 230400 is pretty safe regardless of what auto-reset circuit there is; and 921600 is occasionally too fast for boards depending on the auto-reset circuit design.



    The reason for using 74880 is that the data initially sent out during booting
    is at that speed for ESP’s running at 26 MHz. If you are using an ESP at 40 MHz
    then the initial BAUD rate is 115200, which is a standard rate.

    The ESP does go into an auto-baud mode after booting but auto-baud is tricky and not
    always reliable.

    I think that my problem with bricking my OS this time is that the ESP booted
    OK and then went into OS code which immediately panicked. This disrupted
    the auto-baud and prevented the flash utility from grabbing the ESP and taking
    control. By using 74880 the flash utility did not depend on the auto-baud
    being completed successfully.

    Not all USB to ASYNC adapters support arbitrary BAUD rated but the ones using
    the CH340 chip do. However you must also have a device driver that supports
    this mode, the standard Linux driver does not. The kernel module called ch341
    does support the ch340 chip and arbitrary BAUD rates.



    More updates: I have caught a few hang conditions. I have also had some bad luck!
    One time I accidentally kicked the power connector out when I was extracting state
    data during a hang condition thereby losing the state. The other day the power
    went out for a few hours when I had another hang condition that I was examining.

    Some hang conditions I have waited 2 months for, others I have caught in 1 or
    2 days. I have also bricked my OS a few times.

    Anyway, it seems that there are two conditions that produce network problems.
    The first is a receive buffer overflow. This is indicated by the ESTAT register
    bit 6 set to 1 and the EIR register bit 0 set to one. This condition does
    not clear itself.

    I suspect the cause is that the OS code does not poll the network layer
    fast enough to prevent a buffer overflow in all conditions. Since the ENC28J60
    chip does not have any internal packet processing it must be polled often enough
    to handle all packets that appear on the wire. This includes ICMP (ARP)
    packets and packets that are not addressed to the OS. This condition only
    appears rarely so it need not be fixed in the OS code but there must
    be some recovery mechanism.

    The second fault is indicated by the ESTAT register being set to 0x13. This
    is a transmit “late collision error”. This condition also does not seem to
    clear itself.

    I will add this code in the do_loop() routine.

    if ((estat & ESTAT_ERROR) || (eir & EIR_ERROR)) {

    This resets the entire Ethernet layer including the ENC28J60 chip.
    It should clear all these error conditions.
    I will run this for my next test and log the number of occurrences.

    In looking through the code find no memory leaks or data corruption.
    That is what I was expecting to find. I think that the problem
    results from a mismatch between the relatively slow ESP based OS
    code and the 1 Gigabit Ethernet. Packets can arrive too fast.
    The solution, I think, lies in a robust recovery of a rare error
    condition rather than to try to handle the packets faster.

Viewing 6 posts - 126 through 131 (of 131 total)
  • You must be logged in to reply to this topic.

OpenSprinkler Forums Hardware Questions OpenSprinkler Controller lockups / crashes with wired Ethernet module