Controller lockups / crashes with wired Ethernet module

Tagged: Controller lock up hang crash

This topic has 177 replies, 21 voices, and was last updated 2 weeks, 4 days ago by StephenOz.

Viewing 25 posts - 126 through 150 (of 176 total)

← 1 2 3 … 5 6 7 8 →

Author

Posts
February 1, 2021 at 9:37 am #69165

Water_my_lawn
Participant

An update;

I have run since my last post and just now detected a network subsystem hang.
With the debug code that I added I can see that the uip_process in the uip.c
file is receiving packets but always drops them. The main OpenSprinkler
code never gets the packets. There is clearly something wrong with packet
handling since even ping does not work and ping does not involve the OS code.

I have been communicating with jandrassy, one of the maintainers of the
UIPEthernet code. That thread is here:
https://github.com/UIPEthernet/UIPEthernet/issues/129

He has been chasing a memory leak in this code.
There is a memory heap manager for packet buffers called mempool.c.
He suspects that the problem may lay there. Since I am seeing receive
buffer overflow errors in the ENC28J60 chip the problems could be related.
I have added code to check for this and will start another run.
It took 2 months to catch this error it may take a while to catch another
error.

February 2, 2021 at 11:26 am #69171

Ray
Keymaster

Thanks for chasing this down. I will be keeping an eye on the issue update. Thanks!

February 12, 2021 at 11:21 pm #69227

Water_my_lawn
Participant

After waiting for 2 months for a network interface hang I added debug code to
narrow the focus of my investigation. I started the next run and caught a
hang after 2 days. This time the network interface, while hung, would respond
to pings. The pings had a valid response ratio of about 10%. The bad
response packets seemed to be corrupted. I watched the traffic with Wireshark.

I further narrowed my debug code to focus more closely in the received and
transmitted packets code. When I loaded my new code I totally bricked
the device. The recovery method that I previously used with esptool.py
did not work. The OS was transmitting data continuously from the ASYNC
port but it would not autobaud so the data was just garbage.

The default BAUD rate of the ESP8266 with a 26MHz oscillator is 74880 BAUD.
This is non-standard and Putty does not support it even though my USB to ASYNC
adapter does support that odd BAUD rate. I found a terminal emulator called
miniterm.py which does support any BAUD rate.

Using this I successfully received the data from the OS. This is what
I got:
———————————
ets Jan 8 2013,rst cause:2, boot mode:(3,6)

load 0x4010f000, len 1384, room 16
tail 8
chksum 0x2d
csum 0x2d
v00000000
~ld
Fatal exception 9(LoadStoreAlignmentCause):
epc1=0x401014d7, epc2=0x00000000, epc3=0x00000000, excvaddr=0x0000000a, depc=0x00000000

Exception (9):
epc1=0x401014d7 epc2=0x00000000 epc3=0x00000000 excvaddr=0x0000000a depc=0x00000000

>>>stack>>>

ctx: sys
sp: 3ffff8f0 end: 3fffffb0 offset: 01a0
3ffffa90: 4024c3fa 3ffee5aa 3ffee5aa 3ffeed9c
3ffffaa0: 4024c409 4024c3b6 40105450 c1781c9b
3ffffab0: 00000000 400042db 40105712 000003fd
3ffffac0: 000000ed 00000020 3fffff10 00000001
3ffffad0: 4010570c 40105583 00000003 8667a4e3
3ffffae0: ffffffff ffffffff ffff0002 00000000
3ffffaf0: 00000000 00000000 00000000 00000000
3ffffb00: 00000000 00000000 00000000 00000000
3ffffb10: ffffffff 00ffffff 00000000 00000000
3ffffb20: 00000000 00000000 00000000 00000000
3ffffb30: 00000000 00000000 00000000 00000000
etc…
——————————————-

This was sent repeatedly and the maximum rate.
The important message is:

Fatal exception 9

This means that a pointer expecting to read a 32 bit value
is not word aligned. The compiler should not do this so
perhaps the process of flashing my code had an error.
The OS was initializing and taking an exception in
a very tight loop.

Since the OS was in this loop, the regular tools would
not write new firmware. Even the esptool.py loader would
not work. On the Espressif web site I found their tool,
flash_download_tool_3.8.5.exe, for programming the device.
That tool is really klugey but I did manage to over-write
the flash with the OpenGarage binary. Then the OS did
respond to IP address 192.168.4.1/update. Now I could
fully recover.

Now the the OS is back I will further zoom into the
suspected area and hopefully fix this problem. This was a
struggle, I thought that I had permanently bricked my OS!

February 16, 2021 at 9:56 am #69260

Ray
Keymaster

I’ve never used 74880 baud rate. Common baud rates for ESP8266 are: 115200, 230400, 460800, and 921600. Generally 230400 is pretty safe regardless of what auto-reset circuit there is; and 921600 is occasionally too fast for boards depending on the auto-reset circuit design.

February 18, 2021 at 7:34 am #69275

Water_my_lawn
Participant

The reason for using 74880 is that the data initially sent out during booting
is at that speed for ESP’s running at 26 MHz. If you are using an ESP at 40 MHz
then the initial BAUD rate is 115200, which is a standard rate.

The ESP does go into an auto-baud mode after booting but auto-baud is tricky and not
always reliable.

I think that my problem with bricking my OS this time is that the ESP booted
OK and then went into OS code which immediately panicked. This disrupted
the auto-baud and prevented the flash utility from grabbing the ESP and taking
control. By using 74880 the flash utility did not depend on the auto-baud
being completed successfully.

Not all USB to ASYNC adapters support arbitrary BAUD rated but the ones using
the CH340 chip do. However you must also have a device driver that supports
this mode, the standard Linux driver does not. The kernel module called ch341
does support the ch340 chip and arbitrary BAUD rates.

March 3, 2021 at 12:49 pm #69346

Water_my_lawn
Participant

More updates: I have caught a few hang conditions. I have also had some bad luck!
One time I accidentally kicked the power connector out when I was extracting state
data during a hang condition thereby losing the state. The other day the power
went out for a few hours when I had another hang condition that I was examining.

Some hang conditions I have waited 2 months for, others I have caught in 1 or
2 days. I have also bricked my OS a few times.

Anyway, it seems that there are two conditions that produce network problems.
The first is a receive buffer overflow. This is indicated by the ESTAT register
bit 6 set to 1 and the EIR register bit 0 set to one. This condition does
not clear itself.

I suspect the cause is that the OS code does not poll the network layer
fast enough to prevent a buffer overflow in all conditions. Since the ENC28J60
chip does not have any internal packet processing it must be polled often enough
to handle all packets that appear on the wire. This includes ICMP (ARP)
packets and packets that are not addressed to the OS. This condition only
appears rarely so it need not be fixed in the OS code but there must
be some recovery mechanism.

The second fault is indicated by the ESTAT register being set to 0x13. This
is a transmit “late collision error”. This condition also does not seem to
clear itself.

I will add this code in the do_loop() routine.

if ((estat & ESTAT_ERROR) || (eir & EIR_ERROR)) {
OpenSprinkler::start_ether();
}

This resets the entire Ethernet layer including the ENC28J60 chip.
It should clear all these error conditions.
I will run this for my next test and log the number of occurrences.

In looking through the code find no memory leaks or data corruption.
That is what I was expecting to find. I think that the problem
results from a mismatch between the relatively slow ESP based OS
code and the 1 Gigabit Ethernet. Packets can arrive too fast.
The solution, I think, lies in a robust recovery of a rare error
condition rather than to try to handle the packets faster.

March 24, 2021 at 9:47 pm #69512

Water_my_lawn
Participant

Well I have done it this time, really bricked my system!

I loaded an image with a bug that causes the system to crash and reboot. No big
thing, I have done this a number of times. However my normal recovery scheme is
not working this time. I have covered this previously and described the procedure.
This time, no-go.

I think I have identified the situation that causes the ENC28J60 Ethernet port to
stop working. If the packets are not unloaded from the Ethernet chip fast enough
the fifo will fill and result in the receive error. This error must be deliberately
cleared before the chip will return to normal operation.

I am trying to figure where to go from here.

May 30, 2021 at 9:34 am #70263

Water_my_lawn
Participant

I have repaired by OS with help from Ray (thanks Ray). I tried many more times
to unbrick the ESP-12N but failed. Now I am back up.

With the latest firmware 2.1.8.(7) and the connection using the Ethernet adapter
ENC28J60 it hangs frequently. I cannot make it through a single watering cycle
without it hanging. When It is hung it will not respond to pings.

This is actually a much better situation for debugging. Previously it might
take more than a month to hang.

My current working theory is that the ESP processor does not respond fast enough
to prevent a ethernet buffer overflow. I will add some code to detect this
situation.

June 7, 2021 at 12:09 pm #70356
Water_my_lawn
Participant
Well my hang condition went away for no known reason. I was able
to capture one hang with my debug code and have the data from the hang.

When hung this is the state of the registers that I am logging:
EIR 0x09 TXIF (transmit done), RXERIF (receive aborted, buffer overrun)
ESTAT 0x41 BUFER (read or write buffer error), CLKRDY (clock is OK)
ECON1 0x04 RXEN (receive enable)

At this stage the recovery counter (n_reinits) is at 3. This means that
a hang condition has been detected and the recovery code has executed
3 times but the Ethernet interface is still hung. This recovery code
is not in the standard release code. Ray has it turned off.

I turned the debug flag on which enabled the recovery code. I have
added some additional logging code to further try to understand why
the recovery process does not work. Otherwise this debug version
is identical to the latest release of Ray’s firmware: 2.1.9 (7).

I have attached a firmware binary with the additional debug logging.
If anyone is experiencing the same hang with a hardwired Ethernet
connection using the ENC28J60 module I ask that you would give my
firmware a try and report back what is says.

The debug code prints two lines on the OLED display. One line
appears above the standard messages and the other line appears
below the standard messages.

The top line is formatted as such:
XX|XX|XX|XX
The XX is the value in the EIR register, the ESTAT register,
the ECON1 register, and the recovery counter.

The bottom line is formatted as such:
XX|XX|XX XX|XX|XX
The EIR register, ESTAT register, ECON1 register, the EIR register,
ESTAT register, ECON1 register.
The apparent duplication is because the registers are read two
times at different places in the code.

If I could get all of this information after a Ethernet hang
it would help me figure out this very elusive bug.

Thanks.
Attachments:
1. OS_fw_dbg.bin_.zip
June 13, 2021 at 5:09 am #70410

Water_my_lawn
Participant

I have about 1 week of runtime with my debug code. I have caught a few errors.
The first run showed 6 recoveries, then I power cycled the OS and now it shows
3 recoveries. These mostly resulted in no errors showing on a web browser
pointed at the OS. All of the errors were receive buffer overflows.

I normally keep 2 browsers showing my OS, both Firefox; one running on Windows
and one running on Linux. One time the browser on Windows showed “Network error”
and would not refresh. At the same time the browser on Linux refreshed properly
indicating that the OS Ethernet interface was not at fault. When I closed the
browser tab the reopened the tab the OS web page came up OK.

I suspect that there is some problem in the protocol between the browser and the OS.
Perhaps the browser protocol is not robust enough to withstand the lost packets that
will occur when the OS resets the Ethernet interface. This will necessarily
result in lost packets and likely connection timeouts.

I recommend that the recovery code for the ENC28J60 Ethernet interface be included
in the standard release. The driver code for the ENC28J60 does not detect buffer
overflow and does not have any error recovery code. A buffer overflow stops the
processing of received packets and must be recovered by the system before it can
resume normal processing.

I suspect that there is a problem in the higher level protocol that communicates
with the browser. It is possible that the protocol sometimes cannot recover
in a situation where the channel is momentarily broken and a number of packets
are lost. However that protocol is outside of my area of experience.

Hope this helps,
Pete.

September 14, 2021 at 9:34 am #71165

Water_my_lawn
Participant

Even though I have not posted for a while I am still working on the problem of the wired Ethernet
connection hang. In discussions with the UIPEthernet developers I am convinced that their
driver is OK. So I have changed my debug strategy.

The essential problem seems to be that the main loop is not seeing Ethernet packets when
they arrive. The loop queries if any data has arrived and the Ethernet driver always
reports “no data”. This happens even when packets are arriving. Since the packet
data is not read from the Ethernet chip buffer, the buffer fills and flags an overrun
condition. Originally I thought that this overrun flag was the cause of the problem
but now I see that it is just a result of the problem.

It seems to me now that there must be data corruption in the RAM. The UIPEthernet returns
an incorrect result but the UIPEthernet appears to be error free. Perhaps a buffer
is over-running it’s bounds. Perhaps there is an error in variable type casting.

My new strategy is to dump the entire system RAM for the normal running state and dump
it for the error state. I have a bunch of captures of both conditions.

I have written a program that takes the map file produced by objdump and filled out each
variable with the actual data from the RAM dump. This gives me value of all variables
that exist in the system. I can compare these results from the many captures from the
good running systems. Any differences in the variables will be just the normal running
state changing. I edit these variables out. What is left is a list of the variables that
don’t change in my configuration in my running system when it is working properly.

I next process the bad state RAM dumps and compare them to the good state variables.
This gives me a map of what is different between a good running system and a system
with the Ethernet port hung. After this processing I find about 40 variables that
are different in the bad state. This is out of 586 total variables in the map of
all variables.

At the moment I am pondering this result but have not come to a conclusion.

It is interesting that the WiFi also seems to have a problem. perhaps they are
related.

September 22, 2021 at 12:43 pm #71230

jaycan
Spectator

Hi,

been having the same problems with wired Ethernet (OS3.0) like everyone else on this thread and found a solution that works for me. Appears port 80 is a bit noisy and by changing to an obscure and unused port like 8012 problem with controller lock ups have ceased. Been about 2 weeks and so far so good. Just need to make sure you change port in controller settings and forward that port in your router.

Cheers

October 5, 2021 at 3:41 pm #71357

jaycan
Spectator

Now a month on and not a single lock up/crash event. Seems like there is merit to changing port number after all.

Curious what others have experienced with this and whether this solution is not just an isolated example.

April 3, 2022 at 8:59 am #72466

Water_my_lawn
Participant

I have been running my firmware for over 3 months on the wired Ethernet connection without a crash.
The only change made was a fix by Juraj Andrássy to the UIPClient.cpp which is part of the UIPEthernet
Ethernet driver. The only change made is to replace (*this) with (data) 4 places in the source file.
This is one of those bugs that is obvious in hindsight.

This bug would result in dereferencing a bad pointer when there was a zero length packet. Such
a thing produces unpredictable behavior or a crash.

However this is all moot since Ray moved to EthernetENC from the previous UIPEthernet Ethernet drive.

April 28, 2022 at 5:13 pm #72634

Water_my_lawn
Participant

I just had a network hang with firmware 2.1.9 (9) after 3 weeks running.
This is using the ENC28J60 Ethernet adapter board.
Since this has the new network stack perhaps there is a problem just
like there was with the old (unpatched) stack.

Has anyone else seen this?

May 2, 2022 at 7:29 am #72676

Water_my_lawn
Participant

I have caught another hang. The OS is locked in a boot loop. You can see it in this video:
https://youtu.be/47hsE2BB1gw
What is happening?

May 12, 2022 at 5:38 am #72721

Water_my_lawn
Participant

I am hitting an OS network error every few days using the latest firmware. This time is was a network hang but the OS worked OK using the buttons. It was not the boot loop hang that I had previously.

Is anyone else using their OS with a hard-wired Ethernet connection?

May 12, 2022 at 5:44 am #72722

rboer01
Participant

Dear,

I also have hangs with the wired module.

OS response from buttons.

Brgds,

Rik

May 12, 2022 at 10:08 am #72723

Water_my_lawn
Participant

Are you running the latest firmware 2.1.9 (9)?

May 12, 2022 at 10:18 am #72724

rboer01
Participant

Yes I am.

Brgds,

Rik

May 12, 2022 at 9:40 pm #72730

Water_my_lawn
Participant

Would you be interested in trying my fixed version? This is using the older network stack: UIPEthernet.
There was a obvious (in hindsight) bug that the original author fixed. I tested it and it worked perfectly.
However the firmware rev is 2.1.9 (7) which is a few months behind the current development.

I can make it available if you would like to try it.
Let me know.

May 13, 2022 at 12:31 am #72732

rboer01
Participant

Hi. At the moment i tried a full reset.

If i have a hang, I’ll report and test your version with pleasure.

Brgds,

Rik

May 19, 2022 at 8:02 pm #72811

Water_my_lawn
Participant

I have taken a few more hangs and have loaded my special fixed version based on 2.1.9 (7).
I will see how that does.

May 21, 2022 at 9:29 pm #72821

colinl
Participant

Hello (from Australia)
I’ve just purchased/installed OpenSprinker (as a replacement for old Hunter controller) with an Ethernet module (as wifi would be at edge of range)
In the middle of initial setup – I found controller looping thru a boot sequence – removed Ethernet connection & able to gain access via Wifi.
Based on guidance in this thread, I have changed Port from 80 to 8080
I have had another 2 lockups (buttons till work) while using Ethernet – so reverting to Wifi awaiting guidance on what to try next
Hardware Version is 3.2 – AC Firmware is 2.1.9 (9)
Regards
Colin

May 23, 2022 at 7:20 am #72833

Water_my_lawn
Participant

I have run my fixed version without problems so far. Since Ray changed to EthernetENC from the previous UIPEthernet Ethernet
stack, my fix will not help the current release. I have run the current release and continue to have the hang problem, so I suspect
that there is a bug in the EthernetENC code as well.

It seems that most people never run into this hang problem. I have it quite frequently. However, I don’t know how
many people are actually running a wired Ethernet connection.

I have opened an issue with the Ethernet stack here:
https://github.com/JAndrassy/EthernetENC/issues/35
Author

Posts