Reply To: Controller lockups / crashes with wired Ethernet module

March 3, 2021 at 12:49 pm #69346

Participant

More updates: I have caught a few hang conditions. I have also had some bad luck!
One time I accidentally kicked the power connector out when I was extracting state
data during a hang condition thereby losing the state. The other day the power
went out for a few hours when I had another hang condition that I was examining.

Some hang conditions I have waited 2 months for, others I have caught in 1 or
2 days. I have also bricked my OS a few times.

Anyway, it seems that there are two conditions that produce network problems.
The first is a receive buffer overflow. This is indicated by the ESTAT register
bit 6 set to 1 and the EIR register bit 0 set to one. This condition does
not clear itself.

I suspect the cause is that the OS code does not poll the network layer
fast enough to prevent a buffer overflow in all conditions. Since the ENC28J60
chip does not have any internal packet processing it must be polled often enough
to handle all packets that appear on the wire. This includes ICMP (ARP)
packets and packets that are not addressed to the OS. This condition only
appears rarely so it need not be fixed in the OS code but there must
be some recovery mechanism.

The second fault is indicated by the ESTAT register being set to 0x13. This
is a transmit “late collision error”. This condition also does not seem to
clear itself.

I will add this code in the do_loop() routine.

if ((estat & ESTAT_ERROR) || (eir & EIR_ERROR)) {
OpenSprinkler::start_ether();
}

This resets the entire Ethernet layer including the ENC28J60 chip.
It should clear all these error conditions.
I will run this for my next test and log the number of occurrences.

In looking through the code find no memory leaks or data corruption.
That is what I was expecting to find. I think that the problem
results from a mismatch between the relatively slow ESP based OS
code and the 1 Gigabit Ethernet. Packets can arrive too fast.
The solution, I think, lies in a robust recovery of a rare error
condition rather than to try to handle the packets faster.