Forum Replies Created
-
AuthorPosts
-
Water_my_lawnParticipantIt is not uncommon for small changes in the code to result in masking a problem. This is not a fix and can make finding the actual
problem a real trick. However, I have only been running for 2 months and in the past I have run for longer before the hang occurred.If I could connect a debugger or get a crash dump, I could fix this problem easily. As it is, I can only make small changes and
hope the little bit of information extracted points to the problem.
Water_my_lawnParticipantI have run with the latest code, version 2.1.9(9), and had some hang conditions. I have been communicating with the author of EthernetENC and
he made some suggestions on where to add some debug code. I have created a version of 2.1.9(9) with the debug code but have not hit a hang yet
after 2 months of running. I am debugging with EthernetENC now, and no longer the older UIPEthernet. Same author on both cases.I wish you well on your health issues.
Water_my_lawnParticipantI have run my fixed version without problems so far. Since Ray changed to EthernetENC from the previous UIPEthernet Ethernet
stack, my fix will not help the current release. I have run the current release and continue to have the hang problem, so I suspect
that there is a bug in the EthernetENC code as well.It seems that most people never run into this hang problem. I have it quite frequently. However, I don’t know how
many people are actually running a wired Ethernet connection.I have opened an issue with the Ethernet stack here:
https://github.com/JAndrassy/EthernetENC/issues/35
Water_my_lawnParticipantI have taken a few more hangs and have loaded my special fixed version based on 2.1.9 (7).
I will see how that does.
Water_my_lawnParticipantWould you be interested in trying my fixed version? This is using the older network stack: UIPEthernet.
There was a obvious (in hindsight) bug that the original author fixed. I tested it and it worked perfectly.
However the firmware rev is 2.1.9 (7) which is a few months behind the current development.I can make it available if you would like to try it.
Let me know.May 12, 2022 at 10:08 am in reply to: Controller lockups / crashes with wired Ethernet module #72723
Water_my_lawnParticipantAre you running the latest firmware 2.1.9 (9)?
Water_my_lawnParticipantI am hitting an OS network error every few days using the latest firmware. This time is was a network hang but the OS worked OK using the buttons. It was not the boot loop hang that I had previously.
Is anyone else using their OS with a hard-wired Ethernet connection?
Water_my_lawnParticipantI have caught another hang. The OS is locked in a boot loop. You can see it in this video:
https://youtu.be/47hsE2BB1gw
What is happening?April 28, 2022 at 5:13 pm in reply to: Controller lockups / crashes with wired Ethernet module #72634
Water_my_lawnParticipantI just had a network hang with firmware 2.1.9 (9) after 3 weeks running.
This is using the ENC28J60 Ethernet adapter board.
Since this has the new network stack perhaps there is a problem just
like there was with the old (unpatched) stack.Has anyone else seen this?
April 3, 2022 at 8:59 am in reply to: Controller lockups / crashes with wired Ethernet module #72466
Water_my_lawnParticipantI have been running my firmware for over 3 months on the wired Ethernet connection without a crash.
The only change made was a fix by Juraj Andrássy to the UIPClient.cpp which is part of the UIPEthernet
Ethernet driver. The only change made is to replace (*this) with (data) 4 places in the source file.
This is one of those bugs that is obvious in hindsight.This bug would result in dereferencing a bad pointer when there was a zero length packet. Such
a thing produces unpredictable behavior or a crash.However this is all moot since Ray moved to EthernetENC from the previous UIPEthernet Ethernet drive.
Water_my_lawnParticipantWhen I try to post I get that error in Linux but not Windows.
Am I doing something wrong? Did something change? It used
to work.
Water_my_lawnParticipantIn case you were wondering how I could post here: this post is done using Firefox on Windows 10.
I see the little Captcha box on the lower right but i did not have to solve any Captcha puzzle.September 14, 2021 at 9:34 am in reply to: Controller lockups / crashes with wired Ethernet module #71165
Water_my_lawnParticipantEven though I have not posted for a while I am still working on the problem of the wired Ethernet
connection hang. In discussions with the UIPEthernet developers I am convinced that their
driver is OK. So I have changed my debug strategy.The essential problem seems to be that the main loop is not seeing Ethernet packets when
they arrive. The loop queries if any data has arrived and the Ethernet driver always
reports “no data”. This happens even when packets are arriving. Since the packet
data is not read from the Ethernet chip buffer, the buffer fills and flags an overrun
condition. Originally I thought that this overrun flag was the cause of the problem
but now I see that it is just a result of the problem.It seems to me now that there must be data corruption in the RAM. The UIPEthernet returns
an incorrect result but the UIPEthernet appears to be error free. Perhaps a buffer
is over-running it’s bounds. Perhaps there is an error in variable type casting.My new strategy is to dump the entire system RAM for the normal running state and dump
it for the error state. I have a bunch of captures of both conditions.I have written a program that takes the map file produced by objdump and filled out each
variable with the actual data from the RAM dump. This gives me value of all variables
that exist in the system. I can compare these results from the many captures from the
good running systems. Any differences in the variables will be just the normal running
state changing. I edit these variables out. What is left is a list of the variables that
don’t change in my configuration in my running system when it is working properly.I next process the bad state RAM dumps and compare them to the good state variables.
This gives me a map of what is different between a good running system and a system
with the Ethernet port hung. After this processing I find about 40 variables that
are different in the bad state. This is out of 586 total variables in the map of
all variables.At the moment I am pondering this result but have not come to a conclusion.
It is interesting that the WiFi also seems to have a problem. perhaps they are
related.June 13, 2021 at 5:09 am in reply to: Controller lockups / crashes with wired Ethernet module #70410
Water_my_lawnParticipantI have about 1 week of runtime with my debug code. I have caught a few errors.
The first run showed 6 recoveries, then I power cycled the OS and now it shows
3 recoveries. These mostly resulted in no errors showing on a web browser
pointed at the OS. All of the errors were receive buffer overflows.I normally keep 2 browsers showing my OS, both Firefox; one running on Windows
and one running on Linux. One time the browser on Windows showed “Network error”
and would not refresh. At the same time the browser on Linux refreshed properly
indicating that the OS Ethernet interface was not at fault. When I closed the
browser tab the reopened the tab the OS web page came up OK.I suspect that there is some problem in the protocol between the browser and the OS.
Perhaps the browser protocol is not robust enough to withstand the lost packets that
will occur when the OS resets the Ethernet interface. This will necessarily
result in lost packets and likely connection timeouts.I recommend that the recovery code for the ENC28J60 Ethernet interface be included
in the standard release. The driver code for the ENC28J60 does not detect buffer
overflow and does not have any error recovery code. A buffer overflow stops the
processing of received packets and must be recovered by the system before it can
resume normal processing.I suspect that there is a problem in the higher level protocol that communicates
with the browser. It is possible that the protocol sometimes cannot recover
in a situation where the channel is momentarily broken and a number of packets
are lost. However that protocol is outside of my area of experience.Hope this helps,
Pete.June 7, 2021 at 12:09 pm in reply to: Controller lockups / crashes with wired Ethernet module #70356
Water_my_lawnParticipantWell my hang condition went away for no known reason. I was able
to capture one hang with my debug code and have the data from the hang.When hung this is the state of the registers that I am logging:
EIR 0x09 TXIF (transmit done), RXERIF (receive aborted, buffer overrun)
ESTAT 0x41 BUFER (read or write buffer error), CLKRDY (clock is OK)
ECON1 0x04 RXEN (receive enable)At this stage the recovery counter (n_reinits) is at 3. This means that
a hang condition has been detected and the recovery code has executed
3 times but the Ethernet interface is still hung. This recovery code
is not in the standard release code. Ray has it turned off.I turned the debug flag on which enabled the recovery code. I have
added some additional logging code to further try to understand why
the recovery process does not work. Otherwise this debug version
is identical to the latest release of Ray’s firmware: 2.1.9 (7).I have attached a firmware binary with the additional debug logging.
If anyone is experiencing the same hang with a hardwired Ethernet
connection using the ENC28J60 module I ask that you would give my
firmware a try and report back what is says.The debug code prints two lines on the OLED display. One line
appears above the standard messages and the other line appears
below the standard messages.The top line is formatted as such:
XX|XX|XX|XX
The XX is the value in the EIR register, the ESTAT register,
the ECON1 register, and the recovery counter.The bottom line is formatted as such:
XX|XX|XX XX|XX|XX
The EIR register, ESTAT register, ECON1 register, the EIR register,
ESTAT register, ECON1 register.
The apparent duplication is because the registers are read two
times at different places in the code.If I could get all of this information after a Ethernet hang
it would help me figure out this very elusive bug.Thanks.
Attachments:
Water_my_lawnParticipantI have repaired by OS with help from Ray (thanks Ray). I tried many more times
to unbrick the ESP-12N but failed. Now I am back up.With the latest firmware 2.1.8.(7) and the connection using the Ethernet adapter
ENC28J60 it hangs frequently. I cannot make it through a single watering cycle
without it hanging. When It is hung it will not respond to pings.This is actually a much better situation for debugging. Previously it might
take more than a month to hang.My current working theory is that the ESP processor does not respond fast enough
to prevent a ethernet buffer overflow. I will add some code to detect this
situation.
Water_my_lawnParticipantI have taken your fixes and updated my script.
This script downloads all of the source and applies the necessary
fixed to build the latest binary. This is currently: Firmware 2.1.9 (7)
If this script is run in a new subdirectory it creates it’s own environment
to build the binary. This allows you to have multiple build trees with
their own build root.I have attached my script. I use Ubuntu 20.04 for development.
Attachments:
Water_my_lawnParticipantThese OLED displays (SSD1306) are quite reliable and long lasting. This Russian guy did a burnin test for over a year of
a bunch of the displays:
https://www.youtube.com/watch?v=GWOFF5tMv_A&t=493sIt seems that their life depends on brightness and time, not how frequently the pixels are changed.
If you do replace the OLED display be careful to note positions of the power pins on the 4 pin header connector.
The power pins are reversed on a lot of the displays.
Water_my_lawnParticipantWhy are you applying DC voltage to your AC solenoids? This is likely to put too much current
through the coils.March 24, 2021 at 9:47 pm in reply to: Controller lockups / crashes with wired Ethernet module #69512
Water_my_lawnParticipantWell I have done it this time, really bricked my system!
I loaded an image with a bug that causes the system to crash and reboot. No big
thing, I have done this a number of times. However my normal recovery scheme is
not working this time. I have covered this previously and described the procedure.
This time, no-go.I think I have identified the situation that causes the ENC28J60 Ethernet port to
stop working. If the packets are not unloaded from the Ethernet chip fast enough
the fifo will fill and result in the receive error. This error must be deliberately
cleared before the chip will return to normal operation.I am trying to figure where to go from here.
Water_my_lawnParticipantI see that my script is out of date!
The lines with esp8266_2.5.2 need to be edited to pick up the 2.7.4 version.
The issue with tick() must also be dealt with.
Water_my_lawnParticipantI wrote a script that does the download and necessary fixups.
Make a build directory and run this script from there.
Here it is:—————————————————–
#!/bin/bash
# This script downloads all of the source code necessary to build
# the OpenSprinkler binary for the 3.0 hardware.
# The script fixes a few places in the source downloads that are
# needed before it will compile cleanly.
# Finally it runs the make command.
# This can be run in any directory and will set a local $HOME
# This runs in Linux.# Remove any previous installation. You may or may not want to do this!
rm -rf Arduino esp8266_2.5.2 OpenSprinkler-Firmware# Create a local $HOME foe installation and build.
export HOME=pwd
# Get the OpenSprinkler code.
# Puts it in ~/OpenSprinkler-Firmware
git clone https://github.com/OpenSprinkler/OpenSprinkler-Firmware.git# Get the Arduino code.
# Puts it in ~/esp8266_2.5.2
git clone https://github.com/esp8266/Arduino.git esp8266_2.5.2# Go into esp8266_2.5.2 and checkout the correct version.
cd ~/esp8266_2.5.2
git checkout tags/2.5.2# If necessary, install Python.
# sudo apt install python# This Perl script installs the xtensa compiler and tools.
cd ~/esp8266_2.5.2/tools
python get.py# Go back up the base level.
#cd# Install necessary libraries, including SSD1306, RCSwitch, and UIPEthernet.
# Download and unzip or git clone these into Arduino/libraries folder.
mkdir -p ~/Arduino/libraries
cd ~/Arduino/libraries# Get the library for the OLED display.
git clone https://github.com/ThingPulse/esp8266-oled-ssd1306.git# The latest version of the OLED code is not compatable, backup to 4.1.0
cd ~/Arduino/libraries/esp8266-oled-ssd1306
git checkout tags/4.1.0cd ~/Arduino/libraries
# Get some of the necessary pieces.
git clone https://github.com/sui77/rc-switch.git
git clone https://github.com/UIPEthernet/UIPEthernet.git
git clone https://github.com/knolleary/pubsubclient.git# Remove tests directory as it will not compile but is unnecessary.
rm -rf ~/Arduino/libraries/pubsubclient/tests# Go into the actual build location
cd ~/OpenSprinkler-Firmware# There is an fixup needed in make.lin32:
# This changes the OLED library to the correct one.
sed -i s+Arduino/libraries/SSD1306+Arduino/libraries/esp8266-oled-ssd1306+ make.lin32# And finally build the final OpenSprinkler binary.
make -f make.lin32echo “**********************************************************”
echo ‘To build in a directory other than your normal $HOME directory’
echo ‘run this command to set a local $HOME directory to your current directory:’
echo ‘ export HOME=pwd
‘
echo “Next go into the build directory.”
echo ” cd ~/OpenSprinkler-Firmware”
echo “Then run this command to build the binary:”
echo ” make -f make.lin32″
echo “**********************************************************”March 3, 2021 at 12:49 pm in reply to: Controller lockups / crashes with wired Ethernet module #69346
Water_my_lawnParticipantMore updates: I have caught a few hang conditions. I have also had some bad luck!
One time I accidentally kicked the power connector out when I was extracting state
data during a hang condition thereby losing the state. The other day the power
went out for a few hours when I had another hang condition that I was examining.Some hang conditions I have waited 2 months for, others I have caught in 1 or
2 days. I have also bricked my OS a few times.Anyway, it seems that there are two conditions that produce network problems.
The first is a receive buffer overflow. This is indicated by the ESTAT register
bit 6 set to 1 and the EIR register bit 0 set to one. This condition does
not clear itself.I suspect the cause is that the OS code does not poll the network layer
fast enough to prevent a buffer overflow in all conditions. Since the ENC28J60
chip does not have any internal packet processing it must be polled often enough
to handle all packets that appear on the wire. This includes ICMP (ARP)
packets and packets that are not addressed to the OS. This condition only
appears rarely so it need not be fixed in the OS code but there must
be some recovery mechanism.The second fault is indicated by the ESTAT register being set to 0x13. This
is a transmit “late collision error”. This condition also does not seem to
clear itself.I will add this code in the do_loop() routine.
if ((estat & ESTAT_ERROR) || (eir & EIR_ERROR)) {
OpenSprinkler::start_ether();
}This resets the entire Ethernet layer including the ENC28J60 chip.
It should clear all these error conditions.
I will run this for my next test and log the number of occurrences.In looking through the code find no memory leaks or data corruption.
That is what I was expecting to find. I think that the problem
results from a mismatch between the relatively slow ESP based OS
code and the 1 Gigabit Ethernet. Packets can arrive too fast.
The solution, I think, lies in a robust recovery of a rare error
condition rather than to try to handle the packets faster.February 18, 2021 at 7:34 am in reply to: Controller lockups / crashes with wired Ethernet module #69275
Water_my_lawnParticipantThe reason for using 74880 is that the data initially sent out during booting
is at that speed for ESP’s running at 26 MHz. If you are using an ESP at 40 MHz
then the initial BAUD rate is 115200, which is a standard rate.The ESP does go into an auto-baud mode after booting but auto-baud is tricky and not
always reliable.I think that my problem with bricking my OS this time is that the ESP booted
OK and then went into OS code which immediately panicked. This disrupted
the auto-baud and prevented the flash utility from grabbing the ESP and taking
control. By using 74880 the flash utility did not depend on the auto-baud
being completed successfully.Not all USB to ASYNC adapters support arbitrary BAUD rated but the ones using
the CH340 chip do. However you must also have a device driver that supports
this mode, the standard Linux driver does not. The kernel module called ch341
does support the ch340 chip and arbitrary BAUD rates.February 12, 2021 at 11:21 pm in reply to: Controller lockups / crashes with wired Ethernet module #69227
Water_my_lawnParticipantAfter waiting for 2 months for a network interface hang I added debug code to
narrow the focus of my investigation. I started the next run and caught a
hang after 2 days. This time the network interface, while hung, would respond
to pings. The pings had a valid response ratio of about 10%. The bad
response packets seemed to be corrupted. I watched the traffic with Wireshark.I further narrowed my debug code to focus more closely in the received and
transmitted packets code. When I loaded my new code I totally bricked
the device. The recovery method that I previously used with esptool.py
did not work. The OS was transmitting data continuously from the ASYNC
port but it would not autobaud so the data was just garbage.The default BAUD rate of the ESP8266 with a 26MHz oscillator is 74880 BAUD.
This is non-standard and Putty does not support it even though my USB to ASYNC
adapter does support that odd BAUD rate. I found a terminal emulator called
miniterm.py which does support any BAUD rate.Using this I successfully received the data from the OS. This is what
I got:
———————————
ets Jan 8 2013,rst cause:2, boot mode:(3,6)load 0x4010f000, len 1384, room 16
tail 8
chksum 0x2d
csum 0x2d
v00000000
~ld
Fatal exception 9(LoadStoreAlignmentCause):
epc1=0x401014d7, epc2=0x00000000, epc3=0x00000000, excvaddr=0x0000000a, depc=0x00000000Exception (9):
epc1=0x401014d7 epc2=0x00000000 epc3=0x00000000 excvaddr=0x0000000a depc=0x00000000>>>stack>>>
ctx: sys
sp: 3ffff8f0 end: 3fffffb0 offset: 01a0
3ffffa90: 4024c3fa 3ffee5aa 3ffee5aa 3ffeed9c
3ffffaa0: 4024c409 4024c3b6 40105450 c1781c9b
3ffffab0: 00000000 400042db 40105712 000003fd
3ffffac0: 000000ed 00000020 3fffff10 00000001
3ffffad0: 4010570c 40105583 00000003 8667a4e3
3ffffae0: ffffffff ffffffff ffff0002 00000000
3ffffaf0: 00000000 00000000 00000000 00000000
3ffffb00: 00000000 00000000 00000000 00000000
3ffffb10: ffffffff 00ffffff 00000000 00000000
3ffffb20: 00000000 00000000 00000000 00000000
3ffffb30: 00000000 00000000 00000000 00000000
etc…
——————————————-This was sent repeatedly and the maximum rate.
The important message is:Fatal exception 9
This means that a pointer expecting to read a 32 bit value
is not word aligned. The compiler should not do this so
perhaps the process of flashing my code had an error.
The OS was initializing and taking an exception in
a very tight loop.Since the OS was in this loop, the regular tools would
not write new firmware. Even the esptool.py loader would
not work. On the Espressif web site I found their tool,
flash_download_tool_3.8.5.exe, for programming the device.
That tool is really klugey but I did manage to over-write
the flash with the OpenGarage binary. Then the OS did
respond to IP address 192.168.4.1/update. Now I could
fully recover.Now the the OS is back I will further zoom into the
suspected area and hopefully fix this problem. This was a
struggle, I thought that I had permanently bricked my OS! -
AuthorPosts