Reply To: Controller lockups / crashes with wired Ethernet module

July 15, 2020 at 12:18 am #67333

Keymaster

So I found something today which I think is really interesting: if you took a look at my post above (https://opensprinkler.com/forums/topic/controller-lockups-crashes/page/2/#post-67254), I suspect that the ENC28J60 register values: buffer overflow error flag ESTAT.BUFFER and receive error flag EIR.RXERIF, are indicators that can tell if the controller is in an erroneous state which after some time of running will eventually lead to lockup. I will call the state when these two bits are 0 as ‘clean state’, and the state when these two are 1 as ‘corrupted state’.

My experiment was to find out what causes the corrupted state to happen in the first place. I know that when I use a secondary router to isolate OS from the rest of my primary WiFi network, it’s always in clean state (or at least during the testing started a few days ago, it has always been in the clean state). On the other hand, if OS is connected to my primary router, it goes into corrupted state shortly after booting.

So I started by unplugging / turning off all WiFi devices leaving only OS on the primary router. Sure enough it stays in clean state. Then I turned each WiFi device back on, one after another, restarting OS each time to observe if and when it goes into the corrupted state. Interestingly, most devices don’t cause any trouble, except my two MacBooks, and a Linux computer — as soon as I turn on WiFi on these computers, the test OS goes into corrupted state.

What’s interesting is that I have two other Linux computers that don’t cause this symptom. Comparing what are installed on these computers led me quickly discover that it’s Dropbox that makes the difference. This can be reliably reproduced: if I quit Dropbox, then reboot OS, it doesn’t go to corrupted state; if I leave Dropbox on, OS goes into corrupted state shortly after booting.

Using Wireshark, I saw Dropbox sends a lot of Dropbox lan sync discovery protocols (DB-LSP-DISC). I have strong feeling that this is the root cause of the problem. Although I haven’t found a way to address the issue yet, this at least gave me a way to reliably reproduce the corrupted state, which I highly suspect will eventually lead to the controller locking up. Apparently there is a way to turn off the sync protocol in Dropbox so that’s what I am going to try next.

Anyways, still digging the issue and trying to figure out a solution, but at least feeling a bit closer.