Controller lockups / crashes with wired Ethernet module

Tagged: Controller lock up hang crash

This topic has 163 replies, 16 voices, and was last updated 5 months, 3 weeks ago by Darian.

Viewing 25 posts - 1 through 25 (of 162 total)

1 2 3 … 5 6 7 →

Author

Posts
June 30, 2020 at 10:41 pm #67101

Wendell
Participant

I’ve been using OS for several years with no problems. Somewhat recently I started having problems with my 2.3AC controller crashing, and I would have to switch it off and back on to get it working again. After swapping power supplies, upgrading to the latest firmware, and switching my network connection to it, I finally decided the controller was fried, so I purchased a 3.2AC unit (running 2.1.9 (3) firmware) along with 2 replacement expansion units and the Ethernet module. Even with the new 3.2AC hardware I continue to have lockups (although the symptoms are a little different from the lockups I had with the 2.3AC hardware).

When the 3.2AC controller will no longer connect to my iOS mobile app, I go to the controller and I typically see:

1) the clock on the controller’s display is frozen at a time earlier in the day
2) none of the 3 buttons on the controller do anything
3) the Ethernet module’s internal red LED is still on, and the 2 LEDs on the RJ45 jack seem to reflect normal network activity
4) the controller still responds to pings
5) if the controller was in the middle of a zone watering, the zone gets stuck “on” when it crashes

Power cycling the controller gets it running again. The crashes are somewhat infrequent (although it has crashed twice today) and seemingly random (I’ve seen it crash both during a manual watering cycle and when the unit is just sitting idle). I’m not currently running any automatic watering cycles because I’m worried about it locking up and leaving a zone running, so I’ve stuck to manually started zones.

While trying to diagnose these problems I started running continuous pings to the controller (one every 2 seconds) so I could see if it stopped responding. I’ve found that even when the controller seems to be running fine, once in a while it will not respond to a ping, so when I look at my continuous ping logs I see a few ping timeouts each hour of the day. I’ve run continuous pings to other devices on the same switch as my OS controller, and they never experience timeouts, so I’m pretty sure it really is the OS controller that’s missing the pings a few times an hour. I don’t know if this is “normal” for the OS 3.2AC controller. Ray has been working with me to diagnose the problem, but so far we’re coming up blank.

Is anyone else seeing that continuous pings to their controller result in a handful of timeouts every hour when the ping is running?

July 5, 2020 at 10:31 pm #67151

Ray
Keymaster

Hi Wendell, I am still trying to debugging this issue for you. While I haven’t been able to reproduce the symptoms you reported, I have reasons to believe that this may have to do with DNS timeout issue. To check if this is the case, I would suggest you do the following:

1. turn off NTP sync (or set it to a valid NTP server IP, instead of leaving it as 0.0.0.0. When it’s set to 0.0.0.0 it will use pool.ntp.org by default).
2. temporarily disable IFTTT, MQTT if you are using either of them.
3. go to http://x.x.x.x/su and change weather.opensprinkler.com to 192.241.180.46 (this is the IP address of the weather server).
4. then reboot your controller.

Basically, the above steps will eliminate all DNS requests. The reason to try this out is that while looking at the UIPEthernet library, I discovered that occasionally the DNS requests may fail (this highly depends on your particular network setup) and this failure could lead to the ethernet controller’s TCP/IP stack getting messed up. I am not completely sure of this theory but it’s worth a try.

July 6, 2020 at 2:30 pm #67159

John K
Participant

Hi Wendell,

Are you able to temporary use WiFi instead? I’ve been having similar issues over the last few months with the 3.0 AC controllers (multiple units). What I have deduced, and you can try and see if it’s the same, that while running the controller with auto programs switched on, the controllers work flawlessly. However, when manually triggering valves in the mobile app, is when I start to have problems with the units freezing and valves getting stuck open. I’ve only experienced this with the 3.0 AC controllers and never had this issue with the previous models (of which I’ve had many).

All this while using the Ethernet connection. When I switched to WiFi mode, I have yet to witness the problem. I can’t say this for 100% certainty because the problem is so random and I can’t reproduce it on demand. I have to manually run stations and watch for it to happen. Some days I’ve had it reoccur many times. Some days I can’t get it even once….So I am watching out for it now that it is in WiFi mode to see if this is the case.

Please let me know if you see the problem go away while using WiFi instead, so I know I am not crazy. Since it is hard to reproduce, its hard to diagnose I believe. Ray has been a great help with all of this, and we have sorted out some other issues along the way as well, so I’m sure he will get it sorted out.

July 6, 2020 at 4:48 pm #67162

Wendell
Participant

John,

After doing quite a bit of troubleshooting I decided that the wired Ethernet controller was one of the suspects, so I unplugged the ribbon cable and switched to WiFi. Long term I don’t like this solution because it relies on a powerline extender in my garage, but I have been on WiFi for the last 2 days, and sure enough, it has yet to lock up like it had been quite frequently with the hard wired ethernet connection. I’m not 100% confident that the problem is completely gone (I’ll need more time for that), but it sure is looking like the issue may be caused by the ethernet module. My understanding is that the ethernet module uses a chipset that relies on a software based TCP/IP stack. I’m wondering if the built-in WiFi module has hardware based TCP/IP handling?

Yesterday while I was browsing some of the old forum posts I saw that some other people had been reporting lockups similar to what I was seeing, so I don’t think this is a new problem. At least one of the other reported lockup problems seemed to have started after a couple of years of reliable hardwired operation. That mirrors my situation. That person reported that setting up a VLAN with only the OS controller on it solved his problem, so that suggests that some other newly deployed equipment on the network may be sending some type of broadcast message that the software-based TCP/IP code chokes on. My equipment will support a VLAN, but I will have to rewire some things before I can make that change because I have a separate switch downstream from the switch that has the VLAN support.

Due to the random lockups I’ve been experiencing, I have been reluctant to enable my automatic programs. It’s interesting that you don’t seem to see the lockups unless you’re using manual watering. I think I’m going to continue with manual watering for now until I gain a better confidence that the WiFi interface isn’t susceptible to the lockup problem. I have noticed that sometimes it takes *much* longer than it should to connect to the controller from my iOS app, but I don’t know if this is related to WiFi connectivity issues or something else. At least it hasn’t locked up the controller yet!

I haven’t tried Ray’s above suggestions yet, but the only item in my setup he mentions that relies on DNS is the weather server. I would need to switch back to the hardwired connection to validate his suggestion, so for now I’m going to continue with WiFI to ensure that it isn’t susceptible to the bug in the first place.

July 6, 2020 at 5:40 pm #67164

John K
Participant

Hey Wendell,

My non-technical theory is that the Ethernet module causes the freeze when you access the app to manually trigger the valve. Either because of a voltage spike that the controller doesn’t like when the Ethernet is actively used, or something in the TCP/IP software side of things that does not happen while using the WiFi chip.

To give you a little more confidence in trusting the WiFi, and I’m with you with not wanting to use WiFi long term, I have 2 OS 3.0 controllers running over 80 stations. All of these run daily during my peak growing season (greenhouse production). I can not confirm an instance that I have had issue with the controllers freezing in WiFi mode while the auto programs were cycling through. Only times I caught the issue was when manually accessing the controller mobile app to turn on stations…while using the Ethernet. For the record, I also can not confirm a time that I have had issue while using Ethernet and auto programs. This is what makes me think that it has something to do with manually accessing the controller through the mobile app while using Ethernet.

Sometimes I catch a station that is stuck on. When I reopen the app to see if I’m locked out, it will suddenly update the screen, the green highlighted station will update back to the default off appearance and the station will turn off. I can then check the Logs to see that the station ran for say 34 minutes instead of the manually set 10 minutes. I don’t always get a freeze where I HAVE to restart, but often I do.

I believe many people are able to essentially “set it and forget it” with their controllers and so they will not run into this issue, in Ethernet or Wifi modes. If this was not the case, there is no way only a few people would have noticed this issue, everyone would. I use mine extensively…taking advantage of the manual features daily, and so it hits me like no other. If that is not the case, then its just a problem on my end. If you are experiencing the same, then that validates that this isn’t just an issue with my setup…

Further, I still use the previous generation controllers in some of my greenhouses…all hardwired with Ethernet. I have one on the same network as the OS 3.0 systems that has this issue. None of the previous generation controllers have ever froze on me in this manner. I use them manually all the time. I have NTP Sync unchecked on my controllers, FYI.

July 6, 2020 at 7:26 pm #67168

Wendell
Participant

John,

At least in my case, I don’t think the lockups are related to the solenoids switching on. At first I thought that was the case, but then I started noticing that the controller was locking up when nothing was going on. When it was connected via ethernet I’ve had several instances of finding the controller unresponsive to its buttons even when it’s just been sitting idle. Then I thought perhaps the act of connecting to it was causing the lockups, and while I’m not as sure about that now, I think it’s still a possibility. If Ray is correct that a DNS timeout causes problems, and if connecting to it causes it to refresh the Weather data (which is the only thing in my setup that uses DNS), that could explain how making a connection to it could cause a lockup.

Interestingly, shortly after I posted my last reply I tried to access the controller from my iPad and it timed out trying to connect. Then I tried accessing it from my iPhone and it also timed out, so I went to the controller and the display indicated it was trying to connect to the WiFi and all of the buttons were responding as they should (so it wasn’t locked up). With my iPhone right next to the controller I could see that I had a strong signal. The controller continued trying to connect, then acted like it had connected only to revert back to showing it was trying to connect a second or two later. I reset the powerline WiFi extender that I think it’s connecting to (I can’t be sure because the extender uses the same SSID as my other access points), but the controller still wasn’t connecting, so I left it sitting there while I ran an errand. About an hour and a half later when I returned I tried connecting from my iPhone again and it worked, so somehow it got itself sorted out while I was gone.

This isn’t the first time since I’ve had it using WiFi that I’ve thought it was locked up because it wouldn’t connect, only to find out that the controller buttons were still working fine and then it would eventually allow me to connect again without me having to reset the controller. When it was connected via Ethernet, I think every time it failed to connect it was due to the controller being locked up. Obviously I would rather have it fail to connect without being due to a lockup that can leave a station running, but either way it’s frustrating that sometimes you can’t access it to see what’s still running or to fire off another zone. I think the ethernet connection is a more reliable connection in general, so if we can get past the lockups I would switch back to ethernet right away.

I agree that this must not be a really widespread problem, and that tends to give some credibility to the theory that the ethernet module is susceptible to specific unusual network traffic. I have quite a number of different devices in my network, and I tend to add new things periodically, so it’s entirely possible that I added a problematic device within the last 6 to 12 months (when I started seeing lockups on my 2.3AC controller). I’m thinking that a hardware based TCP/IP stack ethernet module would likely cure the problem. Ray seems to think that it may not be terribly difficult to implement this because apparently it uses the same API, so code changes should be very minimal. I don’t know what it would do to the cost of the ethernet module… that might be a big negative.

July 6, 2020 at 10:09 pm #67170

John K
Participant

I have experienced the controller locking me out from the mobile app while no stations are in use, only to start working again a short time later without resetting controller. I can’t recall if this was only while in ethernet mode, in wifi or both. I’ve been troubleshooting this since February between everything else going on, so I’ve lost track of some details. What got me to suggesting the ethernet connection was the consistency at which I would have a stuck valve while using ethernet. I haven’t been able to see this problem in wifi so until then I can’t rule out the ethernet.

What would a “hardware based TCP/IP stack ethernet module” entail? I’m wondering if that is something I could do? Though I don’t have the skill to code…I would gladly buy a module if that solves the problem for me.

July 6, 2020 at 10:40 pm #67172

Ray
Keymaster

There are two Ethernet modules which are very popular in the open-source / maker community: Microchip’s ENC28J60, which requires software TCP/IP stack, and Wiznet’s W5500, which has hardware TCP/IP stack. There is no doubt that W5500 is more superior since it frees the microcontroller from having to handle TCP/IP stack. It used to be that W5500 was significantly more expensive, also since OpenSprinkler started with DIY kit, only ENC28J60 has through-hole version so naturally I chose ENC28J60. Since all OS legacy versions also use ENC28J60, it has been a pretty well tested platform. So even though its software TCP/IP stack requirement is a downside, I don’t think this chip itself has any intrinsic problem.

As we moved on to OS 3.0, which has built-in WiFi, it seems customers still want wired Ethernet option, so I again chose ENC28J60 as the module to go with. Prior to firmware 2.1.8, we’ve been using the EtherCard library to handle ENC28J60 — it works pretty well but it’s incompatible with Arduino’s Ethernet library, so that’s a big bummer. From firmware 2.1.8 we’ve started using the UIPEthernet library, which is implemented for ENC28J60 but is fully compatible with Arduino’s Ethernet library. This has the advantage of dramatically simplifying the code, since the same code is cross-compilable for all of OS 2.3, OS 3.0 and OSPi. It hasn’t been long enough since we used UIPEthernet library, so I am not entirely sure about the technical issues. It seems locking up is one potential recurring issue, and I’ve spent the weekend trying to debug and figure out the root cause of it. As John K said, it’s not so easy to debug as there is no fixed access pattern that will trigger this issue. Also when the problem happens, my controller’s symptom is very different from what Wendell observes, that is, everything still seems to be running just fine, the controller responds to button clicks, time is correct, and programs still run, but the controller does not respond to ping test or HTTP requests.

While digging into the issue, I’ve also invested some time looking at W5500 modules. The good thing is that since UIPEthernet is fully compatible with Arduino’s Ethernet library, which is also what W5500 library is compatible with, changing the source code to use W5500 is almost just a matter of switching the header file. The only tricky thing is that these off-the-shelf W5500 modules have a different pin layout than ENC28j60, so I designed a small adapter that can convert the pin layout between the two. I am still waiting for the adapter PCBs to arrive. I have high hopes that W5500 should completely eliminate the lockup issue, and with the pin adapter it can easily replace your existing ENC28J60 module.

So in short summary, I am debugging UIPEthernet library for ENC28J60 but at the same time also getting prepared to transition to W5500.

July 6, 2020 at 11:20 pm #67174

Wendell
Participant

John,

EDIT – **NOTE** – I didn’t see Ray’s above post until after I sent the info below.

I don’t know if there’s already a commercial module out there that uses the same pinout as the one Ray provides (but based on the more capable chip), so Ray might have to first design a new Ethernet module. From what Ray told me, the W5500 chip (https://www.wiznet.io/product-item/w5500/) is one that looks like it would be relatively easy to adapt the code to use.

I haven’t taken the cover off of the ethernet module that Ray provides, but I’m wondering if it’s this board inside:

https://www.ebay.com/itm/NEW-MiNi-ENC28J60-Ethernet-LAN-Network-Module-For-Arduino-SPI-AVR-PIC-LPC-STM32-

If it is, then perhaps something like this could satisfy the hardware part of the equation:

https://www.newegg.com/p/2S7-01M5-00MD0?item=9SIAEC99HS2498

July 6, 2020 at 11:21 pm #67176

Wendell
Participant

Ray, regarding your latest post… I’m not sure if you were describing the symptoms you’ve seen in your recent testing or what I reported from my testing, so I want to clarify what I’ve observed:

1) when I do long term continuous Ping tests (one every 2 seconds), I see a handful of timeouts every hour. On every other device I’ve done long ping tests to it is *extremely* rare to see a ping timeout. I suspect this behavior from OS is due to the software based TCP/IP stack, so it may have absolutely nothing to do with the crashes I’m seeing.

2) when my controller locks up, it usually freezes the on-screen clock at the lockup time, and the buttons are non-functional until a power cycle reboot. Many (or most) of the lockups have actually left the controller responding to pings still, so something is still alive in the controller even though I can’t log into it or start/stop watering cycles.

July 6, 2020 at 11:25 pm #67178

Ray
Keymaster

Yes, this is the W5500 module that I was referring to. It’s also 2×5 pins just like the ENC28J60 module, but it’s a bummer that the pin ordering is not the same, otherwise it would have been directly replaceable. You would think that whoever designed these modules would use the same pin ordering, but they didn’t. In any case, as I said, I’ve already designed a small adapter PCB that plugs into W5500 module and rewires the 10 pins to the same 10 pins as the ENC28J60 module, so that solves the problem. Also, I’ve already modified the firmware, basically changing wherever UIPEthernet appears to Ethernet2 (the library that’s for W5500), and a few minor changes to remove functions that are not available in Ethernet2. I’ve verified that the firmware compiles and runs just fine on OpenSprinkler. Of course I have not yet done long-term testing, but this is a good starting point to show that it’s possible and relatively easy to replace ENC28J60 with W5500.

July 6, 2020 at 11:48 pm #67180

Wendell
Participant

Ray,

Sounds encouraging! I’d be happy to do some testing for you once you get to the point where it’s ready for that. Given how many crashes I’ve been seeing with the current Ethernet module, it shouldn’t take more than a couple of days to verify that the new W5500 module solves the problem. Do you want me to order one of the W5500 modules from NewEgg? It looks like it will take a while to get one since they’re shipping from China.

July 7, 2020 at 12:04 am #67181

John K
Participant

Ray,

Same here…anything I can do to help with this.

Would it be this one on amazon?

https://www.amazon.com/ARCELI-Ethernet-Network-Hardware-Microcontroller/dp/B07JLFN3T1/ref=sr_1_3?dchild=1&keywords=Wiznet+W5500&qid=1594098165&sr=8-3

Of course, won’t try to use it without the adapter.

July 7, 2020 at 12:27 am #67182

John K
Participant

Just to be sure I’m not misunderstanding, being that we all seem to be experiencing this issue with somewhat different symptoms, with regard to my symptom of the valves sticking open while the controller freezes, are we thinking this is all related to the same issue that hopefully changing the ethernet set up will resolve?

Also, I just realized my older OS 2.3 controller on the same network as the 3.2 controllers is using firmware 2.1.7…so the EtherCard library. So, were I to upgrade my firmware to 2.1.9…in theory I would start to have this problem on that controller too?

July 7, 2020 at 12:57 am #67184

Wendell
Participant

John,

That part you found on Amazon appears to be the same one I found on NewEgg, only Amazon seems to have it in stock in the US, so shipping time is a fraction of what NewEgg is quoting. Good find!

As for whether our varying symptoms all have the same root cause, I’m guessing they probably do. I suspect that the TCP/IP traffic that kills the software based stack is somehow causing code or data corruption (I.e. a buffer overrun) which in turn leads to unpredictable execution of the main controller code. I’ve seen a somewhat wide variety of symptoms myself, from simple lockups that don’t seem to have other consequences, to 2 zones running simultaneously even though I have all of my zones set to Sequential mode. Based on these differing symptoms, I will be surprised if the root cause doesn’t turn out to be a buffer overflow (or similar coding error). It’s probably buried in the library that Ray is using.

July 7, 2020 at 8:46 am #67189

Ray
Keymaster

Regrading modules: if you want to get these modules fast, you pretty much have to buy from Amazon with prime shipping. These modules are also available from Aliexpress.com for much cheaper price but those ship from China and can take weeks. I am only aware of one type of W5500 module:
https://www.amazon.com/ARCELI-Ethernet-Network-Hardware-Microcontroller/dp/B07JLFN3T1

On the other hand, ENC28J60 has several variants, but only the following two have 2×5 pins that match OS 3.0 design:
a wider module: https://www.amazon.com/ENC28J60-Ethernet-Network-Module-Arduino/dp/B01FDD3YYW
a thinner module: https://www.amazon.com/ENC28J60-Network-Module-Schematic-Arduino/dp/B07C2QNGCC

I still think the issue with ENC28J60 can be fixed in software. For one, we know that all OS 2.x used ENC28j60, albeit with the EtherCard library, and I don’t think lockup is a common issue I am aware of with OS 2.x. So it probably has to do with UIPEthernet library. Also, at the minimum, if I can find a reliable condition to check when lockup has happened, then I can have the firmware trigger a software reboot, and this can be done in a program-safe way (i.e. it only reboots when there is no program running). As long as this doesn’t happen frequently, it should be a reasonable solution. At the moment, though, such ‘condition’ is very elusive, because as I said, when the lockup happens on my test unit, the microcontroller still runs, time is correct, programs run, link status is fine, buttons still work. So the condition to flag lockup would have to be from reading ENC28J60’s register values to figure out a consistent pattern. Another way which I will try is to periodically issue a ping from the controller to router, and see if the ping times out.

July 7, 2020 at 10:28 am #67192

John K
Participant

Ray,

Triggering a reboot can work for all the occasions when the controller is not being manually accessed and used, but it wouldn’t help with when the controller freezes while a manual station is run, if it did it would shut down the station at the same time wouldn’t it? Therefore disrupting its use. I very much want to try the W5500 when you have it ready.

Also, I’ve sent you a few emails following up with our conversations the last few months, not sure if you have seen them?

Wendell,

I’ve never seen the controller trigger random stations in error so to speak, as you mention 2 being on at once, on its own. That is strange. I should really updated the 2.3 controller and see if I get the issues…I just dont want to mess with it as its working fine.

July 7, 2020 at 11:17 am #67196

Wendell
Participant

John,

I think you are correct that rebooting isn’t a perfect solution (although it’s better than a zone getting stuck “on” forever). In theory you could restore the unit to the last known state by periodically writing the current status to NVRAM and then reading that back out on each reboot, but from my experience this can lead to other problems. And if the lock-ups are occurring due to a buffer overrun corrupting the program or data space, you really wouldn’t want to try to restart from the last known state anyway.

The variety of symptoms I’ve seen really makes me think that the crashes aren’t just one portion of the system (e.g. the ethernet module) locking up, and if I’m correct, it would be nearly impossible to detect the crash condition and reliably reboot from it. Years ago I experienced an issue where an embedded controller was doing really strange things at random times. It occurred on only a small percentage of the units we made, but when the crashes happened, they were completely random. In our case it was a hardware bug in an Atmel CPU chip that was causing it to literally execute code at random locations… pretty much the worst case nightmare a programmer can run into. I suspect that the problem we’re seeing with the OS controller isn’t nearly this insidious, but until someone can figure out the actual cause of the crashes (versus simply detecting when they have occurred and trying to reboot), I don’t think there will be a good solution to the problem.

If Ray can pin down the actual cause of the problem then I agree that a software patch will likely solve all of the problems, but simply trying to detect when a problem has already occurred isn’t likely to be a viable solution. I’ll freely admit that I don’t know the architecture of the OS system (i.e. is everything running on the main CPU, or is there an additional microcontroller in the ethernet module?), so my hypothesis could be off due to not understanding how the various system components relate to one another, but if the entirety of the code is running on one CPU, the varied symptoms I’m seeing suggest that there is some type of widespread corruption of data at play.

Regardless of whether the root cause can be found and corrected, I would rather be running a system that handles the TCP/IP stack in hardware, since it should result in more reliable network operations and free up the CPU to run only the application itself (potentially making it more responsive to user inputs). Implementing the W5500 chip sounds like a really good idea to me!

July 8, 2020 at 8:22 am #67206

Ray
Keymaster

So yesterday my WiFi router had a problem and I rebooted it. Strangely enough, now I cannot observe any lost-connetion issue on any of the test units. I have three test units, two OS 3.2 (running 2.1.9(3) and 2.1.9(4) respectively), one OS 2.3 (running 2.1.9(4)). On each unit I set a program that runs a zone for 1 minute and repeats every 10 minutes throughout the day. I also use IFTTT to send notifications to my email upon program start. All three units have been running fine so far (more than a day) and all notifications were successfully received. So, the issue has become more elusive than ever since now I cannot reproduce it. Given that I rebooted my router, I suspect the router may have had a DNS problem causing timeout which then led to issues on OpenSprinkler. In any case, I don’t have means to debug the issue right now since I cannot reproduce it. But I will continue to explore the W5500 route as I am expecting the adapter PCB to arrive in a day or two.

July 8, 2020 at 9:08 am #67207

Wendell
Participant

Ray, I don’t think my issues are directly related to anything in my router being out of whack, because I’ve rebooted my router before when I’ve seen issues and it hasn’t helped. Over the last few days that I’ve been running via WiFi my system has yet to lock up like it was routinely when on the ethernet module, but I have had several instances where it is *very* slow to connect to the iOS app. It will repeatedly time out trying to connect, but if I leave it alone for awhile it will eventually come back online. I’m not going to spend a bunch of time trying to track down this issue because it sounds like we’ll have a good ethernet solution once your pin converter boards arrive. I ordered one of the W5500 boards yesterday, so I should have it this week.

Out of curiosity, does the internal WiFi functionality in the controller have hardware TCP/IP handling or does it rely on the same code that your ethernet module does?

July 8, 2020 at 9:10 am #67208

John K
Participant

Hi Ray,

I run my controllers with assigned programs daily, never do I have a valve get stuck open.

I ONLY have issue if I open the mobile app and manually run a zone for X amount of time (3 to 15 mins), then minimize the app into the background (iPhone). It doesn’t happen every time but it does happen regularly enough that I will find the valves stuck open for longer than it should. When I reopen the mobile app it will either be timed-out where I need to reset controller, or it un-freezes the condition and the valve finally turns off.

What is terribly challenging about this, as you said, it’s elusive. If I were do as you did “set a program that runs a zone for 1 minute and repeats every 10 minutes throughout the day” I would expect it to work perfectly, no issues…because you are setting it and leaving it alone. Thats bean consistently my experience, but that doesn’t address what Wendell or you have seen…

So for me, the best way to cause the issue for troubleshooting purposes is to be manually triggering the stations through the app over and over again for say 3 to 5 min durations and waiting to see if the zone gets stuck. If you have a way for me to set up logging of the connections, I could do that and replicate the problem on my end.

I think working with the W5500 will likely be better to try first….look forwards to hearing about that.

Wendell,

I use Ubiquiti network equipment and have Vlans set up. You mentioned earlier someone used them to isolate the controllers. I have a Vlan with the controllers and a ip camera recorder set up on it alone….so if I were to have the controllers alone that may make a difference for DNS reasons?

July 8, 2020 at 9:41 am #67209

Wendell
Participant

John,

The forum posts I saw were related to the controller crashing, and I’m not sure if that’s what you’re seeing happen when a valve gets stuck open on yours. When it happens to me, my controller is truly crashed and will not recover without a power cycling. The buttons on my controller become unresponsive, and the clock stops updating. I believe what I’m seeing is what the other forum member was seeing when they posted about the VLAN solution a couple of years ago. In their case, isolating the OS controller onto its own VLAN completely solved the crashing problem.

Have you ever seen a situation where the log files in the controller show incorrect runtimes for zones? I’ve actually seen “impossible” run times in my log files (I.e. a zone is reported to have run longer than it possibly could have based on the surrounding zone start and run times). This is one of those symptoms that leads me to believe there’s data corruption responsible for the controller crashing. Since you’ve noted that you have zones that run longer than they should but then finally shut down, I’m curious if your log files reflect that actual run time.

You noted that you leave the app running in the background on your iPhone. What happens if you actually close the app after you start a manual zone? In my case, I’ve seen the controller crash even when there’s no iOS app running (and I don’t necessarily even have to be running a zone for the controller to crash). Based on my symptoms matching what the other forum member posted quite some time ago, I suspect a VLAN would solve my problem, but your issue sounds a little different. I’m currently running an ASUS router with several smart switches. I plan to switch to a UniFi system, but I’m waiting for them to release more of their WiFi 6 hardware before I take the plunge. I really like the UniFi system and the configurability / troubleshooting that you get with it.

July 8, 2020 at 9:48 am #67210

Wendell
Participant

John,

I forgot to mention one thing… I’m not convinced that the crashes I’m seeing are related to DNS timeouts. The forum member who posted about using a VLAN to solve the problem hypothesized that there’s some device on his network that sends a particular type of traffic that the software based TCP/IP stack on the ethernet module chokes on, this causing the crash. By removing all other network traffic from the link to the OS controller, it never sees this sequence that sends it into a tailspin and it stops crashing. If taking your IP camera off the VLAN solves your problem, that would give us a huge clue about the problem, and make it much more likely we could isolate the cause of the crash.

July 8, 2020 at 10:08 am #67211

Ray
Keymaster

“the best way to cause the issue for troubleshooting purposes is to be manually triggering the stations through the app over and over again for say 3 to 5 min durations” — there is a easier way to do so, I can easily write a script to trigger this repeatedly and see what I find. The app uses the HTTP GET API, which can be called from a script.

July 8, 2020 at 1:53 pm #67212

John K
Participant

Wendell,

If you haven’t seen it already, this post https://opensprinkler.com/forums/topic/station-run-time-significantly-longer-than-manually-set-bug/ was my first attempt to figure things out before I understood more of what/why it was happening. The portion about currant reading proved to be irrelevant and Ray addressed the hot swapping issue as well. The attached images of my logs is what I will see if the controller did not crash but rather “unfroze” when I reopened the app. All those zones were manually turned on for 6 minutes….but some actually ran for much longer.

I have had plenty of instances where the controller would freeze, a valve would be stuck on, the 3 buttons on the controller would be unresponsive, and I had to unplug to reset. In those times the Logs would not show the specific zones that I had just been running. I never checked to see if the time read out on the controller was frozen during those instances. Also, if I tried reopening the app I would get the “timed-out” message until I unplugged/reset.

I have manually turned on stations in the app and then shut down the app. I have experienced the crashes in those occasions as well.

I will look into isolating to a VLAN further if need be, but at the moment I am putting my money on the W5500. I’m set on WiFi now and it works reliably so I don’t want to mess with it until I can try the W5500 idea.

Ray,

If you can send me the script and how to run it, I’d love to test my set up with it, and we then can know for sure that the script can trigger the issue. Also, then I can test out these ideas much quicker too.
Author

Posts

Viewing 25 posts - 1 through 25 (of 162 total)

1 2 3 … 5 6 7 →

You must be logged in to reply to this topic.