热度 24
2015-3-27 14:56
2351 次阅读|
0 个评论
The "Things" in our world are becoming increasingly complicated, and the failure modes a whole lot more interesting ("interesting" as in the old curse, "may you live in interesting times"). In the world of IoT, what is fixing broken things going to be like? Let's look at where we are, and extrapolate. My "check engine soon" light came on when I was driving home from the (first) visit to the auto shop (an oxygen sensor problem, supposedly fixed). I took it back so they could look at it again and reset it -- it came on again on the way home from the second trip, as I'm writing this, so it goes back for a third visit now. This reminds me of my high-tech, high-efficiency, computerized heating/air-conditioning unit, which has been down for almost a month, with six tech visits so far, interleaved with ordering parts from the manufacturer. It's still not working. On the sixth visit they brought in an expert on this unit, and the manager is coming out with him for visit seven after ordering enough parts to rebuild the unit if necessary, and the manager swears he won't leave until we're happy. OK, we have an extra bedroom, I'm ready. But as makers of complex stuff, how can we all do our part to keep this sort of thing from happening with our own products? Well, education is important, but it has to be the right education. I didn't think any of the techs who worked on my A/C were incompetent, but they were under time pressure to get in and get out, didn't want to install new parts when old ones were fine, and made a couple of basic debug process errors. I'm sure that all these guys know lots more than I ever will about A/C, but some somewhat-understandable (and a few "huh?") mistakes were made that led to extra trips. The first was that the first six visits were by different techs, a scheduling process failure. This was a hard failure, the best kind when identifying a problem. What if the problem was intermittent, or had only shown up when the unit was controlled over the Internet or by the power company's mesh radio network, and the problem involved interaction among networking, security, radios, and software? And maybe even malware? How many trips would that take? The length of time this is taking brings attention to the required practical debugging skills needed to minimize fix-time for low-to-medium-tech electronics-controlled devices. Those skills aren't ubiquitous. And worse, even high-efficiency residential heat pumps are relatively simple, compared to networked devices with multiple processors, multiple radios, a few million (or tens of million) lines of code, and chips the size of your fingernail that are so dense that if you de-cap them they don't even look interesting any more like chips used to. (The Museum of Modern Art in NYC once had an exhibit of blow-ups of complex chips -- they were exquisite modern art, many of them signed by their artists right there on the silicon). So now, let's talk about IoT -- WHO'S GOING TO FIX THIS SMART STUFF WHEN IT BREAKS? Think about what fixing broken IoT devices will be like. The more responsibility you give to electronics, the more annoying a failure can be -- with great power comes great responsibility, as Spiderman says. Downtime can affect lives in annoying and even profound ways. Defect opportunities increase exponentially with product complexity, and while increasing reliability of individual components and good modularity helps offset that, exponentials are hard to beat in the fullness of time. And many of these babies are complex, and connected to other things that are complex. What new skills will be needed to diagnose and fix them? We need to be thinking about *which* fundamentals and *which* debugging skills are needed in a world of IoT, and how to train people on those skills. And the designers of IoT products have to design them with service in mind. If the field-replaceable-unit is cheap, fine, put in a new one -- but what if it's built into your house? Or your factory? Or it's your refrigerator? Or the problem is somewhere else in the network? It's a simple fact that some field guys might not know Ohm's Law or the difference between WPA-PSK and WEP, that there may be malware involved in a problem, that techs can't pull out a protocol analyzer and diagnose a failed handshake on an encrypted link, etc., etc. So they might have little choice but to just start replacing components until the problem goes away, despite the cost and repair time -- when in fact the root cause might be a mis-typed port number or password, or that the router vendor once shipped revs of software with DLNA turned off and THIS guy is running it (or the owner, aware of security problems with DLNA, had turned it off manually). Sometimes the problem gets fixed totally by accident, without ever knowing what it was so you can keep it from happening again. I don't know about you, but I don't like that future much. Come to think of it, I'm living in it now, and it's getting pretty hot in the house. The spread of IoT devices into society will go much slower than the people who sell them would like unless we think forward to the support structure it's going to take to fix these things when they break. Otherwise, the backlash from field problems will slow adoption of new tech for everyone (fear is a great demotivator). How do you quickly fix your radio-controlled, Internet-connected door lock product that has trapped someone in their garage? Or help someone when your Internet-connected smart refrigerator product sometimes forgets to order milk? Think ahead to how your product support and a tech will handle such problems. Remote- and self- diagnosis, mail-in programs, and online self-diagnosis charts to reduce truck-rolls is certainly a requirement, when they work, but they won't always (bad power supply, loose connection, interfering radios next door, ...). And never forget security -- if you make bypassing security easy for techs ("For the super-user password, look on P. 2 of the service manual"), you might make it easy for bad guys. (More venting on that subject another day!) The more complicated things are, the more skilled the diagnostic/repair people have to be, and/or the easier it has to be to service a failure. A big part of the fix is to teach fundamental skills well, so techs have the basics of how things actually work to fall back on to figure stuff out, even if they've never seen it before -- not just "how to fix device x", but "basic things that anything that does what x does MUST do right". And you have to build escalation into the service process -- you can't send Superman to open every hard-to-open jar of pickles because there's only one Superman and there are a lot of pickle jars, and maybe he's got other jobs more pressing. You need to know how and when to escalate, and how to do it fast. So what does the IoT equivalent of teaching basic physics like refrigeration cycles and Ohm's Law to mid-level techs look like? It certainly involves a more-than-cursory understanding of software, security, and networking -- far beyond just knowing about specific implementations (Linux, IP, ...), though that too. What else? And what does a serviceable IoT device look like? Perhaps it exposes test points that can quickly explain why it's not talking (LEDs and test points are cheap, phone calls and truck rolls are expensive). When it can talk, it self-diagnoses and tells the tech where it hurts in language that a tech with basic skills can understand, even without knowing that device. It calls in sick, if it can -- maybe you can even fix it before the failure affects the customer. It provides key information that makes it easy to decide when to escalate to the next level up in the service pyramid. It lets authorized parties diagnose it, ideally remotely, without becoming insecure. What else? Postscript Though we're still offline, the A/C problem was tentatively identified as being caused by the power supply failing in a way that over-voltaged multiple FRUs, and although each part had been replaced with a known-good unit with no resulting change in behavior, the entire set of failed units was never replaced with known-good units *at the same time*. The techs were following common practice (modulo a couple of goofs) in an attempt to isolate and correct a single failure at the lowest cost. If the current diagnosis is correct, instead of seven visits the repair would optimally have taken four visits if the parts weren't stocked locally, or two if they were and the expert made the second visit with parts in hand. From my selfish viewpoint, 1-2 weeks instead of 3-4. Steve Bunch is a CS/EE software architect with experience in operating systems, system architecture, networking, radio, and software security, and was responsible for the creation of the embedded software platform that was used in several hundred million cellphones. He is currently involved in a startup operating in the networking spa ce.