You are on page 1of 9

IJQRM

13,2 Transients’ effects on the


reliability of programmable
electronics
66
H.K. Tang and Brian Lee
School of Electrical and Electronic Engineering, Nanyang Technological
University, Singapore

Introduction
It has been said that we are living in the age of microelectronics and computers.
They are present in almost every electronic product and system and are also
used heavily in products which are not normally classified as electronics. These
items range from washing machines to automobiles. All these products and
systems have one thing in common. Their electronics are mostly based on
microelectronics hardware and their operations are programmed by software.
In other words, they are programmable electronics. Programmable electronics’
reliability depends not only on the reliability of the constituent hardware and
software but also on the ambient physical environment.
Electronic hardware is inherently more reliable than most mechanical
equipment due to the lack of wear and tear. From the advent of the transistor in
the late 1940s to the latest million-transistor microprocessor chip, the reliability
of microelectronics has improved steadily. The device failure rate model follows
the Weibull distribution in its early life and is followed by a very long, useful life
of constant failure rate. Typically the constant failure rate ranges from a few
ppb to a few hundred ppb[1]. Thus, electronic hardware is rarely responsible for
failures, even for very complex computer systems[2].
As the processing power of microprocessors (measured in million instructions
per second or MIPS) increases, software complexity also increases to harness
their power for better performance and to produce more functions. While the
control program for a washing machine may be just a few thousand lines of
instruction, it is not unusual nowadays to find software with a million lines of
instruction, even in personal computers. Software of such complexity also
controls the modern telephone exchanges, aeroplanes and non-stop computers for
banking and finance. The proliferation of programmable electronics gives rise to
concern over the risks of software. To contain the risks of software, structured
programming, software quality assurance and fault-tolerance techniques are
increasingly being used[1,3]. In a survey of well-debugged programs, MTTF
ranging from 1.6 years to 5,000 years was reported[4].
It is well known that temperature and humidity affect the reliability of
International Journal of Quality
& Reliability Management, electronics. The methods to reduce their detrimental effects are also well known.
Vol. 13 No. 2, 1996, pp. 66-74,
© MCB University Press,
One aspect of the physical environment, however, is not widely known,
0265-671X although it is gaining recognition as one of the most serious elements that
affects electronics in general and programmable electronics in particular. This Transients’
is the susceptibility of electronics to electromagnetic interference (EMI), which effects
is also known as radio frequency interference (RFI). In short, EMI affects
programmable electronics’ reliability through interaction with the hardware
and software.
Transient is one particular form of EMI that is a major cause of failures for
programmable electronics. It is represented as a short burst of electromagnetic 67
energy that enters into a victim equipment via conduction on cables and other
forms of conductor or via electromagnetic radiation. Strong transients can
cause permanent physical damage while weak transients cause only transient
faults that involve no physical damage. Nevertheless, transient faults can still
cause havoc to programmable electronics’ operations. Since there is no evidence
of physical damage, failures due to transient faults are often confused with
software faults and mislead failure analysis in the wrong direction. It is for
these reasons that transients and their effects need to be understood better by
those responsible for product quality and reliability.

Susceptibility of programmable electronics


EMI is part of the physical environment. It is either natural or man-made. There
are many sources of EMI. They include lightning, radio transmitters, motors,
electrical circuit breakers, electrostatic discharge, personal computers[5]. EMI
could be transient in duration, as produced by lightning or continuous, as
produced by broadcast radio. EMI is either conducted by cables such as power
and data interface cables or radiated through the atmosphere. If suitable
countermeasures against EMI are not taken, sensitive electronic equipment
would be interfered with. The result of the interference could be temporary or
permanent loss of performance. Due to the proliferation of electrical and
electronic apparatus, particularly computing devices, man-made EMI has been
on the increase. The situation became so serious that in the early 1980s
regulations were imposed internationally to limit the amount of EMI that can be
emitted from computing devices or information technology equipment.
Limiting EMI emission of computing devices, however, does not eliminate
EMI completely. There are still the natural and other man-made EMI sources,
like lightning and electrical circuit-breakers. They produce interferences that
last only a short time, ranging from nanoseconds to milliseconds. They enter
electronic equipment via power and interface cables or couple into the
equipment as transient electromagnetic radiation. The results are transient
voltages and currents, transients in short, in the electronic hardware.
While programmable electronics are not the only potential victims of
transients, they are especially susceptible to this form of EMI and the possible
failures could be more serious. Take, for example, a radio receiver with no
programmable electronics; the effect of a lightning strike nearby may be just a
clicking noise added to the received signal. On the other hand, transients caused
by lightning may lead a traffic light controller, controlled by programmable
electronics, into an unsafe state, such as turning all the green lights on.
IJQRM Similarly, a financial transaction computer may enter a piece of wrong data with
13,2 serious financial consequences.
In view of the ever-increasing use of electronics, especially programmable elec-
tronics, and the concern about their safety and reliability, international standards
on immunity against EMI have been adopted and will be imposed, from 1996
onward, on products sold in the European Community. The European Norm
68 50082-1 will cover residential, commercial and light industrial environments, while
50082-2 will cover industrial environments. In other words, mass-produced appar-
atus as well as industrial, scientific and medical equipment will be affected[6].

Transients and failures


Transients can cause faults, which in turn can cause errors and errors can cause
failures. Following the computer community, the definitions of these terms are
given below[2,7]:
• A fault is an incorrect state of hardware or software resulting from
failures of components, physical interference from the environment,
operator error, or incorrect design.
• An error is the manifestation of a fault within a program or data
structure. It is a deviation from accuracy or correctness.
• A failure is the non-performance of some action that is due or expected.
Faults can be classified into three types: permanent, intermittent and transient
(often intermittent and transient are not differentiated in usage). A permanent
fault exists indefinitely until it is corrected by repair to the hardware. An
intermittent fault appears, disappears and reappears repeatedly. It is due to
impaired physical conditions of the hardware and can be repaired by part
replacement or correction. A transient fault appears and disappears within a
very short period of time and involves no damage to the hardware.
Transient caused by EMI is a major, but not the only, cause of transient faults.
Electrostatic discharge or (ESD) can also cause transient faults through the
concomitant electromagnetic radiation. Defective software has also been
identified as another major source of transient faults in software intensive
systems. According to some case studies of mature and well-debugged systems,
transient faults account for more than 80 per cent of all failures observed[2,8,9].
The mechanisms of how transients produce failures are very complicated.
Simply stated, they depend first on the physical interaction between the
hardware and the sources of transients and, second, on the states of the
software at the times that the transient faults occur. Three more factors
complicate the situation further and make transient faults and their associated
failures so hard to deal with. These factors are now discussed.

Probability of failure
A transient does not always cause a failure. Transients are inherently random
in nature. Their frequency of occurrence, waveforms and strength are random
variables. Thus, a transient may or may not produce a fault. When it does
produce a fault it could be a transient, an intermittent or a permanent fault Transients’
depending on its strength and waveform (see Figure 1). effects

Transients

69

Hardware

Damage: Partial damage: Transient No


permanent faults intermittent faults faults fault

Software

Figure 1.
Transients and their
Errors No error possible effects on
programmable
electronics
Failures No failure

Even when a transient fault occurs an error does not always result. For
instance, if a fault causes data to be 1 when it should be 0 an error will occur. On
the other hand, if the data are already a 0 then the transient fault does not result
in an error. Similarly, an error does not always end up in a failure. For instance,
a transient may cause data error but if the data are not read and used, or are
overwritten by correct data, then it cannot cause a failure. So failure due to
transient is a highly complicated and random process.
In the case of a very strong transient with very fast rise and fall times the
induced transient voltages and currents in the hardware will also be high and
distributed extensively throughout the hardware. This leads to a very high
probability for transient-induced failure. This could also happen with less severe
transients if the hardware design is very poor and, therefore, highly susceptible to
transients. Otherwise, the probability of failure due to transients will be low and
could be modelled by a Poisson random process of rare events as below.
The software which is being executed by the hardware is characterized by the
presence of time intervals during which the software is susceptible to transient
faults. Such intervals could be called the susceptible windows, for example, the
intervals during which crucial data are being transferred between the processing
IJQRM unit and some memory or input/output device. Typically, these susceptible
13,2 windows represent a small fraction of the total observation time, hence random
transients hitting at susceptible windows can be considered as rare events[10].
The probability of developing transient failure could be calculated easily.
Let the observation time be T, within which there are m number of identical
susceptible windows, each of duration t. The probability, f, of developing at
70 least one failure, with the equipment in question subject to n number of random
transient faults, occurring one at a time and with uniform probability
distribution function throughout T, is given by:
f ≈ 1 – e –(nmt/T)
provided that the following condition is met,
mt « T.
The implications of the above expression are obvious. The more frequent are the
transient faults or the occurrences of susceptible windows within a fixed period of
time, the higher the probability of developing a failure. The fewer in number are the
susceptible windows, the longer it will take to develop a transient-induced failure.

Error latency
An error does not always cause a failure immediately. In some cases it may take a
long time to do so. The time period between the occurrence of an error and its
associated failure is called error latency. Take for example, a piece of data which is
corrupted while it is written into a memory device – a failure will not occur until it
is retrieved and used by the processor. Thus, the error is dormant and undetected.
Such an error is called a latent error and can be likened to a computer virus.

Elusiveness
As pointed out earlier, transients are random in nature. When a failure occurs
and is detected, the source of the transients could have disappeared or become
quiescent for a long time. This makes troubleshooting and tracing the origin of
the failure extremely difficult. Transient faults could produce many different
failures and some of them seldom repeat. During the product-development
stage, transient faults could often be masked by the more dominant hardware
and software faults. All these factors could lead the engineers to wrong
conclusions when diagnosing failures.

Design against transients


Due to the above reasons, a defensive design strategy is preferred and needed to
combat transient faults so as to achieve reliability. Such a strategy could be
implemented at several levels. The first and most fundamental level is the hard-
ware. Shielding, proper grounding of cables, filtering, good circuit board layout,
installing transient absorbers are essential techniques for fault avoidance[5].
The next level is at the software and data structure. At this level, the design
objective is fault tolerance. The purpose of fault tolerance is to prevent faults
leading into errors and errors leading into failures. There are many techniques
used to achieve fault tolerance, e.g. error-correction coding, redundancy,
performance monitoring[2,7]. Some techniques employ only software while Transients’
others use software and additional hardware. effects
However, it is important to realize that no fault tolerance technique gives 100
per cent fault coverage. Some errors may not even be detectable, so failure could
still occur despite fault tolerance techniques. The additional hardware and
software to implement fault tolerance could also fail due to transient faults.
Moreover, fault tolerance techniques often mean an additional workload which 71
slows down system performance.
To test the adequacy of design, a transient simulator should be used in
prototype testing. The objective is to force the equipment under test into failure
so that its weak points can be discovered and ameliorated. The IEC standard
IJQRM
13,2 Initial
specifications

72
Design

Phototyping/ (a) (b) (c) (d)


testing

Final specifications/
manufacture documents/
test plans

Manufacture

Field test

Figure 2.
A typical design- Acceptance
manufacture-operate
process with possible
undesirable loops due to
discovery of transient
problems
cost would have escalated significantly. Besides extra engineering time Transients’
there would be material scrap and extensive revisions of documents. At effects
this stage, redesign will require some or all of the following measures: the
re-layout of printed circuit boards, re-routeing or changing the types of
cables and wires used, addition of components for EMI suppression and
sometimes modifications of the software for fault tolerance.
• The transient problems remain undiscovered until field testing. At this 73
stage the customers would be involved. Facing failures that are
extremely hard to diagnose, for reasons given earlier, the supplier-
customer relationship would be strained. The elusiveness of the sources
of transients means many trips to the field by the engineers. Material
costs and man-hour overrun would escalate further when compared to
earlier discovery (above). The flexibility for a redesign is considerably
reduced because of the time and finance involved are not budgeted.
• It is possible that even field testing does not expose the inherent
susceptibility to transients. One possible reason is insufficient test
duration. As explained above, a transient failure depends on the rare
concurrence of the transients and the susceptibility windows, so over a
short period of time failures associated with transients may not develop
(another possible reason is long error latency). Thus, the error remains
dormant during the entire field test. Subsequent to the field test and
acceptance, maybe after a long time, the error becomes active and failure
develops. For a safety critical system or a system involved with high
finance, the consequence could be serious and result in societal loss.

Management’s responsibility
Given the possible serious consequences that transients have on programmable
electronics, management must be alert to the potential problem. It must take the
necessary steps to ensure the confinement of transients’ effects on reliability.
Following the ISO 9001 standard on quality systems[12], management’s
responsibility should include at least the following:
• Define all the personnel at various levels and functions who will be
responsible for ensuring that the specifications, design, testing and
installation do take transients into account.
• Review contracts or product specifications to ensure that the intended
operational electromagnetic environment is well defined. If the latter is
not defined by the customers then relevant standards should be followed.
• Help to set as a design objective, the immunity levels of the equipment in
question towards defined transients.
• Ensure that all test plans include transient susceptibility tests with
defined procedures. Susceptibility tests must be performed in the
development stage as well as final and field testing.
• Review the design and test records and check if the immunity design
objective is achieved.
IJQRM • Ensure that service records reflect any incidence of failures due to
13,2 transient faults.
• Establish a document that records the objective, plans and results
pertaining to the above points.
Although implementing the above points and the entailing work represents
74 additional cost to the supplier, it should be compared to the potential loss due to
negligence. Indeed, as explained earlier, the loss to the supplier and possibly to
society could be extremely high.

Conclusion
The nature of transients and the associated failure mechanism in program-
mable electronics have been discussed. The importance of design and manage-
ment’s role with regard to transients have been stressed. In view of the pending
European regulation on immunity against EMI and the possible, serious
consequence of ignoring the issue, management must not neglect transients’
effects on product reliability. It must take the lead and ensure the reliability of
products in their intended operational environment.
References
1. Irland, E.A., “Assuring quality and reliability of complex electronic systems: hardware and
software”, Proceedings of the IEEE, Vol. 76 No. 1, January 1988, pp. 5-18.
2. Siewiorek, D.P. and Swarz, R.S., Reliable Computer Systems Design and Evaluation, 2nd ed.,
Digital Press, Geneva, 1992.
3. Avizienis, A. and Laprie, J., “Dependable computing: from concepts to design diversity”,
Proceedings of the IEEE, Vol. 74 No. 5, May 1986, pp. 629-38.
4. Littlewood, B. and Strigini, L., “The risks of software”, Scientific American, November
1992, pp. 38-43.
5. Ott, H.W., Noise Reduction Techniques in Electronic Systems, Wiley, New York, NY, 1988.
6. Davies, J., “The European (CENELEC) generic immunity standards”, EMC Test and
Design, November-December 1992, pp. 49-50.
7. Johnson, B., Design and Analysis of Fault Tolerant Digital Systems, Addison-Wesley,
Reading, MA, 1988.
8. Iyer, R.K. and Rossetti, D.J., “A measurement-based model for workload dependence of CPU
errors”, IEEE Transactions on Computers, Vol. C-35 No. 6, June 1986, pp. 511-19.
9. Duba, P. and Iyer, R.K., “Transient fault behavior in a microprocessor, a case study”,
Proceedings of IEEE International Conference on Computer Design, 1988, pp. 272-6.
10. Papoulis, A., Probability and Statistics, Prentice-Hall, Englewood Cliffs, NJ, 1990, Chapter 3.
11. International Electrotechnical Commission (IEC), IEC 801-4 Electromagnetic Compatibility
for Industrial-process Measurement and Control Equipment, Part 4: Electrical Fast
Transient/Burst Requirements, IEC, 1988.
12. International Organization for Standards (ISO), ISO 9001 Quality Systems — Model for
Quality Assurance in Design/Development, Production, Installation and Servicing, ISO,
Geneva, 1987.

Further reading
Tang, H.K. and Er, M.H., “EMI-induced failure in microprocessor-based counting”,
Microprocessors and Microsystems, Vol. 17 No. 4, 1993, pp. 248-52.

You might also like