Professional Documents
Culture Documents
Introduction
It has been said that we are living in the age of microelectronics and computers.
They are present in almost every electronic product and system and are also
used heavily in products which are not normally classified as electronics. These
items range from washing machines to automobiles. All these products and
systems have one thing in common. Their electronics are mostly based on
microelectronics hardware and their operations are programmed by software.
In other words, they are programmable electronics. Programmable electronics’
reliability depends not only on the reliability of the constituent hardware and
software but also on the ambient physical environment.
Electronic hardware is inherently more reliable than most mechanical
equipment due to the lack of wear and tear. From the advent of the transistor in
the late 1940s to the latest million-transistor microprocessor chip, the reliability
of microelectronics has improved steadily. The device failure rate model follows
the Weibull distribution in its early life and is followed by a very long, useful life
of constant failure rate. Typically the constant failure rate ranges from a few
ppb to a few hundred ppb[1]. Thus, electronic hardware is rarely responsible for
failures, even for very complex computer systems[2].
As the processing power of microprocessors (measured in million instructions
per second or MIPS) increases, software complexity also increases to harness
their power for better performance and to produce more functions. While the
control program for a washing machine may be just a few thousand lines of
instruction, it is not unusual nowadays to find software with a million lines of
instruction, even in personal computers. Software of such complexity also
controls the modern telephone exchanges, aeroplanes and non-stop computers for
banking and finance. The proliferation of programmable electronics gives rise to
concern over the risks of software. To contain the risks of software, structured
programming, software quality assurance and fault-tolerance techniques are
increasingly being used[1,3]. In a survey of well-debugged programs, MTTF
ranging from 1.6 years to 5,000 years was reported[4].
It is well known that temperature and humidity affect the reliability of
International Journal of Quality
& Reliability Management, electronics. The methods to reduce their detrimental effects are also well known.
Vol. 13 No. 2, 1996, pp. 66-74,
© MCB University Press,
One aspect of the physical environment, however, is not widely known,
0265-671X although it is gaining recognition as one of the most serious elements that
affects electronics in general and programmable electronics in particular. This Transients’
is the susceptibility of electronics to electromagnetic interference (EMI), which effects
is also known as radio frequency interference (RFI). In short, EMI affects
programmable electronics’ reliability through interaction with the hardware
and software.
Transient is one particular form of EMI that is a major cause of failures for
programmable electronics. It is represented as a short burst of electromagnetic 67
energy that enters into a victim equipment via conduction on cables and other
forms of conductor or via electromagnetic radiation. Strong transients can
cause permanent physical damage while weak transients cause only transient
faults that involve no physical damage. Nevertheless, transient faults can still
cause havoc to programmable electronics’ operations. Since there is no evidence
of physical damage, failures due to transient faults are often confused with
software faults and mislead failure analysis in the wrong direction. It is for
these reasons that transients and their effects need to be understood better by
those responsible for product quality and reliability.
Probability of failure
A transient does not always cause a failure. Transients are inherently random
in nature. Their frequency of occurrence, waveforms and strength are random
variables. Thus, a transient may or may not produce a fault. When it does
produce a fault it could be a transient, an intermittent or a permanent fault Transients’
depending on its strength and waveform (see Figure 1). effects
Transients
69
Hardware
Software
Figure 1.
Transients and their
Errors No error possible effects on
programmable
electronics
Failures No failure
Even when a transient fault occurs an error does not always result. For
instance, if a fault causes data to be 1 when it should be 0 an error will occur. On
the other hand, if the data are already a 0 then the transient fault does not result
in an error. Similarly, an error does not always end up in a failure. For instance,
a transient may cause data error but if the data are not read and used, or are
overwritten by correct data, then it cannot cause a failure. So failure due to
transient is a highly complicated and random process.
In the case of a very strong transient with very fast rise and fall times the
induced transient voltages and currents in the hardware will also be high and
distributed extensively throughout the hardware. This leads to a very high
probability for transient-induced failure. This could also happen with less severe
transients if the hardware design is very poor and, therefore, highly susceptible to
transients. Otherwise, the probability of failure due to transients will be low and
could be modelled by a Poisson random process of rare events as below.
The software which is being executed by the hardware is characterized by the
presence of time intervals during which the software is susceptible to transient
faults. Such intervals could be called the susceptible windows, for example, the
intervals during which crucial data are being transferred between the processing
IJQRM unit and some memory or input/output device. Typically, these susceptible
13,2 windows represent a small fraction of the total observation time, hence random
transients hitting at susceptible windows can be considered as rare events[10].
The probability of developing transient failure could be calculated easily.
Let the observation time be T, within which there are m number of identical
susceptible windows, each of duration t. The probability, f, of developing at
70 least one failure, with the equipment in question subject to n number of random
transient faults, occurring one at a time and with uniform probability
distribution function throughout T, is given by:
f ≈ 1 – e –(nmt/T)
provided that the following condition is met,
mt « T.
The implications of the above expression are obvious. The more frequent are the
transient faults or the occurrences of susceptible windows within a fixed period of
time, the higher the probability of developing a failure. The fewer in number are the
susceptible windows, the longer it will take to develop a transient-induced failure.
Error latency
An error does not always cause a failure immediately. In some cases it may take a
long time to do so. The time period between the occurrence of an error and its
associated failure is called error latency. Take for example, a piece of data which is
corrupted while it is written into a memory device – a failure will not occur until it
is retrieved and used by the processor. Thus, the error is dormant and undetected.
Such an error is called a latent error and can be likened to a computer virus.
Elusiveness
As pointed out earlier, transients are random in nature. When a failure occurs
and is detected, the source of the transients could have disappeared or become
quiescent for a long time. This makes troubleshooting and tracing the origin of
the failure extremely difficult. Transient faults could produce many different
failures and some of them seldom repeat. During the product-development
stage, transient faults could often be masked by the more dominant hardware
and software faults. All these factors could lead the engineers to wrong
conclusions when diagnosing failures.
72
Design
Final specifications/
manufacture documents/
test plans
Manufacture
Field test
Figure 2.
A typical design- Acceptance
manufacture-operate
process with possible
undesirable loops due to
discovery of transient
problems
cost would have escalated significantly. Besides extra engineering time Transients’
there would be material scrap and extensive revisions of documents. At effects
this stage, redesign will require some or all of the following measures: the
re-layout of printed circuit boards, re-routeing or changing the types of
cables and wires used, addition of components for EMI suppression and
sometimes modifications of the software for fault tolerance.
• The transient problems remain undiscovered until field testing. At this 73
stage the customers would be involved. Facing failures that are
extremely hard to diagnose, for reasons given earlier, the supplier-
customer relationship would be strained. The elusiveness of the sources
of transients means many trips to the field by the engineers. Material
costs and man-hour overrun would escalate further when compared to
earlier discovery (above). The flexibility for a redesign is considerably
reduced because of the time and finance involved are not budgeted.
• It is possible that even field testing does not expose the inherent
susceptibility to transients. One possible reason is insufficient test
duration. As explained above, a transient failure depends on the rare
concurrence of the transients and the susceptibility windows, so over a
short period of time failures associated with transients may not develop
(another possible reason is long error latency). Thus, the error remains
dormant during the entire field test. Subsequent to the field test and
acceptance, maybe after a long time, the error becomes active and failure
develops. For a safety critical system or a system involved with high
finance, the consequence could be serious and result in societal loss.
Management’s responsibility
Given the possible serious consequences that transients have on programmable
electronics, management must be alert to the potential problem. It must take the
necessary steps to ensure the confinement of transients’ effects on reliability.
Following the ISO 9001 standard on quality systems[12], management’s
responsibility should include at least the following:
• Define all the personnel at various levels and functions who will be
responsible for ensuring that the specifications, design, testing and
installation do take transients into account.
• Review contracts or product specifications to ensure that the intended
operational electromagnetic environment is well defined. If the latter is
not defined by the customers then relevant standards should be followed.
• Help to set as a design objective, the immunity levels of the equipment in
question towards defined transients.
• Ensure that all test plans include transient susceptibility tests with
defined procedures. Susceptibility tests must be performed in the
development stage as well as final and field testing.
• Review the design and test records and check if the immunity design
objective is achieved.
IJQRM • Ensure that service records reflect any incidence of failures due to
13,2 transient faults.
• Establish a document that records the objective, plans and results
pertaining to the above points.
Although implementing the above points and the entailing work represents
74 additional cost to the supplier, it should be compared to the potential loss due to
negligence. Indeed, as explained earlier, the loss to the supplier and possibly to
society could be extremely high.
Conclusion
The nature of transients and the associated failure mechanism in program-
mable electronics have been discussed. The importance of design and manage-
ment’s role with regard to transients have been stressed. In view of the pending
European regulation on immunity against EMI and the possible, serious
consequence of ignoring the issue, management must not neglect transients’
effects on product reliability. It must take the lead and ensure the reliability of
products in their intended operational environment.
References
1. Irland, E.A., “Assuring quality and reliability of complex electronic systems: hardware and
software”, Proceedings of the IEEE, Vol. 76 No. 1, January 1988, pp. 5-18.
2. Siewiorek, D.P. and Swarz, R.S., Reliable Computer Systems Design and Evaluation, 2nd ed.,
Digital Press, Geneva, 1992.
3. Avizienis, A. and Laprie, J., “Dependable computing: from concepts to design diversity”,
Proceedings of the IEEE, Vol. 74 No. 5, May 1986, pp. 629-38.
4. Littlewood, B. and Strigini, L., “The risks of software”, Scientific American, November
1992, pp. 38-43.
5. Ott, H.W., Noise Reduction Techniques in Electronic Systems, Wiley, New York, NY, 1988.
6. Davies, J., “The European (CENELEC) generic immunity standards”, EMC Test and
Design, November-December 1992, pp. 49-50.
7. Johnson, B., Design and Analysis of Fault Tolerant Digital Systems, Addison-Wesley,
Reading, MA, 1988.
8. Iyer, R.K. and Rossetti, D.J., “A measurement-based model for workload dependence of CPU
errors”, IEEE Transactions on Computers, Vol. C-35 No. 6, June 1986, pp. 511-19.
9. Duba, P. and Iyer, R.K., “Transient fault behavior in a microprocessor, a case study”,
Proceedings of IEEE International Conference on Computer Design, 1988, pp. 272-6.
10. Papoulis, A., Probability and Statistics, Prentice-Hall, Englewood Cliffs, NJ, 1990, Chapter 3.
11. International Electrotechnical Commission (IEC), IEC 801-4 Electromagnetic Compatibility
for Industrial-process Measurement and Control Equipment, Part 4: Electrical Fast
Transient/Burst Requirements, IEC, 1988.
12. International Organization for Standards (ISO), ISO 9001 Quality Systems — Model for
Quality Assurance in Design/Development, Production, Installation and Servicing, ISO,
Geneva, 1987.
Further reading
Tang, H.K. and Er, M.H., “EMI-induced failure in microprocessor-based counting”,
Microprocessors and Microsystems, Vol. 17 No. 4, 1993, pp. 248-52.