WELCOME TO

THE INDUSTRIAL WIKI

RESEARCH CONTENT IN GREATER DETAIL
Talk Page

SEVEN STEP TROUBLESHOOTING

Print
Email
Save
Social

Troubleshooting is a method of finding the cause of a problem and correcting it. The ultimate goal of troubleshooting is to get the equipment back into operation. This is a very important job because the entire production operation may depend on the troubleshooter's ability to solve the problem quickly and economically, thus returning the equipment to service. Although the actual steps the troubleshooter uses to achieve the ultimate goal may vary, there are a few general guidelines that should be followed. There are often cases where a familiar piece of equipment or system breaks down. In those cases, an abbreviated five-step troubleshooting process can be used to find the fault, get the system up and running. It is important to note that, although it is a five-step approach, the same basic guidelines of the seven-step troubleshooting method are followed. The steps are simply combined to be specific to the problem at hand. This article will briefly cover the five-step troubleshooting process, followed by a more in-depth look at the seven-step troubleshooting process.

General Troubleshooting Guidelines

The general guidelines for a good troubleshooter to follow are:

Use a clear and logical approach

  • Work quickly
  • Work efficiently
  • Work economically
  • Work safely and exercise safety precautions
Menu

Troubleshooting Steps

The five-step troubleshooting process consists of the following:

  1. Verify that a problem actually exists.
  2. Isolate the cause of the problem.
  3. Correct the cause of the problem.
  4. Verify that the problem has been corrected.
  5. Follow up to prevent future problems.

Action Items

Within the four general guidelines previously mentioned, there are several action items that are important to the successful achievement of the goal of troubleshooting:

1. Verify that something is actually wrong.

A problem usually is indicated by a change in equipment performance or product quality. Verification of the problem will either provide you with indications of the cause if a problem actually exists or prevent the troubleshooter from wasting time and effort on "ghost" problems caused by the operator's lack of equipment understanding. Do not simply accept a report that something is wrong without personally verifying the failure. A few minutes invested up front can save a lot of time down the road.

2. Identify and locate the cause of the trouble.

Trouble is often caused by a change in the system. A thorough understanding of the system, its modes of operation, and how the modes of operation are supposed to work, the easier it will be to find the cause of the trouble. This knowledge allows the troubleshooter to compare normal conditions to actual conditions.

3. Correct the problem.

It is very important to correct the cause of the problem, not just the effect or the symptom. This often involves replacing or repairing a part or making adjustments. Never adjust a process or piece of equipment to compensate for a problem and consider the job finished; correct the problem!

4. Verify that the problem has been corrected.

Repeating the same check that originally indicated the problem can often do this. If the fault has been corrected, the system should operate properly.

5. Follow up to prevent further trouble.

Determine the underlying cause of the trouble. Suggest a plan to a supervisor that will prevent a future recurrence of this problem.

This basic troubleshooting philosophy is the basis for the seven-step troubleshooting method discussed later. It reflects the basic strategy for troubleshooting, though each individual facility may require a different application of the strategy specific for the equipment and policies at that facility. An important point to remember as we discuss the seven-step methodology is that we are discussing a philosophy - not a procedure. Using the seven-step philosophy, a procedure could be developed that would provide the most cost-effective and efficient means for troubleshooting a particular piece of equipment in a given facility. However, this procedure would not necessarily be effective when used with different equipment or even the same equipment installed in a different facility.

Troubleshooting Documentation

There is no substitute for experience is a catchy and, more often than not, true phrase. If only there were a way to capture even a small part of that experience to be used in the future either by those who have not been fortunate (or unfortunate, as the case may be) enough to see something for themselves or for those with who have seen too many years between experiences. This is the point of an equipment history, or troubleshooting log. This can tell quite a tale over the life of a piece of equipment. The troubleshooting log provides a valuable source of information from which the troubleshooter can draw on the experience of past troubleshooting efforts to quickly restore the equipment to service. Problems, symptoms, corrective actions, modifications, and preventive maintenance actions all should have entries that can be referenced at a later date. Many companies require their maintenance personnel or engineering staff to maintain historical data on equipment used within their facilities. These requirements are not intended to be a burden on the maintenance or engineering departments, nor are they meant to destroy every tree on the planet with unnecessary paperwork. The equipment history can help prevent the troubleshooter from "recreating the wheel." It can lead the troubleshooter to the solution to a problem that has not occurred in years and has troubleshooting efforts to move slowly as the troubleshooter checks every possibility. Additionally, documentation of recurring problems can provide the horsepower needed to get the right part or the engineering solution necessary to not only fix the problem, but also correct it. Without this historical data and documentation of a recurring problem and its associated costs, the arguments will often be met with the statement "if it is not written down, it did not happen." The equipment history/troubleshooting log is an ideal place to keep the records necessary to establish and maintain a common problems list. The purpose of the common problems list is to provide the troubleshooter with a ready reference of past problems and their corrective actions. It is from this list that quick fixes can be taken. If a problem occurs on a regular or routine basis, it should be put on the common problems list. This can be referred to at the beginning of a troubleshooting problem so the quick fixes can be tried. This can save the troubleshooter valuable time when troubleshooting. Troubleshooters or technicians need to be careful of what is placed on the common problems list. If something occurs once, it is not necessarily a common problem. The problem should be listed in the history section and should not be put on the common problems list until it occurs again. This is because the tools used for troubleshooting are only as good as their application. If the common problems list is too long and cumbersome, it cannot be used effectively. Figure 1 shows an example of a troubleshooting log that could be used as a common problems list. Completing the required information on a troubleshooting log may seem tedious, but the information on the log can be very beneficial to a technician looking for the solution to a problem several months or even years later.


Figure 1: Troubleshooting Log/Common Problems List

Seven-Step Troubleshooting Philosophy

At this point in our discussion, we are ready to examine a method for effective, logical troubleshooting: the seven-step troubleshooting method. The seven-step troubleshooting method consists of the following seven steps:

  1. Symptom recognition
  2. Symptom elaboration
  3. Listing of probable faulty functions
  4. Localizing the faulty function
  5. Localizing the trouble to a faulty component
  6. Failure analysis
  7. Retest requirements

When necessary, each of these steps should be used in the proper order. Deciding when each is necessary is a very important part of troubleshooting. This is where a strategy is developed into a procedure. Many of the more modern designs of equipment in use today offer extensive diagnostics programs and tools as an integral part of the equipment. Some have internal troubleshooting programs that allow the equipment to "troubleshoot" itself to a large degree. These programs and tools usually check inputs and outputs against pre-programmed normal parameters. If a discrepancy is noted, that function is flagged as a potential problem. Some programs are more sophisticated and will actually check functions to a component level, but they usually are only found on very expensive and high-tech equipment. The strategy that the program uses is a simple logical input-output comparison. Systems or equipment that are designed for some form of self-troubleshooting obviously do not require implementation of every one of the seven steps. The equipment itself may perform any one or all of the steps, with the exception of failure analysis and retest requirements. All that is required of the troubleshooter is an understanding of what the equipment diagnostics is indicating and what the quickest and most effective way of clearing the fault is. When any troubleshooting effort is necessary, writing down or referring to the seven steps will ensure that a conscious decision is made as to what steps apply and what steps do not apply. Approaching the problem in this fashion will ensure that valuable time is not wasted back-tracking to an action or thought process that was skipped initially. Next, we will take a look at each of the seven steps individually to see what should be accomplished for each step.

Step 1: Symptom Recognition

This is the most fundamental step in troubleshooting. Each and every person that has ever fixed anything has accomplished it. This step asks the question "Does a failure exist" The first step in identifying a failure is recognizing that a failure exists. This sounds ridiculously simple, and usually it is, but it is also very important. For example, a common failure can be as simple as the power is not connected to a power supply. Electric motors and electrical circuits will not operate without electricity! This is very simple troubleshooting, but it can save a lot of time and potential embarrassment. The symptom recognition step is very straightforward. It requires an entry in the troubleshooting log that states what the indications of a problem are. For example, the indication might be that pump #3 does not start. Always check for additional symptoms of common problems. Unusual symptoms of common troubles occur more often than common symptoms of unusual troubles. The following list provides some guidelines for entries made during the symptom recognition step:

  • Try to be as specific and defining as possible in stating the problem that is occurring.
  • Always check to ensure equipment is lined up for normal operation, i.e., On/Off switch, test switch, mode selection switch, etc.
  • Analyze the performance of the equipment to make sure it actually has a failure and is not simply reacting to an external condition.
  • Try to determine if the failure is total or if the equipment is operating with degraded performance.
  • Know the equipment; realize when it is showing the symptoms of impending failure.

Step 2: Symptom Elaboration

The symptom elaboration step is the beginning of "actual" troubleshooting. The objective of this step is to obtain as much information about the problem as possible. Symptom elaboration is where the question "What is the problem" is asked. As its name implies, this step elaborates on the symptom written in step one. For example, perhaps the cylinder extension stroke is too slow but the retraction stroke timing is satisfactory. This step provides all of the information necessary to narrow the problem down in a logical fashion. The following points would be considered in the symptom elaboration step. * Be aware that a large number of equipment faults can produce similar symptoms. During this step, try to differentiate as much as possible between the characteristics of the symptoms.

  • Start the troubleshooting log with as much background information as possible and document each adjustment and its results.
  • Note how readings are affected by all modes of operation and switch lineups.
  • Be sure to observe all gages, meters, and other indicators as to how they are responding due to the problem.
  • Always note if an adjustment has no effect on the symptom; this will help eliminate possible causes later on.
  • Determine if the trouble has slowly developed (i.e., drift) or if it is a sudden failure.
  • Perform control manipulation with care since detrimental effects can occur to associated equipment or components within the failed equipment.

-There may be a possibility of improper pressures, flows, or voltages exceeding maximum design specifications.

  • Do not go for the answer in one step. Troubleshooting should be a series of small logical steps, each one chosen to show a result leading to discovery of the problem or problems. Remember, troubleshooting can last two hours or two weeks. Be sure to record all troubleshooting actions taken in the log accordingly. Do not leave anything to memory.

Step 3: Listing of Probable Faulty Functions

This step is intended to narrow down the possible faulty functions based on the information obtained in steps one and two. A functional block diagram of the equipment and the troubleshooting log (steps one and two) are needed for this step. The question asked by this step is "Would failure of this function cause the symptoms I am seeing" Again, the purpose of this step is to narrow the possibilities down to a list of probable faulty functions. Key points for this step include:

  • Always use the functional block diagram to ensure all the possible functions are checked.
  • Write down all probable faulty functions, even if it is apparently obvious that some of them are working correctly. Then, write down why it is thought to be functioning correctly.
  • Be sure to include functions such as detectors, switches, cables, meters, wiring, connectors, piping, filters, and regulators. Wiring is always a probable cause!
  • Do not get locked in on what a technician "knows" the trouble has to be. Past troubleshooting experience and hunches certainly play a part in figuring out which is the faulty function. However, do not ignore hard evidence just because one assumes trouble is known prior to proper troubleshooting steps.
  • Always ask: "Would a failure of this function cause these symptoms"

Step 4: Localizing the Faulty Function

This step requires careful evaluation of each of the probable faulty functions listed in the previous step. The goal is to determine exactly which area of the system is causing or generating the problem. This is the first step that requires taking a measurement. The measurement taken may be a system pressure, operating speed, sequence, time delay, temperature, or any variable parameter that is related to the equipment operation. The purpose of this step is not to find the faulty component; it is just to isolate the problem to a circuit or function. More than one of the previously listed probable faulty functions may be contributing to the overall problem. This step is not complete until each and every listed possibility is properly checked. The following key points should be noted:

  • Check all pressures, flows, inputs, and outputs associated with the areas of probable faulty functions.
  • If an abnormal reading is obtained, the equipment setup used to obtain the reading and the reading itself should be rechecked.
  • Do not be discouraged if several hours of troubleshooting reveal that a function is good. Proving a function is operating properly is important to the troubleshooting effort because it narrows down the possibilities of where the problem is located. The first function you choose to check out often will not be the faulty one.
  • Check the troubleshooting log periodically to ensure that troubleshooting efforts are still working in the right direction and have not lost sight of the original troubleshooting goal.

Step 5: Localizing the Fault to a Component

This step continues isolating the fault once the faulty function or functions have been determined. A thorough knowledge of the equipment operation, as well as individual component characteristics, is required for successful completion of this step. Schematic diagrams should be used at this point to ensure that no details go unnoticed. When localizing the trouble to a faulty component, keep in mind the following points:

  • Evaluate each component within the faulty function to determine which components are probable sources of the symptoms noted.
  • Careful consideration must be given to how each component could affect overall function of the system under both normal and failed conditions.
  • Removal of components from the system and use of a test stand may be helpful or even necessary to ascertaining the function of more complex components.

Step 6: Failure Analysis

This step requires the failed component(s) to be repaired or replaced and, most importantly, the cause of the failure corrected. The following key points should be noted.

  • Knowledge of component failure modes and rates is very important. Always make a complete check of the associated components of the failed unit.
  • A considerable amount of information can be rapidly gained through a careful visual inspection.
  • Avoid replacing a component until the exact cause of the problem is found and repaired. Keep in mind though; the main purpose of troubleshooting is to get the equipment operational. Additional troubleshooting failure analysis can be done after the equipment is running.
  • Documentation is imperative at this point, both to aid in troubleshooting the problem should it return and to point out recurring design deficiencies.

Step 7: Retest Requirements

Now that the equipment is operational, check all the functions that have been affected by the failure. Although the equipment has been repaired and is now functioning, all operations must be checked and verified. The information obtained in this step can also aid in troubleshooting next time by providing some baseline information. One key point to remember is:

  • Fail Safe: do all checks that will ensure the equipment is operating correctly.

The seven-step method and its associated important points are provided as a general guide to assist the maintenance person. Circumstances vary from task to task and may require a slightly different troubleshooting approach. Experience and the basic path outlined here will allow an appropriate approach and solve problems in a more efficient manner.

Troubleshooting With Flowcharts

The experienced troubleshooter usually is an "old hand" at reading a variety of block and schematic diagrams, as well as troubleshooting trees. Focus is not shifted, draw a model flowchart depicting an ideal troubleshooting strategy. Assuming familiarity with the equipment in a facility, the troubleshooter can make changes in this model and turn it into a flowchart showing the ideal procedure for troubleshooting equipment. The model strategy to be depicted represents a composite drawn from research findings and from procedures used by highly competent and experienced troubleshooters of a wide variety of equipment. First, we will describe it step-by-step, and then we will review the flowchart depicting each of those steps.

Typical Troubleshooting Process

Because of the variety of items they are expected to maintain, troubleshooters do and use different things. Some use screwdrivers, while others use oscilloscopes, stethoscopes, voltmeters, or wiring diagrams. Some spend a great deal of time disassembling in order to gain access to test points or adjustments, and others spend none. Some require access to large amounts of documentation, while others need only a page. Whatever the nature of the equipment to be fixed, and the equipment used to accomplish that purpose, competent troubleshooters do not differ so much by WHAT they do, but rather by HOW they go about doing it. The strategy (the approach) is much the same for all equipment; only the tactics (the steps for implementing the strategy) differ. Here is what competent troubleshooters do, in the approximate order in which they do it.

Step 1: Talk with the Operator

Operators are the richest potential source of information about what is wrong and where the trouble is. Competent troubleshooters always talk to the operators when available. Operators are with the equipment when the trouble occurs, and they generally know what the operators and the equipment were doing when it happened. The operator can provide indications of the problem by describing what happened that was different from normal operation. Many times, the operator helps point the troubleshooter in the general direction:

  • There was a little puff of smoke right over there.
  • I pressed this button and it went clankety-clank.
  • I tried to run the program, but it only printed gibberish.
  • It wont start no matter what I do.
  • The picture is all crinkly.

This information may tell the competent troubleshooter a great deal about what is wrong and where. Sometimes though, the operator is not that helpful:

  • The things busted again.
  • I said something to it and now it wont work right.
  • I think I hurt its feelings.

Under these circumstances, the troubleshooters patience and ability to ask the right questions may result in more helpful information. Attempting to make the operator feel inferior by using highly technical terms or sarcasm in questioning will not increase the level of communication or cooperation and only serves to waste valuable time. In many cases, the operator can tell the troubleshooter exactly what and where the trouble is. When that happens, other troubleshooting steps can be avoided; the troubleshooter merely verifies the symptoms and clears the trouble.

  • This rod is bent.
  • This cam has worn down again.
  • This belt broke.
  • The connector is loose on this cable.

While sometimes wrong or not too helpful, operators are still the most potent source of information available, and competent troubleshooters head for the operator as a first step.

Step 2: Verify Symptoms

Immediately after their interview with the operator, competent troubleshooters verify symptoms. They know that hearing or seeing a symptom is not automatic proof of a malfunction. Just because equipment does not work properly does not mean something is wrong with it. Suppose an operator forgot to turn it on or plug it in Suppose a switch or a valve was left in the wrong position Human error is notorious for being a source of troubles, and competent troubleshooters know it is inefficient and potentially embarrassing to break out test equipment and sophisticated analytical procedures before verifying the symptoms. Before television, there was only radio. When a radio malfunctioned, its owner took it to the radio repair shop, plopped on the counter, and described its symptoms. Often, a customer would complain about a "hissing radio." "It doesn't work anymore," the owner would complain. "It just hisses and makes crackling sounds, but it doesn't get any stations." The minute the customer said "hisses," the troubleshooter would casually turn the radio around, look at the back, and verify the trouble. Sure enough, there was nothing wrong with the radio. The operator had accidentally snapped the AM short-wave switch to the short-wave position, making reception of AM stations impossible. From the troubleshooters point of view, the problem was then "How do I switch the switch without making the customer feel foolish" Generally, the solution was to take the radio to the back room and make the "fix" there or to ask the customer to return the following day. This is but one classic example of an operator-induced problem. There are many, many others, and experienced troubleshooters can describe the amount of time spent troubleshooting failure that were not there. Competent troubleshooters verify symptoms before proceeding with more involved efforts. They determine whether the trouble is real or not to ensure they do not spend time troubleshooting when they should be instructing an operator on how to avoid the trouble in the future. When the trouble is real, symptom verification will often provide benefits in addition to actual confirmation of a problems existence. By operating the equipment, the troubleshooter will often collect more clues about the trouble's location than were provided by the initially reported symptom. When something goes wrong, it can show up in more ways than one. Troubleshooters who can tell the difference between normal and abnormal operation will spot these additional clues. "I do a lot of troubleshooting by telephone," a highly competent troubleshooter of video equipment explained. "When a customer tells me whats wrong, Ill have them operate the system for me and tell me what happens. Lots of times I'm able to tell them what the problem is right on the phone. I don't even have to see the equipment." (Let alone rig test equipment.) No question about it. Competent troubleshooters verify symptoms before digging into the equipment itself.

Step 3: Attempt Quick Fixes

Even before they have located a trouble, competent troubleshooters attempt quick fixes; that is, they attempt solutions that are fast to try, even though they may be illogical in terms of the symptoms presented. They check fuses, adjust controls, push circuit boards firmly into their sockets, clean contacts, clean filters, replace gaskets, vacuum, dust, and bang or kick interlocked doors or cabinets to make sure they are properly seated. They tighten this or reset that, adjust here or align there. Troubleshooters know that these actions will clear the trouble some of the time, and since they are rapidly accomplished, they are worth doing. If quick fixes work, time and effort have been saved. If they do not work, only a moment has been lost and the information has been gained that certain parts of the equipment, at least, are not the cause of the trouble. Often, troubleshooters engage in these rapid clearing actions while verifying symptoms and looking for other visible signs of malfunction. Auto mechanics, for example, are likely to twist or jiggle spark plug wires while looking around the engine compartment, regardless of the nature of the trouble. They know it is worth doing this since rough-running engines are sometimes caused by loose, oily, or wet contacts. In a way, quick fixes are a form of preventive maintenance. In common usage, preventive maintenance means periodic general servicing of equipment, whether it needs it or not. These actions are carried out because they will either lengthen the life of the equipment or increase the amount of time the equipment is operational they will minimize down time. Competent troubleshooters know that many troubles are caused by inadequate preventive maintenance or a total absence of such routine care. Thus, some quick-fix actions can be thought of as belated preventive maintenance. There is another reason for attempting a quick fix or, if you prefer, for attempting solutions without first doing detailed troubleshooting. Equipment troubles do not occur with equal probability; some are much more likely to occur than others. Competent troubleshooters know this. Troubleshooters also know which troubles are likely to occur most often and the symptoms associated with those troubles. Moreover, everybody knows that some troubles are more common than others. When the table lamp does not light, typical troubleshooting does not begin; the bulb is commonly changed first. When the car does not turn over, the battery is commonly checked. It is not always the battery, and it is not always the light bulb, but the probability is high that these are the sources of trouble. When they are, it would be inefficient to pretend that these probabilities do not exist, especially when clearing actions are quick and easy to take. It would be very costly to pretend that trouble probabilities do not exist and demand that troubleshooters always follow the same procedure for the sake of uniformity or because the prescribed procedure will eventually lead to the trouble. Efficient troubleshooting, then, requires that troubleshooters be armed with all available trouble-probability information. Unarmed, they are deprived of a potent tool for rapid trouble isolation. Regardless of what their information is called, competent troubleshooters attempt solutions that are rapid and efficient because they pay off generously either in a trouble cleared or in information gained.

Step 4: Review Troubleshooting Aid

When troubleshooters have talked with the operator, verified symptoms, and tried quick fixes, but still have not located the fault, additional information must be collected. The troubleshooting aid is the next most efficient source of information to check out. Why Because such aids offer some prepackaged information that troubleshooters would have to seek elsewhere if the aid were absent. Of the several types of troubleshooting aids, some are brief and not too helpful, while others are highly sophisticated or even automated. For example, consider the easy-to interpret "idiot lights" in automobile that indicate when the oil pressure is too low or when the alternator ceases to provide a suitable charge for the battery. The cockpit of a modern aircraft is loaded with bells, buzzers, and sirens to indicate various malfunctions and even impending malfunctions. In one fighter aircraft, for example, there is a repeating sound that changes frequency and tempo as gravity forces are built up during a turn. The higher the G-forces, the higher and more rapid the sound, telling the pilot of an approaching malfunction. The sound monitors a stress condition of the aircraft, and from listening the pilot knows whether or not a correction is needed. In this case, there is no need to talk with an operator, collect additional symptoms, or try quick fixes other than the one suggested by this troubleshooting aid. More and more modern equipment is being designed to provide direct information about troubles. Sensors detect troubles that are then reported by lights, sounds, and other forms of information display. These aids are a response to the growing complexity of some types of equipment, but reflect what is still a growing technology. Though immensely useful, it is still possible for the telltales themselves to fail, making the troubleshooting task even harder than before. Therefore, the importance of providing troubleshooters with other well-designed aids is still as strong as ever. A sometimes-overlooked troubleshooting aid is the "Caution" information attached to the equipment itself. "Caution: Remove all red tags before operating" is one example. "Caution: High voltage in this cabinet" is another. True, these aids do not help in locating troubles, but they do help to save the equipment and the troubleshooter from early death. Still another type of aid is the "If/Then" page, which typically describes symptoms on the left and suggested actions on the right. Troubleshooters may find aids like this in the owners manual that came with the automobile and the instructions accompanying your appliances. They indicate what some common troubles are and what to do about them. Similar troubleshooting aids may be provided with more sophisticated equipment, and some maintenance people construct their own. Related to these is the troubleshooting tree, a type of flowchart that walks the troubleshooter through a series of actions and decision points, and hopefully, to the trouble. Often called fully procedural troubleshooting aids, these aids are a form of thinking prompt, or a form of prepackaged analysis intended to relieve the troubleshooter of the need to memorize all the steps to follow. Well-constructed aids of this type do indeed improve the speed and accuracy with which faults are located, even by the inexperienced troubleshooter. At least one study comparing the usefulness of procedural troubleshooting aids with more traditionally constructed maintenance manuals, showed these aids to be better than the manuals. More troubles were located, and inexperienced troubleshooters made as few errors as experienced people. This should be expected, as the fully procedural troubleshooting aid is a carefully constructed and tested way of guiding the troubleshooter to the source of the problem. Troubleshooters have to know more about the system in order to make good use of traditional maintenance manuals. In addition to knowing the geography of the system, they need more specific troubleshooting knowledge to make up for the incomplete or inaccurate information in the manual. Then there are the sophisticated diagnostics aids used in locating malfunctions in computers and similar equipment. They do not require the troubleshooters assistance at all, except to initiate the diagnostic operation. Diagnostics are programs designed to exercise a system, note discrepancies between normal and abnormal operation, and report the nature, and often the exact source, of the trouble either on a video display or printer (for example, "Bad RAM at CY"). When troubleshooting aids exist, experienced troubleshooters use the ones that remind them of the efficient paths to follow for information collection or those that report specific troubles. They do not use aids containing information they have already memorized through practice and experience, and they do not use aids that are poorly designed.

Step 5: Step-by-Step Search

When other sources of information fail to reveal the trouble's source, troubleshooters turn to a step-by-step search through the equipment itself. This is the last resort of competent troubleshooters, however, as it is the least time-efficient system of information gathering when compared to other information sources. This is not to say that the step-by-step search is unimportant; it is only to say that this procedure (oddly referred to as "systematic," "analytical," or "logical" troubleshooting) is used by competent people only after all other information sources fail. Several step-by-step search procedures might be used. A random search could be a way to test and replace components, and troubles would eventually be cleared. Unfortunately, since as many troubles would be located later as would be sooner, this approach is used only by the uninformed. A sequential search involves systematic testing, starting from one end of the equipment and working item by item to the other end. Although this procedure will also lead eventually to the trouble, it too is inefficient because troubles at the far end of the equipment take a long time to get to. The preferred search procedure is one that yields the most information for the least effort; that is, the most information per action, such as per test check or per trial replacement. Ideally, this search procedure is one that successively eliminates half the system as a possible trouble source. Called the split-half or half-split search, the procedure involves successively testing the system at or near its midpoint. When a test shows normal operation, then the portion of the system preceding that point is considered OK and is eliminated from suspicion. By successively eliminating approximately half of the remaining system with each test, the trouble is located more efficiently than with a random or sequential search. Four points must be made:

  1. It is seldom possible to test a system exactly at the midpoint of the next section to be checked. No matter. The object is to test at a point each time that will eliminate a large chunk of the system from suspicion that the trouble may be lurking there.
  2. Some systems lend themselves to rapid replacement of large segments containing a large number of components, such as chassis or circuit boards. Such board swapping can quickly isolate the trouble to the replaced unit or eliminate it from suspicion. Even though the swapping might have been done at some distance from a mid-point, the speed with which it is done makes the procedure useful.
  3. The split-half search is used only when a troubleshooter must adopt the equal probability hypotheses: "As far as I know right now, the trouble could be anywhere." Competent troubleshooters stop using this search procedure as soon as (a) they develop an idea worth testing, or (b) the trouble is located. Once they know or strongly suspect the trouble's location, they are likely to test or replace the suspected component or assembly. If they find the trouble, they fix it. If they do not find it, and all other information sources have proven inadequate, they resume the split-half search until they can attempt a fix.
  4. Finally, those who expect to be skilled in the step-by-step search (called signal tracing for electrical and electronic equipment, flow tracing for hydraulic or pneumatic equipment, and linkage tracing for mechanical equipment) need to have considerable knowledge. They need to be able to read diagrams, use test equipment, interpret waveforms, and locate components and test points. This explains why often it is considerably more economical to have two types of troubleshooters: those who can isolate a trouble to a unit, such as a gearbox, transmission, circuit board, or card; and those who can trace the trouble to the defective component within the unit. The former generally can clear more than 80% of the troubles they encounter after very little training. It is very expensive to insist that every troubleshooter be as knowledgeable as those who can clear most or all of the troubles ever encountered.

Step 6: Clear the Trouble

Once a trouble is located, someone is expected to eliminate it. Trouble clearing is often done by the troubleshooter, but sometimes it is assigned to someone else. The master auto mechanic, for example, does the diagnosis, but then may assign the actual repair work (trouble clearing) to someone else. The chief engineer at a radio or TV station may be called in to troubleshoot, and then turn the trouble-clearing activity over to the on-duty engineer. Manufacturers hotshot troubleshooters who travel to clients locations to solve difficult problems often leave the actual trouble clearing to the local staff. Trouble clearing is different from trouble locating, and locating requires a different set of skills than clearing. This article concentrates only on locating the source of the trouble.

Step 7: Perform Preventive Maintenance

Preventive maintenance is the process of clearing troubles before they happen, a process that good troubleshooters perform as regularly and carefully as time and policy permit. Performing PM is more than just a ritual or just another company policy; preventive maintenance saves a great deal of time and money and reduces equipment downtime. It is appropriate to do PM on some machines even before starting to hunt for the trouble. PM usually is fast and may clear the trouble. However, for most machines, PM is carried out after the trouble has been cleared. One troubleshooter explained it this way: "Look, when the customers machine is down and the plant has come to a grinding halt, they dont want to see my troubleshooters oiling and greasing. They want that equipment up and running! The oiling and greasing is done after the equipment is operational."

Step 8: Make Final Checks

Competent troubleshooters always check to make sure the trouble is actually cleared and the system is functioning normally. They know too well how easy it is to cause a new trouble while clearing an old one. They also know how easy it is to leave something like a setscrew loose, or something unplugged or out of adjustment. Therefore, a final check of normal operation is a necessary part of the troubleshooting sequence.

Step 9: Complete Paperwork

Troubleshooters are not immune to the bureaucratic plea to "fill out those forms!" Even though paperwork is not troubleshooting, it is part of the troubleshooters job. Often, the history of a machine is recorded in an equipment log. Dates of PMs, information about retrofit, and parts that have been changed are recorded at the time of service or repair. Referring to and keeping up a log are two paperwork activities that are part of the maintenance job. Sometimes troubles can be quickly located by simply reading the history in the log, often because the same trouble occurs regularly in that equipment. For this reason, the equipment log is a useful source of information, and good troubleshooters take the time to update these logs as well as to refer to them.

Step 10: Inform Area Supervision/Instruct Operators

Once the equipment is returned to service, the user is informed of this fact. Often, operators are instructed in the proper use or care of the equipment or cautioned about peculiarities of the system. Although this activity is not strictly part of the troubleshooting procedure, it is important to the continued proper functioning of the equipment.

The Flowchart Model

The next step is to review a flowchart depiction of the action and decision steps in the strategy just described. A flowchart is a graphical tool used to represent the steps of a process. The flowchart uses standard symbols to represent process steps, decisions, and other events. Figure 2 shows typical standard flowchart symbols.


Figure 2: Typical Flowchart Symbols
A flowchart depicting the typical troubleshooting process just described is shown in Figure 3. This flowchart represents the troubleshooting procedures followed by an individual at the location where the equipment trouble is noted.


Figure 3: Flowchart Model
Troubleshooters usually receive a report of trouble in the form of a symptom:

  • Its jammed again.
  • It wont start.
  • I cant get it to complete the cycle.

After locating the correct machine (and good troubleshooters always make sure they have the right machine), they try to interview the operator. Unless the machine is jammed or otherwise inoperable, they operate the machine and verify the symptoms collected from the operator. If the problem is operator-induced, they clear it and then instruct the operator in ways to prevent the problem from occurring again. If the problem is real, they try quick fixes (check interlocks, plugs, and cables; replace units). If any of these work, preventive maintenance may be called for and carried out. Then, final checks are made, documentation (paperwork) is completed, the area is cleaned and checked, and the area supervisor is informed. If quick fixes do not solve the problem, troubleshooters follow troubleshooting aids if they are available. If aids are not available, a half-split search procedure is used as a last resort. When troubleshooters develop a good idea about where or what the trouble is, they test their hypothesis by attempting a solution. If a solution does not work, the search is continued. If the solution does work, troubleshooters complete any preventive maintenance that is indicated and then follow the end steps already described (final check, documentation, area check, and communication).

Five Action Steps for Systematic Troubleshooting

The seven-step troubleshooting method described previously assumes that little may be known of the process or system with a problem. Many times that is the case. The technician, electrician, or mechanic must systematically try to resolve the problem by using his or her skills and intuition. There are often cases, however, where a familiar piece of equipment or system breaks down. In those cases, an abbreviated five-step process can be used to find the fault, get the system up and running. It is important to note that, although this is a five-step approach, the same basic guidelines of the seven-step troubleshooting method are followed. The steps are simply combined to be specific to the problem at hand.

Five-Step Troubleshooting Process

The five-step troubleshooting process consists of the following:

  1. Verify that a problem actually exists.
  2. Isolate the cause of the problem.
  3. Correct the cause of the problem.
  4. Verify that the problem has been corrected.
  5. Follow up to prevent future problems.

Each of these steps is described next using the flowchart approach.

Step 1: Verify That a Problem Actually Exists

The troubleshooting process begins with symptom recognition. To troubleshoot, there must first be a problem. In this alternate approach to troubleshooting, the troubleshooter must first verify that there actually is a problem. This is actually a combination of the first two steps of the seven-step method, symptom recognition and symptom elaboration. To verify that there actually is a problem, the troubleshooter must use all available means of information. This includes the equipment operator, equipment indications and controls, and technical documentation about the equipment or system. Contacting the equipment operator should be the first action taken. The operator usually can supply many of the details concerning the failure incident. To get the most information, the troubleshooter should ask probing questions. Some examples are:

  • What are the operators indications of the trouble
  • How did the operator discover the trouble
  • What were the conditions at the time the trouble first occurred
  • Is the trouble constant or intermittent

Next, the troubleshooter should observe the equipment or system to get a first-hand impression of the trouble. During this observation, the troubleshooter should note all abnormal symptoms. To evaluate the equipment thoroughly and elaborate on the symptoms observed, the troubleshooter will probably need to examine the equipment documentation. This includes prints, operating characteristics, and procedures. Since the equipment operator is probably most knowledgeable about the equipment, it is important to discuss the documentation with the operator. This will help to determine if any changes exist. Some examples of useful graphic documentation are:

  • Panel graphics
  • Loop diagrams
  • Piping and instrumentation diagrams
  • Block diagrams
  • Wiring diagrams
  • Schematic diagrams Each of these examples is described briefly next.

Panel Graphic

A panel graphic is a graphic representation of the system that is mounted on an equipment or system control panel. Although the panel is intended to provide the operator with a big picture of the operations, it can be useful to the troubleshooter during this step. Figure 4 is an example of a panel graphic.


Figure 4: Panel Graphic

The above example does not provide extensive information to the troubleshooter. However, it can be used to identify sources of power, valve line-ups, or instrumentation connections.

Loop Diagram

A loop diagram is used to provide detailed mechanical information about a process. This diagram does not give significant electrical or instrumentation information. Figure 5 is an example of a loop diagram.


Figure 5: Loop Diagram
A more useful diagram for electricians and technicians is the piping and instrumentation diagram, described next.

Piping and Instrumentation Diagram

A piping and instrumentation diagram (P&ID;) shows the functional layout of a fluid system and its piping, valves, and instrumentation as clearly and accurately as possible. It is accurate to the extent that all components are connected to each other as shown in relation to flow path orientation. A P&ID; does not attempt to represent the actual physical layout of equipment, i.e., a valve that may appear to be right at the discharge of a pump can physically be located quite some distance from the pump and on another elevation (floor). Many times, however, a P&ID; will use a broken line encircling a group of equipment to indicate that it is all in one building. Another name commonly used for P&IDs; is bubble diagram due to the use of a circle for locators and symbols. A piping and instrumentation diagram depicts all components of a particular system, including pipe sizes, flange sizes, valve sizes, flow direction, and references to other related diagrams. Rather than try to pictorially include all the valves, piping, instruments, and equipment in a fluid system, a P&ID; uses standardized symbols to represent these items. A section of a simple P&ID; is shown in Figure 6.


Figure 6: Simple P&ID; Section

The P&ID; is useful when troubleshooting entire systems or processes to find a faulty component. A P&ID; shows the relationship between mechanical, electrical, and control components of the system. It does not give any details on the electrical or control circuitry. For circuit troubleshooting, the block diagram, wiring diagram, and schematic diagram may be more useful.

Block Diagram

Block diagrams are the simplest of all electrical diagrams. A block diagram illustrates the major components and electric or mechanical interrelations in block (square, rectangular, or other geometric figure) form. The lines between the blocks represent the connections between the systems or components. Each line may represent one wire or several wires. The purpose of a block diagram is to introduce the system as a whole, convey the general operation and arrangements of the major components, and show the normal order of progression of a signal or current flow. Figure 7 is an example of a block diagram.


Figure 7: Block Diagram

Block diagrams are used to show the parts included in the system and the electrical order the parts are in. Knowing this, the system can be analyzed to determine where a fault might lie. Block diagrams are useful but have some disadvantages. They do not show the accurate physical location of the components in the system. Also, a single line represents all electrical connections. There usually is no indication whether the single line represents a cable or several cables.

Schematic Diagram

Schematic diagrams (often just called schematics) are drawings that show all the components in their proper electrical positions, but not necessarily in their proper physical locations. Schematic diagrams are very useful to the technician troubleshooting an electrical or electronic circuit. Schematic diagrams usually are designed to be read from left to right and from top to bottom. There typically are standard electrical diagram symbols and device function numbers on these diagrams. The positions of the contacts and switches usually are shown as they would be in the relaxed or de-energized state. A schematic diagram is shown in Figure 8.


Figure 8: Schematic Diagram

Once a possible faulty component has been identified using a schematic diagram, in-circuit tests must be performed to verify the suspected failure. To perform the tests, the technician must know where to connect test equipment in the circuit. A wiring diagram is used to supply this information.

Wiring Diagram

Wiring diagrams are mostly used when troubleshooting systems. They may be used in conjunction with schematic diagrams for component and wiring locations. Wiring diagrams show the relative position of various components of the equipment and how each conductor is connected in the circuit. These diagrams are classified in two ways:

  • Internal diagrams, which show the wiring inside a device.
  • External diagrams, which show the wiring from the components to the rest of the system.

A wiring diagram is structured such that is represents all the wires that were presented in the schematic diagram in their actual locations. It shows all electrical connections in an enclosure. Each wire is labeled to indicate where each end of the wire is terminated, such as a terminal board location. Using the documents described so far, a technician can accomplish a great deal toward finding the cause of the problem. During this step, the technician identifies possible faults that could result in the problem. These faults should be listed so that they can be checked and eliminated if possible. The flowchart in Figure 9 shows a block-by-block representation of this step. In the next step, we will discuss isolating the real cause of the problem.


Figure 9: Step One: Verify That a Problem Actually Exists

Step 2: Isolate the Cause of the Problem

The second step of the five-step troubleshooting process relies heavily on the troubleshooters technical skills and intuition. During this step, the troubleshooter is actively involved in isolating the cause of the problem. This involves physical activity, such as reading instrumentation, connecting test equipment, adjusting parameters, and possible dis assembly. It also involves mental activity, such as logic, evaluation, and reasoning. The specialized knowledge of the troubleshooter plays a key part in this step. To safely and effectively isolate the cause of the problem, keep the following in mind:

  • Begin investigating the easiest items to check. Eliminate all convenient possibilities first to save time.
  • Be aware of special operating modes (self-tests, built-in diagnostics, etc.) that may aid in the troubleshooting process.
  • Use appropriate safety precautions and equipment when troubleshooting in the field.
  • When taking instruments in a piping system off-line and when placing them on-line, follow appropriate operating procedures.
  • Recognize the obvious, but do not ignore what may be concealed.

Using the appropriate documents and test equipment, the troubleshooter continues to eliminate possible causes. As each check is completed, the trouble becomes more isolated. Using techniques previously discussed, such as half-splitting and signal tracing, helps to narrow the problem down quickly. A flowchart illustrating the process used to isolate the cause of the problem is shown in Figure 10. Once the problem has been isolated to a specific component, it can be repaired. Correcting the problem is discussed next.

Figure 10: Step Two: Isolate the Cause of the Problem

Step 3: Correct the Cause of the Problem

The third step of the five-step troubleshooting process is correcting the cause of the problem. This step involves performance of the repair or other activity that eliminates the problem. This can be as simple as turning a switch or adjusting a valve, or it could be as complex as re-winding a motor or overhauling a pump. To correct the cause of the problem, the troubleshooter performs both failure analysis and a retest of the equipment. This is shown in the flowchart pictured in Figure 11.


Figure 11: Step Three: Correct the Cause of the Problem

Step 4: Verify That the Problem Has Been Corrected

Once the corrective action is taken, the troubleshooter should verify that the trouble has been corrected. This usually involves rechecking the same indications that proved there was a problem. This time though, the checks should prove that a problem does not exist. This step should be thorough. If there are both an abbreviated procedure and an expanded procedure for checking the equipment, use the expanded procedure. This helps ensure that the problem no longer exists and did not mask another problem. During this verification, the following should be observed:

  • Check all indications that relate to the repaired area.
  • Perform a valve/switch line-up check to validate the integrity of the system.
  • Using approved procedures, establish normal operating conditions and check equipment performance.
  • Check for abnormal operation of all inputs and outputs to the repaired equipment.

By thoroughly verifying the proper operation of the repaired equipment, the troubleshooter can be relatively sure the problem has been resolved correctly. To help ensure the problem does not reoccur, the next step in the process is performed.

Step 5: Follow Up to Prevent Future Problems

The final step in the five-step troubleshooting process is to follow up to prevent future problems. This step involves taking preventive measures and recommend actions that could help keep the equipment from failing. This may include the following:

  • Changing the preventive maintenance schedule to help prevent failures.
  • Recommend a different supplier if a replacement component is unsatisfactory.
  • Recommend procedure modifications that may prevent future failures.
  • Conduct operator/maintenance training to raise awareness of the potential for problems.
  • Complete proper documentation and troubleshooting log entries to aid in future troubleshooting of similar problems.

Although the system retest and preventative measures taken may not seem as vital as fixing the problem and getting the equipment back on-line, these steps are vital to long-term productive performance. The flowchart shown in Figure 12 depicts the actions taken in these steps.


Figure 12: Step Four: Verify That the Problem Has Been Corrected Step Five: Follow Up to Prevent Future Problems

Deriving Logical Troubleshooting Flowcharts and Strategies

In observing competent troubleshooters in action, troubleshooters will not always see the use of the most efficient, ideal procedure. Sometimes the use of less-than-ideal tactics as a means of dealing with various constraints can be performed. For example, telephone maintenance people are often faced with a trouble referred to as CCIO ("Cant Call In or Out"). Once they have ruled out trouble in the central office as the cause, they are supposed to check the telephone instrument itself to verify that the trouble exists as reported. Then, they are supposed to check their way from the instrument toward the telephone exchange until they pinpoint the trouble. However, they do not always troubleshoot in this manner. At one company, they always examine the first checkpoint they come to as they are driving toward the customers telephone, regardless of where that point is in the logical chain of test points. Why Because the cost of operating repair trucks is so high that company policy has been set to follow a more efficient vehicle-use procedure. Gas is saved, but the procedure takes longer. Policy says that it is more important to minimize "windshield time" than it is to maximize troubleshooting efficiency. In observing troubleshooters at another plant, a troubleshooter would note that other troubleshooters generally fail to verify their diagnosis with test equipment. Instead, the troubleshooter might keep trying different solutions until success in clearing the trouble is seemingly achieved; appearing as if changing parts at random is the solution. Why Not because the troubleshooter is not aware of more efficient troubleshooting procedures, but because the test equipment is awkwardly located some distance away. It is easier to test the guesses by changing parts than to take the time and effort needed to verify a diagnosis. Why is the test equipment kept in the tool crib instead of at a location closer to those who need it It has always been done that way! At a third company, troubleshooters follow a similar procedure, not because the test equipment is relatively inaccessible, but because the schematic diagrams are classified and are kept locked up. It is easier, though less efficient, to try a string of solutions than to bother signing the schematics in and out. The variations just described illustrate two types of reasons for deviating from the ideal troubleshooting strategy:

  1. There is sound reason for deviation. Although the deviation results in a troubleshooting strategy that is somewhat less than ideal, it fits a larger plan. There is a good reason for the deviation, and it is not inefficient in the context of the larger plan.
  2. There is no sound reason for deviation. The less-than-ideal strategy is thought to be easier because it has always done it that way. Little or no thought has been given to how it should be done, and how certain constraints could be better dealt with.

If a troubleshooting strategy deviates from the models shown on the previous pages, the deviation should be for good reasons rather than because it has always been done that way.

Deriving Your Own Troubleshooting Strategy

Now that two types of ideal troubleshooting strategies have been identified,seven-step and flow charting, it is time to develop a specific troubleshooting procedure that will fit specific equipment and related situation. The troubleshooter will translate an ideal strategy into specific tactics appropriate to troubleshooting your equipment and create a troubleshooting tool.

Here is What To Do

  1. Select the model strategy that best matches the procedure where the troubleshooter is comfortable with using on the floor.
  2. On a piece of paper, write the name of the equipment.
  3. Under the name of the equipment, write down the level expected to isolate troubles to:

  • block (chassis, unit, card)
  • faulty component
  1. Indicate whether or not have other experienced people are available to call on for help.
  2. Follow the model to build a flowchart. Make the flowchart specific to the equipment and circumstances. Write down the appropriate phone numbers to call, people to talk to, names of references to use, names of tools or test equipment to use, and names and/or numbers of required documentation forms.
  3. When a draft is complete, test it by answering these questions:

  • Is the equipment properly identified It is not very helpful to do this in the abstract.
  • Does the flowchart follow the model in each of the key steps
  • Is the flowchart consistent with the information recorded at the top of the page
  • Are the specific items named in the flowchart
  • If the strategy deviates from the model, can the troubleshooter justify that deviation with a sound reason, such as company policy or other legitimate constraints

When the flowchart meets the test criteria, the troubleshooter has derived the ideal troubleshooting strategy for the equipment.

Types of Failures

Most problems a troubleshooter faces are relatively simple to analyze and repair. The equipment, or an associated component, fails and the failure is obvious. There are times, however, when the problem is not so apparent. In fact, some problems only occur sometimes. When a failure is sporadic, or it is not always present, it is called an intermittent failure.

Steps for Troubleshooting Intermittent Failures

An intermittent failure can create much aggravation and frustration for the troubleshooter. It also can create havoc within a process or system operation. Diagnosing the fault, as difficult as it is, can be accomplished using these general guidelines:

  1. Attempt to recreate the problem.
  2. Isolate the fault once the problem re-occurs.
  3. Monitor the operation if the problem does not re-occur.

A brief description of each of these guidelines follows.

Attempt to Recreate the Problem

If the problem is no longer apparent and operator error has been ruled out, the system or equipment must be examined to find the fault. One of the first things a troubleshooter should try to do is recreate the problem. Using information obtained from the operator and from any equipment history or logs, make an attempt to establish operating conditions that are similar to those that existed at the time of failure. This may require placing the equipment in a state that is contrary to other equipment operation. For this reason, troubleshooting of an intermittent failure is performed off-line, usually in a maintenance shop. Three basic types of intermittent problems will be described. Most intermittent problems fall into one of the following categories:

  1. Thermally induced failure
  2. Mechanically induced failure
  3. Erratic failure

Although other classifications could be used, an intermittent problem usually occurs only under certain circumstances. Contrary to common belief, most equipment does not have a mind of its own. Two of the most likely things to change in a system during operation are temperature and mechanical functions. For this reason, the first two categories exist. The third category, erratic failure, is a catchall for other intermittent problems. It is also the most difficult type of problem to troubleshoot.

Thermally Induced Failure

The thermally induced failure is a problem that only becomes apparent when equipment is warmed up. This problem may only occur on very hot days or when air conditioning is not operating. It may also occur each time the equipment has operated for an extended period of time at normal operating temperatures. To isolate the thermally induced failure, the equipment must first be cooled down. After the equipment is cool, it can be re-energized and allowed to warm up to normal operating temperature. Once the equipment has operated for some time, the thermally induced failure should re-appear. Once the problem re-appears, it can be verified by cycling the equipment through a cool-to-warm state several times. To help isolate the thermally induced failure, the equipment may need to be cooled down in sections as it operates. This can be done using a directional forced air source or a special product developed for this specific purpose.

Mechanically Induced Failure

A mechanically induced failure is relatively easy to recognize. This type of failure occurs when the equipment or circuit experiences a vibration, mechanical shock, or motion. By repeatedly tapping on the troubled area, the fault condition should appear and reappear. Tapping or applying pressure to different areas of the equipment can isolate the faulty component.

Erratic Failure

The most difficult trouble to diagnose is the erratic failure. An erratic failure is a failure that is virtually impossible to predict. It occurs randomly and under different operating conditions. Many times, these types of failures are related to voltage transients or irregularities. Static discharge voltages and damage associated with static discharge can lead to erratic failures. Digital equipment, such as computers and peripherals, are good examples of devices that are subject to these problems. Finding a solution to an erratic failure is not easy. Most of the time, this type of problem cannot be recreated. It usually requires substitution of components on a sub-system basis. For example, assume that a computer system has erratic failures resulting in the system "locking up" at various times. The system locks up in various modes, programs, and operating conditions. There is no apparent trend to the failure. A computer-based diagnostic software program may even pass on the system. In a case like this, each system component can be replaced individually with a known "good" component. The system can then be run for an extended period of time after each substitution is made. If the fault does not reappear, the component that was replaced can be assumed to be bad. Although this is not very practical, it may be the only way to isolate the fault.

Alternatives to Recreating Failures

In the case of each type of intermittent failure, it is important to recreate the fault condition so that the fault can reoccur. Once this has been accomplished, normal troubleshooting techniques can be used to find the cause and repair it. Finding the solution to an intermittent failure can be as rewarding as solving a "whodunit" mystery. The very name intermittent failure guarantees that the problem is not always going to occur. In the case of erratic failures especially, it may be virtually impossible to see the fault reoccur. If this is the case, alternate monitoring methods can be used to track the equipment's operation over an extended period of time. Some of these methods include:

  • Memory oscilloscope
  • Chart recorder
  • Noise monitor

This is just a partial list of devices available for long-range monitoring of the suspect system. Other means can be used to help diagnose the erratic failure. Once the monitoring has been performed, the results must be analyzed. Each aspect of the factors that may contribute to the failure must be assessed to determine the real cause of the problem. While using a monitoring device provides useful information concerning the symptoms of the problem, it does not identify the cause of the problem.

Identifying All Possible Causes of Trouble

Simply fixing a trouble does not necessarily solve a problem. Many times, a repair results in only temporary restoration of system performance. This is because the emphasis is often on getting the equipment up and running, not on fixing the problem. Consider the following example:

Maintenance mechanics were called upon to repair a circulation pump that had failed during normal operation. Upon investigation of the failure, the mechanic determined that fuses had blown in the pump controller. An electrician replaced the fuses and re-energized the pump and controller. Together, the electrician and the mechanic observed the restart of the pump. Upon hearing abnormal grinding noise and observing a noticeable deficiency in pump discharge flow rate, they de-energized the pump. The mechanic inspected the pump and found severely worn and damaged bearings. The bearings were replaced, and the pump was placed into operation. The pump discharge flow rate was normal, and no abnormal sounds were heard during the trial run of the pump, so it was returned to service. Although the normal life expectancy of shaft bearings on the pump was in excess of 3,500 hours of continuous use, the bearings once again failed after only 560 hours of continuous use. The bearing replacement was performed again, with similar results. After the third bearing failure, a thorough examination of the pump revealed a severely misaligned impeller shaft. By the time this problem was detected, the shaft was badly scored and had suffered heat damage. If the time had been taken to diagnose the root cause of the initial problem, the shaft probably could have been saved. At the very least, costly downtime and replacement bearing costs could have been avoided.

The above example may seem extreme, but there have probably been worse instances. By taking the time and using a problem-solving tool or technique, the root cause of a trouble can be determined.

Cause and Effect Diagrams

Determining the root cause of a problem involves considering the possible causes of the effect. In this case, the trouble's symptom is the effect and the system components and operating conditions are potential causes. A cause and effect diagram is used to consider the possible causes associated with a particular problem. The cause and effect diagram is developed as necessary to help isolate the primary or root cause of the problem. This involves gathering data, considering all factors (causes) that contribute to the trouble (effect), and using a process of elimination to determine the root cause. Although the cause and effect diagram is considered a performance-improvement tool, it is worthwhile to consider it as a troubleshooting aid. Using the cause and effect diagram technique can help prevent "hunt-and-peck" troubleshooting and reduce the aggravation associated with undisciplined problem solving. To resolve a problem in the quickest manner possible and help prevent its re-occurrence, the causes of the problem must be identified. The cause and effect diagram is used to graphically show the relationship of each of these causes to the problem. The graphical method used to display this relationship is a fishbone diagram. The term "fishbone" refers to the appearance of the diagram once it has been drawn. Its shape resembles that of a fishbone. An example of a fishbone diagram is shown in Figure 13.


Figure 13: Fishbone Diagram

There are many techniques that can be used to determine the possible causes of a problem. One of the best techniques is brainstorming. Brainstorming is a group-oriented problem-solving technique. To effectively brainstorm, a group of people who are willing to work together is required. This is especially useful when troubleshooting a process with many components. It can also be useful when dealing with a piece of equipment that has mechanical, electrical, and control devices. The more diverse the equipment, system, or process, the more useful the brainstorming technique becomes. Brainstorming involves systematically listing all possible causes for a problem. Giving each member of the group a turn to suggest a possible cause does this. Each suggestion is considered as plausible and is written down for consideration. Each member suggests one cause at a time until there are no more suggestions. All group members then, prior to any discussion, review the compiled list. Once the group has reviewed the list of suggested causes, they are discussed. During the discussion, suggested causes that are similar should be grouped together. For instance, "faulty circuit breaker" and "loose wiring" may be grouped under the heading "electrical" or, even more generically, "materials." Some general group headings that are useful in any troubleshooting analysis are:

  • Materials
  • Operator
  • Methods
  • Equipment
  • Environment

Each of these areas can be broken down into smaller subgroups as required. Once the major headings are determined, the initial fishbone diagram should look like the one shown in Figure 14.


Figure 14: Major Group Headings on a Fishbone Diagram

Next, the subgroups can be determined, if necessary, and individual causes added. An example of subgroups could be "mechanical equipment" and "electrical equipment" under the "equipment" group. As the subgroups and individual tasks are added, the cause and effect diagram takes on its fishbone appearance. This is shown in Figure 15.


Figure 15: Expanded Cause and Effect Diagram

Ideally, the brainstorming approach to constructing the cause and effect diagram is best. However, as a technician or mechanic, it may be difficult to get a group together to come up with the possible causes. In any event, the concept of cause analysis using the cause and effect diagram is useful, even on an individual basis. The process of constructing a cause and effect diagram in itself promotes a broader examination to find the actual root cause of a trouble.

Constructing a Cause and Effect Diagram

The basic steps to constructing the cause and effect diagram, whether in a group or individually, are described next.

Step 1: Identify the Trouble or Problem

This step should be easy. The problem started with a broken or damaged piece of equipment, a malfunctioning circuit, a disturbed process, or some similar specific occurrence. The problem is given a name, which is written on the right-most side of a piece of paper. Placement of the problem on a piece of paper is illustrated in Figure 16.


Figure 16: Placing the Problem on the Cause and Effect Diagram

Step 2: Draw a Main Line Pointing to the Problem

To illustrate a direct path to the problem, a single straight line is drawn from the left side of the piece of paper to the designated problem. The line ends in an arrow pointing to the problem. The problem and the line pointing to it are shown in Figure 17.


Figure 17: Line Pointing to the Problem

Step 3: Identify the Possible Major Causes of the Problem

Identify possible major causes of the problem and designate them on the drawing. This is accomplished by listing each of the potential major causes inside a box around the centerline (the line that points to the problem). Once the possible major causes have been listed, draw connecting lines between the boxes and the centerline. These lines are drawn with an arrow on the end pointing to the centerline, as shown in Figure 18.


Figure 18: Designating Major Causes

Step 4: Identify Each Possible Minor Cause Associated With the Major Causes

Minor causes are things that potentially contribute to the major cause. Evaluate each major cause designated in step three and identify everything that can contribute to ft. These items are listed individually on lines that point into the major cause line. The minor causes on a cause and effect diagram are shown in Figure 19.


Figure 19: Designating Minor Causes on a Cause and Effect Diagram

Step 5: Identify Each Contributing Factor to the Minor Causes

Each minor cause on the cause and effect diagram has factors that contribute to them. Once the minor causes have been designated, the factors that contribute to these causes must be identified. These factors normally are very specific. Depending on the type of trouble initially identified in step one, the factors could be major process components, such as a tank level detector; a specific control device, such as a valve actuator solenoid; or a discrete component, such as a 220-ohm, -watt resistor. These contributing factors are designated on the cause and effect diagram by writing them individually on a line that points into the minor cause it is associated with. The resulting diagram is shown in Figure 20.


Figure 20: Designating Factors That Contribute to Minor Causes

Step 6: Review the Cause and Effect Diagram

Once the cause and effect diagram is complete, it should be reviewed carefully to ensure that all possible causes and contributing factors have been identified. Once the diagram has been constructed and reviewed, the troubleshooter can begin to systematically check the potential causes to isolate the root cause of the problem. By using the cause and effect diagram to sort out and relate all of the potential causes of a problem, the troubleshooter can get to the root of the problem and avoid mistaking a symptom for a problem.