Summary: Problem determination is not an exact science, but it's also not rocket science. A methodical approach will help your problem solving techniques become more organized, systematic, and, ultimately, more effective. How do you approach a new problem? When you encounter a new problem, how do you decide what to do? Where do you start? What do you look for? How can you become more effective at troubleshooting? What you need is a methodology for problem determination. By its very nature, problem determination is about dealing with the unknown and the unexpected. If you knew in advance everything about all the problems that you could encounter and exactly how they manifest themselves, then you would take measures to prevent them and wouldn’t have to investigate them. You cannot expect that problem determination should be a perfectly predictable process, but there are a number of common approaches that can make the process go smoother and be more effective. This blog is based on the experience and observations of members of the IBM Support, Serviceability, and SWAT organizations from years of helping our clients, and from seeing both best and worst practices in action. This is an evolving work, as we continue to look for new ways to further enhance the investigative process. Common challenges Looking back, we can identify several common challenges that can make problem determination exercises difficult to resolve:
Where did it happen?
When did it happen?
Did the problem happen only once or does it occur regularly? Is the problem repeatable at will?
Has the problem been reproduced before, during load or stress testing?
Why did it happen?
What do you know? Finally, it helps to clearly summarize all the information available. Make a list of all the symptoms and anomalies, whether or not they seem related to the problem.
Not every problem is difficult Before jumping into more complex techniques, it is often useful to approach each new problem in two phases:
Over the course of an investigation, you will end up collecting a large number of diagnostic artifacts and files, such as log files, traces, dumps, and so on. Some of these artifacts could be the result of multiple spontaneous occurrences of the problem over time, and others could be the result of specific experiments conducted to try to solve the problem or to gather additional information. Just as the timeline of events is important to keep track of what happened over time, it is also very important to manage all the various diagnostic artifacts collected during these various events so that you can consult them when you need additional information to pursue a line of investigation. For each artifact, you should be able to tell:
As for other aspects of the organizational devices described here, there are several equally good ways and formats suitable to maintain this information. One straightforward approach favored by many experienced troubleshooters is to organize all artifacts into subdirectories, with one subdirectory corresponding to each major event from the timeline, and to give each artifact within a directory a meaningful file name that reflects its type and source (machine, process, and so on). When necessary, you could also create a small README file associated with each artifact or directory that provides additional details about the circumstances when that artifact was generated (for example, configuration, options, detailed timestamps, and so on). While it is perfectly acceptable to simply use any manually-created directory structure you wish to organize these artifacts, the Case Manager facility provided in IBM Support Assistant includes several features that help organize artifacts precisely along the principles outlined above.
Avoiding tunnel vision Regardless of the approach you take, it’s important to always keep an open mind for out-of-the-box thinking and theories. Avoid having “tunnel vision,” focusing on one possibility for root cause and attempting to solve only that one cause relentlessly. Failure to see the whole picture and not methodically evaluating all potential solutions can result in prolonged relief and recovery times. Here are some ideas to help you avoid this condition:
Encountering one problem might cause the system to enter a particular error recovery path, and during the execution of that path another problem might manifest itself.
In other cases, one problem might simply mask another; the first problem does not let the system proceed past a certain point in its processing of requests, but once that first problem is resolved, you get further into the processing and encounter a second independent problem.
In all these cases, you have no choice but to address one problem at a time, in the order that they are encountered, while observing the operation of the system. Each problem could itself require a lengthy sequence of investigative steps before it is understood. Once one problem is understood, you can proceed to the next problem in the sequence, and so on, peeling away each imaginary onion skin, until you finally reach the core. This method can be very frustrating, especially to those unfamiliar with the troubleshooting process, but effective communication can help keep morale high and build trust in the process by showing concrete progress and minimizing confusion. Maintaining and publishing a clear executive summary helps set the context of the overall situation, helps highlight each specific problem when it has been resolved so that progress is evident, and helps identify new (major) problems as they are discovered. The table of problems, symptoms and actions helps to keep track of the various layers and clarify the relationship between similar problems and symptoms. Using IBM Guided Activity Assistant As a complement to the various techniques outlined in this article, you might want to consider using the IBM Guided Activity Assistant to help you conduct your investigation. The most visible function of which is to provide information and step-by-step guidance for the specific tasks that should be performed to diagnose a variety of problems. It also embodies many of the general principles presented in this article, helps you characterize your problem, keeps track of the state of the investigation and the various diagnostic artifacts through its Case Manager, and guides you through initial steps that are similar to Phase 1, which also support information gathering necessary to launch Phase 2. Summary Problem determination is about dealing with the unknown and unexpected, and so it will probably never be an exact science -- but it’s also not rocket science. By following the recommendations and techniques outlined in this blog, you can take steps to make your problem determination work more organized, systematic, and, in the end, more effective and rewarding.
- Need for direction, or "What do I do next?" Sometimes the people involved in resolving a problem simply don’t know where to start or what to do at each step. Problems can be complex and it is not always obvious how to approach them. This blog will provide some general guidance to help you get started finding and deciding what to do at each step of the process for a broad range of problems.
- Need for information, or "What does it mean?" Sometimes what’s missing is simply information: you see some sort of diagnostic message or diagnostic file, but you don’t know how to interpret it or can’t understand how it relates to the problem at hand. You need good sources of information and tools to help you interpret all the clues discovered in the course of an investigation.
- Miscommunication and lack of organization, or "What was I doing? What were you doing" Sometimes time is wasted or important clues are lost because of miscommunication, or because the investigation has dragged on for so long, making the collected information more difficult to manage. Events, timelines, and artifacts that are often invaluable to determining next steps of an investigation and communicating progress can easily get lost or forgotten.
- Dealing with multiple unrelated problems or symptoms, or "What are we looking for?" A particular challenge in complex situations is not knowing whether you are dealing with a single problem or with multiple independent problems that happen to occur at the same time. You might see a variety of symptoms, some of which relate to one problem, some to another, and others that are simply incidental and benign. Being able to distinguish between the “noise” and the real problems can go a long way to an timely resolution.
- What happened?
- What are the main symptoms that led you to determine that there is a problem, as opposed to all other ancillary symptoms?
- Exactly how would the system have to look or act for you to consider that the problem is no longer present?
- How would you recognize that the same problem happened again?
- Be cautious about using vague terms like “hang,” “crash,” and “fail,” which are often generalizations, inaccurate, and distract attention away from important symptoms.
- Be aware that in real-world situations, there can be several independent problems rather than just one. You need to recognize them and prioritize them.
- Be conscious of tangential problems that are consistent and well-known (for example, an application error that is always written to the same log in which the original problem occurs). Sometimes these problems are incidental and not worth investigating, and sometimes they can be related to the problem.
- Be precise about which machine, which application, which processes, and so on, the problem was observed on.
- Which logs and which screens should you look at to see the problems?
- Know the overall environment surrounding the problem (for example, find out the system topology, network topology, application overview, software versions, and so on).
- Track time stamps for where to look in the logs. Note time zone offsets and whether or not multiple systems have synchronized clocks.
- Are there any special timing circumstances:
- Every day at a particular time of day?
- Every time you try to perform a particular operation, or every time a particular system process executes?
- Every time a particular user or batch process starts processing?
- Why did the problem happen now, and not earlier? What has changed?
- Why does the problem happen here, on this system, and not on other similar systems? What is different?
- Has it ever worked correctly in the past?
- In Phase 1, you perform a broad scan of the entire system after a problem has occurred to find any major error messages or dump files that have been generated in the recent past. Each of these errors and dumps constitute an initial symptom for the investigation. Search these symptoms across one or more knowledge bases of known problems.
- In Phase 2, you do everything else: Select one or more initial symptoms for additional investigation beyond a simple search for known problems, perform specific diagnostic actions to generate additional symptoms or information (analysis or isolation approach, described below) that are typically specific to the problem under investigation, and then repeat the process as needed until a solution is found.
- Resolution: Finding the root cause and ensuring that it will not happen again.
- Relief: Enabling your users to resume productive work.
- In the analysis approach, you pick one or more specific symptoms or anomalies observed in the system, and then drill down to gain more detailed information about these items and their causes. These, in turn, might lead to new, more specific symptoms or anomalies, which you further analyze until you reach a point where they are so simple or fundamental that it is clear what caused them. Typical techniques used in this approach include searching knowledge bases to help interpret symptoms, and using traces or system dumps to find more specific symptoms that you can then research further with knowledge bases. For example, consider a problem in which an application server appears to be crashing. By analyzing the diagnostic material from the crash, you can determine that the crash originated in a library called mylib.so. By looking at the source code for the library and taking the native stack trace information from the gathered diagnostic material, you can see that a bit of code creates a pointer to a memory location, but does not handle it correctly. This results in an illegal operation and subsequent crash.
- In the isolation approach, rather than focusing on one particular symptom and analyzing it in ever greater detail, you look at the context in which each symptom occurs within the overall system and its relation to other symptoms, and then attempt to simplify and eliminate factors until you are left with a set of factors so small and simple that, again, it is clear what caused the problem. Typical techniques used in this approach include performing specific experiments to observe how the system’s behavior changes in response to specific inputs or stimuli, or breaking down the system or some operations performed by the system into smaller parts to attempt to see which of these parts is contributing to the problem. For example, consider a large WebSphere Application Server environment, consisting of many nodes across several physical machines, in which an accounting application, deployed into two clusters, is having long response times. Using the isolation approach, you might opt to trace the application servers that are involved along with the network links between the servers. This method would enable you to isolate the cause of the slowdown between the network and the application servers, making it possible for more in-depth investigation to be performed on the effected component.
- Executive summary This is a short paragraph that summarizes the status of the highest priority issues, what has been done in the last interval, and what the top action items are, with owners and approximate dates. This enables both stakeholders and those who are only marginally involved to understand the current status. Not exclusively for “executives,” this information helps focus the team, explains progress and next steps, and should highlight any important discoveries, dependencies, and constraints.
- Table of problems, symptoms, and actions Phase 2 investigations can sometimes suffer when the number of problems to be resolved grows prodigiously. It is important to keep a written list of these additional problems and not rely on the collective memory. This table is a crucial piece of record-keeping and should be kept for all situations, no matter how simple they seem at first. When the situation is simplistic, the table is simple as well and easy to maintain. The effort to track this information will pay off on those (hopefully rare) occasions when things are much more complicated than originally believed. By the time that complexity is realized, it is almost impossible to recreate all of the information that you will want to have kept. The actual format of this table, its level of detail, and how rigorously it is used will vary between each individual troubleshooter and each situation. Regardless of the exact format, this table should contain:
- Problems: One entry for each problem that you are attempting to resolve (or each thing that needs to be changed).
- Symptoms: External manifestations of the underlying problem and anomalies that might provide a clue about the underlying problem. A symptom might be an error message observed in a log, or a particular pattern noticed in a system dump; the problem is the error condition or crash itself. Problems can be fixed; symptoms go away (or not) as changes are made. Sometimes new symptoms appear during an investigation.
- Actions: Tasks to be performed that may or may not be directly related to a particular symptom or problem, such as upgrading the software or preparing a new test environment.
- Fixes: Alternatives to be tried to achieve a resolution or workaround. (Some troubleshooters list these as actions.)
- Theories: It is useful to track ideas about why the problem is occurring or how it might be fixed, along with actions that could be taken to test them. Noting which symptoms the theories are derived from can help rule out theories or draft new ones for the investigation.
- In any investigation that lasts more than a few days, or that involves more than a few individuals, there will invariably be questions about the results of some earlier experiment or trace, where some file was saved, and so on. Keep a written document or log book in which you record a timeline of all major events that occurred during the investigation. The exact format of the timeline and the level of detail might vary between individuals and between different situations, but a timeline will typically contain:
- One entry for each occurrence of any problem being investigated.
- One entry for each significant change made to the system (such as software upgrades, reinstalled applications, and so on).
- One entry for each major diagnostic step taken (such as a test to reproduce the problem or experiment with a solution, a trace, and so on).
- A precise date and time stamp.
- A note of the systems (machines, servers, and so on) that were involved.
- A note of where any diagnostic artifacts (logs, traces, dumps, and so on) were saved.
- Which event in the timeline does it correspond with?
- Which specific machine or process (in all of the machines and processes involved in a given event) did the artifact come from?
- What system or process configuration was in effect at the time the artifact was generated?
- What options were used specifically to generate that artifact, if appropriate (for example, which trace settings)?
- If the artifact is a log or trace file that could contain entries covering a long period of time, exactly at which time stamp(s) did something noteworthy happen in the overall system, that you might wish to correlate with entries in this log or trace file?
Avoiding tunnel vision Regardless of the approach you take, it’s important to always keep an open mind for out-of-the-box thinking and theories. Avoid having “tunnel vision,” focusing on one possibility for root cause and attempting to solve only that one cause relentlessly. Failure to see the whole picture and not methodically evaluating all potential solutions can result in prolonged relief and recovery times. Here are some ideas to help you avoid this condition:
- Utilize checkpoints: During your investigation, take time to create checkpoints for all team members to share all relevant findings since the last checkpoint. With this method, team members can create and maintain creative synergy among themselves and assure that no details are left aside.
- Work in parallel: When a team member has an alternative theory to the root cause, it can benefit the investigation by enabling one or more team members to work on proving that theory in parallel with the main investigative effort.
- Regularly (re)ask the “big picture” questions: Keep asking “what is the problem?” and “are we solving the right problem?” One classic example of tunnel vision is thinking that you have reproduced a problem in a test environment only to discover later that it was actually a slight permutation of a production problem. Asking big picture questions can help contextualize problems and avoid chasing those that are of lesser priority.
- Multiple problems or symptoms can be linked by a cause-and-effect relationship. For example:
- A total lack of response from a Web site might be due to overload in a Web server...
- ...which might itself be due to the slow response of an application server that happens to serve only one type of request on the entire Web site but ties-up excessive resources on the Web server...
- ... which might itself be due to database access slowdown from another application server, which ties resources needed by this application server...
- ...which might itself be due to problems on the underlying network that connects the application server to the database...
- ... and so on, until you get to so-called root cause of this sequence of problems.