@@ran_quest_help Part of Instruct consists of a series of sixty questions. The questions pertain to the system event files (ERROR.SYS and ERRLOG.SYS), the Spear Library dialogs, and the Spear Library reports. The Random Question feature is primarily a Course Administrator's tool. It allows the Course Administrator to randomly select a few questions that will help determine a student's progress. If the student is able to answer 8 out of 10 random questions correctly, then chances are he (or she) understands how to use the Spear Library. If not, then perhaps a little more study time is needed. Students can also use the Random Question feature as a self evaluation tool. To do so, enter a random number in the range of 1 to 50. Instruct will dispatch to a corresponding random question. Answer the question to the best of your knowledge. Instruct will evaluate your answer and print an approprate message. At that point you can type: RANDOM and select another random question. @@quest_help You are participating in a teaching dialog informally referred to as the "Rhetorical Approach to Learning". The approach involves a statement about a subject, in this case the Spear Library. You are to determine whether the statement is True or False. If your answer is correct you will receive a short message and then go on to the next statement. If your answer is incorrect, then the correct answer will be explained and the statement will be repeated. If you are not sure whether the statement is True or False, you can press the RETURN key and the correct answer will be explained. In addition to the True, False, and RETURN key response, you can type NEXT if you want to skip to the next statement. You can also press the BACKSPACE key if you want to return to the menu. @@ans_help You have just answered a question either correctly or incorrectly. You now have three choices. You can: 1. Press the RETURN key. If you answered the question correctly you will continue on to the next sequential question. If, however, you answered the question incorrectly, then the question will be repeated. 2. Type NEXT to continue on to the next sequential question regardless of whether you answered the last question correctly or not. 3. Press the BACKSPACE key to repeat the last question regardless of whether your answer was correct or not. Your response please: @@no_help There really isn't anyway that we can help you at this point. Press the BACKSPACE key and try reading the text again. If it still doesn't make sense, then contact: The Spear Team MRO1-1 / M2 Sorry @@text_help Instruct is frame oriented. That is, it displays one frame or block of information at a time. After you have read the frame you can: 1. Press the RETURN key to proceed to the next frame of information. 2. Press the BACKSPACE key to review the previous frame of information. 3. Type MENU if you want to go back to the subject menu. @@menu_help Instruct is organized around a hierarchy of subject menus. The menus allow you to use Instruct as a reference tool. The top item on the menu (item 0) introduces the subjects and explains there relationship. The remaining items are subjects. You can select any item on the menu by typing the number that corresponds to the item. You can also press the RETURN key to automatically proceed to the first subject on the menu. If you want to go back to the previous menu in the hierarchy you can type MENU. @@fwd_trans_help Instruct is organized around a hierarchy of subject menus. You can use the RETURN key feature to sequence through the subjects listed on the menu. Each time you move from one subject to a other you will be notified. At this point you can choose to go on by pressing the RETURN key, or you can choose to go back to the menu and select a different subject by typing MENU. @@rev_trans_help Instruct is designed in such a way that you can go forward and backward through the subject matter. Each time you move from one subject to another you will be notified. In this case you were notified that you were about to back into to previous subject on the menu. At this point you can: 1. Type MENU to go back to the subject menu. 2. Press the RETURN key to go back to where you came from. 3. Press the BACKSPACE key, or type/REVERSE to continue backing up. However, if the subject that you are backing into required multiple frames of text to explain, then you will back into the last frame. 4. Type BEGIN to backup to the first frame of the subject that you are backing into. @@ran_quest_res_error_msg This is the Random Question response error message. The number that you entered is not within the range of 1 to 50. @@text_res_error_msg This is the text response error message. Instruct displays one page of text at a time. After you have read the text you can: 1. Press the RETURN key to go on to the next page. 2. Press the BACKSPACE key or type /R to go back to the previous page. 3. Type MENU to go back to the menu and select another subject. 4. Type /B to return to the Spear prompt. If you are using a student ID, and if you specify that ID at the Instruct prompt, you will return to the page that you were at when you typed /B. 5. Type anything else and you will get this message. @@menu_res_error_msg This is the menu response error message. Instruct uses a hierarchy of menus. The menus allow you to use Instruct as a quick reference tool. At a menu you can: 1. Type the number on the menu that corresponds to the subject that you are interested in. 2. Type MENU to go back to the previous (higher level) menu. 3. Type /B to return to the Spear prompt. If you are using a student ID, and if you specify that ID at the Instruct prompt, you will return to the page that you were at when you typed /B. 4. Press the BACKSPACE key or type /R. You will get a message stating that you are about to back into the Introduction to the menu. 5. Type anything else and you will get this message. @@fwd_trans_res_error_msg This is the forward response error message. You can sequence through Instruct by pressing the RETURN key. If you do so, you will sequence through an Introduction, followed by a menu, followed by the first subject, followed by the second subject, etc. You will be notified each time you move from one subject to another. At that point you can: 1. Press the RETURN key to continue sequencing through Instruct. 2. Press the BACKSPACE key or type /R to repeat the last page of text. 3. Type MENU to go back to the menu and select another subject. 4. Type /B to return to the Spear prompt. If you are using a student ID, and if you specify that ID at the Instruct prompt, you will return to the point that you were at when you typed /B. 5. Type anything else and you will get this message. @@rev_trans_res_error_msg This is the reverse-transition prompt/response error message.You are sequencing through Instruct in a reverse direction. You were notified that you are about to move in a reverse direction from one subject to another. You can: 1. Press the RETURN key to begin sequencing in a forward direction. 2. Press the BACKSPACE key to continue going in a reverse direction. 3. Type BEGIN to go to the beginning of the subject. 4. Type MENU to go back to the menu and select another subject. 5. Type /B to return to the Spear prompt. 6. Type anything else and you will get this message. @@ans_res_error_msg This is a response error message. Your response does not match the list of acceptable responses. For further information press the RETURN key, then type: ? or HELP. @@farewell Instruct bids you farewell. Type /Break to return to Spear. @@course_admin Spear Course Administrator and Student Guide Course Description The Instruct course consists of four main modules: 1. Fault Isolation Techniques - This module describes the nature of intermittent faults and discusses some of the most common methods used to isolate intermittent system and subsystem failures. 2. System Event File Organization and Content - This module describes the overall organization and content of TOPS-10, TOPS-20, and VAX/VMS system event files. 3. Spear Library Functions - This module explains how to use each of the Spear maintenance functions: Retrieve, Summarize, and Compute. 4. Guaranteed Uptime Program/NOTIFY - This module describes the GUP service which ensures the highest level of reliability for your system. This module also explains how to use NOTIFY to calculate statistics and to log information related to system uptime. @@course_admin_a Each module consists of an introduction and a menu of subordinate subjects. When appropriate, the subordinate subjects are further broken down into introductions and menus. Thus, Instruct can be used as both a tutorial and a reference tool. If you want to use Instruct as a tutorial (i.e., sequence through the course much as you would read a book) you can do so using the RETURN key. You will proceed to the module introduction, then the menu, then the first subject on the menu, followed by the next subject, etc. If you want to use Instruct as a reference tool, then instead of pressing the RETURN key at the menu, select the subject number that interests you. You will proceed directly to that subject. If, after investigating the subject you want to return to the menu, type MENU. @@course_map Course Map ______________________________________ | Guaranteed Uptime Program/NOTIFY | ______________________________________ ^ |--Applications | |--Summarize _________________________________ |--Compute | Using the Spear Library |--------|--Retrieve _________________________________ |--Klerr ^ | _________________________________ | System Event Files | _________________________________ ^ | ______________________________________ ___________________________ | Course Administrator/Student Guide |-->| Troubleshooting | ______________________________________ ___________________________ @@course_map_a The course map suggests a sequence to follow to learn about Spear. This sequence reflects the following factors: Spear processes the system event file and generates a number of reports which are useful in supporting the system. Spear allows the user to produce the following reports: Summary of the system faults be device and time. System reliability and uptime reports. Dump of event log entries in multiple formats. Spear also allows the user to maintain the event file, and includes its own instruction package for its use. @@feedback Feedback is an important part of any system design. Technically, feedback is defined as a representive sample of the output used to control or correct the process. The process, in this case, is The Spear Library. The output is the ability of the Spear Library to help you evaluate system performance and solve service related system problems. If you have any ideas or suggestions for improving the usefulness of The Spear Library, please contact: Digital Equipment Corporation The SPEAR Team MRO1-1 / M2 200 Forest Street Marlboro, Mass. 01752 Thank you; The Spear Team @@random_question The Random Question feature allows you to enter a random number in the range of 1 to 50. Instruct will respond by presenting you with a random question based on the course content. This feature can be used by anyone who has a few minutes, and who would like to pickup a few tidbits about the use of The Spear Library. The feature can also be used by The Course Administrator as a tool to spot check student progress. After being informed that you have correctly answered a question, you may select another random question by typing "RANDOM". Type if you wish to enter the random question mode. @@spear_man Using The Spear Manual You can use The Spear Manual as a learning aid, a user's guide, or a reference tool. As a Learning Aid: Chapters 1, 2, and 3 provide an overview of the Spear Library. They also provide background information required to understand and use the Spear library. As a User's Guide: Chapters 4 and 5 provide step-by-step procedures for using the Spear functions; Retrieve, Summarize, and Compute. The chapters explain, in detail, the command syntax and the response parameters associated with each function. As a Reference Tool: Chapter 6 and the appendices provide reference material such as system event file formats, event record descriptions, and examples of the report formats. This chapter and the appendices are for reference only. They are not meant to be read from beginning to end. @@R.T.cou_ovr_a STOP - You are moving in a reverse direction through the menu. You are about to back into the Course Administrator/Student Guide. @@1.M. Troubleshooting Topic menu: 1. Attitude vs. Approach 2. The Formal Approach 3. The Systematic Approach 4. The Variable Approach @@1.1. Attitude vs. Approach First and foremost; your success as a problem solver depends more on your attitude than it does on your approach. Quite simply, if you believe that you can (solve a particular problem), then you probably will; if you believe that you can't, then you probably won't. The only thing that a problem has going for it is your attitude. Therefore, with the right attitude, you can solve almost any problem. It's just a matter of time. Never give up and you'll never lose. @@1.1.A. Approach The way you approach the solution to a problem will also, to a large extent, determine your success as a problem solver. The more logical and systematic your approach, the more successful you're likely to be. Next on the menu are a couple of systematic problem solving approaches that I think you will find to be both interesting and quite effective. @@R.T.1.1.A. STOP - You are moving in a reverse direction through the menu. You are about to back into the Attitude vs. Approach section of the course. Your response please: @@1.2. The Formal Approach The Formal Approach consists of seven steps: 1. RESEARCH and DEFINE the problem (what is, or is not, happening) 2. VENTURE a testable educated guess (as to the cause of the problem) 3. SETUP a practical experiment (to test the educated guess) 4. PREDICT the result (before you conduct the experiment) 5. CONDUCT the experiment (keep an accurate set of notes) 6. EVALUATE the result (compare the actual and predicted results) 7. REFINE the definition and REPEAT the process (begining with step 2) @@1.2.A. Step 1 - RESEARCH and DEFINE the problem - If you're not familiar with the system, begin your research at the Branch office. Look over the records for the last couple of weeks. Try to get an idea of the size and the application of the system. Also, find out when the system was last serviced, by whom, and why. When you first arrive on site take five or ten minutes to talk with the customer, the operator, or anyone else that may be able to explain the problem. Here's a partial list of the type of questions that you should ask: How serious is the problem ? How long has it been going on? Has the system ever had a problem like this before? How has the system been performing lately? Have there been any recent hardware or software changes lately? @@1.2.B. You can define the problem at the same time that you are doing the research. Ask yourself three questions: 1. What is happening that shouldn't? 2. What is not happening that should? 3. What are the surrounding conditions? The first two questions will help you identify the main error symptom. The third question will help you identify the context or circumstances that surround the symptom. That's important, because it's practically impossible to solve a problem out of context. @@1.2.C. Once again, the questions to ask yourself when defining a problem are: What is happening that shouldn't? What is not happening that should? What are the surrounding conditions? The definition should be as complete as possible. It should also state, in clear and concise terms, the major symptom and the conditions or circumstances that surround that symptom. One more thing, and this is important, you should write the definition down, at least in note form. For example: Def - 4 days/2020/256K/cache/TOPS-20(4.1)/UBANXM/freq:12-14 hrs. @@1.2.D. Or more formally: During the last four days, the system, a 2020 with 256K and cache, running TOPS-20 (4.1) has crashed about every 12 or 14 hours with a UBANXM Bug Halt. Note that the definition states only one main error symptom, UBANXM Bug Halt. The rest of the information describes the conditions that surround the error symptom (i.e., the context of the problem). @@1.2.E. Sometimes, however, a system will exhibit multiple error symptoms. In such a case, each error symptom (including the surrounding conditions) should be stated separately. This is important because, when you first start working, you have no way of knowing, for sure, whether or not the system actually has multiple problems. Therefore, assume the worst case. If a system exhibits multiple error symptoms treat each symptom separately. That way you will eliminate the possibility of multiple errors compounding the problem solving process. Also, if you separate multiple error symptoms, then you can investigate the most obvious symptom first, which is sound troubleshooting practice. @@1.2.F. Review - The key points discussed so far are: 1. Talk to anyone who may know something about the problem. 2. DEFINE the problem. Find out exactly: What is happening that shouldn't? What is not happening that should? What are the surrounding conditions? 3. Remember to get all the conditions and circumstances. It's next to impossible to solve a problem out of context. 4. Write down the definition, at least in note form. Be clear, concise, and as complete as possible. 4. Treat each error symptom as if it were a separate problem. 5. Attempt to solve the most obvious problem first. @@1.2.G. Step 2 - VENTURE a testable educated guess (TEG) as to the cause of the problem. The truth of the matter is, when you first start out to solve a problem, you can't know (for sure) what the cause is. Therefore, you really don't have much of a choice; you have to begin with a guess. Fortunately, if the guess is testable, it does not have to be accurate. In fact, your first few guesses probably won't be accurate. But, if you use this approach and your guesses are testable, then they will quickly become accurate. In other words, they will either: a) lead you directly to the cause of the problem, or b) they will lead you to the realization that you could use some help. Either way, you win. @@1.2.H. Here's a couple of testable educated guesses (TEGs) to go along with the problem that was identified and defined earlier: Def - 4 days/2020/256K/cache/TOPS-20(4.1)/UBANXM/freq:12-14 hrs. TEG #1. A low voltage condition exists at one of the UBAs. TEG #2. One of the Unibus cables is improperly seated. @@1.2.I. REMEMBER TEGs don't have to be earth shattering. But they do have to be testable. @@1.2.J. Step 3 - SETUP an experiment that will prove, or disprove, your TEG. The experiment should be carefully thought out. You should make every effort to ensure that it is a true, and accurate test of your guess. Take your time. Make sure that your experiment is not inadvertently testing something other than your TEG. Here's why. If your experiment turns out to test something other than your TEG, and you don't realize it, then you are liable to misinterpret the result. Consequently, you may find yourself tripping down the Old Garden Path. @@1.2.K. The Old Garden path, by the way, is a expression that refers to a troubleshooting tangent, a lesson in pure frustration. The path or tangent leads you away from the real cause of the problem, contributes very little useful information, and consumes lots of valuable time and effort. So, give yourself a break. Don't take a chance on a trip down the garden path. Instead, use the time to carefully think out your experiment. @@1.2.L. The experiment doesn't have to be complex or elaborate. Let's go back to the problem definition and TEGs that we used earlier, and see if we can devise a couple of simple experiments that will prove, or disprove, the TEGs. Def - 4 days/2020/256K/cache/TOPS-20(4.1)/UBANXM/freq:12-14 hrs. TEG #1. A low voltage condition exists at one of the UBAs. Exp #1. Use up a DVM to test the voltage at each UBA. TEG #2. One of the Unibus cables is improperly seated. Exp #2. Clean and reseat each cable in the Unibus. @@1.2.M. Review - The key points discussed so far are: 1. Research and Define the problem (in writing). Find out exactly: a) What is happening that shouldn't? b) What is not happening that should? c) What are the surrounding conditions? 2. Treat each error symptom as if it was a separate problem. Then, select the most obvious problem and work on it. 3. Venture a testable educated guess (TEG) as to what might be causing the problem. 4. Setup an experiment that will either prove, or disprove your guess. Take your time. Make sure the experiment is a valid test. If it's not, you may waste a lot of time chasing a tangent. If you've been following this course, right around now you should be getting some idea of how effective a problem solving approach such as this can be. Essentially, it is a systematic process of elimination. Properly used it will isolate and ultimately eliminate virtually any problem a system can develop. It's just a matter of time. @@1.2.N. Step 4 - PREDICT the result of the experiment before you conduct it. The purpose of this step is to double check the validity of your experiment. The prediction should be based on the assumption that: 1. Your TEG or guess is absolutely correct. 2. Your experiment is a true and valid test of your TEG. Both of these assumptions will be verified later in Step 6. @@1.2.O. As trivial as this step may seem, it should never be skipped. Nor should you ever leave it up to "maybe" type thinking: Maybe...this will happen (or) Maybe...that will happen When it comes to your experiment and the predicted result, "maybe" type thinking leads to: "Gee.. that's interesting; wonder what it means" type curiosity. And that my friend, will lead you right down Old Garden Path. Therefore, if you decide to use this problem solving approach, keep in mind that your prediction should be explicitly stated and well thought out. Don't get tricked into going off on a wild turkey chase. @@1.2.P. Getting back to our example, let's add a couple of predictions: Def - 4 days/2020/256K/cache/TOPS-20(4.1)/UBANXM/freq:12-14 hrs. TEG #1. A low voltage condition exists at one of the UBAs. Exp #1. Use up a DVM to test the voltage at each UBA. Pre #1. The voltage at one of the UBAs will be out of tolerance. TEG #2. One of the Unibus cables is improperly seated. Exp #2. Clean and reseat each cable in the Unibus. Pre #2. One of the cables will be loose or dirty. @@1.2.Q. Well, that's it for the hard part. The last three steps are relatively simple and straight-forward. But before we go on, let's quickly review the main points: 1. Research and Define the problem. Find out: a) What is happening that shouldn't? b) What is not happening that should? c) What are the surrounding conditions? 2. Treat each error symptom as if it were a separate problem. 3. Venture a testable educated guess (TEG) as to the cause of the problem. 4. Setup an experiment that will either prove, or disprove your TEG. 5. Predict the result of the experiment in advance. Assume that your TEG is correct and your experiment is valid. Be explicit. State (in writing) exactly what you expect to happen. 6. Avoid "maybe" type thinking. It's liable to get you into trouble. @@1.2.R. Step 5 - CONDUCT the experiment - This is the most exciting step in the formal problem solving process. Here's all you have to do. Either: 1. Check the voltage at each UBA. 2. Clean and reseat each UBA cable. Unfortunately, this is where a lot of people fall down. They're overwhelmed by the task. So they tend to put it off. After all checking the voltage at each UBA, or cleaning and reseating each UBA cable is not a five minute job. @@1.2.S. But, if you properly set-up the experiment, then half the job is done. Now if it's going to take a while to conduct the experiment, set a time limit. Don't rush, but try to estimate how long it will take. You might be surprised to find that, once you are set up, it only takes a few minutes to check the voltage at each UBA. So if you've five UBAs to check, you could easly be done in ten minutes. That's not so bad. @@1.2.T. When it comes to cleaning and reseating cables, however, that can be a large undertaking. At two or three minutes per connector, that could require forty five minutes or an hour to complete. At this point you might want to revise your experiment: you might decide to clean and reseat a third of the UBA cables, and see if that corrects the problem. There is a trade-off involved here. You must consider; the seriousness of the problem, the frequency of recurrence, and the amount of time and effort necessary to prove (or disprove) your TEG. The decision is subjective, and entirely up to you. The rule of thumb here is: Do what you think is right. @@1.2.U. Step 6 - EVALUATE the result - After conducting the experiment compare predicted result, with the actual result. If they match, then you have accomplished one of two things. 1. You have either identified the cause of the problem, or 2. You have gathered some new, fairly reliable, information that you can use to refine the problem definition. If the predicted result and the actual result do not match, however, then there is a conflict. Either the experiment tested something other than your TEG, or your understanding of the experiment (the prediction) was incorrect. In either case you should STOP IMMEDIATELY. @@1.2.V. You must figure out which was in error; the experiment, or the predicted result. If, after some thought, you decide that the predicted result was in error, that's ok. It means that the experiment was, in fact, a valid test of your TEG. And, therefore, the result can be used with confidence to refine the definition of the problem. If, however, you discover that the experiment was in error; that is, the experiment was not a valid test of your TEG, then be very careful. You should reconsider the entire situation and either; revise the experiment in such a way that it is a valid test of your hypothesis, or scrap the whole thing and start over again. @@1.2.W. Now you can see the importance of predicting the result of an experiment before you conduct it. If you are unable to determine whether or not the experiment was, in fact, a valid test of your TEG then, you're liable to "assume" that it was. And that kind of an assumption may lead you right down the Old Garden Path. The point here is: if you know what you expect to happen, then you are much more likely to recognize cases where the experiment is not testing what you think it is. @@1.2.X. CHANGES - If an experiment requires that you change the system in any way (swap a cable, perform an adjustment, exchange a module, etc.) be sure that you can restore the system to its original state should you need to. One fool proof way of doing that is to keep notes. Notes don't forget. Surely, in the past at least, some very successful technicians didn't keep any notes at all. But that doesn't mean that they shouldn't have; it only means that they didn't. And that's too bad, because that means that they were using part of their brain muscle to recall facts, there- fore, less of their brain muscle was available to think about solving the problem. Besides, once you get used to it, thinking is much more fun than recalling facts. Don't you think? @@1.2.Y. Either you disagree, or you're not thinking. @@1.2.Z. So do I. @@1.2.A1. Back to CHANGES. If you change the system and the change doesn't correct the problem, then you should restore the system to its original state as soon as possible. If you don't, then you should realize that you are running the risk of introducing new problems into the system and thus, compounding the situation. @@1.2.B1. NEW SYMPTOMS - Finally, if you restore the system to its original state and find that the symptoms have changed, STOP. Don't go on until you are satisfied that you know the reason WHY the symptoms changed. Remember Symptoms Change For A Reason @@1.2.C1. Step 7 - REFINE the definition and REPEAT the process beginning with Step 2 - Venture a TEG. This is the last step. Append the TEG, the experiment, and the result of the experiment to the problem definition. Even if the experiment disproved the TEG, append the information to the definition. At least you know one thing that is not causing the problem. Then, once again, ask yourself: 1. What is happening that shouldn't? 2. What is not happening that should? 3. What are the surrounding conditions? @@1.2.D1. Take your time appending the new information to the problem definition. Follow the same guide lines that you followed when you first constructed the definition; be clear, be concise, and be as accurate and complete as possible. That's one of the keys to using this problem solving approach successfully. Finally, close the loop. In other words, venture a new TEG, setup a new experiment to test the TEG, predict the result, conduct the experiment, evaluate the result, refine the definition, and continue to close the loop. Eventually, if you use this approach, one of two things will happen. Either: 1. you will identify and ultimately eliminate the problem, or 2. you will flat run out of TEGs, time, or both and end up calling for support. But even a call for support is a TEG of fashions, because you won't know until the end, whether or not you needed support from the beginning. @@1.2.E1. One last word before we go on to the final summary. Earlier, we talked about Attitude vs. Approach. During the discussion the statement was made "never give up and you will never lose". The statement does not mean never call support. Support is a tool. It's there to help you do your job more efficiently. Don't be afraid to use it. But, please be prepared to describe the exact problem, what you've done, why, and what the results were. It will save a lot of time, and you will get much better service. @@1.2.F1. Never give up really means, never let a problem go without finding out what the cause was, and how the cause was finally isolated. Even if you have to leave a problem (i.e., let someone else take over) always follow up. Get back to the individual that solved the problem and find out what the cause was and how he or she arrived at that conclusion. That way, in your mind, no problem will go unsolved. And that's where the solution takes place, in the mind. So never give up, never let a problem go unsolved, and you'll never lose. It's as simple as that. @@1.2.G1. Final Summary: 1. RESEARCH and DEFINE the problem. Find out exactly: a) What is happening that shouldn't? b) What is not happening that should? c) What are the surrounding conditions? 2. VENTURE a testable educated guess (TEG) as to the cause. 3. SETUP a practical experiment that will prove, or disprove, your TEG. 4. PREDICT the result before conducting the experiment. Know what you expect to happen. Don't leave it up to "maybe" type thinking. 5. CONDUCT the experiment (keep an accurate set of notes) If you change the system, restore it to its original state before you go on. 6. EVALUATE the result (predicted vs. actual). If the symptoms changed, they changed for a reason. Find out why before you go on. 7. REFINE the definition and REPEAT the process (begining with step 2) Tighten the loop. It's simply a matter of time and TEGs. @@F.T.1.3. That concludes the explanation of the Formal Troubleshooting Approach. Next on the menu is the Systematic Approach. @@R.T.1.2.G1 STOP - You are moving in a reverse direction through the menu. You are about to back into the Formal Troubleshooting Approach. Your response please: @@1.3. Systematic Substitution Some old school hard-line purist technicians may not agree, but under certain circumstances systematic substitution (of spare parts) is a perfectly valid troubleshooting approach. For example, let's assume that you are at home, working in your cellar. Furthermore, let's assume that you are using a circular saw to cut a 2 x 8 piece of oak planking. @@1.3.A. Suddenly, the saw binds, the lights dim and then they go out. From the symptoms (the lights are out) and the conditions (at the time of failure the saw was operating under a heavy load) you might logically conclude that a fuse had blown. Now, let's assume that you light a match and find your way to the fuse box. Upon opening the box you discover a package of spare fuses and a wiring diagram of the house. You decide to light another match. This time you discover that the fuse box contains six 15amp fuses - two rows of three fuses. At the same time, however, you also realize that the match does not provide enough light for you to determine which fuse is blown. @@1.3.B. At this point you have, roughly, six options: 1. You can stall for time hoping that the problem will disappear. This, however, is not a very practical solution because: problems don't just happen, they are caused; and, although some problems may go away temporally, very rarely do they just disappear. Therefore, the best approach is to identify and eliminate the cause of the problem. So much for the wishful thinking approach. @@1.3.C. 2. You can call an electrician, but that could be very expensive. @@1.3.D. 3. You can go get a flashlight (if you can find one that works) and use it to identify the blown fuse. @@1.3.E. 4. You can light a couple more matches, study the wiring diagram and attempt to figure out which fuse is blown. But let's say that your ability to read an electrical wiring diagram is a bit rusty. So this could be quite time consuming, and the results are not certain. @@1.3.F. 5. You can use the spares and randomly substitute fuses until the lights come on (i.e., the symptoms go away). This is a very risky approach, however, because you could lose track of which fuses you did and did not substitute. Thus, you could accidently overlook the blow fuse and conclude that something else was causing the problem. @@1.3.G. 6. You can use the spares and systematically substitute each fuse until the lights come on. You might choose to begin with the upper left most fuse and substitute left-to-right top-to-bottom. If, in fact, the problem is being caused by blown fuse, then sooner or later the lights will come back on. @@1.3.H. Now let's say that, after careful consideration of all six options, you reject wishful thinking and random substitution because they are both risky and impractical. Next, you dismiss the possibility of calling an electrician because it seems unnecessary and it could be very expensive. Finally, you eliminate using the wiring diagram to figure out exactly which fuse is blown. The idea is feasible and even tempting, but under the circumstances it's just too time consuming. Remember, you want to get the saw back on line so you can finish cutting that piece of wood. @@1.3.I. That leaves you with two options; either go get a working flashlight, or try the systematic substitution approach. If you opt for the flash- light, that's a trip upstairs, a few minutes locate a flashlight, a trip back down cellar and 30 seconds to replace the blown fuse. Total time expended approximately five minutes. (That does not include the time required to return the flashlight so that you can find it the next time you need it.) But suppose that instead of opting for the flashlight, you opted for the systematic substitution approach and, on the fourth try you locate the blown fuse. Total time expended (at 30 seconds per fuse): 2 minutes with no trips involved. Not bad. @@1.3.J. One last supposition; suppose that instead of working in your cellar, you had just arrived on site. Instead of a dead power line you're faced with a failing subsystem. Instead of six fuses in a box, the subsystem consists of four modules, a cable and a power supply. Finally, instead of a box of spare fuses and a wiring diagram you have a spares kit, a scope, and a set of prints. But the diagnostics that you need are not on site. The same six options apply: 1. You can stall for time wishing the problem will go away. 2. You can call for support. 3. You can go back to the office and get the diagnostics that you need. 4. You can study the print set and try to figure out what's wrong. 5. You can randomly substitute the spares and hope to solve the problem. 6. You can systematically substitute the spares and quickly identify the cause. There are, however, some things that you should be aware of: @@1.3.K. A. You must approach the substitution process systematically. If you don't, you'll become confused and end up resorting to the random method of substitution. The random method is so prone to error that it's just not worth it. B. If there are more than a few modules involved, keep notes. You may not always need them, but when you do you'll find that they're worth their weight in gold. C. If you substitute a module and the problem doesn't go away, replace the original module immediately. If you don't, you'll run the risk of introducing new problems into the system. Spares tend to have a higher failure rate than modules that have been in use for awhile. D. If you substitute a module and the symptoms change, STOP. Replace the original module. If the original symptoms return, then chances are you have come upon a bad spare. Try it one more time. If the results are the same; Tag the spare right away. If you don't, you're likely to forget and reliable spares are a must. @@1.3.L. E. If you substitute a module and the symptoms change, and they remain changed even after you replace the original, STOP. Chances are you inadvertently changed something and didn't realize it. Retrace every step. Symptoms change for a reason. Find the reason. Don't run the risk of compounding the problem. F. If you substitute a module and it seems to solve the problem don't stop. Confirm the fix. Return the original module. The symptoms should appear. If they don't, then you can't be sure that you found the problem. If they do, then you can be pretty sure that you got it. But, don't stop yet. Run the diagnostics one more time. Make sure that no new problems have crept into the system. Finally, hang around a few minutes, make sure that the equipment comes back on line ok. G. Back to the case where the spare seemed to correct the problem, but when you replaced the original module to confirm the fix, everything seemed to work fine. In this case you may, or may not have identified the cause of the problem. You don't know. So, leave the spare in the system, tag the suspect module as potentially intermittent, and save your notes. Such situations call for a different type of confirmation technique. @@1.3.M. The technique is called the subjective time window. To use it, you must establish a period of time during which you will monitor the problem. Usually a week is adequate, if the problem was solid. If the problem was intermittent, however, then you must determine the rate, or frequency of the failure, triple it (at least), and use that as the period of time during which you will monitor the problem. If the problem does not recur during the time window that you set up, then you can assume that you solved it. Tag the suspect module as intermittent, return it for repair, file your notes, and close out the paper work. If, however, the problem does recur, then you're all set, replace the original module, update your notes, and pick up where you left off. That's all there is to it. As some of the old school hard-liners would say; Hey, at least you know what it's not, and that's worth something. @@F.T.1.4. That concludes the explanation of the Systematic Substitution Approach. Next on the menu is the Variable Approach. @@R.T.1.3.M. STOP - You are moving in a reverse direction through the menu. You are about to back into the Systematic Substitution Approach. Your response please: @@1.4. The Variable Approach This short story was once told by a senior field service engineer to illustrate a VERY important point about using the variable approach to isolate the cause of an intermittent failure. The story is about a telephone conversation he had while working for another company (not to be mentioned). @@1.4.A. "At first, the diagnostic only failed every hour or two. So I performed all the standard checks and adjustments. The problem got a little worse, but it still wasn't solid. So then, I decided to vary the voltage and clock margins awhile. That helped some. I pulled out a marginal module and the symptoms changed so I knew I was getting closer. Then I thought, maybe the problem had something to do with temperature, so I blocked the fans for a few minutes. I just wanted to see if varying the temperature would have any effect. Finally, I tapped around with the back of my screw driver awhile. That really helped. I found a couple of vibrational modules. But now I seem to have a new problem - I can't even load the diagnostics. What do you think is wrong?" @@1.4.B. At that point, the senior engineer would bellow; "Now that's what I call a dumb question - Obviously the guy beat the poor thing to death." The story served its purpose. Clearly, it illustrates the problem with indiscriminately using the variable approach. That is, if you're not really careful, you're likely to cause more problems than you solve. @@1.4.C. The reason is: systems frequently operate in a controlled environment for long periods of time. As a result the environmental operating range of the system narrows. Normally, this is not a problem. As long as the environment remains relatively stable the system will run indefinitely. Keep in mind that if an intermittent problem is, in fact, being caused by an environmentally sensitive component then, just a slight variation in voltage, temperature or clock speed should be enough to aggravate it. The rule of thumb is; BE CAREFUL. @@1.4.D. After all, if you had an intermittent problem, would you want a doctor to double your heart rate in an effort to determine whether or not the problem had something to do with your circulatory system? Probably not, because a lot of working parts could get damaged in the process. Keep that in mind next time you use the variable approach to isolate an elusive intermittent system problem. It will, in the long run, save you a lot of unnecessary grief and irritation. And don't forget the rule of thumb: BE CAREFUL @@ts_end Well, that concludes the Troubleshooting section of the course. We hope that you found it useful. Also, if you have any comments or know of any other troubleshooting approaches that you think should be added to this section please get in touch with us. We're listed under FEEDBACK on the main course menu. Thank You @@2.0. System Event Files (Overview) Most operating systems maintain a system event file. The event file is used to record information about certain events that happen within the system (e.g., system reloads, configuration changes, hardware and software detected errors, etc.). The classification and type of information that is recorded in a system event file is unique to the operating system maintaining the event file. For example: TOPS-10 supports approximately 55 event categories. TOPS-20 supports approximately 25 event categories. VAX/VMS supports approximately 20 event categories. @@2.0.A. The event categories are listed on the back of the Spear Reference card. File Structures - There is nothing special about the file structure associated with a system event file. a. If the event file is maintained by a TOPS-10 operating system, then it conforms to the standard TOPS-10 file structure. For further information about the TOPS-10 file structure refer to The TOPS-10 Software NoteBook 17 (Monitor Table Descriptions). b. If the event file is maintained by a TOPS-20 operating system, then it conforms to the standard TOPS-20 file structure. For further information about the TOPS-20 file structure refer to The TOPS-20 Software NoteBook 16 (Monitor Table Descriptions). c. If the event file is maintained by a VAX/VMS operating system, then it conforms to the standard VAX/VMS file structure. For further information about the VAX/VMS file structure refer to The VAX/VMS Software Support Notebook. @@R.T.2.0.A. STOP - You are moving in a reverse direction through the menu. You are about to back into the System Event File Overview. @@2.M. System Event Files Topic Menu: 1. Overview 2. TOPS-10 System Event Files 3. TOPS-20 System Event Files 4. VAX/VMS System Event Files 5. DEFINE.LIS @@define_lst DEFINE.LIS is a text file that describes the hardware and/or software status that is saved for each entry type in both the TOPS-10 and the TOPS-20 system event file. DEFINE.LIS is normally stored in the system documentation area. To obtain a copy of the file type: PRINTDEFINE.LIS If DEFINE.LIS is not in the system documentation area you can get a copy from the Spear distribution tape. There are two procedures; one for TOPS-10, the other for TOPS-20. @@pri_define_tops_10 TOPS-10 procedure to copy DEFINE.LIS from the Spear tape to your area. Assign a magtape (xxx), mount the Spear tape, run BACKUP, and type: /TAPE MTxxx: /REWIND /INTERCHANGE /FILES /SUPERSEDE ALWAYS /SKIP 1 Note: BACKUP will print "DONE" and reprompt. Type: /RESTORE DEFINE.LIS = DEFINE.LIS Note: BACKUP will print the following message and reprompt. Type: ! "DEFINE LST" "DONE" /UNLOAD /EXIT Note: Remove and return the Spear distribution tape. Then type: PRINT DEFINE.LIS @@pri_define_tops_20 TOPS-20 procedure to copy DEFINE.LIS from the Spear tape to your area. Assign a magtape (xxx), mount the Spear tape, run DUMPER, and type: DUMPER> TAPE MTxxx: DUMPER> REWIND DUMPER> INTERCHANGE DUMPER> FILES DUMPER> SUPERSEDE ALWAYS DUMPER> SKIP 1 Note: DUMPER will print two information messages and reprompt. Type: DUMPER> RESTORE PS:<*>DEFINE.LIS PS: Note: DUMPER will print the following message and reprompt. Type: % RESTORING FILES TO PS: PS:<*>DEFINE.LIS => DEFINE.LIS [OK] DUMPER> UNLOAD Note: Remove and return the Spear distribution tape. Then type: PRINT DEFINE.LIS @@tops_10_ef TOPS-10 System Event Files The section of Instruct consists of a series of questions that pertain to the TOPS-10 System Event File (ERROR.SYS). Before you attempt to answer the questions you should review Chapter 2 of the Spear Manual. Don't forget, you can use the /BREAK feature and return via your ID. @@tops_10_ef_a Press the RETURN key when you are ready. @@tops_10_ef_q1 TOPS-10 System Event Files - Q1 of 10 True or False - Many of the questions that pertain to the TOPS-10 system event file also pertain to the TOPS-20 system event file. @@tops_10_ef_q1_at That's correct. In fact, the questions are practically identical. In many cases so are the answers. Therefore, if you have already answered the questions as they pertain to the TOPS-20 system event file, then you can probably afford to skip this section of Instruct. Of course, on the other hand, you may want to answer the questions anyway. If that's the case, then don't be confused by the redundancy. @@tops_10_ef_q1_af The statement is TRUE. The TOPS-10 system event file (ERROR.SYS) and the TOPS-20 system event file (ERROR.SYS) are very similar. Therefore, it stands to reason that many of the questions that pertain to one event file will also pertain to the other event file. @@tops_10_ef_q2 TOPS-10 System Event Files - Q2 of 10 True or False - The TOPS-10 System Event File is called ERROR.SYS. @@tops_10_ef_q2_at That's correct. Both the TOPS-10 and the TOPS-20 system event file are called ERROR.SYS. The VAX/VMS system event file is called ERRLOG.SYS. @@tops_10_ef_q2_af The statement is TRUE. The idea of a system event file (ERROR.SYS) was first implemented in the early 170's for TOPS-10. Initially, the file was used only to record main memory, channel, and disk errors. The idea proved to be a good one and new entries were added to the file until now ERROR.SYS is the main source of information for solving intermittent system failures. In the mid 1970's the idea of a system event file along with the file name ERROR.SYS was carried over to TOPS-20. Thus, both the TOPS-10 and the TOPS-20 system event file are called ERROR.SYS. @@tops_10_ef_q3 TOPS-10 System Event Files - Q3 of 10 True or False - Prior to the Spear library, TOPS-10 used a program called SYSERR to record entries in the system event files. @@tops_10_ef_q3_at The statement is FALSE. Neither SYSERR nor the Spear library have any- thing to do with the recording of entries in the system event file. That is strictly a function of the operating system. Both SYSERR and the Spear library are designed to process the contents of the system event file. SYSERR was a report generator. Basically, it allowed the user to select and translate specific entries in the event file. The SPEAR library (SYSERR's replacement) is more sophisticated. In addition to translating event file entries it also attempts to localize the cause of intermittent disk and tape subsystem failures. Note however that neither SYSERR nor Spear have anything to do with recording the system event file. @@tops_10_ef_q3_af That is correct. Both SYSERR and its replacement, the SPEAR library, are designed to process the contents of the system event file. They have nothing to do with recording the entries. That is a strictly function of the operating system. @@tops_10_ef_q4 TOPS-10 System Event Files - Q4 of 10 True or False - All hardware detected failures are recorded in the system event file. @@tops_10_ef_q4_at The statement is FALSE. Only failures that require operating system intervention are recorded in the system event file. Failures that do not require operating system intervention are not recorded in the event file. For example, some subsystems attempt error recovery locally. In most cases, if the recovery is successful then the operating is not notified. Thus, those kinds of errors are normally not recorded in the system event file. @@tops_10_ef_q4_af That's correct. Only errors that require operating system intervention are recorded in the system event file. @@tops_10_ef_q5 TOPS-10 System Event Files - Q5 of 10 True or False - Every record in a TOPS-10 system event file consists of a header section and a body section. @@tops_10_ef_q5_at That's correct. Furthermore, the header and body section of each entry type is described in a file called DEFINE.LIS. To obtain a copy of DEFINE.LIS, refer to Appendix A on the Event File Menu. @@tops_10_ef_q5_af The statement is TRUE. Each entry in the TOPS-10 system event file consists of a header section and a body section. The header identifies the entry type (i.e., event code), the date and time that the entry was recorded, the processor serial number, the length of the header section and the length of the body section. Currently, the header section is set at four words, the body section varies in size depending on the type of entry. @@tops_10_ef_q6 TOPS-10 System Event Files - Q6 of 10 True or False - Each record in a TOPS-10 system event file represents one complete system event. @@tops_10_ef_q6_at The statement is true with one exception, KLERR. KLERR entries are built by the console front-end whenever the KL10 crashes. When the system is restarted the entry is transfered via the DTE to KL main memory and then recorded in the system event file. Because the buffer area set aside for communications between the console and KL main memory is significantly smaller than a typical KLERR entry, the entry divided into segments. Each segment is given a unique sequence number and recorded as a separate record in the event file. Technically, therefore, the statement is FALSE. @@tops_10_ef_q6_af That's correct. The KLERR entry consists of multiple records. Each record has a separate sequence number. When a KLERR entry is translated, however, only the first sequence number is used to identify the entry. The other sequence numbers are masked-out to avoid confusion. @@tops_10_ef_q7 TOPS-10 System Event Files - Q7 of 10 True or False - The synchronization word is used to recover from hard read errors that occur while reading the system event file. @@tops_10_ef_q7_at That's correct. Whenever Spear uses the synchronization word to recover from a hard read error it will print the message "Bad header found - RESYNCing". @@tops_10_ef_q7_af The statement is TRUE. The first word in each system event file data block is a synchronization pointer. The pointer points to the starting location of the next record in the file. Thus, if a hard read error occurs while reading a record Spear skips to the next data block, reads the sync word, finds the starting location of the next record, and continues reading the file. The idea of adding a synchronization word to each data block in a system event file was incorporated in the mid 1970's. Prior to that time, if a hard read error occurred while reading the event the remaining records in the file were lost. Now only the records affected by the read error are lost. @@tops_10_ef_q8 TOPS-10 System Event Files - Q8 of 10 True or False - When the TOPS-10 operating system detects a device error the following occurs: 1. Normal operation is suspended and applicable hardware and/or software status is captured (at error) and saved in the Unit Data Block (UDB). 2. If applicable, an error recovery algorithm is applied. 3. Regardless of whether the recovery algorithm is successful or not, the applicable hardware and/or software status is captured again (at end) and appended to the UDB. 4. The error status stored in the UDB is formatted, assigned a sequence number, and appended to to the system event file. 5. If the system was able to recover from the error normal operation continues. If, however, the system was unable to recover from the error, then the job affected by the error is notified and it handles the error. @@tops_10_ef_q8_at That's correct. The action outlined in the question is typical of the way TOPS-10 handles most device errors. Non-device errors (e.g., CPU errors) and errors that affect the operating system itself are also handled in a similar manner. If, however, there is no recovery algorithm or if the recovery algorithm is unsuccessful, then those errors may result in a user job or system crash. @@tops_10_ef_q8_af The statement is TRUE. Most TOPS-10 device errors are handled this way. @@tops_10_ef_q9 TOPS-10 System Event Files - Q9 of 10 True or False - The exact content and format of each TOPS-10 event record is described in the Spear Manual. @@tops_10_ef_q9_at The statement is FALSE. The Spear Manual does describes the report formats generated by Retrieve, but it does not describe the content and format of the actual event records. @@tops_10_ef_q9_af That's correct. The event records are described in a file called DEFINE.LIS. @@tops_10_ef_q10 TOPS-10 System Event Files - Q10 of 10 True or False - The fifth word in a 011 type record is used to save the results of the DATAI performed at the time of the failure. Note: Refer to DEFINE.LIS. If you do not have a copy of DEFINE.LIS and you want one, refer to Appendix A on the Event File Menu. @@tops_10_ef_q10_at The statement is FALSE. Open the DEFINE.LIS to the 011 entry. It starts some place around line number 00450. The line number are listed at the left of the page. Now skipping over the word, byte, and bit definitions, go down the center, or word number column, until you get to word number 5. To the left you will see that word number 5 is defined as CONI_INITIAL. To the right you will see that CONI_INITIAL is described as "controller status at error". Now find word 16. You will see that it is defined as "RH_DATA_BAR_ERR", and described as: DATAI from RH10 block address register at error time. @@tops_10_ef_q10_af That's correct. Word 5 is used to save the CONI status word. The DATAI status is saved word 16. Anytime you want to know exactly what hardware and software status is saved in an entry type you can consult DEFINE.LIS Now, if you haven't already done so, take a few minutes to look over the contents of the file. The introduction explains the overall organization and format of an event file record. Following the introduction, each of the event types are described in detail. When you are finished, take a few more minutes and compare the reports listed in the Spear Manual with the corresponding record descriptions listed in DEFINE.LIS. As a result, you will have a better understanding of the system event file and the reports that are generated from it. @@tops_10_ef_lq That's it. There are only 10 questions about TOPS-10 System Event Files. Press the RETURN key to return to the System Event File Menu. @@tops_20_ef TOPS-20 System Event Files The section of Instruct consists of a series of questions that pertain to the TOPS-20 System Event File (ERROR.SYS). Before you attempt to answer the questions you should review Chapter 2 of the Spear Manual. Don't forget, you can use the /BREAK feature and return via your ID. @@tops_20_ef_a Press the RETURN key when you are ready. @@tops_20_ef_q1 TOPS-20 System Event Files - Q1 of 10 True or False - Many of the questions that pertain to the TOPS-10 system event file also pertain to the TOPS-20 system event file. @@tops_20_ef_q1_at That's correct. In fact, the questions are practically identical. In many cases so are the answers. Therefore, if you have already answered the questions as they pertain to the TOPS-10 system event file, then you can probably afford to skip this section of Instruct. Of course, on the other hand, you may want to answer the questions anyway. If that's the case, then don't be confused by the redundancy. @@tops_20_ef_q1_af The statement is TRUE. The TOPS-10 system event file (ERROR.SYS) and the TOPS-20 system event file (ERROR.SYS) are very similar. Therefore, it stands to reason that many of the questions that pertain to one event file will also pertain to the other event file. @@tops_20_ef_q2 TOPS-20 System Event Files - Q2 of 10 True or False - The TOPS-20 System Event File is called ERROR.SYS. @@tops_20_ef_q2_at That's correct. Both the TOPS-10 and the TOPS-20 system event file are called ERROR.SYS. The VAX/VMS system event file is called ERRLOG.SYS. @@tops_20_ef_q2_af The statement is TRUE. The idea of a system event file (ERROR.SYS) was first implemented in the early 1970's for TOPS-10. Initially, the file was used only to record main memory, channel, and disk errors. The idea proved to be a good one and new entries were added to the file until now ERROR.SYS is the main source of information for solving intermittent system failures. In the mid 1970's the idea of a system event file along with the file name ERROR.SYS was carried over to TOPS-20. Thus, both the TOPS-10 and the TOPS-20 system event file are called ERROR.SYS. @@tops_20_ef_q3 TOPS-20 System Event Files - Q3 of 10 True or False - Prior to the Spear library, TOPS-20 used a program called SYSERR to record entries in the system event files. @@tops_20_ef_q3_at The statement is FALSE. Neither SYSERR nor the Spear library have any- thing to do with the recording of entries in the system event file. That is strictly a function of the operating system. Both SYSERR and the Spear library are designed to process the contents of the system event file. SYSERR was a report generator. Basically, it allowed the user to select and translate specific entries in the event file. The SPEAR library (SYSERR's replacement) is more sophisticated. In addition to translating event file entries it also attempts to localize the cause of intermittent disk and tape subsystem failures. Note however that neither SYSERR nor Spear have anything to do with recording the system event file. @@tops_20_ef_q3_af That is correct. Both SYSERR and its replacement, the SPEAR library, are designed to process the contents of the system event file. They have nothing to do with recording the entries. That is a strictly function of the operating system. @@tops_20_ef_q4 TOPS-20 System Event Files - Q4 of 10 True or False - All hardware detected failures are recorded in the system event file. @@tops_20_ef_q4_at The statement is FALSE. Only failures that require operating system intervention are recorded in the system event file. Failures that do not require operating system intervention are not recorded in the event file. For example, some subsystems attempt error recovery locally. In most cases, if the recovery is successful then the operating is not notified. Thus, those kinds of errors are normally not recorded in the system event file. @@tops_20_ef_q4_af That's correct. Only errors that require operating system intervention are recorded in the system event file. @@tops_20_ef_q5 TOPS-20 System Event Files - Q5 of 10 True or False - Every record in a TOPS-20 system event file consists of a header section and a body section. @@tops_20_ef_q5_at That's correct. Furthermore, the header and body section of each entry type is described in a file called DEFINE.LIS. To obtain a copy of DEFINE.LIS, refer to Appendix A on the Event File Menu. @@tops_20_ef_q5_af The statement is TRUE. Each entry in the TOPS-20 system event file consists of a header section and a body section. The header identifies the entry type (i.e., event code), the date and time that the entry was recorded, the processor serial number, the length of the header section and the length of the body section. Currently, the header section is set at four words, the body section varies in size depending on the type of entry. @@tops_20_ef_q6 TOPS-20 System Event Files - Q6 of 10 True or False - Each record in a TOPS-20 system event file represents one complete system event. @@tops_20_ef_q6_at The statement is true with one exception, KLERR. KLERR entries are built by the console front-end whenever the KL10 crashes. When the system is restarted the entry is transfered via the DTE to KL main memory and then recorded in the system event file. Because the buffer area set aside for communications between the console and KL main memory is significantly smaller than a typical KLERR entry, the entry divided into segments. Each segment is given a unique sequence number and recorded as a separate record in the event file. Technically, therefore, the statement is FALSE. @@tops_20_ef_q6_af That's correct. The KLERR entry consists of multiple records. Each record has a separate sequence number. When a KLERR entry is translated, however, only the first sequence number is used to identify the entry. The other sequence numbers are masked-out to avoid confusion. @@tops_20_ef_q7 TOPS-20 System Event Files - Q7 of 10 True or False - The synchronization word is used to recover from hard read errors that occur while reading the system event file. @@tops_20_ef_q7_at That's correct. Whenever Spear uses the synchronization word to recover from a hard read error it will print the message "Bad header found - RESYNCing". @@tops_20_ef_q7_af The statement is TRUE. The first word in each system event file data block is a synchronization pointer. The pointer points to the starting location of the next record in the file. Thus, if a hard read error occurs while reading a record Spear skips to the next data block, reads the sync word, finds the starting location of the next record, and continues reading the file. The idea of adding a synchronization word to each data block in a system event file was incorporated in the mid 1970's. Prior to that time, if a hard read error occurred while reading the event the remaining records in the file were lost. Now only the record effected by the read error is lost. @@tops_20_ef_q8 TOPS-20 System Event Files - Q8 of 10 True or False - When the TOPS-20 operating system detects a device error the following occurs: 1. Normal operation is suspended and applicable hardware and/or software status is captured (at error) and saved in a buffer. 2. If applicable, an error recovery algorithm is applied. 3. Regardless of whether the recovery algorithm is successful or not, the applicable hardware and/or software status is captured again (at end) and appended to the buffer. 4. The contents of the buffer are formatted, assigned a sequence number, and appended to to the system event file. 5. If the system was able to recover from the error normal operation continues. If, however, the the system was unable to recover from the error, then the job effected by the error is notified and it handles the error. @@tops_20_ef_q8_at That's correct. The action outlined in the question is typical of the way TOPS-20 handles most device errors. Non-device errors (e.g., CPU errors) and errors that affect the operating system itself are also handled in a similar manner. If, however, there is no recovery algorithm or if the recovery algorithm is unsuccessful, then those errors may result in a user job or system crash. @@tops_20_ef_q8_af The statement is TRUE. Most TOPS-20 device errors are handled this way. @@tops_20_ef_q9 TOPS-20 System Event Files - Q9 of 10 True or False - The exact content and format of each TOPS-20 event record is described in the Spear Manual. @@tops_20_ef_q9_at The statement is FALSE. The Spear Manual does describes the report formats generated by Retrieve. But it does not describe the content and format of the actual event records. @@tops_20_ef_q9_af That's correct. The event records are described in a file called DEFINE.LIS. To obtain a copy of DEFINE.LIS refer to Appendix A on the Event File Menu. P.S. You will need a copy of DEFINE.LIS to answer the next question. @@tops_20_ef_q10 TOPS-20 System Event Files - Q10 of 10 True or False - The thirty second word in a 111 type record is used to save the first channel control word. Note: Refer to DEFINE.LIS. If you do not have a copy of DEFINE.LIS and you want one, refer to Appendix A on the Event File Menu. @@tops_20_ef_q10_at The statement is FALSE. Open the DEFINE.LIS to the 111 entry. It starts some place around line number 01320. The line number are listed at the left of the page. Now skipping over the word, byte, and bit definitions, go down the center, or word number column, until you get to word number 32. To the left you will see that word number 32 is defined as RETRY_CNT. To the right you will see that the RETRY_CNT is saved in bit 18 through 35 of the word and it is described as "final retry error count". Now find word number 28. You will see that it is defined as CCW1, it consists of 36 bits, and it is described as "first chan control word". @@tops_20_ef_q10_af That's correct. Word 32 is used to save the error retry count. The first channel control word is saved in word 28. Anytime you want to know exactly what hardware and software status is saved in an entry type you can consult DEFINE.LIS Now, if you haven't already done so, take a few minutes to look over the contents of the file. The introduction explains the overall organization and format of an event file record. Following the introduction, each of the event types are described in detail. When you are finished, take a few more minutes and compare the reports listed in the Spear Manual with the corresponding record descriptions listed in DEFINE.LIS. As a result, you will have a better understanding of the system event file and the reports that are generated from it. @@tops_20_ef_lq That's it. There are only 10 questions about TOPS-20 System Event Files. Press the RETURN key to return to the System Event File Menu. @@vax_vms_ef You do not need to understand file structures to use System Event Files to isolate system failures. However, in order to be effective you should understand something about their format and content. Chapter 5 of the VAX11 Spear Manual describes the overall format and content. Appendix B of the VAX11 Spear Manual describes in detail, the content of each record type that you will find in the system event file. This section of Instruct consists of a series of general and specific questions about the VAX/VMS System Event File (ERRLOG.SYS). Before you attempt to answer the questions you should review Chapter 5 and Appendix B in the Spear Manual. (Don't forget, you can use the /BREAK feature and return via your student ID.) @@ @@vax_vms_ef_a Press the RETURN key when you are ready. @@ @@vax_vms_ef_q1 Q1 of 10 (VAX/VMS System Event Files) True or False - Several of the questions that pertain to the VAX/VMS system event file also pertain to the TOPS-20 system event file. @@ @@vax_vms_ef_q1_at That's correct. @@ @@vax_vms_ef_q1_af The statement is TRUE. The VAX/VMS system event file (ERRLOG.SYS) and the TOPS-20 system event file (ERROR.SYS) are very similar in concept. Therefore, it stands to reason that many of the questions that pertain to one event file will also pertain to the other event file. @@ @@vax_vms_ef_q2 Q2 of 10 (VAX/VMS System Event Files) True or False - In addition to the Spear library, VAX/VMS uses a program called SYE to record entries in the system event files. @@ @@vax_vms_ef_q2_at VAX/VMS Q2 The statement is FALSE. Neither SYE nor the Spear library have any- thing to do with the recording of entries in the system event file. That is strictly a function of the operating system. Both SYE and the Spear library are designed to process the contents of the system event file. SYE is a report generator. Basically, it allows the user to select and translate specific entries in the event file. The SPEAR library is more sophisticated. In addition to translating event file entries it also attempts to localize the cause of intermittent disk and tape subsystem failures. Note however that neither SYE nor Spear have anything to do with recording the system event file. @@ @@vax_vms_ef_q2_af VAX/VMS Q2 That is correct. Both SYE and the SPEAR library are designed to process the contents of the system event file. They have nothing to do with recording the entries. That is a strictly function of the operating system. @@ @@vax_vms_ef_q3 Q3 of 10 (VAX/VMS System Event Files) True or False - The VAX/VMS System Event File is called ERRLOG.SYS. @@ @@vax_vms_ef_q3_at VAX/VMS Q3 That's correct. The VAX/VMS system event file is called ERRLOG.SYS. Both the TOPS-10 and the TOPS-20 system event file are called ERROR.SYS. @@ @@vax_vms_ef_q3_af VAX/VMS Q3 The statement is TRUE. The idea of a system event file (ERROR.SYS) was first implemented in the early 1970's for TOPS-10. Initially, the file was used only to record main memory, channel, and disk errors. The idea proved to be a good one and new entries were added to the file until now ERROR.SYS is the main source of information for solving intermittent system failures. In the mid 1970's the idea of a system event file along with the file name ERROR.SYS was carried over to TOPS-20. Thus, both the TOPS-10 and the TOPS-20 system event file are called ERROR.SYS. @@ @@vax_vms_ef_q4 Q4 of 10 (VAX/VMS System Event Files) True or False - More than one process may do read access on the error file at the same time. @@ @@vax_vms_ef_q4_at VAX/VMS Q4 That's correct. More than one process may read the file at the same time. @@ @@vax_vms_ef_q4_af VAX/VMS Q4 The statement is TRUE The problem arises when the operating system tries to write to the file and finds some other process reading the file. In this case, the operating system creates a new file. @@ @@vax_vms_ef_q5 Q5 of 10 (VAX/VMS System Event Files) True or False - All I/O device errors are logged under the device error record format regardless of the type of device. @@ @@vax_vms_ef_q5_at VAX/VMS Q5 That's correct. The CPU and memory errors are recorded different record formats but not I/O device errors @@ @@vax_vms_ef_q5_af VAX/VMS Q5 The statement is TRUE Only TOPS-10 and TOPS-20 use different record formats for different types of I/O devices. @@ @@vax_vms_ef_q6 Q6 of 10 (VAX/VMS System Event Files) True or False - Only device errors and other hardware detected errors are recorded in the VMS system error file. @@ @@vax_vms_ef_q6_at VAX/VMS Q6 The statement is FALSE Many other types of information are also recorded in the error file such as volume mounts and dismounts. Software detected errors are also recorded in this file as well as text messages from the operator. @@ @@vax_vms_ef_q6_af VAX/VMS Q6 That's correct. There are many other sources of the information found in the error file. @@ @@vax_vms_ef_q7 Q7 of 10 (VAX/VMS System Event Files) True or False - The format of the device error entry is the same regardless of the type of VAX CPU used in the system. @@ @@vax_vms_ef_q7_at VAX/VMS Q7 That's correct. Only the CPU specific entries are different. @@ @@vax_vms_ef_q7_af VAX/VMS Q7 The statement is TRUE Only the CPU specific entries are different. @@ @@vax_vms_ef_q8 Q8 of 10 (VAX/VMS System Event Files) True or False - If the operating system must create a new version of the error file, ERRLOG.SYS, it renames the current version to ERRLOG.OLD and then creates the new file. @@ @@vax_vms_ef_q8_at VAX/VMS Q8 The statement is FALSE The operating system will create a new file using the same name and the next higher version number. @@ @@vax_vms_ef_q8_af VAX/VMS Q8 That's correct. The convention of renaming the error file to ERRLOG.OLD has nothing to do with the operating system. @@ @@vax_vms_ef_q9 Q9 of 10 (VAX/VMS System Event Files) True or False - The media identification is not included as part of the information recorded in a device error. @@ @@vax_vms_ef_q9_at VAX/VMS Q9 That's correct. The media information is recorded in the system event file when the media is mounted or dismounted. @@ @@vax_vms_ef_q9_af VAX/VMS Q9 The statement is TRUE The media information is recorded in the system event file when the media is mounted or dismounted. @@ @@vax_vms_ef_q10 Q10 of 10 (VAX/VMS System Event Files) True or False - Some device error records in the event file may have no apparent indication of any error occuring. @@ @@vax_vms_ef_q10_at VAX/VMS Q10 That's correct. Media off line is a good example. It this case the "on-line" bit would be off indicating the error. @@ @@vax_vms_ef_q10_af VAX/VMS Q10 The statement is TRUE Media off line is a good example. It this case the "on-line" bit would be off indicating the error. @@ @@vax_vms_ef_lq That's it. There are only 10 questions about VAX/VMS System Event Files. Press the RETURN key to get back to the System Event File Menu. @@3.0. Spear Library Introduction Spear is an on-line maintenance software library that runs under three operating systems: TOPS-10, TOPS-20, and VAX/VMS. Currently, the library contains three functions: Summarize, Retrieve, and Compute. These functions; Summarize and Retrieve, are designed to help you sort and evaluate 32- and 36-bit system event files. The third function, Compute, calculates system availability. Its purpose is to help you prepare crash and up time reports and determine overall system performance. @@3.0.A. Each Spear Library function supports a dialog style user interface. The dialog prompts for information and waits for a response. If the prompt accepts a default, the default will be (parenthetically) included as part of the prompt. @@R.T.3.0. STOP - You are moving in a reverse direction through the menu. You are about to back into the Spear Library Introduction. Your response please: @@R.T.3.1.0.B. STOP - You are moving in a reverse direction through the menu. You are about to back into the the Introduction @@R.T.3.1.1.F. STOP - You are moving in a reverse direction through the menu. You are about to back into @@M. Spear Course Menu 1. Course Administrator/Student Guide 2. Troubleshooting 3. System Event Files 4. Using The Spear Library 5. Guaranteed Uptime Program/NOTIFY 6. Feedback 7. Random Questions 8. Dialog Changes @@R.T.3.1.2.M. STOP - You are moving in a reverse direction thru the menu. You are about to back into the Menu. Your response please: @@3.2.0. The Big Picture Input File : Retrieve accepts event ..........:......... files and packet files. : : Event File Packet File The Selected information .....:..... : in an event file can be: : : : Included in, or Excluded Include Exclude Packet from, the output file. :.........: Numbers : : One or more Packets can Selection and : be selected from a Packet Time Criteria : file. :..................: : Output Mode .....:..... Retrieve can translate the : : selected entries or it can ASCII Binary save the selected entries :.........: in a binary history file. : Output File @@R.T.3.2.0. STOP - You are moving in a reverse direction thru the menu. You are about to back into the Retrieve Overview. @@3.2.M. Spear Library - Retrieve Topic menu: 1. Overview 2. Retrieve Dialog 3. Retrieve Questions & Answers @@3.2.1. The basic Retrieve dialog consists of eight selection prompts and one confirmation prompt. RETRIEVE mode ------------- Event or packet file (default): Selection to be (INCLUDED): Selection type (ALL): Time from (EARLIEST): Time to (LATEST): Output mode (ASCII): Report format (SHORT): Output to ([DSK]:RETRIE.RPT): Type [cr] to confirm (/GO): @@3.2.1.A. The first selection prompt: Event or packet file (default): allows you to specify the name of the input file. The default response (SYS:ERROR.SYS for TOPS-10, SERR:ERROR.SYS for TOPS-20, and SYS$ERRORLOG:ERRLOG.SYS for VAX/VMS) is enclosed in parentheses and can be selected by pressing the RETURN key. Retrieve accepts two types of files: standard system event files (such as those generated by TOPS-10, TOPS-20, or VAX/VMS systems), and Packet files If you specify a system event file Retrieve will continue with the basic dialog. If you specify a Packet file, however, Retrieve will switch to the Packet selection dialog. Since the Packet dialog is short (1 prompt) it will be explained next. Then we will continue with the basic dialog. This prompt also supports standard Help and question mark (?) responses. @@RETRIEVE INPUT Selection Criteria ___________. .___ Short Report .________!_________. !___ Full Report Event File ___. ! Event Retrieval ! !___ Raw Data Report !___! Translation !___! Packet File ___! ! and/or Storage ! ! !__________________! !___ Device History Merge File (binary) ___________! Files (binary) Retrieve can be used to generate reports, or it can be used to establish and maintain device history files. If you choose to generate a report, you can select one of three formats: Short, Full, or Octal (Hexadecimal on VAX/VMS systems). If you choose to generate a device history file you will be asked if you want to merge it with an existing (history) file. @@3.2.1.B. If you specify a Packet file at the input file prompt, Retrieve will prompt you for the packet numbers that you want to select. Event or packet file (SERR:ERROR.SYS): DSK:A1225.PAK Packet numbers: Each numbered packet contains a list of sequence numbers. The sequence numbers identify the individual records that were used by Analyze as evidence to support the theories listed in the corresponding Analyze Report file. There is one packet for each theory listed in the report. You can use Retrieve to translate (or save in a separate binary file) the records listed in the packet files. Typically, you would translate a packet if you wanted to examine the records that were used as evidence to support a particular theory. You would save the records if you were building or maintaining a history file for a particular device or a specific type of error. This prompt also supports standard Help and Question mark (?) responses. @@3.2.1.C. If you specify multiple packet numbers, each number should be separated by a comma. You should realize, however, that if you specify more than one packet number the records listed in the packets will be grouped and translated (or saved) according to sequence numbers. In other words, the records will not be grouped according to packet number. After prompting for packet numbers, Retrieve will skip the "Time from" and "Time to" prompts and pickup the basic dialog at the "Output mode" prompt. From that point on, there is no difference between the Event File dialog and the Packet File dialog. Event or packet file (SERR:ERROR.SYS): DSK:A1225.PAK Packet numbers: 3,7,14 Output mode (ASCII): Report format (SHORT): Output to ([DSK]:RETRIE.RPT): Type [cr] to confirm (/GO): @@3.2.1.D. Back to the basic Retrieve dialog. The second selection prompt: Event or packet file (default): Selection to be (INCLUDED):? INCLUDED EXCLUDED allows you to specify whether the selected entries will be included in, or excluded from, the output file. Included is the normal response. If, however, you specify Excluded, then ALL the entries in the input file (except those that you select later in the dialog) will be extracted and translated or saved in the output file. The Exclude feature is used to purge entries from a system event file before the file is translated or saved. For example, suppose a communications node developed a problem that caused the event file to fill up with Network entries; since you know what caused the problem you might want to remove the entries before you process or save the file. Note: the original (or input) file will not be altered in any way. This prompt also supports standard Help and Question mark (?) responses. @@RETRIEVE TYPE The following example illustrates the difference between Include and Exclude. Include(event type C) Exclude(event type C) Time: From To From To : : : : Input file: CABBACBCCAABCABBCAACCBCA CABBACBCCAABCABBCAACCBCA Output file: CC C C CABBACB AAB ABB AACCBCA @@3.2.1.E. The third selection prompt asks you to choose from two separate lists. Selection type (ALL): Type one or more of the following from the first group: ERROR STATISTICS DIAGNOSTICS CONFIGURATION OTHER If you choose more than one of these types, separate each with a comma. Or, type one of the following from the second group: the RETURN key, or ALL SEQUENCE CODE @@3.2.1.EA. ERROR - indicates that you want to select entries that contain actual failure data. If you select ERROR you can also specify the particular error types for which you are looking in relation to the specific device. STATISTICS - indicates that you want to select statistic entries. DIAGNOSTICS - indicates that you want to select entries created by a diagnostic. CONFIGURATION - indicates that you want to select configuration entries. OTHER - indicates that you want to select entries that do not fit into the other types. These responses will be explained later, after the frames relating to SEQUENCE and CODE. @@3.2.1.EB. ALL (or the RETURN key) - indicates that you want to select all the entries in the file. (This is the default). You can further qualify the selection at the Time prompts. SEQUENCE - indicates that you want to select entries according to sequence numbers. This response will be explained next. CODE - indicates that you want to select entries based on the event codes assigned each type of entry by the operating system. This response will be explained after the SEQUENCE response. @@3.2.1.EC. When you specify SEQUENCE in response to the "Selection type" prompt, Retrieve will prompt you for the sequence numbers that you want to select. Selection type (ALL): SEQUENCE Sequence numbers: 22,24,35-67,12 You can select as many sequence numbers as you want. Individual sequence numbers must be separated by commas, groups of sequence numbers must be specified by entering the first and last sequence numbers in the group. The sequence numbers must be separated by a dash (-). For example, 35-67 indicates that you want to select sequence numbers 35 through 67. @@3.2.1.ED. If you specify CODE in response to the "Selection type" prompt, Retrieve will prompt you for the event codes that you want to select. Selection type (ALL): CODE Event codes: 133,161-163 You can select as many event codes as you want. Each event code must be separated by a comma. You can also select groups of event codes. The first and last event codes in the group must be separated by a dash (-). For example, 161-163 indicates that you want to select event codes 161 through 163. This prompt also supports standard Help and Question mark (?) responses. @@3.2.1.EE. If you specify ERROR, STATISTICS, DIAGNOSTICS, or OTHER, or a combination of these responses to the "Selection" prompt, Retrieve will enter the "Error class" dialog. Selection type (ALL): ERROR Category(ALL): ALL MAINFRAME DISK TAPE CI NI UNITRECORD NETWORK OPERATING-SYSTEM COMM PACKID REELID HELP @@3.2.1.EF. ALL (or the RETURN key) - indicates that you want to select all errors. (This is the default). MAINFRAME - indicates that you want to select errors occurring in specific mainframe components. DISK - indicates that you want to select errors occurring on disk units. After selecting DISK, you can specify ALL the specific disks by name (DPA3, RPB7), or by disk type (RP06, RM05). TAPE - indicates that you want to select errors occurring on tape units. After selecting TAPE, you can specifiy ALL, or specify the tape names or types in question. CI - indicates that you want to select CI-related errors. After selecting CI, you can specify ALL, or the specific component of interest. NI - indicates that you want to select NI-related errors. @@3.2.1.EG. UDA - indicates that you want to select UDA-related errors. After selecting UDA, you can specify ALL, or the specific component of interest. UNITRECORD - indicates that you want to select errors occurring on unit-record devices such as card readers and line printers. After selecting UNITRECORD, you can specify ALL, or type the specific device names or types in question. OPERATING-SYSTEM - indicates that you want to select operating system codes. After selecting OPERATING-SYSTEM, you can specify ALL, or type the name of a specific STOPCODE or BUG type. COMM - indicates that you want to select errors occurring on communication devices. @@3.2.1.EH. PACKID - indicates that you want to select specific disk packs. After typing PACKID, you can type ALL, or type the specific pack identifiers. REELID - indicates that you want to select specific tape reels. After typing REELID, you can type ALL, or the specific tape identifiers. HELP - indicates that you want to get detailed information on the above categories. All categories except for COMM and NI prompt further for specific device types. Type ? at the subprompt level to get a list of acceptable responses. If you choose the DISK drive, TAPE drive, or CI controller subprompt, Retrieve then prompts you further for an error type. Type ? at the subprompt level to get a list of acceptable responses. @@3.2.1.EI. RETRIEVE keeps prompting you for categories until you either type FINISHED, or press the RETURN key. Next Category (FINISHED): Type one of the following: The RETURN key, or FINISHED to take the default, or, another category. @@3.2.1.V. Back to the basic dialog. The fourth selection prompt: Time from (EARLIEST): allows you to specify the time at which you want the selection process to begin. The default response (EARLIEST) is inclosed in parentheses and can be selected by pressing the RETURN key. You can also specify real and relative time. The prompt also supports standard Help and question mark (?) responses. @@3.2.1.W. The fifth selection prompt: Time to (LATEST): allows you to specify the time at which you want the selection process to end. The default response (LATEST) is inclosed in parentheses and can be selected by pressing the RETURN key. Again, you can also specify real and relative time. The prompt also supports standard Help and question mark (?) responses. @@3.2.1.X. The sixth selection prompt allows you to specify the type of file that you want Retrieve to generate. Output mode (ASCII): ? ASCII - indicates that you want the selected entries extracted and translated in a report. BINARY - indicates that you want the selected entries extracted and saved in a binary file. We will discuss the ASCII response first. This prompt also supports standard Help and question mark (?) responses. @@3.2.1.Y. If you specify ASCII in response to the "Output mode" prompt, Retrieve will prompt you for the type format that you want. Output mode (ASCII): Report format (SHORT): ? SHORT - indicates that you want a brief translation of each selected entry. FULL - indicates that you want a detailed translation of each selected entry. OCTAL - indicates that you want an octal translation of each selected entry. Normally, octal translations are used to debug errors in Spear or the software routines that record the entries. The prompt also supports standard Help and question mark (?) responses. @@3.2.1.Z. If you specify BINARY in response to the "Output mode" prompt, Retrieve will ask you if you want to merge the selected entries with an existing binary file. Output mode (ASCII): BINARY Merge with (NONE): Normally, merging is done only if you are maintaining a device history file. For example, suppose the processor was experiencing a highly intermittent failure. Let's say that on the average, the failure occurred once a week. Given that situation, you might need several weeks or even a months worth of error information to isolate the cause of the problem. Since an event file can get quite large over a period of several weeks or a month, you might consider establishing a history file to keep track of the failure. The merge feature is designed to help you do this. It allows you to combine the currently selected entries with previously selected entries and merge them in the output file. @@3.2.1.A1. The Merge prompt also supports standard Help and question mark (?) responses. @@3.2.1.B1. The eighth and last selection prompt: Output to ([DSK]:RETRIE.RPT): allows you to specify the name of the output or file. The default file name is DSK:RETRIE.RPT for TOPS-10/TOPS-20, and RETRIE.RPT for VAX/VMS (if you are generating a report). The default becomes DSK:RETRIE.SYS for TOPS-10/TOPS-20, and RETRIE.SYS for VAX/VMS (if you are building or maintaining a binary history file). You can override the entire default by specifying a new file name. You can also override any field in the default response by specifying only the field that you want to override. For example, if you were to type: Report to (DSK:SUMMAR.SYS): CPU the output file specification would become DSK:CPU.SYS The prompt also supports standard Help and question mark (?) responses. @@3.2.1.C1. Finally, the confirmation prompt: Type to confirm (/GO): provides an opportunity for you to review and change any responses entered up to that point. If you want to review the response list type /SHOW. If you are satisfied with the response list press the RETURN key or type /GO. If you want to change a response, press the backspace key until you arrive at the corresponding prompt, make the change, and then type /GO. @@3.2.1.D1. That concludes the explanation of the Retrieve dialog. Next on the menu is a set of questions about the Retrieve dialog. @@RETRIEVE CODES Generally speaking, the TOPS-10, TOPS-20 and VAX/VMS operating systems handle errors in a similar manner. That is, when an error occurs they snapshot pertinent hardware and software status (at error). Then, if applicable, an error retry algorithm is applied. Next, regardless of whether or not the retry algorithm was successful, a second snapshot is taken (at end). Finally, the captured status is put into a record, assigned a code, and appended to the system event file. The operating systems differ, however, in the way that they snapshot the status, implement the retry algorithms, and assign codes to the error or event record. @@3.M. Spear Library Topic Menu: 1. Introduction 2. Retrieve 3. Compute 4. Summarize 5. Applications 6. Klerr @@R.T.3.2.1.D1. STOP - You are moving in a reverse direction thru the menu. You are about to back into the Retrieve Dialog explanation. @@ret_dia_q1 Retrieve Dialog - Q1 of 10 True or False - Retrieve can be used to translate and/or save the records listed in the packets that are generated by Analyze? @@ret_dia_q1_at That's Correct. This feature allows you to translate and/or save the individual records that were used as evidence to support specific theories. @@ret_dia_q1_af The statement is TRUE. Retrieve can translate the Packets generated by Analyze. Remember, there is a packet associated with each theory listed in the Analyze report. The Packet contains pointers that identify the records that were used as evidence to support the theory. Thus, anytime you question the validity of a theory and want to examine the evidence yourself, you can do so by specifying the Packet file as input to Retrieve. When Retrieve prompts for the packet number, enter the number that corresponds to the theory that you are investigating and then, specify the desired output mode (Short, Full, or Octal). @@ret_dia_q2 Retrieve Dialog - Q2 of 10 True or False - Retrieve can be used to generate and maintain device history files? @@ret_dia_q2_at That's correct. You can use Retrieve to build and maintain history files for: a) entire subsystems (disks, tapes, networks, etc.), b) logical devices (DP220, MT300, CPU0, etc.), c) physical option types (RP06s, TU45s etc.) or, d) disk and tape storage media (Pack or Reel IDs). @@ret_dia_q2_af The statement is TRUE. Retrieve can be used to build and maintain device history files. The procedure is relatively simple. Here's what to do: First, select the device via the "Error class" prompt. Next, specify the time frame. Then, when Retrieve prompts for Output mode, specify BINARY. Retrieve will ask you if you want to merge the selected entries with an existing binary history file. If you are building a new history file press the RETURN key or type: NONE. If, however,a history already exists for the selected device and you just want to combine the entries, then specify the name of the history file in response to the "Merge" prompt. Finally, Retrieve will prompt for the output file name. Again, if you are building a new history file, then specify a unique file name. If, however, you are updating an exiting history file, then specify the name of the history file you are updating. In most cases it would be the same file that you specified in response to the "Merge with" prompt. @@ret_dia_q3 Retrieve Dialog - Q3 of 10 True or False - If, in response to the "Type to confirm (GO):" prompt, you type "/DISPLAY" - Retrieve will display the current list of responses? @@ret_dia_q3_at The statement is FALSE. The switch is called "/SHOW". If after making your selections, you type "/SHOW", Retrieve will display each prompt and the corresponding response as illustrated in the following example. Type [cr] to confirm (/GO): /SHOW RETRIEVE mode ------------- Event or packet file: SYSTEM:ERROR.SYS Output to: DSK:RETRIE.TXT Merge with: NONE Time from: EARLIEST Time to: LATEST Selection to be: INCLUDED Output mode: ASCII Report format: SHORT Selection type: ERROR Error class: DISK, TAPE, Disk drives: DP120, DP230, Tape drives: MT300, Type [cr] to confirm (/GO): @@ret_dia_q3_af That's correct. It's the "/SHOW" switch that will cause the current list or responses to be displayed. If, after reviewing the list, you decide that you want to change a response, you can press the BACKSPACE key (or type /REVERSE) until you get back to the response that you want to change. At that point you can add to the response, or you can type /CLEAR and enter a new response. @@ret_dia_q4 Retrieve Dialog - Q4 of 10 True or False - Retrieve can be used to select entries that pertain to specific Disk Packs or Magtape Reel ID's? @@ret_dia_q4_at That's correct. Pack and Reel ID's were added to the selection criteria so that you could use the EXCLUDE mode to remove entries from the event file that pertain to known bad media. Thus, you can clean up the file a bit, resubmit it to Analyze, and see if media problems were covering up other more subtle hardware problems. @@ret_dia_q4_af The statement is TRUE. If you type "?" at the "Error class" prompt you will see that PACKID and REELID are among the selection criteria available. @@ret_dia_q5 Retrieve Dialog - Q5 of 10 True or False - If you specify a file name in response to the "Merge with (NONE):" prompt, Retrieve will automatically append the selected entries to that file? @@ret_dia_q5_at The statement is FALSE. Retrieve will NOT automatically append the selected entries to the file that you specify in response to the "Merge with" prompt. Instead what happens is: the selected entries and the entries in the "merge file" are combined and written out to the file that you specify in response to the "Output to" prompt. Selected Entries Merge with "file name" | | |_____________________| | Output "file name" If at the "Output to" prompt, however, you specify the same file name that you specified at the "Merge with" prompt, then the entries will be combined and written in that file. Incidentlly, that is the recommended method for maintaining device history files. @@ret_dia_q5_af That's correct. Retrieve will NOT change the "merge" file in any way unless you direct it to do so by specifying the same file name at the "Output to" prompt. @@ret_dia_q6 Retrieve Dialog - Q6 of 10 True or False - Sequence numbers are used to identify the relative position of the records in a system event file? @@ret_dia_q6_at That is correct. Record sequence numbers are included as part the header in all Short Full, and Octal reports translated by Retrieve. The sequence number is the simplest way to refer to a specific record. As long as the order of the records in the file are not disturbed, the sequence numbers will remain valid. Thus, if you request a Short ASCII translation of several records and then decide that you want a Full translation of one or two of those records, you can do so by specifying the sequence numbers to Retrieve. @@ret_dia_q6_af The statement is TRUE. Sequence numbers reflect each records relative position in a file. Remember, sequence numbers are dynamically assigned to each record as a file is read. For example, if a file contains 623 records then, the first record in the file will be assigned sequence number 1, the second record will be assigned sequence number 2, etc. Finally, the last record in the file will be assigned sequence number 623. @@ret_dia_q7 Retrieve Dialog - Q7 of 10 True or False - Retrieve can be used to select entries based on the event codes assigned to the entries by the operating system. @@ret_dia_q7_at That's correct. If, for example, you wanted to select all KS10 Halt Status Block entries you could: reference the Spear Manual or look on the back of the Spear Reference Card to get the code number, specify "CODE" at the "Selection type" prompt, and, when Retrieve prompted for "Event code:" enter 033 for TOPS-10 or 133 for TOPS-20. @@ret_dia_q7_af The statement is TRUE. If you type "?" in response to the "Selection type" prompt, you will see "CODE" listed as one of the acceptable responses. If you select "Code" Retrieve will prompt you for the "Event codes". The event types and the corresponding event codes are listed on the back panel of the Spear Reference Card. In addition, the detailed information contained in of each entry types is described in the Spear Manual. @@ret_dia_q8 Retrieve Dialog - Q8 of 10 True or False - Typing /C in response to the "Next Category (FINISHED):" prompt will clear all entries selected up to that point. @@ret_dia_q8_at That's correct. Keep in mind, however, that in addition to clearing selected entries, the /Clear switch will also reset the prompt response to the default. In other words, suppose you type /SHOW before starting Retrieve. Then, let's say that you decide that you don't want the selected magtape entries after all, so you press the BACKSPACE key until you get back to the "Error class" prompt. At that point you specify "Tape", Retrieve prompts for tape drives and you type /CLEAR. You might think that you are no longer selecting any Magtape entries. But that is not the case. Instead, what you did was cleared the selected list and thus, reinstated the default (ALL). @@ret_dia_q8_af The statement is TRUE. The /CLEAR switch provides a mechanism for changing selected entry types. For example, suppose you had just selected some event codes for translation and you're about to press the RETURN key to start Retrieve but, before doing so, you typed "/SHOW" just to double check yourself. Now, suppose you discover that, for some reason, you entered the wrong list of event codes. Here's what to do: 1. Press the BACKSPACE key until you get back to the "Selection type" prompt. 2. Then, in response to the "Selection type" prompt specify "CODE". 3. When Retrieve prompts for the Event codes, type "/CLEAR" to clear the existing list of event codes and then enter the correct list. 4. Finally, type "/SHOW" as a last check and then, if everything is OK type "/GO" to start Retrieve. @@ret_dia_q9 Retrieve Dialog - Q9 of 10 True or False - Entries can be retrieved by logical names (i.e. CPU0) as well as by physical names (i.e. RP06)? @@ret_dia_q9_at Technically, the statement is FALSE. Retrieve can recognize some, but not all, logical and physical names. @@ret_dia_q9_af That is correct. Retrieve recognizes some, but not all, physical and logical names. Just as a double check before running, Retrieve will list all selected names that it considers to be logical. Thus, if you made a typing error or entered a physical name that it does NOT recognize, you'll know because Retrieve will list it as a logical name. @@ret_dia_q10 Retrieve Dialog - Q10 of 10 True or False - Retrieve can be used to extract entries based on STOPCODES or BUGxxx code names? @@ret_dia_q10_at That's correct. The "Mainframe Error and Crash Summary" section of the Analyze report breaks down STOPCODES (for TOPS-10) and BUGxxx (for TOPS-20 and VAX/VMS) by: type, name, and number of occurances. Thus, given the Analyze report, you can then use Retrieve to translate or save the STOPCODE or BUGxxx entries for further investigation. This feature is particularlly helpful when it come to saving and investigating very intermittent system crashes. @@ret_dia_q10_af The statement is TRUE. Retrieve can be used to extract entries based on STOPCODES and BUGxxx code names. If you specify "CODE", Retrieve will prompt for "Event codes". At that point you can enter the names of one or more STOPCODES or BUGxxx that you want retrieved. For example, if you typed: Selection type (ALL): CODE Event codes: DX2FUS,P2RAE Retrieve will translate (or save) all entries that are related to either of the Event codes (DX2FUS and P2RAE). @@3.2.1.1. That's it. There are only ten questions about the Retrieve dialog. If you have gotten this far, then chances are you have a pretty good idea of how to use Retrieve. Therefore, it is with great honor, that Instruct pronounces you a "Retrieve-Dialog Subject Matter Expert". @@3.3.0. Compute calculates the following system performance factors: System Availability (AS) - System Availability is the percentage of time that the system was available for use. (It includes Standalone time.) User Availability (UA) - User Availability is the percentage of time that the system was available for use by the user community. System Effectiveness (SE) - System Effectiveness (SE) is the percentage of probability that the system remained available for a given period of time (t). The remainder of this introduction briefly explains the formulas used by Compute to calculate these factors. For a more detailed explanation of the formulas refer to the Spear Manual. @@3.3.0.A. The following formula is used to calculate System Availability (SA): SA = (1.0) - CDT/(TDT + TRT) where: CDT = Chargeable Down Time TDT = Total Down Time TRT = Total Run Time Remember - System Availability is the percentage of time that the system was available for use. (It includes Standalone time.) @@3.3.0.B. The following formula is used to calculate User Availability (UA): UA = (1.0) - CDT/(CDT + TRT) where: CDT = Chargeable Down Time TRT = Total Run Time Remember - User Availability is the percentage of time that the system was available for use by the user community. @@3.3.0.C. The following formula is used to calculate System Effectiveness (SE): SE = (SA) * (e** (-t/MTBF)) where: SA = System Availability e = the Napierian or natural base of logarithms (2.71828+) t = an arbitrary period of time for which the SE factor is calculated. Typically, Compute calculates the SE factor for four time periods: 6 minutes, 30 minutes, 1 hour, and 4 hours. MTBF= The mean, or average time between failures (chargeable Downtimes). e** means "e" raised to the power of (-t/MTBF). Remember - System Effectiveness (SE) is the percentage of probability that the system remained available for a given period of time (t). @@R.T.3.3.0. STOP - You are moving in a reverse direction through the menu. You are about to back into the Introduction to Compute. @@3.3.M. Spear Library - Compute Topic menu: 1. Overview 2. Compute Dialog 3. Questions & Answers @@3.3.1. Compute Dialog - The Compute dialog consists of seven selection prompts and one confirmation prompt. COMPUTE mode ------------ Event file (default): Report period (LAST-WEEK): Time from (EARLIEST): Time to (LATEST): Report type (SINGLE-REPORT): Availability Report to ([DSK]:COMPUT.RPT): Reload report to ([DSK]:RELOAD.RPT): Type to confirm (/GO): @@3.3.1.A. The first selection prompt: Event file (default): allows you to specify the name of the file that contains the system performance entries that you want Compute to use in its calculations. The default response (SYS:AVAIL.SYS for TOPS-10, SERR:ERROR.SYS for TOPS-20, and SYS$SYSDISK:[SYSERR]:ERRLOG.SYS for VAX/VMS) is enclosed in parentheses, and can be selected by pressing the RETURN key. You can override the entire default response by specifying a new file name, or you can override any field in the default response by specifying only the field that you want to override. For example, if you were to type: Event file (SERR:ERROR.SYS): .LWK the input file specification would become SERR:ERROR.LWK The prompt also supports standard Help and question mark (?) responses. @@COMPUTE INPUT .--------------. .___ Summary Report | Calculate | | System Event File ___| System |___|___ Availability Report (or AVAIL.Ann) | Availability | | |______________| |___ Reload Report TOPS-10, TOPS-20, and VAX/VMS record entries that are used by Compute to calculate overall system performance. Under TOPS-10 the entries are recorded in a file called AVAIL.SYS. Under TOPS-20 and VAX/VMS the entries are recorded in ERROR.SYS and ERRLOG.SYS respectively. @@3.3.1.B. The second selection prompt Report period (LAST-WEEK): allows you to specify the time period for which you want system performance calculated. Compute is designed to calculate system performance for the previous week. That is, from a week ago Sunday at 00:00:01 to last Sunday at 00:00:01. Thus, by running Compute weekly you can monitor overall system performance and note any trends in availability or effectiveness. You can also direct Compute to calculate system performance for this week or any other period of time. If you specify THIS-WEEK, then Compute calculates system performance from last Sunday at 00:00:01 to the present. If you specify OTHER Compute will prompt for the specific time period. The prompt also supports standard Help and question mark (?) responses. @@3.3.1.C. The third and fourth selection prompts Time from (EARLIEST): Time to (LATEST): are displayed only if you specify OTHER in response to the Report Period Prompt. The time prompts allow you to specify the specific time period for which you want system performance calculated. You can specify the default times (Earliest and Latest respectively), or you can specify either real or relative time. Both of these prompts also support standard Help and question mark (?) responses. @@3.3.1.D. The fifth selection prompt Report type (SINGLE-REPORT): is also displayed only if you specify OTHER in response to the Report Period Prompt. The Report Type prompt allows you to specify the type of report that you want. You can specify the default, SINGLE-REPORT, in which case Compute will generate a single report that reflects system performance for the selected time period. You can also specify MULTIPLE-REPORTS, in which case Compute will generate (in addition to the single report) a set of weekly reports that reflect system performance for the selected time period. The prompt also supports standard Help and question mark (?) responses. @@3.3.1.E. The sixth selection prompt Availability Report to ([DSK]:COMPUT.RPT): allows you to specify the destination of the 132 column Availability Report. The default destination (DSK:COMPUT.RPT for TOPS-10/TOPS-20, and COMPUT.RPT for VAX/VMS) is enclosed in parentheses and can be selected by pressing the RETURN key. Compute automatically outputs a 72 column Summary Report to your terminal. You can replace the entire default destination by specifying a new file name, or you can replace any field in the default by specifying only the field that you want to override. For example, if you were to type: Availability Report to (DSK:COMPUT.RPT): FS: the output file specification would become FS:COMPUT.RPT The prompt also supports standard Help and question mark (?) responses. @@compute output Compute generates two reports; a 72 column Summary Report, and a 132 column Availability Report. The Summary Report is automatically output to your terminal. At this prompt Compute is waiting for you to specify a destination for the Availability Report. You can: 1. Press the RETURN key to select the default file specification: DSK:COMPUT.RPT. 2. Enter a unique file specification (e.g., DSK:WK21.RPT). The file specification format is: dev:filename.filetype.version. If you specified multiple reports, then Compute will generate a set of weekly reports in addition to COMPUT.RPT. The reports will be named Cmmdd.RPT. Where mmdd corresponds to the month and day of each week. @@3.3.1.F. The last selection Prompt Reload report to ([DSK]:RELOAD.RPT): allows you to specify the destination of the Reload Log Report. The Reload Report uses 132 columns and lists the system name, the operating system version, the number of times the system was reloaded, and the operator's response to the question "Why Reload?" You can select the default response (DSK:RELOAD.RPT for TOPS-10/TOPS-20, and RELOAD.RPT for VAX/VMS) by pressing the RETURN key. You can replace the entire default destination by specifying a new file name, or you can replace any field in the default response by specifying only the field that you want to replace. For example, if you typed: Reload report to (DSK:RELOAD.RPT): .LWK the output file specification would become DSK:RELOAD.LWK The prompt also supports standard Help and question mark (?) responses. @@3.3.1.G. The confirmation prompt: Type to confirm (/GO): provides an opportunity for you to review and change any responses entered up to that point. If you want to review the response list type /SHOW. If you are satisfied with the response list press the RETURN key or type /GO. If you want to change a response, press the BACKSPACE key until you arrive at the corresponding prompt, make the change, and then type /GO. @@com_dia_q1 Compute Dialog Q1 of 5 True or False - The formulas used by Compute to calculate: System Availability (SA), User Availability (UA), and System Effectiveness (SE) are described in the Spear Manual? @@com_dia_q1_at That's correct. The formulas: SA = (1.0) - CDT/(TDT + TRT) UA = (1.0) - CDT/(CDT + TRT) SE = (SA) * (e** (-t/MTBF)) are also briefly explained in the Introduction section of this module. @@com_dia_q1_af The statement is TRUE. The formulas used by Compute to calculate system availability, user availability and system effectiveness are described in the Spear Manual. You should become familiar with those formulas before you attempt to interpret the reports generated by Compute. @@com_dia_q2 Compute Dialog Q2 of 5 True or False - The entries used by Compute to calculate system performance are recorded in the system event file: ERROR.SYS for TOPS-10 and TOPS20, and ERRLOG.SYS for VAX/VMS? @@com_dia_q2_at The statement is FALSE. Under TOPS-20 and VAX/VMS the entries are recorded in the system event files. However, under TOPS-10 the entries are recorded in a file called AVAIL.SYS. @@com_dia_q2_af That's correct. TOPS-10 records the entries in a file called AVAIL.SYS. As a general rule, most TOPS-10 sites rename the AVAIL.SYS file to AVAIL.Ann weekly. (Where nn is a number in the range of 01 to 99.) Typically, the first AVAIL.SYS file becomes AVAIL.A01, the second AVAIL.A02, etc. Thus, if the latest AVAIL.Ann file was AVAIL.A25, and you wanted Compute to calculate system performance for the last four weeks, then you would specify AVAIL.A22 as the input file. @@com_dia_q3 Compute Dialog Q3 of 5 True or False - Compute generates two types of reports; a 72 column Summary Report that highlights overall system performance, and a 132 column Full Report that provides more detail? @@com_dia_q3_at That's correct. The Summary report is automatically displayed on your terminal. It will provide a picture of overall performance. The Full report backs up the Summary report with specific details. Note: The Full report requires 132 columns and is generally not suited for display on most terminals. @@com_dia_q3_af The statement is TRUE. Compute generates two types of reports; a Summary Report that highlights overall system performance, and a Full Report that details system availability and effectiveness. The Summary report is automatically output to your terminal when you run Compute. The following example illustrates a typical Summary report: Compute Summary Report From: 7-Jun-81 01:00 To: 14-Jun-81 01:00 period length (HRS): 168.000 SYSTEM Availability % : 100.000 USER Availability % : 100.000 Effectiveness Six minutes Thirty minutes One Hour Four Hours factor 99.584 97.938 95.918 94.648 Report file name: DSK:COMPUT.RPT Note: The Effectiveness Factor is the probability that a six minute, a thirty minute, a one hour and a four hour job will run to completion. @@com_dia_q4 Compute Dialog Q4 of 5 True or False - Compute uses the operators response to the question: "Why Reload" to determine User Availability? Downtime? @@com_dia_q4_at The statement is FALSE. The operators response to the question: "Why Reload" is used by Compute to distinguish between Chargeable Downtime and Non-chargeable Downtime. @@com_dia_q4_af That's correct. The operators response to the question: "Why Reload" is to distinguish between Chargeable Downtime and Non-chargeable Downtime. The following operator responses constitute: Chargeable Downtime - STOPCD, BUGHLT, HALT, PARITY, HARDWARE, NXM, HUNG, LOOP, AND CM (Corrective Maintenance). Non-chargeable Downtime - PM (Preventive Maintenance), OPERATOR, POWER, STATIC, NEW, SCHEDULED, STANDALONE, and OTHER. @@com_dia_q5 Compute Dialog Q5 of 5 True or False - In addition to the Summary Report and the Full Report, Compute also generates a Reload Report called COMPUT.RLD? @@com_dia_q5_at The statement is true, only in that Compute generates a Reload Report. The report is actually called RELOAD.RPT not COMPUT.RLD. @@com_dia_q5_af That's correct. The name of the report is: RELOAD.RPT. The following example illustrates the type of information it contains. SYSTEM 2116 THE BIG ORANGE, TOPS-20 MONITOR 4(3530) Built on: 28-May-81 11:41:11 Version: 400,,3530 Loaded on: 10-Jun-81 20:20:45 Crashed on: 14-Jun-81 07:00:16 Reloaded on: 14-Jun-81 07:25:08 Why reload: OTHER Run time: 6.004 Down time: 0.414 SYSTEM 2116 THE BIG ORANGE, TOPS-20 MONITOR 4(3530) Built on: 28-May-81 11:41:11 Version: 400,,3530 Loaded on: 14-Jun-81 07:25:10 Crashed on: 15-Jun-81 08:38:20 Reloaded on: 15-Jun-81 08:38:20 Why reload: OTHER Run time: 25.219 Down time: 0.000 The Reload Report and the Full Report, are intended to help you complete system Crash and Uptime reports. @@3.3.1.1. That's the last question about the Compute dialog. Press the RETURN key to return to the menu. @@3.3.2. The Compute Report questions were not ready in time for this Field Test Version of Spear. Press the BACKSPACE key or type MENU to return to the Compute menu. @@com_rpt_q1 Compute Report - Q1 of 5 True or False - @@com_rpt_q1_at That's correct. @@com_rpt_q1_af The statement is TRUE. @@com_rpt_q2 Compute Report - Q2 of 5 True or False - @@com_rpt_q2_at The statement is FALSE. @@com_rpt_q2_af That's correct. @@com_rpt_q3 Compute Report - Q3 of 5 True or False - @@com_rpt_q3_at The statement is FALSE. @@com_rpt_q3_af That's correct. @@com_rpt_q4 Compute Report - Q4 of 5 True or False - @@com_rpt_q4_at That's correct. @@com_rpt_q4_af The statement is TRUE. @@com_rpt_q5 Compute Report - Q5 of 5 True or False - @@com_rpt_q5_at The statement is FALSE. @@com_rpt_q5_af That's correct. @@3.3.2.1. That's it. There are only five questions about the Compute Report. Press the RETURN key to return to the Compute menu. @@3.4.0. Summarize Overview - Summarize is designed to read and summarize the contents of system event files. The purpose of this Instruct module is to ensure that you understand the dialog and the report associated with the Summarize function. The module consists of two parts. Part one briefly explains the Summarize dialog and then asks some questions to ensure that there are no mis- understandings. Part two of this module briefly explains the format and organization of the Summarize Report. (You will be asked to generate or obtain a typical Summarize Report.) The remainder of the module consists of a series of questions about the report. Again, the purpose of the questions is to ensure that there are no misunderstandings about the general format and content of the report. Objective - Upon completion of this module you should have no difficulty using the Summarize dialog or understanding the format, organization and content of a typical Summarize report. @@R.T.3.4.0. STOP - You are moving in a reverse direction through the menu. You are about to back into the Summarize Overview. @@3.4.M. Spear Library - Summarize Topic menu: 1. Overview 2. Summarize Dialog 3. Summarize Dialog Questions & Answers 4. Summarize Report 5. Summarize Report Questions & Answers @@3.4.1. Summarize Dialog - The Summarize dialog consists of six selection prompts and one confirmation prompt. SUMMARIZE mode -------------- Event file (default): Category (ALL): Time from (EARLIEST): Time to (LATEST): Show Error Distribution(YES): Report to ([DSK]:SUMMAR.RPT): Type to confirm (/GO): @@3.4.1.A. The first selection prompt: Event file (default): allows you to specify the name of the system event file that you want summarized. The default response (SYS:ERROR.SYS for TOPS-10, SERR:ERROR.SYS for TOPS-20, and SYS$ERRORLOG:ERRLOG.SYS for VAX/VMS) is enclosed in parentheses and can be selected by pressing the RETURN key. You can override the entire default response by specifying a new file name. You can also override any field in the default response by specifying only the field that you want to override. For example, if you were to type: Event file (SERR:ERROR.SYS): .LWK the input file specification would become SERR:ERROR.LWK The prompt also supports standard Help and question mark (?) responses. @@3.4.1.AA. After you have specified the source of input, SUMMARIZE prompts you for the category. Category(ALL): ALL MAINFRAME DISK TAPE CI NI UNITRECORD NETWORK OPERATING-SYSTEM COMM PACKID REELID HELP @@3.4.1.AB. ALL (or the RETURN key) - indicates that you want to select all errors. (This is the default). MAINFRAME - indicates that you want to select errors occurring in specific mainframe components. DISK - indicates that you want to select errors occurring on disk units. After selecting DISK, you can specify ALL the specific disks by name (DPA3, RPB7), or by disk type (RP06, RM05). TAPE - indicates that you want to select errors occurring on tape units. After selecting TAPE, you can specifiy ALL, or specify the tape names or types in question. CI - indicates that you want to select CI-related errors. After selecting CI, you can specify ALL, or the specific component of interest. NI - indicates that you want to select NI-related errors. @@3.4.1.AC. UDA - indicates that you want to select UDA-related errors. After selecting UDA, you can specify ALL, or the specific component of interest. UNITRECORD - indicates that you want to select errors occurring on unit-record devices such as card readers and line printers. After selecting UNITRECORD, you can specify ALL, or type the specific device names or types in question. OPERATING-SYSTEM - indicates that you want to select operating system codes. After selecting OPERATING-SYSTEM, you can specify ALL, or type the name of a specific STOPCODE or BUG type. COMM - indicates that you want to select errors occurring on communication devices. @@3.4.1.AD. PACKID - indicates that you want to select specific disk packs. After typing PACKID, you can type ALL, or type the specific pack identifiers. REELID - indicates that you want to select specific tape reels. After typing REELID, you can type ALL, or the specific tape identifiers. HELP - indicates that you want to get detailed information on the above categories. All categories except for COMM and NI prompt further for specific device types. Type ? at the subprompt level to get a list of acceptable responses. @@3.4.1.AE. SUMMARIZE keeps prompting you for categories until you either type FINISHED, or press the RETURN key. Next Category (FINISHED): Type one of the following: The RETURN key, or FINISHED to take the default, or, another category. @@3.4.1.B. The third selection prompt: Time from (EARLIEST): allows you to specify the time at which to begin summarizing the system event file. The default response (EARLIEST) is enclosed in parentheses and can be selected by pressing the RETURN key. You can also specify real and relative time. The prompt also supports standard Help and question mark (?) responses. @@3.4.1.C. The fourth selection prompt: Time to (LATEST): allows you to specify the time at which to end summarizing the system event file. The default response (LATEST) is enclosed in parentheses and can be selected by pressing the RETURN key. Again, you can also specify real and relative time. The prompt also supports standard Help and question mark (?) responses. @@3.4.1.DA. The fifth selection prompt: Show Error Distribution (YES): allows you to specify whether or not you want to receive error distribution tables. The default response (YES) is enclosed in parentheses and can be selected by pressing the RETURN key. If you type NO, you will suppress the error distribution tables from the report. @@3.4.1.D. The sixth selection prompt: Report to ([DSK]:SUMMAR.RPT): allows you to specify the name of the output or Report file. The default response (DSK:SUMMAR.RPT for TOPS-10/TOPS-20, and SUMMAR.RPT for VAX/VMS) is enclosed in parentheses and can be selected by pressing the RETURN key. You can override the entire default response by specifying a new file name. You can also override any field in the default response by specifying only the field that you want to override. For example, if you were to type: Report to (DSK:SUMMAR.RPT): FS: the output file specification would become FS:SUMMAR.RPT The prompt also supports standard Help and question mark (?) responses. @@3.4.1.E. Finally, the confirmation prompt: Type to confirm (/GO): provides an opportunity for you to review and change any responses entered up to that point. If you want to review the response list type /SHOW. If you are satisfied with the response list press the RETURN key or type /GO. If you want to change a response, press the backspace key until you arrive at the corresponding prompt, make the change, and then type /GO. @@3.4.1.F. That concludes the explanation of the Summarize dialog. Next on the menu is a set of questions about the Summarize dialog. @@sum_dia_q1 Summarize Dialog - Q1 of 7 True or False - If you do NOT want to change any of the Summarize default responses, you can type /GO at the Event file prompt? @@sum_dia_q1_at That is correct. All Spear Library functions begin by setting the response list to the default values. You can change the responses or type /GO at any time. The function will use the responses that you have specified up to that point and default the rest. If you make no changes the default response list is used. @@sum_dia_q1_af The statement is TRUE. When you first enter a Spear library dialog, the response list is set to the default values. Thus, if you type /GO at the Event file prompt Summarize will begin execution using the defaults. The result will be report that summarizes the contents of the entire event file. @@sum_dia_q2 Summarize Dialog - Q2 of 7 True or False - If you type HELP in response to any Summarize prompt, a ONE page message explaining the prompt and the acceptable response to that prompt will be displayed? @@sum_dia_q2_at That is correct. All Spear Library prompts support the HELP and (?) command. The Help messages are limited to one page, and the prompt is repeated immediately following the message. Typing (?) will result in a list of acceptable responses without explanation. @@sum_dia_q2_af The statement is TRUE. You can type HELP any time you are not sure how you should respond to a particular prompt. You will receive a one page HELP message that explains the prompt and the acceptable responses to that prompt. @@sum_dia_q3 Summarize Dialog - Q3 of 7 True or False - Summarize will accept and summarize the contents of any binary event file, including a binary event file generated by Retrieve? @@sum_dia_q3_at That is correct. Summarize will accept (as input) any file that conforms to the standard binary event file format. Currently, that includes event files generated by: TOPS-10, TOPS-20, VAX/VMS, or Retrieve. There is one restriction, however, the event file must have been generated by the same type of system that you are using to summarize the file. In other words, the TOPS-10 version of Spear can NOT be used to process event files generated by TOPS-20 etc. @@sum_dia_q3_af The statement is TRUE. Retrieve does not change the file format when it generates a binary (or History) file. Therefore, since Summarize is designed to handle standard binary event files, it will accept binary event files generated by Retrieve. @@sum_dia_q4 Summarize Dialog - Q4 of 7 True or False - In order to take the default response at a Summarize prompt you must press the ESCAPE key before pressing the RETURN key? @@sum_dia_q4_at The statement is FALSE. You don't have to press ESCAPE/RETURN to take the default response. You need only press the RETURN key. Originally, the purpose of the ESCAPE key was to display the default response. However, as a result of feedback during product Field Test, the prompts were changed. They now display the default responses in parentheses. Thus, the original purpose of the ESCAPE was nullified. @@sum_dia_q4_af That is correct. Since the default response is enclosed in parentheses, there is no need to use the ESCAPE key. @@sum_dia_q5 Summarize Dialog - Q5 of 7 True or False - Summarize will accept and summarize Packet Files generated by Analyze? @@sum_dia_q5_at The statement is FALSE. A Packet file is not a standard binary event file. It is a special file produced by Analyze that contains pointers that identify the records that were used as evidence to support the theories listed in the corresponding Analyze Report file. @@sum_dia_q5_af That is correct. Summarize only accepts standard binary event files. Since a Packet file is not a standard binary event file, Summarize will not accept it. @@sum_dia_q6 Summarize Dialog - Q6 of 7 True or False - If you want to change the name of the report file from DSK:SUMMAR.RPT to DSK:TEST.RPT, you need only type TEST at the Report prompt? @@sum_dia_q6_at That is correct. You can substitute fields at any Spear file specification prompt. For example, if you wanted the report to go to FS: and you wanted to call it SUMMAR.LWK, you could type: Report to(DSK:SUMMAR.RPT): FS:.LWK @@sum_dia_q6_af The statement is TRUE. All Spear Library file-name prompts accept field substitution. You can substitute the output device, the file name, the file extension, or any combination thereof. @@sum_dia_q7 Summarize Dialog - Q7 of 7 True or False - Both the "Time from" and the "Time to" prompt accept real and relative time? @@sum_dia_q7_at That is correct. All Spear Library "Time" prompts accept both real and relative time specifications. @@sum_dia_q7_af The statement is TRUE. All Spear Library "Time" prompts accept both real and relative time specifications. The format for real time is: dd-mmm-yy hh:mm:ss where dd is the numerical day, mmm is the first three letters of the month, yy is the last two digits of the year, and hh:mm:ss represent the hour, minute, and second respectively. The format for relative time is: -dd where dd represents some number of past days. The time defaults to 00:00:01. @@3.4.1.1. That's it. If you have gotten this far, then chances are you have a good handle on the Summarize dialog. Next on the menu is a brief explanation of the Summarize Report format. @@3.4.2. Summarize Report - The Summarize Report consists of four major sections: 1. A File Environment and Entry Occurrence Count section. 2. A Monitor Detected Error and Reload section. 3. A Front-end, Channel and Device Summary section. 4. A Channel and Device Breakdown section. This part of Instruct involves a series of questions. The questions are designed to ensure that you understand the format and general content of a typical Summarize Report. @@3.4.2.A. Before proceeding further, you should have a copy of a Summarize Report. You can type /BREAK and generate one using the Spear Library or, you can use the one in the Spear Manual. When you are ready to proceed press the RETURN key. @@sum_rpt_q1 Summarize Report - Q1 of 8 True or False - If you are running on a TOPS-20 System, the "Monitor Detected Errors and Reloads" section of the Summarize Report identifies the number of BUGHLT, BUGCHK, and BUGINF that occurred during the summary period? @@sum_rpt_q1_at That is correct. The BUGHLTs, BUGCHKs, and BUGINFs described in the TOPS-20 Software Notebooks (Volume 16). @@sum_rpt_q1_af The statement is TRUE. You would have no way of knowing this, however, if, during the summary period that you selected, there were no BUGHLTs, BUGCHKs, or BUGINFs recorded. Summarize does not print this section of the report unless there were BUGxxx events recorded during the summary period. @@sum_rpt_q2 Summarize Report - Q2 of 8 True or False - The "File Environment" section of the Summarize Report always lists the total number and type of entries recorded in the system event file that was submitted as input? @@sum_rpt_q2_at The statement is FALSE. The Summarize Report only lists the entries that were recorded during the period of time being summarized. Although that period of time could, it does not always reflect the entire event file. @@sum_rpt_q2_af That is correct. Only the events that occurred on or between the time the user specified, at the "Time from" and the "Time to" prompts, are summarized. @@sum_rpt_q3 Summarize Report - Q3 of 8 True or False - Under the "File Environment" section of the Summarize Report, the term "inconsistencies" refers to the number of unknown event types that were found in the summarized period of the event file? @@sum_rpt_q3_at The statement is FALSE. The term "inconsistencies" means that Spear encountered a nonrecoverable read error while reading the event file. In such cases it loses sync and must use the resynchronization word in the next data block to recover. For further information about the resync process refer to the DEFINE.LIS file and the Spear Manual. @@sum_rpt_q3_af That is correct. The term "inconsistencies" refers to the number of times Summarize lost sync reading the event file and had to use the resynchronization word in the next data block to recover. @@sum_rpt_q4 Summarize Report - Q4 of 8 True or False - The "Entry Occurrence Counts" section of the Summarize Report lists the event code and the number of times each event type appeared in the summarized period of the system event file? @@sum_rpt_q4_at That is correct. The entry types are catalogued by entry code and described in Appendix B of the Spear Manual. Sometime, when you get a chance, you should take a look at Appendix B. It lists, in detail, the information recorded for each entry type in the system event file. @@sum_rpt_q4_af The statement is TRUE. If you take a look at the report you'll see a decimal number, followed by name, followed by a number in parentheses. The decimal number indicates the number of times a particular entry type appeared in the file; the name refers to the entry type; and the number in parentheses refers to the code assigned to the entry type by the system software developers. @@sum_rpt_q5 Summarize Report - Q5 of 8 True or False - Under the "RP04/RP05/RP06 Breakdown" section of the Summarize Report only the contents of Error Register 1 are listed? @@sum_rpt_q5_at The statement is FALSE. If there are any error bits set in Error Register 2 they will be listed also. However, if none of the disk error summarized had a bit set in Error Register 2 then, of course, the contents of Error Register 2 would not be listed. If that's the case, then you're correct. @@sum_rpt_q5_af That is correct. Summarize does not try to hide information. However, because the report was designed so that it could be displayed on a terminal (i.e., 72 columns), the contents of Error Register 2 are listed below the contents of Error Register 1. The purpose of the question was to point that out because, at a glance, you might think that Error Register 2 was part of a different summary. @@sum_rpt_q6 Summarize Report - Q6 of 8 True or False - For the most part the Summarize Report is easy to read and understand? @@sum_rpt_q6_at We're glad that you're satisfied. However, if have any suggestion or ideas that will improve the format or content of the report please use the FEEDBACK feature on the Main Course Menu to let us know. @@sum_rpt_q6_af OK. Changing the report format is a relatively easy task. If you would take the time to let us know how the report could be improved we'll do our best to make the changes in the next release. You will find our address listed under FEEDBACK on the Main Course Menu. @@sum_rpt_q7 Summarize Report - Q7 of 8 True or False - In a Summarize Report, asterisks (***) will be printed if a number exceeds the maximum digits for a field? @@sum_rpt_q7_at That is correct. Each asterisk represents one digit of the total spaces set aside for a numeric value (that includes the decimal point, if the number is decimal). In other words, if three spaces were set aside for a value (say 99.), then three asterisks (***) will be printed should the value exceed 99. @@sum_rpt_q7_af The statement is TRUE. The number of digits that can be printed is limited to the space available in the report (i.e., 72 columns). Thus, there is always a possibility that the number of digits necessary to report a count will exceed the available space. When such a case occurs a string of asterisks (***) will be printed. @@sum_rpt_q8 Summarize Report - Q8 of 8 True or False - The following Summarize report indicates that DP160 experienced 5 errors: 2 Hard Errors and 3 Soft Errors? RP04/RP05/RP06 Breakdown: Error Register 1 D U O D W I A H H E W F P R I I C N P T L A O C C C C E A M L L K S I E E E E R E H F R R R R F C S/N 1957 DP160 H 1. 1. S 3. @@sum_rpt_q8_at The statement is FALSE. You cannot determine how many Hard and Soft error a device experienced by looking at the Breakdown section because; (and this is important to remember) the Breakdown section indicates the number of times the error bit was set when Hard errors occurred, and the number of times the error bit was set when Soft errors occurred. The following RP04/RP05/RP06 Summary taken from the same Summarize report that the Breakdown was taken from bears this out. It indicates that DP160 experienced a total of 4 errors; 1 Hard and 3 Soft. RP04/5/6 Summary: Hard Soft S/N 1957 DP160 1. 3. The point is; don't be tricked into thinking that the system had more errors than it actually had. When you want to know the total number of errors experienced by a Channel or a Device go by the Summary NOT the Breakdown. @@sum_rpt_q8_af That is correct. The Breakdown reflects the number of times each bit was set during Hard and Soft errors. If you want to know the total number of Hard and Soft for a given Channel or Device refer to the Summaries. @@3.4.2.1. Well, that's it. You have just completed the Summarize Report section of Instruct. Assuming that you have also completed the Dialog section, you should feel that you are a qualified Summarize user. If for some reason you do not agree, or again, if you have any ideas or suggestions that will make either Instruct or Summarize a better product please let us know. You will find our mailing address listed under FEEDBACK on the Course Menu. Press the RETURN key to return to the Spear Library Menu. @@SUMMARIZE INPUT System Event File ___. .----------. ! ! Event ! !___! File !____ Summary Report Retrieve ! ! Summary ! (binary) File ___! !__________! INPUT PROCESS OUTPUT Summarize reads the specified event file, summarizes its contents and produces a report file. The contents are summarized by: event code, STOPCODE or BUGxxx code types, front-end reloads, channel errors, disk errors and magtape errors. @@sum_dia_qx Summarize Dialog - Qx of x True or False - @@sum_dia_qx_at That is correct. The statement is FALSE @@sum_dia_qx_af That is correct. The statement is TRUE @@sum_rpt_qx Summarize Report - Qx of x True or False - @@sum_rpt_qx_at That is correct. The statement is FALSE @@sum_rpt_qx_af That is correct. The statement is TRUE @@3.5.1. Spear Library Applications The Spear Library can be used in conjunction with either the Systematic Substitution Troubleshooting Approach, or the Formal Troubleshooting Approach to isolate the cause of intermittent failures. @@3.5.1.A. The first thing you want to do is ensure that Summarize is run on a daily basis. The best way to do this is to run it via a daily Batch job. If you not sure how to do that you can ask an experienced operator to give you a hand, or, if your on a TOPS-20 system, you can try the using this Batch Control File @SUBMIT SPEAR /TIME:30 /AFTER:TODAY ! Resubmit SPEAR again tomorrow. @RENAME *.RPT *.RPO ! Rename yesterdays report file. @SPEAR ! Run SPEAR. *SUMMARIZE /GO ! Summarize yesterday's errors. *EXIT ! Then leave. @IF (ERROR) ! Continue even if there's an error. @PRINT *.RPT /NOTE:"SPEAR - F-S" ! Print two copies of the report: @PRINT *.RPT /NOTE:"SPEAR - OPER" ! one for FS and one for Operations. @@3.5.1.B. Or, if your on a TOPS-10 system you can try using this Control File. .SUBMIT SPEAR /TIME:30 /AFTER:23:59 ! Resubmit Spear again tomorrow. .R SPEAR ! Run Spear. *SUMMARIZE /GO ! SUMMARIZE yesterday's errors. *EXIT ! Then leave. .IF (ERROR) ! Continue even if there's an error. .PRINT *.RPT /NOTE:"SPEAR - SITE" ! Print two copies of the report: .PRINT *.RPT /NOTE:"SPEAR - F-S" ! one for FS and one for the Site. .RENAME *.RPD = *.RPT ! Rename today's report so that it ! won't be printed again tomorrow. @@3.5.1.C. Once you have the Batch File running you can use the daily reports to monitor the over all performance of the system. If the error rate for a particular device or subsystem starts to go up, you will see it reflected in the various summaries and histograms. @@3.5.1.D. Next, a few hours before you get the system for routine maintenance submit the last seven days or so of the event file for summarization. Allow yourself about an hour to look over the report and decide on a fault isolation strategy. For example; suppose the report indicates that, among other things: DP140 reported 5 recoverable Index Errors while PS1: was mounted. Since intermittent Index Errors are generally caused by either a faulty Servo Track or a faulty Index Module; during the maintenance period you could swap the Index module in DP140 with the Index module in another drive (let's say DP220). Then, when you return the system to operations you could ask that they move PS1: to a different drive (perhaps DP110). The rest is a matter of "wait and watch". You do the waiting and you use summarize (on a daily or weekly basis) to do the watching. @@3.5.1.E. 1. If the problem moves to DP220, then you know that the Index module was the cause of the failure. 2. If the problem moves to DP110, then you know that the medium (PS1:) was the cause of the failure. 3. If the problem does not move, then you know that cause was not the Index module nor was it the medium. So, the next chance you get, put everything back the way it was and try something else. Sooner or later the report is bound to reflect the fact that you have identified and either moved or eliminated the cause of the problem. @@3.5.1.F. When used in this manner SPEAR becomes a very powerful troubleshooting tool. The principal is simple. If you move a faulty component from one piece of equipment to another the error symptoms will move with it. If they don't, then at least you know what the problem is not. This particular isolation technique was developed, during product load test, by the South Massachusetts Field Service Office. If you can come up with any neat ways of using the Spear Library to simplify system maintenance please let us know. We'd be glad to try and include it in this Application Section. You'll find our address listed under Feedback on the main course menu. @@3.6.0 The KLERR function provides expanded reporting of the KL10 function reads supplied by the Front-End on a monitor crash. SPEAR can be used to generate detailed reports of and/or summaries of KLERR data blocks. You can always get a summary, but you must select one of three formats if you want a detailed report of each event. @@3.6.1 The following summary options will be available: o ALL -- This will result in a complete listing containing the number of times each signal was true and false. o ERRORS-ONLY -- This will result in a single-page list containing the number of times an error signal was true and the number of times it was false. o NONE -- This will result in no summary at all. @@3.6.2 The following report format options will be available: o SUMMARY-ONLY -- This will result in no entry-by-entry output. Only the final summary of signals will be printed. o FULL -- The result will be a set of detailed reports that list all of the registers and signals (true or false) as well as fields. o TRUE-SIGNALS -- The result will be a set of detailed reports that list all of the registers but only the "true" signals and not the fields. o CRAM-BAD-WORD -- The result will be a set of reports, consisting of one line for each record which included a CRAM parity error. This line will report the CRAM location and contents. @@3.6.3 The following output formats will be available for the CRAM word: o MICROCODE -- This format is used to compare the bad cram word with the microcode listing. o OCTAL -- This format matches the one shown in the KL10 Maintenance Handbook and can help isolate the failing cram module. o TRACON -- Used to compare with "TRACON" snapshots. @@KLERR END This concludes the KLERR section of the course. We hope you found it useful. Also, if you have any comments about this section please get in touch with us. Our address is found under FEEDBACK on the main course menu. @@4.0. The Guaranteed Uptime Program is a service that allows you and DIGITAL to work together to select and maintain the highest level of reliability for your system. Together you and DIGITAL determine the percentage of Uptime your site requires, from 96% to 99%. Uptime is defined as any time the system is NOT down - with downtime defined as: (1) that time within the hours of contract coverage when the system is turned over to DIGITAL for corrective maintenance due to operating system malfunction resulting in a system crash and failure to restart. (2) Failure of DIGITAL-supplied hardware which in your opinion makes the system unavailable for use. @@4.0.A. The NOTIFY program and the SPEAR function COMPUTE are the two programs that provide the tools to monitor the operation of the system and calculate the statistics needed to measure uptime. NOTIFY is the program that allows you to keep the current contract coverage in a file known as the contract file. The NOTIFY program also allows you to keep an outage log that contains the date and time you report the system inoperable and the date and time you accept the system back from DIGITAL as being fixed. When you run NOTIFY, you input two types of information: (1) The date and time you notified DIGITAL that the system was down and the date and time DIGITAL returned the repaired system to you (2) The number of hours a day that you have DIGITAL maintenance coverage. The NOTIFY program then creates a binary file in your area called NOTIFY.SYS. This is the file COMPUTE uses to produce the system uptime statistics. @@4.0.B. The NOTIFY program contains three modes: DISPLAY PURGE UPDATE The DISPLAY mode allows you to translate NOTIFY.SYS into ASCII so you can display all or part of the outage log or contract file. The PURGE mode allows you to delete a portion of the data base in NOTIFY.SYS, either from the contract file or from the outage log. The UPDATE mode allows you to write log entries or to insert or modify contract coverage into NOTIFY.SYS. @@4.0.C. To collect the data needed to measure uptime, do the following: 1. Run NOTIFY to establish a contract file containing the number of hours you have DIGITAL coverage for corrective maintenance. 2. When you determine that the system is inoperable, call DIGITAL to report the system-down condition and turn your system over to DIGITAL for service. 3. When the system is returned to you, run NOTIFY from the same directory containing the contract file to log: a) reported time (the date and time you notified DIGITAL). b) accepted time (the date and time DIGITAL returned the system to you). 4. After collecting 13 weeks of data run COMPUTE from the same directory that you have been running NOTIFY. @@4.0.D. To run the NOTIFY program, type one of the following: $ RUN SYS$SYSTEM:NOTIFY on VAX/VMS, @NOTIFY on TOPS-20, .R NOTIFY on TOPS-10. NOTIFY responds with the following prompt: NOTIFY> At this point, as well as after any other prompt, you can type ? or HELP to get detailed information on both the prompt and on acceptable responses. Type DISPLAY if you want to check the outage log or if you want to check the contract. Type UPDATE if you want to enter or revise contract coverage, or if you want to report an outage. Type PURGE if you want to delete entries from either the contract file or from the outage log. @@4.0.E. The NOTIFY program and the SPEAR function COMPUTE look for the NOTIFY.SYS file in your default directory. If more than one person will be using NOTIFY and COMPUTE, you may want to agree on where the NOTIFY.SYS file will reside. Or you may want to change the location of NOTIFY.SYS. To change the location of NOTIFY.SYS, use a text editor to modify the file called NOTIFY.SPE. You can modify the file specification for NOTIFY.SYS to specify a specific device and directory, or you can even change the name of the file itself. Both NOTIFY and COMPUTE will use this file specification. For a more detailed explanation of the NOTIFY program refer to the GUIDE TO MEASURING UPTIME document. @@GUP END This concludes the Guaranteed Uptime Program/NOTIFY section of the course. We hope you found it useful. Also, if you have any comments about this section please get in touch with us. Our address is found under FEEDBACK on the main course menu. @@ @@rec_alg Recovery Algorithms - Most operating systems have some sort of algorithm or procedure for error recovery. This section of Instruct explains the algorithms used by TOPS-10 and TOPS-20 to recover from disk read errors. @@R.T.rec_alg STOP - You are moving in a reverse direction through the course. You are about to back into the Introduction to the Recovery Algorithms. @@rec_menu Disk Read Error Recovery Algorithms Topic Menu 0. Introduction 1. TOPS-10 Disk Recovery Algorithm 2. TOPS-20 Disk Recovery Algorithm @@t10_dsk_rec_alg TOPS-10 RP04/05/06 Disk Read Error Recovery Algorithm TOPS-10 and TOPS-20 use a similar algorithm to recover from disk read data errors. The algorithm involves 31 retry attempts. Under TOPS-10, if an ECC correctable error is detected during a read header or data operation the following occurs: 1. The transfer is terminated. 2. The software reconstructs the data using the calculated ECC value. 3. The transfer is restarted beginning at the next sector (i.e., the sector following the sector in error). If the read data error is not ECC correctable, however, the following recovery algorithm is evoked. @@t10_dsk_rec_alg_a 1. Non (ECC) recoverable read error 2. Repeat read operation (attempt ECC correction) 3. Repeat read operation (attempt ECC correction) 4. Repeat read operation (attempt ECC correction) 5. Repeat read operation (attempt ECC correction) 6. Repeat read operation (attempt ECC correction) 7. Repeat read operation (attempt ECC correction) 8. Repeat read operation (attempt ECC correction) 9. Repeat read operation (attempt ECC correction) 10. Repeat read operation (attempt ECC correction) 11. Repeat read operation (attempt ECC correction) 12. Repeat read operation (attempt ECC correction) 13. Repeat read operation (attempt ECC correction) 14. Repeat read operation (attempt ECC correction) 15. Repeat read operation (attempt ECC correction) 16. Repeat read operation (attempt ECC correction) 17. Repeat read operation (attempt ECC correction) Next Offset is tried. @@t10_dsk_rec_alg_b Offset heads (+400 microinches if RP04/05, +200 if RP06). 18. Repeat read operation (attempt ECC correction). 19. Repeat read operation (attempt ECC correction). Offset heads (-400 microinches if RP04/05, -200 if RP06). 20. Repeat read operation (attempt ECC correction). 21. Repeat read operation (attempt ECC correction). Offset heads (+800 microinches if RP04/05, +400 if RP06). 22. Repeat read operation (attempt ECC correction). 23. Repeat read operation (attempt ECC correction). Offset heads (-800 microinches if RP04/05, -400 if RP06). 24. Repeat read operation (attempt ECC correction). 25. Repeat read operation (attempt ECC correction). @@t10_dsk_rec_alg_c Offset heads (+1200 microinches if RP04/05, +600 if RP06). 26. Repeat read operation (attempt ECC correction). 27. Repeat read operation (attempt ECC correction). Offset heads (-1200 microinches if RP04/05, -600 if RP06). 28. Repeat read operation (attempt ECC correction). 29. Repeat read operation (attempt ECC correction). Return to center line. Set Error Correction Inhibit (ECC INHIBIT = 1) 30. Repeat read operation. Reset Error Correction Inhibit (ECC INIBIT = 0) 31. Repeat read operation (attempt ECC correction). If all 31 retries are unsuccessful, then the read error is defined as non-recoverable (Hard) and an entry is made in the structures BAT block. @@t20_dsk_rec_alg TOPS-20 RP04/5/6 Disk Read Error Retry Algorithm TOPS-10 and TOPS-20 use a similar algorithm to recover from disk read data errors. The algorithm involves 31 retry attempts. If any of the retry attempts are successful, then the error is defined as Soft (recoverable) and the system continues in a normal manner. If, however, all 31 retries are unsuccessful, then the error is defined as Hard (non recoverable) and the system takes the appropriate action. @@t20_dsk_rec_alg_a The following details each of the 31 steps in the disk read error retry algorithm. Assume that a read operation was initiated and a read data error (DCK) was detected. The first three retries do not attempt ECC correction. 1. Repeat read operation. Do not attempt ECC correction. 2. Repeat read operation. Do not attempt ECC correction. 3. Repeat read operation. Do not attempt ECC correction. @@t20_dsk_rec_alg_b The next 13 retries will attempt ECC correction if ECC Hard = zero (0). 4. Repeat read operation. Attempt ECC correction. 5. Repeat read operation. Attempt ECC correction. 6. Repeat read operation. Attempt ECC correction. 7. Repeat read operation. Attempt ECC correction. 8. Repeat read operation. Attempt ECC correction. 9. Repeat read operation. Attempt ECC correction. 10. Repeat read operation. Attempt ECC correction. 11. Repeat read operation. Attempt ECC correction. 12. Repeat read operation. Attempt ECC correction. 13. Repeat read operation. Attempt ECC correction. 14. Repeat read operation. Attempt ECC correction. 15. Repeat read operation. Attempt ECC correction. 16. Repeat read operation. Attempt ECC correction. @@t20_dsk_rec_alg_c The next 12 retries attempt offset and ECC correction. The first offset value listed is used for RP04s and RP05s. The second offset value listed is used for RP06s. 17. Offset (+400/+200). Repeat read operation. Attempt ECC correction. 18. Repeat read operation at this offset. Attempt ECC correction. 19. Offset (-400/-200). Repeat read operation. Attempt ECC correction. 20. Repeat read operation at this offset. Attempt ECC correction. 21. Offset (+800/+400). Repeat read operation. Attempt ECC correction. 22. Repeat read operation at this offset. Attempt ECC correction. 23. Offset (-800/-400). Repeat read operation. Attempt ECC correction. 24. Repeat read operation at this offset. Attempt ECC correction. 25. Offset (+1200/+600). Repeat read operation. Attempt ECC correction. 26. Repeat read operation at this offset. Attempt ECC correction. 27. Offset (-1200/-600). Repeat read operation. Attempt ECC correction. 28. Repeat read operation at this offset. Attempt ECC correction. @@t20_dsk_rec_alg_d The final three retries are a last ditch effort to get the data. 29. Return to centerline. Repeat read operation. Attempt ECC correction. 30. Set Error Correction Inhibit. Repeat read operation. 31. Set Error Correction Inhibit. Repeat read operation. If all 31 retries are unsuccessful, then the read error is defined as non-recoverable (Hard) and an entry is made in the structures BAT block. @@dialog_change The following dialog changes must be made to all SPEAR version 1.x command and control files in order for them to operate under SPEAR version 2.0. Although the examples use TOPS-20 style commands, the changes apply to the TOPS-10 and VMS versions of SPEAR version 2.0 as well. Retrieve: The only changes in the Retrieve dialog from version 1.x to version 2.0 are in the "Selection type" "Error" and "NonError" areas. However, there are no changes for a "Selection type" of "Error" "All". @@dialog_change_a The following example illustrates changes in Retrieve "Selection type" "Error". SPEAR v1.x SPEAR V2.0 Comments __________ __________ ________ *Error *Error Selection type *Disk *Disk Device category *RP06 *RP06 Specific device(s) *All Device error type *Finished *Finished End device selection To retrieve the events for a specific device error type, replace "*All" in the version 2.0 dialog above with one or more device error types. For example, *Software, Bus, Channel-controller. @@dialog_change_b The following example illustrates changes in Retrieve "Selection type" "NonError". SPEAR v1.x SPEAR v2.0 Comments __________ __________ ________ *NonError *Stat, Diag, Config, Other Device Category *All Device Selection To retrieve the events for a specific device or class of device, replace the "*All" in the version 2.0 dialog above with one of the following command sequences: *Disk *Disk Device category *All *RA60, RA80, RA81 Specific device(s) *Finished *Finished End device selection @@dialog_change_c The same functionality in Summarize may be maintained by changing the version 1.x dialog to the version 2.0 dialog below. SPEAR v1.x SPEAR v2.0 Comments __________ ___________ ________ @SPEAR @SPEAR Run SPEAR *Summarize *Summarize Invoke Summarize *SERR:ERROR.SYS *SERR:ERROR.SYS Event file *All Device category *Earliest *Earliest Time from *Latest *Latest Time to *Yes Error distribution *DSK:SUMMAR.RPT *DSK:SUMMAR.RPT Report to */Go */Go Start processing @@dialog_change_d To get summaries for a specific device or class of device, replace the "*All" in the version 2.0 dialog above with one of the following command sequences: *Disk *Disk Device category *All *RA60, RA80, RA81 Specific device(s) *Finished *Finished End device selection To suppress the Error Distributions, change the "*Yes" to "*No" in the version 2.0 dialog above. @@dialog_change_e There are no dialog changes in Compute. @@