Carnegie Mellon
Computer Science Department |
|
|
|
|
|
|
|
|
|
|
15-410 Triple-Fault Advice Page
Introduction
What should you do when you first experience
the dreaded "triple fault"? Here are some suggestions.
First, take a moment to smile. If you had been
running your kernel on a real PC, it would
have suddenly cleared the screen, beeped, and rebooted.
You would have no way to figure out what
had happened aside from comparing your source code
to a previous version and reasoning forward from there.
In 15-410, you have thousands of dollars of the world's
best PC simulation software at your beck and call.
So maybe as you smile you should remind yourself
how grateful you are to the Simics developers.
Now, take a second moment to smile.
A triple fault means you are about to learn something!
What's more, it's the kind of learning you can't
get from books. Maybe you will learn something about
the x86, something about OS structure, or something
about debugging strategy.
Q: Ok,
Mr. Miyagi,
enough telling me how to polish the car, can you give me some actual
advice about my triple fault already?
A: Ah, grasshopper, you are in such a hurry...
The Problem
Students are sometimes unsure about the definitions
of double and triple faults.
A regular fault (or exception) occurs when the processor
is unable to successfully execute an instruction to
completion—for example,
a page-fault exception occurs when one of the memory references
needed to execute an instruction can't be completed because
paging is enabled and the paging system decides that the
reference is to a page that does not exist,
or that the reference is not a legitimate type for the
page in question.
A double fault occurs when there is a fault,
but the processor cannot successfully execute to
completion the first instruction of the handler
for the primary fault;
this causes the processor to switch to running the
first instruction of the double-fault handler.
A triple fault occurs when the processor cannot
successfully execute to completion
the first instruction of the double-fault handler.
It is possible that a single underlying problem makes
it impossible to execute the three instructions in
question,
or it may be the case that each instruction cannot
be executed for its own personal reason.
Regardless, at this point the processor gives up on
the entire endeavor of executing instructions
and resets.
Advice
Generally, a triple fault means that you
(the OS kernel author) told the machine to do
something it couldn't do. For example, you may
have asked it to execute a trap handler that
doesn't exist, asked it to run code which you
didn't give it permission to run, told it to
access memory you didn't give it permission to
access, etc. This condition
is very unlikely to result from a bug
in Simics or in the course-provided run-time
environment. So sending us mail of the form
I have a triple fault, now what?
isn't going
to get you very far. We genuinely don't know
what's going wrong!
Sending us register dumps won't help much
either. As you will see below, it is unlikely
we will be able to point out one "incorrect"
bit. Things are right or wrong in context,
and you (the kernel author) are responsible for
defining the context and therefore what is right
and wrong.
Since you have access to all processor
state and all of memory, you should be able to
figure out first what was happening when the
processor ran into trouble and then what the
problem was.
When you encounter a fault or exception,
you must determine three key
pieces of information:
You must determine which instruction
(not "line of code") can't be executed.
Processors don't execute "lines of code";
they execute instructions.
Based on the surrounding code,
determine what that instruction was intended to accomplish.
Generally speaking,
the instruction was selected by a compiler,
based on preconditions expected to be true before the
instruction executes and on conditions desired to be
true after it's done.
It is possible you will need to look up a description
of exactly what the instruction does.
You will need to determine exactly why the instruction
could not be executed.
Generally speaking, some precondition isn't true,
or some input value is wrong.
Depending on the exception,
the processor may write down some information about
this particular execution failure;
you will need to consult appropriate documentation
to find what information is available and how to
decode it.
It is unwise to guess at which precondition/value is the
source of the problem.
In more detail...
Use the 15-410
Simics Command Guide to collect as much
information as you can about the processor state.
Collectively the registers and stack trace should
suggest what was supposed to happen.
Note: though there is a command called preg,
it does not necessarily print all the registers...
Then try to figure out why you thought that
thing should have worked. Your answer will probably
consist of multiple stages, probably involving two
or three kinds of memory and maybe a privilege level.
Then try to figure out which part didn't work
and why not.
Don't be too eager to skip past information
which is "odd". If some lines on the stack trace
are incomplete, look at the parts which are
there and see if they tell you anything about why
the missing parts are missing.
For example, if the debugger
complains at you, think about why... if it says
"not in TLB", apply the steps of this list, i.e.,
ask yourself
- what things should be in the TLB (translation lookaside buffer),
- how those things are supposed to get there,
- why you think the thing should have been there,
- how you could check to see why the thing didn't get there
Actually solving the problem will probably be an iterative process.
You may need to change your test code, add trace
information to your kernel, come up with an innovative
breakpoint strategy, etc.
Lots of faults are somehow related to memory.
Make sure you've gone over what we've given you related
to memory. For example, the textbook devotes several
pages to x86 virtual memory, and we have written a handout
devoted to that topic from a different angle. The
textbook also covers TLBs.
|