Carnegie Mellon
Computer Science Department |
|
|
|
|
|
|
|
|
|
|
|
15-410 Triple-Fault Advice Page
Introduction
What should you do when you first experience
the dreaded "triple fault"? Here are some suggestions.
First, take a moment to smile. If you had been
running your kernel on a real PC, it would
have suddenly cleared the screen, beeped, and rebooted.
You would have no way to figure out what
had happened aside from comparing your source code
to a previous version and reasoning forward from there.
In 15-410, you have thousands of dollars of the world's
best PC simulation software at your beck and call.
So maybe as you smile you should remind yourself
how grateful you are to the guys in Sweden.
Now, take a second moment to smile.
A triple fault means you are about to learn something!
What's more, it's the kind of learning you can't
get from books. Maybe you will learn something about
the x86, something about OS structure, or something
about debugging strategy.
Q: Ok, Mr. Miyagi, enough telling me how to polish
the car, can you give me some actual
advice about my triple fault already?
A: Ah, grasshopper, you are in such a hurry...
Advice
Generally, a triple fault means that you
(the OS kernel author) told the machine to do
something it couldn't do. For example, you may
have asked it to execute a trap handler that
doesn't exist, asked it to run code which you
didn't give it permission to run, told it to
access memory you didn't give it permission to
access, etc. This condition
is very unlikely to result from a bug
in Simics or in the course-provided run-time
environment. So sending us mail of the form
I have a triple fault, now what?
isn't going
to get you very far. We genuinely don't know
what's going wrong!
Sending us register dumps won't help much
either. As you will see below, it is unlikely
we will be able to point out one "incorrect"
bit. Things are right or wrong in context,
and you (the kernel author) are responsible for
defining the context and therefore what is right
and wrong.
Since you have access to all processor
state and all of memory, you should be able to
figure out first what was happening when the
processor ran into trouble and then what the
problem was.
Use the 15-410
Simics Command Guide to collect as much
information as you can about the processor state.
Collectively the registers and stack trace should
suggest what was supposed to happen.
Note: though there is a command called preg,
it does not print all the registers...
- Then try to figure out why you thought that
thing should have worked. Your answer will probably
consist of multiple stages, probably involving two
or three kinds of memory and maybe a privilege level.
- Then try to figure out which part didn't work
and why not.
Don't be too eager to skip past information
which is "odd". If some lines on the stack trace
are incomplete, look at the parts which are
there and see if they tell you anything about why
the missing parts are missing.
For example, if the debugger
complains at you, think about why... if it says
"not in TLB", apply the steps of this list, i.e.,
ask yourself
- what things should be in the TLB (translation lookaside buffer),
- how those things are supposed to get there,
- why you think the thing should have been there,
- how you could check to see why the thing didn't get there
- This will probably be an interative process.
You may need to change your test code, add trace
information to your kernel, come up with an innovative
breakpoint strategy, etc.
- Lots of faults are somehow related to memory.
Make sure you've gone over what we've given you related
to memory. For example, the textbook devotes several
pages to x86 virtual memory, and we have written a handout
devoted to that topic from a different angle. The
textbook also covers TLBs.
|