The DESCHALL
clients with version numbers ending in "dk001
",
"dk002
", and "dk003
" use a
bit-parallel key-search method some refer to as "bitslicing."
I'll try to explain the general technique briefly, and talk a bit
about my particular implementation of it.
The basic idea is that if you have a CPU that operates on N-bit integers, you can treat it like N processors, each operating on single bits. So for a 32-bit CPU, you get 32 "virtual processors", and for a 64-bit CPU, you get 64. The parallel machine you're simulating is of the "SIMD" (single instruction, multiple data) model; that is, all the "processors" execute the same instruction at each step, but they operate on different data.
Now, this certainly doesn't mean you can do everything 32 or 64 times as fast, because your virtual processors only work on a single bit at a time, so it takes more steps for them to accomplish most tasks. However, the calculations necessary to compute DES are well-suited to this kind of single-bit operation. In a bit-parallel (or "bitslice") DES keysearch approach, you test 32 or 64 different keys at once--one key per virtual processor. It turns out that you can check a block of 64 keys this way in substantially fewer steps than it would take to test the 64 keys one at a time, using a conventional approach.
The main reason that DES is faster to run using the bit-parallel technique is that computing the DES function requires doing lots of different bit-level operations, such as permuting the individual bits of a 64-bit chunk of data, or applying an arbitrary function to six bits of one word and inserting the four-bit result into another word. These operations are difficult to do efficiently on general-purpose CPUs, but when you're operating on one bit at a time, they become simpler.
Eli Biham was the first person (to my knowledge) to describe a bit-parallel DES implementation. You can read his paper, "A Fast New DES Implementation in Software", in the proceedings of Fast Software Encryption 4 (1997). It's also available in postscript from Biham's publications page.
Here's a little information about the bitslice DESCHALL clients I've written.
My current bitslice DESCHALL clients get roughly the following keysearch rates:
CPU | word size | speed (Mkeys/sec) |
---|---|---|
DEC Alpha 21164/333MHz | 64 bits | 5.32 |
UltraSPARC I/167MHz | 64 bits | 2.43 (*) |
MIPS R10000/194MHz | 64 bits | 2.11 |
DEC Alpha 21064A/233MHz | 64 bits | 1.81 |
PowerPC 601/80MHz | 32 bits | 0.26 |
For comparison, a 200MHz Pentium running Rocke's carefully tuned conventional client checks about 0.94 Mkeys/sec.
Aside from the bitslice technique itself, several important factors contribute to this performance:
Instead, we design for each S-box a digital circuit that computes its output using simple gates like AND, OR, and XOR. The program will simulate the operation of this circuit, using the corresponding CPU instructions. The time it takes to simulate the circuit will be roughly proportional to the number of gates.
In his paper, Biham describes a general method for constructing
these circuits, which yields an average of approximately 100 gates
per S-box after standard compiler optimizations are applied.
This is essentially what I used in the "dk001
"
DESCHALL clients.
Matthew Kwan has developed smaller S-box circuits--about 51 gates per S-box. He's made source code available; since he developed it in Australia, anyone can look at it. I intend to incorporate his improved S-box circuits when I release my code.
I've produced S-box circuits with an average of 66.1
gates per S-box; these are used in the "dk002
"
DESCHALL clients. The methods I used to construct them were
somewhat interesting in their own right; I may write about that
later.
Rocke has built even smaller S-box circuits. These
are used in the "dk003
" clients (UltraSPARC only,
so far).
Given these advantages, compilers for RISC machines can typically produce very good code from a bitslice program in C. With fewer registers available, or with conventional DES code, compilers have a much harder time competing with hand-generated assembly. I have tried building x86 bitslice code using a compiler, but it's at least 30% slower than Rocke's conventional client.
The MMX instruction set extensions offer some hope. They provide 64-bit operations, an extra ANDNOT instruction, and eight registers to work with, so an MMX bitslice client might work reasonably well. It will very likely require hand-tuned assembly, though, and we haven't tried it yet.