CS 495 Project Proposal

Group members

Margaret DeLap: mid@andrew.cmu.edu
James Shuma: shuma@andrew.cmu.edu

Project web page

http://www.cs.cmu.edu/~mid/495

Project description

We want to do multiple parallel implementations of the Quine-McCluskey boolean logic exact minimization algorithm. The goal of the project would be to examine tradeoffs for different parallel programming schemes.

The algorithm

The QM algorithm takes as input a boolean function of n variables and returns a minimized list of implicants which will implement the function. For example, if we are given the following function:

f = ¬a¬b¬c¬d + ¬ab¬c¬d + ¬a¬b¬cd + ¬ab¬cd + ab¬cd + ¬abcd + abcd + a¬bcd + abc¬d + a¬bc¬d

with the following Karnaugh map:

     00 01 11 10
     -- -- -- --
  00| 1  1  0  0
  01| 1  1  1  0
  11| 0  1  1  1
  10| 0  0  1  1

Quine-McCluskey will tell us that the function can be implemented as: f = ¬a¬c + bd + ac

The algorithm is deceptively simple to run by hand, but to implement it efficiently is quite another matter, especially when considering the different ways of implementing it on a parallel processor. The basic operation of the algorithm is this: Construct a list of implicants. For nimplicants, perform Theta(n²) operations, producing a separate list of implicants, and then operate on that list of implicants. The number of lists is at most n.

In more detail, the algorithm works as follows: First, generate a list of implicants. In our example from above, this is:

Next, find all the pairs of elements which differ in at most one element; compose a separate list of these elements, with one implicant for each element, with the term where they differ replaced by a -; keep track of the implicants used:

Note that implicants which have already been used can still be used in other pairs; this is essential to the correctness of the algorithm, corresponding to overlapping implicants in a Karnaugh map. The next step is to generate another implicant list.

0000+ 0-00+ 0-0-
0100+ 000-+ 0-0-
0001+ 010-+ -1-1
0101+ 0-01+ -1-1
1101+ -101+ 1-1-
0111+ 01-1+ 1-1-
1111+ 11-1+
1011+ -111+
1110+ 1-11+
1010+ 111-+
      101-+
      1-10+

Note that (0-00,0-01) and (000-,010-) both reduce to (0-0-); this is also the case for some other pairs of implicants. Two choices for the implementation of collisions such as these are: Only add to a list if the specified implicant is not already present in the list; or, discard any duplicate implicants in the final output.

The next step in the algorithm is to look once more for implicants which differ in only one term. However, there are no such implicants, so it is time to move on to the next phase of the algorithm, which is finding all the prime implicants -- that is, the implicants which were not collected into a pair. These are all the implicants which do not have a + beside them. In this example, the set of prime implicants is exactly the same as the set of elements in the right list, but in general, there may be prime implicants in any of the lists, and all of these are used for output. The list of prime implicants in this example is: (0-0-,-1-1,1-1-), meaning that the function can be expressed as ¬a¬c + bd + ac, as can easily be verified by looking at the Karnaugh map above.

Parallelization

The major opportunity for parallelism that we have seen so far is that the lists can be generated dynamically: If we assign a process to each list, each process can take a stream of implicants from the previous list and generate a stream of implicants for successive lists. However, this is not sufficient to ensure good parallelism, for two main reasons: The number of implicants on the initial lists is (usually) much larger than the number of implicants on later lists, so if a process is assigned to each list, then the later ones will be starved for work; and, if there are more than nprocessors, they will not all be used if we simply assign a processor to each list, since there are at most n lists. The solution to these problems will be interesting to investigate.

Logistics

Plan of attack and schedule

1. Implicant code: Interface, datastructure(s), and basic operations for/on implicants
2. Table code: Interface, datastructure(s), and basic operations for/on tables and the main table of tables
3. Uniprocessor code: Correct implementation of the algorithm for 1 processor
4. Uniprocessor timing
5. Uniprocessor writeup
6. Parallel data structures: such as queues, etc.
7. Locking-related code
8. Statically scheduled code: Correct implementation of the algorithm for more than 1 processor, with work split up statically
9. Statically scheduled code timing
10. Static writeup
11. Dynamically scheduled code*: Correct implementation of the algorithm for more than 1 processor, with work split up dynamically*
12. Dynamically scheduled code timing*
13. Dynamic writeup*
14. Other improvements: Optimizations to the code, to be determined after testing static and dynamic versions
15. Other timing: This will most likely occur more than once to try different optimizations
16. Writeup about other optimizations
17. Final writeup
18. Poster *At present we do not know of a good way to do dynamic scheduling for this problem. In any case, 11, 12, and 13 should achieve better load balancing than the original statically scheduled code, even if they do not do strictly dynamic scheduling. Sun Mon Tue Wed Thu Fri Sat Mar 19 Proposal due 20 21 Implicant code (1) 22* 23* 24* Table code (2) 25 26 27 Uniprocessor code (3) 28 Milestone #1 Uniprocessor timing (4) 29* 30* Uniprocessor writeup (5) Mar-Apr 31* 1* 2* Parallel data structures (6) 3* 4* Locking code (7) 5* 6* 7* 8 9 Static code (8) 10 Static timing (9) 11 Milestone #2 Static writeup (10) 12 13 14 15 Dynamic code (11) 16 Dynamic timing (12) 17 Dynamic writeup (13) 18 19 20 21 22 23 24 25 Final code (14) 26 Final timing(15) 27 Final documentation (16) Apr-May 28 29 30 Final writeup (17) 1 2 Poster session (18) *=mid out of town Milestones For the first milestone, we expect to have the algorithm implemented for a uniprocessor and to have done some performance testing. This should include total time and computation time for all of the machines we will be using. That means that even on multiprocessor machines the code will be using only one processor. The purpose is to have a baseline for comparison and to make sure the implementation is correct before it forks into different and more complex multiprocessor versions. For the second milestone, we will have: an implementation of multiprocessor code with static scheduling timing/speedup results for this implementation, from all applicable machines additions to our writeup based on these results Literature search W. Quine, "The problem of simplifying truth functions," American Mathematical Monthly, Vol 59, pp521-531, 1952. McCluskey, Edward J. Logic design principles: with emphasis on testable semicustom circuits. Englewood Cliffs, N.J.: Prentice-Hall, c1986. De Micheli, Giovanni. Synthesis and Optimization of Digital Circuits. New York: McGraw Hill, c1994. Katz, Randy H. Contemporary Logic Design. Redwood City, Calif.: Benjamin/Cummings, c1994. Peysakh, Jocob. "An improvement of the Quine-McCluskey logic reduction technique." Yorktown Heights, N.Y.: IBM T.J. Watson Research Center, 1987. Resources needed Development will be done primarily on and for the NCSA SGI machines. We will also evaluate performance on machines with higher communication latencies (including the Intel cluster). Additionally, we will evaluate the performance on large numbers of processors on PSC machines, although it will be challenging to get good parallelism in this environment. Individual runs will be kept to a few minutes at most. Hence, we should not require more than 40 hours of total compute time for the entire project (excluding development time in interactive mode). No special commercial software will be required. Berkeley's espresso heuristic logic minimizer may be used for results verification and/or timing comparisons. We will write some Perl scripts and/or other tools to generate input files and test output. No special resources are required for this project which are not already available. Getting started To date we have accomplished the following: Written scripts for generating test input Planned project goals and scope Planned application architectures (including statically and dynamically scheduled versions) Begun work on representation of implicants for efficient storage and quick computation There are no current obstacles to getting started immediately, and the only foreseeable obstacle to completion is the long average wait time for job submission on the available machines.

	Sun	Mon	Tue	Wed	Thu	Fri	Sat
Mar			19 Proposal due	20	21 Implicant code (1)	22*	23*
	24* Table code (2)	25	26	27 Uniprocessor code (3)	28 Milestone #1 Uniprocessor timing (4)	29*	30* Uniprocessor writeup (5)
Mar-Apr	31*	1*	2* Parallel data structures (6)	3*	4* Locking code (7)	5*	6*
	7*	8	9 Static code (8)	10 Static timing (9)	11 Milestone #2 Static writeup (10)	12	13
	14	15 Dynamic code (11)	16 Dynamic timing (12)	17 Dynamic writeup (13)	18	19	20
	21	22	23	24	25 Final code (14)	26 Final timing(15)	27 Final documentation (16)
Apr-May	28	29	30 Final writeup (17)	1	2 Poster session (18)

CS 495 Project Proposal

Group members

Project web page

Project description

The algorithm

Parallelization

Logistics

Plan of attack and schedule

Milestones

Literature search

Resources needed

Getting started