CS 495 Project Proposal

Group members

Margaret DeLap
mid@andrew.cmu.edu
James Shuma
shuma@andrew.cmu.edu

Project web page

http://www.cs.cmu.edu/~mid/495

Project description

We want to do multiple parallel implementations of the Quine-McCluskey boolean logic exact minimization algorithm. The goal of the project would be to examine tradeoffs for different parallel programming schemes.

The algorithm

The QM algorithm takes as input a boolean function of n variables and returns a minimized list of implicants which will implement the function. For example, if we are given the following function:

f = ¬a¬b¬c¬d + ¬ab¬c¬d + ¬a¬b¬cd + ¬ab¬cd + ab¬cd + ¬abcd + abcd + a¬bcd + abc¬d + a¬bc¬d

with the following Karnaugh map:

     00 01 11 10
     -- -- -- --
  00| 1  1  0  0
  01| 1  1  1  0
  11| 0  1  1  1
  10| 0  0  1  1

Quine-McCluskey will tell us that the function can be implemented as: f = ¬a¬c + bd + ac

The algorithm is deceptively simple to run by hand, but to implement it efficiently is quite another matter, especially when considering the different ways of implementing it on a parallel processor. The basic operation of the algorithm is this: Construct a list of implicants. For nimplicants, perform Theta(n²) operations, producing a separate list of implicants, and then operate on that list of implicants. The number of lists is at most n.

In more detail, the algorithm works as follows: First, generate a list of implicants. In our example from above, this is:

0000
0100
0001
0101
1101
0111
1111
1011
1110
1010

Next, find all the pairs of elements which differ in at most one element; compose a separate list of these elements, with one implicant for each element, with the term where they differ replaced by a -; keep track of the implicants used:

0000+ 0-00
0100+ 000-
0001+ 010-
0101+ 0-01
1101+ -101
0111+ 01-1
1111+ 11-1
1011+ -111
1110+ 1-11
1010+ 111-
      101-
      1-10

Note that implicants which have already been used can still be used in other pairs; this is essential to the correctness of the algorithm, corresponding to overlapping implicants in a Karnaugh map. The next step is to generate another implicant list.

0000+ 0-00+ 0-0-
0100+ 000-+ 0-0-
0001+ 010-+ -1-1
0101+ 0-01+ -1-1
1101+ -101+ 1-1-
0111+ 01-1+ 1-1-
1111+ 11-1+
1011+ -111+
1110+ 1-11+
1010+ 111-+
      101-+
      1-10+

Note that (0-00,0-01) and (000-,010-) both reduce to (0-0-); this is also the case for some other pairs of implicants. Two choices for the implementation of collisions such as these are: Only add to a list if the specified implicant is not already present in the list; or, discard any duplicate implicants in the final output.

The next step in the algorithm is to look once more for implicants which differ in only one term. However, there are no such implicants, so it is time to move on to the next phase of the algorithm, which is finding all the prime implicants -- that is, the implicants which were not collected into a pair. These are all the implicants which do not have a + beside them. In this example, the set of prime implicants is exactly the same as the set of elements in the right list, but in general, there may be prime implicants in any of the lists, and all of these are used for output. The list of prime implicants in this example is: (0-0-,-1-1,1-1-), meaning that the function can be expressed as ¬a¬c + bd + ac, as can easily be verified by looking at the Karnaugh map above.

Parallelization

The major opportunity for parallelism that we have seen so far is that the lists can be generated dynamically: If we assign a process to each list, each process can take a stream of implicants from the previous list and generate a stream of implicants for successive lists. However, this is not sufficient to ensure good parallelism, for two main reasons: The number of implicants on the initial lists is (usually) much larger than the number of implicants on later lists, so if a process is assigned to each list, then the later ones will be starved for work; and, if there are more than nprocessors, they will not all be used if we simply assign a processor to each list, since there are at most n lists. The solution to these problems will be interesting to investigate.

Logistics

Plan of attack and schedule

1. Implicant code
Interface, datastructure(s), and basic operations for/on implicants
2. Table code
Interface, datastructure(s), and basic operations for/on tables and the main table of tables
3. Uniprocessor code
Correct implementation of the algorithm for 1 processor
4. Uniprocessor timing
5. Uniprocessor writeup
6. Parallel data structures
such as queues, etc.
7. Locking-related code
8. Statically scheduled code
Correct implementation of the algorithm for more than 1 processor, with work split up statically
9. Statically scheduled code timing
10. Static writeup
11. Dynamically scheduled code*
Correct implementation of the algorithm for more than 1 processor, with work split up dynamically*
12. Dynamically scheduled code timing*
13. Dynamic writeup*
14. Other improvements
Optimizations to the code, to be determined after testing static and dynamic versions
15. Other timing
This will most likely occur more than once to try different optimizations
16. Writeup about other optimizations
17. Final writeup
18. Poster *At present we do not know of a good way to do dynamic scheduling for this problem. In any case, 11, 12, and 13 should achieve better load balancing than the original statically scheduled code, even if they do not do strictly dynamic scheduling.

SunMonTueWedThuFriSat
Mar19
Proposal due
2021
Implicant code (1)
22*23*
24*
Table code (2)
252627
Uniprocessor code (3)
28
Milestone #1
Uniprocessor timing (4)
29*30*
Uniprocessor writeup (5)
Mar-Apr31*1*2*
Parallel data structures (6)
3*4*
Locking code (7)
5*6*
7*89
Static code (8)
10
Static timing (9)
11
Milestone #2
Static writeup (10)
1213
1415
Dynamic code (11)
16
Dynamic timing (12)
17
Dynamic writeup (13)
181920
2122232425
Final code (14)
26
Final timing(15)
27
Final documentation (16)
Apr-May282930
Final writeup (17)
12
Poster session (18)
*=mid out of town

Milestones

For the first milestone, we expect to have the algorithm implemented for a uniprocessor and to have done some performance testing. This should include total time and computation time for all of the machines we will be using. That means that even on multiprocessor machines the code will be using only one processor. The purpose is to have a baseline for comparison and to make sure the implementation is correct before it forks into different and more complex multiprocessor versions.

For the second milestone, we will have:

Literature search

Resources needed

Development will be done primarily on and for the NCSA SGI machines. We will also evaluate performance on machines with higher communication latencies (including the Intel cluster). Additionally, we will evaluate the performance on large numbers of processors on PSC machines, although it will be challenging to get good parallelism in this environment.

Individual runs will be kept to a few minutes at most. Hence, we should not require more than 40 hours of total compute time for the entire project (excluding development time in interactive mode).

No special commercial software will be required. Berkeley's espresso heuristic logic minimizer may be used for results verification and/or timing comparisons. We will write some Perl scripts and/or other tools to generate input files and test output.

No special resources are required for this project which are not already available.

Getting started

To date we have accomplished the following: There are no current obstacles to getting started immediately, and the only foreseeable obstacle to completion is the long average wait time for job submission on the available machines.