Automatically Identifying Opportunities for Using Special Purpose Instructions
Edward Hogan |
Glenn Judd |
Shafeeq Sinnamohideen |
Over the last few years, there has been a trend in the microprocessor industry in which CPU vendors have been adding new specialized instructions to their processors. Many of these instructions have been added with the stated purpose of providing better performance for specialized applications. Examples of these instructions include Single-Instruction-Multiple-Data (SIMD) for use in multimedia applications or fused instructions (e.g. Multiply-Add) for use in numerical applications. Additionally, reconfigurable processors[6], which can be configured for arbitrary operations, take this concept to the extreme - a program can configure the processor for the instructions it expects will provide the best performance.
The benefits of specialized instructions bring with them a set of challenges. First, it can be difficult for a compiler to effectively determine optimal use of these instructions. Furthermore, it is potentially hard for a developer, coding in assembly language, to determine when it may be useful to use a new specialized instruction. This may be because the developer is not familiar with the available specialized instruction set, or may not understand which places in the code most heavily affect performance.
We have developed a tool that alleviates this difficulty by identifying potential places where a specialized instruction could be used. In addition to determining places where specialized instructions might be used, our approach helps determine how beneficial the use of the specialized instruction might be; this is done by profiling the program and identifying sections of code that are executed frequently (so called “hot spots”).
In short, we provide developers with suggestions of places in their code where they may be able to take advantage of specialized instructions. These suggestions target the program’s hot spots so that the optimizations can greatly improve performance.
A tool that provides similar functionality already is the VTune Performance Analyzer[4] from Intel. VTune makes use of standard profiling techniques and hardware counters in Intel's Pentium and later processors to trace the execution of an x86 program. It also has "coaches" that can suggest locations where using Intel's Streaming SIMD Extensions[5] would increase program performance. It is, however, more restricted than we would like. First, it is confined to a single ISA, the x86, and to a fixed set of new instructions. This works well for its intended purpose, helping programmers optimize their code for the Pentium III, but falls short of supporting the arbitrary new instructions that a reconfigurable processor could execute. It is likely that VTune's analysis algorithm is capable of doing this, considering that it must be flexible to deal with the complexity of the x86 ISA, but that functionality is not available to the user.
Compilers also perform similar optimization processes on the intermediate representation of a program during compilation. Semantic-preserving transformations are used to transform one sequence of operations into another, possibly shorter, sequence. Strength reduction, which replaces complex operations with a sequence of simpler ones, is the opposite of what we want to do. This strategy makes sense, since complex operations frequently take longer to execute than simpler ones, but in our case, we expect that the new instruction is faster than a sequence of instructions, otherwise the instruction would not have been added to the ISA. The major difference between our post-compilation analyzer and an optimizer within the compiler is that while the compiler is constrained to generating code that is correct under all circumstances, we merely try to detect that a transformation is possible. The task of determining the safety of the transformation is left to the programmer or an automatic optimizer. Also, we may have an execution trace available that allows us to identify the common execution path through loops and optimize for it, whereas the compiler cannot know which path through a loop is the common case.
Our system consists of two self-contained components: an optimization analysis tool and a hot spot trace generator. The optimization analysis tool analyzes a sequence of assembly code and finds places where specialized instructions can possibly be applied to optimize the code. Specialized instructions are defined using a language that describes patterns of assembly language instructions that can be replaced with a specialized instruction. We call this language the Instruction Optimization Description Language (IODL), and the patterns described by it we call IODL Patterns. Using IODL Patterns, arbitrary instructions can easily be defined which can then be tested against sequences of assembly code.
Our optimization analysis tool is capable of analyzing arbitrary sequences of assembly code; however, it is desirable to optimize the sections of code that most affect a program’s performance. The hot spot trace generator facilitates this by analyzing a running application, and identifying sections of the program that are executed frequently. It then generates a trace of instructions of the most frequently executed sections of the program.
In order for our system to search and find potential places for multimedia instruction replacement, it is necessary for the user to be able to describe to the system the patterns of instructions to look for. This is done by the creation of instruction optimization description language (IODL) patterns. These patterns are human readable, and have an HTML-like syntax. Currently there are 3 tag pairings used in IODL, they are:
<INSTRUCTION> - Used to mark the beginning and end of the instruction patterns.
<TARGET> - Used to provide a user-friendly description of the pattern.
<IODL> - Used to mark the beginning and end of a single pattern.
The descriptions of the individual assembly instructions to search for are described by providing the name of the instruction, followed by alphanumeric strings that serve as a virtual register name for the specific registers used. The names of the register can also be the string “*”, this is a wildcard symbol that instructs the pattern matching facility to match this symbol with any register. The IODL descriptions use the Alpha instruction format.
Below is the IODL pattern describing a sample parallel add instruction. There are two descriptions of this instruction, a vector + vector version, and a vector + scalar version.
<INSTRUCTION>
<TARGET> A Parallel Add Instruction </TARGET>
<IODL>
addq a b c
addq d e f
addq g h i
addq j k l
</IODL>
<IODL>
addq a b c
addq d b e
addq f b g
addq h b i
</IODL>
</INSTRUCTION>
When an IODL Pattern is read in, a dependency graph is generated for the pattern. For each statement in the IODL Pattern, a node is created in the graph. A node’s parents indicate nodes that have instructions on which the child nodes depend. Since Alpha instructions have at most two input registers, each Alpha instruction can have a direct data dependency on at most two other instructions. As a result, each node in the graph can have at most two parents. In addition, as the number of instructions that can depend on the output of an instruction is unbounded, the number of children that a node may have is likewise unbounded. In reality, however, the number of children that a node may have is limited by the fact that registers are reused.
IODL Dependency Graphs are not required to be contiguous. SIMD instructions are a common example of instructions having disjoint IODL Dependency Graphs.
Trace dependency graphs are dependency graphs generated from sequences of assembly code. They are created in the exact same fashion as IODL Dependency graphs. Note that trace dependency graphs can be generated from any sequence of assembly code whether it is from an actual trace, or comes from assembly source code. Also note our system does not treat control instructions in any special fashion.
IODL Dependency Graphs and Trace Dependency Graphs are built in the exact same manner using a fairly straightforward algorithm:
· For each graph linear a list of instructions that defines the graph is maintained. This list is ordered according to the temporal ordering of the instructions. A linear list of “roots” is also maintained. Roots in a dependency graph are nodes which do not depend on any other nodes.
· When an instruction is added to the graph, a new (empty) dependency graph node is created. Next the list is traversed from the first instruction to the last instruction in temporal order. At each node in the list, if the new node depends on the node in the list, the node in the list is set to be a parent of the current node. After the entire list has been traversed the node’s parents will have been determined. For each parent, add the new node as a child. If the node has no parents, it is added to the list of roots.
· After all instructions in a pattern have been added to the graph, the graph is examined in order to determine which parts of the graph are disjoint. For each disjoint segment of the graph, a “primary root” is determined. The primary root is the first root (in temporal order) of the graph segment. Determining primary roots begins by traversing the list of roots. For each root, all reachable nodes from the root are “marked”. Next, the list of roots is traversed until an unmarked root is round. If such a root is found, it is added to the list of primary roots. This new primary root’s graph segment is then marked and the process continues until all roots have been marked and all primary roots have been determined.
As an example consider the following specialized instruction that computes the sum of two multiply-add instruction sequences:
<instruction>
<target>
Sum of two multiply-accumulations
</target>
<iodl>
mult a, b, c
add c, d, e
mult f, g, h
add h, i, j
add e, j, k
</iodl>
</instruction>
The following figure shows the IODL dependency graph generated for this instruction. The numbers on the upper left-hand corner of each graph node show the order in which nodes are added to the graph.
Figure 1: Sample IODL Dependency Graph
The IODL Pattern Matcher component is responsible for matching the specialized instructions defined in the IODL Patterns to the actual hot spot trace. Primarily, there are two varieties of instructions that the pattern matcher recognizes. First, it can match patterns for which the dependency trees for the operation has multiple root nodes that are independent This is the typical pattern found in Single-Instruction-Multiple-Data (SIMD) operations and is depicted in Figure 2. Second, it can also match patterns for instructions in which the dependency tree of the operation contains multiple paths that eventually merge into a single node. This pattern is typical of most fused instructions and is depicted in Figure 3.
Although, these are two predominant patterns found in most applications we have considered, the pattern-matching algorithm is capable of matching any arbitrary dependency graph in which each node may have multiple children and parent nodes. The pattern-matching algorithm will be described below.
The algorithm that matches the IODL descriptions, which are provided by the user, to the hot spot traces, which are retrieved by the Atom tool, works as follows.
In general, the algorithm operates in a depth-first manner, matching the entire descendant tree of a root node in the IODL description. Once the entire tree has been matched successfully for the root, the next root in the IODL description is chosen and the matching process begins again. A complete match can only be declared when the descendant trees of each root node are matched completely with the dependency tree found in the trace.
A match of a trace node and an IODL node is declared when the following conditions are true. First, the nodes’ operations must be identical. Second, the registers referred to by nodes must also be identical. Since the IODL description uses user chosen names for the registers, a match occurs when the chosen name of the register can be bound to the actual machine register. This register binding is described later. The last requirement to declare that two nodes match is that each child nodes of the IODL node must match a child node in the trace.
Although the pattern-matching algorithm matches the IODL graph to the trace graph in a depth-first manner, to make sure that all possible matches are found, this algorithm needs to be substantially more complex. The process of selecting the order that the IODL graph roots or node children are matched to the trace graph must be done by using every permutation of the roots and children that is possible. This must be done since the order that a match is made may influence if the match succeeds. This occurs because more than one IODL node may match to the same node in the trace graph. In short, this increases the complexity of the algorithm to n!, for n children of a node. Although, this number may appear intimidating on paper, in practice there are several short-cuts to make sure that not all of the n! permutations are tried.
Figure 4: Example IODL Graph (left) and Trace Graph (right)
Figure 4 shows an example of how one match can occur. The matching begins by selecting the first root node in the IODL graph (labeled root A), and matching it to a corresponding node in the trace. This search, matching the root node to the trace nodes is done using a breadth-first search. Once the root node has been matched and the bindings updated, a depth-first search and match begins at this node. The matching process matches the children marked 2, 3, and 4 in Figure 4 above. At this point, the algorithm has successfully matched an entire subtree of one of the IODL graphs two roots. The algorithm then selects the next root node (labeled root B) and search through the trace in a breadth-first manner until a match with this node is found. Again, when this root node is matched with the trace and marked with the number 5, its children are then matched in a depth-first manner, marking the node labeled 6. Finally, when we examine the deepest child in the root B subtree, we realize that this node actually resides as both a child of root A and root B nodes. Extra checks must be made at this node to guarantee that the parent nodes, marked 2 and 6, can both be found in this node’s list of parents. In this manner, the entire IODL graph can be pattern-matched against the entire trace.
Register binding is the process through which register arguments to an instruction can be determined to be matching. Consider the example described below in Table 1.
User-defined IODL description |
Actual Trace Instructions |
addq a b c |
addq t3, t6, t7 |
addq d * e |
addq t1, a1, t2 |
mulq c e * |
mulq t7, t2, t7 |
Table 1: Sample Register Bindings
Initially, the register-binding table is empty. Register binding is done when the pattern matcher notices that it can create a mapping from the actual machine registers to the user-created virtual register names. In the first instruction, the pattern-matcher would map the string name ‘a’ to the register t3, the string name ‘b’ to the register t6, and the string name ‘c’ to the register t7. In the second instruction, the pattern-matcher would map the string name ‘d’ to the register t1 and map the string name ‘e’ to the register t2. Since the instruction uses a wildcard, the match to the a1 register is not stored in the registry-binding table. Finally, in the third instruction, the values of the string name ‘c’ would be looked up in the binding table and it would successfully match with the machine register t7, the string name ‘e’ would also successfully match with the machine register t2. Because the destination register of the third IODL instruction is a wildcard, it does not matter that this machine register is also bound to the string name ‘c’.
Focusing on the hot spots in the execution of a program allows the pattern matcher to concentrate on the sections that will offer the greatest improvement. While the pattern matcher can handle the an entire program, only analyzing a relatively small portion at a time is useful because it keeps the number of possible combinations of instructions, which directly affects the search time, manageable. A profiling tool can be used to determine which sections of the code are executed most frequently, and thus would be most beneficial to examine. Additionally, some of the complex instructions we want to optimize for may replace multiple iterations of a loop with a single instruction. In order to determine if the loop is executed at least enough times to be able use such an instruction, we need a trace of the order in which basic block are executed. Finally, knowing the execution frequency of basic blocks and the commonly taken path will allow the programmer, or other optimizer, to choose the most optimal substitution from the set of possible optimizations.
The Hot Spot Trace Generator collects all the necessary data necessary to do this and provides it to the pattern matcher. It consists of two halves - trace collection and hotspot detection. The trace collection half uses Atom [3] to instrument an Alpha binary so that it is functionally identical to the original binary except that it also writes a file indicating which basic blocks are executed. Once the trace file has been generated it is given as input to the hotspot extractor. The hot spot extractor is a Perl script that examines the trace and the assembly source file and produces a trace for each hotspot found. The user can select the number of hotspots desired, the length of the trace desired, and the minimum iterations through that hot spot the trace must contain. An example of its use is:
hotspot gzip.dis gzip.trace 20 1000 10
The script counts the execution frequencies of each basic block and finds the most common one. It then steps through the trace beginning at the first occurrence of that block, printing out instructions until either there are enough instructions, or the block has been executed enough times. The process is then repeated for the next most frequent block until enough hotspots have been output. If two (or more) blocks repeat in series, as would happen for blocks that are part of the same loop, the other block's instructions will be output in trace order as well. The tool remembers this, however, and will skip the hotspot starting at the other block, since that hotspot will be the same as the first one, with a shifted starting point. The optimization of logging only the starting addresses of basic blocks in the trace reduces the size of the trace file significantly, but requires that the script read the other instructions in the block from the assembly file, which is a small cost.
In order to test the ability of the IODL tool to adequately find places in code that can be replaced with multimedia or complex instructions, several existing Alpha binaries were analyzed with the Atom tool to find hot spots in the code that could be optimized. Due to disk space and processing time limitations, the traces were limited to 25MB in length, which corresponds to 2.25 million basic blocks. Each hot spot was limited to 1000 instructions per length and the top 20 hotspots from each application were produced. These hot spots in the code were then analyzed by the IODL Pattern-Matcher to determine if any complex instructions could be used to possibly improve the performance of the application. Because the hot spots are the most frequently run blocks of code in the program any improvements to this code can dramatically improve the execution of the program.
Four programs were analyzed using Atom and the IODL Pattern-Matcher, these programs were applu, compress, mpg123, and mpeg_play. These programs can be described briefly as:
· applu |
A floating point partial differential equation program. |
· compress |
A Spec95 benchmark that uses the Lempel-Ziv compression algorithm. |
· mpg123 |
A streaming MPEG audio player. |
· mpeg_play |
A streaming MPEG video player. |
Each application’s hot spot trace was analyzed for several classes of complex instructions. Primarily, the trace were searched for possible Single-Instruction Multiple-Data (SIMD) instructions and fused instructions. SIMD instructions allow multiple instructions of the same class to operate simultaneously over separate data registers. Fused instructions allow several different operations to take place on the same set of data registers simultaneously. The third and final set of instructions that was considered in the pattern-matching was application-specific instructions. For this set of instructions, we examined places in the application’s assembly code that we believed were part of the application’s core functionality. We then made IODL patterns to recognize these potential sources of improvement and searched for these patterns in the code. This technique is useful in reconfigurable computing environments, where the optimization information could be fed back into the system to create a new optimized instruction for a specific optimization.
The classes of instructions searched are described in Table 2, and are enumerated in detail in Appendix 1. The data types that these instructions operate on are described in Table 3.
Longword
Fused |
Quadword
Fused |
Longword
SIMD Scalar |
Quadword
SIMD Scalar |
Longword
SIMD Vector |
Quadword
SIMD Vector |
T_Float
Fused |
S_Float
Fused |
T_Float
SIMD Scalar |
S_Float
SIMD Scalar |
T_Float
SIMD Vector |
S_Float
SIMD Vector |
Logical
SIMD Scalar |
Logical
SIMD Vector |
Application-Specific Instructions |
Table 2: Categories of Searched Instructions
C declaration |
Alpha Data Type |
Size (Bytes) |
C declaration |
Alpha Data Type |
Size (Bytes) |
char |
Byte |
1 |
long
int |
Quadword |
8 |
short |
Word |
2 |
long
unsigned |
Quadword |
8 |
int |
Longword |
4 |
char
* |
Quadword |
8 |
unsigned |
Longword |
4 |
float |
S_Float |
4 |
|
double |
T_Float |
8 |
Table 3: Alpha Data Types
The first results of interest are the Instruction Match Counts, these results measure the number of matches on a particular complex instruction that are found in the hot spots of program. These numbers can be quite high because the basic blocks that contain these possible optimizations can be repeated if they may exist inside a code loop. The pattern matcher creates an instruction match count for each of the many hot spots that it analyses for each application. The 5, 6, 7, and 8 depict a summary of the instruction match counts for all of the hot spot traces of the applu, compress, mpg123, and mpeg_play applications.
Figure 5, 6, 7, and 8: Instruction Match Counts for applu, compress, mpeg_play,
and mpg123.
The second set of results collected for each hotspot by the pattern-matcher was a histogram depicting the occurrence and frequency that a particular instruction line in the program trace could be replaced with an optimized instruction. In some cases, it is possible for multiple complex instructions to match to the same instruction, this can occur for example when an instruction could be part of a fused instruction or a SIMD instruction. In these cases, both suggestions are made to the user who must choose the final course of action. In addition, in many cases the same optimized instruction pattern matched multiple times with the same line in the program trace. This occurs because for many optimized instructions, especially SIMD operations, the pattern matcher will find each unique combination of the instructions that fits the pattern. For example, given a set of four instructions {a, b, c, d} where each could be a part of a SIMD operation where two operations are done in parallel, the pattern matcher would discover the matches {{a, b}, {a, c}, {a, d}, {b, c}, {b, d}, {c, d}}. In this instance, it would report that each instruction line was matched three times and that six SIMD operations are possible. In practice, it would only be possible to reduce the original sequence into two SIMD operations. The designer of the program can, after analyzing the match output, modify their program based on their knowledge of the system, making an informed decision about which of the set of reported matches best fits into the program.
Histograms were collected for each of the many hotspots in the targeted applications. The figures 9, 10, 11, and 12 show histograms for a single hotspot of application that contained interesting results.
Figure 9, 10, 11, and 12: Per Line Match histograms for selected applu,
compress, mpeg_play, and mpg123 hotspots.
From the results received, there are several conclusions that we can make after analyzing the collected data. Table 4 below restates the instruction match count data that was shown in the Results section above. From this data, we can make several conclusions about the properties of the targeted applications and the IODL pattern matching system.
|
applu |
compress |
mpeg_play |
mpg123 |
compress: sub;
add; cmov; s8add |
- |
2 |
- |
- |
L_integer SIMD
scalar subtraction |
- |
- |
- |
1040 |
L_integer SIMD
vector addition |
- |
- |
7 |
- |
Logical SIMD
scalar and |
- |
- |
300 |
- |
Logical SIMD
scalar sll |
- |
- |
- |
720 |
Logical SIMD
vector bis |
- |
301 |
23 |
- |
mpeg_play:
extract high; shift right |
- |
- |
32 |
14 |
mpg123:
extracts; ors; shifts |
- |
- |
- |
18 |
mpg123: insert;
mask; or |
- |
2 |
4 |
2 |
Q_integer SIMD
vector addition |
- |
47 |
- |
|
T_float fused
multiplication addition |
13 |
- |
- |
- |
T_float fused
multiplication subtraction |
8 |
- |
- |
2 |
T_float fused
subtraction multiplication |
6 |
- |
- |
- |
T_float SIMD
scalar division |
- |
2 |
- |
- |
T_float SIMD
scalar multiplication |
481 |
- |
- |
- |
T_float SIMD
vector addition |
272 |
- |
- |
- |
T_float SIMD
vector multiplication |
264 |
- |
- |
2100 |
T_float SIMD
vector subtraction |
51 |
- |
- |
- |
Table 4: Instruction Match Results Summary
First, we can see from the applu results that this program can be heavily optimized through the use of optimized versions of T_Float instructions. Since this application does intensive differential equation processing, this fits the profile of an optimizable program. It is interesting to note that both scalar and vector versions of optimizations can done, in addition fused instructions seem to possible as well as SIMD instructions.
Next, it appears that the compress program can be optimized through many integer operation optimizations. A SIMD bis instruction (logical or) and a SIMD vector addition are most common and could probably contribute to the program speed-up. In addition, it is interesting to notice that it is possible to construct an instruction specific to the compress program that could be used to speed up the main compression loop. This instruction is named “compress: sub; add; cmov; s8add” and is represented in Table 5 below:
Compress: sub; add; cmov; s8add subq a,
*, a addq a, *, b cmovlt a, b, a s8addq a, *, c |
mpeg_play: extract high; shift right extqh *, *, a sra a,
*, a |
mpg123: extracts, or, shifts extbl a, *, a extbl b, *, b sll a,
*, a bis a,
b, a sll a,
*, c |
mpg123: insert; mask; or inswl *, a, b mskwl c, a, c bis c,
b, c |
Table 5: Application Specific IODL descriptions
Finally, from the results of the pattern match searches on the mpeg_play video application and the mpg123 audio tool, it is important to notice that there are many opportunities to optimize these applications. The mpeg_play video player can take advantage of many SIMD logical and instructions, vector addition, and vector bis instructions. The mpg123 audio player takes advantage primarily of a separate set of optimizations, it can use SIMD shift logical left, SIMD multiplication, and SIMD subtraction. However, although these programs differ in some regards, they can both take advantage of several uniquely configured instructions. First, they both can use an ‘word insert, mask, and or’ that we originally found while choosing possible optimizations for the mpg123 application. In fact, this combination can also be use to improve the compress program. In addition, both the mpeg_play and the mpg123 programs can take advantage of an extract high, shift right combination that was originally found to optimize the mpeg_play code. In short, many unusual combinations of instructions can be used to optimize code even in unusual combinations. From the application specific optimizations that we designed, we were surprised that these combinations could then be used in other applications. This can be of particular interest to reconfigurable computing environments where an application may be able to select specific optimizations that improve its own execution speed.
As instructions specialized for particular classes of applications such as multimedia applications become more common, the importance of determining their potential benefits increases. As shown above, our system can effectively assist in determining the benefits of these specialized instructions by gathering traces of application hot spots, and determining potential uses of new instructions in these hot spots. Our system can also assist reconfigurable systems by determining the potential benefit of instructions created on the fly.
Further, our results have shown that while many applications have obvious uses for specialized instructions, many applications of specialized instructions are non-obvious. Hence, an automatic analysis tool is critical in order to effectively use many specialized instructions.
Though our overall approach and implementation have been shown to be effective, several enhancements could be made to our system. This section briefly describes some of the enhancements that would be most beneficial.
While our IODL descriptions are not limited to the Alpha instruction set, they are limited to Alpha-like instruction sets. Namely, IODL descriptions can only represent instructions, which have at most two inputs, and at most a single output. A more powerful scheme could allow for arbitrary numbers of inputs and outputs. Further, our IODL statements require either a virtual register name, or a wildcard value. A more powerful scheme could allow for logical expressions stating which virtual registers are and are not acceptable. This would allow for more compact IODL descriptions since, in many instances, users would no longer be required to create multiple descriptions for a single instruction. In addition, allowing logical expressions would increase the accuracy of matches generated through the removal of false matches that are caused by the limited expressiveness of IODL descriptions.
Also, when a single trace is analyzed many suggested instruction matches are found that conflict with each other. For instance, a fused multiply-accumulate and a SIMD vector multiply may both share a common source register. As another example, consider that multiple matches may be found for a four-way SIMD vector multiply, but only a subset of these matches may be used since some of them have the same source registers. Our current system simply reports all possible matches. A more useful system would report tradeoffs among matches (e.g. which matches cannot coexist) as well as suggest the relative benefit available from each match.
The detection of hot spots can also be improved significantly. The current system makes the assumption that the first occurrence of a frequently executed block will be representative of its behavior throughout the program. While this makes sense for loops, it is not so true if the most frequent block is a procedure that is called in one pattern during startup, and another during the body of the program. Considering the period between occurrences of a given block when counting them might solve this problem. If a block occurs within a loop that has different behaviors at different times, would effectively be considered as two separate candidate hotspots. For example, block A in the trace ABABABAB would be considered distinct from A in ABCDABCD. The hot spot detector could then know that ABAB... is a better choice, since it contains more iterations of A. Faced with the worst case of ABBB...AAAA where there are more A's than B's, the current simple detector will output ABBB as the trace of hotspot A. Using, the minimum iteration parameter, however, will ensure that the trace will at least contain that number of occurrences of A. Another improvement would be searching for the most common sequences of blocks, which would go a step further and distinguish ABCABC from ADEADE. This way, the pattern matcher would be run on both cases, instead of just the one that occurs first.
Finally, our current system determines how beneficial candidate instructions are for a particular program, but our system does not provide any mechanism for suggesting candidate instructions on its own. Adding this capability would greatly assist reconfigurable computing.
[1]
Alpha Architecture Handbook, Version 3, Digital Equipment Corporation,
Maynard, Massachusetts, October 1996.
[2] Assembly Language Programmer’s Guide, Digital Equipment Corporation, Maynard, Massachusetts, March 1996.
[3] Digital Unix Programmer's Guide, Digital Equipment Corporation, Maynard, Massachusetts,1999
[4] "VTuneä Performance Analyzer" 4.0, Intel Corporation, 1999
[5] Shreekant Thakkar, Tom Huff, "The Internet Streaming SIMD Extensions" in Intel Technology Journal, 2nd quarter 1999, August 1999
[6] Seth Copen Goldstein, Herman Schmit, Matthew
Moe, Bihai Budiu, Srihari Cadambi, R. Reed Taylor, Ronald Laufer, "PipeRench
: A Coprocessor for Streaming Multimedia Acceleration" in IEEE Symposium on Computer Architecture 1999
The following table lists the specialized instructions that we attempted to search for in each of the traces we analyzed. The instructions are grouped into categories describing their general type of operation and the datatype on which they operate.
Longword Fused |
Quadword Fused |
L_integer fused addition division |
Q_integer fused addition division |
L_integer fused addition multiplication |
Q_integer fused addition multiplication |
L_integer fused division addition |
Q_integer fused division addition |
L_integer fused division subtraction |
Q_integer fused division subtraction |
L_integer fused multiplication addition |
Q_integer fused multiplication addition |
L_integer fused multiplication subtraction |
Q_integer fused multiplication subtraction |
L_integer fused subtraction division |
Q_integer fused subtraction division |
L_integer fused subtraction multiplication |
Q_integer fused subtraction multiplication |
Longword SIMD Scalar |
Quadword SIMD Scalar |
L_integer SIMD scalar addition |
Q_integer SIMD scalar addition |
L_integer SIMD scalar division |
Q_integer SIMD scalar division |
L_integer SIMD scalar multiplication |
Q_integer SIMD scalar multiplication |
L_integer SIMD scalar subtraction |
Q_integer SIMD scalar subtraction |
Longword SIMD Vector |
Quadword SIMD Vector |
L_integer SIMD vector addition |
Q_integer SIMD vector multiplication |
L_integer SIMD vector division |
Q_integer SIMD vector division |
L_integer SIMD vector multiplication |
Q_integer SIMD vector remainder |
L_integer SIMD vector remainder |
Q_integer SIMD vector addition |
L_integer SIMD vector scaled by 4 addition |
Q_integer SIMD vector subtraction |
L_integer SIMD vector scaled by 4 subtraction |
Q_integer SIMD vector scaled by 4 addition |
L_integer SIMD vector scaled by 8 addition |
Q_integer SIMD vector scaled by 4 subtraction |
L_integer SIMD vector scaled by 8 subtraction |
Q_integer SIMD vector scaled by 8 addition |
L_integer SIMD vector subtraction |
Q_integer SIMD vector scaled by 8 subtraction |
T_Float Fused |
S_Float Fused |
T_float fused addition division |
S_float fused addition division |
T_float fused addition multiplication |
S_float fused addition multiplication |
T_float fused division addition |
S_float fused division addition |
T_float fused division subtraction |
S_float fused division subtraction |
T_float fused multiplication addition |
S_float fused multiplication addition |
T_float fused multiplication subtraction |
S_float fused multiplication subtraction |
T_float fused subtraction division |
S_float fused subtraction division |
T_float fused subtraction multiplication |
S_float fused subtraction multiplication |
T_Float SIMD Scalar |
S_Float SIMD Scalar |
T_float SIMD scalar addition |
S_float SIMD scalar addition |
T_float SIMD scalar division |
S_float SIMD scalar division |
T_float SIMD scalar multiplication |
S_float SIMD scalar multiplication |
T_float SIMD scalar subtraction |
S_float SIMD scalar subtraction |
T_Float SIMD Vector |
S_Float SIMD Vector |
T_float SIMD vector addition |
S_float SIMD vector addition |
T_float SIMD vector division |
S_float SIMD vector division |
T_float SIMD vector multiplication |
S_float SIMD vector multiplication |
T_float SIMD vector subtraction |
S_float SIMD vector subtraction |
Logical SIMD Scalar |
Logical SIMD Vector |
Logical SIMD scalar and |
Logical SIMD vector and |
Logical SIMD scalar bic |
Logical SIMD vector bic |
Logical SIMD scalar bis |
Logical SIMD vector bis |
Logical SIMD scalar eqv |
Logical SIMD vector eqv |
Logical SIMD scalar ornot |
Logical SIMD vector ornot |
Logical SIMD scalar sll |
Logical SIMD vector sll |
Logical SIMD scalar sra |
Logical SIMD vector sra |
Logical SIMD scalar srl |
Logical SIMD vector srl |
Logical SIMD scalar xor |
Logical SIMD vector xor |
Application Specific Instructions |
mpeg_play: extract high; shift left |
mpeg_play: extract high; shift right |
mpeg_play: extract low; shift left |
mpeg_play: extract low; shift right |
mpg123: insert; mask; or |
mpg123: extracts; ors; shifts |