Fast Block
Copy in DRAM
Ningning
Hu & Jichuan Chang
Abstract
Although
many techniques have been exploited to improve the DRAM performance, the ratio
of processor speed to DRAM speed continues to grow rapidly, which makes the
DRAM a performance bottleneck in modern computer systems. In this report, we
explore the possibility of implementing fast block copy operation in DRAM. By
writing data in another row during DRAM refreshing period, we can copy a whole
row of data in only two memory cycles. In order to quantify the usefulness of
such an operation, we first studied the memory copy behavior of typical
applications. Aligned and unaligned row copy and subrow copy instructions are
then implemented in SimpleScalar 2.0. The performance improvement of our
benchmarks and SPECint95 are measured and analyzed.
1. Introduction
DRAM
has been used as main memory for over 20 years. During this time, DRAM density
has increased from 1kb/chip to 256Mb/chip, a factor of 256,000, while the
latency of DRAM access has only been reduced to 1/10. The ratio of processor
speed to DRAM speed in modern computers continues to grow rapidly. All these
make the DRAM a performance bottleneck of modern computer system.
One
possible way of improving DRAM performance is to exploit the wide bandwidth
within DRAM chip (2-3 orders of magnitude larger than CPU-memory bandwidth), to
copy large amount of data quickly in a single DRAM chip. Such operations can
occur in a DRAM read cycle. In a traditional DRAM chip, a row of bits is read
into a latch upon a RAS signal. After reading the individual bit within this
row, the latched row must be written back to the same row (since reading
operation is destructive). If we can write the content of latch into an
arbitrarily specified row in the same DRAM chip when refreshing, we can
actually copy hundreds of bytes of data in just two DRAM cycles.
Memory
copy operations (memcpy(), bcopy(), etc) are intensively used in
text processing application, networking services, video/image streaming and
operating systems, with different block sizes and alignment property. In our
project, we focus on the opportunity and usefulness of fast block copy
operations in DRAM. This report will introduce our implementation of DRAMs
supporting fast block copy and the simulation results using fast block copy
operations.
The
rest of this paper is organized as follows. In section 2, we will discuss
related works. In section 3, we describe the methodology used in our experiments.
In section 4, we will briefly introduce traditional DRAM organization and how
to extend the DRAM hardware to support different kinds of block copies. In
section 5, we present the experiment results and our analysis. In section 6, we
suggest future directions and conclude.
2. Related works
Many
techniques have been used by memory manufactures to improve the performance of
DRAM. Extended data out (EDO) DRAM and synchronous DRAM (SDRAM) have long been
popular. Advanced interface technologies, such as Rambus RAM (RDRAM), RamLink,
and SyncLink are quickly emerging. Application-specific memories, such as cache
DRAM (CDRAM), enhanced DRAM (EDRAM), video DRAM (VRAM) [3, 4], and Synchronous
Pipelined DRAM (SPDRAM) [5], have achieved better performance in their intended
area.
The
research community also proposed novel methods to integrate DRAM with
computation logics. For example, The Computational RAM (C-RAM) brings SIMD
computation logic into DRAM by implementing the logic-in memory at sense-amps
[10]. The Intelligent RAM (IRAM) architecture merges the processor and memory
into a single chip to reduce memory access latency, increases memory size while
saving power consumption and board area [9].
There
are also many conventional methods to improve the performance of block copy
operations. Non-blocking caches allow the processor to overlap cache miss stall
times with the block movement instructions [11]. Data prefetching are also used
to improve the performance of block copy. Usually the data cache can be bypassed
when block transfer operations are performed. [12] studies the operating system
behavior of block copy within IRAM architecture, which provides complementary
result with our project (due to the restriction of SimpleScalar, we can
only observe the behaviors of applications in user level).
3. Methodology
We use SimpleScalar 2.0 as our simulator, in which we add an assemble instruction, blkcp.
This instruction is based on the hardware support of block copy operation
within DRAM chip. It can copy a whole row of DRAM to another new row during the
refreshing period of DRAM operation in 2 cycles, or subrow copy in 3 cycles. We
run our simulator on the office PCs: Pentium III processor running at 733MHz,
with 256MB main memory.
There
are at least two important factors that influent the relevance and usefulness
of blkcp instructions: (1) the frequency of block copy operations in
different kinds of applications, and 2) the block sizes and alignment property
of these block copy operations. These data should be collected in both kernel
and user modes, by observing all kinds of library functions that use block copy
operations (including bcopy() and memcpy()). But currently SimpleScalar2.0
doesn’t simulate the kernel mode operations, and provides limited support
of library customization on Linux platform. This has limited our approach to
only observe user applications and instrument only two library functions: bcopy()
and memcpy(). One possible future work should be to collect data in all
possible cases, to do comprehensive simulation and analysis.
We
didn’t have the chance to do detailed hardware simulation, but block diagram
and sequence diagram are described in section 4 to show that the DRAM can be
extended to support fast copy operations.
To
start from the simplest case, we restrict that both source and destination
addresses have the same offset with respect to the row size of DRAM, with which
we can copy the source to destination row-by-row. But according to
Figure 1, we realize that aligned whole row copy is seldom the case
for real world applications. So it is necessary for us to consider how to
support the unaligned whole row copy, as well as aligned or unaligned subrow
copy. In our simulation experiments, we inspect the performance improvement for all
these kinds of block copy operations.
Figure 1. Block sizes of memory copy operations in GCC
and Perl.
1. Aligned
row copy: In this mode, blkcp
assumes the source and destination addresses are both aligned with row
boundary. This is the simplest case, and it could make full use of existing
DRAM hardware.
2. Unaligned row copy: In this case, we
remove the strict restriction of aligned row copy by allowing source address
not be aligned with the row boundary. Consequently, for general memory copies,
we could first process the beginning part of the destination, and then process
the remaining memory with the aligned destination address. We will show that
our design only needs three cycles to finish work[1].
3. Subrow copy: Commodity DRAM has
very large row size, for example, 1024 bits. Consequently, in the above two
modes, only very few memory copies can use the blkcp instruction. To
alleviate this situation, subrow copy is implemented. By using mask and shift
registers in DRAM, blkcp can copy one subrow into another subrow
quickly. Subrow copy can be implemented in hardware to support only aligned
(with respect to 2n memory address boundary) or unaligned copy, they
both can finish in three cycles. So our hardware diagram only presents the more
general, unaligned case. In our software simulation, we consider both cases,
the difference is that aligned subrow copy is usually used less. This is
especially significant when block size is large (>64). But for our
benchmarks, the difference is little. So in section 4, we present the result
using unaligned copy, except explicitly mentioned (for example, in fileread).
4. DRAMs Supporting Fast Block Copy Operation
4.1.
DRAM Organization and operations
In the traditional DRAM,
any storage location can be randomly accessed for read/write by inputting the
address of the corresponding storage location. A typical DRAM of bit capacity 2N
* 2M consists of an array of memory cells arranged in 2N
rows (word-lines) and 2M columns (bit-lines). Each memory cell has a
unique location represented by the intersection of word and bit line. Memory
cell consists of a transistor and a capacitor. The charge on the capacitor
represents 0 or 1 for the memory cell. The support circuitry for the DRAM chip
is used to read/write to a memory cell. It includes:
a) Address decoders to select a row and a column.
b) Sense amps to detect and amplify the charge in the
capacitor of the memory cell.
c) Read/Write logic to read/store information in the memory
cell.
d) Output Enable logic that controls whether data should
appear at the outputs.
e) Refresh counters to keep track of refresh sequence.
DRAM Memory is arranged
in a XY grid pattern of rows and columns. First, the row address is sent to the
memory chip and latched, then the column address is sent in a similar fashion.
This row and column-addressing scheme (called multiplexing) allows a large
memory address to use fewer pins. The charge stored in the chosen memory cell
is amplified using the sense amplifier and then routed to the output pin.
Read/Write is controlled using the read/write logic. [1]
Figure
2. Hardware Diagram of Typical DRAM (2 N x 2N x 1)
A typical DRAM read
operation includes the following steps (refer to Figure 2):
1. The row address is placed
on the address pins visa the address bus
2. RAS pin is activated,
which places the row address onto the Row Address Latch.
3. The Row Address Decoder
selects the proper row to be sent to the sense amps.
4. The Write Enable is
deactivated, so the DRAM knows that it’s not being written to.
5. The column address is
placed on the address pins via the address bus
6. The CAS pin is
activated, which places the column address on the Column Address Latch
7. The CAS pin also serves
as the Output Enable, so once the CAS signal has stabilized the sense amps
place the data from the selected row and column on the Data Out pin so that it
can travel the data bus back out into the system.
8. RAS and CAS are both
deactivated so that the cycle can begin again. [2]
4.2
Block Diagrams
Some
important assumptions are introduced to simplify the implementation of blkcp:
·
The source and destination block are in the same DRAM chip.
We currently don’t support block copy across DRAM chips.
·
There is no overlap between the source and destination blocks.
·
Blkcp
operation
does use register file and is not cacheable.
1M x 1 DRAM is chosen to
illustrate our implementation.
Figure 3. DRAM chip supporting aligned row copy (1M x
1)
4.2.1
Aligned DRAM Row Copy
The block diagram of DRAM
is shown in Figure 3. We add two new components in DRAM chip: a Buffer Register
and a MUX (multiplexer). The Buffer Register is used to temporarily store the
source row, and the MUX is used to choose the write back data used in refresh
period: under normal condition, column latch should be chosen to refresh, but
during row copy mode, WS is raised and Buffer Register is chosen. Steps of copy
operation are listed in the table below. It could finish block copy in 2
cycles:
Cycle |
Action |
Result |
1 |
Fit A0-A9 with SRC
row address Raise RAS |
Column latch and row
buffer now contains the source row data |
Raise R/W |
Refresh the SRC row
(column latch write back to SRC) |
|
2 |
Fit A0-A9 with DST
row address Raise RAS |
|
Raise R/W, raise WS |
Data in SRC is
written back to DST when refreshing. |
4.2.2
Unaligned DRAM Row Copy
The DRAM
block diagram supporting unaligned row copy is included in Appendix A (Figure
A1). More hardware are added, including one shift register, two mask register,
one buffer register, one OR logic and one MUX. As the source address is not
aligned, we need 2 cycles to read out a row size buffer before write to the
destination row. Steps of copy operation is show below:
Cycle |
Action |
Result |
1 |
Fit A0-A9 with SRC’s
row address Raise RAS |
Column latch and Shifter
Reg. store the SRC row |
Raise R/W Shift Reg. shifts
SRC’s data to the higher half and transfer to Mask Reg1. |
Column latch is
written back to cell array (refresh) |
|
2 |
Fit A0-A9 with the
next row address Raise RAS Mask Reg.1 make the
lower half bits with 0, and write its content to Buffer Reg. |
Column latch and
Shifter Reg. store the SRC row |
Raise R/W, raise L/S Shift Reg. shift the
row content to the lower half and transfer to Mask Reg.2 |
Column latch is
written back to cell array (refresh) |
|
3 |
Fit A0- A9 with DST
row address Raise RAS, raise L/S Mask Reg.2 clear the
higher bits |
Mask Reg. 2 stores
later half or SRC data that will be written back to DST row. |
Raise R/W, raise WS Buffer Reg. “OR”
Mask Reg2 |
Combined SRC row is
written to DST row |
4.2.3 DRAM Subrow Copy
In this
mode, part of the source row (eg. 32b of the 1024b in a row) can be copied into
a subrow destination. It assumes that both the source subrow and the
destination subrow stay within a single row, that is, they don’t cross the row
boundary. The DRAM chip diagram is shown in Appendix A (Figure A2). The design
is similar to that of unaligned row copy DRAM. Steps of copy operation is show
below:
Cycle |
Action |
Result |
1 |
Fit A0-A9 with row
address Raise RAS |
Mask Reg is filled
with the src row |
Fit A0-A9 with src
subrow column address Raise CAS, and R/W Mask Reg set the
other bits other than the source subrow with 0 Raise SRC signal for
MUX1 |
Column latch is
written back to cell array (refresh) Shift Reg. is filled
with the source subrow |
|
2 |
Fit A0-A9 with DST’s
subrow address Raise RAS |
Mask Reg is filled
with DST’s data |
Fit A0-A9 with the
DST column address Raise CAS Shift Reg. shift SRC
to DST’s column address Mask Reg fill the
DSR with 0 Raise DST signal for
MUX1 |
Shift Reg. is filled
with shifted SRC data; Buffer Reg. is filled with masked DST data. |
|
3 |
Fit A0- A9 with DST
address Raise RAS, and R/W Combine Buffer Reg.
and Shift Reg. using OR Raise WS |
Subrow SRC is
written into DST |
5. Simulation Results and Analysis
5.1
Simulation Method
5.1.1
Extending the Simulator
SimpleScalar2.0
is used to simulate the effect of our new block copy
operations. We add an instruction (blkcp) into SimpleScalar’s
instruction set, so that assembly programmer can use it to do fast block copy.
Two commonly used block copy functions (bcopy() and memcpy()) are
re-implemented using blkcp instruction, the SimpleScalar library is
updated so that C program can use blkcp by calling the library
functions. The implementation is described as below:
a.
Add blkcp into SimpleScalar’s instruction set
architecture
In SimpleScalar2.0, assembly instruction is
defined as a C procedure [8]. For example, in the aligned row copy case (block
size = 1024 bytes), blkcp can be
defined as following:
b. Modify
library functions to utilize the new instruction
In Linux, memory copy
operations are achieved by calling library function memcpy() and bcopy(). We rewrite them using blkcp,
and replace the SimpleScalar library
using their new implementations. Our experiences (which are later confirmed by
SimpleScalar’s authors) show that it is hard to rebuild the Glib in SimpleScalar2.0, so we simply substitute
memcpy.o and bcopy.o in the precompiled library libc.a of SimpleScalar2.0.
Our selected benchmarks are linked with new library to use blkcp.
An effective way to
implement block copy is to simply do a small number of byte copies at the
block’s beginning and ends, so that there is a maximal destination blocks
aligned in the middle part. Our memcpy() and bcopy() function
implementation follows this method. It first judges whether it’s beneficial to
use blkcp instruction. For example, in aligned row copy, it should check
that: (a) memory block to be copied includes at least one physical row (not
just larger than the row size); (b) source and destination addresses have the
same row offset. For operation satisfying the requirements, block copy will use
blkcp; otherwise the data are copied
byte-by-byte (or word-by-word).
If (src and dst meet
requirements) {
//
non-overlap, and buffer long enough
copy beginning
(or ending) unaligned bytes;
block copy the
aligned chucks using blkcp;
copy remained
unaligned bytes;
}
else {
do memory copy byte-by-byte;
}
5.1.2
Relevant Metrics
In our experiments, we
focus on three performance metrics: (1) Total number of instructions executed
(IN); (2) Total number of memory references (MRN); (3) Total number of blkcp used. IN and MRN could not reflect
the performance changes directly,
because blkcp may need two or
three cycles to finish, and this can’t be reflected in SimpleScalar. Actually,
execution time and memory operation time are the best metrics to evaluate the performance
improvement. But both of them could not be used as performance metrics in SimpleScalar2.0. Because, in SimpleScalar2.0, each instruction is
implemented in the form of C procedure, and one instruction assumed to finish
in one cycle actually uses a few lines of C code, which uses several real
machine cycles and makes the real execution time meaningless. Anyway the
frequencies of operations still provide much information about the performance.
5.1.3
Benchmark
A
suitable benchmark for our experiments should meet three requirements:
(1) Use a
lot of block copy operations;
(2) Block
size are large enough to utilize our blkcp instruction;
(3) Can be
built on SimpleScalar 2.0 to use the
special version of bcopy() & memcpy().
So far,
we haven’t found such a benchmark that satisfies these requirements
simultaneously. We have tried to use SPECint95 as our benchmarks. But for the
eight of benchmarks that we could get source code, only four could be rebuilt
(the others need libraries which are not supported by SimpleScalar), and only one of them has enough block copy
operations to be used in our experiments. We also tried other benchmarks (such
as SPLASH – water and ocean), but similar problems remain. Due to above
limitations, we have to write our own benchmarks. Although self-designed benchmarks
have many problems in terms of generality and comparability, they are useful
for us to get the idea about how and when the block copy scheme can improve the
performance. Below
we will briefly introduce our benchmarks:
Memcopy
This benchmark will
simulate the behavior of execution of memory intensive application. It mainly
uses two classes of instructions: arithmetic/logic and load/store. The runtime
execution times of ALU/MEM instructions are chosen randomly (but reasonably
large). We wan to test the adaptability of blkcp
to various block sizes using this benchmark. Below shows its pseudo-code:
memcopy
{
pick a
random number n1 from 100 to 100,000;
do ALU
operations by n1 times;
fill source
buffer;
pick another
random number buf_len from 1 to 2048;
/* des, src
aligned with the biggest possible row size */
memcpy(dst,
src, buf_len);
}
Fileread
This
benchmark is used to simulate the behavior of a networking server, for example
http or ftp server. It is known that, a genetic http server (support static content
only) could be implemented as following. In step (b), (c) and (d), there
are generally a lot of memory copy operations carried out by the operating
system to move data between buffers.
network_server_simulator
{
(a) listen
for request, parse and check access rights
(b) read
static file;
(c)
transform the content into desired format;
(d) send the
result back to client;
}
To our surprise, we also
found that when the application’s read buffer is large enough (20 Bytes in SimpleScalar’s
Glib implementation), fread() also calls memcpy() to transfer
data from system
buffer to local buffer. Since all our source data are big files whose length
are large enough, a lot of memcpy() operations occur. To simplify the
analysis and isolate the effect of blkcp on file system, we write the fileread
benchmark, which only reads files. It is more general because file reading
is heavily used in both kernel mode and user mode.
Perl
Perl is the only benchmark
chosen from SPECint95. It is a Perl language interpreter, which reads in a Perl
script and executes it by interpretation.
5.2.
Result and Analysis
In out experiments, we combine the data get from the three block copy
modes. The experiment result with block size of 1024 is collected using whole
row block copy; others are using subrow copy. Whether aligned or unaligned copy
is used will be specified explicitly. Below we will discuss different
benchmarks respectively.
5.2.1 Memcopy
As memcopy only
involves aligned block copy, it is used to get the best performance improvement
that can be expected. Experimental data are shown in Figure 4. We can see that
both IN and MRN are greatly reduced, performance is improved by terms of 2 to
30!
But the improvement is
not monotonic: it is the best when block size is 16 bytes and 32 bytes. Because
when block size is small (eg, 4 bytes or 8 bytes), most of memory copies could
make use of blkcp, but the total
number of blkcp used is also large
(Fig 4c). On the other hand, when the block size is large (eg, 512 bytes or
1024 bytes), blkcp could copy a lot
of memory once, but only can be used less frequently, and a large portion of
block copy is realized byte-by-byte. The bigger the block size, the more
bytes has to be treated in the normal way, that’s why the performance get worse
when block size is too large. The top line shows our implementation of naïve
memcpy algorithm (copy byte-by-byte), which has a large amount of memory
operations.
|
|
Figure 4. Experimental results of memcopy using unaligned blkcp. In (a) and (b) the top line doesn’t use blkcp, the lower curve is the result using blkcp. The x-axe represents the size of block supported by blkcp. |
5.2.2 Fileread
Figure 5 shows the
results of fileread, the curves are very similar with the aligned mode
in Figure 4. The results from fileread tell us that, in the normal
programs, unaligned blkcp could
achieve good improvement for those systems having intensive memory operations
(say, by reading large files).
|
|
|
|
|
|
Figure 5. Experimental results for fileread, using unaligned blkcp. In (a) and (b) the top line is the data got without using blkcp, the lower curve is the one using blkcp. The x-axe represents the size of block supported by blkcp. |
Figure 6. Experimental
results for fileread, using aligned blkcp. In (a) and (b) the top
line is the data got without using blkcp,
the lower curve is the one using blkcp.
The x-axe represents the size of block supported by blkcp. |
In the
aligned case (Figure 6), although for all different block sizes,
there are significant performance improvements, especially when the block size
is 4 bytes and 8 bytes. The reason is that the system buffer used in fread()
can be actually aligned or unaligned to different block size, and is not
controlled by user application. When the I/O buffer addresses are only aligned
with 8, but the
block size of blkcp is 16 bytes or larger, memcpy() can seldom
use blkcp, as illustrated by Fig 6c[2].
Consequently, both IN and MRN for those large block sizes are much larger than
those of 4 bytes and 8 bytes.
The
experiment results for the aligned case actually vary with different executions
(some time block size of 16 bytes or even 32 bytes also improves performance
significantly). And Figure 6 is a typical execution snapshot selected among
them. It strongly suggests that if the operating system can intentionally
allocate buffer aligned with the block size of blkcp, the kernel mode
performance will also be improved.
5.2.3 Perl
Figure 7 shows the
result of Perl. For this benchmark, performance improvement is not
obvious, because there is only a small fraction (compared with memcopy and
fileread) of memory copy operations and the block sizes are usually
within 64 (Figure 7d). What makes thing worse is that this benchmark failed to
finish its execution when the block size is 4, 32, and 64 bytes, which are marked
with hollow triangles in Figure 7. We hypothesize that some execution
mechanisms in Perl conflict with the design of SimpleScalar in these
three cases.
Omitting those wrong
data, we still could find that, within the range of the buffer size, blkcp indeed improves the performance,
but rather small compared with memcopy and fileread.
|
|
|
|
Figure 7.
Experimental results for perl, using unaligned blkcp. In (a) and
(b) the top line is the data got without using blkcp, the lower curve is the one using blkcp. The x-axe represents the block size supported by blkcp.
Hollow triangles are incomplete execution results. |
6. Conclusion
Our experiments show
that for those systems frequently using large block memory copy operations, blkcp
could indeed improve the performance significantly, just as illustrated by
our first two benchmarks. It also suggests that the performance of block copy
in operating system, such as in file systems, memory management subsystem,
networking protocol stacks, can be also significantly enhanced. But for the
applications in which memory copy operations do not dominate the performance
overhead, we could not expect such optimistic improvement by just using block
copy.
There are also some
limitations in our approach. First, we did not consider the overhead caused by
the introduction of new hardware in our design. Currently, we haven’t test the
feasibility of our hardware design, but it’s apparent that to implement such a
DRAM chip is not easy or cheap. Also, as we only consider one memory bank in
our design, interleaved memory organization is ignored in our study. The
restriction of our simulator also limit the scope of our study, currently only
user mode applications are simulated, and little improvement has been achieved
on most conventional benchmarks. Some people might say that the result is too
conservative because we only modified the implementation of bcopy and memcpy
used by user application, while others might say that our result is too
optimistic because our own benchmark stress block copy operations too much. We
should consider combining the user and kernel mode simulation and much
realistic benchmarks in future research.
It’s still difficult to
conclude whether the hardware cost of such a block copy operation could be
justified. But we can say that in some cases, the performance improvement is
really large and can be used in some domain-specific application (say, NFS).
Future work should be done in comparing the result with approaches using data
prefetching and non-blocking cache. Furthermore, some other logic and
arithmetic operations should also be considered to make full use of the new
hardware we added.
Reference
[1]
Tulika Mitra, Dynamic Random Access Memory: A Survey. Research
Proficiency Examination Report, SUNY Stony Brook, March 1999
[2] RAM Guide.
http://arstechnica.com/paedia/r/ram_guide/ram_guide.part1-4.html
[3] Yoichi
Oshima, Bing Sheu, Teve H. Jen. High-Speed Architectures for Multimedia
Applications. Circuit & Device, pp 8-13, Jan. 1997
[4] Hiroaki Ikeda and Hidemori
Inukai. High-Speed DRAM Architecture Development. IEEE Journal of Solid-State
Circuits, pp 685-692, Vol. 34, No. 5, May 1999
[5] Chi-Weon Yoon, Yon-Kyun Im, Seon-Ho Han, Hoi-Jun
Yoo and Tae-Sung Jung. A Fast Synchronous Pipelined DRAM (SP-DRAM)
Architecture With SRAM Buffers. ICVC, pp 285-288, Oct, 1999.
[6] Doug Burger, Todd. M. Austin. The SimpleScalar Tool Set, Version
2.0. University of Wisconsin-Madison Computer Sciences Department Technical
Report #1342, June, 1997.
[7] Todd M. Austin. Hardware and Software
Mechanisms for Reducing Load Latency. Ph.D. Thesis, April 1996.
[8] Todd M. Austin. A Hacher’s Guide to the SimpleScalar
Architectural Research Tool Set. ftp://ftp.simplescalar.org/pub/doc/hack_guide.pdf.
[9] David Patterson, Thomas Anderson, Neal Cardwell,
Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and
Katherine Yelick. A Case for Intelligent
RAM: IRAM. IEEE Micro, April 1997.
[10] Duncan G.
Elliott, W. Martin
Snelgrove, and Michael Stumm.
Computational
RAM: A Memory-SIMD Hybrid and its Application to DSP. In Custom
Integrated Circuits Conference, pp 30.6.1-30.6.4, Boston, MA, May 1992.
[11] Rosenblum, M., et al., The Impact of Architectural
Trends on Operating System Performance, 15th ACM SOSP, pp 285-298, Dec.
1995.
[12] Richard Fromm,
Utilizing the on-chip IRAM bandwidth, Course Project Report, UC
Berkeley, 1996
Appendix A: DRAMs supporting unaligned row copy and subrow copy
[1] The
design is not symmetric, that is, if the source address is aligned while the
destination is not, the chip design will be more difficult. And we also
hypothesize it could not do the same amount of work as fast as our current
design.
[2] Because we are not able to rebuild Glib - the library used by SimpleScalar - we could not record the real buffer addresses and buffer sizes when fread() calls memcpy().