doc: doc's our cluster
DOC is the compute cluster and data server for the Medical Robotics
Technology Center at the Robotics Institute at Carnegie Mellon University.
- Why DOC?
- Because Doc's Our Cluster. The name is inspired by
recursive acronyms such
as GNU.
- What makes up DOC?
- 16 compute nodes with 2.4 GHz Intel Xeon processors (4 nodes
have dual processors, 12 have a single processor) with a fast
533 MHz front-side bus and dual channel memory. The single
processor nodes have 1 GB memory; the dual processor nodes have
2 GB. Each node has an 80 GB disk.
- 1 server with a 2.4 GHz Intel Xeon processor, 512 MB memory,
2 mirrored 180 GB system disks, and a large 12 disk IDE RAID-5 array
of 180 GB disks with a 3ware hardware RAID controller. The raw
storage capacity is 2.5 terabytes.
- All machines are connected with a 24-port Dell gigabit switch.
- The clustering software is OSCAR (Open Source Cluster
Application Resources) with Redhat GNU/Linux 7.3 as well as some
parallel computing libraries, including PETSc (Portable,
Extensible Toolkit for Scientific Computation).
Don't miss the photos of doc.
Condor Queueing System
Condor is software for
harnessing the power of numerous machines while minimizing the
need for the user to worry about where a given job runs. While
you can just run a binary directly on Condor using the "vanilla
universe", it is far more equitable, efficient, and reliable to
run your jobs under the "standard universe". The standard
universe allows jobs to checkpoint, or save the state of the
running program in memory. This is extremely useful in case of
power outage, maintenance, or for pausing longer-running jobs
that would otherwise prevent shorter multi-processor jobs from
running. In addition, because the cluster nodes of doc are
hidden behind a firewall, vanilla jobs will be limited to the
cluster nodes. Standard jobs can also take advantage of idle
processing time of workstations that are also part of our virtual
cluster, or Condor pool. The Condor pool includes the cluster
nodes, and more! Standard universe jobs only require a
relinking of your executable using "condor_compile".
Golden rule: If possible, please submit your job using the standard
universe!
MPI-based mutli-processor
jobs and java jobs also have their own "universes". Matlab jobs
can be run by submitting a shell script as the executable (do not
make "matlab" itself the executable).
For an overview of Condor, take a look at this tutorial.
Funding
This material is based upon work supported by the National
Science Foundation under Grant No. 0305719. Any opinions,
findings, and conclusions or recommendations expressed in this
material are those of the author(s) and do not necessarily
reflect the views of the National Science Foundation.