Michelle Goodstein, Michael Ashley-Rollman, Paul Zagieboylo
Introduction
Our project focused on parallelizing the Open Dynamics Engine (ODE), a physics simulation engine. Our hope was to use parallelization to achieve a significant speedup in how long simulations take to run, without inventing new algorithms or changing the outcome of the simulation.
Open Dynamics Engine (ODE)
ODE is an open source physics engine for simulating rigid body dynamics.
It is used in many games and 3D simulation tools to provide physics simulation support. As games strive
to be more realistic, physics simulations become more and more important for creating believable, accurate
simulations of complex things like troops under mortar attack as well as simple things like dropping items
and letting them fall to the ground in the appropriate manner.
At present ODE is single threaded. This is currently a problem for 3D simulation tools as it prevents
them from gaining significant benefit from running on a supercomputer, a cluster, or even just a multi-
processor machine. This can be extremely inconvenient and frustrating for people running these simulations
as they frequently take a very long time to execute. The dprsim simulator for DPR in particular is known
to suffer from this problem. Also, while less urgent, a multi-processor capable version of ODE would greatly
benefit games in the long run. While most games are presently single-threaded, the move towards multi-core
processors rather than faster processors will require games to become multi-threaded in the future to benefit
from newer hardware. As this happens it will become necessary to have a multi-processor capable physics
simulator.
In addition, ODE is used by the Claytronics project to model the dynamics of how catoms affect each other. Achieving a general
speedup would allow Claytronics to simulate larger quantities of catoms at a time.
Approach
We decided to use multi-threading as our approach to parallelization, rather than integrating ODE into an OpenMP framework. One motivation was ease of use; ODE was already built to link in a standard pthread library, though no calls to pthreads were ever made originally. This meant we could incorporate multi-threading without significant cost to changing code compilation. In addition, most pthread libraries can take advantage of multi-core machines, like an Intel Core Duo laptop, to run each process on separate cores. Therefore, debugging was simple, since we could use our laptops to run the code natively, and then test it later on 8-way machines.
Once we had decided on the tools for parallelization, we focused on profiling the code using GNU gprof and identifying the areas where we spent the most time. Roughly, there were two areas the code spent the most time in: collision detection, and solving instances of the linear complementarity problem, essentially a linear algebra problem. We decided to focus our parallelization efforts on the collision detection. ODE provides different functionalities for different kinds of \textbfspaces. Different spaces store objects in slightly different ways and can behave in slightly more sophisticated manners for collision detection. Since all of our examples operated in a HashSpace, meaning a hash table was used in collision detection, we focused our attention on its methods only.
Related Work
PhysX is a parallel physics simulator, but it is not open source. ODE, on the other hand, is open source but is not readily available in a parallel form. One paper pointed out the desirable qualities a parallelized physics engine would provide, but its contributions were not immediately applicable to ODE[1]
Contributions
One of the main contributions of our work is in our profiling and identifying the areas of ODE where parallelization would do the most good. We also contribute a methodology for parallelizing ODE that preserves the results and identifies parallel sections. We also include a partial implementation of these ideas, demonstrating their potential.
Profiling the Code
We used the GNU gprof utility to profile our code and identify areas where we spent the most time. At a high level, there were two types of calculations that were consuming most of our time: collision detection, and solving instances of the linear complementarity problem.
Parallelizing ODE
We identified two good candidates for parallelization within ODE: collision detection, and solving instances of the linear complementarity problem.
Collision Detection
There are two types of collision detection in ODE. The first, collide(), deals with colliding all objects in a space against each other. The second, collide2(), deals with colliding one object against all the others in a space. Different simulations will utilize one form of collision detection more than another.
We decided to utilize a static approach to allocation. We allocated n threads. Once spawned, the child thread would start its run function that had it do three things. First, it waited at a barrier until it was needed, to avoid threads starting before their arguments had been initialized. Then, it ran a parallel version of collision detection (we wrote separate functions for each of collide() and collide2(). Finally, it would wait at a second barrier again, as a way to signal the main thread that all data was available for it to work on. These three steps were continued in a continuous loop, until the program halted.
One of the problems we initially faced when attempting to parallelize these functions was the data structures. In both cases, the objects which were being collided were stored in a linked list. While this is efficient when used in a sequential program, it makes it harder to write a parallel program. The high level changes are described for both collide() and collide2().
collide2()
Collide2() accepts a pointer l to a list of objects residing in one space, as well as an object o to collide against all the others. This was originally accomplished in a linear loop. The order of collisions was not important. We parallelized at the loop level; we first found the pointers to the first n elements in the space that were going to be collided against. Then, each of the n threads accepted o as well as a pointer l_i, where l_i points to the ith element in l. The threads would first set l' = l_i, attempt to collide o against the head of their list, and then increment the head pointer of l' n steps. If that results in NULL due to the list ending, the loop would terminate. Once this was done, the child threads wait on a barrier. Once all the children have completed, the master thread resumes.
collide()
Collide() does a lot more work. First, it has to compute what all the elements in the world are. Then, it builds a hash table and stores all of its elements in the hash table, assuming they are not too big. If they are considered oversized, they are placed in a special linked list of ``big-boxes'' and handled as a special case at the end. There is an assumption that most elements will not be too big.
- Creating linked lists of objects in the space, filling in all the correct fields
In the sequential version of ODE, this happens in the main thread in one large for loop, with only one linked list.
In our parallelized version of ODE, we create n linked lists, one per thread. In the main thread, we create the linked lists, but don't fill out all the fields. We then create the children threads if they don't already exist, and have the children threads fill in the fields while the main thread waits. At this point, we identify which objects are oversized, and place them in a separate list. We also make this list of oversized objects available globally.
- Create the hash table data structure and an array to track whether two objects have already been collided against each other
In the sequential version of ODE, the main thread proceeds to create the hash table, an array to track whether two objects have been collided against each other, and fills in the hash table. Keeping this part of the code sequential seemed like a good idea.
In our version of ODE, once all the children threads have filled out the fields for all the objects they are tracking, the main thread takes over and creates the hash table data structure. The hash table is only used to track collisions between things that are not considered oversized. Therefore, while we were waiting for master thread to create and fill in the the hash table, the children can proceed with step 4 and collide all their non-oversized elements against the oversized elements without waiting.
Threads all wait at the barrier until the children finish colliding their non-oversized objects against the oversized objects and the master thread finishes filling in the hash table.
- Collide the elements in the hash table against each other
In the sequential version of ODE, this proceeds in a for loop. If two elements have not yet been tested for a collision, then we collide them and update our array that tracks collision.
In the parallel version we implemented, the children threads are responsible for these collisions, colliding all the non-oversized objects they own against each other and then colliding all their non-oversized objects against everyone else's non-oversized objects. We created O(n) locks, n being the number of threads, to protect our array that tracks collisions, essentially locking rows in groups.
While this was happening, we observed that the main thread could execute step 5, since it only involves the oversized objects. All threads wait at a barrier until all children have finished handling their assigned collisions and the master thread competes step~\refbig-big.
- Collide the elements in the hash table against the oversized elements
Originally in ODE, this occurred after step 3. It now occurs in the children threads in parallel with step 2. The larger elements are collided against all the non-oversized elements in parallel.
- Collide all oversized elements against each other
ODE originally had this occur at the very end, with a note that hopefully there were not too many oversized elements. Assuming there are not too many of the larger elements, this should not take much time. Since the changes it makes are independent, this step now occurs in parallel with step 3.
Linear Complementarity Problem (LCP)
An instance of the linear complementarity problem (LCP) is solved in ODE when we need to determine where objects are moving to. We spent time researching LCP, and possible parallel methods of solving it. Most of the literature on LCP is older and harder to obtain. We did find a paper by Medhi[2] proposing a possible solution, but implementing it was outside the scope of this project. First of all, Medhi proposes a parallel version of Lemke's algorithm; however, ODE does not use Lemke's algorithm, and instead uses one attributed to Dantzig. We were unable to determine what the original Dantzig algorithm was, and whether it solved cases Medhi's algorithm did not cover.
Due to this, implementing Medhi's algorithm would have required a complete algorithmic change, with new data structures. Doing this so that no mistakes were propagated and results of simulations were identical seemed like its own project, and outside the scope of this project.
Experimental Setup
In order to profile ODE and evaluate potential methods for parallelizing the engine, a number of example cases were selected to exercise different aspects of the engine. These examples were meant to be representative of typical applications that might make use of a physics engine. They experiment with large numbers of objects in the presence of gravity as well as a complex set of forces and with small numbers of objects in the presence of a complex set of forces. These examples were the tennis balls, magnetic tennis balls, and buggy examples.
Experiments
Tennis balls
The tennis ball application made use of the DPRSim[3] modular robot simulator. The simulator uses the ODE engine to support a world with many spherical robots. It supports running a program on each of the robots to direct their operations and allows the robots to turn on and off magnetic monopoles on their surface to cause the robots to be attracted to one another. For this example, not program was run on the individual catoms and all of the magnetic monopoles were disabled to simulate a collection of tennis balls. The worlds used made up of 1,000--10,000 spherical robots arrange in a rectangular prism. Tennis balls arranged in this pattern are not stable, so when the experiment is run the structure collapses with the tennis balls rolling out across the floor. These experiments were run for 1000 time steps before stopping. This application stresses ODEs abilities to deal with a large number of interacting objects in the presence of gravity.
Magnetic tennis balls
The magnetic tennis balls application was very similar to the tennis balls application. It consisted of running experiments on the same worlds in the same simulator, but this time the magnetic monopoles on the robots were enabled, causing the balls to stick together. This is adequate to prevent the stack from collapsing, although the balls do move around a bit before settling into an imperfect rectangular prism. This application stresses ODEs abilities to deal with a large number of interacting objects in the presence of a large collection of forces.
Buggy
The buggy example is packaged with ODE and consists of a vehicle made from a rectangle and some wheels. The vehicle can be accelerated and decelerated by applying a rotational force to the wheel and can be steered by rotating the wheel in the ground plane. This example demonstrates a small number of objects interacting under a set of complex forces.
Running experiments
Collecting profiles
To make use of gprof, each experiment was compiled with gprof support enabled and then run for at least a minute to collect enough data. For the DPRSim examples, the tests were precisely controlled through the use of experiment configuration files, and each experiment was run for 10--1000 timesteps, depending on the number of modules present. The buggy example was non-trivial to create uniform experiments with, so the buggy was driven around in figure eights and varying velocities until a minute had passed.
Timing Experiments
For timing experiments, the unix time command was used. This worked well for the DPRSim-based example because the simulation starts when the program starts, runs for a consistent number of time steps, and then the program automatically terminates when the simulation ends. This allowed us to avoid instrumenting the tests directly while generating consistent results. The buggy test was not timed because its implementation did not facilitate running consistent experiments and the portions of the parallelization plans that were implemented were not prevalent in the buggy example.
Results
In running the experiments on the tennis ball worlds, gprof revealed that collision detection accounted for 40\% of the cpu utilization on a world with 8000 tennis balls. The parallel version of the algorithm showed a substancial linear speedup in collision detection as shown in Figure 1. This lead to a very noticeable improvement in the performance of the tennis ball apllication when run on large worlds, as shown in Figure 2. Unfortunately smaller worlds with only 1000 objects show only a very modest improvement.
Speedup in Collision Detection
Speedup in Entire Application
Surprises and Lessons Learned
ODE is a large system and not very well documented. Learning enough about the system to be able to incorporate profiling, and understand what the profiling was suggesting, was a challenge. Once we had profiled the functions that were the most likely to benefit from parallelization, the data structures used became a problem. A lot of the code uses strongly sequential data structures, (e.g., walking a linked list) which makes parallelization more difficult.
The programs available for profiling, such as gprof, do not produce easy to understand output. Profiling and identifying areas strongly suited for parallelization was much more time consuming than expected. Furthermore, identifying bottlenecks and evaluating lock contention are not very easy with these tools as the granularity is too high. This made synchronization particularly challenging as we had a hard time determining what negative impacts our synchronization methods were having on performance. Also, we not only had to synchronize across our own threads in ODE, but also needed to make the callbacks provided to ODE by the client application thread-safe.
We learned a lot about synchronization and thread safety in implementing this project. At first, our concern was ensuring that launching all these threads still maintained correct program behavior, and we needed two barriers to accomplish that. Later, when we wanted to achieve larger amounts of parallelism, we had to incorporate two more barriers or else try to determine how to build and fill a hash table in parallel. This meant that both the main thread and its child threads initially had two back-to-back barrier calls, wasting a lot of time with nothing between them. The tradeoff of using four barriers per collide() call was balanced in the end by achieving much higher rates of parallelism. Also, later examination allowed us to move certain independent blocks of code between two ``empty'' barrier\_wait calls, and achieve even higher rates of parallelism.
Synchronization also was a problem with one shared array which tracked whether we had already collided two objects. Initially, our approach was just to wrap the array with a lock. However, while initially the only possible contention had been whether we would compute the collision of object i with object j twice, now all the threads had to wait for one lock in order to continue, introducing a lot of needless wait time. Creating n locks, one per thread, and adding slightly more logic so that we requested the lock for row i whenever object i may collide with object j, reduced this slowdown considerably. The most surprising thing was that having n^2 locks, which theoretically seems like it would reduce contention the most by reducing the demand for any lock to constant time, actually was slower than the sequential case.
Conclusions and Future Work
Our work offers several contributions. First, we identified which main sections of ODE would benefit from parallelization: the collision detection portions and LCP. Secondly, we created a detailed method for parallelizing collision detection, in addition to identifying a possible way to parallelize LCP. Finally, we implemented our results and were able to show a speedup as a result of our parallelization, without changing the results of the simulations. This is the first parallel implementation of ODE that we are aware of, making our speedup significant.
There are many possible directions for future work. First, implementing the parallel version of LCP would likely yield a speedup. In addition, our collision detection parallelization scheme is static; there are n threads, each gets an approximately equal share of objects to process, and each thread is only given one rather large task to complete per iteration. Using a more dynamic allocation, by chopping the task unit into smaller pieces, may yield a further speedup if the overhead for dynamic allocation is sufficiently low. Also, different ways of statically allocating the threads could be investigated. Currently, we don't use the structure of the hash table in dividing up ownership of the objects; that could possibly be leveraged for additional speedup.
Finally, we focused on the HashSpaces, since that was the space all of our test cases used. However, a better speedup might be possible by using a more naive space. Starting with an algorithm that collides all objects against each other, instead of using the hash structure, might be easier to parallelize because there is less overhead to track, and less barriers are required.
References
[1] J. Choi, D. Shin, and D. Shin. Research on Design and Implementation of Adaptive
Physics Game Agent for 3D Physics Game. LECTURE NOTES IN COMPUTER SCIENCE
, 4088:614, 2006.
[2] KT Medhi. Parallel pivotal algorithm for solving the linear complementarity problem. Journal
of Optimization Theory and Applications, 69(2):285–296, 1991.
[3] Intel Corporation and Carnegie Mellon University. Dprsim: The dynamic physical rendering
simulator. http://www.pittsburgh.intel-research.net/dprweb/, 2006.