[FLASH-USERS] Sedov scaling test
Artur Gawryszczak
gawrysz at camk.edu.pl
Tue Feb 12 01:29:43 EST 2008
Hi Mateusz,
On poniedziałek, 11 lutego 2008, mateuszr at umich.edu wrote:
> I am testing a very small Dell cluster (8 cpus total). The machine is a
> dual quad core Xeon with 2 GB per core. [...] The setup does not seem to
> scale with the number of processors at all.
I noticed similar behavior. There was significant speedup when I've changed
from one to two cores (other were idle), there was some speedup when I tried
four cores and almost nothing when I tried all eight. The same setup (code
and parameters) did scale well up to 8 processes on a cluster of dual
Opterons with Infiniband interconnects despite of usage of selfgravity
module (lots of communication, scaling significantly more difficult than in
hydro).
I tried to pin threads to certain cores (taskset/schedtool) and discovered
that the problem with scaling is most likely due to insufficient bandwith
between CPU cores and main memory. All four cores of a Quad Core Xeon share
single memory bus and when two or more cores of the same physical CPU needs
to read or write main memory only one can do the transfer and others have to
wait. This is a bottleneck of Intel systems and in general might be a
problem of other multi-core CPUs. If Dual Quad Core Intel system has NUMA
architecture then some additional memory transfers are required for
communication between cores on different physical CPUs.
Things are getting worse when one needs many state variables that are
contained in unk(:,:,:,:,:) array (like dens, ener etc.). Paramesh
developers decided to have variable index first, so all state quantities for
a given cell occupy continuous region in memory. For solvers which use only
some of the quantities this effects in poor CPU cache performance because
lots of unnecessary data has to be loaded due to way how the cache works.
For systems where all cores have own bus to memory the overhead in transfer
may be hidden by well planned cache prefetch instructions. Systems with
shared bus, like Quad Xeon, suffer here from more CPU stalls. New PARAMESH
versions have utility to reorder unk and other arrays, but I have no idea
how it can be utilized in FLASH.
--
Cheers,
Artur
More information about the flash-users
mailing list