[FLASH-USERS] Sedov scaling test

Artur Gawryszczak gawrysz at camk.edu.pl
Tue Feb 12 01:29:43 EST 2008


Hi Mateusz,

On poniedziałek, 11 lutego 2008, mateuszr at umich.edu wrote:
> I am testing a very small Dell cluster (8 cpus total). The machine is a
> dual quad core Xeon with 2 GB per core. [...] The setup does not seem to
> scale with the number of processors at all. 

I noticed similar behavior. There was significant speedup when I've changed 
from one to two cores (other were idle), there was some speedup when I tried 
four cores and almost nothing when I tried all eight. The same setup (code 
and parameters) did scale well up to 8 processes on a cluster of dual 
Opterons with Infiniband interconnects despite of usage of selfgravity 
module (lots of communication, scaling significantly more difficult than in 
hydro).

I tried to pin threads to certain cores (taskset/schedtool) and discovered 
that the problem with scaling is most likely due to insufficient bandwith 
between CPU cores and main memory. All four cores of a Quad Core Xeon share 
single memory bus and when two or more cores of the same physical CPU needs 
to read or write main memory only one can do the transfer and others have to 
wait. This is a bottleneck of Intel systems and in general might be a 
problem of other multi-core CPUs. If Dual Quad Core Intel system has NUMA 
architecture then some additional memory transfers are required for 
communication between cores on different physical CPUs.

Things are getting worse when one needs many state variables that are 
contained in unk(:,:,:,:,:) array (like dens, ener etc.). Paramesh 
developers decided to have variable index first, so all state quantities for 
a given cell occupy continuous region in memory. For solvers which use only 
some of the quantities this effects in poor CPU cache performance because 
lots of unnecessary data has to be loaded due to way how the cache works. 
For systems where all cores have own bus to memory the overhead in transfer 
may be hidden by well planned cache prefetch instructions. Systems with 
shared bus, like Quad Xeon, suffer here from more CPU stalls. New PARAMESH 
versions have utility to reorder unk and other arrays, but I have no idea 
how it can be utilized in FLASH.

-- 
Cheers,
        Artur



More information about the flash-users mailing list