[FLASH-USERS] Flash 4 memory errors on Stampede

Rukmani Vijayaraghavan vijayar2 at illinois.edu
Mon Aug 18 19:22:37 EDT 2014


Hi,

I've been encountering multiple problems with running large flash jobs (>
7000 cores) on Stampede. I frequently encounter the following error message:

********************************************************************************************************
[c499-401.stampede.tacc.utexas.edu:mpispawn_285][readline] Unexpected
End-Of-File on file descriptor 7. MPI process died?
[c499-401.stampede.tacc.utexas.edu:mpispawn_285][mtpmi_processops] Error
while reading PMI socket. MPI process died?
[c499-401.stampede.tacc.utexas.edu:mpispawn_285][child_handler] MPI process
(rank: 0, pid: 9819) terminated with signal 9 -> abort job

*******************************************************************************************************
This is a somewhat opaque error message and I see no other errors in flash
log files or other output files. It looks like some of the nodes are
running out of memory. Sometimes the issue goes away if I decrease the load
/ processor (e.g, decrease pt_maxPerProc or maxblocks) or run my job on
fewer cores / node (e.g. run on 12/16 cores on each node rather than 16
/16). However I haven't been able to find an obvious solution or figure out
what these errors are triggered by. I wonder if others running flash have
encountered these errors and I'd appreciate any help.

Best,

Rukmani


-- 
Rukmani Vijayaraghavan
PhD Student
Dept. of Astronomy
University of Illinois
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://flash.rochester.edu/pipermail/flash-users/attachments/20140818/3c02f596/attachment.htm>


More information about the flash-users mailing list