[Hpc-forum] MPI problemak

Ferenc Bartha barthaf at sol.cc.u-szeged.hu
2013. Sze. 27., P, 08:36:25 CEST


Megint csak reszben ertekelve:

> Error: NCE_PACKAGES not set
Nem eleg a module use /opt/nce/modulfiles, utana module load nce/global is 
kell.

Van meg egy gyanus dolog az egy job-bol a hatterben elinditott ket mpi-s 
program esetleges viselkedesevel kapcsolatban. Ezt most inkabb nem taglalom, 
de tanacsolnam, hogy addig se hasznalj ilyet.

ÜdvFeri
------------:
Dr. Ferenc BARTHA, tel: 62/54-6821, E-mail: barthaf at sol.cc.u-szeged.hu
SKYPE: ferenc.bartha, WWW: http://www.staff.u-szeged.hu/~barthaf/
SZTE DNT - High Performance Computing Group, 6725 Szeged, Szikra u. 2.
SZTE, Department of Medical Chemistry, 6720 Szeged, Dóm tér 8.

----- Original Message ----- 
From: "etele molnar" <molnar at fias.uni-frankfurt.de>
To: "Rőczei Gábor" <roczei at niif.hu>; "etele molnar" <etele.molnar at gmail.com>
Cc: <hpc-forum at listserv.niif.hu>
Sent: Friday, September 27, 2013 7:50 AM
Subject: Re: [Hpc-forum] MPI problemak


Kedves Gabor es tobbiek

Eloszor is koszonom szepen a valaszokat es segitseget,
mindjart mindjart ott vagyunk..

module use /opt/nce/modulefiles
bizony segitett es most lefut a program 12, 24, 48 slotton is, de most
van egy ujabb hiba,
ami meg a program futasa elott jon, azutan lefut a program (sikeresen)
de mar a kovetkezo hivasra
nem indul be es a job is megszakad.

A mostani teszt programokat igy hivtam bash-bol

mpirun -np 48 ./ program ... ;
mpirun -np 48 ./ program ... ;
...
wait


Error: NCE_PACKAGES not set
...
Error: NCE_PACKAGES not set

real    11m7.497s
user    39m5.807s
sys    0m1.416s
[r1i0n6:09241] opal_os_dirpath_create: Error: Unable to create the
sub-directory
(/scratch/tmp/425790.1.parallel.q/openmpi-sessions-emolnar at r1i0n6_0/39007)
of
(/scratch/tmp/425790.1.parallel.q/openmpi-sessions-emolnar at r1i0n6_0/39007/0/0),
mkdir failed [1]
[r1i0n6:09241] [[39007,0],0] ORTE_ERROR_LOG: Error in file
util/session_dir.c at line 106
[r1i0n6:09241] [[39007,0],0] ORTE_ERROR_LOG: Error in file
util/session_dir.c at line 399
[r1i0n6:09241] [[39007,0],0] ORTE_ERROR_LOG: Error in file
ess_hnp_module.c at line 320
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

   orte_session_dir failed
   --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[r1i0n6:09241] [[39007,0],0] ORTE_ERROR_LOG: Error in file
runtime/orte_init.c at line 128
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

   orte_ess_set_name failed
   --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[r1i0n6:09241] [[39007,0],0] ORTE_ERROR_LOG: Error in file orterun.c at
line 694


elore is koszonom
udv
e

On 25-Sep-13 11:17 AM, Rőczei Gábor wrote:
> Kedves Etele!
>
>> Leforditottam ujra a programot, mpicxx vagy mpic++ (gcc)
>> es futtatni probalom de sajnos most is hibauzenetet kapok
>>
>> mpirun-t adtam meg 1 job  8G memoria np=12:
>>
>> Warning: Permanently added 
>> '[r1i1n3.ice.debrecen.hpc.niif.hu]:58158,[10.148.0.21]:58158' (RSA) to 
>> the list of known hosts.
>> Warning: Permanently added 
>> '[r1i0n15.ice.debrecen.hpc.niif.hu]:45141,[10.148.0.17]:45141' (RSA) to 
>> the list of known hosts.
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>> 'openmpi/1.6.3-gcc-4.7.2'
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>> 'openmpi/1.6.3-gcc-4.7.2'
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>> 'gcc/4.7.2'
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>> 'gcc/4.7.2'
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>> 'openmpi/1.6.3-gcc-4.7.2'
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>> 'gcc/4.7.2'
>> --------------------------------------------------------------------------
>> WARNING: It appears that your OpenFabrics subsystem is configured to only
>> allow registering part of your physical memory.  This can cause MPI jobs 
>> to
>> run with erratic performance, hang, and/or crash.
>>
>> This may be caused by your OpenFabrics vendor limiting the amount of
>> physical memory that can be registered.  You should investigate the
>> relevant Linux kernel module parameters that control how much physical
>> memory can be registered, and increase them to allow registering all
>> physical memory on your machine.
>>
>> See this Open MPI FAQ item for more information on these Linux kernel 
>> module
>> parameters:
>>
>>      http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>>
>>    Local host:              r1i0n14
>>    Registerable memory:     32768 MiB
>>    Total memory:            49143 MiB
>>
>> Your MPI job will continue, but may be behave poorly and/or hang.
>> --------------------------------------------------------------------------
>> [r1i0n14:21422] 47 more processes have sent help message 
>> help-mpi-btl-openib.txt / reg mem limit low
>> [r1i0n14:21422] Set MCA parameter "orte_base_help_aggregate" to 0 to see 
>> all help / error messages
> Sikerült rájönnöm, hogy mi volt a baj. A debreceni CN gépek mlx4_core 
> kernel modul konfigurációjánál meg kell adni ezt a paramétert:
>
> options mlx4_core log_mtts_per_seg=5
>
> Mivel sok párhuzamos job nem futott ma reggel Debrecenben, ezért 
> lehetőségem volt arra, hogy a CN gépek nagy részét újraindítsam annak 
> érdekében hogy ez a beállítás aktiválódjon. Amiket még nem tudtam 
> újraindítani azokat most disabled állapotba tettem addig amíg a rajta lévő 
> jobok le nem futnak.
>
>> Tovabba ha ugyanezt a programot
>>
>> mpirun.sge probalom futtatni (nem lett ujraforditva, maradt a gcc's 
>> forditas) akkor szinten
>> ugyan ezt a hibaunzenetet kapom csak sokkal hosszabbat es azonnal le is 
>> all a job...
> mpirun.sge-t légyszives ne használj OpenMPI-os job esetén, ez csak SGI 
> MPT-nél lesz jó.
>
> OpenMPI-nál mpirun-t kell használni. Példa:
>
> #!/bin/bash
> #$ -N CONNECTIVITY
> #$ -pe mpi 120
>
> mpirun -np $NSLOTS ./connectivity -v
>
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>> 'openmpi/1.6.3-gcc-4.7.2'
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>> 'gcc/4.7.2'
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>> 'openmpi/1.6.3-gcc-4.7.2'
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>> 'openmpi/1.6.3-gcc-4.7.2'
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>> 'gcc/4.7.2'
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>> 'gcc/4.7.2'
> Ezt meg kellene adni a .bashrc fájlodban:
>
> module use /opt/nce/modulefiles
>
>> Meg 2 utolso "elmeleti" kerdes
>> Peldaul; egy program 4 Gb memoriat ker, ha parallel 2 programot szeretnek 
>> futtatni (./ program & ./ program &)
>> akkor 2*4 Gb memoriat kerjek vagy csak 4-et ?
>>> #$ -l h_vmem=8G vagy 4Gb
> h_vmem esetén azt adod meg, hogy 1 slot számára mennyi memóriára van 
> szükség. Tehát: #$ -l h_vmem=4G
>
>> Ugyanez a program MPI-s verzioval tegyuk fel 6 slotton
>> mpirun -np 6 program
>>
>> akkor
>>> #$ -l h_vmem=4G vagy 6*4=24Gb
> Megoldás:
>
>   #$ -l h_vmem=4G
>
> Gábor


_______________________________________________
Hpc-forum mailing list
Hpc-forum at listserv.niif.hu
https://listserv.niif.hu/mailman/listinfo/hpc-forum 




További információk a(z) Hpc-forum levelezőlistáról