[Hpc-forum] MPI problemak

Rőczei Gábor roczei at niif.hu
2013. Sze. 27., P, 13:27:11 CEST


On 2013.09.27., at 8:36, Ferenc Bartha wrote:

> Megint csak reszben ertekelve:
> 
>> Error: NCE_PACKAGES not set
> Nem eleg a module use /opt/nce/modulfiles, utana module load nce/global is kell.
> 
> Van meg egy gyanus dolog az egy job-bol a hatterben elinditott ket mpi-s program esetleges viselkedesevel kapcsolatban. Ezt most inkabb nem taglalom, de tanacsolnam, hogy addig se hasznalj ilyet.

Szerintem is ez a hiba. 

Most hozzáadtam a /home/emolnar/.bashrc fájlhoz ezt is. Jelenleg ez van benne:

module use /opt/nce/modulefiles
module load nce/global 
module load openmpi/1.6.3-gcc-4.7.2  gcc/4.7.2

A HPC_run_cooperfrye_LHC2760_visc_all_serial.sh, HPC_run_cooperfrye_RHIC200_visc_all_serial.sh programok jól elindultak. 

Etele,

Légyszives teszteld Te is.

Gábor

> ----- Original Message ----- From: "etele molnar" <molnar at fias.uni-frankfurt.de>
> To: "Rőczei Gábor" <roczei at niif.hu>; "etele molnar" <etele.molnar at gmail.com>
> Cc: <hpc-forum at listserv.niif.hu>
> Sent: Friday, September 27, 2013 7:50 AM
> Subject: Re: [Hpc-forum] MPI problemak
> 
> 
> Kedves Gabor es tobbiek
> 
> Eloszor is koszonom szepen a valaszokat es segitseget,
> mindjart mindjart ott vagyunk..
> 
> module use /opt/nce/modulefiles
> bizony segitett es most lefut a program 12, 24, 48 slotton is, de most
> van egy ujabb hiba,
> ami meg a program futasa elott jon, azutan lefut a program (sikeresen)
> de mar a kovetkezo hivasra
> nem indul be es a job is megszakad.
> 
> A mostani teszt programokat igy hivtam bash-bol
> 
> mpirun -np 48 ./ program ... ;
> mpirun -np 48 ./ program ... ;
> ...
> wait
> 
> 
> Error: NCE_PACKAGES not set
> ...
> Error: NCE_PACKAGES not set
> 
> real    11m7.497s
> user    39m5.807s
> sys    0m1.416s
> [r1i0n6:09241] opal_os_dirpath_create: Error: Unable to create the
> sub-directory
> (/scratch/tmp/425790.1.parallel.q/openmpi-sessions-emolnar at r1i0n6_0/39007)
> of
> (/scratch/tmp/425790.1.parallel.q/openmpi-sessions-emolnar at r1i0n6_0/39007/0/0),
> mkdir failed [1]
> [r1i0n6:09241] [[39007,0],0] ORTE_ERROR_LOG: Error in file
> util/session_dir.c at line 106
> [r1i0n6:09241] [[39007,0],0] ORTE_ERROR_LOG: Error in file
> util/session_dir.c at line 399
> [r1i0n6:09241] [[39007,0],0] ORTE_ERROR_LOG: Error in file
> ess_hnp_module.c at line 320
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>  orte_session_dir failed
>  --> Returned value Error (-1) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> [r1i0n6:09241] [[39007,0],0] ORTE_ERROR_LOG: Error in file
> runtime/orte_init.c at line 128
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>  orte_ess_set_name failed
>  --> Returned value Error (-1) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> [r1i0n6:09241] [[39007,0],0] ORTE_ERROR_LOG: Error in file orterun.c at
> line 694
> 
> 
> elore is koszonom
> udv
> e
> 
> On 25-Sep-13 11:17 AM, Rőczei Gábor wrote:
>> Kedves Etele!
>> 
>>> Leforditottam ujra a programot, mpicxx vagy mpic++ (gcc)
>>> es futtatni probalom de sajnos most is hibauzenetet kapok
>>> 
>>> mpirun-t adtam meg 1 job  8G memoria np=12:
>>> 
>>> Warning: Permanently added '[r1i1n3.ice.debrecen.hpc.niif.hu]:58158,[10.148.0.21]:58158' (RSA) to the list of known hosts.
>>> Warning: Permanently added '[r1i0n15.ice.debrecen.hpc.niif.hu]:45141,[10.148.0.17]:45141' (RSA) to the list of known hosts.
>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 'openmpi/1.6.3-gcc-4.7.2'
>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 'openmpi/1.6.3-gcc-4.7.2'
>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 'gcc/4.7.2'
>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 'gcc/4.7.2'
>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 'openmpi/1.6.3-gcc-4.7.2'
>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 'gcc/4.7.2'
>>> --------------------------------------------------------------------------
>>> WARNING: It appears that your OpenFabrics subsystem is configured to only
>>> allow registering part of your physical memory.  This can cause MPI jobs to
>>> run with erratic performance, hang, and/or crash.
>>> 
>>> This may be caused by your OpenFabrics vendor limiting the amount of
>>> physical memory that can be registered.  You should investigate the
>>> relevant Linux kernel module parameters that control how much physical
>>> memory can be registered, and increase them to allow registering all
>>> physical memory on your machine.
>>> 
>>> See this Open MPI FAQ item for more information on these Linux kernel module
>>> parameters:
>>> 
>>>     http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>>> 
>>>   Local host:              r1i0n14
>>>   Registerable memory:     32768 MiB
>>>   Total memory:            49143 MiB
>>> 
>>> Your MPI job will continue, but may be behave poorly and/or hang.
>>> --------------------------------------------------------------------------
>>> [r1i0n14:21422] 47 more processes have sent help message help-mpi-btl-openib.txt / reg mem limit low
>>> [r1i0n14:21422] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
>> Sikerült rájönnöm, hogy mi volt a baj. A debreceni CN gépek mlx4_core kernel modul konfigurációjánál meg kell adni ezt a paramétert:
>> 
>> options mlx4_core log_mtts_per_seg=5
>> 
>> Mivel sok párhuzamos job nem futott ma reggel Debrecenben, ezért lehetőségem volt arra, hogy a CN gépek nagy részét újraindítsam annak érdekében hogy ez a beállítás aktiválódjon. Amiket még nem tudtam újraindítani azokat most disabled állapotba tettem addig amíg a rajta lévő jobok le nem futnak.
>> 
>>> Tovabba ha ugyanezt a programot
>>> 
>>> mpirun.sge probalom futtatni (nem lett ujraforditva, maradt a gcc's forditas) akkor szinten
>>> ugyan ezt a hibaunzenetet kapom csak sokkal hosszabbat es azonnal le is all a job...
>> mpirun.sge-t légyszives ne használj OpenMPI-os job esetén, ez csak SGI MPT-nél lesz jó.
>> 
>> OpenMPI-nál mpirun-t kell használni. Példa:
>> 
>> #!/bin/bash
>> #$ -N CONNECTIVITY
>> #$ -pe mpi 120
>> 
>> mpirun -np $NSLOTS ./connectivity -v
>> 
>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 'openmpi/1.6.3-gcc-4.7.2'
>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 'gcc/4.7.2'
>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 'openmpi/1.6.3-gcc-4.7.2'
>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 'openmpi/1.6.3-gcc-4.7.2'
>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 'gcc/4.7.2'
>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 'gcc/4.7.2'
>> Ezt meg kellene adni a .bashrc fájlodban:
>> 
>> module use /opt/nce/modulefiles
>> 
>>> Meg 2 utolso "elmeleti" kerdes
>>> Peldaul; egy program 4 Gb memoriat ker, ha parallel 2 programot szeretnek futtatni (./ program & ./ program &)
>>> akkor 2*4 Gb memoriat kerjek vagy csak 4-et ?
>>>> #$ -l h_vmem=8G vagy 4Gb
>> h_vmem esetén azt adod meg, hogy 1 slot számára mennyi memóriára van szükség. Tehát: #$ -l h_vmem=4G
>> 
>>> Ugyanez a program MPI-s verzioval tegyuk fel 6 slotton
>>> mpirun -np 6 program
>>> 
>>> akkor
>>>> #$ -l h_vmem=4G vagy 6*4=24Gb
>> Megoldás:
>> 
>>  #$ -l h_vmem=4G
>> 
>> Gábor
> 
> 
> _______________________________________________
> Hpc-forum mailing list
> Hpc-forum at listserv.niif.hu
> https://listserv.niif.hu/mailman/listinfo/hpc-forum 
> 
> _______________________________________________
> Hpc-forum mailing list
> Hpc-forum at listserv.niif.hu
> https://listserv.niif.hu/mailman/listinfo/hpc-forum




További információk a(z) Hpc-forum levelezőlistáról