[Hpc-forum] MPI problemak
Ferenc Bartha
barthaf at sol.cc.u-szeged.hu
2013. Sze. 28., Szo, 13:19:49 CEST
Kedves Etele!
A melleklet alapjan nem tudom eldonteni problema mibenletet.
Meg kell majd neznem a job-okat es az SGE bejegyzeseket. Ez nem most lesz a
hetvegen.
Addig is javaslom, hogy probald meg az egy mpirun/job futasokat, tehat
fuggetlen job-okban futtass, nem pedig egy job-ban tobb dolgot. Az mpi-s
job-ok interferalhatnak egymassal, nem csak akkor ha helytelenul hatterban
futtatod, de ugy is, hogy tul gyorsan indul el egymas utan ketto ugyanabban
az mpi kornyezetben.
Ha ezzel megoldodik a hiba, akkor nem is kell a jovo heten nyomozni.
ÜdvFeri
------------:
Dr. Ferenc BARTHA, tel: 62/54-6821, E-mail: barthaf at sol.cc.u-szeged.hu
SKYPE: ferenc.bartha, WWW: http://www.staff.u-szeged.hu/~barthaf/
SZTE DNT - High Performance Computing Group, 6725 Szeged, Szikra u. 2.
SZTE, Department of Medical Chemistry, 6720 Szeged, Dóm tér 8.
----- Original Message -----
From: "etele molnar" <etele.molnar at gmail.com>
To: "Rőczei Gábor" <roczei at niif.hu>; "Ferenc Bartha"
<barthaf at sol.cc.u-szeged.hu>
Cc: <hpc-forum at listserv.niif.hu>
Sent: Saturday, September 28, 2013 12:12 PM
Subject: Re: [Hpc-forum] MPI problemak
> Kedves Gabor es Ferenc
>
> Most mar eljutottunk odaig, hogy a kod lefut egymas utan kovetkezve, 1,
> 2, 3 ... de egy ido utan
> megis hibak jonnek elo es leallas... ma delelott (28.09.2013)
>
> most nem reszletezem itt helyben hanem attachmentkent megprobalom
> elkuldom a fileokat
>
> elore is koszonom
> udv
> e
>
>
>
> On 27-Sep-13 1:27 PM, Rőczei Gábor wrote:
>> On 2013.09.27., at 8:36, Ferenc Bartha wrote:
>>
>>> Megint csak reszben ertekelve:
>>>
>>>> Error: NCE_PACKAGES not set
>>> Nem eleg a module use /opt/nce/modulfiles, utana module load nce/global
>>> is kell.
>>>
>>> Van meg egy gyanus dolog az egy job-bol a hatterben elinditott ket mpi-s
>>> program esetleges viselkedesevel kapcsolatban. Ezt most inkabb nem
>>> taglalom, de tanacsolnam, hogy addig se hasznalj ilyet.
>> Szerintem is ez a hiba.
>>
>> Most hozzáadtam a /home/emolnar/.bashrc fájlhoz ezt is. Jelenleg ez van
>> benne:
>>
>> module use /opt/nce/modulefiles
>> module load nce/global
>> module load openmpi/1.6.3-gcc-4.7.2 gcc/4.7.2
>>
>> A HPC_run_cooperfrye_LHC2760_visc_all_serial.sh,
>> HPC_run_cooperfrye_RHIC200_visc_all_serial.sh programok jól elindultak.
>>
>> Etele,
>>
>> Légyszives teszteld Te is.
>>
>> Gábor
>>
>>> ----- Original Message ----- From: "etele molnar"
>>> <molnar at fias.uni-frankfurt.de>
>>> To: "Rőczei Gábor" <roczei at niif.hu>; "etele molnar"
>>> <etele.molnar at gmail.com>
>>> Cc: <hpc-forum at listserv.niif.hu>
>>> Sent: Friday, September 27, 2013 7:50 AM
>>> Subject: Re: [Hpc-forum] MPI problemak
>>>
>>>
>>> Kedves Gabor es tobbiek
>>>
>>> Eloszor is koszonom szepen a valaszokat es segitseget,
>>> mindjart mindjart ott vagyunk..
>>>
>>> module use /opt/nce/modulefiles
>>> bizony segitett es most lefut a program 12, 24, 48 slotton is, de most
>>> van egy ujabb hiba,
>>> ami meg a program futasa elott jon, azutan lefut a program (sikeresen)
>>> de mar a kovetkezo hivasra
>>> nem indul be es a job is megszakad.
>>>
>>> A mostani teszt programokat igy hivtam bash-bol
>>>
>>> mpirun -np 48 ./ program ... ;
>>> mpirun -np 48 ./ program ... ;
>>> ...
>>> wait
>>>
>>>
>>> Error: NCE_PACKAGES not set
>>> ...
>>> Error: NCE_PACKAGES not set
>>>
>>> real 11m7.497s
>>> user 39m5.807s
>>> sys 0m1.416s
>>> [r1i0n6:09241] opal_os_dirpath_create: Error: Unable to create the
>>> sub-directory
>>> (/scratch/tmp/425790.1.parallel.q/openmpi-sessions-emolnar at r1i0n6_0/39007)
>>> of
>>> (/scratch/tmp/425790.1.parallel.q/openmpi-sessions-emolnar at r1i0n6_0/39007/0/0),
>>> mkdir failed [1]
>>> [r1i0n6:09241] [[39007,0],0] ORTE_ERROR_LOG: Error in file
>>> util/session_dir.c at line 106
>>> [r1i0n6:09241] [[39007,0],0] ORTE_ERROR_LOG: Error in file
>>> util/session_dir.c at line 399
>>> [r1i0n6:09241] [[39007,0],0] ORTE_ERROR_LOG: Error in file
>>> ess_hnp_module.c at line 320
>>> --------------------------------------------------------------------------
>>> It looks like orte_init failed for some reason; your parallel process is
>>> likely to abort. There are many reasons that a parallel process can
>>> fail during orte_init; some of which are due to configuration or
>>> environment problems. This failure appears to be an internal failure;
>>> here's some additional information (which may only be relevant to an
>>> Open MPI developer):
>>>
>>> orte_session_dir failed
>>> --> Returned value Error (-1) instead of ORTE_SUCCESS
>>> --------------------------------------------------------------------------
>>> [r1i0n6:09241] [[39007,0],0] ORTE_ERROR_LOG: Error in file
>>> runtime/orte_init.c at line 128
>>> --------------------------------------------------------------------------
>>> It looks like orte_init failed for some reason; your parallel process is
>>> likely to abort. There are many reasons that a parallel process can
>>> fail during orte_init; some of which are due to configuration or
>>> environment problems. This failure appears to be an internal failure;
>>> here's some additional information (which may only be relevant to an
>>> Open MPI developer):
>>>
>>> orte_ess_set_name failed
>>> --> Returned value Error (-1) instead of ORTE_SUCCESS
>>> --------------------------------------------------------------------------
>>> [r1i0n6:09241] [[39007,0],0] ORTE_ERROR_LOG: Error in file orterun.c at
>>> line 694
>>>
>>>
>>> elore is koszonom
>>> udv
>>> e
>>>
>>> On 25-Sep-13 11:17 AM, Rőczei Gábor wrote:
>>>> Kedves Etele!
>>>>
>>>>> Leforditottam ujra a programot, mpicxx vagy mpic++ (gcc)
>>>>> es futtatni probalom de sajnos most is hibauzenetet kapok
>>>>>
>>>>> mpirun-t adtam meg 1 job 8G memoria np=12:
>>>>>
>>>>> Warning: Permanently added
>>>>> '[r1i1n3.ice.debrecen.hpc.niif.hu]:58158,[10.148.0.21]:58158' (RSA) to
>>>>> the list of known hosts.
>>>>> Warning: Permanently added
>>>>> '[r1i0n15.ice.debrecen.hpc.niif.hu]:45141,[10.148.0.17]:45141' (RSA)
>>>>> to the list of known hosts.
>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>>>>> 'openmpi/1.6.3-gcc-4.7.2'
>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>>>>> 'openmpi/1.6.3-gcc-4.7.2'
>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>>>>> 'gcc/4.7.2'
>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>>>>> 'gcc/4.7.2'
>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>>>>> 'openmpi/1.6.3-gcc-4.7.2'
>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>>>>> 'gcc/4.7.2'
>>>>> --------------------------------------------------------------------------
>>>>> WARNING: It appears that your OpenFabrics subsystem is configured to
>>>>> only
>>>>> allow registering part of your physical memory. This can cause MPI
>>>>> jobs to
>>>>> run with erratic performance, hang, and/or crash.
>>>>>
>>>>> This may be caused by your OpenFabrics vendor limiting the amount of
>>>>> physical memory that can be registered. You should investigate the
>>>>> relevant Linux kernel module parameters that control how much physical
>>>>> memory can be registered, and increase them to allow registering all
>>>>> physical memory on your machine.
>>>>>
>>>>> See this Open MPI FAQ item for more information on these Linux kernel
>>>>> module
>>>>> parameters:
>>>>>
>>>>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>>>>>
>>>>> Local host: r1i0n14
>>>>> Registerable memory: 32768 MiB
>>>>> Total memory: 49143 MiB
>>>>>
>>>>> Your MPI job will continue, but may be behave poorly and/or hang.
>>>>> --------------------------------------------------------------------------
>>>>> [r1i0n14:21422] 47 more processes have sent help message
>>>>> help-mpi-btl-openib.txt / reg mem limit low
>>>>> [r1i0n14:21422] Set MCA parameter "orte_base_help_aggregate" to 0 to
>>>>> see all help / error messages
>>>> Sikerült rájönnöm, hogy mi volt a baj. A debreceni CN gépek mlx4_core
>>>> kernel modul konfigurációjánál meg kell adni ezt a paramétert:
>>>>
>>>> options mlx4_core log_mtts_per_seg=5
>>>>
>>>> Mivel sok párhuzamos job nem futott ma reggel Debrecenben, ezért
>>>> lehetőségem volt arra, hogy a CN gépek nagy részét újraindítsam annak
>>>> érdekében hogy ez a beállítás aktiválódjon. Amiket még nem tudtam
>>>> újraindítani azokat most disabled állapotba tettem addig amíg a rajta
>>>> lévő jobok le nem futnak.
>>>>
>>>>> Tovabba ha ugyanezt a programot
>>>>>
>>>>> mpirun.sge probalom futtatni (nem lett ujraforditva, maradt a gcc's
>>>>> forditas) akkor szinten
>>>>> ugyan ezt a hibaunzenetet kapom csak sokkal hosszabbat es azonnal le
>>>>> is all a job...
>>>> mpirun.sge-t légyszives ne használj OpenMPI-os job esetén, ez csak SGI
>>>> MPT-nél lesz jó.
>>>>
>>>> OpenMPI-nál mpirun-t kell használni. Példa:
>>>>
>>>> #!/bin/bash
>>>> #$ -N CONNECTIVITY
>>>> #$ -pe mpi 120
>>>>
>>>> mpirun -np $NSLOTS ./connectivity -v
>>>>
>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>>>>> 'openmpi/1.6.3-gcc-4.7.2'
>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>>>>> 'gcc/4.7.2'
>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>>>>> 'openmpi/1.6.3-gcc-4.7.2'
>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>>>>> 'openmpi/1.6.3-gcc-4.7.2'
>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>>>>> 'gcc/4.7.2'
>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>>>>> 'gcc/4.7.2'
>>>> Ezt meg kellene adni a .bashrc fájlodban:
>>>>
>>>> module use /opt/nce/modulefiles
>>>>
>>>>> Meg 2 utolso "elmeleti" kerdes
>>>>> Peldaul; egy program 4 Gb memoriat ker, ha parallel 2 programot
>>>>> szeretnek futtatni (./ program & ./ program &)
>>>>> akkor 2*4 Gb memoriat kerjek vagy csak 4-et ?
>>>>>> #$ -l h_vmem=8G vagy 4Gb
>>>> h_vmem esetén azt adod meg, hogy 1 slot számára mennyi memóriára van
>>>> szükség. Tehát: #$ -l h_vmem=4G
>>>>
>>>>> Ugyanez a program MPI-s verzioval tegyuk fel 6 slotton
>>>>> mpirun -np 6 program
>>>>>
>>>>> akkor
>>>>>> #$ -l h_vmem=4G vagy 6*4=24Gb
>>>> Megoldás:
>>>>
>>>> #$ -l h_vmem=4G
>>>>
>>>> Gábor
>>>
>>> _______________________________________________
>>> Hpc-forum mailing list
>>> Hpc-forum at listserv.niif.hu
>>> https://listserv.niif.hu/mailman/listinfo/hpc-forum
>>>
>>> _______________________________________________
>>> Hpc-forum mailing list
>>> Hpc-forum at listserv.niif.hu
>>> https://listserv.niif.hu/mailman/listinfo/hpc-forum
>
>
További információk a(z) Hpc-forum levelezőlistáról