[Hpc-forum] MPI problemak

Ferenc Bartha barthaf at sol.cc.u-szeged.hu
2013. Sze. 29., V, 12:02:19 CEST


Jo lett volna latni, hogy a fuggetlenul (kulon SGE job-ban inditott) futasok 
mennek-e siman.
Vagy azt is, hogy ugyanaz az osszetett job reprodukalhatoan pusztul-e el, 
vagy hol itt, hol ott.

A hibauzenet (mindig az elso a fontos, ez most a CANNOT CREATE...) arra 
utal, hogy az SGE szenved meg a job kovetkezo reszenek az inditasaval.

Ha ragaszkodsz ahhoz, hogy egy SGE job-on belul menjen tobb egymas utani MPI 
hivas, akkor:
Esetleg probald meg (ezzel is tobbet tudunk majd holnap), hogy ket mpirun 
kozott  adj idot a rendszernek az elozo teljes lezarasara. Pl. ugy, hogy egy 
"sleep 100" sort irj be az mpirun hivasok koze.

----- Original Message ----- 
From: "etele molnar" <etele.molnar at gmail.com>
To: "Ferenc Bartha" <barthaf at sol.cc.u-szeged.hu>; "Rőczei Gábor" 
<roczei at niif.hu>
Cc: <hpc-forum at listserv.niif.hu>
Sent: Sunday, September 29, 2013 11:50 AM
Subject: Re: [Hpc-forum] MPI problemak


> Kedves mindenki
>
> ujra inditottam a jobokat a tegnap is de sajnos ugyanaz a vegeredmeny mint 
> elotte;
> ha el is indul 1-2 job utana bedoglik mind...
>
> [r1i1n7:07616] CANNOT CREATE FIFO 
> /scratch/tmp/425892.1.parallel.q/openmpi-sessions-emolnar at r1i1n7_0/42592/0/debugger_attach_fifo: 
> errno 2
> [r1i1n7:07628] opal_os_dirpath_create: Error: Unable to create the 
> sub-directory 
> (/scratch/tmp/425892.1.parallel.q/openmpi-sessions-emolnar at r1i1n7_0/42592/1) 
> of 
> (/scratch/tmp/425892.1.parallel.q/openmpi-sessions-emolnar at r1i1n7_0/42592/1/3), 
> mkdir failed [1]
> [r1i1n7:07628] [[42592,1],3] ORTE_ERROR_LOG: Error in file 
> util/session_dir.c at line 106
> [r1i1n7:07628] [[42592,1],3] ORTE_ERROR_LOG: Error in file 
> util/session_dir.c at line 399
> [r1i1n7:07628] [[42592,1],3] ORTE_ERROR_LOG: Error in file 
> base/ess_base_std_app.c at line 130
> ...
> [r1i1n7:07630] [[42592,1],5] ORTE_ERROR_LOG: A message is attempting to be 
> sent to a process whose contact information is unknown in file 
> rml_oob_send.c at line 104
> [r1i1n7:07630] [[42592,1],5] could not get route to [[INVALID],INVALID]
> [r1i1n7:07630] [[42592,1],5] ORTE_ERROR_LOG: A message is attempting to be 
> sent to a process whose contact information is unknown in file 
> util/show_help.c at line 627
> [r1i1n7:07630] [[42592,1],5] ORTE_ERROR_LOG: Error in file 
> ess_env_module.c at line 167
> [r1i1n7:07625] [[42592,1],0] ORTE_ERROR_LOG: A message is attempting to be 
> sent to a process whose contact information is unknown in file 
> rml_oob_send.c at line 104
>
> Termeszetesen nekem nem 1-2 dolgot kell lefuttatni, hanem legalabb 2*25 
> dolgot 2 scriptben amit most
> csak teszteles. A valodi futasok kb 100 mpi-os programhivas manualisan nem 
> vallahato.
> a pc-n a scriptek vegrehajtodnak, ha parhuzamosan, ha egymas utan csak 
> persze 2*25 tobb ido alatt.
>
> udv
> e
>
>
> On 28-Sep-13 1:19 PM, Ferenc Bartha wrote:
>> Kedves Etele!
>>
>> A melleklet alapjan nem tudom eldonteni problema mibenletet.
>> Meg kell majd neznem a job-okat es az SGE bejegyzeseket. Ez nem most lesz 
>> a hetvegen.
>>
>> Addig is javaslom, hogy probald meg az egy mpirun/job futasokat, tehat 
>> fuggetlen job-okban futtass, nem pedig egy job-ban tobb dolgot. Az mpi-s 
>> job-ok interferalhatnak egymassal, nem csak akkor ha helytelenul 
>> hatterban futtatod, de ugy is, hogy tul gyorsan indul el egymas utan 
>> ketto ugyanabban az mpi kornyezetben.
>>
>> Ha ezzel megoldodik a hiba, akkor nem is kell a jovo heten nyomozni.
>>
>> ÜdvFeri
>> ------------:
>> Dr. Ferenc BARTHA, tel: 62/54-6821, E-mail: barthaf at sol.cc.u-szeged.hu
>> SKYPE: ferenc.bartha, WWW: http://www.staff.u-szeged.hu/~barthaf/
>> SZTE DNT - High Performance Computing Group, 6725 Szeged, Szikra u. 2.
>> SZTE, Department of Medical Chemistry, 6720 Szeged, Dóm tér 8.
>>
>> ----- Original Message ----- From: "etele molnar" 
>> <etele.molnar at gmail.com>
>> To: "Rőczei Gábor" <roczei at niif.hu>; "Ferenc Bartha" 
>> <barthaf at sol.cc.u-szeged.hu>
>> Cc: <hpc-forum at listserv.niif.hu>
>> Sent: Saturday, September 28, 2013 12:12 PM
>> Subject: Re: [Hpc-forum] MPI problemak
>>
>>
>>> Kedves Gabor es Ferenc
>>>
>>> Most mar eljutottunk odaig, hogy a kod lefut egymas utan kovetkezve, 1,
>>> 2, 3 ... de egy ido utan
>>> megis hibak jonnek elo es leallas... ma delelott (28.09.2013)
>>>
>>> most nem reszletezem itt helyben hanem attachmentkent megprobalom
>>> elkuldom a fileokat
>>>
>>> elore is koszonom
>>> udv
>>> e
>>>
>>>
>>>
>>> On 27-Sep-13 1:27 PM, Rőczei Gábor wrote:
>>>> On 2013.09.27., at 8:36, Ferenc Bartha wrote:
>>>>
>>>>> Megint csak reszben ertekelve:
>>>>>
>>>>>> Error: NCE_PACKAGES not set
>>>>> Nem eleg a module use /opt/nce/modulfiles, utana module load 
>>>>> nce/global is kell.
>>>>>
>>>>> Van meg egy gyanus dolog az egy job-bol a hatterben elinditott ket 
>>>>> mpi-s program esetleges viselkedesevel kapcsolatban. Ezt most inkabb 
>>>>> nem taglalom, de tanacsolnam, hogy addig se hasznalj ilyet.
>>>> Szerintem is ez a hiba.
>>>>
>>>> Most hozzáadtam a /home/emolnar/.bashrc fájlhoz ezt is. Jelenleg ez van 
>>>> benne:
>>>>
>>>> module use /opt/nce/modulefiles
>>>> module load nce/global
>>>> module load openmpi/1.6.3-gcc-4.7.2  gcc/4.7.2
>>>>
>>>> A HPC_run_cooperfrye_LHC2760_visc_all_serial.sh, 
>>>> HPC_run_cooperfrye_RHIC200_visc_all_serial.sh programok jól elindultak.
>>>>
>>>> Etele,
>>>>
>>>> Légyszives teszteld Te is.
>>>>
>>>> Gábor
>>>>
>>>>> ----- Original Message ----- From: "etele molnar" 
>>>>> <molnar at fias.uni-frankfurt.de>
>>>>> To: "Rőczei Gábor" <roczei at niif.hu>; "etele molnar" 
>>>>> <etele.molnar at gmail.com>
>>>>> Cc: <hpc-forum at listserv.niif.hu>
>>>>> Sent: Friday, September 27, 2013 7:50 AM
>>>>> Subject: Re: [Hpc-forum] MPI problemak
>>>>>
>>>>>
>>>>> Kedves Gabor es tobbiek
>>>>>
>>>>> Eloszor is koszonom szepen a valaszokat es segitseget,
>>>>> mindjart mindjart ott vagyunk..
>>>>>
>>>>> module use /opt/nce/modulefiles
>>>>> bizony segitett es most lefut a program 12, 24, 48 slotton is, de most
>>>>> van egy ujabb hiba,
>>>>> ami meg a program futasa elott jon, azutan lefut a program (sikeresen)
>>>>> de mar a kovetkezo hivasra
>>>>> nem indul be es a job is megszakad.
>>>>>
>>>>> A mostani teszt programokat igy hivtam bash-bol
>>>>>
>>>>> mpirun -np 48 ./ program ... ;
>>>>> mpirun -np 48 ./ program ... ;
>>>>> ...
>>>>> wait
>>>>>
>>>>>
>>>>> Error: NCE_PACKAGES not set
>>>>> ...
>>>>> Error: NCE_PACKAGES not set
>>>>>
>>>>> real    11m7.497s
>>>>> user    39m5.807s
>>>>> sys    0m1.416s
>>>>> [r1i0n6:09241] opal_os_dirpath_create: Error: Unable to create the
>>>>> sub-directory
>>>>> (/scratch/tmp/425790.1.parallel.q/openmpi-sessions-emolnar at r1i0n6_0/39007)
>>>>> of
>>>>> (/scratch/tmp/425790.1.parallel.q/openmpi-sessions-emolnar at r1i0n6_0/39007/0/0),
>>>>> mkdir failed [1]
>>>>> [r1i0n6:09241] [[39007,0],0] ORTE_ERROR_LOG: Error in file
>>>>> util/session_dir.c at line 106
>>>>> [r1i0n6:09241] [[39007,0],0] ORTE_ERROR_LOG: Error in file
>>>>> util/session_dir.c at line 399
>>>>> [r1i0n6:09241] [[39007,0],0] ORTE_ERROR_LOG: Error in file
>>>>> ess_hnp_module.c at line 320
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> It looks like orte_init failed for some reason; your parallel process 
>>>>> is
>>>>> likely to abort.  There are many reasons that a parallel process can
>>>>> fail during orte_init; some of which are due to configuration or
>>>>> environment problems.  This failure appears to be an internal failure;
>>>>> here's some additional information (which may only be relevant to an
>>>>> Open MPI developer):
>>>>>
>>>>>   orte_session_dir failed
>>>>>   --> Returned value Error (-1) instead of ORTE_SUCCESS
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> [r1i0n6:09241] [[39007,0],0] ORTE_ERROR_LOG: Error in file
>>>>> runtime/orte_init.c at line 128
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> It looks like orte_init failed for some reason; your parallel process 
>>>>> is
>>>>> likely to abort.  There are many reasons that a parallel process can
>>>>> fail during orte_init; some of which are due to configuration or
>>>>> environment problems.  This failure appears to be an internal failure;
>>>>> here's some additional information (which may only be relevant to an
>>>>> Open MPI developer):
>>>>>
>>>>>   orte_ess_set_name failed
>>>>>   --> Returned value Error (-1) instead of ORTE_SUCCESS
>>>>> --------------------------------------------------------------------------
>>>>>
>>>>> [r1i0n6:09241] [[39007,0],0] ORTE_ERROR_LOG: Error in file orterun.c 
>>>>> at
>>>>> line 694
>>>>>
>>>>>
>>>>> elore is koszonom
>>>>> udv
>>>>> e
>>>>>
>>>>> On 25-Sep-13 11:17 AM, Rőczei Gábor wrote:
>>>>>> Kedves Etele!
>>>>>>
>>>>>>> Leforditottam ujra a programot, mpicxx vagy mpic++ (gcc)
>>>>>>> es futtatni probalom de sajnos most is hibauzenetet kapok
>>>>>>>
>>>>>>> mpirun-t adtam meg 1 job  8G memoria np=12:
>>>>>>>
>>>>>>> Warning: Permanently added 
>>>>>>> '[r1i1n3.ice.debrecen.hpc.niif.hu]:58158,[10.148.0.21]:58158' (RSA) 
>>>>>>> to the list of known hosts.
>>>>>>> Warning: Permanently added 
>>>>>>> '[r1i0n15.ice.debrecen.hpc.niif.hu]:45141,[10.148.0.17]:45141' (RSA) 
>>>>>>> to the list of known hosts.
>>>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>>>>>>> 'openmpi/1.6.3-gcc-4.7.2'
>>>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>>>>>>> 'openmpi/1.6.3-gcc-4.7.2'
>>>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>>>>>>> 'gcc/4.7.2'
>>>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>>>>>>> 'gcc/4.7.2'
>>>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>>>>>>> 'openmpi/1.6.3-gcc-4.7.2'
>>>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>>>>>>> 'gcc/4.7.2'
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>> WARNING: It appears that your OpenFabrics subsystem is configured to 
>>>>>>> only
>>>>>>> allow registering part of your physical memory.  This can cause MPI 
>>>>>>> jobs to
>>>>>>> run with erratic performance, hang, and/or crash.
>>>>>>>
>>>>>>> This may be caused by your OpenFabrics vendor limiting the amount of
>>>>>>> physical memory that can be registered.  You should investigate the
>>>>>>> relevant Linux kernel module parameters that control how much 
>>>>>>> physical
>>>>>>> memory can be registered, and increase them to allow registering all
>>>>>>> physical memory on your machine.
>>>>>>>
>>>>>>> See this Open MPI FAQ item for more information on these Linux 
>>>>>>> kernel module
>>>>>>> parameters:
>>>>>>>
>>>>>>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>>>>>>>
>>>>>>>    Local host:              r1i0n14
>>>>>>>    Registerable memory:     32768 MiB
>>>>>>>    Total memory:            49143 MiB
>>>>>>>
>>>>>>> Your MPI job will continue, but may be behave poorly and/or hang.
>>>>>>> --------------------------------------------------------------------------
>>>>>>>
>>>>>>> [r1i0n14:21422] 47 more processes have sent help message 
>>>>>>> help-mpi-btl-openib.txt / reg mem limit low
>>>>>>> [r1i0n14:21422] Set MCA parameter "orte_base_help_aggregate" to 0 to 
>>>>>>> see all help / error messages
>>>>>> Sikerült rájönnöm, hogy mi volt a baj. A debreceni CN gépek mlx4_core 
>>>>>> kernel modul konfigurációjánál meg kell adni ezt a paramétert:
>>>>>>
>>>>>> options mlx4_core log_mtts_per_seg=5
>>>>>>
>>>>>> Mivel sok párhuzamos job nem futott ma reggel Debrecenben, ezért 
>>>>>> lehetőségem volt arra, hogy a CN gépek nagy részét újraindítsam annak 
>>>>>> érdekében hogy ez a beállítás aktiválódjon. Amiket még nem tudtam 
>>>>>> újraindítani azokat most disabled állapotba tettem addig amíg a rajta 
>>>>>> lévő jobok le nem futnak.
>>>>>>
>>>>>>> Tovabba ha ugyanezt a programot
>>>>>>>
>>>>>>> mpirun.sge probalom futtatni (nem lett ujraforditva, maradt a gcc's 
>>>>>>> forditas) akkor szinten
>>>>>>> ugyan ezt a hibaunzenetet kapom csak sokkal hosszabbat es azonnal le 
>>>>>>> is all a job...
>>>>>> mpirun.sge-t légyszives ne használj OpenMPI-os job esetén, ez csak 
>>>>>> SGI MPT-nél lesz jó.
>>>>>>
>>>>>> OpenMPI-nál mpirun-t kell használni. Példa:
>>>>>>
>>>>>> #!/bin/bash
>>>>>> #$ -N CONNECTIVITY
>>>>>> #$ -pe mpi 120
>>>>>>
>>>>>> mpirun -np $NSLOTS ./connectivity -v
>>>>>>
>>>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>>>>>>> 'openmpi/1.6.3-gcc-4.7.2'
>>>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>>>>>>> 'gcc/4.7.2'
>>>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>>>>>>> 'openmpi/1.6.3-gcc-4.7.2'
>>>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>>>>>>> 'openmpi/1.6.3-gcc-4.7.2'
>>>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>>>>>>> 'gcc/4.7.2'
>>>>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for 
>>>>>>> 'gcc/4.7.2'
>>>>>> Ezt meg kellene adni a .bashrc fájlodban:
>>>>>>
>>>>>> module use /opt/nce/modulefiles
>>>>>>
>>>>>>> Meg 2 utolso "elmeleti" kerdes
>>>>>>> Peldaul; egy program 4 Gb memoriat ker, ha parallel 2 programot 
>>>>>>> szeretnek futtatni (./ program & ./ program &)
>>>>>>> akkor 2*4 Gb memoriat kerjek vagy csak 4-et ?
>>>>>>>> #$ -l h_vmem=8G vagy 4Gb
>>>>>> h_vmem esetén azt adod meg, hogy 1 slot számára mennyi memóriára van 
>>>>>> szükség. Tehát: #$ -l h_vmem=4G
>>>>>>
>>>>>>> Ugyanez a program MPI-s verzioval tegyuk fel 6 slotton
>>>>>>> mpirun -np 6 program
>>>>>>>
>>>>>>> akkor
>>>>>>>> #$ -l h_vmem=4G vagy 6*4=24Gb
>>>>>> Megoldás:
>>>>>>
>>>>>>   #$ -l h_vmem=4G
>>>>>>
>>>>>> Gábor
>>>>>
>>>>> _______________________________________________
>>>>> Hpc-forum mailing list
>>>>> Hpc-forum at listserv.niif.hu
>>>>> https://listserv.niif.hu/mailman/listinfo/hpc-forum
>>>>>
>>>>> _______________________________________________
>>>>> Hpc-forum mailing list
>>>>> Hpc-forum at listserv.niif.hu
>>>>> https://listserv.niif.hu/mailman/listinfo/hpc-forum
>>>
>>>
>>
> 




További információk a(z) Hpc-forum levelezőlistáról