[Hpc-forum] MPI problemak
etele molnar
etele.molnar at gmail.com
2013. Sze. 24., K, 22:06:26 CEST
termeszetesen igy kellene tortenjen
probaltam csokkenteni h_vmem=2Gb-ra sot 512Mb, de nem tudom, hogy mi a
felso hatar amit ki lehet kerni
(elemeletileg ha van szabad memoria akkor miert ne kerjunk annyit
amennyire szuksegunk van?)
de minden esetben a hiba maradt, sot ujabb hibak is elojottek;
ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
'openmpi/1.6.3-gcc-4.7.2'
ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
'gcc/4.7.2'
ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
'openmpi/1.6.3-gcc-4.7.2'
ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
'gcc/4.7.2'
ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
'openmpi/1.6.3-gcc-4.7.2'
ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
'gcc/4.7.2'
[r1i0n15][[62467,1],5][btl_openib_component.c:660:start_async_event_thread]
Failed to create async event thread
[r1i0n15][[62467,1],8][btl_openib_component.c:660:start_async_event_thread]
Failed to create async event thread
[r1i0n15][[62467,1],6][btl_openib_component.c:660:start_async_event_thread]
[r1i0n15][[62467,1],7][btl_openib_component.c:660:start_async_event_thread]
F\
ailed to create async event thread
Failed to create async event thread
[r1i0n15][[62467,1],4][btl_openib_component.c:660:start_async_event_thread]
Failed to create async event thread
[r1i0n15][[62467,1],9][btl_openib_component.c:660:start_async_event_thread]
Failed to create async event thread
--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.
Local host: r1i0n15
Local device: mlx4_0
--------------------------------------------------------------------------
[r1i1n3][[62467,1],13][btl_openib_component.c:660:start_async_event_thread]
Failed to create async event thread
[r1i1n3][[62467,1],12][btl_openib_component.c:660:start_async_event_thread]
Failed to create async event thread
[r1i1n3][[62467,1],14][btl_openib_component.c:660:start_async_event_thread]
Failed to create async event thread
[r1i1n3][[62467,1],15][btl_openib_component.c:660:start_async_event_thread]
Failed to create async event thread
[r1i1n3][[62467,1],10][btl_openib_component.c:660:start_async_event_thread]
Failed to create async event thread
[r1i1n3][[62467,1],11][btl_openib_component.c:660:start_async_event_thread]
Failed to create async event thread
[r1i1n5][[62467,1],18][btl_openib_component.c:660:start_async_event_thread]
Failed to create async event thread
[r1i1n5][[62467,1],16][btl_openib_component.c:660:start_async_event_thread]
Failed to create async event thread
[r1i1n5][[62467,1],19][btl_openib_component.c:660:start_async_event_thread]
Failed to create async event thread
[r1i1n5][[62467,1],23][btl_openib_component.c:660:start_async_event_thread]
Failed to create async event thread
[r1i1n5][[62467,1],21][btl_openib_component.c:660:start_async_event_thread]
Failed to create async event thread
[r1i1n5][[62467,1],17][btl_openib_component.c:660:start_async_event_thread]
Failed to create async event thread
[r1i1n5][[62467,1],22][btl_openib_component.c:660:start_async_event_thread]
Failed to create async event thread
[r1i1n5][[62467,1],20][btl_openib_component.c:660:start_async_event_thread]
Failed to create async event thread
[r1i0n6:18501] 19 more processes have sent help message
help-mpi-btl-openib.txt / error in device init
[r1i0n6:18501] Set MCA parameter "orte_base_help_aggregate" to 0 to see
all help / error messages
olvastam a niif.hu oldalon, de nem tudom, hogy meg mindig igy van-e
(lasd a tablazatot)
"Pécsen és Debrecenben nincsen beállítva default h_vmem érték, mert az
SGI MPI (MPT)-s alkalmazások "meghalnak" emiatt. "
http://www.niif.hu/node/316
na de az niif oldalon adott hasznalati utasitok sem sok halistent ernek
udv
e
On 24-Sep-13 9:20 PM, Tamas Hegedus wrote:
> Nem lehet, h az a baj, h veletlen 12x8G memoriat akarsz lefoglalni?
>
> --
> Tamas Hegedus
> tamas at hegelab.org
>
> etele molnar <etele.molnar at gmail.com> wrote:
>
>> Kedves Gabor
>>
>>
>> Eloszor is koszonom a segitseget es a faradsagot de meg mindig gond van.
>>
>>
>> Igaz most mar van uj gcc 4.7.2 es openmpi 1.6.3 de a problema nem mult el.
>> Megjegyzem a /sgi/mpt/2.04 modul alapertelmezett igy csak az elobbi kettot
>> adtam hozza a .bashrc-hez.
>>
>> Leforditottam ujra a programot, mpicxx vagy mpic++ (gcc)
>> es futtatni probalom de sajnos most is hibauzenetet kapok
>>
>> mpirun-t adtam meg 1 job 8G memoria np=12:
>>
>> Warning: Permanently added
>> '[r1i1n3.ice.debrecen.hpc.niif.hu]:58158,[10.148.0.21]:58158'
>> (RSA) to the list of known hosts.
>> Warning: Permanently added
>> '[r1i0n15.ice.debrecen.hpc.niif.hu]:45141,[10.148.0.17]:45141'
>> (RSA) to the list of known hosts.
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>> 'openmpi/1.6.3-gcc-4.7.2'
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>> 'openmpi/1.6.3-gcc-4.7.2'
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>> 'gcc/4.7.2'
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>> 'gcc/4.7.2'
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>> 'openmpi/1.6.3-gcc-4.7.2'
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>> 'gcc/4.7.2'
>> --------------------------------------------------------------------------
>> WARNING: It appears that your OpenFabrics subsystem is configured to only
>> allow registering part of your physical memory. This can cause MPI jobs to
>> run with erratic performance, hang, and/or crash.
>>
>> This may be caused by your OpenFabrics vendor limiting the amount of
>> physical memory that can be registered. You should investigate the
>> relevant Linux kernel module parameters that control how much physical
>> memory can be registered, and increase them to allow registering all
>> physical memory on your machine.
>>
>> See this Open MPI FAQ item for more information on these Linux kernel module
>> parameters:
>>
>> http://www.open-mpi.org/faq/?category=openfabrics#ib-locked-pages
>>
>> Local host: r1i0n14
>> Registerable memory: 32768 MiB
>> Total memory: 49143 MiB
>>
>> Your MPI job will continue, but may be behave poorly and/or hang.
>> --------------------------------------------------------------------------
>> [r1i0n14:21422] 47 more processes have sent help message
>> help-mpi-btl-openib.txt / reg mem limit low
>> [r1i0n14:21422] Set MCA parameter "orte_base_help_aggregate" to 0 to see
>> all help / error messages
>>
>>
>>
>> Tovabba ha ugyanezt a programot
>>
>> mpirun.sge probalom futtatni (nem lett ujraforditva, maradt a gcc's
>> forditas) akkor szinten
>> ugyan ezt a hibaunzenetet kapom csak sokkal hosszabbat es azonnal le is all
>> a job...
>>
>>
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>> 'openmpi/1.6.3-gcc-4.7.2'
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>> 'gcc/4.7.2'
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>> 'openmpi/1.6.3-gcc-4.7.2'
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>> 'openmpi/1.6.3-gcc-4.7.2'
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>> 'gcc/4.7.2'
>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>> 'gcc/4.7.2'
>> --------------------------------------------------------------------------
>> mpirun was unable to launch the specified application as it could not find
>> an executable:
>>
>> Executable: r1i0n14.ice.debrecen.hpc.niif.hu
>> Node: r1i0n14
>>
>> while attempting to start process rank 0.
>> --------------------------------------------------------------------------
>>
>> real 0m1.436s
>> user 0m0.012s
>> sys 0m0.016s
>>
>> etc...
>>
>> ***
>>
>> Meg 2 utolso "elmeleti" kerdes
>> Peldaul; egy program 4 Gb memoriat ker, ha parallel 2 programot szeretnek
>> futtatni (./ program & ./ program &)
>> akkor 2*4 Gb memoriat kerjek vagy csak 4-et ?
>>> #$ -l h_vmem=8G vagy 4Gb
>> Ugyanez a program MPI-s verzioval tegyuk fel 6 slotton
>> mpirun -np 6 program
>>
>> akkor
>>> #$ -l h_vmem=4G vagy 6*4=24Gb
>>
>> Elore is koszonom
>> udv
>> e
>>
>>
>> 2013/9/24 Rőczei Gábor <roczei at niif.hu>
>>
>>> Kedves Molnár Etele!
>>>
>>>> van egy surgeto problemakor amit nem tudtam megoldani napok ota.
>>>> Az egyik c++ kodom MPI-t hasznal integralasra, de sajnos nem mindig
>>> mukodik kellokepp a debreceni gepen.
>>>> eloszor is az SGE fele mpi-ben nem fordul le mivel nem ismeri fel az MPI
>>> reszt ?
>>>> mpi-selector --query -> mpt-2.04
>>> mpi-selector helyett használd inkább a SGI MPT-s modult Debrecenben:
>>> module load sgi/mpt/2.04
>>>
>>> A job scriptben mpirun.sge-t kell majd megadni a futtatáshoz (erre csak
>>> SGI MPT esetén van szükség, OpenMPI-nál mpirun-t kell használni). Példa:
>>>
>>> #!/bin/bash
>>> #$ -N TEST
>>> #$ -pe mpi 60
>>>
>>> mpirun.sge -np $NSLOTS program
>>>
>>> Ezeket kell beállítani a fordításkor (akkor ha nem ismeri fel
>>> automatikusan):
>>>
>>> CXXFLAGS=-I/opt/nce/packages/global/sgi/mpt/2.04/include
>>> CFLAGS=-I/opt/nce/packages/global/sgi/mpt/2.04/include
>>> FFLAGS=-I/opt/nce/packages/global/sgi/mpt/2.04/include
>>> LDFLAGS=-L/opt/nce/packages/global/sgi/mpt/2.04/lib -lmpi
>>>
>>>> A masik problemam az az, hogy probalom az ujabb gcc/4.6.4 fordito
>>> csomagot hasznalni es a .bashrc bele is irtam, hogy
>>>> module load gcc/4.6.4 openmpi/1.6.3-gcc-4.7.2 es ez be is toltodik
>>> mivel megtalalom a
>>>> module list parancsal, de mikor a kodot elinditom mindig azt a hibat
>>> kapom, hogy
>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>>> 'gcc/4.6.4'
>>>> ModuleCmd_Load.c(199):ERROR:105: Unable to locate a modulefile for
>>> 'openmpi/1.6.3-gcc-4.7.2'
>>>
>>> Debrecenben nem volt telepítve a 4.7.2-es OpenMPI verzió. Ezt most
>>> javítottam.
>>>
>>> Így tudod betölteni őket:
>>>
>>> DEBRECEN[service0] ~ (0)$ module load openmpi/1.6.3-gcc-4.7.2 gcc/4.7.2
>>> DEBRECEN[service0] ~ (0)$ type gcc
>>> gcc is /opt/nce/packages/global/gcc/4.7.2/bin/gcc
>>> DEBRECEN[service0] ~ (0)$ type mpirun
>>> mpirun is /opt/nce/packages/global/openmpi/1.6.3-gcc-4.7.2/bin/mpirun
>>> DEBRECEN[service0] ~ (0)$
>>>
>>>> erdekes az is, hogy az mpicxx.openmpi-t nem ismeri fel a
>>> szuperszamitogep !
>>>
>>> OpenMPI esetén mpirun-t használj. Úgy van fordítva az OpenMPI, hogy képes
>>> értelmezni az SGE által definiált PE_HOSTFILE-os környezeti változót:
>>>
>>> DEBRECEN[service0] mpi (0)$ ompi_info -all | grep gridengine
>>> MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.6.3)
>>> MCA ras: parameter "ras_gridengine_debug" (current value:
>>> <0>, data source: default value)
>>> Enable debugging output for the gridengine ras
>>> component
>>> MCA ras: parameter "ras_gridengine_priority" (current
>>> value: <100>, data source: default value)
>>> Priority of the gridengine ras component
>>> MCA ras: parameter "ras_gridengine_verbose" (current
>>> value: <0>, data source: default value)
>>> Enable verbose output for the gridengine ras
>>> component
>>> MCA ras: parameter "ras_gridengine_show_jobid" (current
>>> value: <0>, data source: default value)
>>> DEBRECEN[service0] mpi (0)$
>>>
>>> Megjegyzés: OpenMPI helyett SGI MPT-t használj ha van rá lehetőséged, mert
>>> ez sokkal jobban ki tudja használni az SGI-s architektúrát Debrecenben.
>>>
>>> Ezt cseréltem: mpiexec --> mpirun
>>>
>>>> mi van akkor ha egyszerre 2 mpis programot futtatok pl,
>>>>
>>>> #!/bin/sh
>>>> #$ -N test2
>>>> #$ -l h_vmem=4G
>>>> #$ -l h_rt=23:59:59
>>>> #$ -pe mpi 24
>>>> #$ -q parallel.q
>>>>
>>>> time mpirun -np 12 ./kod $writedir01/ > kod.out1 &
>>>> time mpirun -np 12 ./kod $writedir11/ > kod.out2 &
>>>> wait
>>>>
>>>> ez magatol erthetodo ? 24 thread 2 reszre osztva ? es "&" parhuzamosan
>>> futtnak
>>>
>>> Igen, ez így fog történni.
>>>
>>> Ez pedig egymás után fogja lefuttatni:
>>>
>>>> time mpirun -np 24 ./kod $writedir01/ > kod.out1 ;
>>>> time mpirun -np 24 ./kod $writedir11/ > kod.out2 ;
>>> Üdvözlettel,
>>>
>>> Rőczei Gábor
>> _______________________________________________
>> Hpc-forum mailing list
>> Hpc-forum at listserv.niif.hu
>> https://listserv.niif.hu/mailman/listinfo/hpc-forum
További információk a(z) Hpc-forum levelezőlistáról