[Hpc-forum] defected nodes | solved
Ferenc Bartha
barthaf at sol.cc.u-szeged.hu
2016. Már. 5., Szo, 08:34:09 CET
Debrecen2 has been cleaned by NIIF, thanx.
Keep an eye on possible new orphans with "about orp".
----- Original Message -----
From: "Ferenc Bartha" <barthaf at physx.u-szeged.hu>
To: <hpc-forum at listserv.niif.hu>
Sent: Tuesday, February 16, 2016 10:03 AM
Subject: [Hpc-forum] defected nodes
Dear Users,
There are jobs, which terminate but leave "orphan" processes on the compute
nodes. SLURM does not know them; launches new jobs without cleaning up. The
orphan processes damage the new job, sometimes seriously.
These nodes can be cured with using root privileges. As long as NIIF does
not do it I provide a hint on how to exclude the infected nodes.
Budapest, Budapest2, Debrecen, Debrecen2:
Refer to the about command. It is located in
/opt/nce/packages/global/barthaf/bin. You can make an alias to it, or simply
copy it to your own bin, or/and several other solutions to make it easier to
use...
If you see an "orp" info topic, you'd better take it seriously. I update
these files typically daily. Remark: For the moment only DB2 is infected,
but badly.
Pecs, Miskolc:
UV machines are not inspected.
Szeged:
No need for info on this topic, I clean up immediately when orphans are
recognized.
ÜdvFeri
------------:
Dr. Ferenc BARTHA, tel: 62/54-6821, E-mail: barthaf at sol.cc.u-szeged.hu
SKYPE: ferenc.bartha, WWW: http://www.staff.u-szeged.hu/~barthaf/
SZTE DNT - High Performance Computing Group
6725 Szeged, Tisza Lajos krt. 113 (Szikra u. 2.)
SZTE, Department of Medical Chemistry, 6720 Szeged, Dóm tér 8.
_______________________________________________
Hpc-forum mailing list
Hpc-forum at listserv.niif.hu
https://listserv.niif.hu/mailman/listinfo/hpc-forum
További információk a(z) Hpc-forum levelezőlistáról