HTCondorWiki: Run Cms Jobs At Bsc

Running CMS Jobs at BSC via a Shared Filesystem

CMS researchers at PIC (Port d’Informació Científica) are interested in running jobs originating at CERN on the Mare Nostrum system at BSC (Barcelona Supercomputing Center). Access to Mare Nostrum is highly restrictive, particularly with regards to networking. We are experimenting with a solution where jobs are forwarded to Mare Nostrum execute nodes via a shared filesystem link. The approach is outlined in this document:

https://docs.google.com/document/d/1GGOc3pgidfv_qunaKqzjbvKPF8CD_RF30KkqMtwBC0U/edit?usp=sharing

Implementation Details

We have a working proof-of-concept implementation, which has been tested with 4 MareNostrum nodes running sleep jobs. Here, we give details of the current code.

Machines:

vocms0824.cern.ch: A CMS submit point that flocks to the test pool at PIC.
cm05-hpc.pic.es: The central manager for the test pool.
htcbridge01.pic.es: Runs the proxy startds that talk with the schedd and actual execute nodes (via filesystem).
td513.pic.es: A stand-in execute node to allow testing without any BSC machines.
dt01.bsc.es: MareNostrum data transfer machine, used for sshfs.
mn1.bsc.es: MareNostrum login machine, used for Slurm job submission.

All of the code is available under /home/jfrey/bsc_glidein/ on htcbridge01.pic.es. A copy is saved at /afs/cs.wisc.edu/p/condor/workspaces/jfrey/bsc_glidein.tar.gz. It relies on some changes to the condor_starter, which currently sit on an unmerged branch in the HTCondor repository (see #6843). The Condor binaries used on htcbrid01.pic.es are in /home/jfrey/condor-remoteproc/.The condor_starter (plus libraries) that runs on the MareNostrum nodes is pre-installed in /home/ifae96/ifae96618/release_dir/.

To run the current system, a user must have accounts on cm05-hpc.pic.es and mn1.bsc.es. They then run the following command on cm05-hpc.pic.es:

/home/jfrey/bsc_glidein/bin/launch_glidein <bsc-username> <node-cnt>

This will create an sshfs mount to BSC, submit a Slurm job at MareNostrum, and configure and launch a startd locally.

The shut down an instance of the current system, the user should run the following command from the same directory on cm05-hpc.pic.es:

/home/jfrey/bsc_glidein/bin/stop_glidein

This will remove the sshfs mount and stop the Condor daemons. It doesn't remove the Slurm job.

Attachments:

bsc_glidein.tar.gz 6699 bytes added by jfrey on 2021-Jun-29 19:16:29 UTC.
Scripts for BSC split-starter setup.