Running CMS Jobs at BSC via a Shared Filesystem

CMS researchers at PIC (Port d’Informació Científica) are interested in running jobs originating at CERN on the Mare Nostrum system at BSC (Barcelona Supercomputing Center). Access to Mare Nostrum is highly restrictive, particularly with regards to networking. We are experimenting with a solution where jobs are forwarded to Mare Nostrum execute nodes via a shared filesystem link. The approach is outlined in this document:

https://docs.google.com/document/d/1GGOc3pgidfv_qunaKqzjbvKPF8CD_RF30KkqMtwBC0U/edit?usp=sharing

Implementation Details

We have a working proof-of-concept implementation, which has been tested with 4 MareNostrum nodes running sleep jobs. Here, we give details of the current code.

Machines:

All of the code is available under /home/jfrey/bsc_glidein/ on htcbridge01.pic.es. A copy is saved at /afs/cs.wisc.edu/p/condor/workspaces/jfrey/bsc_glidein.tar.gz. It relies on some changes to the condor_starter, which currently sit on an unmerged branch in the HTCondor repository (see #6843). The Condor binaries used on htcbrid01.pic.es are in /home/jfrey/condor-remoteproc/.The condor_starter (plus libraries) that runs on the MareNostrum nodes is pre-installed in /home/ifae96/ifae96618/release_dir/.

To run the current system, a user must have accounts on cm05-hpc.pic.es and mn1.bsc.es. They then run the following command on cm05-hpc.pic.es:

This will create an sshfs mount to BSC, submit a Slurm job at MareNostrum, and configure and launch a startd locally.

The shut down an instance of the current system, the user should run the following command from the same directory on cm05-hpc.pic.es:

This will remove the sshfs mount and stop the Condor daemons. It doesn't remove the Slurm job.

Attachments: