This project has now finished. SLURM is now working in Galaxy and most of the code presented here as been polished and merged back to Romans repo. Please see his bitbucket for the finished code. I will NOT update the code changes to the Galaxy files on this page anymore. Checking out Romans code is they way to go for that:
https://bitbucket.org/brainstorm/galaxy-central
I have updated the other steps in the guide on April 18th 2012. Most things work out of the box now, so no mucking around with code changes in python eggs and whatnot.
We will try to get this code included in the official Galaxy release, so keep your fingers crossed :)
Link to the bitbucket: https://bitbucket.org/dahlo/galaxy-central
Connecting to a installation of Galaxy running on Uppmax
Since Uppmax is behind a firewall it is necessary to tunnel the HTTP connection to
Galaxy through a ssh tunnle. Fortunatly this is quite easy in Linux.
Galaxy through a ssh tunnle. Fortunatly this is quite easy in Linux.
Create a ssh tunnel from port 8080 on your computer to port 8080 on uppmax. The ssh
connection will then hide in the background.
connection will then hide in the background.
ssh -f <user>@<uppmax> -L 8080:localhost:8080 -N
Go to this address in your browser
http://127.0.0.1:8080
(<port> is the port specified in universe_wsgi.ini in the galaxy dist)
Run Galaxy on a node
Oh, and if you run Galaxy as a job (to avoid the 30 min time limit), you have to
create another ssh tunnel inside of uppmax.
create another ssh tunnel inside of uppmax.
First book a node for a suitable time
salloc -A <project id> -t 09:00:00 -p node -n 8 -J GalaxyServer --bell --no-shell &When the job is granted the allocation, check which node you got using
jobinfo -u <your username>Then connect to the node from a LOCAL TERMINAL (that is, a fresh newly created terminal on your own computer), tunneling in two steps (q11, q153 etc)
ssh -L 8080:localhost:8080 <user>@<uppmax> 'ssh -t -t -L 8080:localhost:8080 <node address> "source ~/.bash_profile ; sh /path/to/galaxy-central/run.sh"'Now your Galaxy instance will start on the node, and it will not be shut down after
30 minutes, as on the login nodes. Btw, .bash_profile is where your python path might be
stored (where dependencies like yaml is installed). This should not be required later, i
you have all dependencies centrally installed and append that path to the pythonpath when
you load the galaxy module. As before, use the address http://127.0.0.1:8080 in your
browser.
You have to register a user, and set the DRMAA options for your user before being able to use Galaxy.
1. Register and login
2. In top menu: User - Preferences
3. Native job runner options
4. Set options for your system.
NOTE: To change the available Partition options, modify row 55 in database/compiled_templates/user/native_runner_params.mako.py
Installing the DRMAA module
The SLURM-DRMAA module is not globally installed at Uppmax, so I needed to make a
personal installation of it.
personal installation of it.
Downloaded and compiled slurm_drmaa 1.0.3
./configure --with-slurm-inc=/usr/include/slurm --with-slurm-lib=/usr/lib64/slurm --prefix=/home/username/glob/work/userspace
make
make install
Making Galaxy SLURM friendly
There are some code modifications involved when getting Galaxy to speak with the
DRMAA module.
DRMAA module.
Setting some environment variables to be able to use DRMAA and other things.
Added the following to the head of run.sh in galaxy-dist. The plan is to add some
menues in the job wizard UI where you can choose these options (account, time
requirement etc). The DRMAA paths is required for DRMAA to work.
./run.sh (row 5) changed:
menues in the job wizard UI where you can choose these options (account, time
requirement etc). The DRMAA paths is required for DRMAA to work.
./run.sh (row 5) changed:
#!/bin/sh cd `dirname $0` python ./scripts/check_python.py [ $? -ne 0 ] && exit 1 SAMPLES=" external_service_types_conf.xml.sample datatypes_conf.xml.sample . . .
to
#!/bin/sh cd `dirname $0` python ./scripts/check_python.py [ $? -ne 0 ] && exit 1 # import environment variables source ./startup_settings SAMPLES=" external_service_types_conf.xml.sample datatypes_conf.xml.sample . . .
./startup_settings: (create this file)
## set required variables. Very site specific, so change these to suit your needs. # should be ok on all systems export TEMP=database/tmp # location of drmaa-slurm library export DRMAA_LIBRARY_PATH=/home/dahlo/glob/work/userspace/lib/libdrmaa.so # copy previous export DRMAA_PATH=$DRMAA_LIBRARY_PATH # Specify which modules to load before starting galaxy.
# Remove if you do not have the module system.
# Most don't, and you would know if you did.
# It's a system to handle installed software on a system, forcing you to load a module for a software before being able to use it. export SLURM_MOD="bioinfo-tools bowtie samtools tophat" module load $SLURM_MOD
Configuring Galaxy to use DRMAA as a job runner
If you just downloaded Galaxy, you probably need to rename ./universe_wsgi.ini.sample to just universe_wsgi.ini
In universe_wsgi.ini
In universe_wsgi.ini
Changed (row 518)
#start_job_runners = None
to
start_job_runners = drmaa
Changed (row 522)
#default_cluster_job_runner = local:///
to
default_cluster_job_runner = drmaa:///
Letting native variables be injected later on in the drmaa job runner.
Changes in Galaxys DRMAA job runner (NO LONGER MAINTAINED)
Check out this code instead: https://bitbucket.org/dahlo/galaxy-central
I have changes a couple of code blocks to make it work with SLURM. The plan is to
have an if-statement check if SLURM is to be used (through a environment variable),
and use my code if that is the case. Otherwise it will use the standard code, to
avoid causing conflicts with other queue systems i have not been able to test run it
on.
have an if-statement check if SLURM is to be used (through a environment variable),
and use my code if that is the case. Otherwise it will use the standard code, to
avoid causing conflicts with other queue systems i have not been able to test run it
on.
I added some information about the job in the head of the sh file containing the job
itself. Mainly a copy of the SLURM settings used. A modules command is also added to
be able to load the correct modules on the node running the job.
itself. Mainly a copy of the SLURM settings used. A modules command is also added to
be able to load the correct modules on the node running the job.
Changed (row 35)
drm_template = """#!/bin/sh #$ -S /bin/sh GALAXY_LIB="%s" if [ "$GALAXY_LIB" != "None" ]; then if [ -n "$PYTHONPATH" ]; then PYTHONPATH="$GALAXY_LIB:$PYTHONPATH" else PYTHONPATH="$GALAXY_LIB" fi export PYTHONPATH fi cd %s %s """
to
if os.getenv("GALAXY_SLURM"): # The sbatch rows below does NOT influence the job # They are there to be able to see which options were used # when submitting the job. drm_template = """#!/bin/bash -l #$ -S /bin/bash -l #SBATCH -A %s #SBATCH -p %s #SBATCH -t %s %s module load %s GALAXY_LIB="%s" if [ "$GALAXY_LIB" != "None" ]; then if [ -n "$PYTHONPATH" ]; then PYTHONPATH="$GALAXY_LIB:$PYTHONPATH" else PYTHONPATH="$GALAXY_LIB" fi export PYTHONPATH fi cd %s %s """ else: #~ ORIGINAL drm_template = """#!/bin/sh #$ -S /bin/sh GALAXY_LIB="%s" if [ "$GALAXY_LIB" != "None" ]; then if [ -n "$PYTHONPATH" ]; then PYTHONPATH="$GALAXY_LIB:$PYTHONPATH" else PYTHONPATH="$GALAXY_LIB" fi export PYTHONPATH fi cd %s %s """
and further down, to get the SLURM settings from environment variables, i changed (row 185, after modification above)
native_spec = self.get_native_spec( runner_url ) if native_spec is not None: jt.nativeSpecification = native_spec script = drm_template % (job_wrapper.galaxy_lib_dir, os.path.abspath( job_wrapper.working_directory ), command_line)
to
# check if slurm is activated if os.getenv("GALAXY_SLURM"): ## get saved slurm variables from ConfigParser import SafeConfigParser import re # save the slurm variables together with the sh files os.system("cp %s/database/pbs/slurm_settings.tmp %s/database/pbs/galaxy_%s.slurm" % (os.getcwd(), os.getcwd(), job_wrapper.get_id_tag())) # load a parser and read the slurm settings parser = SafeConfigParser() parser.read("%s/database/pbs/galaxy_%s.slurm" % (os.getcwd(), job_wrapper.get_id_tag())) # check if time is given in days (can not be handled by slurm-drmaa) slurm_t = parser.get('slurm','t') # get user specified time t_search = re.search("(\d+)-(\d+):(.+)",slurm_t) # check if it has days in it (3-12:00:00) if t_search: # if it has slurm_t = "%s:%s" % ((int(t_search.group(1))*24 + int(t_search.group(2))), t_search.group(3)) # convert to hours # check for memory request slurm_c = parser.get('slurm','c') c_search = re.search("#SBATCH ([^']*)",slurm_c) # check if any special memory is requested if c_search: # if it is slurm_c = c_search.group(1) # save the request for later in nativeSpecification # set job variables jt.nativeSpecification = "-A %s -p %s %s" % (parser.get('slurm','a'),parser.get('slurm','p'),slurm_c) # insert account, partition and memory request jt.hardWallclockTimeLimit = slurm_t # store time requirement jt.jobName = "GalaxyJob_%s" % (job_wrapper.get_id_tag()) # store job name # add the same information to the script template. Will not affect anything, just for future reference script = drm_template % (parser.get('slurm','a'),parser.get('slurm','p'),parser.get('slurm','t'),parser.get('slurm','c'),os.environ.get("SLURM_MOD"),job_wrapper.galaxy_lib_dir, os.path.abspath( job_wrapper.working_directory ), command_line) else: #~ ORIGINAL native_spec = self.get_native_spec( runner_url ) if native_spec is not None: jt.nativeSpecification = native_spec script = drm_template % (job_wrapper.galaxy_lib_dir, os.path.abspath( job_wrapper.working_directory ), command_line)
A temporary fix for HOME not being set in the shell created by drmaa-slurm. Inserted a code block.
Changed (row 232, after modifications above):
Changed (row 232, after modifications above):
. . fh = file( jt.remoteCommand, "w" ) fh.write( script ) fh.close() os.chmod( jt.remoteCommand, 0750 ) # job was deleted while we were preparing it if job_wrapper.get_state() == model.Job.states.DELETED: log.debug( "Job %s deleted by user before it entered the queue" % job_wrapper.get_id_tag() ) self.cleanup( ( ofile, efile, jt.remoteCommand ) ) job_wrapper.cleanup() return . .
to
. . fh = file( jt.remoteCommand, "w" ) fh.write( script ) fh.close() os.chmod( jt.remoteCommand, 0750 ) # check if slurm is activated if os.getenv("GALAXY_SLURM"): # the remoteCommand was used to decide the filename in the rows above, so i had to insert this afterwards :) This should be removed as soon as uppmax support figures out what is causing HOME to be empty. jt.remoteCommand = "export HOME=/home/dahlo ; %s/database/pbs/galaxy_%s.sh" % (os.getcwd(), job_wrapper.get_id_tag()) # job was deleted while we were preparing it if job_wrapper.get_state() == model.Job.states.DELETED: log.debug( "Job %s deleted by user before it entered the queue" % job_wrapper.get_id_tag() ) self.cleanup( ( ofile, efile, jt.remoteCommand ) ) job_wrapper.cleanup() return . .
Getting SLURM options when configuring job (NO LONGER MAINTAINED)
Check out this code instead: https://bitbucket.org/dahlo/galaxy-central
When a job is configured, it is important to be able to add SLURM options like account and time requirement. Additional options which are site-dependent is also added.
In templates/tools_form.mako
Added a section with options needed for our SLURM.
Changed (row 245):
In templates/tools_form.mako
Added a section with options needed for our SLURM.
Changed (row 245):
. . . %if tool.display_by_page[tool_state.page]: ${trans.fill_template_string( tool.display_by_page[tool_state.page], context=tool.get_param_html_map( trans, tool_state.page, tool_state.inputs ) )} <input type="submit" class="primary-button" name="runtool_btn" value="Execute"> %else: ${do_inputs( tool.inputs_by_page[ tool_state.page ], tool_state.inputs, errors, "" )} <div class="form-row"> %if tool_state.page == tool.last_page: <input type="submit" class="primary-button" name="runtool_btn" value="Execute"> %else: . . .
to
. . . %if tool.display_by_page[tool_state.page]: ${trans.fill_template_string( tool.display_by_page[tool_state.page], context=tool.get_param_html_map( trans, tool_state.page, tool_state.inputs ) )} <input type="submit" class="primary-button" name="runtool_btn" value="Execute"> %else: ${do_inputs( tool.inputs_by_page[ tool_state.page ], tool_state.inputs, errors, "" )} <% import os slurm = os.getenv("GALAXY_SLURM") %> % if slurm=="1": <h3>SLURM Settings</h3> <table border="0"> <tr> <td> <b>Account:</b> </td> <td> <select name="slurm_a"> <% # generate the list of groups import subprocess as sub # run the groups command to get all the groups the user belongs to p = sub.Popen('groups',stdout=sub.PIPE,stderr=sub.PIPE) # get the output, keep the first line, and split it on spaces output = p.communicate()[0].split(" ") # remove the group uppmax from the list. Very specific for the development site. Feel free to add your own here if "uppmax" in output: output.remove("uppmax") %> # make an entry for each group %for group in output: <option value=${group}>${group}</option> %endfor </select> </td> </tr> <tr> <td> <b>Time resarvation:</b> </td> <td> <input type="text" name="slurm_t" value="12:00:00" size=10/> (Ex. 48:00:00 or 2-:00:00:00) </td> </tr> </table> <b><br>Memory usage:</b><br> <input type="radio" name="slurm_c" value="" checked>Normal <input type="radio" name="slurm_c" value="#SBATCH -C fat">48G or 72G <input type="radio" name="slurm_c" value="#SBATCH -C mem72GB">72G only<br><br> <b>Partition</b><br> <input type="radio" name="slurm_p" value="node" checked>Node <input type="radio" name="slurm_p" value="core">Core<br><br> % endif <div class="form-row"> %if tool_state.page == tool.last_page: <input type="submit" class="primary-button" name="runtool_btn" value="Execute"> %else: . . .
These SLURM options will be sent, together with all the other option, to the function "index" specified in lib/galaxy/web/controllers/tool_runner.py
It is in tool_runner.py the SLURM options are extracted and written to a file to the same directory as the .sh file that specifies the submitted job (), to be accessible to the drmaa script that configures the job. This is to avoid fiddling with the code in all scripts along the chain to get them to pass the settings on..
In tool_runner.py
Changed (row ~80):
It is in tool_runner.py the SLURM options are extracted and written to a file to the same directory as the .sh file that specifies the submitted job (), to be accessible to the drmaa script that configures the job. This is to avoid fiddling with the code in all scripts along the chain to get them to pass the settings on..
In tool_runner.py
Changed (row ~80):
. . if from_noframe is not None: add_frame.wiki_url = trans.app.config.wiki_url add_frame.from_noframe = True return trans.fill_template( template, history=history, toolbox=toolbox, tool=tool, util=util, add_frame=add_frame, **vars ) . .
to
. . if from_noframe is not None: add_frame.wiki_url = trans.app.config.wiki_url add_frame.from_noframe = True
if os.getenv("GALAXY_SLURM"): # check if the params string contains the right things if "slurm" in str(params) : str_params = str(params) # convert to a proper string # open the temporary file for writing settings (i sure hope there won't be any conflicts here, overwriting eachothers settings.. Should test with workflows) file = open("%s/database/pbs/slurm_settings.tmp" % (os.getcwd()),"w") file.write("[slurm]\n") # write section header # extract the slurm settings slurm_setting = '' # reset slurm_setting = re.search("'slurm_a': u'([^']*)",str_params) # check for this specific setting if slurm_setting: # if something was found file.write("%s = %s\n" % ("a",str(slurm_setting.group(1)))) # set the environment variable slurm_setting = '' # reset slurm_setting = re.search("'slurm_t': u'([^']*)",str_params) # check for this specific setting if slurm_setting: # if something was found file.write("%s = %s\n" % ("t",str(slurm_setting.group(1)))) # set the environment variable slurm_setting = '' # reset slurm_setting = re.search("'slurm_c': u'([^']*)",str_params) # check for this specific setting if slurm_setting: # if something was found file.write("%s = %s\n" % ("c",str(slurm_setting.group(1)))) # set the environment variable slurm_setting = '' # reset slurm_setting = re.search("'slurm_p': u'([^']*)",str_params) # check for this specific setting if slurm_setting: # if something was found file.write("%s = %s\n" % ("p",str(slurm_setting.group(1)))) # set the environment variable return trans.fill_template( template, history=history, toolbox=toolbox, tool=tool, util=util, add_frame=add_frame, **vars ) . .
Current
Finished
Future
- Have a cup of tea
Encountered problems
- Various python modules missing, like yaml. Fixed by simply installing them.
- Module system not initiated in the sh shell created by slurm-drmaa. Fixed by configuring the initiation scripts to be run when shell == sh as well.
- HOME not set in the shell created by slurm-drmaa. Still working on that. The solution to include an 'export HOME..' in the remoteCommand is not so elegant..
- Some modules print to stderr when loading, causing Galaxy to interpret the job as failed. Fixed by removing these print commands. There is a workaround called the 'Gordon patch', if memory serves, that is used to wrap all jobs and remove any stderr print outs, but it seems a bit overkill at the moment. We'll see how things develop.
- Got "python: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_auth_get_arg_desc" error whenever i tried to submit jobs. Turned out i forgot to rename drmaa-0.4b3.egg back to .egg after having it as .zip when editing wrappers.py
- Environment variables from the submitting node were not inherited by the worker node. This was quickly fixed in an update to slurm-drmaa (job.c file) by the developer Mariusz Mamoński.
Martin, I still think the best option is to go for python's VirtualENV, as suggested by official galaxy documentation, while removing the rot from the module system on the postactivate hook:
ReplyDelete# User specific aliases and functions
export PATH=$PATH:~/opt/mypython/bin
export PYTHONPATH=~/opt/mypython/lib/python2.6/site-packages
source ~/opt/mypython/bin/virtualenvwrapper.sh
mkdir -p $HOME/opt/mypython/lib/python2.6/site-packages
easy_install-2.6 --prefix=~/opt/mypython pip
pip install virtualenvwrapper --install-option="--prefix=~/opt/mypython"
mkvirtualenv --python=python2.6 --no-site-packages
Finally, you should define the following code in ~/.virtualenvs//bin/postactivate:
#!/bin/bash
# This hook is run after this virtualenv is activated.
source ~/bin/reload_uppmax_modules.sh
# We don't want UPPMAX's python
RPATH="/sw/comp/python/2.6.6_kalkyl/bin"
PATH=$( echo ${PATH} | tr -s ":" "\n" | grep -vwE "(${RPATH})" | tr -s "\n" ":" | sed "s/:$//" )
unset PYTHONHOME
Look/merge my last commit, contains your $HOME hack... I was trying to circumvent it with other means that's why it was not there... any progress on determining why $HOME is gone on the nodes when the job is run ? I'm quite puzzled by it I must say :-S I wouldn't be surprised if it's related to some weird side-effect with the module system.
ReplyDeleteThere was an old UPPMAX ticket/weird interaction with screen (unexpected clearing of environment variables), I'll see if it's related somehow.
Humm... could it be that slurmd is the one clearing the variables for security reasons ?
ReplyDeletehttp://superuser.com/questions/235760/ld-library-path-unset-by-screen
There was a brief discussion about this during a uppmax-meeting a couple of weeks ago, after i had sent a support ticket about it. I don't remember exactly why, but one of the sys admins thought it had something to do with the initialization scripts that runs for different shells. He would look into it, but then July came..
ReplyDeleteNow everyone is away on vacation and should be back in a couple of weeks. There is only one person managing the whole support system at uppmax right now, and he is understandably quite busy :/
drmaa-python issue has been reported upstream, so hopefully those changes will not be needed anymore in the future:
ReplyDeletehttp://code.google.com/p/drmaa-python/issues/detail?id=25
Thanks Martin and Mariusz (slurm-drmaa developer) for their support.
Regarding the shell issue, there's a documented "-shell yes" nativespecification parameter that I haven't managed to get it working yet together with slurm-drmaa:
ReplyDeletehttp://linux.die.net/man/3/drmaa_attributes
I sent an email to the slurm-drmaa developer and I'll look at it myself shortly.
Hi it seems you have been able to fix the SGE_ROOT issue but somehow I am stuck:
ReplyDeleteSGE_ROOT environment variable required.....
http://code.google.com/p/drmaa-python/issues/detail?id=29
any ideas?
I just got back from easter, and I see in the bug report you linked to that you resolved your problem.
DeleteDahlö,
ReplyDeletehow would you judge your slurm-galaxy implementation after using it it for a year? Has it been stable? What are the main pain points?
Hi Dipe
DeleteI'll ask around if anyone has been using it, since i don't run much analysis myself nowadays. Our GUI developer has been busy with other more acute tasks the last 6-12 months, so we have not started pushing people to using galaxy yet.
I guess the biggest pain point is how to launch it for inexperienced users without a GUI or script.
I'll add more here if i learn something new when i ask around.
Hello again. I have not heard about anyone actually using this yet, so I can't give you more than i already have, sorry.
DeleteIf it was easier to start up a galaxy instance at our site, maybe people would start using it.
Martin, I submitted some modifications to your code these guys made. May be that gets included.
ReplyDeletehttps://bitbucket.org/galaxy/galaxy-central/issue/778/add-new-slurm-drmaa-runner-to-galaxy
Nice, thank you for helping slurm support on the way to be included in galaxy :)
DeleteHi Dahlo,
ReplyDeleteThanks for your brilliant job! It helps me a lot.
Now I get stuck with the error "python: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_auth_get_arg_desc".
I see your explanation above but I cannot understand what do you mean by rename the egg after editing wrappers.py? How did you edit the wrappers.py?
Thanks,
Jingchao
Hello Jingchao
DeleteSorry for the late response, i have had a lot on my plate lately. I am afraid i have no idea where your problem might be..
The "renaming of the egg"-part was removed from the instruction after the developer of slurm-drmaa updated his code and the error went away. What i did was that i went to galaxy's eggs folder (eggs/) and renamed the drmaa-slurm egg (drmaa-0.4b3-py2.6.egg) to a zip file (drmaa-0.4b3-py2.6.zip). I then unzipped the file, edited the wrappers.py, and then zipped it again. After that i renamed the zip file (drmaa-0.4b3-py2.6.zip) to an egg again (drmaa-0.4b3-py2.6.egg). The egg then contained an edited version of wrappers.py.
But as i said, this step is not necessary anymore since the slurm-drmaa updated his code. Maybe you could send him an email and ask if he knows what is causing the problem? http://apps.man.poznan.pl/trac/slurm-drmaa/
Great and I have a tremendous give: Whole Home Renovation Cost split level house exterior remodel
ReplyDelete