Monday, June 20, 2011

Galaxy on Uppmax

PROJECT FINISHED, CODE ON THIS PAGE NO LONGER MAINTAINED!


This project has now finished. SLURM is now working in Galaxy and most of the code presented here as been polished and merged back to Romans repo. Please see his bitbucket for the finished code. I will NOT update the code changes to the Galaxy files on this page anymore. Checking out Romans code is they way to go for that:

https://bitbucket.org/brainstorm/galaxy-central

I have updated the other steps in the guide on April 18th 2012. Most things work out of the box now, so no mucking around with code changes in python eggs and whatnot.

We will try to get this code included in the official Galaxy release, so keep your fingers crossed :)








Link to the bitbucket: https://bitbucket.org/dahlo/galaxy-central

Connecting to a installation of Galaxy running on Uppmax


Since Uppmax is behind a firewall it is necessary to tunnel the HTTP connection to
Galaxy through a ssh tunnle. Fortunatly this is quite easy in Linux.

Create a ssh tunnel from port 8080 on your computer to port 8080 on uppmax. The ssh
connection will then hide in the background.
ssh -f <user>@<uppmax> -L 8080:localhost:8080 -N

Go to this address in your browser
http://127.0.0.1:8080

(<port> is the port specified in universe_wsgi.ini in the galaxy dist)

Run Galaxy on a node


Oh, and if you run Galaxy as a job (to avoid the 30 min time limit), you have to
create another ssh tunnel inside of uppmax.

First book a node for a suitable time

salloc -A <project id> -t 09:00:00 -p node -n 8 -J GalaxyServer --bell --no-shell &
When the job is granted the allocation, check which node you got using
jobinfo -u <your username>
Then connect to the node from a LOCAL TERMINAL (that is, a fresh newly created terminal on your own computer), tunneling in two steps (q11, q153 etc)
ssh -L 8080:localhost:8080 <user>@<uppmax> 'ssh -t -t -L 8080:localhost:8080 <node address> "source ~/.bash_profile ; sh /path/to/galaxy-central/run.sh"'
Now your Galaxy instance will start on the node, and it will not be shut down after
30 minutes, as on the login nodes. Btw, .bash_profile is where your python path might be
stored (where dependencies like yaml is installed). This should not be required later, i
you have all dependencies centrally installed and append that path to the pythonpath when
you load the galaxy module. As before, use the address http://127.0.0.1:8080 in your
browser.

You have to register a user, and set the DRMAA options for your user before being able to use Galaxy.


1. Register and login
2. In top menu: User - Preferences
3. Native job runner options
4. Set options for your system.


NOTE: To change the available Partition options, modify row 55 in database/compiled_templates/user/native_runner_params.mako.py



Installing the DRMAA module


The SLURM-DRMAA module is not globally installed at Uppmax, so I needed to make a
personal installation of it.

Downloaded and compiled slurm_drmaa 1.0.3

./configure --with-slurm-inc=/usr/include/slurm --with-slurm-lib=/usr/lib64/slurm --prefix=/home/username/glob/work/userspace
make
make install

Making Galaxy SLURM friendly


There are some code modifications involved when getting Galaxy to speak with the
DRMAA module.

Setting some environment variables to be able to use DRMAA and other things.


Added the following to the head of run.sh in galaxy-dist. The plan is to add some
menues in the job wizard UI where you can choose these options (account, time
requirement etc). The DRMAA paths is required for DRMAA to work.

./run.sh (row 5) changed:

#!/bin/sh

cd `dirname $0`

python ./scripts/check_python.py
[ $? -ne 0 ] && exit 1

SAMPLES="
    external_service_types_conf.xml.sample
    datatypes_conf.xml.sample
.
.
.

to

#!/bin/sh

cd `dirname $0`

python ./scripts/check_python.py
[ $? -ne 0 ] && exit 1

# import environment variables
source ./startup_settings

SAMPLES="
    external_service_types_conf.xml.sample
    datatypes_conf.xml.sample
.
.
.

./startup_settings: (create this file)
## set required variables. Very site specific, so change these to suit your needs.

# should be ok on all systems
export TEMP=database/tmp

# location of drmaa-slurm library
export DRMAA_LIBRARY_PATH=/home/dahlo/glob/work/userspace/lib/libdrmaa.so

# copy previous
export DRMAA_PATH=$DRMAA_LIBRARY_PATH

# Specify which modules to load before starting galaxy.
# Remove if you do not have the module system.
# Most don't, and you would know if you did.
# It's a system to handle installed software on a system, forcing you to load a module for a software before being able to use it.
export SLURM_MOD="bioinfo-tools bowtie samtools tophat"
module load $SLURM_MOD

Configuring Galaxy to use DRMAA as a job runner


If you just downloaded Galaxy, you probably need to rename ./universe_wsgi.ini.sample to just universe_wsgi.ini

In universe_wsgi.ini

Changed (row 518)
#start_job_runners = None

to
start_job_runners = drmaa

Changed (row 522)
#default_cluster_job_runner = local:///

to
default_cluster_job_runner = drmaa:///

Letting native variables be injected later on in the drmaa job runner.

Changes in Galaxys DRMAA job runner (NO LONGER MAINTAINED)

Check out this code instead: https://bitbucket.org/dahlo/galaxy-central


I have changes a couple of code blocks to make it work with SLURM. The plan is to
have an if-statement check if SLURM is to be used (through a environment variable),
and use my code if that is the case. Otherwise it will use the standard code, to
avoid causing conflicts with other queue systems i have not been able to test run it
on.

I added some information about the job in the head of the sh file containing the job
itself. Mainly a copy of the SLURM settings used. A modules command is also added to
be able to load the correct modules on the node running the job.

Changed (row 35)
drm_template = """#!/bin/sh
#$ -S /bin/sh
GALAXY_LIB="%s"
if [ "$GALAXY_LIB" != "None" ]; then
    if [ -n "$PYTHONPATH" ]; then
        PYTHONPATH="$GALAXY_LIB:$PYTHONPATH"
    else
        PYTHONPATH="$GALAXY_LIB"
    fi
    export PYTHONPATH
fi
cd %s
%s
"""

 to
if os.getenv("GALAXY_SLURM"):
  # The sbatch rows below does NOT influence the job
  # They are there to be able to see which options were used
  # when submitting the job.
  drm_template = """#!/bin/bash -l
  #$ -S /bin/bash -l
  #SBATCH -A %s
  #SBATCH -p %s
  #SBATCH -t %s
  %s
  
  module load %s
  
  GALAXY_LIB="%s"
  if [ "$GALAXY_LIB" != "None" ]; then
      if [ -n "$PYTHONPATH" ]; then
          PYTHONPATH="$GALAXY_LIB:$PYTHONPATH"
      else
          PYTHONPATH="$GALAXY_LIB"
      fi
      export PYTHONPATH
  fi
  cd %s
  %s
  """

else:
  #~ ORIGINAL
  drm_template = """#!/bin/sh
  #$ -S /bin/sh
  GALAXY_LIB="%s"
  if [ "$GALAXY_LIB" != "None" ]; then
      if [ -n "$PYTHONPATH" ]; then
          PYTHONPATH="$GALAXY_LIB:$PYTHONPATH"
      else
          PYTHONPATH="$GALAXY_LIB"
      fi
      export PYTHONPATH
  fi
  cd %s
  %s
  """


and further down, to get the SLURM settings from environment variables, i changed (row 185, after modification above)
native_spec = self.get_native_spec( runner_url )
if native_spec is not None:
    jt.nativeSpecification = native_spec
script = drm_template % (job_wrapper.galaxy_lib_dir, os.path.abspath( job_wrapper.working_directory ), command_line)

to
# check if slurm is activated
if os.getenv("GALAXY_SLURM"):
  ## get saved slurm variables
  from ConfigParser import SafeConfigParser
  import re
  
  # save the slurm variables together with the sh files
  os.system("cp %s/database/pbs/slurm_settings.tmp %s/database/pbs/galaxy_%s.slurm" % (os.getcwd(), os.getcwd(), job_wrapper.get_id_tag()))
  
  # load a parser and read the slurm settings
  parser = SafeConfigParser()
  parser.read("%s/database/pbs/galaxy_%s.slurm" % (os.getcwd(), job_wrapper.get_id_tag()))
  
  # check if time is given in days (can not be handled by slurm-drmaa)
  slurm_t = parser.get('slurm','t')  # get user specified time
  t_search = re.search("(\d+)-(\d+):(.+)",slurm_t)  # check if it has days in it (3-12:00:00)
  if t_search:  # if it has
    slurm_t = "%s:%s" % ((int(t_search.group(1))*24 + int(t_search.group(2))), t_search.group(3))  # convert to hours
  
  # check for memory request
  slurm_c = parser.get('slurm','c')
  c_search = re.search("#SBATCH ([^']*)",slurm_c)  # check if any special memory is requested
  if c_search:   # if it is
    slurm_c = c_search.group(1)  # save the request for later in nativeSpecification
  
  # set job variables
  jt.nativeSpecification = "-A %s -p %s %s" % (parser.get('slurm','a'),parser.get('slurm','p'),slurm_c)  # insert account, partition and memory request
  jt.hardWallclockTimeLimit = slurm_t  # store time requirement
  jt.jobName = "GalaxyJob_%s" % (job_wrapper.get_id_tag())  # store job name
  
  # add the same information to the script template. Will not affect anything, just for future reference
  script = drm_template % (parser.get('slurm','a'),parser.get('slurm','p'),parser.get('slurm','t'),parser.get('slurm','c'),os.environ.get("SLURM_MOD"),job_wrapper.galaxy_lib_dir, os.path.abspath( job_wrapper.working_directory ), command_line)
  
else:
  #~ ORIGINAL
  native_spec = self.get_native_spec( runner_url )
  if native_spec is not None:
      jt.nativeSpecification = native_spec
  script = drm_template % (job_wrapper.galaxy_lib_dir, os.path.abspath( job_wrapper.working_directory ), command_line)
 

A temporary fix for HOME not being set in the shell created by drmaa-slurm. Inserted a code block.

Changed (row 232, after modifications above):

.
.
fh = file( jt.remoteCommand, "w" )
fh.write( script )
fh.close()
os.chmod( jt.remoteCommand, 0750 )

# job was deleted while we were preparing it
if job_wrapper.get_state() == model.Job.states.DELETED:
  log.debug( "Job %s deleted by user before it entered the queue" % job_wrapper.get_id_tag() )
  self.cleanup( ( ofile, efile, jt.remoteCommand ) )
  job_wrapper.cleanup()
  return
.
.

to

.
.
fh = file( jt.remoteCommand, "w" )
fh.write( script )
fh.close()
os.chmod( jt.remoteCommand, 0750 )

# check if slurm is activated
if os.getenv("GALAXY_SLURM"):
  
  # the remoteCommand was used to decide the filename in the rows above, so i had to insert this afterwards :) This should be removed as soon as uppmax support figures out what is causing HOME to be empty.
  jt.remoteCommand = "export HOME=/home/dahlo ; %s/database/pbs/galaxy_%s.sh" % (os.getcwd(), job_wrapper.get_id_tag())


# job was deleted while we were preparing it
if job_wrapper.get_state() == model.Job.states.DELETED:
  log.debug( "Job %s deleted by user before it entered the queue" % job_wrapper.get_id_tag() )
  self.cleanup( ( ofile, efile, jt.remoteCommand ) )
  job_wrapper.cleanup()
  return
.
.



Getting SLURM options when configuring job (NO LONGER MAINTAINED)

Check out this code instead: https://bitbucket.org/dahlo/galaxy-central

When a job is configured, it is important to be able to add SLURM options like account and time requirement. Additional options which are site-dependent is also added.

In templates/tools_form.mako

Added a section with options needed for our SLURM.

Changed (row 245):

.
.
.
%if tool.display_by_page[tool_state.page]:
    ${trans.fill_template_string( tool.display_by_page[tool_state.page], context=tool.get_param_html_map( trans, tool_state.page, tool_state.inputs ) )}
    <input type="submit" class="primary-button" name="runtool_btn" value="Execute">
%else:
    ${do_inputs( tool.inputs_by_page[ tool_state.page ], tool_state.inputs, errors, "" )}
    <div class="form-row">
        %if tool_state.page == tool.last_page:
            <input type="submit" class="primary-button" name="runtool_btn" value="Execute">
        %else:
.
.
.

to

.
.
.
%if tool.display_by_page[tool_state.page]:
    ${trans.fill_template_string( tool.display_by_page[tool_state.page], context=tool.get_param_html_map( trans, tool_state.page, tool_state.inputs ) )}
    <input type="submit" class="primary-button" name="runtool_btn" value="Execute">
%else:
    ${do_inputs( tool.inputs_by_page[ tool_state.page ], tool_state.inputs, errors, "" )}
    
<%
import os
slurm = os.getenv("GALAXY_SLURM")
%>
    % if slurm=="1":
      <h3>SLURM Settings</h3>
      <table border="0">
        <tr>
          <td>
            <b>Account:</b>
          </td>
          <td>
            <select name="slurm_a">
            
              <%
              # generate the list of groups
              
              import subprocess as sub
              
              # run the groups command to get all the groups the user belongs to
              p = sub.Popen('groups',stdout=sub.PIPE,stderr=sub.PIPE)
              
              # get the output, keep the first line, and split it on spaces
              output = p.communicate()[0].split(" ")
              
              # remove the group uppmax from the list. Very specific for the development site. Feel free to add your own here
              if "uppmax" in output:
                output.remove("uppmax")
              %>
              
              # make an entry for each group
              %for group in output:
                <option value=${group}>${group}</option>
              %endfor
           
            </select>
          </td>
        </tr>
        <tr>
          <td>
            <b>Time resarvation:</b>
          </td>
          <td>
            <input type="text" name="slurm_t" value="12:00:00" size=10/> (Ex. 48:00:00 or 2-:00:00:00)
          </td>
        </tr>
        </table>
        <b><br>Memory usage:</b><br>
        <input type="radio" name="slurm_c" value="" checked>Normal
        <input type="radio" name="slurm_c" value="#SBATCH -C fat">48G or 72G
        <input type="radio" name="slurm_c" value="#SBATCH -C mem72GB">72G only<br><br>
        <b>Partition</b><br>
        <input type="radio" name="slurm_p" value="node" checked>Node
        <input type="radio" name="slurm_p" value="core">Core<br><br>
      % endif

    <div class="form-row">
            %if tool_state.page == tool.last_page:
                <input type="submit" class="primary-button" name="runtool_btn" value="Execute">
            %else:
.
.
.


These SLURM options will be sent, together with all the other option, to the function "index" specified in lib/galaxy/web/controllers/tool_runner.py

It is in tool_runner.py the SLURM options are extracted and written to a file to the same directory as the .sh file that specifies the submitted job (), to be accessible to the drmaa script that configures the job. This is to avoid fiddling with the code in all scripts along the chain to get them to pass the settings on..

In tool_runner.py
Changed (row ~80):
.
.
if from_noframe is not None:
    add_frame.wiki_url = trans.app.config.wiki_url
    add_frame.from_noframe = True

return trans.fill_template( template, history=history, toolbox=toolbox, tool=tool, util=util, add_frame=add_frame, **vars )
.
.

to

.
.
if from_noframe is not None:
    add_frame.wiki_url = trans.app.config.wiki_url
    add_frame.from_noframe = True 
 
if os.getenv("GALAXY_SLURM"):
  # check if the params string contains the right things
  if "slurm" in str(params) :
    str_params = str(params)  # convert to a proper string
    
    # open the temporary file for writing settings (i sure hope there won't be any conflicts here, overwriting eachothers settings.. Should test with workflows)
    file = open("%s/database/pbs/slurm_settings.tmp" % (os.getcwd()),"w")
    file.write("[slurm]\n")  # write section header
    
    # extract the slurm settings
    slurm_setting = ''  # reset
    slurm_setting = re.search("'slurm_a': u'([^']*)",str_params)  # check for this specific setting
    if slurm_setting:  # if something was found
      file.write("%s = %s\n" % ("a",str(slurm_setting.group(1))))  # set the environment variable
      
    slurm_setting = ''  # reset
    slurm_setting = re.search("'slurm_t': u'([^']*)",str_params)  # check for this specific setting
    if slurm_setting:  # if something was found
      file.write("%s = %s\n" % ("t",str(slurm_setting.group(1))))  # set the environment variable
    
    slurm_setting = ''  # reset
    slurm_setting = re.search("'slurm_c': u'([^']*)",str_params)  # check for this specific setting
    if slurm_setting:  # if something was found
      file.write("%s = %s\n" % ("c",str(slurm_setting.group(1))))  # set the environment variable
    
    slurm_setting = ''  # reset
    slurm_setting = re.search("'slurm_p': u'([^']*)",str_params)  # check for this specific setting
    if slurm_setting:  # if something was found
      file.write("%s = %s\n" % ("p",str(slurm_setting.group(1))))  # set the environment variable  

return trans.fill_template( template, history=history, toolbox=toolbox, tool=tool, util=util, add_frame=add_frame, **vars )
.
.

Current


Finished

Future

  • Have a cup of tea

Encountered problems

  • Various python modules missing, like yaml. Fixed by simply installing them. 
  • Module system not initiated in the sh shell created by slurm-drmaa. Fixed by configuring the initiation scripts to be run when shell == sh as well.
  • HOME not set in the shell created by slurm-drmaa. Still working on that. The solution to include an 'export HOME..' in the remoteCommand is not so elegant..
  • Some modules print to stderr when loading, causing Galaxy to interpret the job as failed. Fixed by removing these print commands. There is a workaround called the 'Gordon patch', if memory serves, that is used to wrap all jobs and remove any stderr print outs, but it seems a bit overkill at the moment. We'll see how things develop.
  • Got "python: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_auth_get_arg_desc" error whenever i tried to submit jobs. Turned out i forgot to rename drmaa-0.4b3.egg back to .egg after having it as .zip when editing wrappers.py
  • Environment variables from the submitting node were not inherited by the worker node. This was quickly fixed in an update to slurm-drmaa (job.c file) by the developer Mariusz Mamoński.



15 comments:

  1. Martin, I still think the best option is to go for python's VirtualENV, as suggested by official galaxy documentation, while removing the rot from the module system on the postactivate hook:


    # User specific aliases and functions

    export PATH=$PATH:~/opt/mypython/bin
    export PYTHONPATH=~/opt/mypython/lib/python2.6/site-packages
    source ~/opt/mypython/bin/virtualenvwrapper.sh



    mkdir -p $HOME/opt/mypython/lib/python2.6/site-packages
    easy_install-2.6 --prefix=~/opt/mypython pip
    pip install virtualenvwrapper --install-option="--prefix=~/opt/mypython"



    mkvirtualenv --python=python2.6 --no-site-packages


    Finally, you should define the following code in ~/.virtualenvs//bin/postactivate:


    #!/bin/bash
    # This hook is run after this virtualenv is activated.

    source ~/bin/reload_uppmax_modules.sh

    # We don't want UPPMAX's python
    RPATH="/sw/comp/python/2.6.6_kalkyl/bin"
    PATH=$( echo ${PATH} | tr -s ":" "\n" | grep -vwE "(${RPATH})" | tr -s "\n" ":" | sed "s/:$//" )

    unset PYTHONHOME

    ReplyDelete
  2. Look/merge my last commit, contains your $HOME hack... I was trying to circumvent it with other means that's why it was not there... any progress on determining why $HOME is gone on the nodes when the job is run ? I'm quite puzzled by it I must say :-S I wouldn't be surprised if it's related to some weird side-effect with the module system.

    There was an old UPPMAX ticket/weird interaction with screen (unexpected clearing of environment variables), I'll see if it's related somehow.

    ReplyDelete
  3. Humm... could it be that slurmd is the one clearing the variables for security reasons ?

    http://superuser.com/questions/235760/ld-library-path-unset-by-screen

    ReplyDelete
  4. There was a brief discussion about this during a uppmax-meeting a couple of weeks ago, after i had sent a support ticket about it. I don't remember exactly why, but one of the sys admins thought it had something to do with the initialization scripts that runs for different shells. He would look into it, but then July came..

    Now everyone is away on vacation and should be back in a couple of weeks. There is only one person managing the whole support system at uppmax right now, and he is understandably quite busy :/

    ReplyDelete
  5. drmaa-python issue has been reported upstream, so hopefully those changes will not be needed anymore in the future:

    http://code.google.com/p/drmaa-python/issues/detail?id=25

    Thanks Martin and Mariusz (slurm-drmaa developer) for their support.

    ReplyDelete
  6. Regarding the shell issue, there's a documented "-shell yes" nativespecification parameter that I haven't managed to get it working yet together with slurm-drmaa:

    http://linux.die.net/man/3/drmaa_attributes


    I sent an email to the slurm-drmaa developer and I'll look at it myself shortly.

    ReplyDelete
  7. Hi it seems you have been able to fix the SGE_ROOT issue but somehow I am stuck:

    SGE_ROOT environment variable required.....
    http://code.google.com/p/drmaa-python/issues/detail?id=29

    any ideas?

    ReplyDelete
    Replies
    1. I just got back from easter, and I see in the bug report you linked to that you resolved your problem.

      Delete
  8. Dahlö,

    how would you judge your slurm-galaxy implementation after using it it for a year? Has it been stable? What are the main pain points?

    ReplyDelete
    Replies
    1. Hi Dipe

      I'll ask around if anyone has been using it, since i don't run much analysis myself nowadays. Our GUI developer has been busy with other more acute tasks the last 6-12 months, so we have not started pushing people to using galaxy yet.

      I guess the biggest pain point is how to launch it for inexperienced users without a GUI or script.

      I'll add more here if i learn something new when i ask around.

      Delete
    2. Hello again. I have not heard about anyone actually using this yet, so I can't give you more than i already have, sorry.

      If it was easier to start up a galaxy instance at our site, maybe people would start using it.

      Delete
  9. Martin, I submitted some modifications to your code these guys made. May be that gets included.

    https://bitbucket.org/galaxy/galaxy-central/issue/778/add-new-slurm-drmaa-runner-to-galaxy

    ReplyDelete
    Replies
    1. Nice, thank you for helping slurm support on the way to be included in galaxy :)

      Delete
  10. Hi Dahlo,

    Thanks for your brilliant job! It helps me a lot.

    Now I get stuck with the error "python: symbol lookup error: /usr/lib64/slurm/auth_munge.so: undefined symbol: slurm_auth_get_arg_desc".

    I see your explanation above but I cannot understand what do you mean by rename the egg after editing wrappers.py? How did you edit the wrappers.py?

    Thanks,
    Jingchao

    ReplyDelete
    Replies
    1. Hello Jingchao

      Sorry for the late response, i have had a lot on my plate lately. I am afraid i have no idea where your problem might be..

      The "renaming of the egg"-part was removed from the instruction after the developer of slurm-drmaa updated his code and the error went away. What i did was that i went to galaxy's eggs folder (eggs/) and renamed the drmaa-slurm egg (drmaa-0.4b3-py2.6.egg) to a zip file (drmaa-0.4b3-py2.6.zip). I then unzipped the file, edited the wrappers.py, and then zipped it again. After that i renamed the zip file (drmaa-0.4b3-py2.6.zip) to an egg again (drmaa-0.4b3-py2.6.egg). The egg then contained an edited version of wrappers.py.

      But as i said, this step is not necessary anymore since the slurm-drmaa updated his code. Maybe you could send him an email and ask if he knows what is causing the problem? http://apps.man.poznan.pl/trac/slurm-drmaa/

      Delete