The use of computational tools for data preprocessing and analysis is daily activity for any bioinformatician. Having a interdisciplinary field, a plethora of tools exist in the domain, written by developers from various fields and languages. Many times there can be multiple candidate tools exists to accomplish any given task. Often it is the decision of the developer of researcher to choose from the available tools on the basis of each tools pros and cons.
More broadly we can see tools written on but not limiited to R, Python, Bash, and Perl languages. In order to carry out a complete data analysis or bioinformatics pipeline, may require combining tools written in various languages. When it comes to automating the whole bioinformatics pipeline, it will be convinient for the pipeline to be implemented in a single programming language. This will make the automation process easier to build and modify.
Python wrapper
The programming language Python is an excellent choice to write wrapper code around tools built in various languages. The utility functions and libraries in Python allows us to write wrapper code for R, Perl, Binary executables etc. This way the whole data analysis/processing pipeline can be controlled by a single language. This also improves the repeatability of experiments and to do the experiments in a extensive way.
In this tutorial, let us see how to write a Python wrapper code for a Linux-based command line program.
Simple Command Line Tool
For demonstration, let us use a simple built-in command in Linux called ‘ls’. ‘ls’ is a command used to list out files and folders in any particular directory. It has some parameters or user arguments with which we can control or modifiy its usage slightly. By default, ls would list out files and folders in the current directory.
(base) Sajils-Air:~ sajil$ ls
Applications Downloads Movies Desktop Dropbox Music
For example, here the command lists out all folders in my home directory. Needless to say that there is only non-empty folders in my home directory and no files. Let’s see couple of modified use cases with parameters passed so that we can write python wrappers around those commands.
(base) Sajils-Air:~ sajil$ ls -ls
total 0
0 drwx------@ 3 sajil staff 96 Feb 25 2020 Applications
0 drwx------+ 4 sajil staff 128 Jan 29 19:13 Desktop
0 drwx------+ 68 sajil staff 2176 Feb 2 11:25 Downloads
0 drwx------@ 31 sajil staff 992 Dec 8 10:11 Dropbox
0 drwx------+ 5 sajil staff 160 Sep 19 19:39 Movies
0 drwx------+ 7 sajil staff 224 Sep 6 2020 Music
Here the command ‘ls’ along with ‘-ls’ parameter list out all the files and folders along with its meta information (size, date, user permissions, etc.,). Another use case is specifying the directory whom which we want to list out files and folders in it. The use case is ‘ls'<space><path_to directory_to_be_listed>.
(base) Sajils-Air:~ sajil$ ls Documents/
Books
Workshop
Python Wrapper Code
To write python wrapper code, we will have to parse arguements if there are any. Don’t worry, Python provides libraries to deal with this sorts of things. The following wrapper code (named as run_comds.py) calls the command ‘ls’ and ‘ls <path_to_folder> ‘ using the python library functions.
import sys
import subprocess
def run():
if len(sys.argv) > 1:
command = sys.argv[1]
if len(sys.argv) == 2:
subprocess.Popen([command], shell=True)
elif len(sys.argv) == 3:
param = sys.argv[2]
subprocess.run([command+' '+param], shell=True)
else:
print('Invalid usage')
if __name__ == '__main__':
run()
Now it can be called for the both cases as shown below.
# To simply lists contents in current directory
python run_comds.py ls
# To list out contents in a specific directory (e.g. Graphs)
python run_comds.py ls Graphs
The use case is for demonstration only and probably we never require a wrapper code for a simple command like ‘ls’. But the point is, it is possible to write wrapper code for bioinformatics tools which has complex and long list of arguments.
Now we have the automation power of python to tweak it and reuse it for many folders, use cases, etc. By using the same technique you can build bioinformatics pipelines by combining tools and scripts from various languages and calling them via Python wrapper code. For more complex bioinformatics pipelines, there are open source Workflow Description Languages (WDL) such as Nextflow or Snakemake which are the industry standard.
Good