rnalib: a python-based genomics library

Rnalib is a python library for handling transcriptomics data. It implements a transcriptome model and provides efficient iterators for the annotation of its features (genes, transcripts, exons, etc.). It also provides a number of utility functions for working with genomics data.

Design

Here are our main rnalib design considerations:

  • In rnalib, genomic data is represented by tuples of genomic locations and associated data.

  • Genomic locations are represented by immutable genomic intervals (GI) that can safely be used in indexing and hashing. Genomic intervals are named tuples (chromosome, start, end, strand)

  • Chromosome order is determined by reference dictionaries (RefDict ) that store chromosome names, their order and (possibly) lengths. Reference dictionaries are used to validate and merge genomic datasets from different sources.

  • Associated annotation data are represented by arbitrary, mutable objects (e.g., dicts, numpy arrays or pandas dataframes).

  • Rnalib implements a Transcriptome class that explicitly models genomic features (e.g., genes, transcripts, exons, etc.) and their relationships (e.g., parent/child relationships) using dynamically created python dataclasses that inherit from the GI class. Associated annotations are stored in a separate dictionary that maps features to annotation data.

  • A transcriptome can be instantiated from a GFF/GTF file and rnalib understands various popular GFF/GTF ‘flavours’ (e.g., gencode, ensembl, refseq, flybase, etc.). Users can then incrementally add annotation data to transcriptomes, either by direct assignment or by using LocationIterators that yield genomic locations and associated data.

  • Rnalib implements a number of LocationIterators for iterating genomic data (location/data tuples) via a common interface. Most are based on respective pysam classes and leverage associated indexing data structures (e.g., .tbi or .bai files) for efficient random access. This enables users to quickly switch between genomic sub regions (e.g., for focussing on difficult/complex regions) and whole transcriptome analyses during development.

  • Annotated transcriptomes can be exported in various formats (e.g., GFF, BED, pandas dataframes etc.) for further processing using other tools/libraries.

  • Most importantly, rnalib was not designed to replace the great work of others but to integrate with it and fill gaps. For example, rnalib provides interfaces for integrating with pybedtools, bioframe and HTSeq.

Rnalib’s target audience are bioinformatics analysts and developers and its main design goal is to enable fast, readable, reproducible and robust development of novel bioinformatics tools and methods.

Installation

Rnalib is hosted on PyPI and can be installed via pip:

$ pip install rnalib

The source code is available on GitHub.

You can import the library as follows:

>>> import rnalib as rna
>>> print(f"imported rnalib {rna.__version__}")

To use rnalib in jupyter lab (recommended), you should:

  • Install jupyter lab

  • Install and create a new virtual environment (venv)

  • Activate the venv and install the required packages from the requirements.txt file

  • Add the venv to jupyter lab

  • Start jupyter lab, create/load a notebook and select the venv as kernel

Here is an example of how to use rnalib in jupyter lab (adapt paths to your system):

$ cd /Users/myusername/.virtualenvs # change to your venv directory
$ python3 -m venv rnalib      # create venv with name 'rnalib'
$ source rnalib/bin/activate  # activate venv
(rnalib) $ python3 -m pip install ipykernel ipywidgets # install required ipython packages
(rnalib) $ python3 -m pip install -r https://raw.githubusercontent.com/popitsch/rnalib/main/requirements.txt # install required packages
(rnalib) $ python3 -m ipykernel install --user --name=rnalib # add currently activated venv to jupyter
(rnalib) $ deactivate # deactivate venv
$ jupyter lab # start jupyter lab

Now, you can load an rnalib notebook and select ‘rnalib’ as kernel. All basic requirements of rnalib should be installed, however, some notebook-specific requirements might still need to be installed separately. Respective instructions are provided at the beginning of each notebook.

Test data

The rnalib test suite and the tutorial ipython notebooks use various genomic test data files that are not included in the GitHub repository due to size restrictions and potential licensing issues. These test resources are ‘configured’ in the rnalib.testdata module (i.e., their source file/URL, the contained genomic region(s) and a short description of the data).

You can get final test data files in one of the following ways:

  • A zipped version (~260M) of the files can be downloaded from the GitHub release page of the rnalib repository (or from the respective most recent release with an attached ZIP file).

  • The files can also be created by running rnalib create_testdata from the commandline. This will download the source files from public sources and creates the test files by slicing, sorting, compressing and indexing the files. For this to work, however, you need some external tools (bedtools, bgzip, tabix and samtools) to be installed.

  • The tutorial notebooks provide code snippets for creating the test files via rna.testdata.create_testdata() which does the same as rnalib create_testdata. Again, this is only possible if you have the required external tools installed.

Note

The test data files are not required for using the rnalib package itself but only for testing it or for running the tutorial notebooks. The additional tools (e.g., tabix) required for creating the test data files are also not required for using the rnalib package itself.

Usage

An introduction to the API, its design and several usage examples is provided in the README.ipynb and in the AdvancedUsage.ipynb notebooks.

If you don’t have jupyter installed, you can also view the notebooks on GitHub or run them on Google Colab. On Google Colab, you need to install rnalib and its dependencies first. You also need to upload the required test data files to your Google Drive and mount the drive or upload the files directly to the Colab runtime.

Quick Start

Here are some examples of how to use rnalib:

Introduction to rnalib

And how to use rnalib LocationIterators:

Introduction to rnalib LocationIterators

Commandline tools

Rnalib provides a growing number of commandline tools for working with genomics data. These tools are implemented in the rnalib tools modulde and can be called from the commandline via rnalib <tool> or from within python scripts. Here is a list of the available tools:

Note

Call rnalib <tool> --help for more information on the respective tool.

Tutorials

We also provide a set of tutorials for further demonstrating rnalib’s API:

We compare rnalib to other genomics libraries with a focus on performance and memory usage in the following notebook:

We provide a set of tutorials for demonstrating rnalib in realistic usage scenarios:

Getting Help

If you have questions of how to use rnalib that is not addressed in the documentation, please post it on StackOverflow using the rnalib tag. For bugs and feature requests, please open a Github Issue.

Contributing

Contributions to rnalib are highly welcome. Please contact the main author directly or open an issue or a pull request on the GitHub repository.

Testing

We use pytest and tox for testing rnalib against different python versions as configured in the tox.ini file. We also use black for code formatting. You can run the tests by running the following command in the rnalib source directory:

$ RNALIB_TESTDATA=<testdata_dir> tox

To run a specific tests with a specific python version, you can use the following command:

$ RNALIB_TESTDATA=<testdata_dir> tox -epy312 -- tests/test_gi.py::test_loc_simple

To skip missing interpreters, you can use the --skip-missing-interpreters switch.

Documentation

We use sphinx to generate the documentation. The documentation can be built by running the build_docs.sh script in the docs/ directory. The documentation of official releases is hosted on ReadTheDocs. and is built automatically via an AutomationRule.

Screencasts

We use terminalizer to create animated GIF screencasts that demonstrate rnalib’s API. All required resources can be found in the docs/_static/screencasts directory. The screencasts are created by running record_screencasts.sh. This script uses the execute_screencast() method (implemented in utils.py) that simulates user interactions with the rnalib API. Note that the current version requires multi-line commands to start with an indentation beyond the first line, see the existing examples. Note, that all python files in the screencasts directory are excluded from reformatting with black (see tox.ini)