rnalib package

Rnalib is a python utilities library for handling genomics data with a focus on transcriptomics.

It implements a transcriptome model that can be instantiated from various popular GFF ‘flavours’ as published in encode, ensembl, ucsc, chess, mirgendb and flybase databases. Several filtering options are available to filter transcripts by gene type, biotype, etc.

The transcriptome model is implemented as a hierarchical tree of frozen dataclasses (derived from the ‘Feature’ class) and supports incremental and flexible annotation of transcriptome features. It also supports the extraction of genomic sequences for transcriptome features.

The transcriptome model supports genomic range queries via query() which are implemented by a combination of interval and linear search queries. A transcriptome object maintains one intervaltree per chromosome built from gene annotations. Overlap/envelop queries will first be applied to the respective intervaltree and the (typically small result sets) will then be filtered, e.g., for requested sub-feature types.

class rnalib.GI(chromosome: str = None, start: int = 0, end: int = 2147483647, strand: str = None)[source]

Bases: NamedTuple

Genomic intervals (GI) in rnalib are inclusive, continuous and 1-based. This model differs from other genomics libraries but was chosen to make interpretation of GIs straightforward: start and end coordinates represent the first/last included nucleotide as also seen in a genome browser (such as IGV).

GIs are implemented as (readonly) namedtuples and can safely be used as keys in a dict. They are instantiated by the gi() factory method that either parses from a string representation or by passing explicitly chrom/start/stop coordinates. Coordinate components can be ‘unrestricted’ by setting them to None, e.g., gi(‘chr1’, 1, None) refers to the whole chromosome 1, gi(‘chr1’, 100000) refers to the section on chr1 from (and including) position 100k on, gi(start=100, end=200) refers to positions 100-200 (inclusive) on any chromosome.

Intervals can be stranded. The strand can be ‘+’ or ‘-’ or None if unstranded. Note that ‘.’ will be converted to None. Individual genomic positions (points) are represented by intervals with same start and stop coordinate. Empty intervals can be represented by start>end coordinates (e.g., gi(‘chr1’, 1,0).is_empty() -> True) but note that this feature is still experimental and may not be supported in all methods.

GIs can be compared for equality, overlap, envelopment, etc. and can be sorted by chromosome and coordinates.

GIs are grouped by chromosomes/contigs of a given reference genome and their order is defined by a RefDict (that extends regular python dicts). RefDicts are typically instantiated automatically from indexing data structures (e.g., .tbi or .bai files) and keep track of the order of chromosomes/contigs and their lengths. They are used throughout rnalib to assert the compatibility of different genomic datasets and to sort GIs by chromosome and coordinate, e.g., via sorted(gis, key=lambda x: (refdict.index(x.chromosome), x)). Note that the index of chromosome ‘None’ is always 0, i.e., it is the first chromosome in the sorted list.

GIs can be iterated over to yield individual positions (points) as GIs, e.g., tuple(gi(‘chr2’, 1, 3)) -> (gi(‘chr2’, 1, 1), gi(‘chr2’, 2, 2), gi(‘chr2’, 3, 3)).

GIs should be instantiated by the gi() factory method that allows to configure each coordinate component individually or can parse from a string representation, e.g., gi(‘chr1:1-10 (+)’). The factory method conducts type and value checking and conversions (e.g., ‘.’ to None for strand).

Variables:
  • chromosome (str) – Chromosome (default: None)

  • start (int) – First included position, 1-based (default: 0).

  • end (int) – Last included position, 1-based (default: MAX_INT)

  • strand (str) – Strand, either ‘+’ or ‘-’ or None if unstranded. Note that ‘.’ will be converted to None. Default: None

Examples

>>> gi('chr3')
>>> gi('chr2', 1, 10)
>>> gi('1', 1, 10, strand='+')
>>> gi('chr4:1-10 (-)')
>>> assert gi('chrX:2-1').is_empty()

Notes

Note that there is also a frozen (immutable) dataclass version of this class, GI_dataclass that shares most functionality with GIs and is used as superclass for genomic features (genes, transcripts, etc.) as instantiated by the Transcriptome class. The reason for this separation is that dataclass instantiation is relatively slow and has performance impacts when instantiating large sets of GIs.

chromosome: str

Alias for field number 0

start: int

Alias for field number 1

end: int

Alias for field number 2

strand: str

Alias for field number 3

classmethod from_str(loc_string)[source]

Parse from <chr>:<start>-<end> (<strand>). Strand is optional

static sort(intervals, refdict, reverse: bool = False)[source]

Returns a chromosome + coordinate sorted iterable over the passed intervals. Chromosome order is defined by the passed reference dict.

get_stranded(strand)[source]

Get a new object with same coordinates; the strand will be set according to the passed variable.

to_file_str()[source]

returns a sluggified string representation “<chrom>_<start>_<end>_<strand>”

is_unbounded()[source]
is_empty()[source]
is_stranded()[source]
cs_match(other, strand_specific=False)[source]

True if this location is on the same chrom/strand as the passed one. will not compare chromosomes if they are unrestricted in one of the intervals. Empty intervals always return False hee

left_match(other, strand_specific=False)[source]
right_match(other, strand_specific=False)[source]
left_pos()[source]
right_pos()[source]
envelops(other, strand_specific=False) bool[source]

Tests whether this interval envelops the passed one.

overlaps(other, strand_specific=False) bool[source]

Tests whether this interval overlaps the passed one. Supports unrestricted start/end coordinates and optional strand check

overlap(other, strand_specific=False) int[source]

Calculates the overlap with the passed genomic interval. Supports unrestricted start/end coordinates and optional strand check If the intervals are not on the same chromosome or strand, 0 is returned. If the intervals are empty, 0 is returned. If the intervals are adjacent, 0 is returned. If the intervals do not overlap, 0 is returned.

split_coordinates() -> (<class 'str'>, <class 'int'>, <class 'int'>)[source]
classmethod merge(loc)[source]

Merges a list of intervals. If intervals are not on the same chromosome or if strand is not matching, None is returned The resulting interval will inherit the chromosome and strand of the first passed one.

Examples

>>> GI.merge([gi('chr1', 1, 10), gi('chr1', 11, 20)])
is_adjacent(other, strand_specific=False)[source]

true if intervals are directly next to each other (not overlapping!)

get_downstream(width=100)[source]

Returns an upstream genomic interval

get_upstream(width=100)[source]

Returns an upstream genomic interval

get_extended(width=100)[source]

Returns an genomic interval that is extended up- and downstream by width nt

split_by_maxwidth(maxwidth)[source]

Splits this into n intervals of maximum width

copy()[source]

Returns a copy of this gi

distance(other, strand_specific=False)[source]

Distance to other interval. - None if chromosomes do not match - 0 if intervals overlap, 1 if they are adjacent - negative if other < self

to_pybedtools()[source]

Returns a corresponding pybedtools interval object. Note that this will fail on open or empty intervals as those are not supported by pybedtools.

Examples

>>> gi('chr1',1,10).to_pybedtools()
>>> gi('chr1',1,10, strand='-').to_pybedtools()
>>> gi('chr1').to_pybedtools() # uses maxint as end coordinate

Warning

Note that len(gi(‘chr1’,12,10).to_pybedtools()) reports wrong length 4294967295 for this empty interval!

to_htseq()[source]

Returns a corresponding HTSeq interval object. Note that this will fail on open or empty intervals as those are not supported by HTSeq.

Examples

>>> gi('chr1',1,10).to_htseq()
>>> gi('chr1',1,10, strand='-').to_htseq()
>>> gi('chr1').to_htseq()
rnalib.gi(chromosome: str = None, start: int = 0, end: int = 2147483647, strand: str = None)[source]

Factory function for genomic intervals (GI).

Examples

>>> gi('chr1', 1, 10, strand='-') # explicit coordinates
>>> gi('chr1:1-10 (-)') # parsed from string
>>> gi('chr1') # whole chromosome
>>> gi() # unbounded interval
class rnalib.GI_dataclass(chromosome: str = None, start: int = 0, end: int = 2147483647, strand: str = None)[source]

Bases: object

Dataclass for genomic intervals (GIs) in rnalib. Copies the functionality of the namedtuple GI, but is slower to instantiate due to the post_init assertions. Needed for Feature hierarchies and other dataclasses that need to be frozen.

chromosome: str
start: int
end: int
strand: str
get_stranded(strand)

Get a new object with same coordinates; the strand will be set according to the passed variable.

to_file_str()

returns a sluggified string representation “<chrom>_<start>_<end>_<strand>”

is_unbounded()
is_empty()
is_stranded()
cs_match(other, strand_specific=False)

True if this location is on the same chrom/strand as the passed one. will not compare chromosomes if they are unrestricted in one of the intervals. Empty intervals always return False hee

left_match(other, strand_specific=False)
right_match(other, strand_specific=False)
left_pos()
right_pos()
envelops(other, strand_specific=False) bool

Tests whether this interval envelops the passed one.

overlaps(other, strand_specific=False) bool

Tests whether this interval overlaps the passed one. Supports unrestricted start/end coordinates and optional strand check

overlap(other, strand_specific=False) int

Calculates the overlap with the passed genomic interval. Supports unrestricted start/end coordinates and optional strand check If the intervals are not on the same chromosome or strand, 0 is returned. If the intervals are empty, 0 is returned. If the intervals are adjacent, 0 is returned. If the intervals do not overlap, 0 is returned.

split_coordinates() -> (<class 'str'>, <class 'int'>, <class 'int'>)
classmethod merge(loc)

Merges a list of intervals. If intervals are not on the same chromosome or if strand is not matching, None is returned The resulting interval will inherit the chromosome and strand of the first passed one.

Examples

>>> GI.merge([gi('chr1', 1, 10), gi('chr1', 11, 20)])
is_adjacent(other, strand_specific=False)

true if intervals are directly next to each other (not overlapping!)

get_downstream(width=100)

Returns an upstream genomic interval

get_upstream(width=100)

Returns an upstream genomic interval

get_extended(width=100)

Returns an genomic interval that is extended up- and downstream by width nt

split_by_maxwidth(maxwidth)

Splits this into n intervals of maximum width

copy()

Returns a copy of this gi

distance(other, strand_specific=False)

Distance to other interval. - None if chromosomes do not match - 0 if intervals overlap, 1 if they are adjacent - negative if other < self

to_pybedtools()

Returns a corresponding pybedtools interval object. Note that this will fail on open or empty intervals as those are not supported by pybedtools.

Examples

>>> gi('chr1',1,10).to_pybedtools()
>>> gi('chr1',1,10, strand='-').to_pybedtools()
>>> gi('chr1').to_pybedtools() # uses maxint as end coordinate

Warning

Note that len(gi(‘chr1’,12,10).to_pybedtools()) reports wrong length 4294967295 for this empty interval!

to_htseq()

Returns a corresponding HTSeq interval object. Note that this will fail on open or empty intervals as those are not supported by HTSeq.

Examples

>>> gi('chr1',1,10).to_htseq()
>>> gi('chr1',1,10, strand='-').to_htseq()
>>> gi('chr1').to_htseq()
class rnalib.FixedKeyTypeDefaultdict(*args, **kwargs)[source]

Bases: defaultdict

A defaultdict that allows only keys of a certain type.

Examples

>>> d = rna.FixedKeyTypeDefaultdict(defaultdict, allowed_key_type=int)
>>> d[1] = 1 # this works, key is an int
>>> import pytest
>>> with pytest.raises(TypeError) as e_info:
>>>     d['x'] = 2 # this raises a TypeError as key is a string
>>> d.allowed_key_type = str # you can change the allowed key type
>>> d['x'] = 2 # now it works
>>> assert dill.loads(dill.dumps(d)).allowed_key_type == str # pickling/dill works
class rnalib.Transcriptome(annotation_gff: str, annotation_flavour: str, genome_fa: str = None, gene_name_alias_file: str = None, annotation_fun_alias: Callable = None, copied_fields: tuple = (), load_sequence_data: bool = False, calc_introns: bool = True, disable_progressbar: bool = False, genome_offsets: dict = None, name='Transcriptome', feature_filter=None, custom_feature_types: list = None)[source]

Bases: object

Represents a transcriptome as modelled by a GTF/GFF file.

  • Model contains genes, transcripts and arbitrary sub-features (e.g., exons, intron, 3’/5’-UTRs, CDS) as defined in the GFF file. Frozen dataclasses (derived from the ‘Feature’ class) are created for all parsed feature types automatically and users may configure which GTF/GFF attributes will be added to those (and are thus accessible via dot notation, e.g., gene.gene_type).

  • This transcriptome implementation exploits the hierarchical relationship between genes and their sub-features to optimize storage and computational requirements, see the Feature documentation for examples. To enable this, however, parent features must envelop (i.e., completely contain) child feature locations and this requirement is asserted when building the transcriptome.

  • A transcriptome maintains an anno dict mapping (frozen) features to dicts of arbitrary annotation values. This supports incremental and flexible annotation of transcriptome features. Values can directly be accessed via dot notation <feature>.<attribute> and can be stored/loaded to/from a (pickled) file.

  • Feature sequences can be added via load_sequences() which will extract the sequence of the top-level feature (‘gene’) from the configured reference genome. Sequences can then be accessed via get_sequence(). For sub-features (e.g., transcripts, exons, etc.) the respective sequence will be sliced from the gene sequence. If mode=’rna’ is passed, the sequence is returned in 5’-3’ orientation, i.e., they are reverse-complemented for minus-strand transcripts. The returned sequence will, however, still use the DNA alphabet (ACTG) to enable direct alignment/comparison with genomic sequences. if mode=’spliced’, the spliced 5’-3’ sequence will be returned. if mode=’translated’, the spliced 5’-3’ CDS sequence will be returned.

  • Genomic range queries via query() are supported by a combination of interval and linear search queries. A transcriptome object maintains one intervaltree per chromosome built from gene annotations. Overlap/envelop queries will first be applied to the respective intervaltree and the (typically small result sets) will then be filtered, e.g., for requested sub-feature types.

  • When building a transcriptome model from a GFF/GTF file, contained transcripts can be filtered using a TranscriptFilter.

  • The current implementation does not implement the full GFF3 format as specified in
    but currently supports various popular gff ‘flavours’ as published in
    encode, ensembl, ucsc, chess, mirgendb and flybase databases (see
    GFF_FLAVOURS). As such this implementation will likely be extended
    in the future.

@see the README.ipynb jupyter notebook for various querying and iteration examples

Parameters:
  • annotation_gff (str) – Path to a GFF/GTF file containing the transcriptome annotations.

  • annotation_flavour (str) – The annotation flavour of the GFF/GTF file. Currently supported: encode, ensembl, ucsc, chess, mirgendb and flybase.

  • genome_fa (str) – Path to a FASTA file containing the reference genome. If provided, sequences will be loaded and can be accessed via get_sequence().

  • gene_name_alias_file (str) – Path to a gene name alias file. If provided, gene symbols will be normalized using the alias file.

  • annotation_fun_alias (Callable) – Name of a function that will be used to alias gene names. The function must be defined in the global namespace and must accept a gene name and return the alias. If provided, gene symbols will be normalized using the alias function.

  • copied_fields (tuple) – List of GFF/GTF fields that will be copied to the respective feature annotations.

  • calc_introns (bool) – If true, intron features will be added to transcripts that do not have them.

  • load_sequence_data (bool) – If true, sequences will be loaded from the genome_fa file.

  • disable_progressbar (bool) – If true, progressbars will be disabled.

  • genome_offsets (dict) – A dict mapping chromosome names to offsets. If provided, the offset will be added to all coordinates.

  • name (str) – An (optional) human-readable name of the transcriptome object. default: ‘Transcriptome’

  • feature_filter (TranscriptFilter) – A TranscriptFilter object that will be used to filter transcripts.

Variables:
  • genes (List[Gene]) – List of genes in the transcriptome.

  • transcripts (List[Transcript]) – List of transcripts in the transcriptome.

  • anno (Dict[Feature, Dict[str, Any]]) – Dictionary mapping features (e.g., genes, transcripts) to their annotations.

Examples

>>> fb_gff = ... # path to flybase gff file
>>> t = rna.Transcriptome(annotation_gff = fb_gff, annotation_flavour ='flybase')
>>> # download full flybase GTF to temporary directory, use the create_testdata function to download, sort and index
>>> with tempfile.TemporaryDirectory() as tmpdir: # noqa
>>>     res = {"flybase_full": {
>>>            "uri": "https://ftp.flybase.net/releases/current/dmel_r6.55/gtf/dmel-all-r6.55.gtf.gz",
>>>            "filename": "dmel-all-r6.55.gtf.gz"}}
>>>    rna.testdata.create_testdata(tmpdir, res) # download, sort + index; needs bgzip, bedtools, tabix
>>>    t2 = rna.Transcriptome(annotation_gff = rna.get_resource("flybase_full", tmpdir, res), annotation_flavour ='flybase')
>>>  # instantiate Ensembl with chrom aliasing using a sliced genome FASTA file
>>> config = {
>>>    'genome_fa': get_resource('ACTB+SOX2_genome'),  # a sliced genome FASTA file; part of rnalib testdata
>>>    'genome_offsets': {'chr3': 181711825, 'chr7': 5526309}, # genome offsets for the sliced genome
>>>    'annotation_gff': get_resource('ensembl_gff'), # ensembl GFF file; part of rnalib testdata
>>>    'annotation_flavour': 'ensembl', # ensembl annotation flavour
>>>    'annotation_fun_alias': 'toggle_chr', # aliasing function for chromosome names; will all/remove 'chr' prefix
>>>    'load_sequence_data': True # load sequences from genome_fa
>>>    }
>>> t3 = Transcriptome(**config)  # instantiate transcriptome with config dict
build()[source]
load_sequences()[source]

Loads feature sequences from a genome FASTA file. Requires a ‘genome_fa’ config entry.

add(location: GI, feature_id: str, feature_type: str, parent=None, children: tuple = (), sort=True)[source]

Adds a feature to the transcriptome

Parameters:
  • location (GI) – Genomic interval of the feature

  • feature_id (str) – Feature id

  • feature_type (str) – Feature type

  • parent (Feature) – Parent feature

  • children (tuple) – Children features

  • sort (bool) – If true, the annotation dict will be sorted by location

sort()[source]
find_attr_rec(f, attr)[source]

recursively finds attribute from parent(s)

get_sequence(feature, mode='dna', show_exon_boundaries=False)[source]

Returns the sequence of the passed feature.

  • If mode is ‘rna’ then the reverse complement of negative-strand features (using a DNA alphabet) will be returned.

  • if mode is ‘spliced’, the fully spliced sequence of a transcript will be returned. This will always use ‘rna’ mode and is valid only for containers of exons.

  • if mode is ‘translated’ then the CDS sequence is reported. To, e.g., calculate the amino-acid sequence of a transcript using biopython’s Seq() implementation, you can do: Seq(t.transcript[my_tid].translated_sequence).translate()

  • else, the 5’-3’ DNA sequence (as shown in a genome browser) of this feature is returned

show_exon_boundaries=True can be used to insert ‘*’ characters at splicing boundaries of spliced/translated sequences.

slice_from_parent(f, attr, default_value=None)[source]

Gets an attr from the passed feature or its predecessors (by traversing the parent/child relationships). If retrieved from an (enveloping) parent interval, the returned value will be sliced. Use only to access attributes that contain one item per genomic position (e.g, arrays of per-position values)

gene_triples(max_dist=None)[source]

Convenience method that yields genes and their neighbouring (up-/downstream) genes. If max_dist is set and the neighbours are further away (or on other chromosomes), None is returned.

To iterate over all neighbouring genes within a given genomic window, consider query() or implement a custom iterator.

query(query, feature_types=None, envelop=False, sort=True)[source]

Query features of the passed class at the passed query location.

Parameters:
  • query (GenomicInterval or string that is parsed by GI.from_str()) – Query interval

  • feature_types (str or List[str]) – Feature types to query. If None, all feature types will be queried.

  • envelop (bool) – If true, only features fully contained in the query interval are returned.

  • sort (bool) – If true, the returned features will be sorted by chromosome and start coordinate.

annotate(anno_its, fun_anno, labels=None, region=None, feature_types=None, disable_progressbar=True)[source]

Annotates all features of the configured type and in the configured genomic region using the passed fun_anno function. NOTE: consider removing previous annotations with the clear_annotations() functions before (re-)annotating a transcriptome.

annotate_with_mygene(fields, attribute='feature_id', species=None, anno_prefix='', region=None, batch_size=1000, default_value=None, disable_progressbar=True)[source]

Annotates all gene features in the configured genomic region using the MyGene.info API. This method requires an internet connection and will send batch requests to MyGene.info for the configured fields, see https://docs.mygene.info/en/latest/doc/data.html#available-fields for available annotation fields.

The submitted gene ids will be accessed via the configured “attribute” (default: feature_id) and must be in Entrez (“1017”) or Ensembl (“ENSG00000123374.1”) format. For Ensembl ids, the method will remove the minor qualifier (e.g., ENSG00000123374.1 -> ENSG00000123374).

Please see here for mygene.info terms of use: https://docs.mygene.info/en/latest/terms.html

Parameters:
  • fields (list or dict) – Data fields that will be retrieved from mygene. If a {field: annotation_name} dict is passed, the values will be used as annotation names. Otherwise, the field names will be used as annotation names with dots replaced by underscores (e.g., if ‘pathway.kegg’ is requested, the annotation name will be ‘pathway_kegg’). Final annotation names will be prefixed by the anno_prefix parameter. Available fields and data types are in rna.mygeneinfo_fields. Note that for keyword and object fields, _rnalib_ converts the results always to a list using mygene.alwayslist().

  • attribute (str) – The attribute to be used for gene id retrieval. Default: ‘feature_id’

  • species (str) – Species to be used for annotation. Default: None

  • anno_prefix (str) – Prefix for the annotation keys. Default: ‘’. Annotation fields will be named as f’{anno_prefix}{annotation_name}’

  • region (GenomicInterval) – Genomic region to be annotated. If None, the whole transcriptome will be annotated.

  • batch_size (int) – Batch size for the API requests. Default: 1000

  • default_value (Any) – Default value for missing annotations. Default: None

  • disable_progressbar (bool) – If true (default), progressbars will be disabled.

Return type:

The number of annotated genes

save(out_file)[source]

Stores this transcriptome and all annotations as dill (pickle) object. Note that this can be slow for large-scaled transcriptomes and will produce large ouput files. Consider using save_annotations()/load_annotations() to save/load only the annotation dictionary.

classmethod load(in_file)[source]

Load transcriptome from pickled file

clear_annotations(retain_keys=('dna_seq',))[source]

Clears this transcriptome’s annotations (except for retain_keys annotations (by default: ‘dna_seq’)).

save_annotations(out_file, excluded_keys=('dna_seq',))[source]

Stores this transcriptome annotations as dill (pickle) object. Note that the data is stored not by object reference but by comparison key, so it can be assigned to newly created transcriptome objects.

Parameters:
  • out_file (str) – The output file name

  • excluded_keys (tuple or set) – The keys to be excluded from the output file. Default: (‘dna_seq’,)

load_annotations(in_file, update=False)[source]

Loads annotation data from the passed dill (pickle) object. If update is true, the current annotation dictionary will be updated.

Parameters:
  • in_file (str) – The input file name

  • update (bool) – If true, the current annotation dictionary will be updated. So, existing annotations with different keys will be kept.

to_gff3(out_file, bgzip=True, feature_types=('gene', 'transcript', 'exon', 'intron', 'CDS', 'three_prime_UTR', 'five_prime_UTR'))[source]

Writes a GFF3 file with all features of the configured types.

Parameters:
Return type:

the name of the (bgzipped) output file

Examples

>>> transcriptome = ...
>>> # create a file containing all intron annotations
>>> transcriptome.to_gff3('introns.gff3', feature_types=['intron'])
to_bed(out, region=None, feature_types=('transcript', ), fun_anno=<function _transcript_to_bed>, bed_header=None, disable_progressbar=True, no_header=False)[source]

Outputs transcripts of this transcriptome in BED format. Pass your custom annotation function via fun_anno to output custom BED fields (e.g., color based on transcript type).

Parameters:
  • out (file-like object) – The output file-like object.

  • region (gi or str) – The genomic region to be included in the output file. If None (default), all features will be included.

  • feature_types (tuple) – The feature types to be included in the output file (e.g., feature_types=(‘transcript’,)). Custom feature types need to be added here if they should be serialized. Default: (‘transcript’,) (all transcripts).

  • fun_anno (function) – A function that takes a feature index, a location and a feature as input and returns a tuple of strings that will be added to the BED output file.

  • bed_header (dict) – A dict containing the BED header fields. If None (default), a default header will be used.

  • disable_progressbar (bool) – If true, the progressbar will be disabled.

  • no_header (bool) – If true, no header will be written to the output file.

Example

>>> transcriptome = ...
>>> transcriptome.to_bed('transcripts.bed')
iterator(region=None, feature_types=None)[source]

returns a TranscriptomeIterator for iterating over all features of the passed type(s). If feature_types is None (default), all features will be returned.

get_struct()[source]

Return a dict mapping feature to child feature types

class rnalib.Feature(chromosome: str = None, start: int = 0, end: int = 2147483647, strand: str = None, transcriptome: Transcriptome = None, feature_id: str = None, feature_type: str = None, parent: object = None, subfeature_types: tuple = ())[source]

Bases: GI_dataclass

A (frozen) genomic feature, e.g., a gene, a transcript, an exon, an intron, a CDS or a three_prime_UTR/five_prime_UTR. Features are themselves containers of sub-features (e.g., a gene contains transcripts, a transcript contains exons, etc.). Sub-features are stored in the respective child tuples (e.g., gene.transcript, transcript.exon, etc.).

Features are built by the Transcriptome class and are typically not instantiated directly. They contain a reference to their parent feature (e.g., a transcript’s parent is a gene) and a list of child features (e.g., a gene’s children are transcripts). The parent-child relationships are defined by the transcriptome implementation and are typically defined by the GFF file used to build the transcriptome. For example, a GFF file may contain a gene feature with a child transcript feature and the transcript feature may contain a child exon feature. In this case, the transcriptome implementation will define the parent-child relationships as follows: gene.parent = None gene.children = {‘transcript’: [transcript1, transcript2, …]} transcript.parent = gene transcript.children = {‘exon’: [exon1, exon2, …]} exon.parent = transcript exon.children = {}

Features will also contain a dynamically created dataclass (see create_sub_class()) that contains annotations parsed from the GFF file. For example, a gene feature may contain a ‘gene_name’ annotation that was parsed from the GFF file. Annotations are stored in the feature’s anno dict and can be accessed via <feature>.<annotation> (e.g., gene.gene_name). Annotations are typically strings but can be of any type.

Equality of features is defined by comparing their genomic coordinates and strand, as well as their feature_type (e.g., transcript, exon, five_prime_UTR) and feature_id (which should be unique within a transcriptome), as returned by the key() method.

Features are typically associated with the Transcriptome object used to create them and (mutable) additional annotations are stored in the respective transcriptome’s anno dict can be directly accessed via <feature>.<annotation>. This hybrid approach allows efficient and transparent access to readonly (immutable) annotations parsed from the GFF file as well as to mutable annotations that are added later (e.g, using the Transcriptome.annotate() function).

This also includes dynamically calculated (derived) annotations such as feature sequences that are sliced from a feature’s predecessor via the get_sequence() method. Rnalib stores sequences only at the gene level and slices the subsequences of child features from those to minimize storage requirements.

For example, if the sequence of an exon is requested via exon.sequence then the Feature implementation will search for a ‘sequence’ annotation in the exons super-features by recursively traversing ‘parent’ relationships. The exon sequence will then be sliced form this sequence by comparing the respective genomic coordinates (which works only if parent intervals always envelop their children as asserted by the transcriptome implementation).

This mechanism can also be used for any other sliceable annotation (e.g., a numpy array) that is stored in the transcriptome’s anno dict. To access such annotations, use the get(…, slice_from_parent=True) method (e.g., exon.get(‘my_array’, slice_from_parent=True)). This will recursively traverse the parent/child relationships until the my_array annotation is found and then slice the respective section from there (assuming that array indices correspond to genomic coordinates).

Subfeatures of a feature can be accessed via the features() method which returns a generator over all such features that is unsorted by default. To iterate over features in a sorted order, consider using the TranscriptomeIterator class.

To inspect a features methods and attributes, use vars(feature), dir(feature) or help(feature).

transcriptome: Transcriptome = None
feature_id: str = None
feature_type: str = None
parent: object = None
subfeature_types: tuple = ()
key() tuple[source]

Returns a tuple containing feature_id, feature_type and genomic coordinates including strand

get(attr, default_value=None, slice_from_parent=False)[source]

Safe getter supporting default value and slice-from-parent

classmethod from_gi(loc)[source]

Init from gi

get_location()[source]

Returns a genomic interval representing the genomic location of this feature.

get_rnk()[source]

Rank (1-based index) of feature in this feature’s parent children list

features(feature_types=None)[source]

Yields all sub-features (not sorted). To get a coordinate-sorted iterator, use sorted(feature.features()). Sorting by chromosome is not required as subfeatures are enveloped by their parents by convention.

classmethod create_sub_class(feature_type, annotations: dict = None, child_feature_types: list = None)[source]

Create a subclass of feature with additional fields (as defined in the annotations dict) and child tuples

chromosome: str
start: int
end: int
strand: str
class rnalib.AbstractFeatureFilter[source]

Bases: ABC

For filtering genes/transcripts entries from a GFF when building a transcriptome.

abstract filter(loc, info)[source]

Returns true if the passed feature should be filtered, false otherwise. The filter can also return a filter message that will be added to the log.

get_chromosomes()[source]

Returns the set of chromosomes to be included or None if all (no filtering applied).

class rnalib.TranscriptFilter(config=None)[source]

Bases: AbstractFeatureFilter

A transcript filter that can be used to filter genes/transcripts by location and/or feature type specific info fields. It can be configured by chaining include_* methods or by passing a dict to the constructor. Features will first be filtered by location, then by included and then by excluded feature type specific info fields.

For location filtering, the following rules apply:

  • if included_chrom is not None, only transcripts on these chromosomes will be included

  • if excluded_chrom is not None, transcripts on these chromosomes will be excluded

  • if included_regions is not None, only transcripts overlapping these regions will be included

  • if excluded_regions is not None, transcripts overlapping these regions will be excluded

For feature type specific filtering, the following rules apply:

  • in the feature is a transcript and included_tids is not None, it will be filtered unless its feature_id is in

    included_tids

  • if included is not None, only transcripts with the specified feature type specific info fields

    will be included. Info fields are parsed to sets of values by splitting on ‘,’. if the feature type specific info field is not present, the feature will be filtered unless None is added to the included list.

  • if excluded is not None, transcripts with the specified feature type specific info fields

    will be excluded. Info fields are parsed to sets of values by splitting on ‘,’.

Examples

>>> # create a filter that includes only protein coding genes
>>> tf1 = TranscriptFilter().include_gene_types({'protein_coding'})
>>> # create a filter that includes only protein coding genes and transcripts on chr1
>>> tf2 = TranscriptFilter().include_gene_types({'protein_coding'}).include_chromosomes({'chr1'})
>>> # create a filter that includes only canonical genes and transcripts in a given region.
>>> # Features without tags will be included
>>> tf3 = TranscriptFilter().include_tags({'Ensembl_canonical', None}).include_regions({'chr1:1-10000'})
>>> # build only for given set of transcript ids
>>> tf4 = TranscriptFilter().include_transcript_ids({'ENST00000367770', 'ENST00000367771'})
>>> # create a filtered Transcriptome
>>> config = { ... }
>>> t = Transcriptome(config, feature_filter=tf4)

Notes

TODO

  • add greater, smaller than filters for feature type specific info fields

  • redesign?

filter(loc, info)[source]

Returns true if the passed feature should be filtered, false otherwise. The filter can also return a filter message that will be added to the log.

get_chromosomes()[source]

Returns the set of chromosomes to be included

include_chromosomes(chromosomes: set)[source]

Convenience method to add chromosomes to the included_chrom set

include_transcript_ids(tids: set)[source]

Convenience method to add transcript ids to the included_tids set

include_regions(regions: set)[source]

Convenience method to add regions to the included_regions set

include_gene_ids(ids: set)[source]

Convenience method to add included gene ids

include_gene_types(gene_types: set, include_missing=True, feature_type='gene')[source]

Convenience method to add included gene_types to gene+transcript inclusion rules. Use, e.g., {‘protein_coding’} to load only protein coding genes. If include_missing is True then genes/transcripts without gene_type will also be included.

include_transcript_types(transcript_types: set, include_missing=True, feature_type='transcript')[source]

Convenience method to add included transcript_types. Use, e.g., {‘miRNA’} to load only miRNA transcripts. If include_missing is True (default) then transcripts without transcript_type will also be included.

include_tags(gene_tags: set, include_missing=True, feature_type='transcript')[source]

Convenience method to add included gene tags. Use, e.g., {‘Ensembl_canonical’} to load only canonical genes. If include_missing is True then genes/transcripts without tags will also be included.

class rnalib.RefDict(d, name=None, fun_alias=None)[source]

Bases: Mapping[str, int]

Named mapping for representing a set of references (contigs) and their lengths.

Supports aliasing by passing a function (e.g., fun_alias=toggle_chr which will add/remove ‘chr’ prefixes) to easily integrate genomic files that use different (but compatible) reference names. If an aliasing function is passed, original reference names are accessible via the orig property. An aliasing function must be reversible, i.e., fun_alias(fun_alias(str))==str and support None.

Note that two reference dicts match if their (aliased) contig dicts match (name of RefDict is not compared).

has_len(chrom=None)[source]

Returns true if the passed chromosome (or all if chrom is None) has an assigned length, false otherwise.

set_len(chrom: str = None, length: int = None)[source]

Sets the length of the passed chromosome. If chrom is None, all chromosomes will be set to the passed length. Use set_len() to unset the length of all chromosomes.

Parameters:
  • chrom (str) – The chromosome name. If None, all chromosomes will be set to the passed length.

  • length (int) – The chromosome length or None if not set

tile(tile_size=1000000)[source]

Iterates in an ordered fashion over the reference dict, yielding non-overlapping genomic intervals of the given tile_size (or smaller at chromosome ends).

chromosomes()[source]

Returns a list of chromosome names

chromosomes_orig()[source]

Returns a list of chromosome names

alias(chrom)[source]
index(chrom)[source]

Index of the passed chromosome, None if chromosome not in refdict or -1 if None was passed. Useful, e.g., for sorting genomic coordinates

static merge_and_validate(*refdicts, check_order=False, included_chrom=None)[source]

Checks whether the passed reference sets are compatible and returns the merged reference set containing the intersection of common references.

Parameters:
  • refdicts – list of RefDicts

  • check_order – if True, the order of the common references is asserted. default: False

  • included_chrom – if passed, only the passed chromosomes are considered. default: None

Return type:

RefDict containing the intersection of common references

static load(fh, fun_alias=None, calc_chromlen=False)[source]

Extracts chromosome names, order and (where possible) length from pysam objects.

Parameters:
  • fh (pysam object or file path (str))

  • fun_alias (aliasing functions (see RefDict))

  • calc_chromlen (bool, if True, chromosome lengths will be calculated from the file (if required))

Returns:

dict

Return type:

chromosome name to length

Raises:

NotImplementedError – if input type is not supported yet

class rnalib.Item(location: gi, data: Any)[source]

Bases: NamedTuple

A location, data tuple, returned by an LocationIterator

location: gi

Alias for field number 0

data: Any

Alias for field number 1

class rnalib.LocationIterator(file, region: GI = None, file_format: str = None, per_position: bool = False, fun_alias: Callable = None, refdict: RefDict = None)[source]

Bases: object

Superclass for genomic iterators (mostly based on pysam) for efficient, indexed iteration over genomic datasets. Most rnalib iterables inherit from this superclass.

A LocationIterator iterates over a genomic dataset and yields Item’s, that are tuples of genomic intervals and associated data. The data can be any object (e.g., a string, a dict, a list, etc.). Location iterators can be restricted to a genomic region.

LocationIterators keep track of their current genomic location (self.location) as well as statistics about the iterated and yielded items (stats()). They can be consumed and converted to lists (self.to_list()), pandas dataframes (self.to_dataframe()), interval trees (self.to_intervaltrees()) or bed files (self.to_bed()).

LocationIterators can be grouped or tiled, see the group() and tile() methods.

LocationIterators are iterable, implement the context manager protocol (‘with’ statement) and provide a standard interface for region-type filtering (by passing chromosome sets or genomic regions of interest). Where feasible and not supported by the underlying (pysam) implementation, they implement chunked I/O where for efficient iteration (e.g., FastaIterator).

The maximum number of iterated items can be queried via the max_items() method which tries to guesstimate this number (or an upper bound) from the underlying index data structures.

Examples

>>> # Creates a LocationIterator via factory method and consume its items, splitting into locs and data lists.
>>> locs, data = zip(*it('myfile.bed'))
Variables:
  • _stats (Counter) – A counter that collects statistics (e.g., yielded items per chromosome) for QC and reporting.

  • location (GI) – The current genomic location of this iterator

Parameters:
  • file (str or pysam file object) – A file path or an open file object. In the latter case, the object will not be closed

  • region (GI or str) – genomic region to iterate

  • file_format (str) – optional, will be determined from filename if omitted

  • per_position (bool) – optional, if True, the iterator will yield per-position items (e.g., FASTA files)

  • fun_alias (Callable) – optional, if set, the iterator will use this function for aliasing chromosome names

  • refdict (RefDict) – optional, if set, the iterator will use the passed reference dict instead of reading it from the file

Notes

TODOs:

  • strand specific iteration

  • url streaming

  • add is_exhausted flag?

region: GI = None
chromosomes: list = None
location: GI = None
per_position: bool = False
file: Any = None
fun_alias: Callable = None
refdict: RefDict = None
property stats

Returns the collected stats

set_region(region)[source]

Update the iterated region of this iterator. Note that the region’s chromosome must be in this anno_its refdict (if any)

to_list(style='item')[source]

Convenience method that consumes iterator and returns a list of items, locations or data.

Parameters:

style (str) – ‘item’ (default): all items with be returned. Same as list(iterator). ‘location’: only the locations of the items will be returned ‘data’: only the data of the items will be returned

Examples

>>> it('myfile.bed').to_list()
to_bed(out, fun_anno=<function LocationIterator.<lambda>>, bed_header=None, disable_progressbar=True, no_header=False, n_col=12)[source]

Consumes iterator and returns results in BED format to the passed output stream.

Parameters:
  • out (file-like object) – The output file-like object. If None, then the results will be returned as a list.

  • fun_anno (function) – A function that takes a feature index, a location and a feature as input and returns a tuple of strings that will be added to the BED output file.

  • bed_header (dict) – A dict containing the BED header fields. If None (default), a default header will be used.

  • disable_progressbar (bool) – If true, the progressbar will be disabled.

  • no_header (bool) – If true, no header will be written to the output file.

  • n_col (int) – The number of columns to be written to the output file. Default: 12

Examples

>>> out_file = ...
>>> with open_file_obj(out_file, 'wt') as wout:
>>>     with LocationIterator(...) as lit: # use respective subclass here
>>>         lit.to_bed(wout)
>>> bgzip_and_tabix(out_file)
to_dataframe(fun=<function LocationIterator.<lambda>>, fun_col=('Value', ), coord_inc=(0, 0), coord_colnames=('Chromosome', 'Start', 'End', 'Strand'), excluded_columns=None, included_columns=None, dtypes=None, default_value=None, max_items=None, disable_progressbar=True)[source]

Consumes iterator (up to max_items items) and returns results in a dataframe. Start/stop Coordinates will be corrected by the passed coord_inc tuple.

Parameters:
  • fun (function) – a function that converts the yielded items of this iterator into a tuple of values that represent fun_col column values in the created dataframe

  • fun_col (tuple) – a tuple of column names for the created dataframe

  • coord_inc (tuple) – a tuple of values to be added to start and end coordinates

  • coord_colnames (tuple) – a tuple of column names for the created dataframe

  • excluded_columns (tuple) – optional, a tuple of column names to be excluded from the created dataframe

  • included_columns (tuple) – optional, a tuple of column names to be included in the created dataframe

  • dtypes (dict) – optional, a dict of column names and their respective dtypes

  • default_value (any) – optional, a default value to be used if a column value is not present in the iterated item

  • max_items (int) – maximum number of included items (None: all)

  • disable_progressbar (bool) – optional, if True, no progressbar will be shown

describe(fun=<function LocationIterator.<lambda>>, fun_col=('Value', ), coord_inc=(0, 0), coord_colnames=('Chromosome', 'Start', 'End', 'Strand'), excluded_columns=None, included_columns=None, dtypes=None, default_value=None, max_items=None, disable_progressbar=True) Tuple[DataFrame, dict][source]

Converts this iterator to a pandas dataframe and calls describe(include=’all’)

to_intervaltrees(disable_progressbar=False)[source]

Consumes iterator and returns results in a dict of intervaltrees.

Warning

NOTE that this will silently drop empty intervals

Notes

  • TODO

    improve performance with from_tuples method

merge(iterables: Iterable[LocationIterator], labels: Iterable[str] = None, refdict: RefDict = None)[source]

Merges this iterator with the passed iterables and returns a new iterator.

Parameters:
  • iterables (iterable) – an iterable of LocationIterators to be merged with this iterator

  • labels (iterable) – an iterable of labels for the passed iterables. If None, f”it{idx}” will be used.

  • refdict (RefDict) – optional, if set, this refdict will be used for the merged iterator, otherwise a merged refdict will be used

group(strategy='both')[source]

Wraps this iterator in a GroupedLocationIterator

tile(regions_iterable: Iterable[GI] = None, tile_size=100000000.0)[source]

Wraps this iterator in a TiledIterator.

Parameters:
  • regions_iterable (iterable) – an iterable of genomic regions (GI) that defines the tiles. Will be calculated from the refdict if None.

  • tile_size (int) – the size of the tiles

abstract max_items()[source]

Maximum number of items yielded by this iterator or None if unknown. Note that this is the upper bound of yielded items but less (or even no) items may be yielded based on filter settings, etc. Useful, e.g., for progressbars or time estimates

close()[source]

Closes the underlying file object if it was opened by this iterator

class rnalib.MemoryIterator(d, region=None, fun_alias=None)[source]

Bases: LocationIterator

A location iterator that iterates over intervals retrieved from one of the following datastructures:

  • {str:gi} dict: yields (gi,str); note that the passed strings must be unique

  • {gi:any} dict: yields (gi,any)

  • iterable of gis: yields (gi, index in input iterable). If a single GI object is passed, its contained

    positions are iterated

Regions will be sorted according to a refdict that is created from the data automatically. For associative datastructures (dicts), the key or value that is not the GI will be yielded as data item. For others, the index in the sorted iterable will be yielded as data item.

Notes

  • reported stats:
    iterated_items, chromosome: (int, str)

    Number of iterated items

    yielded_items, chromosome: (int, str)

    Number of yielded items (remaining are filtered, e.g., due to region constraints)

to_bed(out, fun_anno=<function MemoryIterator.<lambda>>, bed_header=None, disable_progressbar=True, no_header=False, n_col=4)[source]

Consumes iterator and returns results in BED format to the passed output stream.

Parameters:
  • out (file-like object) – The output file-like object. If None, then the results will be returned as a list.

  • fun_anno (function) – A function that takes a feature index, a location and a feature as input and returns a tuple of strings that will be added to the BED output file.

  • bed_header (dict) – A dict containing the BED header fields. If None (default), a default header will be used.

  • disable_progressbar (bool) – If true, the progressbar will be disabled.

  • no_header (bool) – If true, no header will be written to the output file.

  • n_col (int) – The number of columns to be written to the output file. Default: 12

Examples

>>> out_file = ...
>>> with open_file_obj(out_file, 'wt') as wout:
>>>     with LocationIterator(...) as lit: # use respective subclass here
>>>         lit.to_bed(wout)
>>> bgzip_and_tabix(out_file)
max_items()[source]

Maximum number of items yielded by this iterator or None if unknown. Note that this is the upper bound of yielded items but less (or even no) items may be yielded based on filter settings, etc. Useful, e.g., for progressbars or time estimates

class rnalib.TranscriptomeIterator(transcriptome, region=None, feature_types=None, features=None)[source]

Bases: LocationIterator

Iterates over (sorted) features in a transcriptome object. Yielded items contain the respective feature (location) and its transcriptome annotation (data). Note that no chromosome aliasing is possible with this iterator as returned features are immutable.

The transcriptome features will be filtered by feature_type and sorted if the transcriptome object is not sorted (i.e., if features were added after it was built from a (sorted) GFF3 file).

A transcriptome iterator can be converted to a pandas dataframe via the to_dataframe() method which consumes the remaining items. Column names of the dataframe are automatically estimated from the transcriptome model. Data columns may be excluded via the ‘excluded_columns’ parameter; by default the reserved ‘dna_seq’ field is excluded as this field may contain very long sequence strings. The default conversion function (fun) extracts values from the iterated location objects (=feature). To add custom columns, one needs to provide (i) a list of column names via the ‘included_columns’ parameter and (ii) a custom extraction methof (fun) that return the respective values. Example:

Examples

>>> # advanced usage: convert to pandas dataframe with a custom conversion function
>>> # here we add a 'feature length' column
>>> def my_fun(tx, item, fun_col, default_value):
>>>     return [len(tx) if col == 'feature_len' else tx.get(col, default_value) for col in fun_col] # noqa
>>> t = Transcriptome(...)
>>> TranscriptomeIterator(t).to_dataframe(fun=my_fun, included_annotations=['feature_len']).head()
Parameters:
  • transcriptome (Transcriptome) – The transcriptome object to iterate over

  • region (gi or str) – optional, if set, only features overlapping with this region will be iterated

  • feature_types (list) – optional, if set, only features of these types will be iterated

  • features (list) – optional, if set, only these features will be iterated

Notes

  • reported stats:
    iterated_items, chromosome: (int, str)

    Number of iterated items

    yielded_items, chromosome: (int, str)

    Number of yielded items (remaining are filtered, e.g., due to region constraints)

max_items()[source]

Maximum number of items yielded by this iterator or None if unknown. Note that this is the upper bound of yielded items but less (or even no) items may be yielded based on filter settings, etc. Useful, e.g., for progressbars or time estimates

to_dataframe(fun=<function TranscriptomeIterator.<lambda>>, fun_col=None, coord_inc=(0, 0), coord_colnames=('Chromosome', 'Start', 'End', 'Strand'), excluded_columns=('dna_seq', ), included_columns=None, dtypes=None, default_value=None, max_items=None, disable_progressbar=True)[source]

Consumes iterator and returns results in a dataframe. Start/stop Coordinates will be corrected by the passed coord_inc tuple.

fun_col is a tuple of column names for the created dataframe. fun is a function that converts the yielded items of this iterator into a tuple of values that represent fun_col column values in the created dataframe.

Example

>>> t = Transcriptome(...)
>>> TranscriptomeIterator(t).to_dataframe().head()
describe(fun=<function TranscriptomeIterator.<lambda>>, fun_col=('Value', ), coord_inc=(0, 0), coord_colnames=('Chromosome', 'Start', 'End', 'Strand'), excluded_columns=('dna_seq', ), included_columns=None, dtypes=None, default_value=None, max_items=None, disable_progressbar=True) Tuple[DataFrame, dict][source]

Converts this iterator to a pandas dataframe and calls describe(include=’all’)

class rnalib.FastaIterator(fasta_file, region=None, width=1, step=1, file_format=None, chunk_size: int = 1024, fill_value='N', padding=False, fun_alias=None)[source]

Bases: LocationIterator

Iterates over a FASTA file yielding sequence strings and keeps track of the covered genomic location.

The yielded sequence of length <width> will be padded with <fillvalue> characters if padding is True. This generator will yield every step_size window, for a tiling window approach set step_size = width. FASTA files will be automatically indexed if no existing index file is found and will be read in chunked mode.

Parameters:
  • fasta_file (str or pysam FastaFile) – A file path or an open FastaFile object. In the latter case, the object will not be closed

  • region (gi or str) – optional, if set, only features overlapping with this region will be iterated

  • width (int) – sequence window size

  • step (int) – increment for each yield

  • file_format (str) – optional, will be determined from filename if omitted

  • chunk_size (int) – optional, chunk size for reading fasta files. default: 1024

  • fill_value (str) – optional, fill value for padding. default: ‘N’

  • padding (bool) – optional, if True, the sequence will be padded with fill_value characters. default: False Padding may prefix/postfix the reported sequences at the region boundaries and accordingly, first and last reported location GIs may extend beyond the region boundaries. Especially, if the region is close to the chromosome start or end, the first and last reported locations may start/end outside the chromosome boundaries and start coordinates may become negative.

  • fun_alias (Callable) – optional, if set, the iterator will use this function for aliasing chromosome names

Yields:

sequence (str) – The extracted sequence in including sequence context. The core sequence w/o context can be accessed via seq[context_size:context_size+width].

Notes

  • reported stats:
    iterated_items, chromosome: (int, str)

    Number of iterated and yielded items

max_items()[source]

Maximum number of items yielded by this iterator or None if unknown.

iterate_data(chromosome)[source]

Reads pysam data in chunks and yields individual data items

class rnalib.TabixIterator(tabix_file, region=None, fun_alias=None, per_position=False, coord_inc=(0, 0), pos_indices=(0, 1, 2), refdict=None)[source]

Bases: LocationIterator

Iterates over a bgzipped + tabix-indexed file and returns location/tuple pairs. Genomic locations will be parsed from the columns with given pos_indices and interval coordinates will be converted to 1-based inclusive coordinates by adding values from the configured coord_inc tuple to start and end coordinates. Note that this class serves as super-class for various file format specific anno_its (e.g., BedIterator, VcfIterator, etc.) which use proper coord_inc/pos_index default values.

Parameters:
  • tabix_file (str or pysam.TabixFile) – A file path or an open TabixFile object. In the latter case, the object will not be closed

  • region (gi or str) – optional, if set, only features overlapping with this region will be iterated

  • fun_alias (Callable) – optional, if set, the iterator will use this function for aliasing chromosome names

  • per_position (bool) – optional, if True, the iterator will yield per-position items (e.g., FASTA files)

  • coord_inc (tuple) – optional, a tuple of values to be added to start and end coordinates

  • pos_indices (tuple) – optional, a tuple of column indices for the interval start, end and chromosome

  • refdict (RefDict) – optional, if set, the iterator will use the passed reference dict instead of reading it from the file

Notes

  • stats:
    iterated_items, chromosome: (int, str)

    Number of iterated/yielded items

  • TODO
    • add slop

    • improve docs

max_items()[source]

Maximum number of items yielded by this iterator or None if unknown.

class rnalib.BedGraphIterator(bedgraph_file, region=None, fun_alias=None, strand=None, refdict=None)[source]

Bases: TabixIterator

Iterates a bgzipped and indexed bedgraph file and yields float values If a strand is passed, all yielded intervals will have this strand assigned.

Parameters:
  • bedgraph_file (str or pysam.TabixFile) – A file path or an open TabixFile object. In the latter case, the object will not be closed

  • region (gi or str) – optional, if set, only features overlapping with this region will be iterated

  • fun_alias (Callable) – optional, if set, the iterator will use this function for aliasing chromosome names

  • strand (str) – optional, if set, all yielded intervals will have this strand assigned

  • refdict (RefDict) – optional, if set, the iterator will use the passed reference dict instead of reading it from the file

Notes

  • reported stats:
    yielded_items, chromosome: (int, str)

    Number of yielded items (remaining are filtered, e.g., due to region constraints)

class rnalib.BedRecord(tup, refdict=None)[source]

Bases: object

Parsed and mutable version of pysam.BedProxy See https://genome.ucsc.edu/FAQ/FAQformat.html#format1 for BED format details

name: str
score: int
location: gi
thick_start: int
thick_end: int
item_rgb: str
block_count: int
block_sizes: List[int]
block_starts: List[int]
class rnalib.BedIterator(bed_file, region=None, fun_alias=None)[source]

Bases: TabixIterator

Iterates a BED file and yields 1-based coordinates and pysam BedProxy objects @see https://pysam.readthedocs.io/en/latest/api.html#pysam.asBed

Parameters:
  • bed_file (str or pysam.TabixFile) – A file path or an open TabixFile object. In the latter case, the object will not be closed

  • region (gi or str) – optional, if set, only features overlapping with this region will be iterated

  • fun_alias (Callable) – optional, if set, the iterator will use this function for aliasing chromosome names

Notes

  • NOTE that empty intervals (i.e., start==stop coordinate) will not be iterated.

  • reported stats:
    yielded_items, chromosome: (int, str)

    Number of yielded items (remaining are filtered, e.g., due to region constraints)

class rnalib.BigBedRecord(tup, refdict=None)[source]

Bases: object

Parsed BigBed record See https://github.com/deeptools/pyBigWig/tree/master for format details

thick_start: int
name: str
score: float
location: gi
level: float
signif: float
score2: int
class rnalib.BigBedIterator(file, region=None, fun_alias=None)[source]

Bases: LocationIterator

Iterates over a BigBed file via pyBigWig.

Parameters:
  • file (str, PathLike or pyBigWig) – The pyBigWig file to iterate over

  • region (gi or str) – optional, if set, only features overlapping with this region will be iterated

  • fun_alias (Callable) – optional, if set, the iterator will use this function for aliasing chromosome names

max_items()[source]

Maximum number of items yielded by this iterator or None if unknown.

header()[source]

Returns the header of the underlying BigBed file.

class rnalib.BigWigIterator(file, region=None, fun_alias=None, per_position=False, strand=None)[source]

Bases: LocationIterator

Iterates over a pyBigWig object. If per_position is True, the iterator will yield per-position values, otherwise it will yield interval values.

Parameters:
  • file (str, PathLike or pyBigWig) – The pyBigWig file to iterate over

  • region (gi or str) – optional, if set, only features overlapping with this region will be iterated

  • fun_alias (Callable) – optional, if set, the iterator will use this function for aliasing chromosome names

  • strand (str) – optional, if set, all yielded intervals will have this strand assigned

  • per_position (bool) – optional, if True, the iterator will yield per-position values, otherwise it will yield interval values

Notes

Note that you can also access remote files, but this is slow and not recommended unless files are small. Example: >>> # iterate a remote bigwig file and report intervals. This file is ~100M >>> import itertools >>> uri = ‘https://hgdownload.soe.ucsc.edu/goldenPath/hg19/encodeDCC/wgEncodeMapability/wgEncodeCrgMapabilityAlign100mer.bigWig’ # noqa >>> for loc, val in itertools.islice(rna.it(uri),0,10): >>> display(f”{loc}: {val}”)

max_items()[source]

Maximum number of items yielded by this iterator or None if unknown.

header()[source]

Returns the header of the underlying BigWig file.

class rnalib.VcfRecord(pysam_var, samples, sample_indices, formats, refdict)[source]

Bases: object

Parsed version of pysam VCFProxy, no type conversions for performance reasons.

Variables:
  • location (gi) – genomic interval representing this record

  • pos (int) – 1-based genomic (start) position. For deletions, this is the first deleted genomic position

  • is_indel (bool) – True if this is an INDEL

  • ref/alt (str) – reference/alternate allele string

  • qual (float) – variant call quality

  • info (dict) – dict of info fields/values

  • dicts (genotype (per-sample)) – for each FORMAT field (including ‘GT’), a {sample_id: value} dict will be created

  • zyg (dict) – zygosity information per sample. Created by mapping genotypes to zygosity values using gt2zyg() (0=nocall, 1=heterozygous call, 2=homozygous call).

  • n_calls (int) – number of called alleles (among all considered samples)

  • returned. (Attributes can be accessed via dot notation, f no such attribute exists,, None will be)

id: str
ref: str
alt: str
filter: str
qual: float
info: dict
pos: int
location: gi
format: List[str]
class rnalib.VcfIterator(vcf_file, region=None, fun_alias=None, samples=None, filter_nopass: bool = True, filter_nocalls: bool = True)[source]

Bases: TabixIterator

Iterates a VCF file and yields 1-based coordinates and VcfRecord objects (wrapping pysam VcfProxy object) @see https://pysam.readthedocs.io/en/latest/api.html#pysam.asVCF

Besides coordinate-related filtering, users cam also pass a list of sample ids to be considered (samples). The iterator will by default report only records with at least 1 call in one of the configured samples. To force reporting of all records, set filter_nocalls=False.

Parameters:
  • vcf_file (str or pysam.VariantFile) – A file path or an open VariantFile object. In the latter case, the object will not be closed

  • region (gi or str) – optional, if set, only features overlapping with this region will be iterated

  • fun_alias (Callable) – optional, if set, the iterator will use this function for aliasing chromosome names

  • samples (list) – optional, if set, only records with calls in these samples will be reported

  • filter_nopass (bool) – if True (default), the iterator will report only records with the PASS filter flag

  • filter_nocalls (bool) – if True (default), the iterator will report only records with at least 1 call in one of the configured samples.

Variables:
  • header – pysam VariantFile header

  • allsamples (list) – list of all contained samples

  • shownsampleindices – indices of all configured samples

Notes

  • reported stats:
    iterated_items, chromosome: (int, str)

    Number of iterated items

    filtered_nopass_{filter}, chromosome: (int, str)

    Number of filtered non-PASS records (filter is the filter value)

    filtered_nocalls, chromosome: (int, str)

    Number of filtered no-call records

    yielded_items, chromosome: (int, str)

    Number of yielded items (remaining are filtered, e.g., due to region constraints)

class rnalib.GFF3Iterator(gtf_file, region=None, fun_alias=None, feature_types=None)[source]

Bases: TabixIterator

Iterates a GTF/GFF3 file and yields 1-based coordinates and dicts containing key/value pairs parsed from the respective info sections. The feature_type, source, score and phase fields from the GFF/GTF entries are copied to this dict (NOTE: attribute fields with the same name will be overloaded). @see https://pysam.readthedocs.io/en/latest/api.html#pysam.asGTF

This iterator is used to build a transcriptome object from a GFF3/GTF file.

Parameters:
  • gtf_file (str or pysam.TabixFile) – A file path or an open TabixFile object. In the latter case, the object will not be closed

  • region (gi or str) – optional, if set, only features overlapping with this region will be iterated

  • fun_alias (Callable) – optional, if set, the iterator will use this function for aliasing chromosome names

Notes

  • reported stats:
    yielded_items, chromosome: (int, str)

    Number of iterated/yielded items

class rnalib.PandasIterator(df, feature: str = None, region: GI = None, coord_columns: tuple = ('Chromosome', 'Start', 'End', 'Strand'), is_sorted=False, coord_off=(0, 0), fun_alias: Callable = None, calc_chromlen=False, refdict=None)[source]

Bases: LocationIterator

Iterates over a pandas dataframe that contains three columns with chromosome/start/end coordinates. Notable subclasses are PyrangesIterator and BioframeIterator. If the passed dataframe is not coordinate sorted, is_sorted must be set to False.

Parameters:
  • df (dataframe) – pandas datafram with at least 4 columns names as in coord_columns and feature parameter. This dataframe will be sorted by chromosome and start values unless is_sorted is set to True

  • feature (str) – Name of column to yield. If null, the whole row will be yielded

  • region (GI or str) – optional, if set, only features overlapping with this region will be iterated

  • coord_columns (list) – Names of coordinate columns, default: [‘Chromosome’, ‘Start’, ‘End’, ‘Strand’]

  • is_sorted (bool) – optional, if set, the iterator will assume that the dataframe is already sorted by chromosome and start

  • coord_off (list) – Coordinate offsets, default: (1, 0). These offsets will be added to the read start/end coordinates to convert to the rnalib convention.

  • fun_alias (Callable) – optional, if set, the iterator will use this function for aliasing chromosome names

  • calc_chromlen (bool) – optional, if set, the iterator will calculate the maximum end coordinate for each chromosome and use this information to create a reference dictionary

  • refdict (RefDict) – optional, if set, the iterator will use the passed reference dict instead of estimating from the data

Yields:
  • location (gi) – Location object describing the current coordinates

  • value (any) – The extracted feature value

Notes

  • Note, that iteration over dataframes is generally discouraged as there are much more efficient methods for data manipulation.

  • reported stats:
    iterated_items, chromosome: (int, str)

    Number of iterated items

    yielded_items, chromosome: (int, str)

    Number of yielded items (remaining are filtered, e.g., due to region constraints)

Examples

Here is another (efficient) way to use a pandas iterator:

>>> gencode_gff = ... # a gff file
>>> with BioframeIterator(gencode_gff, chromosome='chr2') as lit: # create a filtered DF from the passed gff
>>>     lit.df = lit.df.query("strand=='-' ") # further filter the dataframe with pandas
>>>     print(Counter(lit.df['feature'])) # count minus strand features
>>>     # you can also use this code to then iterate the data which may be convenient/readable if the
>>>     # dataframe is small:
>>>     for tx, row in lit: # now iterate with rnalib iterator
>>>         # do something with location and pandas data row
to_dataframe(**kwargs)[source]

Consumes iterator (up to max_items items) and returns results in a dataframe. Start/stop Coordinates will be corrected by the passed coord_inc tuple.

Parameters:
  • fun (function) – a function that converts the yielded items of this iterator into a tuple of values that represent fun_col column values in the created dataframe

  • fun_col (tuple) – a tuple of column names for the created dataframe

  • coord_inc (tuple) – a tuple of values to be added to start and end coordinates

  • coord_colnames (tuple) – a tuple of column names for the created dataframe

  • excluded_columns (tuple) – optional, a tuple of column names to be excluded from the created dataframe

  • included_columns (tuple) – optional, a tuple of column names to be included in the created dataframe

  • dtypes (dict) – optional, a dict of column names and their respective dtypes

  • default_value (any) – optional, a default value to be used if a column value is not present in the iterated item

  • max_items (int) – maximum number of included items (None: all)

  • disable_progressbar (bool) – optional, if True, no progressbar will be shown

max_items()[source]

Maximum number of items yielded by this iterator or None if unknown. Note that this is the upper bound of yielded items but less (or even no) items may be yielded based on filter settings, etc. Useful, e.g., for progressbars or time estimates

class rnalib.BioframeIterator(df, feature: str = None, region: GI = None, is_sorted=False, fun_alias=None, schema=None, coord_columns: tuple = ('chrom', 'start', 'end', 'strand'), coord_off=(1, 0), calc_chromlen=False, refdict=None)[source]

Bases: PandasIterator

Iterates over a [bioframe](https://bioframe.readthedocs.io/) dataframe. The genomic coordinates of yielded locations are corrected automatically.

Parameters:
  • df (dataframe) – bioframe dataframe with at least 4 columns names as in coord_columns and feature parameter. This dataframe will be sorted by chromosome and start values unless is_sorted is set to True

  • feature (str) – Name of column to yield. If null, the whole row will be yielded

  • region (gi or str) – optional, if set, only features overlapping with this region will be iterated

  • coord_columns (list) – Names of coordinate columns, default: [‘chrom’, ‘start’, ‘end’, ‘strand’]

  • is_sorted (bool) – optional, if set, the iterator will assume that the dataframe is already sorted by chromosome and start

  • coord_off (list) – Coordinate offsets, default: (1, 0). These offsets will be added to the read start/end coordinates to convert to the rnalib convention.

  • calc_chromlen (bool) – optional, if set, the iterator will calculate chromosome lengths from the data (if required)

  • schema (str) – optional, if set, it is passed to by bioframe.read_table to parse the dataframe

  • fun_alias (Callable) – optional, if set, the iterator will use this function for aliasing chromosome names

  • refdict (RefDict) – optional, if set, the iterator will use this reference dictionary for chromosome aliasing

class rnalib.PyrangesIterator(probj, feature=None, region=None, is_sorted=False, fun_alias=None, coord_columns=('Chromosome', 'Start', 'End', 'Strand'), coord_off=(1, 0), calc_chromlen=False, refdict=None)[source]

Bases: PandasIterator

Iterates over a [pyranges](https://pyranges.readthedocs.io/) object. The genomic coordinates of yielded locations are corrected automatically.

Parameters:
  • probj (pyranges.PyRanges) – A pyranges object

  • feature (str) – Name of column to yield. If null, the whole row will be yielded

  • region (gi or str) – optional, if set, only features overlapping with this region will be iterated

  • coord_columns (list) – Names of coordinate columns, default: [‘Chromosome’, ‘Start’, ‘End’, ‘Strand’]

  • is_sorted (bool) – optional, if set, the iterator will assume that the dataframe is already sorted by chromosome and start

  • fun_alias (Callable) – optional, if set, the iterator will use this function for aliasing chromosome names

  • coord_off (list) – Coordinate offsets, default: (1, 0). These offsets will be added to the read start/end coordinates to convert to the rnalib convention.

  • calc_chromlen (bool) – optional, if set, the iterator will calculate chromosome lengths from the data (if required)

  • refdict (RefDict) – optional, if set, the iterator will use this reference dictionary for chromosome aliasing

class rnalib.PybedtoolsIterator(bedtool, region=None, fun_alias=None, calc_chromlen=False, refdict=None)[source]

Bases: LocationIterator

Iterates over a pybedtools BedTool

Parameters:
  • bedtool (str or pybedtools.BedTool) – A file path or an open BedTool object. In the latter case, the object will not be closed

  • region (gi or str) – optional, if set, only features overlapping with this region will be iterated

  • fun_alias (Callable) – optional, if set, the iterator will use this function for aliasing chromosome names

  • calc_chromlen (bool) – optional, if set, the iterator will calculate chromosome lengths from the file (if required)

  • refdict (RefDict) – optional, if set, the iterator will use this reference dictionary for chromosome aliasing

Notes

  • reported stats:
    yielded_items, chromosome: (int, str)

    Number of yielded items

max_items()[source]

Maximum number of items yielded by this iterator or None if unknown.

close()[source]

Closes the underlying file object if it was opened by this iterator

class rnalib.ReadIterator(bam_file, region=None, file_format=None, min_mapping_quality=0, flag_filter=3844, tag_filters=None, max_span=None, report_mismatches=False, min_base_quality=0, fun_alias=None, include_unmapped=False)[source]

Bases: LocationIterator

Iterates over a BAM alignment.

Parameters:
  • bam_file (str) – BAM file name

  • region (gi or str) – optional, if set, only features overlapping with this region will be iterated

  • file_format (str) – optional, if set, the iterator will assume this file format (e.g., ‘bam’ or ‘sam’)

  • min_mapping_quality (int) – Minimum mapping quality. If set, reads with lower mapping quality will be filtered.

  • flag_filter (int) – Bitwise flag filter. If set, reads with any of the specified flags will be filtered.

  • tag_filters (list) – List of TagFilter objects. If set, reads with any of the specified tags will be filtered.

  • max_span (int) – Maximum alignment span. If set, reads with a larger alignment span will be filtered.

  • report_mismatches (bool) – If set, this iterator will additionally yield (read, mismatches) tuples where ‘mismatches’ is a list of (read_position, genomic_position, ref_allele, alt_allele) tuples that describe differences wrt. the aligned reference sequence. This options requires MD BAM tags to be present.

  • min_base_quality (int) – Useful only in combination with report_mismatches; filters mismatches based on minimum per-base quality

  • fun_alias (Callable) – optional, if set, the iterator will use this function for aliasing chromosome names

  • include_unmapped (bool) – If set, this iterator will also yield unmapped reads after all mapped ones.

Notes

  • Reported stats
    iterated_items, chromosome: (int, str)

    Number of iterated items

    yielded_items, chromosome: (int, str)

    Number of yielded items

    n_fil_flag, chromosome:

    Number of reads filtered due to a FLAG filter

    n_fil_mq, chromosome:

    Number of reads filtered due to low mapping quality

    n_fil_tag chromosome:

    Number of reads filtered due to a TAG filter

    n_fil_max_span, chromosome:

    Number of reads filtered due to exceeding a configured maximum alignment span

  • If you need to non-default settings for opening the BAM file, you should open the pysam.AlignmentFile object yourself and pass the file object to this iterator.

max_items()[source]

Returns number of reads retrieved from the SAM/BAM index

class rnalib.ReadPair(r1: AlignedSegment, r2: AlignedSegment, mm1: List[Tuple[int, int, str, str]], mm2: List[Tuple[int, int, str, str]])[source]

Bases: NamedTuple

namedtuple for paired reads

r1: AlignedSegment

Alias for field number 0

r2: AlignedSegment

Alias for field number 1

mm1: List[Tuple[int, int, str, str]]

Alias for field number 2

mm2: List[Tuple[int, int, str, str]]

Alias for field number 3

class rnalib.PairedReadIterator(bam_file, region, file_format=None, min_mapping_quality=0, flag_filter=3844, tag_filters=None, max_span=None, report_mismatches=False, min_base_quality=0, fun_alias=None, filter_pcr_duplicates=False)[source]

Bases: ReadIterator

Iterates over a BAM alignment and yields paired reads as readpair objects. This iterator is used to extract paired reads from a BAM file. It is assumed that the input BAM file is coordinate sorted and that the reads are paired. This iterator stores incomplete pairs in a dictionary and yields them when the second read is found. This approach is more memory efficient than the naive approach of loading all reads into memory and then pairing them. It may, however, still result in high memory consumption if the input BAM file contains many orphaned reads or pairs with large insert sizes. It is thus recommended to use this iterator with a region parameter to limit the number of reads that are loaded into memory.

Parameters:
  • bam_file (str) – BAM file name

  • region (gi or str) – optional, if set, only features overlapping with this region will be iterated

  • file_format (str) – optional, if set, the iterator will assume this file format (e.g., ‘bam’ or ‘sam’)

  • min_mapping_quality (int) – Minimum mapping quality. If set, reads with lower mapping quality will be filtered.

  • flag_filter (int) – Bitwise flag filter. If set, reads with any of the specified flags will be filtered.

  • tag_filters (list) – List of TagFilter objects. If set, reads with any of the specified tags will be filtered.

  • max_span (int) – Maximum alignment span. If set, reads with a larger alignment span will be filtered.

  • report_mismatches (bool) – If set, this iterator will additionally yield (read, mismatches) tuples where ‘mismatches’ is a list of (read_position, genomic_position, ref_allele, alt_allele) tuples that describe differences wrt. the aligned reference sequence. This options requires MD BAM tags to be present.

  • min_base_quality (int) – Useful only in combination with report_mismatches; filters mismatches based on minimum per-base quality

  • fun_alias (Callable) – optional, if set, the iterator will use this function for aliasing chromosome names

  • filter_pcr_duplicates (bool) – If set, this iterator will filter PCR duplicates based on the read sequence and same start/end coordinates.

Returns:

The (merged) location of the read pair and a readpair namedtuple that contains the two reads as pysam.AlignedSegment objects. If report_mismatches is set, mm1 and mm2 contain the mismatch lists of the two mates. Note that r1 always contains read1 and r2 always read2. Both can be None for orphaned reads (e.g., if the mate was not mapped or maps outside the region).

Return type:

Item[gi, readpair]

Notes

  • Reported stats (additional to ReadIterator stats)
    • (“found_pairs”, “complete”): number of complete read pairs found

    • (“found_pairs”, “incomplete”): number of incomplete (i.e., read1 or read2 missing) read pairs found

    • (“filtered_duplicates”, chromosome): number of filtered PCR duplicates per chromosome

class rnalib.FastPileupIterator(bam_file, chromosome: str = None, reported_positions: set = None, region: GI = None, file_format: str = None, min_mapping_quality: int = 0, flag_filter: int = 3844, tag_filters: list[TagFilter] = None, max_span: int = None, min_base_quality: int = 0, fun_alias=None)[source]

Bases: LocationIterator

Fast pileup iterator that yields a complete pileup (w/o insertions) over a set of genomic positions. This is more lightweight and considerably faster than pysam’s pileup() but lacks some features (such as ignore_overlaps or ignore_orphans). By default, it basically reports what is seen in the default IGV view (using, e.g., the same read flag filter).

This iterator uses a ReadIterator to iterate the BAM file and then builds a pileup from the reads that overlap the reported positions. The pileup is a dict with genomic positions as keys and a Counter object as values. The Counter object contains the counts of each base at the respective position.

Parameters:
  • bam_file (str) – BAM file name

  • chromosome (str) – Chromosome name. If set, only this chromosome will be iterated and reported_positions will be derived from the respective parameter. Note, that either chromosome and reported_positions or region must be set.

  • reported_positions (range or set) – Range or set of genomic positions for which counts will be reported. The chromosome is derived from the respective parameter. Note, that either chromosome and reported_positions or region must be set.

  • region (gi) – Genomic region. If set, chromosome and reported_positions will be derived from this region. Note, that either chromosome and reported_positions or region must be set.

  • file_format (str) – optional, if set, the iterator will assume this file format (e.g., ‘bam’ or ‘sam’)

  • min_mapping_quality (int) – Filters pileup reads based on minimum mapping quality

  • flag_filter (int) – Filters pileup reads based on read flags (see utils.BamFlag for details)

  • tag_filters (list) – Filters pileup reads based on BAM tags (see utils.TagFilter for details)

  • max_span (int) – Restricts maximum pileup depth.

  • min_base_quality (int) – Filters pileup based on minimum per-base quality

  • fun_alias (function) – Optional function for aliasing chromosome names

Returns:

A <base>

Return type:

<count> Counter object. Base is 'None' for deletions.

Notes

  • initial performance tests that used a synchronized iterator over a FASTA and 2 BAMs showed ~50kpos/sec

  • reported stats:
    iterated_items, chromosome: (int, str)

    Number of iterated items (reads)

    yielded_items, chromosome: (int, str)

    Number of yielded items (positions)

Examples

>>> pileup_data = rna.it("test.bam", style="pileup", region=gi('chr1:1-100')).to_list()
>>> rna.it("test.bam", style='pileup',
>>>        region='1:22418244-22418244', # pileup single position
>>>        min_mapping_quality=30, flag_filter=0, min_base_quality=40, # read filtering
>>>        tag_filters=[rna.TagFilter('NM', [0], True, inverse=True)], # include only reads w/o mismatches
>>>        ).to_list()
max_items()[source]

Maximum number of items yielded by this iterator or None if unknown.

class rnalib.GroupedLocationIterator(lit: LocationIterator, strategy='both')[source]

Bases: LocationIterator

Wraps another location iterator and yields groups of items sharing (parts of) the same location given a matching strategy (e.g., same start, same end, same coords, overlapping). The iterator yields tuples of (merged) group location and a (locations, items) tuple containing lists of locations/items yielded from the wrapped iterator.

This iterator is useful for grouping features that share ovelapping genomic locations (e.g., overlapping features, features with shared start coordinates, etc.). It is typically instantiated from an existing LocationIterator via its group() method.

Parameters:
  • lit (LocationIterator) – The wrapped location iterator

  • strategy (str) – The grouping strategy to use: left (start coordinate match), right (end coordinate match), both (complete location match; default), overlap (coordinate overlap).

max_items()[source]

Maximum number of items yielded by this iterator or None if unknown.

close()[source]

Closes the underlying file object if it was opened by this iterator

class rnalib.MergedLocationIterator(iterables, labels=None, refdict=None)[source]

Bases: LocationIterator

Merges multiple location iterators into a single iterator that yields sorted locations and corresponding result items (named tuples) that contain the data from the yielding iterator and an associated label. This class uses a heap queue to merge-sort the iterators in a memory-efficient way.

It is typically instantiated from an existing LocationIterator via its merge() method.

Parameters:
  • iterables (list) – A list of LocationIterators to merge

  • labels (list) – A list of labels for the merged iterators. If None, default labels (f”it{idx}”) will be used.

  • refdict (RefDict) – A reference dictionary to use. If not set, the merged refdict of the passed iterators will be used.

max_items()[source]

Maximum number of items yielded by this iterator or None if unknown. Note that this is the upper bound of yielded items but less (or even no) items may be yielded based on filter settings, etc. Useful, e.g., for progressbars or time estimates

class rnalib.AnnotationIterator(lit, anno_its, labels=None, refdict=None, disable_progressbar=False)[source]

Bases: LocationIterator

Annotates locations in the first iterator with data from the ano_its location anno_its. The returned data is a namedtuple with the following fields: Item(location=gi, data=Result(anno=dat_from_it, label1=[Item(tx, dat_from_anno_it1)], …, labeln=[Item(tx, dat_from_anno_itn)])

This enables access to the following data:

  • item.location: gi of the currently annotated location

  • item.data.anno: data of the currently annotated location

  • item.data.<label_n>: list of items from <iterator_n> that overlap the currently annotated position.

if no labels are provided, the following will be used: it0, it1, …, itn.

Parameters:
  • lit (LocationIterator) – The main location iterator. The created AnnotationIterator will yield each item from this iterator alongside all overlapping items from the configured anno_its anno_its

  • anno_its (List[LocationIterator]) – A list of LocationIterators for annotating the main iterator

  • labels (List[str]) – A list of labels for storing data from the annotating anno_its

  • refdict (RefDict) – A reference dict for the main iterator. If None, the refdict of the main iterator will be used

  • disable_progressbar (bool) – If True, disables the tqdm progressbar

Yields:
  • Item(location=gi, data=Result(anno=dat_from_it, label1=[Item(tx, dat_from_anno_it1)], ..., labeln=[Item(tx,

  • dat_from_anno_itn)])

__init__(lit, anno_its, labels=None, refdict=None, disable_progressbar=False)[source]
Parameters:
  • lit (the main location iterator. The created AnnotationIterator will yield each item from this iterator) – alongside all overlapping items from the configured anno_its anno_its

  • anno_its (a list of location anno_its for annotating the main iterator)

  • labels (a list of labels for storing data from the annotating anno_its)

  • refdict (a reference dict for the main iterator. If None, the refdict of the main iterator will be used)

  • disable_progressbar (if True, disables the tqdm progressbar)

max_items()[source]

Maximum number of items yielded by this iterator or None if unknown.

stats()[source]

Return stats of wrapped main iterator

update(ref, anno_its)[source]

Updates the current buffer of overlapping intervals wrt. the passed reference location

close()[source]

Closes all wrapped anno_its

class rnalib.TiledIterator(location_iterator, regions_iterable: Iterable[GI] = None, tile_size=100000000.0)[source]

Bases: LocationIterator

Wraps a location iterator and reports tuples of items yielded for each genomic tile. Tiles are either iterated from the passed regions_iterable or calculated from the reference dict via the RefDict.tile(tile_size=…) method.

The iterator yields the tile location and a tuple of overlapping items from the location iterator. During iteration, the locations and data of the last yielded item can be access via it.tile_locations and it.tile_items.

Note that this can be slow if tile size is too small.

Parameters:
  • location_iterator (LocationIterator) – The location iterator to be wrapped

  • regions_iterable (Iterable[GI]) – An iterable of genomic intervals that define the tiles. Note, that if the location_iterator has a region set then only tiles fully enveloped by this region will be iterated. If not set, tiles will be calculated from the location iterators region (or its reference dict if none set) based on the passed tile_size.

  • tile_size (int) – The size of the tiles in base pairs. Used only if regions_iterable is not set.

max_items()[source]

Maximum number of items yielded by this iterator or None if unknown.

class rnalib.FastqIterator(fastq_file)[source]

Bases: object

Iterates a fastq file and yields FastqRead objects (named tuples containing sequence name, sequence and quality string (omitting line 3 in the FASTQ file)).

Parameters:

fastq_file (str) – The FASTQ file name

Notes

Note that this is not a location iterator.

reported stats:

  • yielded_items: int

    Number of yielded items

property stats
to_list()[source]

Exhausts iterator and returns results in a list. For debugging/testing only

rnalib.read_aligns_to_loc(loc: gi, read: AlignedSegment)[source]

Tests whether a read aligns to the passed location by checking the respective alignment block coordinates and the read strand. Note that the chromosome is not checked.

rnalib.it(obj, **kwargs)[source]

Factory function for creating LocationIterators for the passed object. In most cases, the type of the created iterator will be determined by the type of the passed object. If there are multiple possibilities, the style parameter can be used to select the desired iterator type. For example, if a pandas dataframe is passed then the style=... parameter determines whether a PandasIterator(), a BioframeIterator() or a PyrangesIterator() is returned.

The following object types are supported:

Parameters:
  • obj (any) – The object for which an iterator should be created

  • style (str) – The (optional) style of the iterator for disambiguation. This parameter will be removed from kwargs before the iterator is created.

  • kwargs (dict) – Additional parameters that will be passed to the iterator constructor

Return type:

LocationIterator

Examples

>>> it('test.bed') # creates a BedIterator
>>> it('test.bed', style='bedgraph') # creates a BedGraphIterator
>>> it('test.bam', style='pileup', region='chr1:1000-2000', min_mapping_quality=30) # creates FastPileupIterator
>>> it(pd.DataFrame(...), style='bioframe') # creates a BioframeIterator
>>> it(rna.Transcriptome(...), feature_types='exon') # creates a TranscriptomeIterator
>>> vars(it("test.bam")) # inspect the created iterator

Submodules