rnalib.utils module

This module implements various general (low-level) utility methods

class rnalib.utils.GeneSymbol(symbol, name, taxid)

Bases: tuple

A named tuple representing a gene symbol with symbol, name, and taxid fields.

name: Alias for field number 1

symbol: Alias for field number 0

taxid: Alias for field number 2

rnalib.utils.ensure_outdir(outdir=None) → PathLike[source]: Ensures that the configured output dir exists (will use current dir if none provided)

rnalib.utils.check_list(lst, mode='inc1') → bool[source]: Tests whether the (numeric, comparable) items in a list are * mode==’inc’: increasing (strictly monotonic) * mode==’dec’: decreasing (strictly monotonic) * mode==’inceq’: increasing (monotonic) * mode==’deceq’: decreasing (monotonic) * mode==’inc1’: increasing by one * mode==’dec1’: decreasing by one * mode==’eq’: all equal

rnalib.utils.split_list(lst, n, is_chunksize=False) → list[source]

Splits a list into sublists.

Parameters:

lst (list) – input list to be split. If no list is passed, it is tried to convert via the list(…) method.
n (int) – number of sublists
is_chunksize (bool (default: False)) – if is_chunksize is True: splits into chunks of length n if is_chunksize is False: splits it into n (approx) equal parts

Examples

>>> list(split_list(range(0,10), 3))
>>> list(split_list(range(0,10), 3, is_chunksize=True))

Return type:: generator of lists

Notes

Will probably be replaced by more_iterools chunked and chunked_even

rnalib.utils.intersect_lists(*lists, check_order=False) → list[source]

Intersects lists (iterables) while preserving order. Order is determined by the last provided list. If check_order is True, the order of all input lists wrt. shared elements is asserted.

Examples

>>> intersect_lists([1,2,3,4],[3,2,1]) # [3,2,1]
>>> list_of_lists = ...
>>> intersect_lists(*list_of_lists)
>>> intersect_lists([1,2,3,4],[1,4],[3,1], check_order=True)
>>> intersect_lists((1,2,3,5),(1,3,4,5)) # [1,3,5]

rnalib.utils.cmp_sets(a, b) -> (<class 'bool'>, <class 'bool'>, <class 'bool'>)[source]: Set comparison. Returns shared, unique(a) and unique(b) items

rnalib.utils.get_unique_keys(dict_of_dicts)[source]: Returns all unique key names from a dict of dicts. Example: get_unique_keys({‘a’:{‘1’:12,’2’:13}, ‘b’: {‘1’:14,’3’:43}})

rnalib.utils.calc_set_overlap(a, b) → float[source]: Calculates the overlap between two sets

rnalib.utils.powerset(it)[source]

Returns all possible subsets of the passed iterable. :Parameters: it (iterable)

Return type:: The powerset of the passed iterable

Examples

>>> list(powerset([1,2,3])) # [(), (1,), (2,), (3,), (1, 2), (1, 3), (2, 3), (1, 2, 3)]

rnalib.utils.grouper(iterable, n, fill_value=None)[source]: Groups n lines into a list

rnalib.utils.to_set(x, sep=',') → set[source]: Converts the passed object to a set if not already. If None is passed, an empty set is returned. if a string is passed, it will be split by the specified separator. if a set, list or tuple is passed, it will be converted to a set. If a number or iterable is passed, it will be converted to a set with one element. :rtype: set

rnalib.utils.to_list(x, sep=',') → list[source]

Converts the passed object to a list if not already. If None is passed, an empty list is returned. if a string is passed, it will be split by the specified separator. if a set, list or tuple is passed, it will be converted to a list. If a number or iterable is passed, it will be converted to a list with one element.

Return type:: list

rnalib.utils.to_str(*args, sep=',', na='NA') → str[source]

Converts an object to a string representation. Iterables will be joined by the configured separator. Objects with zero length or None will be converted to the configured ‘NA’ representation.

Examples

>>> to_str([12,[None,[1,'',[]]]]) # 12,NA,1,NA,NA

Notes

This implementation is different from dm-tree.flatten() or treevalue.flatten() but slower?

rnalib.utils.write_data(dat, out=None, sep='\t', na='NA')[source]: Write data to TSV file. Values will be written via thee to_str() method (i.e., None will become ‘NA’). Collections will be written as comma-separated strings, nested structures will be flattened. If out is None, the respective string will be written, Otherwise it will be written to the respective output handle.

rnalib.utils.format_fasta(string, ncol=80) → str[source]: Format a string for FASTA files

rnalib.utils.convert_size(size_bytes)[source]: Convert bytes to human-readable format, see https://stackoverflow.com/questions/5194057/better-way-to-convert-file-sizes-in-python

rnalib.utils.dir_tree(root: Path, prefix: str = '', space=' ', branch='│ ', tee='├── ', last='└── ', max_lines=10, glob=None, show_size=True)[source]

A recursive generator yielding a visual tree structure line by line.

Parameters:

root – Path instance pointing to a directory
prefix – optional prefix string to prepend to each output line
space – optional prefix string to prepend to lines indicating a regular file
branch – optional prefix string to prepend to lines indicating the last item in a directory
tee – optional prefix string to prepend to lines indicating an item in a directory
last – optional prefix string to prepend to lines indicating the last item in the entire tree
max_lines – maximum yielded lines per directory.
glob – optional glob param to filter. Use, e.g., ‘**/*.py’ to list all .py files. default: None (all files)
show_size – optional; if True, the size of the file will be printed in brackets after the filename
@see https (//stackoverflow.com/questions/9727673/list-directory-tree-structure-in-python)

rnalib.utils.print_dir_tree(root: Path, max_lines=10, glob=None)[source]

rnalib.utils.count_lines(file) → int[source]: Counts lines in (gzipped) files. Slow.

rnalib.utils.gunzip(in_file, out_file=None) → str[source]: Gunzips a file and returns the filename of the resulting file

rnalib.utils.slugify(value, allow_unicode=False) → str[source]: Slugify a string (e.g., to get valid filenames w/o extensions). @see https://github.com/django/django/blob/master/django/utils/text.py Convert to ASCII if ‘allow_unicode’ is False. Convert spaces or repeated dashes to single dashes. Remove characters that aren’t alphanumerics, underscores, or hyphens. Convert to lowercase. Also strip leading and trailing whitespace, dashes, and underscores.

rnalib.utils.remove_extension(p, remove_gzip=True) → str[source]: Returns a string representation of a resolved PosixPath of the passed path with removed file extension. Will also remove ‘.gz’ extensions if remove_gzip is True. example remove_extension(‘b/c.txt.gz’) -> ‘<pwd>/b/c’

class rnalib.utils.UrlretrieveTqdm(filename)[source]: Bases: object

rnalib.utils.download_file(url, filename, show_progress=True)[source]

Downloads a file from the passed (https) url into a file with the given path

Parameters:

url (str) – URL to download from
filename (str) – Path to the file to be created
show_progress (bool) – If True, a progress bar will be shown

Examples

>>> import tempfile
>>> with tempfile.TemporaryDirectory() as tempdirname:
>>>     fn=download_file(..., f"{tempdirname}/{filename}")
>>>     # do something with this file
>>>     # Note that the temporary created dir will be removed once the context manager is closed.

rnalib.utils.reverse_complement(seq, tmap='dna') → str[source]: Calculate reverse complement DNA/RNA sequence. Returned sequence is uppercase, N’s are kept. seq_type can be ‘dna’ (default) or ‘rna’

rnalib.utils.complement(seq, tmap='dna') → str[source]: Calculate complement DNA/RNA sequence. Returned sequence is uppercase, N’s are kept. seq_type can be ‘dna’ (default) or ‘rna’

rnalib.utils.pad_n(seq, minlen, padding_char='N') → str[source]: Adds up-/downstream padding with the configured padding character to ensure a given minimum length of the passed sequence.

rnalib.utils.rnd_seq(n, alpha='ACTG', m=1)[source]: Creates m random sequence of length n using the provided alphabet (default: DNA bases). To use different character frequencies, pass each character as often as expected in the frequency distribution. Example: rnd_seq(100, ‘GC’* 60 + ‘AT’ * 40, 5) # 5 sequences of length 100 with 60% GC

rnalib.utils.count_gc(s) -> (<class 'int'>, <class 'float'>)[source]

Counts number of G+C bases in string.: Returns the number of GC bases and the length-normalized fraction.

rnalib.utils.count_rest(s, rest=('GGTACC', 'GAATTC', 'CTCGAG', 'CATATG', 'ACTAGT')) → int[source]: Counts number of restriction sites, see https://portals.broadinstitute.org/gpp/public/resources/rules

rnalib.utils.longest_hp_gc_len(seq) -> (<class 'int'>, <class 'int'>)[source]: Counts HP length (any allele) from start. This method counts any character including N’s

rnalib.utils.longest_GC_len(seq) → int[source]: Counts HP length (any allele) from start

rnalib.utils.find_all(a, sub) → int[source]: Finds all indices of the passed substring

rnalib.utils.kmer_search(seq, kmer_set, include_revcomp=False) → dict[source]: Exact kmer search. Can include revcomp if configured

rnalib.utils.find_gpos(genome_fa, kmers, included_chrom=None) → defaultdict[list][source]: Returns a dict that maps the passed kmers to their (exact match) genomic positions (1-based). Positions are returned as (chr, pos1) tuples. included_chromosomes (list): if configured, then only the respective chromosomes will be considered.

rnalib.utils.parse_gff_attributes(info, fmt='gff3')[source]: parses GFF3/GTF info sections

rnalib.utils.compact_bedgraph_file(bedgraph_file, out_file=None)[source]: Compacts a bedgraph file by merging adjacent entries with the same value.

rnalib.utils.bgzip_and_tabix(in_file, out_file=None, sort=False, create_index=True, del_uncompressed=True, preset='auto', seq_col=0, start_col=1, end_col=1, meta_char=35, line_skip=0, zerobased=False)[source]

BGZIP the input file and create a tabix index with the given parameters if create_index is True. File is sorted with pybedtools if sort is True.

Parameters:

in_file (str) – The input file to be compressed.
out_file (str, optional) – The output file name. Default is in_file + ‘.gz’.
sort (bool, optional) – Whether to sort the input file. Default is False.
create_index (bool, optional) – Whether to create a tabix index. Default is True.
del_uncompressed (bool, optional) – Whether to delete the uncompressed input file. Default is True.
preset (str, optional) – The preset format for the tabix index. Default is ‘auto’. presets: * ‘gff’ : (TBX_GENERIC, 1, 4, 5, ord(‘#’), 0), * ‘bed’ : (TBX_UCSC, 1, 2, 3, ord(‘#’), 0), * ‘psltbl’ : (TBX_UCSC, 15, 17, 18, ord(‘#’), 0), * ‘sam’ : (TBX_SAM, 3, 4, 0, ord(‘@’), 0), * ‘vcf’ : (TBX_VCF, 1, 2, 0, ord(‘#’), 0), * ‘auto’: guess from file extension
seq_col (int, optional) – The column number for the sequence name. Default is 0.
start_col (int, optional) – The column number for the start position. Default is 1.
end_col (int, optional) – The column number for the end position. Default is 1.
meta_char (int, optional) – The character that indicates a comment line. Default is ord(‘#’).
line_skip (int, optional) – The number of lines to skip at the beginning of the file. Default is 0.
zerobased (bool, optional) – Whether the start position is zero-based. Default is False.

Return type:

The filename of the compressed file.

Raises:

AssertionError – If out_file does not end with ‘.gz’.

Notes

This function requires the pysam package to be installed. TODO add csi support

Examples

>>> bgzip_and_tabix('input.vcf', create_index=True)

rnalib.utils.get_rel_coord(feature, subfeature, query, strand_specific=False)[source]

Returns a GI with coordinates that are relative to the passed feature by taking the specified subfeature blocks into account. This method can, e.g., be used to map from genomic to transcriptomic coordinates or to calculate codon numbers, see examples.

Parameters:

feature (rna.GI) – The feature to map to
subfeature (str) – The subfeature type to consider (e.g., ‘CDS’, ‘exon’, ‘UTR’)
query (rna.GI) – The query interval (genomic coordinates) to map
strand_specific (bool, optional) – Whether to test for matching strand between feature and query. Default is False.

Returns:

rna.GI or None – The relative coordinates of the query interval in the feature or None if no overlap was found. The chromosome of the returned GI will be set to the feature_id (e.g., the transcript id) of the feature.
Example
>>> get_rel_coord(t[‘ENST00000621592.8’], “CDS”, rna.gi(“chr8 (127740698-127750856 (+)"))))

rnalib.utils.print_fast5_tree(fast5_file, max_lines=10, n_reads=1, show_attrs=True)[source]: Prints the structure of a fast5 file.

rnalib.utils.print_small_file(filename, show_linenumber=False, max_lines=10)[source]: Prints the first 10 lines of a file

rnalib.utils.get_bcgs(fast5_file)[source]: Returns a list of basecall groups from the 1st read

rnalib.utils.display_animated_gif(gif_url)[source]

rnalib.utils.display_textarea(txt, rows=4, cols=120)[source]: Display a (long) text in a scrollable HTML text area

rnalib.utils.display_list(lst)[source]: Display a list as an HTML list

rnalib.utils.display_popover(msg, bg_color='red', clear=True)[source]

rnalib.utils.display_help(obj, icon='help_icon.png', bg_color='#dcf6fa')[source]: Display the docstring of the passed object

rnalib.utils.head_counter(cnt, non_empty=True)[source]: Displays n items from the passed counter. If non_empty is true then only items with len>0 are shown

rnalib.utils.plot_times(title, times, n=None, reference_method=None, show_speed=True, show_fastest=False, ax=None, orientation='h', highlight_bar=None)[source]: Helper method to plot a dict with timings (seconds). If n is passed and show_speed is true, the method will display iterations per second. If reference_method is set then it will also display the speed increase of the fastest method compared to the reference method/

rnalib.utils.guess_file_format(file_name, file_extensions=None)[source]: Guesses the file format from the file extension :param file_name: :param file_extensions: :return:

class rnalib.utils.ParseMap(*args, **kwargs)[source]

Bases: dict

Extends default dict to return ‘missing char’ for missing keys (similar to defaultdict, but does not enter new values). Should be used with translate()

class rnalib.utils.AutoDict[source]

Bases: dict

Implementation of perl’s autovivification feature. https://stackoverflow.com/questions/651794/whats-the-best-way-to-initialize-a-dict-of-dicts-in-python/651879#651879

rnalib.utils.get_config(config, keys, default_value=None, required=False)[source]

Gets a configuration value from a config dict (e.g., loaded from JSON). Keys can be a list of keys (that will be traversed) or a single value. If the key is missing and required is True, an error will be thrown. Otherwise, the configured default value will be returned.

Examples

>>> cmd_bcftools = get_config(config, ['params','cmd','cmd_bcftools'], required=True)
>>> threads = get_config(config, 'config/process/threads', default_value=1, required=False)

Notes

Consider replacing this with omegaconf

rnalib.utils.open_file_obj(fh, file_format=None, file_extensions=None) → object[source]

Opens a file object. If a pysam compatible file format was detected, the respective pysam object is instantiated.

Parameters:

fh (str or file path object)
file_format (str) – Can be any <supported_formats> or None for auto-detection from filename (valid file extensions can be configured).
file_extensions (dict) – a dict of file extension mappings

Returns:

file_handle

Return type:

instance (file/pysam/pyBigWig object)

rnalib.utils.toggle_chr(s)[source]: Simple function that toggle ‘chr’ prefixes. If the passed reference name start with ‘chr’, then it is removed, otherwise it is added. If None is passed, None is returned. Can be used as fun_alias function.

rnalib.utils.yield_unaligned_reads(dat_file: str)[source]

Convenience method that yields FastqRead items representing unaligned reads from either a FASTQ or an unaligned BAM file.

Examples

>>> for r in yield_unaligned_reads('unaligned_reads.bam'):
>>>     print(r)
>>> for r in yield_unaligned_reads('unaligned_reads.fastq.gz'):
>>>     print(r)

rnalib.utils.aligns_to(anno_blocks, read_blocks, min_overlap=0.95)[source]

Calculates the fraction of aligned read bases that overlap with the passed annotation and returns True if the overlap is greater or equal to the configured min_overlap

Parameters:

anno_blocks (list) – List of annotation blocks (e.g., exons)
read_blocks (list) – List of read blocks (e.g., [rna.gi(read.reference_name,a+1,b) for a,b in read.get_blocks()])
min_overlap (float) – The minimum overlap fraction required for a positive result

Returns:

True if the fraction of aligned read bases that overlap with the passed annotation is >= min_frac

Return type:

bool

rnalib.utils.count_reads(in_file)[source]: Counts reads in different file types

rnalib.utils.get_softclip_seq(read: AlignedSegment, report_seq=False) → tuple[int | None, int | None][source]

Extracts soft-clipped sequences from the passed read.

Parameters:

read (pysam.AlignedSegment) – The read to extract soft-clipped sequences from.
report_seq (bool) – If True, the soft-clipped sequences will be returned, otherwise their length is returned.

Returns:

A tuple containing the left and right soft-clipped sequences/lengths, respectively. If no soft-clipped sequence is found, the corresponding value in the tuple is None.

Return type:

Tuple[Optional[int], Optional[int]]

Examples

>>> r = pysam.AlignedSegment()
>>> r.cigartuples = [(4, 5), (0, 10)]
>>> get_softclip_seq(r)
(5, None)

rnalib.utils.get_softclipped_seq_and_qual(read)[source]: Returns the sequence+qualities of this read w/o softclipped bases

rnalib.utils.get_covered_contigs(bam_files)[source]

Returns all contigs that have some coverage across a set of BAMs.

Parameters:: bam_files (str) – File paths of input BAM files. If a single path is passed, it will be converted to a list.
Returns:: A set of strings representing all contigs that have data in at least one of the BAM files.
Return type:: set

rnalib.utils.get_covered_regions(bam_file)[source]: Returns all covered regions in a BAM file. This is slow as it iterates over all reads in the BAM file and should be used for small files only.

rnalib.utils.downsample_per_chrom(bam_file, max_reads, out_file_bam=None)[source]

Simple convenience method that randomly subsamples reads to ensure max_reads per chromosome. The resulting BAM file will be sorted and indexed.

Parameters:

bam_file (str) – The input BAM file.
max_reads (int) – The maximum number of reads to keep per chromosome.
out_file_bam (str) – The output file name. If None, it will be set to bam_file + ‘.subsampled_max<max_reads>.bam’.

rnalib.utils.is_paired(bam_file, n=10000)[source]: Returns True if the BAM contains PE reads. The method samples the first n reads and checks if any of them are paired.

rnalib.utils.extract_aligned_reads_from_fastq(bam_file, fastq1_file, fastq2_file=None, region=None, out_file_prefix=None, max_reads=None)[source]

Extracts reads from one (SE) or two (PE) FASTQ files that are mapped in a BAM file at the provided region. If region is None, all reads are considered, if max_reads is set, only the first max_reads reads are extracted. Returns the filename(s) of the FASTQ file(s).

Parameters:

bam_file (str) – The input BAM file.
fastq1_file (str) – The input FASTQ file for the first read.
fastq2_file (str or None) – The input FASTQ file for the second read (for PE data)
region (str or None) – The region to extract reads from. If None, all reads are considered.
out_file_prefix (str or None) – The prefix for the output FASTQ file(s). If None, it will be set to bam_file + ‘.<region>.<fastq1/2>.fq’.
max_reads (int or None) – The maximum number of reads to extract. If None, all reads are considered.

Returns:

The number of written reads (sum if PE reads) and the filename of the FASTQ file if single-end data, or both filenames if paired-end data.

Return type:

tuple

rnalib.utils.get_tx_indices(tx, sequence_type='spliced_sequence')[source]

Returns the splice sequence of the transcript, a numpy array with genomic indices of the respective nucleotides in the genome and a numpy array with the distances to the next consecutive nucleotide in the genome for each nucleotide in the transcript.

Parameters:

tx (rnalib.Transcript) – The transcript object
sequence_type (str) – The type of sequence to return. Can be ‘spliced_sequence’ or ‘rna_sequence’. Default is ‘spliced_sequence’.

Returns:

splice_seq (str) – The spliced transcript sequence
idx (np.array) – The genomic indices of the respective nucleotides in the genome
idx0 (np.array) – The distances to the next consecutive nucleotide in the genome for each nucleotide in the transcript

rnalib.utils.get_aligned_blocks(tx, spliced_seq_start: int, spliced_seq_end: int, splice_seq=None, idx=None, idx0=None, sequence_type='spliced_sequence')[source]

Return the aligned blocks of a read sequence to the genomic sequence. The read sequence is defined by the passed transcript and the spliced_seq_start and spliced_seq_end indices.

Examples

>>> t = rna.Transcriptome(...)
>>> rs, bl = get_aligned_blocks(t['ENST00000674681.1'], 0, 100)

Parameters:

tx (rnalib.Transcript) – The transcript object
spliced_seq_start (int) – The start index of the read in the spliced transcript sequence
spliced_seq_end (int) – The end index of the read in the spliced transcript sequence
splice_seq (str) – The spliced transcript sequence. If None, tx.spliced_sequence is used.
idx (np.array) – The genomic indices of the respective nucleotides in the genome. If None, it is calculated from the transcript.
idx0 (np.array) – The distances to the next consecutive nucleotide in the genome for each nucleotide in the transcript. If None, it is calculated from the transcript.
sequence_type (str) – The type of sequence to return. Can be ‘spliced_sequence’ or ‘rna_sequence’. Default is ‘spliced_sequence’.

Returns:

read_seq (str) – The read sequence
blocks (list) – The (sorted) aligned blocks of the read sequence to the genomic sequence

class rnalib.utils.MismatchProfile(d: dict = None, alpha=None, *args, **kwargs)[source]

Bases: Counter

Mismatch profile for sequence error handling.

classmethod get_flat_profile(seq_err=0.001, alpha=None)[source]: Returns a flat sequencing error profile with the given mismatch error probability.

add(ref: str, alt: str, strand: str)[source]: Adds a reference/alternative pair to the profile. Converts the passed reference/alternative pair to upper case and checks if they are valid.

get_prob(ref: str, alt: set = None, strand: str = '+', revcomp=True)[source]: Returns the probability for observing the passed reference/alternative pair. if alt is None, any of the other bases is assumed.

get_mismatch_prob(strand: str)[source]: Returns the probability of observing a mismatch.

get_n_mismatches()[source]: Returns the number of mismatches.

to_dataframe()[source]: Converts the mismatch profile to a DataFrame.

plot_mm_profile(ax=None, prob=True)[source]

save(file)[source]: Saves the mismatch profile to a file.

add_seq_err(seq, strand: str = '+')[source]: Adds a sequencing error according to the profile to the passed sequence. By convention, the passed sequence should always be in 5’->3’ direction (as written in a BAM file). If strand is ‘-’, the probabilities are calculated for the reverse complement of the sequence.

classmethod load(file)[source]: Loads the mismatch profile from a file.

classmethod from_bam(bam_file, features, min_cov=10, max_mm_frac=0.1, max_sample=1000000.0, strand_specific=True, alpha=None, fasta_file=None, disable_progressbar=False)[source]

Creates a mismatch profile from a BAM file. The mismatch profile is created by analyzing the passed regions in the BAM file. Only positions with a minimum coverage of min_cov are considered. Positions with a mismatch fraction > max_mm_frac are skipped. The analysis stops after max_mm_cnt mismatches have been counted.

Parameters:

bam_file (str) – The input BAM file.
features (list) – List of genomic features to analyze (e.g., genes). The features will be randomly permutated and analyzed in this order. They must have an associated sequence attribute in order to calculate the reference base. If None, all covered regions in the BAM file will be analyzed (slow) and “N” will be used as reference base.
min_cov (int) – The minimum coverage required for a position to be considered.
max_mm_frac (float) – The maximum mismatch fraction allowed for a position to be considered.
max_sample (int) – The maximum number of mismatches to count.
strand_specific (bool) – If True, a strand-specific profile is created.
alpha (set) – The set of valid nucleotides.
disable_progressbar (bool) – If True, the progress bar is disabled.

class rnalib.utils.BamWriter(genome_fa: str, out_file_bam: str, sort_and_index=True)[source]

Bases: object

A simple helper class for creating custom BAM files. The BAM header is initialized from the passed genome FASTA file.

write(aligned_blocks: list, query_sequence: str = None, query_qualities: str = None, query_name: str = None, mm: list[tuple] = None, mapping_quality=255, tags=None)[source]: Writes a read to the BAM file.

write_read(read: AlignedSegment)[source]

close()[source]

rnalib.utils.sort_and_index_bam(bam_file)[source]: Sort and index a BAM file

rnalib.utils.merge_bam_files(out_file: str, bam_files: list, sort_output: bool = False, del_in_files: bool = False)[source]

Merge multiple BAM files, sort and index results.

Parameters:

out_file (str) – The output file name.
bam_files (list) – A list of input BAM files.
sort_output (bool) – Whether to sort the output file. Default is False.
del_in_files (bool) – Whether to delete the input files after merging. Default is False.

Return type:

The filename of the merged BAM file.

rnalib.utils.move_id_to_info_field(vcf_in, info_field_name, vcf_out, desc=None)[source]: move all ID entries from a VCF to a new info field

rnalib.utils.add_contig_headers(vcf_in, ref_fasta, vcf_out)[source]: Add missing contig headers to a VCF file

rnalib.utils.get_read_attr_info(fast5_file, rn, path)[source]

Returns a dict with attribute values. For debugging only. .. admonition:: Example

>>> get_read_attr_info(fast5_file, 'read_00262802-9463-45c8-b22d-f68d1047c6fc',
>>>                    'Analyses/Basecall_1D_000/BaseCalled_template/Trace')

class rnalib.utils.Timer(tdict, name)[source]

Bases: object

Simple class for collecting timings of code blocks.

Examples

>>> import time
>>> times=Counter()
>>> with Timer(times, "sleep1") as t:
>>>     time.sleep(.1)
>>> print('executed ', t.name)
>>> with Timer(times, "sleep2") as t:
>>>     time.sleep(.2)
>>> print(times)

rnalib.utils.read_alias_file(gene_name_alias_file, disable_progressbar=False) -> (<class 'dict'>, <class 'set'>)[source]: Reads a gene name aliases from the passed file. Supports the download format from genenames.org and vertebrate.genenames.org. Returns an alias dict and a set of currently known (active) gene symbols

rnalib.utils.norm_gn(g, current_symbols=None, aliases=None) → str[source]: Normalizes gene names. Will return a stripped version of the passed gene name. The gene symbol will be updated to the latest version if an alias table is passed (@see read_alias_file()). If a set of current (up-to date) symbols is passed that contains the passed symbol, no aliasing will be done.

rnalib.utils.geneid2symbol(gene_ids)[source]: Queries gene names for the passed gene ids from MyGeneInfo via https://pypi.org/project/mygene/ Gene ids can be, e.g., EntrezIds (e.g., 60) or ensembl gene ids (e.g., ‘ENSMUSG00000029580’) or a mixed list. Returns a dict { entrezid : GeneSymbol } Example: geneid2symbol([‘ENSMUSG00000029580’, 60]) - will return mouse and human actin beta

rnalib.utils.calc_3end(tx, width=200, report_partial=True, na_value=None)[source]

Utility function that returns a (sorted) list of genomic intervals containing the last <width> bases of the passed transcript or None if not possible (e.g., if transcript is too short). TODO: should report a “Feature” with parent tx so that, e.g., sequence slicing works

Parameters:

tx (rna.Feature) – The transcript to calculate the 3’ end for.
width (int) – The number of bases to return. Default is 200.
report_partial – If True, 3’ends shorter than width will be reported, otherwise the na_value will be returned in this case.
na_value (any) – The value to return if the 3# end could not be calculated (e.g., if the transcript is too short. Default is None.)

rnalib.utils.gt2zyg(gt) -> (<class 'int'>, <class 'int'>)[source]

Extracts zygosity information from a genotype string.

Parameters:

gt genotype

Returns:

zygosity is 2 if all called alleles are the same, 1 if there are mixed called alleles or 0 if no call
call: 0 if no-call or homref, 1 otherwise

Return type:

a tuple (zygosity, call)

class rnalib.utils.FastqRead(name, seq, qual)

Bases: tuple

A read in a FASTQ file. Print in FASTQ format, e.g., like this: print(’n’.join([r.name, r.seq, ‘+’, r.qual]))

name: Alias for field number 0

qual: Alias for field number 2

seq: Alias for field number 1

class rnalib.utils.TagFilter(tag: str, filter_values: ~typing.List = <factory>, filter_if_no_tag: bool = False, inverse: bool = False)[source]

Bases: object

Filter reads if the specified tag has one of the provided filter_values. Can be inverted for filtering if specified values is found. If filter_if_no_tag is True (default: False), then the read is filtered if the tag is not present.

tag: str

filter_values: List

filter_if_no_tag: bool = False

inverse: bool = False

filter(r)[source]

rnalib.utils.get_sample_meta_keys(file='data/human_gene_v2.2.h5')[source]

rnalib.utils.get_archs4_sample_dict(file='data/human_gene_v2.2.h5', remove_sc=True)[source]

Returns a dict of GSM ids and sample indices. If remove_sc is True (default), then single cell samples are removed. .. admonition:: Example

>>> nosc_samples = get_archs4_sample_dict()
>>> ten_random_ids = random.sample(list(nosc_samples), 10)

rnalib.utils.get_sample_metadata(samples, sample_dict=None, keys=None, file='data/human_gene_v2.2.h5', disable_progressbar=False)[source]

rnalib.utils.random_sample(conf_str, rng=Generator(PCG64) at 0x727C1E31B840)[source]

Draws random samples from a distribution that is configured by the passed (configuration) string. If the conf_str is numeric (i.e., a constant value), it is cast to a float and returned. Supported distributions are all numpy.random functions (e.g., normal, uniform, etc.). .. admonition:: Examples

>>> random_sample(12), random_sample(12.0), random_sample('12') # constant value
>>> random_sample('uniform(1,2,100)') # 100 random numbers, uniformly sampled from [1; 2]
>>> random_sample('normal(3, 0.8, size=(2, 4))') # 2 x 4 random numbers from a normal distribution around 3 with sd 0.8
>>> for x in random_sample("normal(3000,10,size=(5,1000))"):
>>>     plt.hist(x, 30, density=True, histtype=u'step') # plot 5 histograms # noqa

rnalib.utils.random_intervals(chromosomes=('1',), start_range=range(0, 1000), len_range=range(1, 100), n=10)[source]

Generates random genomic intervals.

Parameters:

chromosomes (list) – The chromosomes to sample from.
start_range (range) – The range of start positions.
len_range (range) – The range of lengths.
n (int) – The number of intervals to generate.

Return type:

A sorted list of genomic intervals.

rnalib.utils.execute_screencast(command_file, col=True, default_delay=1, console_prompt='>>> ', min_typing_delay=0.001, max_typing_delay=0.05)[source]

Executes the passed command file in a python shell. .. admonition:: Example

>>> execute_screencast('screencast.txt')
To record a screencast, create a file with the commands to execute and add comments with the delay in seconds. Then run the following command to record the screencast (make sure that rnalib is installed): >>> terminalizer record -c myconfig.yml –skip-sharing –command “python3 -c “import rnalib as rna; rna.execute_screencast(‘myscript.py’)”” ${name} # noqa where myconfig.yml is a terminalizer configuration file and myscript.py is the python script to execute.

Todo: Use ast to parse and highlight the commands