rnalib.utils module
This module implements various general (low-level) utility methods
- class rnalib.utils.GeneSymbol(symbol, name, taxid)
Bases:
tupleA named tuple representing a gene symbol with symbol, name, and taxid fields.
- name
Alias for field number 1
- symbol
Alias for field number 0
- taxid
Alias for field number 2
- rnalib.utils.ensure_outdir(outdir=None) PathLike[source]
Ensures that the configured output dir exists (will use current dir if none provided)
- rnalib.utils.check_list(lst, mode='inc1') bool[source]
Tests whether the (numeric, comparable) items in a list are * mode==’inc’: increasing (strictly monotonic) * mode==’dec’: decreasing (strictly monotonic) * mode==’inceq’: increasing (monotonic) * mode==’deceq’: decreasing (monotonic) * mode==’inc1’: increasing by one * mode==’dec1’: decreasing by one * mode==’eq’: all equal
- rnalib.utils.split_list(lst, n, is_chunksize=False) list[source]
Splits a list into sublists.
- Parameters:
Examples
>>> list(split_list(range(0,10), 3)) >>> list(split_list(range(0,10), 3, is_chunksize=True))
- Return type:
generatoroflists
Notes
Will probably be replaced by more_iterools chunked and chunked_even
- rnalib.utils.intersect_lists(*lists, check_order=False) list[source]
Intersects lists (iterables) while preserving order. Order is determined by the last provided list. If check_order is True, the order of all input lists wrt. shared elements is asserted.
Examples
>>> intersect_lists([1,2,3,4],[3,2,1]) # [3,2,1] >>> list_of_lists = ... >>> intersect_lists(*list_of_lists) >>> intersect_lists([1,2,3,4],[1,4],[3,1], check_order=True) >>> intersect_lists((1,2,3,5),(1,3,4,5)) # [1,3,5]
- rnalib.utils.cmp_sets(a, b) -> (<class 'bool'>, <class 'bool'>, <class 'bool'>)[source]
Set comparison. Returns shared, unique(a) and unique(b) items
- rnalib.utils.get_unique_keys(dict_of_dicts)[source]
Returns all unique key names from a dict of dicts. Example: get_unique_keys({‘a’:{‘1’:12,’2’:13}, ‘b’: {‘1’:14,’3’:43}})
- rnalib.utils.powerset(it)[source]
Returns all possible subsets of the passed iterable. :Parameters: it (
iterable)- Return type:
The powersetofthe passed iterable
Examples
>>> list(powerset([1,2,3])) # [(), (1,), (2,), (3,), (1, 2), (1, 3), (2, 3), (1, 2, 3)]
- rnalib.utils.to_set(x, sep=',') set[source]
Converts the passed object to a set if not already. If None is passed, an empty set is returned. if a string is passed, it will be split by the specified separator. if a set, list or tuple is passed, it will be converted to a set. If a number or iterable is passed, it will be converted to a set with one element. :rtype:
set
- rnalib.utils.to_list(x, sep=',') list[source]
Converts the passed object to a list if not already. If None is passed, an empty list is returned. if a string is passed, it will be split by the specified separator. if a set, list or tuple is passed, it will be converted to a list. If a number or iterable is passed, it will be converted to a list with one element.
- Return type:
- rnalib.utils.to_str(*args, sep=',', na='NA') str[source]
Converts an object to a string representation. Iterables will be joined by the configured separator. Objects with zero length or None will be converted to the configured ‘NA’ representation.
Examples
>>> to_str([12,[None,[1,'',[]]]]) # 12,NA,1,NA,NA
Notes
This implementation is different from dm-tree.flatten() or treevalue.flatten() but slower?
- rnalib.utils.write_data(dat, out=None, sep='\t', na='NA')[source]
Write data to TSV file. Values will be written via thee to_str() method (i.e., None will become ‘NA’). Collections will be written as comma-separated strings, nested structures will be flattened. If out is None, the respective string will be written, Otherwise it will be written to the respective output handle.
- rnalib.utils.convert_size(size_bytes)[source]
Convert bytes to human-readable format, see https://stackoverflow.com/questions/5194057/better-way-to-convert-file-sizes-in-python
- rnalib.utils.dir_tree(root: Path, prefix: str = '', space=' ', branch='│ ', tee='├── ', last='└── ', max_lines=10, glob=None, show_size=True)[source]
A recursive generator yielding a visual tree structure line by line.
- Parameters:
root – Path instance pointing to a directory
prefix – optional prefix string to prepend to each output line
space – optional prefix string to prepend to lines indicating a regular file
branch – optional prefix string to prepend to lines indicating the last item in a directory
tee – optional prefix string to prepend to lines indicating an item in a directory
last – optional prefix string to prepend to lines indicating the last item in the entire tree
max_lines – maximum yielded lines per directory.
glob – optional glob param to filter. Use, e.g., ‘**/*.py’ to list all .py files. default: None (all files)
show_size – optional; if True, the size of the file will be printed in brackets after the filename
@see https (
//stackoverflow.com/questions/9727673/list-directory-tree-structure-in-python)
- rnalib.utils.gunzip(in_file, out_file=None) str[source]
Gunzips a file and returns the filename of the resulting file
- rnalib.utils.slugify(value, allow_unicode=False) str[source]
Slugify a string (e.g., to get valid filenames w/o extensions). @see https://github.com/django/django/blob/master/django/utils/text.py Convert to ASCII if ‘allow_unicode’ is False. Convert spaces or repeated dashes to single dashes. Remove characters that aren’t alphanumerics, underscores, or hyphens. Convert to lowercase. Also strip leading and trailing whitespace, dashes, and underscores.
- rnalib.utils.remove_extension(p, remove_gzip=True) str[source]
Returns a string representation of a resolved PosixPath of the passed path with removed file extension. Will also remove ‘.gz’ extensions if remove_gzip is True. example remove_extension(‘b/c.txt.gz’) -> ‘<pwd>/b/c’
- rnalib.utils.download_file(url, filename, show_progress=True)[source]
Downloads a file from the passed (https) url into a file with the given path
- Parameters:
Examples
>>> import tempfile >>> with tempfile.TemporaryDirectory() as tempdirname: >>> fn=download_file(..., f"{tempdirname}/{filename}") >>> # do something with this file >>> # Note that the temporary created dir will be removed once the context manager is closed.
- rnalib.utils.reverse_complement(seq, tmap='dna') str[source]
Calculate reverse complement DNA/RNA sequence. Returned sequence is uppercase, N’s are kept. seq_type can be ‘dna’ (default) or ‘rna’
- rnalib.utils.complement(seq, tmap='dna') str[source]
Calculate complement DNA/RNA sequence. Returned sequence is uppercase, N’s are kept. seq_type can be ‘dna’ (default) or ‘rna’
- rnalib.utils.pad_n(seq, minlen, padding_char='N') str[source]
Adds up-/downstream padding with the configured padding character to ensure a given minimum length of the passed sequence.
- rnalib.utils.rnd_seq(n, alpha='ACTG', m=1)[source]
Creates m random sequence of length n using the provided alphabet (default: DNA bases). To use different character frequencies, pass each character as often as expected in the frequency distribution. Example: rnd_seq(100, ‘GC’* 60 + ‘AT’ * 40, 5) # 5 sequences of length 100 with 60% GC
- rnalib.utils.count_gc(s) -> (<class 'int'>, <class 'float'>)[source]
- Counts number of G+C bases in string.
Returns the number of GC bases and the length-normalized fraction.
- rnalib.utils.count_rest(s, rest=('GGTACC', 'GAATTC', 'CTCGAG', 'CATATG', 'ACTAGT')) int[source]
Counts number of restriction sites, see https://portals.broadinstitute.org/gpp/public/resources/rules
- rnalib.utils.longest_hp_gc_len(seq) -> (<class 'int'>, <class 'int'>)[source]
Counts HP length (any allele) from start. This method counts any character including N’s
- rnalib.utils.kmer_search(seq, kmer_set, include_revcomp=False) dict[source]
Exact kmer search. Can include revcomp if configured
- rnalib.utils.find_gpos(genome_fa, kmers, included_chrom=None) defaultdict[list][source]
Returns a dict that maps the passed kmers to their (exact match) genomic positions (1-based). Positions are returned as (chr, pos1) tuples. included_chromosomes (list): if configured, then only the respective chromosomes will be considered.
- rnalib.utils.compact_bedgraph_file(bedgraph_file, out_file=None)[source]
Compacts a bedgraph file by merging adjacent entries with the same value.
- rnalib.utils.bgzip_and_tabix(in_file, out_file=None, sort=False, create_index=True, del_uncompressed=True, preset='auto', seq_col=0, start_col=1, end_col=1, meta_char=35, line_skip=0, zerobased=False)[source]
BGZIP the input file and create a tabix index with the given parameters if create_index is True. File is sorted with pybedtools if sort is True.
- Parameters:
in_file (
str) – The input file to be compressed.out_file (
str, optional) – The output file name. Default is in_file + ‘.gz’.sort (
bool, optional) – Whether to sort the input file. Default is False.create_index (
bool, optional) – Whether to create a tabix index. Default is True.del_uncompressed (
bool, optional) – Whether to delete the uncompressed input file. Default is True.preset (
str, optional) – The preset format for the tabix index. Default is ‘auto’. presets: * ‘gff’ : (TBX_GENERIC, 1, 4, 5, ord(‘#’), 0), * ‘bed’ : (TBX_UCSC, 1, 2, 3, ord(‘#’), 0), * ‘psltbl’ : (TBX_UCSC, 15, 17, 18, ord(‘#’), 0), * ‘sam’ : (TBX_SAM, 3, 4, 0, ord(‘@’), 0), * ‘vcf’ : (TBX_VCF, 1, 2, 0, ord(‘#’), 0), * ‘auto’: guess from file extensionseq_col (
int, optional) – The column number for the sequence name. Default is 0.start_col (
int, optional) – The column number for the start position. Default is 1.end_col (
int, optional) – The column number for the end position. Default is 1.meta_char (
int, optional) – The character that indicates a comment line. Default is ord(‘#’).line_skip (
int, optional) – The number of lines to skip at the beginning of the file. Default is 0.zerobased (
bool, optional) – Whether the start position is zero-based. Default is False.
- Return type:
The filenameofthe compressed file.- Raises:
AssertionError – If out_file does not end with ‘.gz’.
Notes
This function requires the pysam package to be installed. TODO add csi support
Examples
>>> bgzip_and_tabix('input.vcf', create_index=True)
- rnalib.utils.get_rel_coord(feature, subfeature, query, strand_specific=False)[source]
Returns a GI with coordinates that are relative to the passed feature by taking the specified subfeature blocks into account. This method can, e.g., be used to map from genomic to transcriptomic coordinates or to calculate codon numbers, see examples.
- Parameters:
feature (
rna.GI) – The feature to map tosubfeature (
str) – The subfeature type to consider (e.g., ‘CDS’, ‘exon’, ‘UTR’)query (
rna.GI) – The query interval (genomic coordinates) to mapstrand_specific (
bool, optional) – Whether to test for matching strand between feature and query. Default is False.
- Returns:
rna.GIorNone– The relative coordinates of the query interval in the feature or None if no overlap was found. The chromosome of the returned GI will be set to the feature_id (e.g., the transcript id) of the feature.Example>>> get_rel_coord(t[‘ENST00000621592.8’], “CDS”, rna.gi(“chr8 (
127740698-127750856 (+)"))))
- rnalib.utils.print_fast5_tree(fast5_file, max_lines=10, n_reads=1, show_attrs=True)[source]
Prints the structure of a fast5 file.
- rnalib.utils.print_small_file(filename, show_linenumber=False, max_lines=10)[source]
Prints the first 10 lines of a file
- rnalib.utils.display_textarea(txt, rows=4, cols=120)[source]
Display a (long) text in a scrollable HTML text area
- rnalib.utils.display_help(obj, icon='help_icon.png', bg_color='#dcf6fa')[source]
Display the docstring of the passed object
- rnalib.utils.head_counter(cnt, non_empty=True)[source]
Displays n items from the passed counter. If non_empty is true then only items with len>0 are shown
- rnalib.utils.plot_times(title, times, n=None, reference_method=None, show_speed=True, show_fastest=False, ax=None, orientation='h', highlight_bar=None)[source]
Helper method to plot a dict with timings (seconds). If n is passed and show_speed is true, the method will display iterations per second. If reference_method is set then it will also display the speed increase of the fastest method compared to the reference method/
- rnalib.utils.guess_file_format(file_name, file_extensions=None)[source]
Guesses the file format from the file extension :param file_name: :param file_extensions: :return:
- class rnalib.utils.ParseMap(*args, **kwargs)[source]
Bases:
dictExtends default dict to return ‘missing char’ for missing keys (similar to defaultdict, but does not enter new values). Should be used with translate()
- class rnalib.utils.AutoDict[source]
Bases:
dictImplementation of perl’s autovivification feature. https://stackoverflow.com/questions/651794/whats-the-best-way-to-initialize-a-dict-of-dicts-in-python/651879#651879
- rnalib.utils.get_config(config, keys, default_value=None, required=False)[source]
Gets a configuration value from a config dict (e.g., loaded from JSON). Keys can be a list of keys (that will be traversed) or a single value. If the key is missing and required is True, an error will be thrown. Otherwise, the configured default value will be returned.
Examples
>>> cmd_bcftools = get_config(config, ['params','cmd','cmd_bcftools'], required=True) >>> threads = get_config(config, 'config/process/threads', default_value=1, required=False)
Notes
Consider replacing this with omegaconf
- rnalib.utils.open_file_obj(fh, file_format=None, file_extensions=None) object[source]
Opens a file object. If a pysam compatible file format was detected, the respective pysam object is instantiated.
- Parameters:
- Returns:
file_handle
- Return type:
instance (file/pysam/pyBigWig object)
- rnalib.utils.toggle_chr(s)[source]
Simple function that toggle ‘chr’ prefixes. If the passed reference name start with ‘chr’, then it is removed, otherwise it is added. If None is passed, None is returned. Can be used as fun_alias function.
- rnalib.utils.yield_unaligned_reads(dat_file: str)[source]
Convenience method that yields FastqRead items representing unaligned reads from either a FASTQ or an unaligned BAM file.
Examples
>>> for r in yield_unaligned_reads('unaligned_reads.bam'): >>> print(r) >>> for r in yield_unaligned_reads('unaligned_reads.fastq.gz'): >>> print(r)
- rnalib.utils.aligns_to(anno_blocks, read_blocks, min_overlap=0.95)[source]
Calculates the fraction of aligned read bases that overlap with the passed annotation and returns True if the overlap is greater or equal to the configured min_overlap
- Parameters:
- Returns:
True if the fraction of aligned read bases that overlap with the passed annotation is >= min_frac
- Return type:
- rnalib.utils.get_softclip_seq(read: AlignedSegment, report_seq=False) tuple[int | None, int | None][source]
Extracts soft-clipped sequences from the passed read.
- Parameters:
read (
pysam.AlignedSegment) – The read to extract soft-clipped sequences from.report_seq (
bool) – If True, the soft-clipped sequences will be returned, otherwise their length is returned.
- Returns:
A tuple containing the left and right soft-clipped sequences/lengths, respectively. If no soft-clipped sequence is found, the corresponding value in the tuple is None.
- Return type:
Tuple[Optional[int],Optional[int]]
Examples
>>> r = pysam.AlignedSegment() >>> r.cigartuples = [(4, 5), (0, 10)] >>> get_softclip_seq(r) (5, None)
- rnalib.utils.get_softclipped_seq_and_qual(read)[source]
Returns the sequence+qualities of this read w/o softclipped bases
- rnalib.utils.get_covered_contigs(bam_files)[source]
Returns all contigs that have some coverage across a set of BAMs.
- rnalib.utils.get_covered_regions(bam_file)[source]
Returns all covered regions in a BAM file. This is slow as it iterates over all reads in the BAM file and should be used for small files only.
- rnalib.utils.downsample_per_chrom(bam_file, max_reads, out_file_bam=None)[source]
Simple convenience method that randomly subsamples reads to ensure max_reads per chromosome. The resulting BAM file will be sorted and indexed.
- rnalib.utils.is_paired(bam_file, n=10000)[source]
Returns True if the BAM contains PE reads. The method samples the first n reads and checks if any of them are paired.
- rnalib.utils.extract_aligned_reads_from_fastq(bam_file, fastq1_file, fastq2_file=None, region=None, out_file_prefix=None, max_reads=None)[source]
Extracts reads from one (SE) or two (PE) FASTQ files that are mapped in a BAM file at the provided region. If region is None, all reads are considered, if max_reads is set, only the first max_reads reads are extracted. Returns the filename(s) of the FASTQ file(s).
- Parameters:
bam_file (
str) – The input BAM file.fastq1_file (
str) – The input FASTQ file for the first read.fastq2_file (
strorNone) – The input FASTQ file for the second read (for PE data)region (
strorNone) – The region to extract reads from. If None, all reads are considered.out_file_prefix (
strorNone) – The prefix for the output FASTQ file(s). If None, it will be set to bam_file + ‘.<region>.<fastq1/2>.fq’.max_reads (
intorNone) – The maximum number of reads to extract. If None, all reads are considered.
- Returns:
The number of written reads (sum if PE reads) and the filename of the FASTQ file if single-end data, or both filenames if paired-end data.
- Return type:
- rnalib.utils.get_tx_indices(tx, sequence_type='spliced_sequence')[source]
Returns the splice sequence of the transcript, a numpy array with genomic indices of the respective nucleotides in the genome and a numpy array with the distances to the next consecutive nucleotide in the genome for each nucleotide in the transcript.
- Parameters:
tx (
rnalib.Transcript) – The transcript objectsequence_type (
str) – The type of sequence to return. Can be ‘spliced_sequence’ or ‘rna_sequence’. Default is ‘spliced_sequence’.
- Returns:
splice_seq (
str) – The spliced transcript sequenceidx (
np.array) – The genomic indices of the respective nucleotides in the genomeidx0 (
np.array) – The distances to the next consecutive nucleotide in the genome for each nucleotide in the transcript
- rnalib.utils.get_aligned_blocks(tx, spliced_seq_start: int, spliced_seq_end: int, splice_seq=None, idx=None, idx0=None, sequence_type='spliced_sequence')[source]
Return the aligned blocks of a read sequence to the genomic sequence. The read sequence is defined by the passed transcript and the spliced_seq_start and spliced_seq_end indices.
Examples
>>> t = rna.Transcriptome(...) >>> rs, bl = get_aligned_blocks(t['ENST00000674681.1'], 0, 100)
- Parameters:
tx (
rnalib.Transcript) – The transcript objectspliced_seq_start (
int) – The start index of the read in the spliced transcript sequencespliced_seq_end (
int) – The end index of the read in the spliced transcript sequencesplice_seq (
str) – The spliced transcript sequence. If None, tx.spliced_sequence is used.idx (
np.array) – The genomic indices of the respective nucleotides in the genome. If None, it is calculated from the transcript.idx0 (
np.array) – The distances to the next consecutive nucleotide in the genome for each nucleotide in the transcript. If None, it is calculated from the transcript.sequence_type (
str) – The type of sequence to return. Can be ‘spliced_sequence’ or ‘rna_sequence’. Default is ‘spliced_sequence’.
- Returns:
- class rnalib.utils.MismatchProfile(d: dict = None, alpha=None, *args, **kwargs)[source]
Bases:
CounterMismatch profile for sequence error handling.
- classmethod get_flat_profile(seq_err=0.001, alpha=None)[source]
Returns a flat sequencing error profile with the given mismatch error probability.
- add(ref: str, alt: str, strand: str)[source]
Adds a reference/alternative pair to the profile. Converts the passed reference/alternative pair to upper case and checks if they are valid.
- get_prob(ref: str, alt: set = None, strand: str = '+', revcomp=True)[source]
Returns the probability for observing the passed reference/alternative pair. if alt is None, any of the other bases is assumed.
- add_seq_err(seq, strand: str = '+')[source]
Adds a sequencing error according to the profile to the passed sequence. By convention, the passed sequence should always be in 5’->3’ direction (as written in a BAM file). If strand is ‘-’, the probabilities are calculated for the reverse complement of the sequence.
- classmethod from_bam(bam_file, features, min_cov=10, max_mm_frac=0.1, max_sample=1000000.0, strand_specific=True, alpha=None, fasta_file=None, disable_progressbar=False)[source]
Creates a mismatch profile from a BAM file. The mismatch profile is created by analyzing the passed regions in the BAM file. Only positions with a minimum coverage of min_cov are considered. Positions with a mismatch fraction > max_mm_frac are skipped. The analysis stops after max_mm_cnt mismatches have been counted.
- Parameters:
bam_file (
str) – The input BAM file.features (
list) – List of genomic features to analyze (e.g., genes). The features will be randomly permutated and analyzed in this order. They must have an associated sequence attribute in order to calculate the reference base. If None, all covered regions in the BAM file will be analyzed (slow) and “N” will be used as reference base.min_cov (
int) – The minimum coverage required for a position to be considered.max_mm_frac (
float) – The maximum mismatch fraction allowed for a position to be considered.max_sample (
int) – The maximum number of mismatches to count.strand_specific (
bool) – If True, a strand-specific profile is created.alpha (
set) – The set of valid nucleotides.disable_progressbar (
bool) – If True, the progress bar is disabled.
- class rnalib.utils.BamWriter(genome_fa: str, out_file_bam: str, sort_and_index=True)[source]
Bases:
objectA simple helper class for creating custom BAM files. The BAM header is initialized from the passed genome FASTA file.
- rnalib.utils.merge_bam_files(out_file: str, bam_files: list, sort_output: bool = False, del_in_files: bool = False)[source]
Merge multiple BAM files, sort and index results.
- Parameters:
- Return type:
The filenameofthe merged BAM file.
- rnalib.utils.move_id_to_info_field(vcf_in, info_field_name, vcf_out, desc=None)[source]
move all ID entries from a VCF to a new info field
- rnalib.utils.add_contig_headers(vcf_in, ref_fasta, vcf_out)[source]
Add missing contig headers to a VCF file
- rnalib.utils.get_read_attr_info(fast5_file, rn, path)[source]
Returns a dict with attribute values. For debugging only. .. admonition:: Example
>>> get_read_attr_info(fast5_file, 'read_00262802-9463-45c8-b22d-f68d1047c6fc', >>> 'Analyses/Basecall_1D_000/BaseCalled_template/Trace')
- class rnalib.utils.Timer(tdict, name)[source]
Bases:
objectSimple class for collecting timings of code blocks.
Examples
>>> import time >>> times=Counter() >>> with Timer(times, "sleep1") as t: >>> time.sleep(.1) >>> print('executed ', t.name) >>> with Timer(times, "sleep2") as t: >>> time.sleep(.2) >>> print(times)
- rnalib.utils.read_alias_file(gene_name_alias_file, disable_progressbar=False) -> (<class 'dict'>, <class 'set'>)[source]
Reads a gene name aliases from the passed file. Supports the download format from genenames.org and vertebrate.genenames.org. Returns an alias dict and a set of currently known (active) gene symbols
- rnalib.utils.norm_gn(g, current_symbols=None, aliases=None) str[source]
Normalizes gene names. Will return a stripped version of the passed gene name. The gene symbol will be updated to the latest version if an alias table is passed (@see read_alias_file()). If a set of current (up-to date) symbols is passed that contains the passed symbol, no aliasing will be done.
- rnalib.utils.geneid2symbol(gene_ids)[source]
Queries gene names for the passed gene ids from MyGeneInfo via https://pypi.org/project/mygene/ Gene ids can be, e.g., EntrezIds (e.g., 60) or ensembl gene ids (e.g., ‘ENSMUSG00000029580’) or a mixed list. Returns a dict { entrezid : GeneSymbol } Example: geneid2symbol([‘ENSMUSG00000029580’, 60]) - will return mouse and human actin beta
- rnalib.utils.calc_3end(tx, width=200, report_partial=True, na_value=None)[source]
Utility function that returns a (sorted) list of genomic intervals containing the last <width> bases of the passed transcript or None if not possible (e.g., if transcript is too short). TODO: should report a “Feature” with parent tx so that, e.g., sequence slicing works
- Parameters:
tx (
rna.Feature) – The transcript to calculate the 3’ end for.width (
int) – The number of bases to return. Default is 200.report_partial – If True, 3’ends shorter than width will be reported, otherwise the na_value will be returned in this case.
na_value (
any) – The value to return if the 3# end could not be calculated (e.g., if the transcript is too short. Default is None.)
- rnalib.utils.gt2zyg(gt) -> (<class 'int'>, <class 'int'>)[source]
Extracts zygosity information from a genotype string.
- Parameters:
gt genotype
- Returns:
zygosity is 2 if all called alleles are the same, 1 if there are mixed called alleles or 0 if no call
call: 0 if no-call or homref, 1 otherwise
- Return type:
a tuple (zygosity,call)
- class rnalib.utils.FastqRead(name, seq, qual)
Bases:
tupleA read in a FASTQ file. Print in FASTQ format, e.g., like this: print(’n’.join([r.name, r.seq, ‘+’, r.qual]))
- name
Alias for field number 0
- qual
Alias for field number 2
- seq
Alias for field number 1
- class rnalib.utils.TagFilter(tag: str, filter_values: ~typing.List = <factory>, filter_if_no_tag: bool = False, inverse: bool = False)[source]
Bases:
objectFilter reads if the specified tag has one of the provided filter_values. Can be inverted for filtering if specified values is found. If filter_if_no_tag is True (default: False), then the read is filtered if the tag is not present.
- rnalib.utils.get_archs4_sample_dict(file='data/human_gene_v2.2.h5', remove_sc=True)[source]
Returns a dict of GSM ids and sample indices. If remove_sc is True (default), then single cell samples are removed. .. admonition:: Example
>>> nosc_samples = get_archs4_sample_dict() >>> ten_random_ids = random.sample(list(nosc_samples), 10)
- rnalib.utils.get_sample_metadata(samples, sample_dict=None, keys=None, file='data/human_gene_v2.2.h5', disable_progressbar=False)[source]
- rnalib.utils.random_sample(conf_str, rng=Generator(PCG64) at 0x727C1E31B840)[source]
Draws random samples from a distribution that is configured by the passed (configuration) string. If the conf_str is numeric (i.e., a constant value), it is cast to a float and returned. Supported distributions are all numpy.random functions (e.g., normal, uniform, etc.). .. admonition:: Examples
>>> random_sample(12), random_sample(12.0), random_sample('12') # constant value >>> random_sample('uniform(1,2,100)') # 100 random numbers, uniformly sampled from [1; 2] >>> random_sample('normal(3, 0.8, size=(2, 4))') # 2 x 4 random numbers from a normal distribution around 3 with sd 0.8 >>> for x in random_sample("normal(3000,10,size=(5,1000))"): >>> plt.hist(x, 30, density=True, histtype=u'step') # plot 5 histograms # noqa
- rnalib.utils.random_intervals(chromosomes=('1',), start_range=range(0, 1000), len_range=range(1, 100), n=10)[source]
Generates random genomic intervals.
- rnalib.utils.execute_screencast(command_file, col=True, default_delay=1, console_prompt='>>> ', min_typing_delay=0.001, max_typing_delay=0.05)[source]
Executes the passed command file in a python shell. .. admonition:: Example
>>> execute_screencast('screencast.txt')
To record a screencast, create a file with the commands to execute and add comments with the delay in seconds. Then run the following command to record the screencast (make sure that rnalib is installed): >>> terminalizer record -c myconfig.yml –skip-sharing –command “python3 -c “import rnalib as rna; rna.execute_screencast(‘myscript.py’)”” ${name} # noqa where myconfig.yml is a terminalizer configuration file and myscript.py is the python script to execute.
Todo: Use ast to parse and highlight the commands