rnalib.tools module

Implementation of rnalib command line tools.

rnalib.tools.prune_tags(bam_in, bam_out, kept_tags=('NH', 'HI', 'AS'))[source]

Parameters:

bam_in – input bam file
bam_out – output bam file
kept_tags – list of tags to be kept

Example

>>> prune_tags('small.ACTB+SOX2.bam', 'small.ACTB+SOX2.clean.bam')

rnalib.tools.tag_tc(bam_file, included_chrom=None, snp_vcf_file=None, out_prefix=None, fractional_counts=False, write_density_histogram=True, tags=None, min_mapping_quality=0, flag_filter=3844, min_base_quality=10, reverse_strand=False)[source]

Determines the T/C status per read while masking SNP positions and create density histogram per chrom. Will output two files: a BAM file with the T/C status encoded in the XC tag and a TSV file containing the T/C density per chrom.

Parameters:

bam_file (str) – bam file to be analysed
included_chrom (list) – list of chromosomes to be included (subset of reference dict)
snp_vcf_file (str) – VCF file containing variant calls from the respective cell line. T/C and A/G SNPs will be masked
out_prefix (str) – output file prefix
fractional_counts (bool) – if set, reads will be counted by 1/NH
write_density_histogram (bool) – if set, a (gzipped) TSV file containing T/C density data will be written. default is True
tags (dict) – dictionary containing the BAM tags to be used for the output BAM file. Default is {‘ntc’: ‘xc’, ‘ntt’: ‘xt’, ‘col’: ‘YC’}
min_mapping_quality (int) – minimum mapping quality
flag_filter (int) – flag filter for filtering reads.
min_base_quality (int) – minimum base quality
reverse_strand (bool) – if set, the reverse strand will be considered as the conversion strand. Default is False.

rnalib.tools.quantise_values(values, bins=10)[source]

Quantise values into bins.

Parameters:

values (list) – list of values to be quantised
bins (int) – number of bins

Returns:

list of quantised values

Return type:

list

rnalib.tools.filter_tc(bam_file, out_file=None, min_tc=1, tags=None)[source]

Filters a T/C annotated BAM file for reads with at least min_tc T/C conversions.

Parameters:

bam_file (str) – T/C annotated bam file to be analysed
out_file (str) – output file. Default is {bam_file_name}_tc-only.bam
min_tc (int) – minimum number of T/C conversions
tags (dict) – dictionary containing the BAM tags to be used for the output BAM file. Default is {‘ntc’: ‘xc’, ‘ntt’: ‘xt’, ‘col’: ‘YC’}

rnalib.tools.build_amplicon_resources(transcriptome_name, bed_file, fasta_file, out_dir, padding=100, amp_extension=100)[source]

Builds resources for amplicon analyses from a bed file.

This tool creates all required genomics resources for downstream analyses, such as the FASTA file with the (possibly padded) sequences, a GFF file that annotates the amplicons in transcriptome coordinates and various indexing data structures.

Intervals are parsed from the bed file (and possibly extended and padded= and sequences are extracted from the fasta file. The sequences are written to a FASTA file. For downstream analyses, a chrom.sizes file, a .dict file, a GFF3 file and a BED file are created. Additionally, a log file is written to the output directory.

The parsed intervals are merged and only one ‘amplicon’ per merged interval is written to the FASTA file. The GFF3 file contains the amplicon coordinates and the BED file contains the amplicon coordinates and the original bed item score and strand information.

The created amplicon sequences may be padded 3’ and 5’ with Ns (needed by some mappers). The amplicon sequences can also be extended 3’ and 5’ which is recommended to allow mappers to map reads that extend beyond the amplicon coordinates. The created FASTA file is indexed with samtools faidx. The GFF3 and BED files are bgzipped and tabix indexed.

Parameters:

transcriptome_name (str) – name of the transcriptome
bed_file (str) – bed file containing the amplicon coordinates
fasta_file (str) – fasta file containing the reference genome
out_dir (str) – output directory
padding (int) – padding to be added to the amplicon sequences
amp_extension (int) – extension of the amplicon sequences

Example

>>> import tempfile
>>> with tempfile.TemporaryDirectory() as tmpdir:
>>>     # create a small BED file with two defined amplicons that map to the genomic positions in the FASTA file
>>>     with open(f'{tmpdir}/test.bed', 'wt') as out:
>>>         rna.MemoryIterator({'amp1.1': rna.gi('chr3:1-100'), 'amp1.2': rna.gi('chr3:50-150'),
>>>             'amp2': rna.gi('chr7:10-50')}).to_bed(out, no_header=True)
>>>     # build amplicon resources
>>>     rna.tools.build_amplicon_resources('test_transcriptome', bed_file=...,
>>>         fasta_file=rna.get_resource('ACTB+SOX2_genome'), padding=50, amp_extension=10, out_dir=tmpdir)
>>>     rna.print_dir_tree(tmpdir) # show the created resources

Returns:: log – a dictionary containing information about the created resources
Return type:: dict

rnalib.tools.calculate_mm_profile(bam_file, gff_file, fasta_file, annotation_flavour='gencode', min_cov=10, max_mm_frac=0.1, max_sample=1000000.0, strand_specific=True, out_file=None)[source]

Calculates the mismatch profile of a BAM file from random genes.

Parameters:

bam_file (str) – input BAM file
gff_file (str) – gene annotation GFF file
fasta_file – Reference sequence FASTA file
annotation_flavour (str) – The annotation flavour of the GFF file. Default is ‘gencode’.
min_cov (int) – The minimum coverage required for a position to be considered.
max_mm_frac (float) – The maximum mismatch fraction allowed for a position to be considered.
max_sample (int) – The maximum number of mismatches to count.
strand_specific (bool) – If True, the strand-specific profile is created.
out_file (str) – TSV file to write the mismatch profile to. Deafult: {bam_file}.mm.tsv