rnalib.tools module

Implementation of rnalib command line tools.

rnalib.tools.prune_tags(bam_in, bam_out, kept_tags=('NH', 'HI', 'AS'))[source]
Parameters:
  • bam_in – input bam file

  • bam_out – output bam file

  • kept_tags – list of tags to be kept

Example

>>> prune_tags('small.ACTB+SOX2.bam', 'small.ACTB+SOX2.clean.bam')
rnalib.tools.tag_tc(bam_file, included_chrom=None, snp_vcf_file=None, out_prefix=None, fractional_counts=False, write_density_histogram=True, tags=None, min_mapping_quality=0, flag_filter=3844, min_base_quality=10, reverse_strand=False)[source]

Determines the T/C status per read while masking SNP positions and create density histogram per chrom. Will output two files: a BAM file with the T/C status encoded in the XC tag and a TSV file containing the T/C density per chrom.

Parameters:
  • bam_file (str) – bam file to be analysed

  • included_chrom (list) – list of chromosomes to be included (subset of reference dict)

  • snp_vcf_file (str) – VCF file containing variant calls from the respective cell line. T/C and A/G SNPs will be masked

  • out_prefix (str) – output file prefix

  • fractional_counts (bool) – if set, reads will be counted by 1/NH

  • write_density_histogram (bool) – if set, a (gzipped) TSV file containing T/C density data will be written. default is True

  • tags (dict) – dictionary containing the BAM tags to be used for the output BAM file. Default is {‘ntc’: ‘xc’, ‘ntt’: ‘xt’, ‘col’: ‘YC’}

  • min_mapping_quality (int) – minimum mapping quality

  • flag_filter (int) – flag filter for filtering reads.

  • min_base_quality (int) – minimum base quality

  • reverse_strand (bool) – if set, the reverse strand will be considered as the conversion strand. Default is False.

rnalib.tools.quantise_values(values, bins=10)[source]

Quantise values into bins.

Parameters:
  • values (list) – list of values to be quantised

  • bins (int) – number of bins

Returns:

list of quantised values

Return type:

list

rnalib.tools.filter_tc(bam_file, out_file=None, min_tc=1, tags=None)[source]

Filters a T/C annotated BAM file for reads with at least min_tc T/C conversions.

Parameters:
  • bam_file (str) – T/C annotated bam file to be analysed

  • out_file (str) – output file. Default is {bam_file_name}_tc-only.bam

  • min_tc (int) – minimum number of T/C conversions

  • tags (dict) – dictionary containing the BAM tags to be used for the output BAM file. Default is {‘ntc’: ‘xc’, ‘ntt’: ‘xt’, ‘col’: ‘YC’}

rnalib.tools.build_amplicon_resources(transcriptome_name, bed_file, fasta_file, out_dir, padding=100, amp_extension=100)[source]

Builds resources for amplicon analyses from a bed file.

This tool creates all required genomics resources for downstream analyses, such as the FASTA file with the (possibly padded) sequences, a GFF file that annotates the amplicons in transcriptome coordinates and various indexing data structures.

Intervals are parsed from the bed file (and possibly extended and padded= and sequences are extracted from the fasta file. The sequences are written to a FASTA file. For downstream analyses, a chrom.sizes file, a .dict file, a GFF3 file and a BED file are created. Additionally, a log file is written to the output directory.

The parsed intervals are merged and only one ‘amplicon’ per merged interval is written to the FASTA file. The GFF3 file contains the amplicon coordinates and the BED file contains the amplicon coordinates and the original bed item score and strand information.

The created amplicon sequences may be padded 3’ and 5’ with Ns (needed by some mappers). The amplicon sequences can also be extended 3’ and 5’ which is recommended to allow mappers to map reads that extend beyond the amplicon coordinates. The created FASTA file is indexed with samtools faidx. The GFF3 and BED files are bgzipped and tabix indexed.

Parameters:
  • transcriptome_name (str) – name of the transcriptome

  • bed_file (str) – bed file containing the amplicon coordinates

  • fasta_file (str) – fasta file containing the reference genome

  • out_dir (str) – output directory

  • padding (int) – padding to be added to the amplicon sequences

  • amp_extension (int) – extension of the amplicon sequences

Example

>>> import tempfile
>>> with tempfile.TemporaryDirectory() as tmpdir:
>>>     # create a small BED file with two defined amplicons that map to the genomic positions in the FASTA file
>>>     with open(f'{tmpdir}/test.bed', 'wt') as out:
>>>         rna.MemoryIterator({'amp1.1': rna.gi('chr3:1-100'), 'amp1.2': rna.gi('chr3:50-150'),
>>>             'amp2': rna.gi('chr7:10-50')}).to_bed(out, no_header=True)
>>>     # build amplicon resources
>>>     rna.tools.build_amplicon_resources('test_transcriptome', bed_file=...,
>>>         fasta_file=rna.get_resource('ACTB+SOX2_genome'), padding=50, amp_extension=10, out_dir=tmpdir)
>>>     rna.print_dir_tree(tmpdir) # show the created resources
Returns:

log – a dictionary containing information about the created resources

Return type:

dict

rnalib.tools.calculate_mm_profile(bam_file, gff_file, fasta_file, annotation_flavour='gencode', min_cov=10, max_mm_frac=0.1, max_sample=1000000.0, strand_specific=True, out_file=None)[source]

Calculates the mismatch profile of a BAM file from random genes.

Parameters:
  • bam_file (str) – input BAM file

  • gff_file (str) – gene annotation GFF file

  • fasta_file – Reference sequence FASTA file

  • annotation_flavour (str) – The annotation flavour of the GFF file. Default is ‘gencode’.

  • min_cov (int) – The minimum coverage required for a position to be considered.

  • max_mm_frac (float) – The maximum mismatch fraction allowed for a position to be considered.

  • max_sample (int) – The maximum number of mismatches to count.

  • strand_specific (bool) – If True, the strand-specific profile is created.

  • out_file (str) – TSV file to write the mismatch profile to. Deafult: {bam_file}.mm.tsv