rnalib.tools module
Implementation of rnalib command line tools.
- rnalib.tools.prune_tags(bam_in, bam_out, kept_tags=('NH', 'HI', 'AS'))[source]
- Parameters:
bam_in – input bam file
bam_out – output bam file
kept_tags – list of tags to be kept
Example
>>> prune_tags('small.ACTB+SOX2.bam', 'small.ACTB+SOX2.clean.bam')
- rnalib.tools.tag_tc(bam_file, included_chrom=None, snp_vcf_file=None, out_prefix=None, fractional_counts=False, write_density_histogram=True, tags=None, min_mapping_quality=0, flag_filter=3844, min_base_quality=10, reverse_strand=False)[source]
Determines the T/C status per read while masking SNP positions and create density histogram per chrom. Will output two files: a BAM file with the T/C status encoded in the XC tag and a TSV file containing the T/C density per chrom.
- Parameters:
bam_file (
str) – bam file to be analysedincluded_chrom (
list) – list of chromosomes to be included (subset of reference dict)snp_vcf_file (
str) – VCF file containing variant calls from the respective cell line. T/C and A/G SNPs will be maskedout_prefix (
str) – output file prefixfractional_counts (
bool) – if set, reads will be counted by 1/NHwrite_density_histogram (
bool) – if set, a (gzipped) TSV file containing T/C density data will be written. default is Truetags (
dict) – dictionary containing the BAM tags to be used for the output BAM file. Default is {‘ntc’: ‘xc’, ‘ntt’: ‘xt’, ‘col’: ‘YC’}min_mapping_quality (
int) – minimum mapping qualityflag_filter (
int) – flag filter for filtering reads.min_base_quality (
int) – minimum base qualityreverse_strand (
bool) – if set, the reverse strand will be considered as the conversion strand. Default is False.
- rnalib.tools.filter_tc(bam_file, out_file=None, min_tc=1, tags=None)[source]
Filters a T/C annotated BAM file for reads with at least min_tc T/C conversions.
- Parameters:
bam_file (
str) – T/C annotated bam file to be analysedout_file (
str) – output file. Default is {bam_file_name}_tc-only.bammin_tc (
int) – minimum number of T/C conversionstags (
dict) – dictionary containing the BAM tags to be used for the output BAM file. Default is {‘ntc’: ‘xc’, ‘ntt’: ‘xt’, ‘col’: ‘YC’}
- rnalib.tools.build_amplicon_resources(transcriptome_name, bed_file, fasta_file, out_dir, padding=100, amp_extension=100)[source]
Builds resources for amplicon analyses from a bed file.
This tool creates all required genomics resources for downstream analyses, such as the FASTA file with the (possibly padded) sequences, a GFF file that annotates the amplicons in transcriptome coordinates and various indexing data structures.
Intervals are parsed from the bed file (and possibly extended and padded= and sequences are extracted from the fasta file. The sequences are written to a FASTA file. For downstream analyses, a chrom.sizes file, a .dict file, a GFF3 file and a BED file are created. Additionally, a log file is written to the output directory.
The parsed intervals are merged and only one ‘amplicon’ per merged interval is written to the FASTA file. The GFF3 file contains the amplicon coordinates and the BED file contains the amplicon coordinates and the original bed item score and strand information.
The created amplicon sequences may be padded 3’ and 5’ with Ns (needed by some mappers). The amplicon sequences can also be extended 3’ and 5’ which is recommended to allow mappers to map reads that extend beyond the amplicon coordinates. The created FASTA file is indexed with samtools faidx. The GFF3 and BED files are bgzipped and tabix indexed.
- Parameters:
transcriptome_name (
str) – name of the transcriptomebed_file (
str) – bed file containing the amplicon coordinatesfasta_file (
str) – fasta file containing the reference genomeout_dir (
str) – output directorypadding (
int) – padding to be added to the amplicon sequencesamp_extension (
int) – extension of the amplicon sequences
Example
>>> import tempfile >>> with tempfile.TemporaryDirectory() as tmpdir: >>> # create a small BED file with two defined amplicons that map to the genomic positions in the FASTA file >>> with open(f'{tmpdir}/test.bed', 'wt') as out: >>> rna.MemoryIterator({'amp1.1': rna.gi('chr3:1-100'), 'amp1.2': rna.gi('chr3:50-150'), >>> 'amp2': rna.gi('chr7:10-50')}).to_bed(out, no_header=True) >>> # build amplicon resources >>> rna.tools.build_amplicon_resources('test_transcriptome', bed_file=..., >>> fasta_file=rna.get_resource('ACTB+SOX2_genome'), padding=50, amp_extension=10, out_dir=tmpdir) >>> rna.print_dir_tree(tmpdir) # show the created resources
- Returns:
log – a dictionary containing information about the created resources
- Return type:
- rnalib.tools.calculate_mm_profile(bam_file, gff_file, fasta_file, annotation_flavour='gencode', min_cov=10, max_mm_frac=0.1, max_sample=1000000.0, strand_specific=True, out_file=None)[source]
Calculates the mismatch profile of a BAM file from random genes.
- Parameters:
bam_file (
str) – input BAM filegff_file (
str) – gene annotation GFF filefasta_file – Reference sequence FASTA file
annotation_flavour (
str) – The annotation flavour of the GFF file. Default is ‘gencode’.min_cov (
int) – The minimum coverage required for a position to be considered.max_mm_frac (
float) – The maximum mismatch fraction allowed for a position to be considered.max_sample (
int) – The maximum number of mismatches to count.strand_specific (
bool) – If True, the strand-specific profile is created.out_file (
str) – TSV file to write the mismatch profile to. Deafult: {bam_file}.mm.tsv