rnalib.interfaces module

Interfaces to other libraries: - Archs4Dataset: a class to access Archs4 datasets (https://maayanlab.cloud/archs4).

class rnalib.interfaces.Archs4Dataset(location, transcript_specific=None)[source]

Bases: object

A class to access the Archs4 dataset.

Parameters:: location (str) – The file path or an s3 URL (e.g., referencing the h5 file containing the data. NOTE that direct access via s3 bucket is slow and not recommended except for testing.

Examples

>>> location = 'data/human_gene_v2.2.h5' # or e.g., 'https://s3.dev.maayanlab.cloud/archs4/files/mouse_gene_v2.2.h5'
>>> with Archs4Dataset(location) as a4: # load the dataset
>>>     a4.describe() # prints the number of unique values for each metadata field
>>>     df = a4.get_sample_metadata(filter_string = "readsaligned>5000000") # pandas filtering with query
>>>     df.groupby('series_id').size().reset_index(name='counts') # a df with GEO series ids and counts
>>>     df.query("series_id=='GSE124076,GSE222593'") # query from one series (byte strings!)
>>>     df_sample = df.query("instrument_model.str.contains('HiSeq')").sample(10).index # 10 random HiSat samples
>>>     df_cnt = a4.get_counts(samples = df_sample) # get counts for 10 random samples

describe()[source]: Gets metadata for 1k random samples and prints the number of unique values for each metadata field.

get_meta_keys()[source]: Returns a list of all archs4 sample metadata keys.

get_sample_dict(remove_sc=True)[source]: Returns a dict of GSM ids and sample indices. If remove_sc is True (default), then single cell samples are removed.

get_sample_metadata(filter_string=None, samples=None, cols=('characteristics_ch1', 'data_processing', 'extract_protocol_ch1', 'instrument_model', 'last_update_date', 'library_selection', 'library_source', 'molecule_ch1', 'platform_id', 'readsaligned', 'relation', 'sample', 'series_id', 'singlecellprobability', 'source_name_ch1', 'status', 'submission_date', 'title'), disable_progressbar=False)[source]

Creates a pandas DataFrame with sample metadata for all samples matching the passing filter query. To group the resturned data by series_id, use df.groupby(‘series_id’).size().reset_index(name=’counts’)

Parameters:

filter_string (str) – A query string to filter the samples by. See pandas.DataFrame.query for details (if None, a sample list must be set).
samples (list) – A list of sample ids to retrieve metadata for (if None, all samples will be considered).
cols (list) – A list of metadata fields to retrieve. If None, all fields will be retrieved.
disable_progressbar (bool) – Whether to disable the progress bar.

get_counts(samples, gene_symbols=None, disable_progressbar=False, row_encoding=None)[source]

Retrieve gene/transcript expression data from a specified file for the given sample and gene/tid indices.

Parameters:

samples (list) – A list of sample ids to retrieve gene expression data for.
gene_symbols (list) – A list of gene symbols to retrieve gene expression data for (if None, all genes will be considered). If this is a transcript specific dataset, then all transcript ids corresponding to this gene symbol are returned.
disable_progressbar (bool) – Whether to disable the progress bar.
row_encoding (str) – The h5 path to the gene or transcript symbols (default depends on whether this is a transcript specific dataset).

Returns:

A pandas DataFrame containing the gene expression data.

Return type:

pd.DataFrame