rnalib.interfaces module
Interfaces to other libraries: - Archs4Dataset: a class to access Archs4 datasets (https://maayanlab.cloud/archs4).
- class rnalib.interfaces.Archs4Dataset(location)[source]
Bases:
objectA class to access the Archs4 dataset.
- Parameters:
location (
str) – The file path or an s3 URL (e.g., referencing the h5 file containing the data. NOTE that direct access via s3 bucket is slow and not recommended except for testing.
Examples
>>> location = 'data/human_gene_v2.2.h5' # or e.g., 'https://s3.dev.maayanlab.cloud/archs4/files/mouse_gene_v2.2.h5' >>> with Archs4Dataset(location) as a4: # load the dataset >>> a4.describe() # prints the number of unique values for each metadata field >>> df = a4.get_sample_metadata(filter_string = "readsaligned>5000000") # pandas filtering with query >>> df.groupby('series_id').size().reset_index(name='counts') # a df with GEO series ids and counts >>> df.query("series_id=='GSE124076,GSE222593'") # query from one series (byte strings!) >>> df_sample = df.query("instrument_model.str.contains('HiSeq')").sample(10).index # 10 random HiSat samples >>> df_cnt = a4.get_counts(samples = df_sample) # get counts for 10 random samples
- describe()[source]
Gets metadata for 1k random samples and prints the number of unique values for each metadata field.
- get_sample_dict(remove_sc=True)[source]
Returns a dict of GSM ids and sample indices. If remove_sc is True (default), then single cell samples are removed.
- get_sample_metadata(filter_string=None, samples=None, cols=('characteristics_ch1', 'data_processing', 'extract_protocol_ch1', 'instrument_model', 'last_update_date', 'library_selection', 'library_source', 'molecule_ch1', 'platform_id', 'readsaligned', 'relation', 'sample', 'series_id', 'singlecellprobability', 'source_name_ch1', 'status', 'submission_date', 'title'), disable_progressbar=False)[source]
Creates a pandas DataFrame with sample metadata for all samples matching the passing filter query. To group the resturned data by series_id, use df.groupby(‘series_id’).size().reset_index(name=’counts’)
- Parameters:
filter_string (
str) – A query string to filter the samples by. See pandas.DataFrame.query for details (if None, a sample list must be set).samples (
list) – A list of sample ids to retrieve metadata for (if None, all samples will be considered).cols (
list) – A list of metadata fields to retrieve. If None, all fields will be retrieved.disable_progressbar (
bool) – Whether to disable the progress bar.