scranpy package¶
Module contents¶
- scranpy.aggregate_across_cells(x, factors, num_threads=1)[source]¶
Aggregate expression values across cells based on one or more grouping factors. This is primarily used to create pseudo-bulk profiles for each cluster/sample combination.
- Parameters:
x (
Any) – A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. Values are expected to be counts.factors (
Union[dict,Sequence,NamedList,BiocFrame]) – One or more grouping factors, seecombine_factors(). Each entry should be a sequence of length equal to the number of columns inx.num_threads (
int) – Number of threads to use for aggregation.
- Return type:
- Returns:
A
NamedListcontaining the following entries.sum: double-precision NumPy matrix where each row corresponds to a gene and each column corresponds to a unique combination of grouping levels. Each matrix entry contains the summed expression across all cells with that combination.detected: integer NumPy matrix where each row corresponds to a gene and each column corresponds to a unique combination of grouping levels. Each matrix entry contains the number of cells with detected expression in that combination.combinations: aBiocFramecontaining all unique combinations of levels acrossfactors. Each column corresponds to an entry offactorswhile each row corresponds to a combination. Specifically, thei-th combination is defined as thei-th elements of all columns. Combinations are in the same order as the columns ofsumanddetected.counts: an integer NumPy array containing the number of cells associated with each combination incombinations.index: an Integer NumPy array of length equal to the number of cells. This specifies the combinationcombinationsassociated with each cell inx.
References
The
aggregate_across_cellsfunction in the scran_aggregate C++ library.Examples
>>> import numpy >>> mat = numpy.random.rand(100, 20) >>> import scranpy >>> clusters = ["A", "B", "C", "D"] * 5 >>> blocks = [1] * 4 + [2] * 4 + [3] * 4 + [4] * 4 + [5] * 4 >>> aggr = scranpy.aggregate_across_cells(mat, { "clusters": clusters, "blocks": blocks }) >>> aggr["sum"][:5,] >>> print(aggr["combinations"])
- scranpy.aggregate_across_cells_se(x, factors, num_threads=1, more_aggr_args={}, assay_type='counts', output_prefix='factor_', counts_name='counts', meta_name='aggregated', include_coldata=True, more_coldata_args={}, altexps=None, copy_altexps=False)¶
Aggregate expression values across groups of cells for each gene, storing the result in a
SummarizedExperiment. This callsaggregate_across_cells()along withaggregate_column_data().- Parameters:
x – A
SummarizedExperimentobject or one of its subclasses. Rows correspond to genes and columns correspond to cells.factors (
Union[dict,Sequence,NamedList,BiocFrame]) – One or more grouping factors, see the argument of the same name inaggregate_across_cells().num_threads (
int) – Passed toaggregate_across_cells().more_aggr_args (
dict) – Further arguments to pass toaggregate_across_cells().assay_type (
Union[str,int]) – Name or index of the assay ofxto be aggregated.output_prefix (
Optional[str]) – Prefix to add to the names of the columns containing the factor combinations in the column data of the output object. IfNone, no prefix is added.counts_name (
str) – Name of the column in which to store the cell count for each factor combination in the column data of the output object. IfNone, the cell counts are not reported.meta_name (
Optional[str]) – Name of the metadata entry in which to store additional information like the combination indices in the output object. IfNone, additional outputs are not reported.include_coldata (
bool) – Whether to add the aggregated column data fromxto the output. IfTrue, this adds the output ofaggregate_column_data()to the column data of the output object.more_coldata_args (
dict) – Additional arguments to pass toaggregate_column_data(). Only relevant ifinclude_coldata = True.altexps (
Union[str,int,dict,Sequence,NamedList,None]) –List of integers or strings, containing the indices or names of alternative experiments of
xto aggregate. The aggregated assay from each alternative experiment is determined byassay_type.Alternatively, this may be a single integer or string containing the index or name of one alternative experiment to aggregate. Again, the aggregated assay from each alternative experiment is determined by
assay_type.Alternatively, this may be a dictionary where keys are strings and values are integers or strings. Each key should be the name of an alternative experiment while each value is the index/name of the assay to aggregate from that experiment.
Alternatively, a
NamedListof integers or strings. If named, this is treated as a dictionary, otherwise it is treated as a list.Only relevant if
xis aSingleCellExperimentor one of its subclasses.copy_altexps (
bool) – Whether to copy the column data and metadata of the outputSingleCellExperimentinto each of its alternative experiments. Only relevant ifxis aSingleCellExperimentor one of its subclasses.
- Return type:
- Returns:
A
SummarizedExperimentwhere each column corresponds to a factor combination. Each row corresponds to a gene inx, and the row data is taken directly fromx. The assays contain the sum of counts (sum) and the number of detected cells (detected) in each combination for each gene. The column data contains:The factor combinations, with column names prefixed by
output_prefix.The cell count for each combination, named by
counts_name.Additional column data from
xifinclude_coldata = True. This is aggregated withaggregate_column_data`()on the combination indices.
The metadata contains a
meta_nameentry, which is a list with anindexinteger vector of length equal to the number of cells inx. Each entry of this vector is an index of the factor combination (i.e., column of the output object) to which the corresponding cell was assigned.If
altexpsis specified, aSingleCellExperimentis returned instead. The same aggregation described above for the main experiment is applied to each alternative experiment. Ifcopy_altexps = True, the column data for each alternative experiment will contain a copy of the factor combinations and cell counts, and themetadatawill contain a copy of the index vector.
Examples
>>> import scranpy >>> sce = scranpy.get_test_rna_data_se("start") >>> aggr = scranpy.aggregate_across_cells_se(sce, factors=[sce.get_column_data()["level1class"]]) >>> aggr.get_assay(0)[:5,] >>> print(aggr.get_column_data()) >>> >>> # We can also aggregate within alternative experiments. >>> aggr2 = scranpy.aggregate_across_cells_se(sce, factors=[sce.get_column_data()["level1class"]], altexps=["ERCC"]) >>> aggr2.get_alternative_experiment("ERCC").get_assay(0)[:5,]
- scranpy.aggregate_across_genes(x, sets, row_names=None, average=False, num_threads=1)[source]¶
Aggregate expression values across genes, potentially with weights. This is typically used to summarize expression values for gene sets into a single per-cell score.
- Parameters:
x (
Any) – Matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. Values are expected to be log-expression values.sets (
Union[dict,Sequence,NamedList]) –Sequence of gene sets. Each gene set may be represented by:
A sequence of integers, specifying the row indices of the genes in that set.
A sequence of strings, specifying the row names of the genes in that set. If any strings are present,
row_namesshould also be supplied.A tuple of length 2, containing a sequence of strings/integers (row names/indices) and a numeric array (weights).
A
BiocFramewhere each row corresponds to a gene. The first column contains the row names/indices and the second column contains the weights.
Alternatively, a dictionary may be supplied where each key is the name of a gene set and each value is a sequence/tuple as described above. The keys will be used to name the output
NamedList.Alternatively, a
NamedListwhere each entry is a gene set represented by a sequence/tuple as described above. If names are available, they will be used to name the outputNamedList.row_names (
Optional[Sequence]) – Sequence of strings of length equal to the number of rows ofx, containing the name of each gene. Duplicate names are allowed but only the first occurrence will be used. IfNone, rows are assumed to be unnamed.average (
bool) – Whether to compute the average rather than the sum.num_threads (
int) – Number of threads to be used for aggregation.
- Return type:
- Returns:
A
NamedListof length equal to that ofsets. Each entry is a numeric vector of length equal to the number of columns inx, containing the (weighted) sum/mean of expression values for the corresponding set across all cells.
References
The
aggregate_across_genesfunction in the scran_aggregate C++ library, which implements the aggregation.Examples
>>> import numpy >>> mat = numpy.random.rand(100, 20) >>> import scranpy >>> sets = { >>> "foo": [ 1, 3, 5, 7, 9 ], >>> "bar": range(10, 40, 2) >>> } >>> aggr = scranpy.aggregate_across_genes(mat, sets) >>> print(aggr.get_names()) >>> print(aggr[0])
- scranpy.aggregate_across_genes_se(x, sets, num_threads=1, more_aggr_args={}, assay_type='logcounts', output_name=None)[source]¶
Aggregate expression values across sets of genes for each cell. This calls
aggregate_across_genes()on an assay from aSummarizedExperiment.- Parameters:
x (
SummarizedExperiment) – ASummarizedExperimentobject or one of its subclasses. Rows correspond to genes and columns correspond to cells.sets (
Sequence) – Sequence of gene sets, seeaggregate_across_genes()for details.num_threads (
int) – Passed toaggregate_across_genes().more_aggr_args (
dict) – Further arguments to pass toaggregate_across_genes().assay_type (
Union[str,int]) – Name or index of the assay ofxto be aggregated across genes.output_name (
Optional[str]) – String specifying the assay name of the aggregated values in the output object. IfNone, it defaults toassay_typeif that argument is a string, otherwise it is set to"aggregated".
- Return type:
- Returns:
A
SummarizedExperimentwith number of rows equal to the number of gene sets. The lone assay contains the aggregated values for each gene set for all cells. The column data is the same as that ofx. Ifsetsis named, the names are used as the row names of the output.
Examples
>>> import scranpy >>> sce = scranpy.get_test_rna_data_se("norm") >>> sets = { "foo": [ 0, 2, 5, 10 ], "bar": [ 1, 3, 11, 17, 23 ] } >>> aggregated = scranpy.aggregate_across_genes_se(sce, sets) >>> aggregated.get_assay(0)[:,:10]
- scranpy.aggregate_column_data(coldata, index, number, only_simple=True, placeholder=None)¶
Aggregate the column data from a
SummarizedExperimentfor groups of cells.- Parameters:
coldata (
BiocFrame) – ABiocFramecontaining the column data for aSummarizedExperiment. Each row should correspond to a cell.index (
Sequence) – Vector of length equal to the number of cells. Each entry should be the index of the factor combination to which each cell incoldatawas assigned, e.g., the index vector produced bycombine_factors().number (
int) – Total number of unique factor combinations. All elements ofindexshould be less thannumber.only_simple (
bool) – Whether to skip a column ofcoldatathat is not a list, NumPy array,NamedListorFactor.placeholder – Placeholder value to store in the output column when a factor combination does not have a single unique value.
- Return type:
- Returns:
A
BiocFramewith number of rows equal tonumber. Each “simple” column incoldata(i.e., list, NumPy array,NamedListorFactor) is represented by a column in the outputBiocFrame. In each column, thej-th entry is equal to the unique value of all rows whereindex == j, orplaceholderif there is not exactly one unique value. Ifonly_simple = False, any non-simple columns ofcoldataare represented in the outputBiocFrameby a list ofplaceholdervalues. Otherwise, ifonly_simple = True, any non-simple columns ofcoldataare skipped.
Examples
>>> import biocframe >>> df = biocframe.BiocFrame({ >>> "X": ["a", "a", "b", "b", "c", "c"], >>> "Y": [ 1, 1, 1, 2, 2, 2], >>> "Z": [True, False, True, False, True, False] >>> }) >>> import scranpy >>> print(scranpy.aggregate_column_data(df, [0, 0, 1, 1, 2, 2], 3))
- scranpy.analyze(*args, **kwargs)[source]¶
Deprecated, use
analyze_se()instead.- Return type:
- scranpy.analyze_se(x, rna_altexp=False, adt_altexp=None, crispr_altexp=None, rna_assay_type='counts', adt_assay_type='counts', crispr_assay_type='counts', block=None, block_name='block', rna_qc_subsets=[], rna_qc_output_prefix=None, more_rna_qc_args={}, adt_qc_subsets=[], adt_qc_output_prefix=None, more_adt_qc_args={}, crispr_qc_output_prefix=None, more_crispr_qc_args={}, filter_cells=True, rna_norm_output_name='logcounts', more_rna_norm_args={}, adt_norm_output_name='logcounts', more_adt_norm_args={}, crispr_norm_output_name='logcounts', more_crispr_norm_args={}, rna_hvg_output_prefix=None, more_rna_hvg_args={}, rna_pca_output_name='PCA', more_rna_pca_args={}, adt_pca_output_name='PCA', more_adt_pca_args={}, use_rna_pcs=True, use_adt_pcs=True, scale_output_name='combined', more_scale_args={}, mnn_output_name='MNN', more_mnn_args={}, more_umap_args={}, more_tsne_args={}, cluster_graph_output_name='graph_cluster', more_build_graph_args={}, more_cluster_graph_args={}, more_neighbor_args={}, kmeans_clusters=None, kmeans_clusters_output_name='kmeans_cluster', more_kmeans_args={}, clusters_for_markers=['graph', 'kmeans'], more_rna_marker_args={}, more_adt_marker_args={}, more_crispr_marker_args={}, nn_parameters=<knncolle.annoy.AnnoyParameters object>, num_threads=3)[source]¶
Execute a simple single-cell analysis pipeline, starting from a count matrix and ending with clusters, visualizations and markers. This is equivalent to:
Running
quick_rna_qc_se(),quick_adt_qc_se()and/orquick_crispr_qc_se(), for quality control.Subsetting
xto only retain the high-quality cells in all modalities.Running
normalize_rna_counts_se(),normalize_adt_counts_se()and/ornormalize_crispr_counts_se(), for normalization.Running
choose_rna_hvgs_se()to identify highly variable genes.Running
run_pca_se()on the RNA and/or ADT data.Running
scale_by_neighbors_se()if multiple modalities are present.Running
correct_mnn_se()if multiple batches are present.Running
run_all_neighbor_steps_se()to obtain t-SNE and UMAP coordinates, and to perform graph-based clustering.Running
cluster_kmeans_se()to perform k-means clustering, if requested.Running
score_markers_se()to compute markers for each modality based on one of the clusterings.
- Parameters:
x (
SummarizedExperiment) – ASummarizedExperimentobject or one of its subclasses. Rows correspond to genomic features (genes, ADTs, or CRISPR guides) and columns correspond to cells.rna_altexp (
Union[int,str,bool,None]) – Name or index of the alternative experiment ofxcontaining the RNA data. AlternativelyFalse, in which case the main experiment is assumed to contain the RNA data. AlternativelyNone, in which case it is assumed that no RNA data is available.adt_altexp (
Union[int,str,bool,None]) – Name or index of the alternative experiment ofxcontaining the ADT data. AlternativelyFalse, in which case the main experiment is assumed to contain the ADT data. AlternativelyNone, in which case it is assumed that no ADT data is available.crispr_altexp (
Union[int,str,bool,None]) – Name or index of the alternative experiment ofxcontaining the CRISPR data. AlternativelyFalse, in which case the main experiment is assumed to contain the CRISPR data. AlternativelyNone, in which case it is assumed that no CRISPR data is available.rna_assay_type (
Union[str,int]) – Name or index of the assay containing the RNA count data. Only used ifrna_altexpis notNone.adt_assay_type (
Union[str,int]) – Name or index of the assay containing the ADT count data. Only used ifrna_altexpis notNone.crispr_assay_type (
Union[str,int]) – Name or index of the assay containing the CRISPR count data. Only used ifrna_altexpis notNone.block (
Optional[Sequence]) – Blocking factor, as a sequence containing the block of origin (e.g., batch, sample) for each cell inx. AlternativelyNone, if all cells are from the same block.block_name (
Optional[str]) – Name of the column in which to store the blocking factor in the column data of the outputSummarizedExperiment. Only used ifblockis notNone. IfNone, the blocking factor is not stored in the output.rna_qc_subsets (
Union[dict,Sequence,NamedList]) – Passed toquick_rna_qc_se()as thesubsetsargument. Only used ifrna_altexpis notNone.rna_qc_output_prefix (
Optional[str]) – Passed toquick_rna_qc_se()as theoutput_prefixargument. Only used ifrna_altexpis notNone.more_rna_qc_args (
dict) – Additional arguments to pass toquick_rna_qc_se(). Only used ifrna_altexpis notNone.adt_qc_subsets (
Union[dict,Sequence,NamedList]) – Passed toquick_adt_qc_se()as thesubsetsargument. Only used ifadt_altexpis notNone.adt_qc_output_prefix (
Optional[str]) – Passed toquick_adt_qc_se()as theoutput_prefixargument. Only used ifadt_altexpis notNone.more_adt_qc_args (
dict) – Additional arguments to pass toquick_adt_qc_se(). Only used ifadt_altexpis notNone.crispr_qc_output_prefix (
Optional[str]) – Passed toquick_crispr_qc_se()as theoutput_prefixargument. Only used ifcrispr_altexpis notNone.more_crispr_qc_args (
dict) – Additional arguments to pass toquick_crispr_qc_se(). Only used ifcrispr_altexpis notNone.filter_cells (
bool) – Whether to filterxto only retain high-quality cells in all modalities. IfFalseQC metrics and thresholds are still computed but are not used to filter the count matrices.rna_norm_output_name (
str) – Passed tonormalize_rna_counts_se()as theoutput_nameargument. Only used ifrna_altexpis notNone.more_rna_norm_args (
dict) – Additional arguments to pass tonormalize_rna_counts_se(). Only used ifrna_altexpis notNone.adt_norm_output_name (
str) – Passed tonormalize_adt_counts_se()as theoutput_nameargument. Only used ifadt_altexpis notNone.more_adt_norm_args (
dict) – Additional arguments to pass tonormalize_adt_counts_se(). Only used ifadt_altexpis notNone.crispr_norm_output_name (
str) – Passed tonormalize_crispr_counts_se()as theoutput_nameargument. Only used ifcrispr_altexpis notNone.more_crispr_norm_args (
dict) – Additional arguments to pass tonormalize_crispr_counts_se(). Only used ifcrispr_altexpis notNone.rna_hvg_output_prefix (
Optional[str]) – Passed tochoose_rna_hvgs_se()as theoutput_prefixargument. Only used ifrna_altexpis notNone.more_rna_hvg_args (
dict) – Additional arguments to pass tochoose_rna_hvgs_se(). Only used ifrna_altexpis notNone.rna_pca_output_name (
str) – Passed torun_pca_se()as theoutput_nameargument. Only used ifrna_altexpis notNone.more_rna_pca_args (
dict) – Additional arguments to pass torun_pca_se(). Only used ifrna_altexpis notNone.adt_pca_output_name (
str) – Passed torun_pca_se()as theoutput_nameargument. Only used ifadt_altexpis notNone.more_adt_pca_args (
dict) – Additional arguments to pass torun_pca_se(). Only used ifadt_altexpis notNone.use_rna_pcs (
bool) – Whether to use the RNA-derived PCs for downstream steps (i.e., clustering, visualization). Only used ifrna_altexpis notNone.use_adt_pcs (
bool) – Whether to use the ADT-derived PCs for downstream steps (i.e., clustering, visualization). Only used ifadt_altexpis notNone.scale_output_name (
str) – Passed toscale_by_neighbors_se()as theoutput_nameargument. Only used if multiple modalities are available and their correspondinguse_*_pcsarguments areTrue.more_scale_args (
dict) – Additional arguments to pass toscale_by_neighbors_se(). Only used if multiple modalities are available and their correspondinguse_*_pcsarguments areTrue.mnn_output_name (
str) – Passed tocorrect_mnn_se()as theoutput_nameargument. Only used ifblockis notNone.more_mnn_args (
dict) – Additional arguments to pass tocorrect_mnn_se(). Only used ifblockis notNone.more_tsne_args (
dict) – Passed torun_all_neighbor_steps_se().more_umap_args (
dict) – Passed torun_all_neighbor_steps_se().more_build_graph_args (
dict) – Passed torun_all_neighbor_steps_se().cluster_graph_output_name (
str) – Passed torun_all_neighbor_steps_se()ascluster_output_name.more_cluster_graph_args (
dict) – Passed torun_all_neighbor_steps_se().more_neighbor_args (
dict) – Passed torun_all_neighbor_steps_se().kmeans_clusters (
Optional[int]) – Passed tocluster_kmeans_se()as thekargument. IfNone, k-means clustering is not performed.kmeans_clusters_output_name – Passed to
cluster_kmeans_se()as theoutput_nameargument. Ignored ifkmeans_clustersisNone.more_kmeans_args (
dict) – Additional arguments to pass tocluster_kmeans_se(). Ignored ifkmeans_clustersisNone.clusters_for_markers (
Sequence[Literal['graph','kmeans']]) – List of clustering algorithms, specifying the clusters to be used for marker detection. The first available clustering will be chosen. If no clustering is available from the list, markers will not be computed.more_rna_marker_args (
dict) – Additional arguments to pass toscore_markers_se()for the RNA data. Ignored if no suitable clusterings are available or ifrna_altexpisNonemore_adt_marker_args (
dict) – Additional arguments to pass toscore_markers_se()for the ADT data. Ignored if no suitable clusterings are available or ifadt_altexpisNonemore_crispr_marker_args (
dict) – Additional arguments to pass toscore_markers_se()for the CRISPR data. Ignored if no suitable clusterings are available or ifcrispr_altexpisNonenn_parameters – Parameters for the nearest-neighbor search.
num_threads (
int) – Number of threads to use in each step.
- Returns:
x: aSingleCellExperimentthat is a copy of the inputx. It is also decorated with the results of each analysis step.markers: a list of list ofBiocFrameobjects containing the marker statistics for each modality. Each inner list corresponds to a modality (RNA, ADT, etc.) while eachBiocFramecorresponds to a group.
- Return type:
A
NamedListcontaining
Examples
>>> import scranpy >>> sce = scranpy.get_test_rna_data_se() >>> is_mito = [ s.startswith("mt-") for s in sce.get_row_names() ] >>> res = scranpy.analyze_se( >>> sce, >>> rna_qc_subsets={ "mito": is_mito } >>> ) >>> res["x"].get_assay_names() >>> res["x"].get_reduced_dimension_names() >>> print(res["x"].get_column_data()) >>> import biocutils >>> print(biocutils.table(res["x"].get_column_data()["graph_cluster"])) >>> print(scranpy.preview_markers(res["markers"]["rna"][0]))
- scranpy.build_snn_graph(x, num_neighbors=10, weight_scheme='ranked', num_threads=1, nn_parameters=<knncolle.annoy.AnnoyParameters object>)¶
Build a shared nearest neighbor (SNN) graph where each node is a cell. Edges are formed between cells that share one or more nearest neighbors, weighted by the number or importance of those shared neighbors.
- Parameters:
x (
Union[ndarray,FindKnnResults,Index]) –Numeric matrix where rows are dimensions and columns are cells, typically containing a low-dimensional representation from, e.g.,
run_pca().Alternatively, a
FindKnnResultsobject containing existing neighbor search results.Alternatively, a
Indexobject.num_neighbors (
int) – Number of neighbors in the nearest-neighbor graph. Larger values generally result in broader clusters during community detection. Ignored ifxis aFindKnnResultsobject.weight_scheme (
Literal['ranked','number','jaccard']) – Weighting scheme to use for the edges of the SNN graph, based on the number or ranking of the shared nearest neighbors.num_threads (
int) – Number of threads to use.nn_parameters (
Parameters) – The algorithm to use for the nearest-neighbor search. Only used ifxis not a pre-built nearest-neighbor search index or a list of existing nearest-neighbor search results.
- Return type:
- Returns:
A
NamedListcontaining the components of a (possibly weighted) graph.vertices: integer specifying the number of vertices (i.e., cells) in the graph.edges: integer NumPy array containing the graph edges. Pairs of values represent the endpoints of an (undirected) edge, i.e.,edges[0:2]form the first edge,edges[2:4]form the second edge and so on.weights: double-precision NumPy array of edge weights. This has length equal to half the length ofedges; the first weight corresponds to the first edge, and so on.
References
The
build_snn_graphfunction in the scran_graph_cluster C++ library.Examples
>>> import numpy >>> pcs = numpy.random.rand(10, 200) >>> import scranpy >>> graph = scranpy.build_snn_graph(pcs) >>> print(graph.get_names())
- scranpy.center_size_factors(size_factors, block=None, mode='lowest', in_place=False)¶
Center size factors before computing normalized values from the count matrix. This ensures that the normalized values are on the same scale as the original counts for easier interpretation.
- Parameters:
size_factors (
ndarray) – Floating-point array containing size factors for all cells.block (
Optional[Sequence]) – Block assignment for each cell. If provided, this should have length equal to the number of cells, where cells have the same value if and only if they are in the same block. Defaults toNone, where all cells are treated as being part of the same block.mode (
Literal['lowest','per-block']) – How to scale size factors across blocks.lowestwill scale all size factors by the lowest per-block average.per-blockwill center the size factors in each block separately. This argument is only used ifblockis provided.in_place (
bool) – Whether to modifysize_factorsin place. IfFalse, a new array is returned. This argument only used ifsize_factorsis double-precision, otherwise a new array is always returned.
- Return type:
- Returns:
Double-precision NumPy array containing centered size factors. If
in_place = True, this is a reference tosize_factors.
References
The
center_size_factorsandcenter_size_factors_blockedfunctions in the scran_norm C++ library, which describes the rationale behind centering.Examples
>>> import numpy >>> sf = numpy.random.rand(100) >>> import scranpy >>> sf2 = scranpy.center_size_factors(sf) >>> sf2.mean()
- scranpy.choose_highly_variable_genes(stats, top=4000, larger=True, keep_ties=True, bound=None)[source]¶
Choose highly variable genes (HVGs), typically based on a variance-related statistic.
- Parameters:
stats (
ndarray) – Array of variances (or a related statistic) across all genes. Typically the residuals frommodel_gene_variances()used here.top (
int) – Number of top genes to retain. Note that the actual number of retained genes may not be equal totop, depending on the other options.larger (
bool) – Whether larger values ofstatsrepresent more variable genes. If true, HVGs are defined from the largest values ofstats.keep_ties (
bool) – Whether to keep ties at thetop-th most variable gene. This avoids arbitrary breaking of tied values.bound (
Optional[float]) – The lower bound (iflarger = True) or upper bound (otherwise) to be applied tostats. Genes are not considered to be HVGs if they do not pass this bound, even if they are within thetopgenes. Ignored ifNone.
- Return type:
- Returns:
Integer NumPy array containing the indices of genes in
statsthat are considered to be highly variable.
References
The
choose_highly_variable_genesfunction from the scran_variances library, which provides the underlying implementation.Examples
>>> import numpy >>> resids = numpy.random.rand(1000) >>> import scranpy >>> chosen = scranpy.choose_highly_variable_genes(resids, top=100) >>> chosen
- scranpy.choose_pseudo_count(size_factors, quantile=0.05, max_bias=1, min_value=1)[source]¶
Choose a suitable pseudo-count to control the bias introduced by log-transformation of normalized counts.
- Parameters:
- Return type:
- Returns:
Choice of pseudo-count, for use in
normalize_counts().
References
The
choose_pseudo_countfunction in the scran_norm C++ library, which describes the rationale behind the choice of pseudo-count.Examples
>>> import numpy >>> sf = numpy.random.rand(1000) >>> import scranpy >>> sf = scranpy.center_size_factors(sf) >>> pseudo = scranpy.choose_pseudo_count(sf) >>> print(pseudo)
- scranpy.choose_rna_hvgs_se(x, block=None, num_threads=1, more_var_args={}, top=4000, more_choose_args={}, assay_type='logcounts', output_prefix=None, include_per_block=False)¶
Model the mean-variance relationship across genes and choose highly variable genes (HVGs) based on the residuals of the fitted trend.
- Parameters:
x (
SummarizedExperiment) – ASummarizedExperimentobject or any of its subclasses. Rows correspond to genes and columns correspond to cells.block (
Optional[Sequence]) – Block assignment for each cell inx, passed tomodel_gene_variances().num_threads (
int) – Number of threads, passed tomodel_gene_variances().more_var_args (
dict) – Additional arguments to pass tomodel_gene_variances().top (
int) – Number of top genes to use as HVGs, passed tochoose_highly_variable_genes().more_choose_args (
dict) – Additional arguments to pass tochoose_highly_variable_genes().assay_type (
Union[str,int]) – Name or index of the assay ofxcontaining the log-expression data to use for computing variances.output_prefix (
Optional[str]) – Prefix to add to the names of the columns of the row data in which to store the output statistics.include_per_block (
bool) – Whether the per-block statistics should be stored in the row data of the output object. Only relevant ifblockis specified.
- Return type:
- Returns:
A copy of
x, with per-gene variance modelling statistics added to its row data. Anhvgcolumn indicates the genes that were chosen as HVGs. Ifinclude_per_block = Trueandblockis specified, the per-block statistics are stored as a nestedBiocFramein theper_blockcolumn.
Examples
>>> import scranpy >>> sce = scranpy.get_test_rna_data_se("norm") >>> sce = scranpy.choose_rna_hvgs_se(sce, more_var_args={ "use_min_width": True }) >>> import biocutils >>> print(biocutils.table(sce.get_row_data()["hvg"]))
- scranpy.cluster_graph(x, method='multilevel', multilevel_resolution=1, leiden_resolution=1, leiden_objective='modularity', walktrap_steps=4, seed=42)[source]¶
Identify clusters of cells using a variety of community detection methods from a graph where similar cells are connected.
- Parameters:
x (
NamedList) – Components of the graph to be clustered, typically produced bybuild_snn_graph(). Each node of the graph should be a cell.method (
Literal['multilevel','leiden','walktrap']) – Community detection algorithm to use.multilevel_resolution (
float) – Resolution of the clustering whenmethod = "multilevel". Larger values result in finer clusters.leiden_resolution (
float) – Resolution of the clustering whenmethod = "leiden". Larger values result in finer clusters.leiden_objective (
Literal['modularity','cpm','er']) – Objective function to use whenmethod = "leiden".walktrap_steps (
int) – Number of steps to use whenmethod = "walktrap".seed (
int) – Random seed to use formethod = "multilevel"or"leiden".
- Return type:
- Returns:
A
NamedListcontaining the following entries.membership: an integer NumPy array containing the cluster assignment for each vertex, i.e., cell. All values are in [0, N) where N is the total number of clusters.
For
method = "multilevel", the output also contains:levels: a list containing the clustering at each level of the algorithm. Each entry corresponds to one level and is an integer NumPy array that contains the cluster assignment for each cell at that level.modularity: a double-precision NumPy array containing the modularity at each level. This has length equal to that oflevels, and the largest value corresponds to the assignments reported inmembership.
For
method = "walktrap", the output also contains:merges: an integer NumPy matrix with two columns. Each row corresponds to a merge step and specifies the pair of cells or clusters that were merged at that step.modularity: a double-precision NumPy array that contains the modularity score at each merge step.
For
method = "leiden", the output also contains:quality: quality of the clustering. This is the same as the modularity ifleiden_objective = "modularity".
References
https://igraph.org/c/html/latest/igraph-Community.html, for the underlying implementation of each clustering method.
The various
cluster_*functions in the scran_graph_cluster C++ library.Examples
>>> import numpy >>> pcs = numpy.random.rand(10, 200) >>> import scranpy >>> graph = scranpy.build_snn_graph(pcs) >>> clust = scranpy.cluster_graph(graph) >>> import biocutils >>> print(biocutils.table(clust["membership"]))
- scranpy.cluster_graph_se(x, num_neighbors=10, num_threads=1, more_build_args={}, method='multilevel', resolution=None, more_cluster_args={}, reddim_type='PCA', output_name='clusters', meta_name=None, graph_name=None)[source]¶
Construct a shared-nearest neighbor (SNN) graph from an existing low-dimensional embedding, and apply community detection algorithms to obtain clusters of cells. This calls
build_snn_graph()followed bycluster_graph().- Parameters:
x (
SingleCellExperiment) – ASingleCellExperimentor one of its subclasses. Rows correspond to genomic features and columns correspond to cells.num_neighbors (
int) – Number of neighbors for constructing the SNN graph, seebuild_snn_graph()for details.num_threads (
int) – Number of threads for the neighbor search when constructing the SNN graph, seebuild_snn_graph()for details.more_build_args (
dict) – Additional arguments to be passed tobuild_snn_graph().method (
Literal['multilevel','leiden','walktrap']) – Community detection method to be used bycluster_graph().resolution (
Optional[float]) – Clustering resolution to be used bycluster_graph(). This is either passed asmultilevel_resolutionorleiden_resolution, depending onmethod.more_cluster_args (
dict) – Additional arguments to be passed tocluster_graph().reddim_type (
Union[int,str,tuple]) – Name or index of the existing reduced dimension ofxto be used for clustering. Alternatively a tuple, where the first element contains the name of an alternative experiment ofx, and the second element contains the name/index of an embedding in that alternative experiment.output_name (
str) – Name of the column of the column data in which to store the cluster assignments.meta_name (
Optional[str]) – Name of the metadta entry in which to store extra clustering output. IfNone, no extra clustering output is stored.graph_name (
Optional[str]) – Name of the metadta entry in which to store the SNN graph. IfNone, the graph is not stored.
- Return type:
- Returns:
xis returned with the cluster assignment for each cell stored in its column data. Additional clustering output is stored in its metadata.
Examples
>>> import scranpy >>> sce = scranpy.get_test_rna_data_se("pca") >>> sce = scranpy.cluster_graph_se(sce) >>> import biocutils >>> print(biocutils.table(sce.get_column_data()["clusters"]))
- scranpy.cluster_kmeans(x, k, init_method='var-part', refine_method='hartigan-wong', var_part_optimize_partition=True, var_part_size_adjustment=1, lloyd_iterations=100, hartigan_wong_iterations=10, hartigan_wong_quick_transfer_iterations=50, hartigan_wong_quit_quick_transfer_failure=False, seed=5489, num_threads=1)[source]¶
Perform k-means clustering with a variety of different initialization and refinement algorithms.
- Parameters:
x (
ndarray) – Input data matrix where rows are dimensions and columns are observations (i.e., cells).k (
int) – Number of clusters.init_method (
Literal['var-part','kmeans++','random']) – Initialization method for defining the initial centroid coordinates. Choices are variance partitioning (var-part), kmeans++ (kmeans++) or random initialization (random).refine_method (
Literal['hartigan-wong','lloyd']) – Method to use to refine the cluster assignments and centroid coordinates. Choices are Lloyd’s algorithm (lloyd) or the Hartigan-Wong algorithm (hartigan-wong).var_part_optimize_partition (
bool) – Whether each partition boundary should be optimized to reduce the sum of squares in the child partitions. Only used ifinit_method = "var-part".var_part_size_adjustment (
float) – Floating-point value between 0 and 1, specifying the adjustment to the cluster size when prioritizing the next cluster to partition. Setting this to 0 will ignore the cluster size while setting this to 1 will generally favor larger clusters. Only used ifinit_method = "var-part".lloyd_iterations (
int) – Maximmum number of iterations for the Lloyd algorithm.hartigan_wong_iterations (
int) – Maximmum number of iterations for the Hartigan-Wong algorithm.hartigan_wong_quick_transfer_iterations (
int) – Maximmum number of quick transfer iterations for the Hartigan-Wong algorithm.hartigan_wong_quit_quick_transfer_failure (
bool) – Whether to quit the Hartigan-Wong algorithm upon convergence failure during quick transfer iterations.seed (
int) – Seed to use for random or kmeans++ initialization.num.threads – Number of threads to use.
- Return type:
- Returns:
A
NamedListcontaining the following entries.clusters: an integer NumPy array containing the cluster assignment for each cell. Values are integers in [0, N) where N is the total number of clusters.centers: a double-precision NumPy matrix containing the coordinates of the cluster centroids. Dimensions are in the rows while centers are in the columns.iterations: integer specifying the number of refinement iterations that were performed.status: convergence status. Any non-zero value indicates a convergence failure though the exact meaning depends on the choice ofrefine_method.For Lloyd, a value of 2 indicates convergence failure.
For Hartigan-Wong, a value of 2 indicates convergence failure in the optimal transfer iterations. A value of 4 indicates convergence failure in the quick transfer iterations when
hartigan_wong_quit_quick_transfer_failure = True.
References
https://ltla.github.io/CppKmeans, which describes the various initialization and refinement algorithms in more detail.
Examples
>>> import numpy >>> pcs = numpy.random.rand(10, 200) >>> import scranpy >>> clust = scranpy.cluster_kmeans(pcs, k=3) >>> import biocutils >>> print(biocutils.table(clust["clusters"]))
- scranpy.cluster_kmeans_se(x, k, num_threads=1, more_kmeans_args={}, reddim_type='PCA', output_name='clusters', meta_name=None)[source]¶
Perform k-means clustering on an existing low-dimensional embedding. This calls
cluster_kmeans()on reduced dimensions from aSingleCellExperiment.- Parameters:
x (
SingleCellExperiment) – ASingleCellExperimentobject or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.k (
int) – Number of clusters, to be passed tocluster_kmeans().num_threads (
int) – Number of threads, to be passed tocluster_kmeans().more_kmeans_args (
dict) – Additional arguments to be passed tocluster_kmeans().reddim_type (
Union[str,int,tuple]) – Name or index of an existing embedding inx, on which to perform the clustering. Alternatively a tuple, where the first element contains the name of an alternative experiment ofx, and the second element contains the name/index of an embedding in that alternative experiment.output_name (
str) – Name of the column of the column data in which to store the cluster assignments.meta_name (
Optional[str]) – Name of the metadta entry in which to store extra clustering output. IfNone, no extra clustering output is stored.
- Return type:
- Returns:
A copy of
x, with the cluster assignment for each cell stored in its column data. Additional clustering output is stored in its metadata.
Examples
>>> import scranpy >>> sce = scranpy.get_test_rna_data_se("pca") >>> sce = scranpy.cluster_kmeans_se(sce, k=10) >>> import biocutils >>> print(biocutils.table(sce.get_column_data()["clusters"]))
- scranpy.combine_factors(factors, keep_unused=False)¶
Combine multiple categorical factors based on the unique combinations of levels from each factor.
- Parameters:
factors (
Union[dict,Sequence,NamedList,BiocFrame]) –Sequence of factors of interest. Each entry corresponds to a factor and should be a sequence of the same length. Corresponding elements across all factors represent the combination of levels for a single observation.
Alternatively, a dictionary where each entry corresponds to a factor. Each key is the name of the factor and value is a sequence containing the factor.
Alternatively, a
NamedListwhere each entry is a factor. This may or may not be named.Alternatively, a
BiocFramewhere each column is a factor.keep_unused (
bool) – Whether to report unused combinations of levels. If any entry offactorsis aFactorobject, any unused levels will also be preserved.
- Return type:
- Returns:
NamedListcontaining the following entries.levels: aBiocFramecontaining the sorted and unique combinations of levels as a tuple. Each column corresponds to a factor infactorswhile each row represents a unique combination. Corresponding elements of each column define a single combination, i.e., thei-th combination is defined by taking thei-th element of each column.index: an integer NumPy array specifying the index intolevelsfor each observation.
For observation
iand factorj,levels[j][index[i]]will recoverfactors[j][i].
References
The
combine_factorsfunction in the scran_aggregate library.Examples
>>> import scranpy >>> import random >>> x = random.choices(["A", "B", "C"], k=20) >>> y = random.choices([True, False], k = 20) >>> combined = scranpy.combine_factors({ "foo": x, "bar": y }) >>> print(combined["levels"]) >>> import biocutils >>> print(biocutils.table(combined["index"]))
- scranpy.compute_adt_qc_metrics(x, subsets, row_names=None, num_threads=1)[source]¶
Compute quality control metrics from ADT count data.
- Parameters:
x (
Any) – A matrix-like object containing ADT counts. Each row should correspond to an ADT while each column should correspond to a cell.subsets (
Union[dict,NamedList]) –Subsets of ADTs corresponding to control features like IgGs. This may be either:
A list of arrays. Each array corresponds to an ADT subset and can either contain boolean or integer values.
For booleans, the sequence should be of length equal to the number of rows, and values should be truthy for rows that belong in the subset. If the sequence contains booleans, it should not contain any other type.
For integers, the value is the row index of an ADT in the subset.
For strings, the value is the name of an ADT in the subset. This should match at least one element in
row_names.
A dictionary where keys are the names of each ADT subset and the values are arrays as described above.
A
NamedListwhere each element is an array as described above, possibly with names.
row_names (
Optional[Sequence]) – Sequence of strings of length equal to the number of rows ofx, containing the name of each ADT. Duplicate names are allowed but only the first occurrence will be used. IfNone, rows are assumed to be unnamed.num_threads (
int) – Number of threads to use.
- Return type:
- Returns:
A
BiocFramewith number of rows equal to the number of cells (i.e., columns) inx. It contains the following columns:sum, a double-precision NumPy array containing the sum of counts across all ADTs for each cell.detected, an integer NumPy array containing the number of detected ADTs in each cell.subset_sum, a nestedBiocFramewith one column per subset insubsets. Each column is a double-precision NumPy array that contains the sum of counts for the corresponding subset in each cell.
References
The
compute_adt_qc_metricsfunction in the scran_qc C++ library, which describes the rationale behind these QC metrics.Examples
>>> import numpy >>> mat = numpy.reshape(numpy.random.poisson(lam=5, size=1000), (50, 20)) >>> import scranpy >>> res = scranpy.compute_adt_qc_metrics(mat, { "IgG": [ 1, 10, 20, 40 ] }) >>> print(res)
- scranpy.compute_clrm1_factors(x, num_threads=1)[source]¶
Compute size factors from an ADT count matrix using the CLRm1 method.
- Parameters:
- Return type:
- Returns:
Double-precision NumPy array containing the CLRm1 size factor for each cell. Note that these size factors are not centered and should be passed through, e.g.,
center_size_factors()before normalization.
References
https://github.com/libscran/clrm1, for a description of the CLRm1 method.
- Examples
>>> import numpy >>> counts = numpy.random.poisson(lam=5, size=1000).reshape(20, 50) >>> import scranpy >>> sf = scranpy.compute_clrm1_factors(counts) >>> print(scranpy.center_size_factors(sf))
- scranpy.compute_crispr_qc_metrics(x, num_threads=1)[source]¶
Compute quality control metrics from CRISPR count data.
- Parameters:
- Return type:
- Returns:
A
BiocFramewith number of rows equal to the number of cells (i.e., columns) inx. It contains the following columns.sum, a double-precision NumPy array containing the sum of counts across all guides for each cell.detected, an integer NumPy array containing the number of guides with non-zero counts in each cell.max_value, a double-precision NumPy array containing the maximum count for each cell.max_index, an integer NumPy array containing the row index of the guide with the maximum count in each cell.
References
The
compute_crispr_qc_metricsfunction in the scran_qc C++ library, which describes the rationale behind these QC metrics.Examples
>>> import numpy >>> mat = numpy.reshape(numpy.random.poisson(lam=5, size=1000), (50, 20)) >>> import scranpy >>> res = scranpy.compute_crispr_qc_metrics(mat) >>> print(res)
- scranpy.compute_rna_qc_metrics(x, subsets, row_names=None, num_threads=1)[source]¶
Compute quality control metrics from RNA count data.
- Parameters:
x (
Any) – A matrix-like object containing RNA counts. Each row should correspond to a gene while each column should correspond to a cell.subsets (
Union[dict,Sequence,NamedList]) –Subsets of genes corresponding to “control” features like mitochondrial genes. This may be either:
A list of sequences. Each sequence corresponds to a gene subset and can contain booleans, integers or strings.
For booleans, the sequence should be of length equal to the number of rows, and values should be truthy for rows that belong in the subset. If the sequence contains booleans, it should not contain any other type.
For integers, the value is the row index of a gene in the subset.
For strings, the value is the name of a gene in the subset. This should match at least one element in
row_names.
A dictionary where keys are the names of each gene subset and the values are arrays as described above.
A
NamedListwhere each element is an array as described above, possibly with names.
row_names (
Optional[Sequence]) – Sequence of strings of length equal to the number of rows ofx, containing the name of each gene. Duplicate names are allowed but only the first occurrence will be used. IfNone, rows are assumed to be unnamed.num_threads (
int) – Number of threads to use.
- Return type:
- Returns:
A
BiocFramewith number of rows equal to the number of cells (i.e., columns) inx. It contains the following columns:sum: a double-precision NumPy array containing the sum of counts across all genes for each cell.detected: an integer NumPy array containing the number of genes with non-zero expression in each cell.subset_proportion: a nestedBiocFramewith one column per subset insubsets. Each column is a double-precision NumPy array that contains the proportion of counts in the corresponding subset in each cell.
References
The
compute_rna_qc_metricsfunction in the scran_qc C++ library, which describes the rationale behind these QC metrics.Examples
>>> import numpy >>> mat = numpy.reshape(numpy.random.poisson(lam=5, size=1000), (50, 20)) >>> import scranpy >>> res = scranpy.compute_rna_qc_metrics(mat, { "mito": [ 1, 10, 20, 40 ] }) >>> print(res)
- scranpy.compute_rna_qc_metrics_with_altexps(x, subsets, altexp_proportions=None, num_threads=1, assay_type='counts')¶
Compute RNA-related QC metrics for the main and alternative experiments of a
SingleCellExperiment. This callscompute_rna_qc_metrics()on each experiment.- Parameters:
x (
SummarizedExperiment) – ASummarizedExperimentobject or one of its subclasses. Rows correspond to genes and columns correspond to cells.subsets (
Union[Mapping,Sequence]) – List of subsets of control genes, passed tocompute_rna_qc_metrics()for the main experiment.altexp_proportions (
Union[str,int,dict,Sequence,NamedList,None]) –Indices or names of alternative experiments for which to compute QC metrics. This is typically used to refer to alternative experiments holding spike-in data. An additional proportion is computed for each alternative experiment relative to the main experiment, i.e., X/(X+Y) where X is the alternative experiment’s total and Y is the main experiment’s total. These proportions will be used for filtering in the same manner as the proportions computed from
subsets.If a list or unnamed
NamedListis supplied, it should contain the indices/names of alternative experiments for which to compute QC metrics. The assay to use from each alternative experiment is determined byassay_type.If a string or integer is supplied, it is assumed to be a name/index of one alternative experiment for which to compute QC metrics. Again, the assay to use from each alternative experiment is determined by
assay_type.If a dictionary or named
NamedListis supplied, each key is a string specifying the name of an alternative experiment for which to compute QC metrics, while each value is an integer/string specifying the index/name of the assay to use from that experiment.This option is only relevant if
xis aSingleCellExperiment.num_threads (
int) – Number of threads, passed tocompute_rna_qc_metrics().assay_type (
Union[int,str]) – Index or name of the assay ofxcontaining the ADT count matrix.
- Returns:
main: aBiocFramecontaining QC statistics for the main experiment, seecompute_rna_qc_metrics()for details. The proportion of counts for each alternative experiment inaltexp_proportionsis stored in thesubset_proportionscolumn.altexps: aNamedListwith one entry per alternative experiment listed inaltexp_proportions. Each entry is named after its corresponding alternative experiment and is aBiocFrameof QC statistics for that experiment.
- Return type:
A
NamedListcontaining
Examples
>>> import scranpy >>> sce = scranpy.get_test_rna_data_se() >>> is_mito = list(y.startswith("mt-") for y in sce.get_row_names()) >>> res = scranpy.compute_rna_qc_metrics_with_altexps(sce, subsets={ "mito": is_mito }, altexp_proportions="ERCC") >>> print(res["main"]) >>> print(res["main"]["subset_proportion"]) >>> print(res["altexps"][0])
- scranpy.correct_mnn(x, block, num_neighbors=15, num_steps=1, merge_policy='rss', num_mads=None, robust_iterations=None, robust_trim=None, mass_cap=None, order=None, reference_policy=None, nn_parameters=<knncolle.annoy.AnnoyParameters object>, num_threads=1)[source]¶
Perform mutual nearest neighbor (MNN) correction to remove batch effects from a matrix of low-dimensional embeddings.
- Parameters:
x (
ndarray) – Matrix of coordinates where rows are dimensions and columns are cells, typically generated byrun_pca().block (
Sequence) – Factor specifying the block of origin (e.g., batch, sample) for each cell. Length should equal the number of columns inx.num_neighbors (
int) – Number of neighbors to use when identifying MNN pairs.num_steps (
int) – Number of steps for the recursive neighbor search to compute the center of mass. Larger values mitigate the kissing effect but increase the risk of including inappropriately distant subpopulations into the center of mass.merge_policy (
Literal['rss','size','variance','input']) –Policy for choosing the merge order.
inputuses the input order of the batches. Observations in the last batch are corrected first, and then the second-last batch, and so on. This allows users to control the merge order by simply changing the inputs.sizemerges batches in order of increasing size (i.e., the number of observations). So, the smallest batch is corrected first while the largest batch is unchanged. The aim is to lower compute time by reducing the number of observations that need to be reprocessed in later merge steps.variancemerges batches in order of increasing variance between observations. So, the batch with the lowest variance is corrected first while the batch with the highest variance is unchanged. The aim is to lower compute time by encouraging more observations to be corrected to the most variable batch, thus avoid reprocessing in later merge steps.rssmerges batches in order of increasing residual sum of squares (RSS). This is effectively a compromise betweenvarianceandsize.
order (
Optional[Sequence]) – Deprecated and ignored. Now the merge order is always determinted automatically.reference_policy (
Optional[str]) – Deprecated, seemerge_policyinstead.nn_parameters (
Parameters) – The nearest-neighbor algorithm to use.num_threads (
int) – Number of threads to use.
- Return type:
- Returns:
A
NamedListcontaining the following entries.corrected, a double-precision NumPy array of the same dimensions asx, containing the corrected values.
References
https://libscran.github.io/mnncorrect, which describes the MNN correction algorithm in more detail.
Examples
>>> import numpy >>> pcs = numpy.random.rand(10, 200) >>> block = ["A", "B"] * 100 >>> import scranpy >>> mnn_out = scranpy.correct_mnn(pcs, block) >>> mnn_out["corrected"].shape
- scranpy.correct_mnn_se(x, block, nn_parameters=<knncolle.annoy.AnnoyParameters object>, num_threads=1, more_mnn_args={}, reddim_type='PCA', output_name='MNN')[source]¶
Correct batch effects from an existing embedding with mutual nearest neighbors (MNNs). This calls
correct_mnn()on reduced dimensions from aSingleCellExperiment.- Parameters:
x (
SingleCellExperiment) – ASingleCellExperimentobject or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.block (
Sequence) – Block assignment of each cell, seecorrect_mnn()for more details.nn_parameters (
Parameters) – Algorithm for nearest neighbor search, seecorrect_mnn()for more details.num_threads (
int) – Number of threads, seecorrect_mnn()for more details.more_mnn_args (
dict) – Additional arguments to pass tocorrect_mnn().reddim_type (
Union[str,int,tuple]) – Name or index of an existing embedding inx, on which to perform the batch correction. Alternatively a tuple, where the first element contains the name of an alternative experiment ofx, and the second element contains the name/index of an embedding in that alternative experiment.output_name (
str) – Name of the reduced dimension entry which to store the corrected coordinates.
- Return type:
- Returns:
A copy of
x, with the corrected embedding stored as a reduced dimension entry.
Examples
>>> import scranpy >>> sce = scranpy.get_test_rna_data_se("pca") >>> # Treating the tissue of origin as the batch. >>> sce = scranpy.correct_mnn_se(sce, block=sce.get_column_data()["tissue"]) >>> sce.get_reduced_dimension_names()
- scranpy.filter_adt_qc_metrics(thresholds, metrics, block=None)[source]¶
Filter for high-quality cells based on ADT-derived QC metrics.
- Parameters:
thresholds (
NamedList) – Filter thresholds on the QC metrics, typically computed withsuggest_adt_qc_thresholds().metrics (
BiocFrame) – ADT-derived QC metrics, typically computed withcompute_adt_qc_metrics().block (
Optional[Sequence]) – Blocking factor specifying the block of origin (e.g., batch, sample) for each cell inmetrics. The levels should be a subset of those used insuggest_adt_qc_thresholds().
- Return type:
- Returns:
A boolean NumPy vector of length equal to the number of cells in
metrics, containing truthy values for putative high-quality cells.
References
The
AdtQcFiltersandAdtQcBlockedFiltersfunctions in the scran_qc C++ library.Examples
>>> import numpy >>> mat = numpy.reshape(numpy.random.poisson(lam=5, size=1000), (50, 20)) >>> import scranpy >>> res = scranpy.compute_adt_qc_metrics(mat, { "IgG": [ 1, 10, 20, 40 ] }) >>> filt = scranpy.suggest_adt_qc_thresholds(res) >>> keep = scranpy.filter_adt_qc_metrics(filt, res) >>> import biocutils >>> print(biocutils.table(keep))
- scranpy.filter_crispr_qc_metrics(thresholds, metrics, block=None)[source]¶
Filter for high-quality cells based on CRISPR-derived QC metrics.
- Parameters:
thresholds (
NamedList) – Filter thresholds on the QC metrics, typically computed withsuggest_crispr_qc_thresholds().metrics (
BiocFrame) – CRISPR-derived QC metrics, typically computed withcompute_crispr_qc_metrics().block (
Optional[Sequence]) – Blocking factor specifying the block of origin (e.g., batch, sample) for each cell inmetrics. The levels should be a subset of those used insuggest_crispr_qc_thresholds().
- Return type:
- Returns:
A boolean NumPy vector of length equal to the number of cells in
metrics, containing truthy values for putative high-quality cells.
References
The
CrisprQcFiltersandCrisprQcBlockedFiltersfunctions in the scran_qc C++ library.Examples
>>> import numpy >>> mat = numpy.reshape(numpy.random.poisson(lam=5, size=1000), (50, 20)) >>> import scranpy >>> res = scranpy.compute_crispr_qc_metrics(mat) >>> filt = scranpy.suggest_crispr_qc_thresholds(res) >>> keep = scranpy.filter_crispr_qc_metrics(filt, res) >>> keep.sum()
- scranpy.filter_rna_qc_metrics(thresholds, metrics, block=None)[source]¶
Filter for high-quality cells based on RNA-derived QC metrics.
- Parameters:
thresholds (
NamedList) – Filter thresholds on the QC metrics, typically computed withsuggest_rna_qc_thresholds().metrics (
BiocFrame) – RNA-derived QC metrics, typically computed withcompute_rna_qc_metrics().block (
Optional[Sequence]) – Blocking factor specifying the block of origin (e.g., batch, sample) for each cell inmetrics. The levels should be a subset of those used insuggest_rna_qc_thresholds().
- Return type:
- Returns:
A boolean NumPy vector of length equal to the number of cells in
metrics, containing truthy values for putative high-quality cells.
References
The
RnaQcFiltersandRnaQcBlockedFiltersfunctions in the scran_qc C++ library.Examples
>>> import numpy >>> mat = numpy.reshape(numpy.random.poisson(lam=5, size=1000), (50, 20)) >>> import scranpy >>> res = scranpy.compute_rna_qc_metrics(mat, { "mito": [ 1, 10, 20, 40 ] }) >>> filt = scranpy.suggest_rna_qc_thresholds(res) >>> keep = scranpy.filter_rna_qc_metrics(filt, res) >>> print(biocutils.table(keep))
- scranpy.fit_variance_trend(mean, variance, mean_filter=True, min_mean=0.1, transform=True, span=0.3, use_min_width=False, min_width=1, min_window_count=200, num_threads=1)[source]¶
Fit a trend to the per-gene variances with respect to the mean.
- Parameters:
mean (
ndarray) – Array containing the mean (log-)expression for each gene.variance (
ndarray) – Array containing the variance in the (log-)expression for each gene. This should have length equal tomean.mean_filter (
bool) – Whether to filter on the means before trend fitting.min_mean (
float) – The minimum mean of genes to use in trend fitting. Only used ifmean_filter = True.transform (
bool) – Whether a quarter-root transformation should be applied before trend fitting.span (
float) – Span of the LOWESS smoother. Ignored ifuse_min_width = True.use_min_width (
bool) – Whether a minimum width constraint should be applied to the LOWESS smoother. This is useful to avoid overfitting in high-density intervals.min_width (
float) – Minimum width of the window to use whenuse_min_width = True.min_window_count (
int) – Minimum number of observations in each window. Only used ifuse_min_width = True.num_threads (
int) – Number of threads to use.
- Return type:
- Returns:
A
NamedListcontaining the following entries.fitted: a double-precision NumPy array of length equal tomean, containing the fitted value of the trend for each gene.residual: a double-precision NumPy array of length equal tomean, containing the residual from the trend for each gene.
References
The
fit_variance_trendfunction in the scran_variances C++ library, for the underlying implementation.Examples
>>> import numpy >>> # Mock up a typical mean-variance relationship for count data. >>> mean = numpy.random.exponential(size=1000) >>> variance = mean / (1 + mean**2) * 2**(numpy.random.normal(size=len(mean)) / 10) >>> >>> import scranpy >>> fit = scranpy.fit_variance_trend(mean, variance) >>> print(fit["fitted"]) >>> print(fit["residual"])
- scranpy.format_compute_adt_qc_metrics_result(df, flatten=True)¶
Pretty-format the results of
compute_adt_qc_metrics().- Parameters:
df (
BiocFrame) – Result ofcompute_adt_qc_metrics().flatten (
bool) – Whether to flatten the nestedBiocFrameof subset sums.
- Return type:
- Returns:
A
BiocFramecontaining per-cell QC statistics. Ifflatten = True, the subset sums are stored as top-level columns with namesubset_sum_<SUBSET>where<SUBSET>is the name of the subset.
Examples
>>> import scranpy >>> sce = scranpy.get_test_adt_data_se().get_alternative_experiment("ADT") >>> is_igg = list(y.find("IgG") >= 0 for y in sce.get_row_names()) >>> qc = scranpy.compute_adt_qc_metrics(sce.get_assay(0), subsets={ "igg": is_igg }) >>> print(scranpy.format_compute_adt_qc_metrics_result(qc))
- scranpy.format_compute_crispr_qc_metrics_result(df)¶
Pretty-format the results of
compute_crispr_qc_metrics().- Parameters:
df (
BiocFrame) – Result ofcompute_crispr_qc_metrics().- Return type:
- Returns:
A
BiocFramecontaining per-cell QC statistics.
Examples
>>> import scranpy >>> sce = scranpy.get_test_crispr_data_se().get_alternative_experiment("CRISPR Guide Capture") >>> df = scranpy.compute_crispr_qc_metrics(sce.get_assay(0)) >>> print(scranpy.format_compute_crispr_qc_metrics_result(df))
- scranpy.format_compute_rna_qc_metrics_result(df, flatten=True)¶
Pretty-format the results of
compute_rna_qc_metrics().- Parameters:
df (
BiocFrame) – Result ofcompute_rna_qc_metrics().flatten (
bool) – Whether to flatten the nestedBiocFrameof subset proportions.
- Return type:
- Returns:
A
BiocFramecontaining per-cell QC statistics. Ifflatten = True, the subset proportions are stored as top-level columns with namesubset_proportion_<SUBSET>where<SUBSET>is the name of the subset.
Examples
>>> import scranpy >>> sce = scranpy.get_test_rna_data_se() >>> is_mito = list(y.startswith("mt-") for y in sce.get_row_names()) >>> qc = scranpy.compute_rna_qc_metrics(sce.get_assay(0), subsets={ "mito": is_mito }) >>> print(scranpy.format_compute_rna_qc_metrics_result(qc))
- scranpy.format_score_markers_result(res, extra_columns=None, order_by=True, row_names=None)¶
Reformat the output of
score_markers()into a list of per-groupBiocFrameobjects.- Parameters:
res – Results of
score_markers().extra_columns (
Union[str,Sequence,BiocFrame,None]) – ABiocFramewith the same number of rows asx, containing extra columns to add eachBiocFrame.order_by (
Union[str,bool,None]) – Name of the column to use for ordering the rows of each outputBiocFrame. AlternativelyTrue, in which case a column is automatically chosen from the effect size summaries. IfNoneorFalse, no ordering is performed.row_names (
Optional[Sequence]) – Sequence of strings containing the row names to add to eachBiocFrame. This should correspond to the gene names corresponding to the rows ofxused inscore_markers().
- Return type:
- Returns:
A
NamedListofBiocFrameobjects. EachBiocFramecorresponds to a unique group ingroups. Each row of eachBiocFramecontains statistics for a gene inx. EachBiocFramecontains the following columns:mean, the mean expression in the current group.detected, the proportion of cells with detected expression in the current group.<effect>_<summary>, a summary statistic for an effect size, e.g.,cohens_d_meancontains the mean Cohen’s d across comparisons involving the current group.
Examples
>>> import scranpy >>> sce = scranpy.get_test_rna_data_se("cluster") >>> raw_markers = scranpy.score_markers(sce.get_assay("logcounts"), sce.get_column_data()["clusters"]) >>> formatted = scranpy.format_score_markers_result(raw_markers, row_names=sce.get_row_names())
- scranpy.get_test_adt_data_se(at='start')¶
Get a CITE-seq dataset with varying levels of processing. This contains human PBMCs obtained with
scrnaseq.fetch_dataset("kotliarov-pbmc-2020", "2024-04-18"). The main experiment contains RNA counts and the alternative experiment contains ADT counts.- Parameters:
at (
Literal['start','qc','norm','hvg','pca']) – The desired level of processing. Forstart, no processing is performed. Otherwise, the dataset is returned after quality control (qc), normalization (norm), feature selection (hvg) or PCA (pca).- Return type:
- Returns:
The dataset as a
SingleCellExperiment. This uses caching to avoid recomputation.
Examples
>>> import scranpy >>> sce = scranpy.get_test_adt_data_se()
- scranpy.get_test_crispr_data_se(at='start')¶
Get a Perturb-seq dataset with varying levels of processing. This contains a pancreatic beta cell line obtained with
scrnaseq.fetch_dataset("cao-pancreas-2025", "2025-10-10", "rqc"). The main experiment contains RNA counts and the alternative experiment contains CRISPR guide counts.- Parameters:
at (
Literal['start','qc']) – The desired level of processing. Forstart, no processing is performed. Otherwise, the dataset is returned after quality control (qc), normalization (norm), feature selection (hvg) or PCA (pca).- Return type:
- Returns:
The dataset as a
SingleCellExperiment. This uses caching to avoid recomputation.
Examples
>>> import scranpy >>> sce = scranpy.get_test_crispr_data_se()
- scranpy.get_test_rna_data_se(at='start')¶
Get a single-cell RNA-seq dataset with varying levels of processing. This contains cells from the mouse brain, obtained with
scrnaseq.fetch_dataset("zeisel-brain-2015", "2023-12-14"). The main experiment contains RNA counts and the alternative experiments contain ERCC and repeat element counts.- Parameters:
at (
Literal['start','qc','norm','hvg','pca','cluster']) – The desired level of processing. Forstart, no processing is performed. Otherwise, the dataset is returned after quality control (qc), normalization (norm), feature selection (hvg), PCA (pca) or clustering (cluster).- Return type:
- Returns:
The dataset as a
SingleCellExperiment. This uses caching to avoid recomputation.
Examples
>>> import scranpy >>> sce = scranpy.get_test_rna_data_se()
- scranpy.model_gene_variances(x, block=None, block_average_policy='mean', block_weight_policy='variable', variable_block_weight=(0, 1000), block_quantile=0.8, mean_filter=True, min_mean=0.1, transform=True, span=0.3, use_min_width=False, min_width=1, min_window_count=200, num_threads=1)[source]¶
Compute the variance in (log-)expression values for each gene, and model the trend in the variances with respect to the mean.
- Parameters:
x (
Any) – A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. It is typically expected to contain log-expression values, e.g., fromnormalize_counts().block (
Optional[Sequence]) – Array of length equal to the number of columns ofx, containing the block of origin (e.g., batch, sample) for each cell. AlternativelyNone, if all cells are from the same block.block_average_policy (
Literal['mean','quantile']) – Policy for averaging statistics across blocks. This can either use a (weighted)meanor aquantile.block_weight_policy (
Literal['variable','equal','none']) – Policy for weighting different blocks when computing the weighted mean across blocks for each statistic. Only used ifblockis provided andblock_average_policy = "mean".variable_block_weight (
Tuple) – Parameters for variable block weighting. This should be a tuple of length 2 where the first and second values are used as the lower and upper bounds, respectively, for the variable weight calculation. Only used ifblockis provided,block_average_policy = "mean", andblock_weight_policy = "variable".block_quantile (
float) – Probability for computing the quantile across blocks. Defaults to 0.5, i.e., the median of per-block statistics. Only used ifblockis provided andblock_average_policy = "quantile".mean_filter (
bool) – Whether to filter on the means before trend fitting.min_mean (
float) – The minimum mean of genes to use in trend fitting. Only used ifmean_filter = True.transform (
bool) – Whether a quarter-root transformation should be applied before trend fitting.span (
float) – Span of the LOWESS smoother for trend fitting, seefit_variance_trend().use_min_width (
bool) – Whether a minimum width constraint should be applied during trend fitting, seefit_variance_trend().min_width (
float) – Minimum width of the smoothing window for trend fitting, seefit_variance_trend().min_window_count (
int) – Minimum number of observations in each smoothing window for trend fitting, seefit_variance_trend().num_threads (
int) – Number of threads to use.
- Return type:
- Returns:
A
NamedListcontainingstatistics. This is aBiocFramewith one row per gene and the following columns:mean: a double-precision NumPy array containing the mean (log-)expression for each gene.variance: a double-precision NumPy array containing the mean (log-)expression for each gene.fitted: a double-precision NumPy array containing the fitted value of the mean-variance trend for each gene.residual: a double-precision NumPy array containing the residual from the mean-variance trend for each gene.
If
blockis supplied, theNamedListwill also contain:per_block: aNamedListcontaining the per-block statistics. Each entry is aBiocFramethat contains themean,variance,fittedandresidualfor each block.block_ids: a list containing the identities of the blocks. This corresponds to the entries ofper_block.
References
The
model_gene_variancesfunction in the scran_variances C++ library.Examples
>>> import numpy >>> # mock up some log-normalized data with an interesting mean-variance relationship. >>> mu = numpy.random.rand(200) * 5 >>> normed = numpy.ndarray((200, 10)) >>> for c in range(10): >>> normed[:,c] = numpy.log1p(numpy.random.poisson(lam=mu, size=200)) / numpy.log(2) >>> >>> import scranpy >>> res = scranpy.model_gene_variances(normed) >>> print(res["statistics"])
- scranpy.normalize_adt_counts_se(x, size_factors=None, num_threads=1, center=True, block=None, mode='lowest', log=True, pseudo_count=1, assay_type='counts', output_name='logcounts', factor_name='size_factor')[source]¶
Compute (log-)normalized expression values after performing scaling normalization of an ADT count matrix. This calls
compute_clrm1_factors()to compute CLRm1 size factors,center_size_factors()to center the size factors, and thennormalize_counts()to compute normalized log-expression values.- Parameters:
x (
SummarizedExperiment) – ASummarizedExperimentobject or one of its subclasses. Rows correspond to antibody-derived tags (ADTs) and columns correspond to cells.size_factors (
Optional[Sequence]) – Size factor for each cell inx, to be passed tonormalize_counts(). IfNone, size factors are computed withcompute_clrm1_factors().num_threads (
int) – Number of threads, to be passed tonormalize_counts().center (
bool) – Whether to center thesize_factorsby passing them tocenter_size_factors().block (
Optional[Sequence]) – Block assignment for each cell inx, to be passed tocenter_size_factors().mode (
Literal['lowest','per-block']) – How to scale the size factors across blocks, seecenter_size_factors(). Only relevant ifblockis provided.log (
bool) – Whether to log-transform the normalized expression values, seenormalize_counts().pseudo_count (
float) – Pseudo-count for the log-transformation, seenormalize_counts(). Only relevant iflog = True.assay_type (
Union[int,str]) – Name or index specifying the assay ofxthat contains the count matrix to be normalized.output_name (
str) – Name of the assay in which to store the normalized matrix in the output object.factor_name (
Optional[str]) – Name of the column of the column data in which to store the size factors in the output object. IfNone, the size factors are not stored.
- Return type:
- Returns:
A copy of
x, with an additional assay containing the (log-)normalized matrix. Size factors are also stored in the column data.
Examples
>>> import scranpy >>> sce = scranpy.get_test_adt_data_se("qc").get_alternative_experiment("ADT") >>> sce = scranpy.normalize_adt_counts_se(sce) >>> sce.get_assay_names() >>> sce.get_column_data()["size_factor"]
- scranpy.normalize_counts(x, size_factors, log=True, pseudo_count=1, log_base=2, preserve_sparsity=False, delayed=True)¶
Create a matrix of (log-transformed) normalized expression values. The normalization removes uninteresting per-cell differences due to sequencing efficiency and library size. The log-transformation ensures that any differences represent log-fold changes in downstream analysis steps; such relative changes in expression are more relevant than absolute changes.
- Parameters:
x (
Any) –Matrix-like object containing cells in columns and features in rows, typically with count data.
Alternatively, a
InitializedMatrixrepresenting a count matrix, typically created byinitialize().size_factors (
Sequence) – Size factor for each cell. This should have length equal to the number of columns inx.log (
bool) – Whether log-transformation should be performed.pseudo_count (
float) – Positive pseudo-count to add before log-transformation. Ignored iflog = False.log_base (
float) – Base of the log-transformation, ignored iflog = False.preserve_sparsity (
bool) – Whether to preserve sparsity whenpseudo_count != 1. IfTrue, users should manually addlog(pseudo_count, log_base)to the returned matrix to obtain the desired log-transformed expression values. Ignored iflog = Falseorpseudo_count = 1.delayed (
bool) – Whether operations on a matrix-likexshould be delayed. This improves memory efficiency at the cost of some speed in downstream operations.
- Return type:
Union[DelayedArray,InitializedMatrix]- Returns:
If
xis a matrix-like object anddelayed = True, aDelayedArrayis returned containing the (log-transformed) normalized expression matrix. Ifdelayed = False, the type of the (log-)normalized matrix will depend on the operations applied tox.If
xis anInitializedMatrix, a newInitializedMatrixis returned containing the normalized expression matrix.
References
The
normalize_countsfunction in the scran_norm C++ library, for the rationale behind normalization and log-transformation.Examples
>>> import numpy >>> counts = numpy.random.poisson(lam=2, size=1000).reshape(50, 20) >>> import scranpy >>> sf = scranpy.center_size_factors(counts.sum(axis=0)) >>> normed = scranpy.normalize_counts(counts, sf) >>> print(normed)
- scranpy.normalize_crispr_counts_se(x, size_factors=None, center=True, block=None, mode='lowest', log=True, pseudo_count=1, assay_type='counts', output_name='logcounts', factor_name='size_factor')[source]¶
Compute (log-)normalized expression values after performing scaling normalization of a CRISPR guide count matrix. This calls
center_size_factors()to center the library sizes, and thennormalize_counts()to compute normalized log-expression values.- Parameters:
x (
SummarizedExperiment) – ASummarizedExperimentobject or one of its subclasses. Rows correspond to guides and columns correspond to cells.size_factors (
Optional[Sequence]) – Size factor for each cell inx, to be passed tonormalize_counts(). IfNone, this defaults to the column sums of the count matrix inx.center (
bool) – Whether to center thesize_factorsby passing them tocenter_size_factors().block (
Optional[Sequence]) – Block assignment for each cell inx, to be passed tocenter_size_factors().mode (
Literal['lowest','per-block']) – How to scale the size factors across blocks, seecenter_size_factors(). Only relevant ifblockis provided.log (
bool) – Whether to log-transform the normalized expression values, seenormalize_counts().pseudo_count (
float) – Pseudo-count for the log-transformation, seenormalize_counts(). Only relevant iflog = True.assay_type (
Union[str,int]) – Name or index specifying the assay ofxthat contains the count matrix to be normalized.output_name (
str) – Name of the assay in which to store the normalized matrix in the output object.factor_name (
Optional[str]) – Name of the column of the column data in which to store the size factors in the output object. IfNone, the size factors are not stored.
- Return type:
- Returns:
A copy of
x, with an additional assay containing the (log-)normalized matrix. Size factors are also stored in the column data.
Examples
>>> import scranpy >>> sce = scranpy.get_test_crispr_data_se("qc").get_alternative_experiment("CRISPR Guide Capture") >>> sce = scranpy.normalize_crispr_counts_se(sce, size_factors=sce.get_column_data()["sum"]) >>> sce.get_assay_names() >>> sce.get_column_data()["size_factor"]
- scranpy.normalize_rna_counts_se(x, size_factors=None, center=True, block=None, mode='lowest', log=True, pseudo_count=1, assay_type='counts', output_name='logcounts', factor_name='size_factor')[source]¶
Compute (log-)normalized expression values after performing scaling normalization of an RNA count matrix. This calls
center_size_factors()to center the library sizes, and thennormalize_counts()to compute normalized log-expression values.- Parameters:
x (
SummarizedExperiment) – ASummarizedExperimentobject or one of its subclasses. Rows correspond to genes and columns correspond to cells.size_factors (
Optional[Sequence]) – Size factor for each cell inx, to be passed tonormalize_counts(). IfNone, this defaults to the column sums of the count matrix inx.center (
bool) – Whether to center thesize_factorsby passing them tocenter_size_factors().block (
Optional[Sequence]) – Block assignment for each cell inx, to be passed tocenter_size_factors().mode (
Literal['lowest','per-block']) – How to scale the size factors across blocks, seecenter_size_factors(). Only relevant ifblockis provided.log (
bool) – Whether to log-transform the normalized expression values, seenormalize_counts().pseudo_count (
float) – Pseudo-count for the log-transformation, seenormalize_counts(). Only relevant iflog = True.assay_type (
Union[str,int]) – Name or index specifying the assay ofxthat contains the count matrix to be normalized.output_name (
str) – Name of the assay in which to store the normalized matrix in the output object.factor_name (
Optional[str]) – Name of the column of the column data in which to store the size factors in the output object. IfNone, the size factors are not stored.
- Return type:
- Returns:
A copy of
x, with an additional assay containing the (log-)normalized matrix. Size factors are also stored in the column data.
Examples
>>> import scranpy >>> sce = scranpy.get_test_rna_data_se("qc") >>> sce = scranpy.normalize_rna_counts_se(sce, size_factors=sce.get_column_data()["sum"]) >>> sce.get_assay_names() >>> sce.get_column_data()["size_factor"]
- scranpy.preview_markers(df, columns=['mean', 'detected', ('lfc', 'delta_mean_mean')], rows=10, order_by=None, include_order_by=True)¶
Preview the top markers for a group in a pretty format.
- Parameters:
df (
BiocFrame) –BiocFramecontaining the marker statistics for a single group.columns (
Optional[Sequence]) –Names of columns of
dfto retain in the preview.Alternatively, each entry may be a tuple of two strings. The first string is the name of the column in the output
BiocFrame, and the second string is the name of the column ofdfto retain.rows (
Optional[int]) – Number of rows to show. IfNone, all rows are returned.order_by (
Union[str,bool,None]) – Name of the column to use for ordering the rows of the outputAlternatively ``True`, in which case a column is automatically chosen from the effect size summaries. IfNoneorFalse, no ordering is performed.include_order_by (
bool) – Whether to include the column named byorder_byin the outputBiocFrame. This may also be a string, which is treated asTrue; the value is used as the name of the column in the outputBiocFrame.
- Return type:
- Returns:
A
BiocFramecontaining important columns for the top markers.
Examples
>>> import scranpy >>> sce = scranpy.get_test_rna_data_se("cluster") >>> markers = scranpy.score_markers_se(sce, sce.get_column_data()["clusters"]) >>> print(scranpy.preview_markers(markers["1"]))
- scranpy.quick_adt_qc_se(x, subsets, num_threads=1, more_suggest_args={}, block=None, assay_type='counts', output_prefix=None, meta_name='qc', flatten=True)¶
Quickly compute quality control (QC) metrics, thresholds and filters from ADT data in a
SummarizedExperiment. This callscompute_adt_qc_metrics(),suggest_adt_qc_thresholds(), andfilter_adt_qc_metrics().- Parameters:
x (
SummarizedExperiment) – ASummarizedExperimentor one of its subclasses. Rows correspond to antibody-derived tags (ADTs) and columns correspond to cells.subsets (
Union[Mapping,Sequence]) – List of subsets of control features, passed tocompute_adt_qc_metrics().num_threads (
int) – Number of threads, passed tocompute_adt_qc_metrics().more_suggest_args (
dict) – Additional arguments to pass tosuggest_adt_qc_thresholds().block (
Optional[Sequence]) – Blocking factor specifying the block of origin (e.g., batch, sample) for each cell inmetrics. If supplied, a separate threshold is computed from the cells in each block. AlternativelyNone, if all cells are from the same block.assay_type (
Union[int,str]) – Index or name of the assay ofxcontaining the ADT count matrix.output_prefix (
Optional[str]) – Prefix to add to the column names of the column data containing the output QC statistics. IfNone, no prefix is added.meta_name (
Optional[str]) – Name of the metadata entry in which to store additional outputs like the filtering thresholds. IfNone, additional outputs are not reported.flatten (
bool) – Whether to flatten the subset proportions into separate columns of the column data. IfFalse, the subset proportions are stored in a nestedBiocFrame.
- Return type:
- Returns:
A copy of
xwith additional columns added to its column data. Each column contains per-cell values for one of the ADT-related QC metrics, seecompute_adt_qc_metrics()for details. The suggested thresholds are stored as a list in the metadata. The column data also contains akeepcolumn, specifying which cells are to be retained.
Examples
>>> import scranpy >>> sce = scranpy.get_test_adt_data_se().get_alternative_experiment("ADT") >>> is_igg = list(y.find("IgG") >= 0 for y in sce.get_row_names()) >>> sce = scranpy.quick_adt_qc_se(sce, subsets={ "igg": is_igg }) >>> print(sce.get_column_data()[:,["sum", "detected", "subset_sum_igg"]]) >>> print(sce.get_metadata()["qc"]["thresholds"]) >>> import biocutils >>> print(biocutils.table(sce.get_column_data()["keep"]))
- scranpy.quick_crispr_qc_se(x, more_suggest_args={}, num_threads=1, block=None, assay_type='counts', output_prefix=None, meta_name='qc')¶
Quickly compute quality control (QC) metrics, thresholds and filters from CRISPR data in a
SummarizedExperiment. This callscompute_crispr_qc_metrics(),suggest_crispr_qc_thresholds(), andfilter_crispr_qc_metrics().- Parameters:
x (
SummarizedExperiment) – ASummarizedExperimentor one of its subclasses. Rows correspond to antibody-derived tags (ADTs) and columns correspond to cells.subsets – Passed to
compute_crispr_qc_metrics().more_suggest_args (
dict) – Additional arguments to pass tosuggest_crispr_qc_thresholds().num_threads (
int) – Passed tocompute_crispr_qc_metrics().block (
Optional[Sequence]) – Blocking factor specifying the block of origin (e.g., batch, sample) for each cell inmetrics. If supplied, a separate threshold is computed from the cells in each block. AlternativelyNone, if all cells are from the same block.assay_type (
Union[int,str]) – Index or name of the assay ofxcontaining the ADT count matrix.output_prefix (
Optional[str]) – Prefix to add to the column names of the column data containing the output QC statistics. IfNone, no prefix is added.meta_name (
str) – Name of the metadata entry in which to store additional outputs like the filtering thresholds. IfNone, additional outputs are not reported.
- Return type:
- Returns:
x, with additional columns added to its column data. Each column contains per-cell values for one of the ADT-related QC metrics, seecompute_crispr_qc_metrics()for details. The suggested thresholds are stored as a list in the metadata. The column data also contains akeepcolumn, specifying which cells are to be retained.
Examples
>>> import scranpy >>> sce = scranpy.get_test_crispr_data_se().get_alternative_experiment("CRISPR Guide Capture") >>> sce = scranpy.quick_crispr_qc_se(sce) >>> print(sce.get_column_data()[:,["sum", "detected", "max_value", "max_index"]]) >>> print(sce.get_metadata()["qc"]["thresholds"]) >>> sce.get_column_data()["keep"].sum()
- scranpy.quick_rna_qc_se(x, subsets, altexp_proportions=None, num_threads=1, more_suggest_args={}, block=None, assay_type='counts', output_prefix=None, meta_name='qc', flatten=True)¶
Quickly compute quality control (QC) metrics, thresholds and filters from RNA data in a
SummarizedExperiment. This callscompute_rna_qc_metrics(),suggest_rna_qc_thresholds(), andfilter_rna_qc_metrics().- Parameters:
x (
SummarizedExperiment) – ASummarizedExperimentor one of its subclasses. Rows correspond to genes and columns correspond to cells.subsets (
Union[Mapping,Sequence]) – List of subsets of control genes, passed tocompute_rna_qc_metrics_with_altexps().altexp_proportions (
Union[str,int,dict,Sequence,NamedList,None]) – Indices or names of alternative experiments for which to compute QC metrics, seecompute_rna_qc_metrics_with_altexps()for details.num_threads (
int) – Number of threads, passed tocompute_rna_qc_metrics_with_altexps().more_suggest_args (
dict) – Additional arguments to pass tosuggest_rna_qc_thresholds().block (
Optional[Sequence]) – Blocking factor specifying the block of origin (e.g., batch, sample) for each cell inmetrics. If supplied, a separate threshold is computed from the cells in each block. AlternativelyNone, if all cells are from the same block.assay_type (
Union[int,str]) – Index or name of the assay ofxcontaining the RNA count matrix.output_prefix (
Optional[str]) – Prefix to add to the column names of the column data containing the output QC statistics. IfNone, no prefix is added.meta_name (
Optional[str]) – Name of the metadata entry in which to store additional outputs like the filtering thresholds. IfNone, additional outputs are not reported.flatten (
bool) – Whether to flatten the subset proportions into separate columns of the column data. IfFalse, the subset proportions are stored in a nestedBiocFrame.
- Return type:
- Returns:
x, with additional columns added to its column data. Each column contains per-cell values for one of the RNA-related QC metrics, seecompute_rna_qc_metrics()for details. The suggested thresholds are stored as a list in the metadata. The column data also contains akeepcolumn, specifying which cells are to be retained.
Examples
>>> import scranpy >>> sce = scranpy.get_test_rna_data_se() >>> is_mito = list(y.startswith("mt-") for y in sce.get_row_names()) >>> sce = scranpy.quick_rna_qc_se(sce, subsets={ "mito": is_mito }) >>> print(sce.get_column_data()[:,["sum", "detected", "subset_proportion_mito"]]) >>> print(sce.get_metadata()["qc"]["thresholds"]) >>> import biocutils >>> print(biocutils.table(sce.get_column_data()["keep"])) >>> >>> # Computing spike-in proportions, if available. >>> sce = scranpy.get_test_rna_data_se() >>> sce = scranpy.quick_rna_qc_se(sce, subsets={ "mito": is_mito }, altexp_proportions=["ERCC"]) >>> print(sce.get_column_data()[:,["sum", "detected", "subset_proportion_mito", "subset_proportion_ERCC"]]) >>> print(sce.get_alternative_experiment("ERCC").get_column_data()[:,["sum", "detected"]])
- scranpy.run_all_neighbor_steps(x, run_umap_options={}, run_tsne_options={}, build_snn_graph_options={}, cluster_graph_options={}, nn_parameters=<knncolle.annoy.AnnoyParameters object>, collapse_search=True, num_threads=3)¶
Run all steps that depend on the nearest neighbor search, i.e.,
run_tsne(),run_umap(),build_snn_graph(), andcluster_graph(). This builds the index once and re-uses it for the neighbor search in each step; the various steps are also run in parallel to save more time.- Parameters:
Matrix of principal components where rows are cells and columns are PCs, typically produced by
run_pca().Alternatively, a
Indexinstance containing a prebuilt search index for the cells.run_umap_options (
Optional[dict]) – Optional arguments forrun_umap(). IfNone, UMAP is not performed.run_tsne_options (
Optional[dict]) – Optional arguments forrun_tsne(). IfNone, t-SNE is not performed.build_snn_graph_options (
Optional[dict]) – Optional arguments forbuild_snn_graph(). Ignored ifcluster_graph_options = None.cluster_graph_options (
dict) – Optional arguments forcluster_graph(). IfNone, graph-based clustering is not performed.nn_parameters (
Parameters) – Parameters for the nearest-neighbor search.collapse_search (
bool) – Whether to collapse the nearest-neighbor search for each step into a single search. Steps that need fewer neighbors will use a subset of the neighbors from the collapsed search. This is faster but may not give the same results as separate searches for some approximate search algorithms.num_threads (
int) – Number of threads to use for the parallel execution of UMAP, t-SNE and SNN graph construction. This overrides the specified number of threads in the various*_optionsarguments.
- Return type:
- Returns:
A
NamedListcontaining one entry for each step.run_tsne: results ofrun_tsne(). Omitted if t-SNE was not performed.run_umap: results ofrun_tsne(). Omitted if UMAP was not performed.build_snn_graph: results ofbuild_snn_graph(). Omitted if graph-based clustering was not performed.cluster_graph: results ofcluster_graph(). Omitted if graph-based clustering was not performed.
If
collapse_search = False, results should be identical to the result of running each step in serial.
Examples
>>> import numpy >>> pcs = numpy.random.rand(10, 200) >>> import scranpy >>> output = scranpy.run_all_neighbor_steps(pcs) >>> print(output["run_tsne"][:5,:]) >>> print(output["run_umap"][:5,:]) >>> import biocutils >>> print(biocutils.table(output["cluster_graph"]["membership"]))
- scranpy.run_all_neighbor_steps_se(x, umap_output_name='UMAP', more_umap_args={}, tsne_output_name='TSNE', more_tsne_args={}, build_graph_name=None, more_build_graph_args={}, cluster_output_name='clusters', cluster_meta_name=None, more_cluster_graph_args={}, nn_parameters=<knncolle.annoy.AnnoyParameters object>, num_threads=3, more_neighbor_args={}, reddim_type='PCA')[source]¶
Concurrently run all steps involving a nearest-neighbor search, using the same nearest-neighbor search index built from an existing embedding. This includes the t-SNE, UMAP and graph-based clustering. Internally, this uses
run_all_neighbor_steps()on a reduced dimension entry inSingleCellExperiment.- Parameters:
x (
SingleCellExperiment) – ASingleCellExperimentobject or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.umap_output_name (
Optional[str]) – Name of the reduced dimension entry in which to store the UMAP coordinates in the output object. IfNone, the UMAP is not computed.more_umap_args (
Optional[dict]) – Additional arguments for UMAP, to pass torun_all_neighbor_steps()asrun_umap_options. IfNone, the UMAP is not computed.tsne_output_name (
Optional[str]) – Name of the reduced dimension entry in which to store the t-SNE coordinates in the output object. IfNone, the t-SNE is not computed.more_tsne_args (
Optional[dict]) – Additional arguments for t-SNE, to pass torun_all_neighbor_steps()asrun_tsne_options. IfNone, the t-SNE is not computed.build_graph_name (
Optional[str]) – Name of the metadata entry in which to store the nearest neighbor graph. IfNone, the graph is not stored.more_build_graph_args (
Optional[dict]) – Additional arguments for graph construction, to pass torun_all_neighbor_steps()asmore_build_graph_args.cluster_output_name (
Optional[str]) – Name of the column of the column data in which to store the cluster assignments. IfNone, graph-based clustering is not performed.cluster_meta_name (
Optional[str]) – Name of the metadata entry in which to store additional clustering outputs. IfNone, additional outputs are not stored.more_cluster_graph_args (
Optional[dict]) – Additional arguments for graph-based clustering, to pass torun_all_neighbor_steps()asmore_cluster_graph_args. IfNone, the graph-based clustering is not performed.nn_parameters (
Parameters) – Parameters for the nearest-neighbor search.num_threads (
int) – Number of threads to use, passed torun_all_neighbor_steps().more_neighbor_args (
dict) – Additional arguments to pass torun_all_neighbor_steps().reddim_type (
Union[str,int]) – Name or index of the reduced dimensions ofxon which to perform the nearest neighbor search. Alternatively a tuple, where the first element contains the name of an alternative experiment ofx, and the second element contains the name/index of an embedding in that alternative experiment.
- Return type:
- Returns:
A copy of
x, with additional coordinates in its reduced dimensions and clustering output in its column data. Additional information may also be stored in its metadata.
Examples
>>> import scranpy >>> sce = scranpy.get_test_rna_data_se("pca") >>> sce = scranpy.run_all_neighbor_steps_se( >>> sce, >>> more_tsne_args={ "max_iterations": 50 }, # turned down for brevity >>> more_umap_args={ "num_epochs": 50 } >>> ) >>> sce.get_reduced_dimension_names() >>> import biocutils >>> print(biocutils.table(sce.get_column_data()["clusters"]))
- scranpy.run_pca(x, number=25, scale=False, block=None, block_weight_policy='variable', variable_block_weight=(0, 1000), subset=None, components_from_residuals=False, extra_work=7, iterations=1000, seed=5489, realized=True, num_threads=1)[source]¶
Run a PCA on the gene-by-cell log-expression matrix to obtain a low-dimensional representation for downstream analyses.
- Parameters:
x (
Any) – A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. Typically, the matrix is expected to contain log-expression values, and the rows should be filtered to relevant (e.g., highly variable) genes.number (
int) – Number of PCs to retain.scale (
bool) – Whether to scale all genes to have the same variance.block (
Optional[Sequence]) – Array of length equal to the number of columns ofx, containing the block of origin (e.g., batch, sample) for each cell. AlternativelyNone, if all cells are from the same block.block_weight_policy (
Literal['variable','equal','none']) – Policy to use for weighting different blocks when computing the average for each statistic. Only used ifblockis provided.variable_block_weight (
Tuple) – Parameters for variable block weighting. This should be a tuple of length 2 where the first and second values are used as the lower and upper bounds, respectively, for the variable weight calculation. Only used ifblockis provided andblock_weight_policy = "variable".components_from_residuals (
bool) – Whether to compute the PC scores from the residuals in the presence of a blocking factor. IfFalse, the residuals are only used to compute the rotation matrix, and the original expression values of the cells are projected onto this new space. Only used ifblockis provided.extra_work (
int) – Number of extra dimensions for the IRLBA workspace.iterations (
int) – Maximum number of restart iterations for IRLBA.seed (
int) – Seed for the initial random vector in IRLBA.realized (
bool) – Whether to realizexinto an optimal memory layout for IRLBA. This speeds up computation at the cost of increased memory usage.num_threads (
int) – Number of threads to use.
- Return type:
- Returns:
A
NamedListcontaining the following entries.components: a double-precision NumPy matrix of principal component (PC) scores. Rows are dimensions (i.e., PCs) and columns are cells.rotation: a double-precision NumPy matrix containing the rotation vectors. Rows are genes and columns are dimensions (i.e., PCs).variance_explained: a double-precision NumPy array containing the variance explained by each successive PC.total_variance: total variance in the input data. Guaranteed to be greater than the sum ofvariance_explained.center: a double-precision NumPy array containing the mean for each gene across all cells. Ifblockwas supplied, this is instead a matrix containing the mean for each gene (column) in each block of cells (row).block_ids: list containing the levels of the blocking factor corresponding to each row ofcenter. Only reported ifblockwas supplied.scale: a double-precision NumPy arrary containing the scaling for each gene. Only reported ifscale = Truein the function call.
References
https://libscran.github.io/scran_pca, which describes the approach in more detail. In particular, the documentation for the
blocked_pcafunction explains the blocking strategy.Examples
>>> import numpy >>> normed = numpy.random.rand(500, 100) >>> import scranpy >>> res = scranpy.run_pca(normed) >>> print(res["variance_explained"] / res["total_variance"])
- scranpy.run_pca_se(x, features, number=25, block=None, num_threads=1, more_pca_args={}, assay_type='logcounts', output_name='PCA', meta_name='PCA')[source]¶
Compact and denoise the dataset by performing PCA on the (log-)normalized expression matrix. This calls
run_pca()on an assay of aSummarizedExperiment.- Parameters:
x (
SummarizedExperiment) – ASummarizedExperimentobject or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.features (
Optional[Sequence]) – Features of interest to use in the PCA. This can be a sequence of row indices, row names, or booleans indicating the rows ofxto use. For RNA data, this is typically thehvgvector added to the row data ofxbychoose_rna_hvgs_se(). IfNone, all available features are used.number (
int) – Number of PCs to retain, passed torun_pca().block (
Optional[Sequence]) – Block assignment of each cell, passed torun_pca().more_pca_args (
dict) – Additional arguments passed torun_pca().assay_type (
Union[int,str]) – Name or index of the assay ofxto be used for PCA. This is typically the log-normalized expression matrix created bynormalize_rna_counts_se()or equivalent.output_name (
str) – Name of the reduced dimension entry in which to store the PC scores in the output object.meta_name (
Optional[str]) – Name of the metadata entry in which to store other PCA statistics in the output object.
- Return type:
- Returns:
A copy of
xwith the newly computed principal component scores in the reduced dimensions. Additional outputs (e.g., rotation matrix, variance explained) are stored in the metadata.
Examples
>>> import scranpy >>> sce = scranpy.get_test_rna_data_se("hvg") >>> sce = scranpy.run_pca_se(sce, features=sce.get_row_data()["hvg"]) >>> sce.get_reduced_dimension("PCA").shape >>> pcameta = sce.get_metadata()["PCA"] >>> pcameta["variance_explained"] / pcameta["total_variance"]
- scranpy.run_tsne(x, perplexity=30, num_neighbors=None, theta=1, early_exaggeration_iterations=250, exaggeration_factor=12, momentum_switch_iterations=250, start_momentum=0.5, final_momentum=0.8, eta=200, max_depth=7, leaf_approximation=False, max_iterations=500, seed=42, num_threads=1, nn_parameters=<knncolle.annoy.AnnoyParameters object>)¶
Compute t-SNE coordinates to visualize similarities between cells.
- Parameters:
x (
Union[ndarray,FindKnnResults,Index]) –Numeric matrix where rows are dimensions and columns are cells, typically containing a low-dimensional representation from, e.g.,
run_pca().Alternatively, a
FindKnnResultsobject containing existing neighbor search results.Alternatively, a
Indexobject.perplexity (
float) – Perplexity to use in the t-SNE algorithm. Larger values cause the embedding to focus on global structure.num_neighbors (
Optional[int]) – Number of neighbors in the nearest-neighbor graph. Typically derived fromperplexityusingtsne_perplexity_to_neighbors(). Ifxis aFindKnnResultsobject and the number of neighbors is not equal tonum_neighbors, a warning is raised; this can be suppressed by settingnum_neighbors = None.theta (
float) – Approximation level for the Barnes-Hut calculation of repulsive forces. Lower values increase accuracy at the cost of computational time.early_exaggeration_iterations (
int) – Number of iterations of the early exaggeration phase, where the conditional probabilities are multiplied byexaggeration_factor. In this phase, the empty space between clusters is increased so that clusters can easily relocate to find a good global organization. Larger values improve convergence within this phase at the cost of reducing the remaining iterations inmax_iterations.exaggeration_factor (
float) – Exaggeration factor to scale the probabilities during the early exaggeration phase (seeearly_exaggeration_iterations). Larger values increase the attraction between nearest neighbors to favor local structure during this phase.momentum_switch_iterations (
int) – Number of iterations to perform before switching from the starting momentum (start_momentum) to the final momentum (final_momentum). Greater momentum can improve convergence by increasing the step size and smoothing over local oscillations, at the risk of potentially skipping over relevant minima.start_momentum (
float) – Starting momentum in [0, 1) to be used in the iterations before the momentum switch atmomentum_switch_iterations. This is usually lower thanfinal_momentumto avoid skipping over suitable local minima.final_momentum (
float) – Final momentum in [0, 1) to be used in the iterations after the momentum switch atmomentum_switch_iterations. This is usually higher thanstart_momentumto accelerate convergence to the local minima once the observations are moderately well-organized.eta (
float) – The learning rate, used to scale the updates to the coordinates at each iteration. Larger values can speed up convergence at the cost of potentially skipping over local minima.max_depth (
int) – Maximum depth of the Barnes-Hut quadtree. If neighboring observations cannot be separated before the maximum depth is reached, they will be assigned to the same leaf node. This effectively approximates each observation’s coordinates with the center of mass of its leaf node. Smaller values (7-10) improve speed at the cost of accuracy.leaf_approximation (
bool) – Whether to use the “leaf approximation” approach, which sacrifices some accuracy for greater speed. This replaces a observation with the center of mass of its leaf node when computing the repulsive forces to all other observations. Only effective whenmax_depthis small enough for multiple cells to be assigned to the same leaf node of the quadtree.max_iterations (
int) – Maximum number of iterations to perform.seed (
int) – Random seed to use for generating the initial coordinates.num_threads (
int) – Number of threads to use.nn_parameters (
Parameters) – The algorithm to use for the nearest-neighbor search. Only used ifxis not a pre-built nearest-neighbor search index or a list of existing nearest-neighbor search results.
- Return type:
- Returns:
Double-precision NumPy matrix containing the coordinates of each cell in a 2-dimensional embedding. Each row corresponds to a cell and each column corresponds to a dimension.
References
https://libscran.github.io/qdtsne, for some more details on the approximations.
Examples
>>> import numpy >>> pcs = numpy.random.rand(20, 500) >>> import scranpy >>> tout = scranpy.run_tsne(pcs) >>> print(tout[:5,:])
- scranpy.run_tsne_se(x, perplexity=30, num_threads=1, more_tsne_args={}, reddim_type='PCA', output_name='TSNE')[source]¶
Generate a t-SNE visualization from an existing embedding. This calls
run_tsne()on a reduced dimension entry in aSingleCellExperiment.- Parameters:
x (
SingleCellExperiment) – ASingleCellExperimentobject or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.perplexity (
float) – Perplexity for the t-SNE algorithm, passed torun_tsne().num_threads (
int) – Number of threads for the neighbor search and optimization, passed torun_tsne().more_tsne_args (
dict) – Additional arguments to pass torun_tsne().reddim_type (
Union[int,str]) – Name or index of the existing reduced dimension embedding inxfrom which to generate t-SNE coordinates. Alternatively a tuple, where the first element contains the name of an alternative experiment ofx, and the second element contains the name/index of an embedding in that alternative experiment.output_name (
str) – Name of the reduced dimension entry in which to store the t-SNE coordinates in the output object.
- Return type:
- Returns:
A copy of
xwith a new reduced dimension entry containing the t-SNE coordinates.
Examples
>>> import scranpy >>> sce = scranpy.get_test_rna_data_se("pca") >>> # Using fewer iterations for a faster-running example. >>> sce = scranpy.run_tsne_se(sce, more_tsne_args={ "max_iterations": 50 }) >>> sce.get_reduced_dimension("TSNE")[:5,:]
- scranpy.run_umap(x, num_dim=2, parallel_optimization=False, local_connectivity=1, bandwidth=1, mix_ratio=1, spread=1, min_dist=0.1, a=None, b=None, repulsion_strength=1, initialize_method='spectral', initial_coordinates=None, initialize_random_on_spectral_fail=True, initialize_spectral_scale=10, initialize_spectral_jitter=False, initialize_spectral_jitter_sd=0.0001, initialize_random_scale=10, initialize_seed=9876543210, num_epochs=None, learning_rate=1, negative_sample_rate=5, num_neighbors=15, optimize_seed=1234567890, num_threads=1, nn_parameters=<knncolle.annoy.AnnoyParameters object>)¶
Compute UMAP coordinates to visualize similarities between cells.
- Parameters:
x (
Union[ndarray,FindKnnResults,Index]) –Numeric matrix where rows are dimensions and columns are cells, typically containing a low-dimensional representation from, e.g.,
run_pca().Alternatively, a
FindKnnResultsobject containing existing neighbor search results.Alternatively, a
Indexobject.num_dim (
int) – Number of dimensions in the UMAP embedding.local_connectivity (
float) – Number of nearest neighbors that are assumed to be always connected, with maximum membership confidence. Larger values increase the connectivity of the embedding and reduce the focus on local structure. This may be a fractional number of neighbors, in which case interpolation is performed when computing the membership confidence.bandwidth (
float) – Effective bandwidth of the kernel when converting the distance to a neighbor into a fuzzy set membership confidence. Larger values reduce the decay in confidence with respect to distance, increasing connectivity and favoring global structure.mix_ratio (
float) – Number between 0 and 1 specifying the mixing ratio when combining fuzzy sets. A mixing ratio of 1 will take the union of confidences, a ratio of 0 will take the intersection, and intermediate values will interpolate between them. Larger values favor connectivity and more global structure.spread (
float) – Scale of the coordinates of the final low-dimensional embedding. Ignored ifaandbare provided.min_dist (
float) – Minimum distance between observations in the final low-dimensional embedding. Smaller values will increase local clustering while larger values favor a more even distribution of observations throughout the low-dimensional space. This is interpreted relative tospread. Ignored ifaandbare provided.a (
Optional[float]) – Theaparameter for the fuzzy set membership strength calculations. Larger values yield a sharper decay in membership strength with increasing distance between observations. If this orbareNone, a suitable value for this parameter is automatically determined fromspreadandmin_dist.b (
Optional[float]) – Thebparameter for the fuzzy set membership strength calculations. Larger values yield an earlier decay in membership strength with increasing distance between observations. If this oraareNone, a suitable value for this parameter is automatically determined fromspreadandmin_dist.repulsion_strength (
float) – Modifier for the repulsive force. Larger values increase repulsion and favor local structure.initialize_method (
Literal['spectral','random','none']) –How to initialize the embedding.
spectral: spectral decomposition of the normalized graph Laplacian. Specifically, the initial coordinates are defined from the eigenvectors corresponding to the smallest non-zero eigenvalues. This fails in the presence of multiple graph components or if the approximate SVD fails to converge.random: fills the embedding with random draws from a normal distribution.none: uses initial values frominitial_coordinates.
initialize_random_on_spectral_fail (
bool) – Whether to fall back to random sampling (i.e.,random) if spectral initialization fails due to the presence of multiple components in the graph. If False, the values ininitial_coordinateswill be used instead, i.e., same asnone. Only relevant ifinitialize_method = "spectral"and spectral initialization fails.initialize_spectral_scale (
float) – Maximum absolute magnitude of the coordinates after spectral initialization. The default is chosen to avoid outlier observations with large absolute distances that may interfere with optimization. Only relevant ifinitialize_method = "spectral"and spectral initialization does not fail.initialize_spectral_jitter (
bool) – Whether to jitter coordinates after spectral initialization to separate duplicate observations (e.g., to avoid overplotting). This is done using normally-distributed noise of mean zero and standard deviation ofinitialize_spectral_jitter_sd. Only relevant ifinitialize_method = "spectral"and spectral initialization does not fail.initialize_spectral_jitter_sd (
float) – Standard deviation of the jitter to apply after spectral initialization. Only relevant ifinitialize_method = "spectral"and spectral initialization does not fail andinitialize_spectral_jitter = True.initialize.random.scale – Scale of the randomly generated initial coordinates. Coordinates are sampled from a uniform distribution from [-x, x) where x is
initialize_random_scale. Only relevant ifinitialize_method = "random", orinitialize_method = "spectral"and spectral initialization fails andinitialize_random_on_spectral_fail = True.initialize_seed (
int) – Seed for the random number generation during initialization. Only relevant ifinitialize_method = "random", orinitialize_method = "spectral"andinitialize_spectral_jitter = True; orinitialize_method = "spectral"and spectral initialization fails andinitialize_random_on_spectral_fail = True.initial_coordinates (
Optional[array]) – Double-precision matrix of initial coordinates with number of rows and columns equal to the number of observations andnum_dim, respectively. Only relevant ifinitialize_method = "none"; orinitialize_method = "spectral"and spectral initialization fails andinitialize_random_on_spectral_fail = False.num_epochs (
Optional[int]) – Number of epochs to perform. If set to None, an appropriate number of epochs is chosen based on the number of points inx.num_neighbors (
int) – Number of neighbors to use in the UMAP algorithm. Larger values cause the embedding to focus on global structure. Ignored ifxis aFindKnnResultsobject.optimize_seed (
int) – Integer scalar specifying the seed to use.num_threads (
int) – Number of threads to use.nn_parameters (
Parameters) – The algorithm to use for the nearest-neighbor search. Only used ifxis not a pre-built nearest-neighbor search index or a list of existing nearest-neighbor search results.
- Return type:
- Returns:
Double-precision NumPy matrix containing the coordinates of each cell in a 2-dimensional embedding. Each row corresponds to a cell and each column corresponds to a dimension.
References
https://libscran.github.io/umappp, for the underlying implementation.
Examples
>>> import numpy >>> pcs = numpy.random.rand(20, 500) >>> import scranpy >>> uout = scranpy.run_umap(pcs) >>> print(uout[:5,:])
- scranpy.run_umap_se(x, num_dim=2, min_dist=0.1, num_neighbors=15, num_threads=1, more_umap_args={}, reddim_type='PCA', output_name='UMAP')[source]¶
Generate a UMAP visualization from an existing embedding. This calls
run_umap()on a reduced dimension entry in aSingleCellExperiment.- Parameters:
x (
SingleCellExperiment) – ASingleCellExperimentobject or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.num_dim (
int) – Number of dimensions in the UMAP, passed torun_umap().min_dist (
float) – Minimum distance between cells in the UMAP, passed torun_umap().num_neighbors (
int) – Number of neighbors, passed torun_umap().num_threads (
int) – Number of threads for the neighbor search and optimization, passed torun_umap().more_umap_args (
dict) – Additional arguments to pass torun_umap().reddim_type (
Union[int,str]) – Name or index of the existing reduced dimension embedding inxfrom which to generate UMAP coordinates. Alternatively a tuple, where the first element contains the name of an alternative experiment ofx, and the second element contains the name/index of an embedding in that alternative experiment.output_name (
str) – Name of the reduced dimension entry in which to store the UMAP coordinates in the output object.
- Return type:
- Returns:
A copy of
xwith a new reduced dimension entry containing the UMAP coordinates.
Examples
>>> import scranpy >>> sce = scranpy.get_test_rna_data_se("pca") >>> # Using fewer iterations for a faster-running example. >>> sce = scranpy.run_umap_se(sce, more_umap_args={ "num_epochs": 50 }) >>> sce.get_reduced_dimension("UMAP")[:5,:]
- scranpy.sanitize_size_factors(size_factors, replace_zero=True, replace_negative=True, replace_infinite=True, replace_nan=True, in_place=False)¶
Replace invalid size factors, i.e., zero, negative, infinite or NaNs.
- Parameters:
size_factors (
ndarray) – Floating-point array containing size factors for all cells.replace_zero (
bool) – Whether to replace size factors of zero with the lowest positive factor. IfFalse, zeros are retained.replace_negative (
bool) – Whether to replace negative size factors with the lowest positive factor. IfFalse, negative values are retained.replace_infinite (
bool) – Whether to replace infinite size factors with the largest positive factor. IfFalse, infinite values are retained.replace_nan (
bool) – Whether to replace NaN size factors with unity. IfFalse, NaN values are retained.in_place (
bool) – Whether to modifysize_factorsin place. IfFalse, a new array is returned. This argument only used ifsize_factorsis double-precision, otherwise a new array is always returned.
- Return type:
- Returns:
Double-precision NumPy array containing sanitized size factors. If
in_place = True, this is a reference tosize_factors.
References
The
sanitize_size_factorsfunction in the scran_norm C++ library, which provides the underlying implementation.Examples
>>> import numpy >>> sf = numpy.random.rand(1000) >>> sf[:4] = [0, -1, numpy.inf, numpy.nan] >>> import scranpy >>> san = scranpy.sanitize_size_factors(sf) >>> san[:4]
- scranpy.scale_by_neighbors(x, num_neighbors=20, block=None, block_weight_policy='variable', variable_block_weight=(0, 1000), num_threads=1, weights=None, nn_parameters=<knncolle.annoy.AnnoyParameters object>)[source]¶
Scale multiple embeddings (usually derived from different modalities across the same set of cells) so that their within-population variances are comparable. Then, combine them into a single embedding matrix for combined downstream analysis.
- Parameters:
x (
Sequence) – Sequence of of numeric matrices of principal components or other embeddings, one for each modality. For each entry, rows are dimensions and columns are cells. All entries should have the same number of columns but may have different numbers of rows.num_neighbors (
int) – Number of neighbors to use to define the scaling factor.num_threads (
int) – Number of threads to use.nn_parameters (
Parameters) – Algorithm for the nearest-neighbor search.weights (
Optional[Sequence]) – Array of length equal tox, specifying the weights to apply to each modality. Each value represents a multiplier of the within-population variance of its modality, i.e., larger values increase the contribution of that modality in the combined output matrix. The default ofNoneis equivalent to an all-1 vector, i.e., all modalities are scaled to have the same within-population variance.
- Return type:
- Returns:
A
NamedListcontaining the following entries.scaling: a double-precision NumPy array containing the scaling factor to be applied to each embedding inx.combined: a double-precision NumPy matrix of scaled and concatenated embeddings. This is formed by scaling each embedding inxby its corresponding entry ofscaling, and then concatenating them together by row. Each row corresponds to a dimension and each column corresponds to a cell.
References
https://libscran.github.io/mumosa, for the basis and caveats of this approach.
Examples
>>> import numpy >>> rna_pcs = numpy.random.rand(25, 200) >>> adt_pcs = numpy.random.rand(15, 200) >>> other_pcs = numpy.random.rand(10, 200) >>> import scranpy >>> res = scranpy.scale_by_neighbors([rna_pcs, adt_pcs, other_pcs]) >>> print(res["scaling"])
- scranpy.scale_by_neighbors_se(x, altexp_reddims, main_reddims='PCA', num_neighbors=20, block=None, nn_parameters=<knncolle.annoy.AnnoyParameters object>, num_threads=1, more_scale_args={}, output_name='combined', meta_name='combined')[source]¶
Scale embeddings for different modalities to equalize their intra-population variance, and then combine them into a single embedding for downstream analysis. This calls
scale_by_neighbors()on the main/alternative experiments of aSingleCellExperiment.- Parameters:
x (
SingleCellExperiment) – ASingleCellExperimentobject or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.altexp_reddims (
Union[dict,NamedList]) –Dictionary specifying the reduced dimensions of the alternative experiments of
xto be combined. Specifically, each key is the name of an alternative experiment and the value is a list of strings/integers specifying the names/indices of the reduced dimensions of that experiment.Alternatively, each value may be a single string or integer that will be converted into a list of length 1. Each value may also be
Nonein which case the corresponding alternative experiment is ignored.Alternatively, a
NamedListwhere each element is named after one or more alternative experiments, and each value is as described above for a dictionary.main_reddims (
Union[int,str,Sequence]) –List of strings/integers specifying the names/indices of the reduced dimensions from the main experiment of
x.Alternatively, a single string or integer that will be converted into a list of length 1.
Alternatively
None, which will be treated as anum_neighbors (
int) – Number of neighbors to compute the scaling, seescale_by_neighbors().block (
Optional[Sequence]) – Block assignment for each cell, passed toscale_by_neighbors().nn_parameters (
Parameters) – Algorithm for the nearest neighbor search, passed toscale_by_neighbors().num_threads (
int) – Number of threads, passed toscale_by_neighbors().more_scale_args (
dict) – Additional arguments to pass toscale_by_neighbors().output_name (
str) – Name of the reduced dimension entry in which to store the combined embeddings in the output object.meta_name (
Optional[str]) – Name of the metadata entry in which to store additional metrics in the output object. IfNone, additional metrics are not stored.
- Return type:
- Returns:
A copy of
xwith the combined embeddings stored in its row data. The scaling factors for all embeddings are stored in the metadata.
Examples
>>> import scranpy >>> sce = scranpy.get_test_adt_data_se("pca") >>> sce = scranpy.scale_by_neighbors_se(sce, altexp_reddims={ "ADT": "PCA" }) >>> sce.get_reduced_dimension_names() >>> sce.get_metadata()["combined"]
- scranpy.score_gene_set(x, set, rank=1, scale=False, block=None, block_weight_policy='variable', variable_block_weight=(0, 1000), extra_work=7, iterations=1000, seed=5489, realized=True, num_threads=1)[source]¶
Compute per-cell scores for a gene set, defined as the column sums of a rank-1 approximation to the submatrix for the feature set. This uses the same approach implemented in the GSDecon package by Jason Hackney.
- Parameters:
x (
Any) – A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. The matrix is expected to contain log-expression values.set (
Sequence) – Array of integer indices specifying the rows ofxbelonging to the gene set. Alternatively, a sequence of boolean values of length equal to the number of rows, where truthy elements indicate that the corresponding row belongs to the gene set.rank (
int) – Rank of the approximation.scale (
bool) – Whether to scale all genes to have the same variance.block (
Optional[Sequence]) – Array of length equal to the number of columns ofx, containing the block of origin (e.g., batch, sample) for each cell. AlternativelyNone, if all cells are from the same block.block_weight_policy (
Literal['variable','equal','none']) – Policy to use for weighting different blocks when computing the average for each statistic. Only used ifblockis provided.variable_block_weight (
Tuple) – Parameters for variable block weighting. This should be a tuple of length 2 where the first and second values are used as the lower and upper bounds, respectively, for the variable weight calculation. Only used ifblockis provided andblock_weight_policy = "variable".extra_work (
int) – Number of extra dimensions for the IRLBA workspace.iterations (
int) – Maximum number of restart iterations for IRLBA.seed (
int) – Seed for the initial random vector in IRLBA.realized (
bool) – Whether to realizexinto an optimal memory layout for IRLBA. This speeds up computation at the cost of increased memory usage.num_threads (
int) – Number of threads to use.
- Return type:
- Returns:
A
NamedListcontaining the following entries.scores: a double-precision NumPy array containing the gene set score for each cell.weights: a double-precision NumPy array containing containing the weight of each gene inset. A larger weight indicates that the corresponding gene insethas more influence onscores.
References
https://libscran.github.io/gsdecon, which describes the approach in more detail. In particular, the documentation for the
compute_blockedfunction explains the blocking strategy.Examples
>>> import numpy >>> normed = numpy.random.rand(200, 100) >>> import scranpy >>> res = scranpy.score_gene_set(normed, set=range(20)) >>> print(res)
- scranpy.score_gene_set_se(x, set, block=None, num_threads=1, more_score_args={}, assay_type='logcounts')[source]¶
Compute a gene set activity score for each cell based on the expression values of the genes in the set. This calls
score_gene_set()on an assay of aSummarizedExperiment.- Parameters:
x (
SummarizedExperiment) – ASummarizedExperimentobject or one of its subclasses. Rows correspond to genes and columns correspond to cells.set (
Sequence) – Names or indices of genes in the set, seescore_gene_set()for details.block (
Optional[Sequence]) – Block assignment for each cell, passed toscore_gene_set().num_threads (
int) – Number of threads, passed toscore_gene_set().more_score_args (
dict) – Additional arguments to pass toscore_gene_set().assay_type (
Union[str,int]) – Name or index of the assay ofxfrom which to compute the gene set scores.
- Return type:
- Returns:
A
NamedListcontaining per-cell scores and per-gene weights, seescore_gene_set()for details.
Examples
>>> import scranpy >>> sce = scranpy.get_test_rna_data_se("norm") >>> custom_set = [0, 1, 4, 5, 7] >>> custom_scores = scranpy.score_gene_set_se(sce, custom_set) >>> custom_scores["scores"]
- scranpy.score_markers(x, groups, block=None, block_average_policy='mean', block_weight_policy='variable', variable_block_weight=(0, 1000), block_quantile=0.5, threshold=0, compute_group_mean=True, compute_group_detected=True, compute_cohens_d=True, compute_auc=True, compute_delta_mean=True, compute_delta_detected=True, compute_summary_min=True, compute_summary_mean=True, compute_summary_median=True, compute_summary_max=True, compute_summary_quantiles=None, compute_summary_min_rank=True, min_rank_limit=500, all_pairwise=False, num_threads=1)[source]¶
Score marker genes for each group using a variety of effect sizes from pairwise comparisons between groups. This includes Cohen’s d, the area under the curve (AUC), the difference in the means (delta-mean) and the difference in the proportion of detected cells (delta-detected).
- Parameters:
x (
Any) – A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. It is typically expected to contain log-expression values, e.g., fromnormalize_counts().groups (
Sequence) – Group assignment for each cell inx. This should have length equal to the number of columns inx.block (
Optional[Sequence]) – Array of length equal to the number of columns ofx, containing the block of origin (e.g., batch, sample) for each cell. AlternativelyNone, if all cells are from the same block.block_average_policy (
Literal['mean','quantile']) – Policy to use for average statistics across blocks. This can either be a (weighted)meanor aquantile. Only used ifblockis supplied.block_weight_policy (
Literal['variable','equal','none']) – Policy to use for weighting different blocks when computing the average for each statistic. Only used ifblockis provided.variable_block_weight (
Tuple) – Parameters for variable block weighting. This should be a tuple of length 2 where the first and second values are used as the lower and upper bounds, respectively, for the variable weight calculation. Only used ifblockis provided andblock_weight_policy = "variable".block_quantile (
float) – Probability of the quantile of statistics across blocks. Defaults to 0.5, i.e., the median of per-block statistics. Only used ifblockis provided andblock_average_policy ="quantile".threshold (
float) – Non-negative value specifying the minimum threshold on the differences in means (i.e., the log-fold change, ifxcontains log-expression values). This is incorporated into the calculation for Cohen’s d and the AUC.compute_group_mean (
bool) – Whether to compute the group-wise mean expression for each gene.compute_group_detected (
bool) – Whether to compute the group-wise proportion of detected cells for each gene.compute_cohens_d (
bool) – Whether to compute Cohen’s d, i.e., the ratio of the difference in means to the standard deviation.compute_auc (
bool) – Whether to compute the AUC. Setting this toFalsecan improve speed and memory efficiency.compute_delta_mean (
bool) – Whether to compute the delta-means, i.e., the log-fold change whenxcontains log-expression values.compute_delta_detected (
bool) – Whether to compute the delta-detected, i.e., differences in the proportion of cells with detected expression.compute_summary_min (
bool) – Whether to compute the minimum as a summary statistic for each effect size. Only used ifall_pairwise = False.compute_summary_mean (
bool) – Whether to compute the mean as a summary statistic for each effect size. Only used ifall.pairwise = False.compute_summary_median (
bool) – Whether to compute the median as a summary statistic for each effect size. Only used ifall_pairwise = False.compute_summary_max (
bool) – Whether to compute the maximum as a summary statistic for each effect size. Only used ifall_pairwise = False.compute_summary_quantiles (
Optional[Sequence]) – Probabilities of quantiles to compute as summary statistics for each effect size. This should be in [0, 1] and sorted in order of increasing size. IfNone, no quantiles are computed. Only used ifall_pairwise = False.compute_summary_min_rank (
bool) – Whether to compute the mininum rank as a summary statistic for each effect size. IfNone, no quantiles are computed. Only used ifall_pairwise = False.min_rank_limit (
int) – Maximum value of the min-rank to report. Lower values improve memory efficiency at the cost of discarding information about lower-ranked genes. Only used ifall_pairwise = Falseandcompute_summary_min_rank = True.all_pairwise (
bool) – Whether to report the full effects for every pairwise comparison between groups. Alternatively, an integer scalar indicating the number of top markers to report from each pairwise comparison between groups. IfFalse, only summaries are reported.num_threads (
int) – Number of threads to use.
- Return type:
- Returns:
A
NamedListcontaining various marker statistics for each group. This has the following entries:nrow: integer specifying the number of genes in the dataset.group_ids: list containing the identities of the groups.means: double-precision NumPy matrix containing the mean expression for each gene in each group. Each row is a gene and each column is a group, ordered as ingroup_ids. Omitted ifcompute_group_means = False.means: double-precision NumPy matrix containing the proportion of cells with detected expression for each gene in each group. Each row is a gene and each column is a group, ordered as ingroup_ids. Omitted ifcompute_group_means = False.
If
all_pairwise = False, theNamedListcontains the following additional entries:cohens_d: aNamedListwith the same structure as returned bysummarized_effects().Briefly, each entry corresponds to a group in
group_idsand is aBiocFramewith one row per gene. Each column contains a summary statistic of the Cohen’s d from pairwise comparisons to all other groups, e.g., min, mean, median, max, min-rank, and any requested quantiles. Columns are omitted if the relevantcompute_summary_*option is set toFalse.
auc: Same ascohens_dbut for the AUCs.delta_mean: Same ascohens_dbut for the delta-mean.delta_detected: Same ascohens_dbut for the delta-detected.
If
all_pairwise = True, theNamedListcontains the following addditional entries:cohens_d: a 3-dimensional double-precision NumPy array containing the Cohen’s d from each pairwise comparison between groups. The extents of the first two dimensions are equal to the number of groups, while the extent of the final dimension is equal to the number of genes. Specifically, the entry[i, j, k]represents Cohen’s d from the comparison of groupjover groupifor genek.auc: Same ascohens_dbut for the AUCs.delta_mean: Same ascohens_dbut for the delta-mean.delta_detected: Same ascohens_dbut for the delta-detected.
If
all_pairwiseis an integer, theNamedListcontains the following additional entries:cohens_d: aNamedListlist of named lists ofBiocFrameobjects. TheBiocFrameatcohens_d[m][n]contains the top markers for the comparison of groupmover groupn. EachBiocFramehas theindexandeffectcolumns, containing the row indices and effect sizes of the top genes, respectively.auc: Same ascohens_dbut for the AUCs.delta_mean: Same ascohens_dbut for the delta-mean.delta_detected: Same ascohens_dbut for the delta-detected.
Entries will be omitted if the relevant
compute_*option is set toFalse. For example, ifcompute_cohens_d = False, the output will not contain anycohens_dentry.
References
The
score_markers_summaryandscore_markers_pairwisefunctions in the scran_markers C++ library, which describes the rationale behind the choice of effect sizes and summary statistics. Also see their blocked equivalentsscore_markers_summary_blockedandscore_markers_pairwise_blockedwhenblockis provided.Examples
>>> import numpy >>> normed = numpy.random.rand(200, 100) >>> import scranpy >>> group = ["A", "B", "C", "D"] * 25 >>> res = scranpy.score_markers(normed, group) >>> print(res["cohens_d"]["A"])
- scranpy.score_markers_se(x, groups, block=None, num_threads=1, more_marker_args={}, assay_type='logcounts', extra_columns=None, order_by=True)¶
Identify candidate marker genes based on effect sizes from pairwise comparisons between groups of cells. This calls
score_markers()on an assay of aSummarizedExperiment, and then usesformat_score_markers_result()to reformat the results.- Parameters:
x (
SummarizedExperiment) – ASummarizedExperimentobject or one of its subclasses. Rows correspond to genes and columns correspond to cells.groups (
Sequence) – Group assignment for each cell, passed toscore_markers().block (
Optional[Sequence]) – Block assignment for each cell, passed toscore_markers().num_threads (
int) – Number of threads for marker scoring, passed toscore_markers().more_marker_args (
dict) – Additional arguments to pass toscore_markers().assay_type (
Union[str,int]) – Name or index of the assay ofxto use for marker detection, usually containing log-normalized expression values.extra_columns (
Union[Sequence,str,NamedList,None]) – ABiocFramewith the same number of rows asx, containing extra columns to add each DataFrame. Alternatively, a list of strings specifying the columns of the row data ofxto be added. A single string is treated as a list of length 1.order_by (
Union[str,bool,None]) – Name of the column to use for ordering the rows of each outputBiocFrame. AlternativelyTrue, in which case a column is automatically chosen from the effect size summaries. IfNoneorFalse, no ordering is performed.
- Return type:
- Returns:
A
NamedListofEach ``BiocFrame`corresponds to a unique group ingroups. Each row contains statistics for a gene inx, with the following columns:mean, the mean expression in the current group.detected, the proportion of cells with detected expression in the current group.<effect>_<summary>, a summary statistic for an effect size. For example,cohens_d_meancontains the mean Cohen’s d across comparisons involving the current group.
Examples
>>> import scranpy >>> sce = scranpy.get_test_rna_data_se("cluster") >>> markers = scranpy.score_markers_se(sce, sce.get_column_data()["clusters"]) >>> print(scranpy.preview_markers(markers["0"]))
- scranpy.subsample_by_neighbors(x, num_neighbors=20, min_remaining=10, nn_parameters=<knncolle.annoy.AnnoyParameters object>, num_threads=1)¶
Subsample a dataset by selecting cells to represent all of their nearest neighbors.
- Parameters:
x (
Union[ndarray,FindKnnResults,Index]) –Numeric matrix where rows are dimensions and columns are cells, typically containing a low-dimensional representation from, e.g.,
run_pca().Alternatively, a
Indexobject containing a pre-built search index for a dataset.Alternatively, a
FindKnnResultsobject containing pre-computed search results for a dataset.num_neighbors (
int) – Number of neighbors to use. Larger values result in greater downsampling. Ignored ifxis aFindKnnResultsobject.nn_parameters (
Parameters) – Neighbor search algorithm to use. Only used ifxdoes not contain existing neighbor search results.min_remaining (
int) – Minimum number of remaining (i.e., unselected) neighbors that a cell must have in order to be considered for selection. This should be less than or equal tonum_neighbors.num_threads (
int) – Number of threads to use for the nearest-neighbor search. Only used ifxdoes not contain existing neighbor search results.
- Return type:
- Returns:
Integer NumPy array containing the indices of the cells selected to be in the subsample.
References
https://libscran.github.io/nenesub, for the rationale behind this approach.
Examples
>>> import numpy >>> pcs = numpy.random.rand(20, 500) >>> import scranpy >>> keep = scranpy.subsample_by_neighbors(pcs) >>> print(keep)
- scranpy.suggest_adt_qc_thresholds(metrics, block=None, min_detected_drop=0.1, num_mads=3.0)[source]¶
Suggest filter thresholds for the ADT-derived QC metrics from
compute_adt_qc_metrics().- Parameters:
metrics (
BiocFrame) – ADT-derived QC metrics generated bycompute_adt_qc_metrics(). This should contain thesum,detectedandsubset_sumcolumns.block (
Optional[Sequence]) – Blocking factor specifying the block of origin (e.g., batch, sample) for each cell inmetrics. If supplied, a separate threshold is computed from the cells in each block. AlternativelyNone, if all cells are from the same block.min_detected_drop (
float) – Minimum proportional drop in the number of detected ADTs to consider a cell to be of low quality. This avoids overly aggressive filtering when the MAD is zero.num_mads (
float) – Number of MADs from the median to define the threshold for outliers in each QC metric.
- Return type:
- Returns:
If
block = None, aNamedListis returned, containing the following entries.detected: a number specifying the lower threshold on the number of detected ADTs.subsets: aFloatListof length equal to the number of control subsets (and named accordingly). Each entry represents the upper bound on the sum of counts in the corresponding control subset.
If
blockis provided, theNamedListinstead contains:detected, aFloatListof length equal to the number of blocks (and named accordingly). Each entry represents the lower threshold on the number of detected ADTs in the corresponding block.subset_sum, aNamedListof length equal to the number of control subsets. Each entry is anotherFloatListthat contains the upper threshold on the sum of counts for that subset in each block.block_ids, a list containing the unique levels of the blocking factor. This is in the same order as the blocks indetectedandsubset_sum.
References
The
compute_adt_qc_filtersandcompute_adt_qc_filters_blockedfunctions in the scran_qc C++ library, which describe the rationale behind the suggested filters.Examples
>>> import numpy >>> mat = numpy.reshape(numpy.random.poisson(lam=5, size=1000), (50, 20)) >>> import scranpy >>> res = scranpy.compute_adt_qc_metrics(mat, { "IgG": [ 1, 10, 20, 40 ] }) >>> filt = scranpy.suggest_adt_qc_thresholds(res) >>> print(filt)
- scranpy.suggest_crispr_qc_thresholds(metrics, block=None, num_mads=3.0)[source]¶
Suggest filter thresholds for the CRISPR-derived QC metrics, typically generated from
compute_crispr_qc_metrics().- Parameters:
metrics (
BiocFrame) – CRISPR-derived QC metrics fromcompute_crispr_qc_metrics(). This should contain thesum,detected,max_valueandmax_indexcolumns.block (
Optional[Sequence]) – Blocking factor specifying the block of origin (e.g., batch, sample) for each cell inmetrics. If supplied, a separate threshold is computed from the cells in each block. AlternativelyNone, if all cells are from the same block.num_mads (
float) – Number of MADs from the median to define the threshold for outliers in each QC metric.
- Return type:
- Returns:
If
block = None, aNamedListis returned, containing the following entries.max_value, a number specifying the lower threshold on the maximum count in each cell.
If
blockis provided, theNamedListinstead contains:max_value, a FloatList of length equal to the number of blocks (and named accordingly). Each entry represents the lower threshold on the maximum count in the corresponding block.block_ids, a list containing the unique levels of the blocking factor. This is in the same order as the blocks indetectedandsubset_sum.
References
The
compute_crispr_qc_filtersandcompute_crispr_qc_filters_blockedfunctions in the scran_qc C++ library, which describes the rationale behind the suggested filters.Examples
>>> import numpy >>> mat = numpy.reshape(numpy.random.poisson(lam=5, size=1000), (50, 20)) >>> import scranpy >>> res = scranpy.compute_crispr_qc_metrics(mat) >>> filt = scranpy.suggest_crispr_qc_thresholds(res) >>> print(filt)
- scranpy.suggest_rna_qc_thresholds(metrics, block=None, num_mads=3.0)[source]¶
Suggest filter thresholds for the RNA-derived QC metrics, typically generated from
compute_rna_qc_metrics().- Parameters:
metrics (
BiocFrame) – RNA-derived QC metrics fromcompute_rna_qc_metrics(). This should contain thesum,detectedandsubset_proportioncolumns.block (
Optional[Sequence]) – Blocking factor specifying the block of origin (e.g., batch, sample) for each cell inmetrics. If supplied, a separate threshold is computed from the cells in each block. AlternativelyNone, if all cells are from the same block.num_mads (
float) – Number of MADs from the median to define the threshold for outliers in each QC metric.
- Return type:
- Returns:
If
block = None, aNamedListis returned, containing the following entries.sum, a number specifying the lower threshold on the sum of counts in each cell.detected, a number specifying the lower threshold on the number of detected genes.subset_proportion, aFloatListof length equal to the number of control subsets (and named accordingly). Each entry represents the upper bound on the proportion of counts in the corresponding control subset.
If
blockis provided, theNamedListinstead contains:sum, a FloatList of length equal to the number of blocks (and named accordingly). Each entry represents the lower threshold on the sum of counts in the corresponding block.detected, a FloatList of length equal to the number of blocks (and named accordingly). Each entry represents the lower threshold on the number of detected genes in the corresponding block.subset_proportion, aNamedListof length equal to the number of control subsets. Each entry is another FloatList that contains the upper threshold on the proportion of counts for that subset in each block.block_ids, a list containing the unique levels of the blocking factor. This is in the same order as the blocks indetectedandsubset_sum.
References
The
compute_rna_qc_filtersandcompute_rna_qc_filters_blockedfunctions in the scran_qc C++ library, which describes the rationale behind the suggested filters.Examples
>>> import numpy >>> mat = numpy.reshape(numpy.random.poisson(lam=5, size=1000), (50, 20)) >>> import scranpy >>> res = scranpy.compute_rna_qc_metrics(mat, { "mito": [ 1, 10, 20, 40 ] }) >>> filt = scranpy.suggest_rna_qc_thresholds(res) >>> print(filt)
- scranpy.summarize_effects(effects, compute_min=True, compute_mean=True, compute_median=True, compute_max=True, compute_quantiles=None, compute_min_rank=True, num_threads=1)[source]¶
For each group, summarize the effect sizes for all pairwise comparisons to other groups. This yields a set of summary statistics that can be used to rank marker genes for each group.
- Parameters:
effects (
ndarray) – A 3-dimensional numeric containing the effect sizes from each pairwise comparison between groups. The extents of the first two dimensions should be equal to the number of groups, while the extent of the final dimension is equal to the number of genes. The entry[i, j, k]should represent the effect size from the comparison of groupjagainst groupifor genek. See also the output ofscore_markers()withall_pairwise = True.compute_min (
bool) – Whether to compute the minimum as a summary statistic for each effect size.compute_mean (
bool) – Whether to compute the mean as a summary statistic for each effect size.compute_median (
bool) – Whether to compute the median as a summary statistic for each effect size.compute_max (
bool) – Whether to compute the maximum as a summary statistic for each effect size.compute_quantiles (
Optional[Sequence]) – Probabilities of quantiles to compute as summary statistics for each effect size. This should be in [0, 1] and sorted in order of increasing size. IfNone, no quantiles are computed.compute_min_rank (
bool) – Whether to compute the mininum rank as a summary statistic for each effect size. IfNone, no quantiles are computed.num_threads (
int) – Number of threads to use.
- Return type:
- Returns:
A
NamedListof length equal to the number of groups (i.e., the extents of the first two dimensions ofeffects). Each entry is aBiocFramewhere each row corresponds to a gene ineffects. Each column contain a summary statistic of the effect sizes of the comparisons involving its corresponding group.min: double-precision NumPy array containing the minimum effect size across all pairwise comparisons to other groups. Only present ifcompute_min = true.median: double-precision NumPy array containing the median effect size across all pairwise comparisons to other groups. Only present ifcompute_median = True.mean: double-precision NumPy array containing the mean effect size from across all pairwise comparisons to other groups. Only present ifcompute_median = True.quantile: nestedBiocFramecontaining the specified quantiles of the effect sizes across all pairwise comparisons to other groups. Only present ifcompute_quantilesis provided.max: double-precision NumPy array containing the maximum effect size across all pairwise comparisons to other groups. Only present ifcompute_max = true.min_rank: integer array containing the minimum rank of each gene across all pairwise comparisons to other groups. Only present ifcompute_min_rank = true.
References
The
summarize_effectsfunction in the scran_markers C++ library, for more details on the statistics.Examples
>>> import numpy >>> normed = numpy.random.rand(200, 100) >>> import scranpy >>> group = ["A", "B", "C", "D"] * 25 >>> res = scranpy.score_markers(normed, group, all_pairwise=True) >>> summaries = scranpy.summarize_effects(res["cohens_d"]) >>> print(summaries[0])
- scranpy.test_enrichment(x, sets, universe, log=False, num_threads=1)[source]¶
Perform a hypergeometric test for enrichment of interesting genes (e.g., markers) in one or more pre-defined gene sets.
- Parameters:
x (
Sequence) – Sequence of identifiers for the interesting genes.sets (
Union[dict,Sequence]) –Sequence of gene sets, where each entry corresponds to a gene set and contains a sequence of identifiers for genes in that set.
Alternatively, a dictionary where each key is the name of a gene set and each value is a sequence of identifiers for that gene set.
universe (
Union[int,Sequence]) –Sequence of identifiers for the universe of genes in the dataset. It is assumed that
xis a subset ofuniverse. Identifiers insetsthat are not inuniversewill be ignored.Alternatively, an integer specifying the number of genes in the universe.
log (
bool) – Whether to report the log-transformed p-values.num_threads (
int) – Number of threads to use.
- Return type:
- Returns:
Double-precision NumPy array of (log-transformed) p-values to test for significant enrichment of
xin each entry ofsets.
References
https://libscran.github.io/phyper, for the underlying implementation.
Examples
>>> LETTERS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ" >>> import scranpy >>> scranpy.test_enrichment( >>> x=["A", "B", "C", "D", "E"], >>> sets={ >>> "set1": LETTERS[:10], >>> "set2": ["B", "D", "F", "H", "J"], >>> "set3": LETTERS[10:20] >>> }, >>> universe=LETTERS >>> )
- scranpy.tsne_perplexity_to_neighbors(perplexity)¶
Determine the number of nearest neighbors required to support a given perplexity in the t-SNE algorithm.
- Parameters:
perplexity (
float) – Perplexity to use inrun_tsne().- Return type:
- Returns:
The corresponding number of nearest neighbors.
Examples
>>> import scranpy >>> scranpy.tsne_perplexity_to_neighbors(10) >>> scranpy.tsne_perplexity_to_neighbors(30) >>> scranpy.tsne_perplexity_to_neighbors(50)