scranpy package

Module contents

scranpy.aggregate_across_cells(x, factors, num_threads=1)[source]

Aggregate expression values across cells based on one or more grouping factors. This is primarily used to create pseudo-bulk profiles for each cluster/sample combination.

Parameters:
  • x (Any) – A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. Values are expected to be counts.

  • factors (Union[dict, Sequence, NamedList, BiocFrame]) – One or more grouping factors, see combine_factors(). Each entry should be a sequence of length equal to the number of columns in x.

  • num_threads (int) – Number of threads to use for aggregation.

Return type:

NamedList

Returns:

A NamedList containing the following entries.

  • sum: double-precision NumPy matrix where each row corresponds to a gene and each column corresponds to a unique combination of grouping levels. Each matrix entry contains the summed expression across all cells with that combination.

  • detected: integer NumPy matrix where each row corresponds to a gene and each column corresponds to a unique combination of grouping levels. Each matrix entry contains the number of cells with detected expression in that combination.

  • combinations: a BiocFrame containing all unique combinations of levels across factors. Each column corresponds to an entry of factors while each row corresponds to a combination. Specifically, the i-th combination is defined as the i-th elements of all columns. Combinations are in the same order as the columns of sum and detected.

  • counts: an integer NumPy array containing the number of cells associated with each combination in combinations.

  • index: an Integer NumPy array of length equal to the number of cells. This specifies the combination combinations associated with each cell in x.

References

The aggregate_across_cells function in the scran_aggregate C++ library.

Examples

>>> import numpy
>>> mat = numpy.random.rand(100, 20)
>>> import scranpy
>>> clusters = ["A", "B", "C", "D"] * 5
>>> blocks = [1] * 4 + [2] * 4 + [3] * 4 + [4] * 4 + [5] * 4
>>> aggr = scranpy.aggregate_across_cells(mat, { "clusters": clusters, "blocks": blocks })
>>> aggr["sum"][:5,]
>>> print(aggr["combinations"])
scranpy.aggregate_across_cells_se(x, factors, num_threads=1, more_aggr_args={}, assay_type='counts', output_prefix='factor_', counts_name='counts', meta_name='aggregated', include_coldata=True, more_coldata_args={}, altexps=None, copy_altexps=False)

Aggregate expression values across groups of cells for each gene, storing the result in a SummarizedExperiment. This calls aggregate_across_cells() along with aggregate_column_data().

Parameters:
  • x – A SummarizedExperiment object or one of its subclasses. Rows correspond to genes and columns correspond to cells.

  • factors (Union[dict, Sequence, NamedList, BiocFrame]) – One or more grouping factors, see the argument of the same name in aggregate_across_cells().

  • num_threads (int) – Passed to aggregate_across_cells().

  • more_aggr_args (dict) – Further arguments to pass to aggregate_across_cells().

  • assay_type (Union[str, int]) – Name or index of the assay of x to be aggregated.

  • output_prefix (Optional[str]) – Prefix to add to the names of the columns containing the factor combinations in the column data of the output object. If None, no prefix is added.

  • counts_name (str) – Name of the column in which to store the cell count for each factor combination in the column data of the output object. If None, the cell counts are not reported.

  • meta_name (Optional[str]) – Name of the metadata entry in which to store additional information like the combination indices in the output object. If None, additional outputs are not reported.

  • include_coldata (bool) – Whether to add the aggregated column data from x to the output. If True, this adds the output of aggregate_column_data() to the column data of the output object.

  • more_coldata_args (dict) – Additional arguments to pass to aggregate_column_data(). Only relevant if include_coldata = True.

  • altexps (Union[str, int, dict, Sequence, NamedList, None]) –

    List of integers or strings, containing the indices or names of alternative experiments of x to aggregate. The aggregated assay from each alternative experiment is determined by assay_type.

    Alternatively, this may be a single integer or string containing the index or name of one alternative experiment to aggregate. Again, the aggregated assay from each alternative experiment is determined by assay_type.

    Alternatively, this may be a dictionary where keys are strings and values are integers or strings. Each key should be the name of an alternative experiment while each value is the index/name of the assay to aggregate from that experiment.

    Alternatively, a NamedList of integers or strings. If named, this is treated as a dictionary, otherwise it is treated as a list.

    Only relevant if x is a SingleCellExperiment or one of its subclasses.

  • copy_altexps (bool) – Whether to copy the column data and metadata of the output SingleCellExperiment into each of its alternative experiments. Only relevant if x is a SingleCellExperiment or one of its subclasses.

Return type:

SummarizedExperiment

Returns:

A SummarizedExperiment where each column corresponds to a factor combination. Each row corresponds to a gene in x, and the row data is taken directly from x. The assays contain the sum of counts (sum) and the number of detected cells (detected) in each combination for each gene. The column data contains:

  • The factor combinations, with column names prefixed by output_prefix.

  • The cell count for each combination, named by counts_name.

  • Additional column data from x if include_coldata = True. This is aggregated with aggregate_column_data`() on the combination indices.

The metadata contains a meta_name entry, which is a list with an index integer vector of length equal to the number of cells in x. Each entry of this vector is an index of the factor combination (i.e., column of the output object) to which the corresponding cell was assigned.

If altexps is specified, a SingleCellExperiment is returned instead. The same aggregation described above for the main experiment is applied to each alternative experiment. If copy_altexps = True, the column data for each alternative experiment will contain a copy of the factor combinations and cell counts, and the metadata will contain a copy of the index vector.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_rna_data_se("start")
>>> aggr = scranpy.aggregate_across_cells_se(sce, factors=[sce.get_column_data()["level1class"]])
>>> aggr.get_assay(0)[:5,]
>>> print(aggr.get_column_data())
>>>
>>> # We can also aggregate within alternative experiments.
>>> aggr2 = scranpy.aggregate_across_cells_se(sce, factors=[sce.get_column_data()["level1class"]], altexps=["ERCC"])
>>> aggr2.get_alternative_experiment("ERCC").get_assay(0)[:5,]
scranpy.aggregate_across_genes(x, sets, row_names=None, average=False, num_threads=1)[source]

Aggregate expression values across genes, potentially with weights. This is typically used to summarize expression values for gene sets into a single per-cell score.

Parameters:
  • x (Any) – Matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. Values are expected to be log-expression values.

  • sets (Union[dict, Sequence, NamedList]) –

    Sequence of gene sets. Each gene set may be represented by:

    • A sequence of integers, specifying the row indices of the genes in that set.

    • A sequence of strings, specifying the row names of the genes in that set. If any strings are present, row_names should also be supplied.

    • A tuple of length 2, containing a sequence of strings/integers (row names/indices) and a numeric array (weights).

    • A BiocFrame where each row corresponds to a gene. The first column contains the row names/indices and the second column contains the weights.

    Alternatively, a dictionary may be supplied where each key is the name of a gene set and each value is a sequence/tuple as described above. The keys will be used to name the output NamedList.

    Alternatively, a NamedList where each entry is a gene set represented by a sequence/tuple as described above. If names are available, they will be used to name the output NamedList.

  • row_names (Optional[Sequence]) – Sequence of strings of length equal to the number of rows of x, containing the name of each gene. Duplicate names are allowed but only the first occurrence will be used. If None, rows are assumed to be unnamed.

  • average (bool) – Whether to compute the average rather than the sum.

  • num_threads (int) – Number of threads to be used for aggregation.

Return type:

NamedList

Returns:

A NamedList of length equal to that of sets. Each entry is a numeric vector of length equal to the number of columns in x, containing the (weighted) sum/mean of expression values for the corresponding set across all cells.

References

The aggregate_across_genes function in the scran_aggregate C++ library, which implements the aggregation.

Examples

>>> import numpy
>>> mat = numpy.random.rand(100, 20)
>>> import scranpy
>>> sets = {
>>>     "foo": [ 1, 3, 5, 7, 9 ],
>>>     "bar": range(10, 40, 2)
>>> }
>>> aggr = scranpy.aggregate_across_genes(mat, sets)
>>> print(aggr.get_names())
>>> print(aggr[0])
scranpy.aggregate_across_genes_se(x, sets, num_threads=1, more_aggr_args={}, assay_type='logcounts', output_name=None)[source]

Aggregate expression values across sets of genes for each cell. This calls aggregate_across_genes() on an assay from a SummarizedExperiment.

Parameters:
Return type:

SummarizedExperiment

Returns:

A SummarizedExperiment with number of rows equal to the number of gene sets. The lone assay contains the aggregated values for each gene set for all cells. The column data is the same as that of x. If sets is named, the names are used as the row names of the output.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_rna_data_se("norm")
>>> sets = { "foo": [ 0, 2, 5, 10 ], "bar": [ 1, 3, 11, 17, 23 ] }
>>> aggregated = scranpy.aggregate_across_genes_se(sce, sets)
>>> aggregated.get_assay(0)[:,:10]
scranpy.aggregate_column_data(coldata, index, number, only_simple=True, placeholder=None)

Aggregate the column data from a SummarizedExperiment for groups of cells.

Parameters:
  • coldata (BiocFrame) – A BiocFrame containing the column data for a SummarizedExperiment. Each row should correspond to a cell.

  • index (Sequence) – Vector of length equal to the number of cells. Each entry should be the index of the factor combination to which each cell in coldata was assigned, e.g., the index vector produced by combine_factors().

  • number (int) – Total number of unique factor combinations. All elements of index should be less than number.

  • only_simple (bool) – Whether to skip a column of coldata that is not a list, NumPy array, NamedList or Factor.

  • placeholder – Placeholder value to store in the output column when a factor combination does not have a single unique value.

Return type:

BiocFrame

Returns:

A BiocFrame with number of rows equal to number. Each “simple” column in coldata (i.e., list, NumPy array, NamedList or Factor) is represented by a column in the output BiocFrame. In each column, the j-th entry is equal to the unique value of all rows where index == j, or placeholder if there is not exactly one unique value. If only_simple = False, any non-simple columns of coldata are represented in the output BiocFrame by a list of placeholder values. Otherwise, if only_simple = True, any non-simple columns of coldata are skipped.

Examples

>>> import biocframe
>>> df = biocframe.BiocFrame({
>>>     "X": ["a", "a", "b", "b", "c", "c"],
>>>     "Y": [  1,   1,   1,   2,   2,   2],
>>>     "Z": [True, False, True, False, True, False]
>>> })
>>> import scranpy
>>> print(scranpy.aggregate_column_data(df, [0, 0, 1, 1, 2, 2], 3))
scranpy.analyze(*args, **kwargs)[source]

Deprecated, use analyze_se() instead.

Return type:

NamedList

scranpy.analyze_se(x, rna_altexp=False, adt_altexp=None, crispr_altexp=None, rna_assay_type='counts', adt_assay_type='counts', crispr_assay_type='counts', block=None, block_name='block', rna_qc_subsets=[], rna_qc_output_prefix=None, more_rna_qc_args={}, adt_qc_subsets=[], adt_qc_output_prefix=None, more_adt_qc_args={}, crispr_qc_output_prefix=None, more_crispr_qc_args={}, filter_cells=True, rna_norm_output_name='logcounts', more_rna_norm_args={}, adt_norm_output_name='logcounts', more_adt_norm_args={}, crispr_norm_output_name='logcounts', more_crispr_norm_args={}, rna_hvg_output_prefix=None, more_rna_hvg_args={}, rna_pca_output_name='PCA', more_rna_pca_args={}, adt_pca_output_name='PCA', more_adt_pca_args={}, use_rna_pcs=True, use_adt_pcs=True, scale_output_name='combined', more_scale_args={}, mnn_output_name='MNN', more_mnn_args={}, more_umap_args={}, more_tsne_args={}, cluster_graph_output_name='graph_cluster', more_build_graph_args={}, more_cluster_graph_args={}, more_neighbor_args={}, kmeans_clusters=None, kmeans_clusters_output_name='kmeans_cluster', more_kmeans_args={}, clusters_for_markers=['graph', 'kmeans'], more_rna_marker_args={}, more_adt_marker_args={}, more_crispr_marker_args={}, nn_parameters=<knncolle.annoy.AnnoyParameters object>, num_threads=3)[source]

Execute a simple single-cell analysis pipeline, starting from a count matrix and ending with clusters, visualizations and markers. This is equivalent to:

Parameters:
  • x (SummarizedExperiment) – A SummarizedExperiment object or one of its subclasses. Rows correspond to genomic features (genes, ADTs, or CRISPR guides) and columns correspond to cells.

  • rna_altexp (Union[int, str, bool, None]) – Name or index of the alternative experiment of x containing the RNA data. Alternatively False, in which case the main experiment is assumed to contain the RNA data. Alternatively None, in which case it is assumed that no RNA data is available.

  • adt_altexp (Union[int, str, bool, None]) – Name or index of the alternative experiment of x containing the ADT data. Alternatively False, in which case the main experiment is assumed to contain the ADT data. Alternatively None, in which case it is assumed that no ADT data is available.

  • crispr_altexp (Union[int, str, bool, None]) – Name or index of the alternative experiment of x containing the CRISPR data. Alternatively False, in which case the main experiment is assumed to contain the CRISPR data. Alternatively None, in which case it is assumed that no CRISPR data is available.

  • rna_assay_type (Union[str, int]) – Name or index of the assay containing the RNA count data. Only used if rna_altexp is not None.

  • adt_assay_type (Union[str, int]) – Name or index of the assay containing the ADT count data. Only used if rna_altexp is not None.

  • crispr_assay_type (Union[str, int]) – Name or index of the assay containing the CRISPR count data. Only used if rna_altexp is not None.

  • block (Optional[Sequence]) – Blocking factor, as a sequence containing the block of origin (e.g., batch, sample) for each cell in x. Alternatively None, if all cells are from the same block.

  • block_name (Optional[str]) – Name of the column in which to store the blocking factor in the column data of the output SummarizedExperiment. Only used if block is not None. If None, the blocking factor is not stored in the output.

  • rna_qc_subsets (Union[dict, Sequence, NamedList]) – Passed to quick_rna_qc_se() as the subsets argument. Only used if rna_altexp is not None.

  • rna_qc_output_prefix (Optional[str]) – Passed to quick_rna_qc_se() as the output_prefix argument. Only used if rna_altexp is not None.

  • more_rna_qc_args (dict) – Additional arguments to pass to quick_rna_qc_se(). Only used if rna_altexp is not None.

  • adt_qc_subsets (Union[dict, Sequence, NamedList]) – Passed to quick_adt_qc_se() as the subsets argument. Only used if adt_altexp is not None.

  • adt_qc_output_prefix (Optional[str]) – Passed to quick_adt_qc_se() as the output_prefix argument. Only used if adt_altexp is not None.

  • more_adt_qc_args (dict) – Additional arguments to pass to quick_adt_qc_se(). Only used if adt_altexp is not None.

  • crispr_qc_output_prefix (Optional[str]) – Passed to quick_crispr_qc_se() as the output_prefix argument. Only used if crispr_altexp is not None.

  • more_crispr_qc_args (dict) – Additional arguments to pass to quick_crispr_qc_se(). Only used if crispr_altexp is not None.

  • filter_cells (bool) – Whether to filter x to only retain high-quality cells in all modalities. If False QC metrics and thresholds are still computed but are not used to filter the count matrices.

  • rna_norm_output_name (str) – Passed to normalize_rna_counts_se() as the output_name argument. Only used if rna_altexp is not None.

  • more_rna_norm_args (dict) – Additional arguments to pass to normalize_rna_counts_se(). Only used if rna_altexp is not None.

  • adt_norm_output_name (str) – Passed to normalize_adt_counts_se() as the output_name argument. Only used if adt_altexp is not None.

  • more_adt_norm_args (dict) – Additional arguments to pass to normalize_adt_counts_se(). Only used if adt_altexp is not None.

  • crispr_norm_output_name (str) – Passed to normalize_crispr_counts_se() as the output_name argument. Only used if crispr_altexp is not None.

  • more_crispr_norm_args (dict) – Additional arguments to pass to normalize_crispr_counts_se(). Only used if crispr_altexp is not None.

  • rna_hvg_output_prefix (Optional[str]) – Passed to choose_rna_hvgs_se() as the output_prefix argument. Only used if rna_altexp is not None.

  • more_rna_hvg_args (dict) – Additional arguments to pass to choose_rna_hvgs_se(). Only used if rna_altexp is not None.

  • rna_pca_output_name (str) – Passed to run_pca_se() as the output_name argument. Only used if rna_altexp is not None.

  • more_rna_pca_args (dict) – Additional arguments to pass to run_pca_se(). Only used if rna_altexp is not None.

  • adt_pca_output_name (str) – Passed to run_pca_se() as the output_name argument. Only used if adt_altexp is not None.

  • more_adt_pca_args (dict) – Additional arguments to pass to run_pca_se(). Only used if adt_altexp is not None.

  • use_rna_pcs (bool) – Whether to use the RNA-derived PCs for downstream steps (i.e., clustering, visualization). Only used if rna_altexp is not None.

  • use_adt_pcs (bool) – Whether to use the ADT-derived PCs for downstream steps (i.e., clustering, visualization). Only used if adt_altexp is not None.

  • scale_output_name (str) – Passed to scale_by_neighbors_se() as the output_name argument. Only used if multiple modalities are available and their corresponding use_*_pcs arguments are True.

  • more_scale_args (dict) – Additional arguments to pass to scale_by_neighbors_se(). Only used if multiple modalities are available and their corresponding use_*_pcs arguments are True.

  • mnn_output_name (str) – Passed to correct_mnn_se() as the output_name argument. Only used if block is not None.

  • more_mnn_args (dict) – Additional arguments to pass to correct_mnn_se(). Only used if block is not None.

  • more_tsne_args (dict) – Passed to run_all_neighbor_steps_se().

  • more_umap_args (dict) – Passed to run_all_neighbor_steps_se().

  • more_build_graph_args (dict) – Passed to run_all_neighbor_steps_se().

  • cluster_graph_output_name (str) – Passed to run_all_neighbor_steps_se() as cluster_output_name.

  • more_cluster_graph_args (dict) – Passed to run_all_neighbor_steps_se().

  • more_neighbor_args (dict) – Passed to run_all_neighbor_steps_se().

  • kmeans_clusters (Optional[int]) – Passed to cluster_kmeans_se() as the k argument. If None, k-means clustering is not performed.

  • kmeans_clusters_output_name – Passed to cluster_kmeans_se() as the output_name argument. Ignored if kmeans_clusters is None.

  • more_kmeans_args (dict) – Additional arguments to pass to cluster_kmeans_se(). Ignored if kmeans_clusters is None.

  • clusters_for_markers (Sequence[Literal['graph', 'kmeans']]) – List of clustering algorithms, specifying the clusters to be used for marker detection. The first available clustering will be chosen. If no clustering is available from the list, markers will not be computed.

  • more_rna_marker_args (dict) – Additional arguments to pass to score_markers_se() for the RNA data. Ignored if no suitable clusterings are available or if rna_altexp is None

  • more_adt_marker_args (dict) – Additional arguments to pass to score_markers_se() for the ADT data. Ignored if no suitable clusterings are available or if adt_altexp is None

  • more_crispr_marker_args (dict) – Additional arguments to pass to score_markers_se() for the CRISPR data. Ignored if no suitable clusterings are available or if crispr_altexp is None

  • nn_parameters – Parameters for the nearest-neighbor search.

  • num_threads (int) – Number of threads to use in each step.

Returns:

  • x: a SingleCellExperiment that is a copy of the input x. It is also decorated with the results of each analysis step.

  • markers: a list of list of BiocFrame objects containing the marker statistics for each modality. Each inner list corresponds to a modality (RNA, ADT, etc.) while each BiocFrame corresponds to a group.

Return type:

A NamedList containing

Examples

>>> import scranpy
>>> sce = scranpy.get_test_rna_data_se()
>>> is_mito = [ s.startswith("mt-") for s in sce.get_row_names() ]
>>> res = scranpy.analyze_se(
>>>     sce,
>>>     rna_qc_subsets={ "mito": is_mito }
>>> )
>>> res["x"].get_assay_names()
>>> res["x"].get_reduced_dimension_names()
>>> print(res["x"].get_column_data())
>>> import biocutils
>>> print(biocutils.table(res["x"].get_column_data()["graph_cluster"]))
>>> print(scranpy.preview_markers(res["markers"]["rna"][0]))
scranpy.build_snn_graph(x, num_neighbors=10, weight_scheme='ranked', num_threads=1, nn_parameters=<knncolle.annoy.AnnoyParameters object>)

Build a shared nearest neighbor (SNN) graph where each node is a cell. Edges are formed between cells that share one or more nearest neighbors, weighted by the number or importance of those shared neighbors.

Parameters:
  • x (Union[ndarray, FindKnnResults, Index]) –

    Numeric matrix where rows are dimensions and columns are cells, typically containing a low-dimensional representation from, e.g., run_pca().

    Alternatively, a FindKnnResults object containing existing neighbor search results.

    Alternatively, a Index object.

  • num_neighbors (int) – Number of neighbors in the nearest-neighbor graph. Larger values generally result in broader clusters during community detection. Ignored if x is a FindKnnResults object.

  • weight_scheme (Literal['ranked', 'number', 'jaccard']) – Weighting scheme to use for the edges of the SNN graph, based on the number or ranking of the shared nearest neighbors.

  • num_threads (int) – Number of threads to use.

  • nn_parameters (Parameters) – The algorithm to use for the nearest-neighbor search. Only used if x is not a pre-built nearest-neighbor search index or a list of existing nearest-neighbor search results.

Return type:

NamedList

Returns:

A NamedList containing the components of a (possibly weighted) graph.

  • vertices: integer specifying the number of vertices (i.e., cells) in the graph.

  • edges: integer NumPy array containing the graph edges. Pairs of values represent the endpoints of an (undirected) edge, i.e., edges[0:2] form the first edge, edges[2:4] form the second edge and so on.

  • weights: double-precision NumPy array of edge weights. This has length equal to half the length of edges; the first weight corresponds to the first edge, and so on.

References

The build_snn_graph function in the scran_graph_cluster C++ library.

Examples

>>> import numpy
>>> pcs = numpy.random.rand(10, 200)
>>> import scranpy
>>> graph = scranpy.build_snn_graph(pcs)
>>> print(graph.get_names())
scranpy.center_size_factors(size_factors, block=None, mode='lowest', in_place=False)

Center size factors before computing normalized values from the count matrix. This ensures that the normalized values are on the same scale as the original counts for easier interpretation.

Parameters:
  • size_factors (ndarray) – Floating-point array containing size factors for all cells.

  • block (Optional[Sequence]) – Block assignment for each cell. If provided, this should have length equal to the number of cells, where cells have the same value if and only if they are in the same block. Defaults to None, where all cells are treated as being part of the same block.

  • mode (Literal['lowest', 'per-block']) – How to scale size factors across blocks. lowest will scale all size factors by the lowest per-block average. per-block will center the size factors in each block separately. This argument is only used if block is provided.

  • in_place (bool) – Whether to modify size_factors in place. If False, a new array is returned. This argument only used if size_factors is double-precision, otherwise a new array is always returned.

Return type:

ndarray

Returns:

Double-precision NumPy array containing centered size factors. If in_place = True, this is a reference to size_factors.

References

The center_size_factors and center_size_factors_blocked functions in the scran_norm C++ library, which describes the rationale behind centering.

Examples

>>> import numpy
>>> sf = numpy.random.rand(100)
>>> import scranpy
>>> sf2 = scranpy.center_size_factors(sf)
>>> sf2.mean()
scranpy.choose_highly_variable_genes(stats, top=4000, larger=True, keep_ties=True, bound=None)[source]

Choose highly variable genes (HVGs), typically based on a variance-related statistic.

Parameters:
  • stats (ndarray) – Array of variances (or a related statistic) across all genes. Typically the residuals from model_gene_variances() used here.

  • top (int) – Number of top genes to retain. Note that the actual number of retained genes may not be equal to top, depending on the other options.

  • larger (bool) – Whether larger values of stats represent more variable genes. If true, HVGs are defined from the largest values of stats.

  • keep_ties (bool) – Whether to keep ties at the top-th most variable gene. This avoids arbitrary breaking of tied values.

  • bound (Optional[float]) – The lower bound (if larger = True) or upper bound (otherwise) to be applied to stats. Genes are not considered to be HVGs if they do not pass this bound, even if they are within the top genes. Ignored if None.

Return type:

ndarray

Returns:

Integer NumPy array containing the indices of genes in stats that are considered to be highly variable.

References

The choose_highly_variable_genes function from the scran_variances library, which provides the underlying implementation.

Examples

>>> import numpy
>>> resids = numpy.random.rand(1000)
>>> import scranpy
>>> chosen = scranpy.choose_highly_variable_genes(resids, top=100)
>>> chosen
scranpy.choose_pseudo_count(size_factors, quantile=0.05, max_bias=1, min_value=1)[source]

Choose a suitable pseudo-count to control the bias introduced by log-transformation of normalized counts.

Parameters:
  • size_factors (ndarray) – Floating-point array of size factors for all cells.

  • quantile (float) – Quantile to use for defining extreme size factors.

  • max_bias (float) – Maximum allowed bias in the log-fold changes between cells.

  • min_value (float) – Minimum value for the pseudo-count.

Return type:

float

Returns:

Choice of pseudo-count, for use in normalize_counts().

References

The choose_pseudo_count function in the scran_norm C++ library, which describes the rationale behind the choice of pseudo-count.

Examples

>>> import numpy
>>> sf = numpy.random.rand(1000)
>>> import scranpy
>>> sf = scranpy.center_size_factors(sf)
>>> pseudo = scranpy.choose_pseudo_count(sf)
>>> print(pseudo)
scranpy.choose_rna_hvgs_se(x, block=None, num_threads=1, more_var_args={}, top=4000, more_choose_args={}, assay_type='logcounts', output_prefix=None, include_per_block=False)

Model the mean-variance relationship across genes and choose highly variable genes (HVGs) based on the residuals of the fitted trend.

Parameters:
Return type:

SummarizedExperiment

Returns:

A copy of x, with per-gene variance modelling statistics added to its row data. An hvg column indicates the genes that were chosen as HVGs. If include_per_block = True and block is specified, the per-block statistics are stored as a nested BiocFrame in the per_block column.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_rna_data_se("norm")
>>> sce = scranpy.choose_rna_hvgs_se(sce, more_var_args={ "use_min_width": True })
>>> import biocutils
>>> print(biocutils.table(sce.get_row_data()["hvg"]))
scranpy.cluster_graph(x, method='multilevel', multilevel_resolution=1, leiden_resolution=1, leiden_objective='modularity', walktrap_steps=4, seed=42)[source]

Identify clusters of cells using a variety of community detection methods from a graph where similar cells are connected.

Parameters:
  • x (NamedList) – Components of the graph to be clustered, typically produced by build_snn_graph(). Each node of the graph should be a cell.

  • method (Literal['multilevel', 'leiden', 'walktrap']) – Community detection algorithm to use.

  • multilevel_resolution (float) – Resolution of the clustering when method = "multilevel". Larger values result in finer clusters.

  • leiden_resolution (float) – Resolution of the clustering when method = "leiden". Larger values result in finer clusters.

  • leiden_objective (Literal['modularity', 'cpm', 'er']) – Objective function to use when method = "leiden".

  • walktrap_steps (int) – Number of steps to use when method = "walktrap".

  • seed (int) – Random seed to use for method = "multilevel" or "leiden".

Return type:

NamedList

Returns:

A NamedList containing the following entries.

  • membership: an integer NumPy array containing the cluster assignment for each vertex, i.e., cell. All values are in [0, N) where N is the total number of clusters.

For method = "multilevel", the output also contains:

  • levels: a list containing the clustering at each level of the algorithm. Each entry corresponds to one level and is an integer NumPy array that contains the cluster assignment for each cell at that level.

  • modularity: a double-precision NumPy array containing the modularity at each level. This has length equal to that of levels, and the largest value corresponds to the assignments reported in membership.

For method = "walktrap", the output also contains:

  • merges: an integer NumPy matrix with two columns. Each row corresponds to a merge step and specifies the pair of cells or clusters that were merged at that step.

  • modularity: a double-precision NumPy array that contains the modularity score at each merge step.

For method = "leiden", the output also contains:

  • quality: quality of the clustering. This is the same as the modularity if leiden_objective = "modularity".

References

https://igraph.org/c/html/latest/igraph-Community.html, for the underlying implementation of each clustering method.

The various cluster_* functions in the scran_graph_cluster C++ library.

Examples

>>> import numpy
>>> pcs = numpy.random.rand(10, 200)
>>> import scranpy
>>> graph = scranpy.build_snn_graph(pcs)
>>> clust = scranpy.cluster_graph(graph)
>>> import biocutils
>>> print(biocutils.table(clust["membership"]))
scranpy.cluster_graph_se(x, num_neighbors=10, num_threads=1, more_build_args={}, method='multilevel', resolution=None, more_cluster_args={}, reddim_type='PCA', output_name='clusters', meta_name=None, graph_name=None)[source]

Construct a shared-nearest neighbor (SNN) graph from an existing low-dimensional embedding, and apply community detection algorithms to obtain clusters of cells. This calls build_snn_graph() followed by cluster_graph().

Parameters:
  • x (SingleCellExperiment) – A SingleCellExperiment or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.

  • num_neighbors (int) – Number of neighbors for constructing the SNN graph, see build_snn_graph() for details.

  • num_threads (int) – Number of threads for the neighbor search when constructing the SNN graph, see build_snn_graph() for details.

  • more_build_args (dict) – Additional arguments to be passed to build_snn_graph().

  • method (Literal['multilevel', 'leiden', 'walktrap']) – Community detection method to be used by cluster_graph().

  • resolution (Optional[float]) – Clustering resolution to be used by cluster_graph(). This is either passed as multilevel_resolution or leiden_resolution, depending on method.

  • more_cluster_args (dict) – Additional arguments to be passed to cluster_graph().

  • reddim_type (Union[int, str, tuple]) – Name or index of the existing reduced dimension of x to be used for clustering. Alternatively a tuple, where the first element contains the name of an alternative experiment of x, and the second element contains the name/index of an embedding in that alternative experiment.

  • output_name (str) – Name of the column of the column data in which to store the cluster assignments.

  • meta_name (Optional[str]) – Name of the metadta entry in which to store extra clustering output. If None, no extra clustering output is stored.

  • graph_name (Optional[str]) – Name of the metadta entry in which to store the SNN graph. If None, the graph is not stored.

Return type:

SingleCellExperiment

Returns:

x is returned with the cluster assignment for each cell stored in its column data. Additional clustering output is stored in its metadata.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_rna_data_se("pca")
>>> sce = scranpy.cluster_graph_se(sce)
>>> import biocutils
>>> print(biocutils.table(sce.get_column_data()["clusters"]))
scranpy.cluster_kmeans(x, k, init_method='var-part', refine_method='hartigan-wong', var_part_optimize_partition=True, var_part_size_adjustment=1, lloyd_iterations=100, hartigan_wong_iterations=10, hartigan_wong_quick_transfer_iterations=50, hartigan_wong_quit_quick_transfer_failure=False, seed=5489, num_threads=1)[source]

Perform k-means clustering with a variety of different initialization and refinement algorithms.

Parameters:
  • x (ndarray) – Input data matrix where rows are dimensions and columns are observations (i.e., cells).

  • k (int) – Number of clusters.

  • init_method (Literal['var-part', 'kmeans++', 'random']) – Initialization method for defining the initial centroid coordinates. Choices are variance partitioning (var-part), kmeans++ (kmeans++) or random initialization (random).

  • refine_method (Literal['hartigan-wong', 'lloyd']) – Method to use to refine the cluster assignments and centroid coordinates. Choices are Lloyd’s algorithm (lloyd) or the Hartigan-Wong algorithm (hartigan-wong).

  • var_part_optimize_partition (bool) – Whether each partition boundary should be optimized to reduce the sum of squares in the child partitions. Only used if init_method = "var-part".

  • var_part_size_adjustment (float) – Floating-point value between 0 and 1, specifying the adjustment to the cluster size when prioritizing the next cluster to partition. Setting this to 0 will ignore the cluster size while setting this to 1 will generally favor larger clusters. Only used if init_method = "var-part".

  • lloyd_iterations (int) – Maximmum number of iterations for the Lloyd algorithm.

  • hartigan_wong_iterations (int) – Maximmum number of iterations for the Hartigan-Wong algorithm.

  • hartigan_wong_quick_transfer_iterations (int) – Maximmum number of quick transfer iterations for the Hartigan-Wong algorithm.

  • hartigan_wong_quit_quick_transfer_failure (bool) – Whether to quit the Hartigan-Wong algorithm upon convergence failure during quick transfer iterations.

  • seed (int) – Seed to use for random or kmeans++ initialization.

  • num.threads – Number of threads to use.

Return type:

NamedList

Returns:

A NamedList containing the following entries.

  • clusters: an integer NumPy array containing the cluster assignment for each cell. Values are integers in [0, N) where N is the total number of clusters.

  • centers: a double-precision NumPy matrix containing the coordinates of the cluster centroids. Dimensions are in the rows while centers are in the columns.

  • iterations: integer specifying the number of refinement iterations that were performed.

  • status: convergence status. Any non-zero value indicates a convergence failure though the exact meaning depends on the choice of refine_method.

    • For Lloyd, a value of 2 indicates convergence failure.

    • For Hartigan-Wong, a value of 2 indicates convergence failure in the optimal transfer iterations. A value of 4 indicates convergence failure in the quick transfer iterations when hartigan_wong_quit_quick_transfer_failure = True.

References

https://ltla.github.io/CppKmeans, which describes the various initialization and refinement algorithms in more detail.

Examples

>>> import numpy
>>> pcs = numpy.random.rand(10, 200)
>>> import scranpy
>>> clust = scranpy.cluster_kmeans(pcs, k=3)
>>> import biocutils
>>> print(biocutils.table(clust["clusters"]))
scranpy.cluster_kmeans_se(x, k, num_threads=1, more_kmeans_args={}, reddim_type='PCA', output_name='clusters', meta_name=None)[source]

Perform k-means clustering on an existing low-dimensional embedding. This calls cluster_kmeans() on reduced dimensions from a SingleCellExperiment.

Parameters:
  • x (SingleCellExperiment) – A SingleCellExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.

  • k (int) – Number of clusters, to be passed to cluster_kmeans().

  • num_threads (int) – Number of threads, to be passed to cluster_kmeans().

  • more_kmeans_args (dict) – Additional arguments to be passed to cluster_kmeans().

  • reddim_type (Union[str, int, tuple]) – Name or index of an existing embedding in x, on which to perform the clustering. Alternatively a tuple, where the first element contains the name of an alternative experiment of x, and the second element contains the name/index of an embedding in that alternative experiment.

  • output_name (str) – Name of the column of the column data in which to store the cluster assignments.

  • meta_name (Optional[str]) – Name of the metadta entry in which to store extra clustering output. If None, no extra clustering output is stored.

Return type:

SingleCellExperiment

Returns:

A copy of x, with the cluster assignment for each cell stored in its column data. Additional clustering output is stored in its metadata.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_rna_data_se("pca")
>>> sce = scranpy.cluster_kmeans_se(sce, k=10)
>>> import biocutils
>>> print(biocutils.table(sce.get_column_data()["clusters"]))
scranpy.combine_factors(factors, keep_unused=False)

Combine multiple categorical factors based on the unique combinations of levels from each factor.

Parameters:
  • factors (Union[dict, Sequence, NamedList, BiocFrame]) –

    Sequence of factors of interest. Each entry corresponds to a factor and should be a sequence of the same length. Corresponding elements across all factors represent the combination of levels for a single observation.

    Alternatively, a dictionary where each entry corresponds to a factor. Each key is the name of the factor and value is a sequence containing the factor.

    Alternatively, a NamedList where each entry is a factor. This may or may not be named.

    Alternatively, a BiocFrame where each column is a factor.

  • keep_unused (bool) – Whether to report unused combinations of levels. If any entry of factors is a Factor object, any unused levels will also be preserved.

Return type:

NamedList

Returns:

NamedList containing the following entries.

  • levels: a BiocFrame containing the sorted and unique combinations of levels as a tuple. Each column corresponds to a factor in factors while each row represents a unique combination. Corresponding elements of each column define a single combination, i.e., the i-th combination is defined by taking the i-th element of each column.

  • index: an integer NumPy array specifying the index into levels for each observation.

For observation i and factor j, levels[j][index[i]] will recover factors[j][i].

References

The combine_factors function in the scran_aggregate library.

Examples

>>> import scranpy
>>> import random
>>> x = random.choices(["A", "B", "C"], k=20)
>>> y = random.choices([True, False], k = 20)
>>> combined = scranpy.combine_factors({ "foo": x, "bar":  y })
>>> print(combined["levels"])
>>> import biocutils
>>> print(biocutils.table(combined["index"]))
scranpy.compute_adt_qc_metrics(x, subsets, row_names=None, num_threads=1)[source]

Compute quality control metrics from ADT count data.

Parameters:
  • x (Any) – A matrix-like object containing ADT counts. Each row should correspond to an ADT while each column should correspond to a cell.

  • subsets (Union[dict, NamedList]) –

    Subsets of ADTs corresponding to control features like IgGs. This may be either:

    • A list of arrays. Each array corresponds to an ADT subset and can either contain boolean or integer values.

      • For booleans, the sequence should be of length equal to the number of rows, and values should be truthy for rows that belong in the subset. If the sequence contains booleans, it should not contain any other type.

      • For integers, the value is the row index of an ADT in the subset.

      • For strings, the value is the name of an ADT in the subset. This should match at least one element in row_names.

    • A dictionary where keys are the names of each ADT subset and the values are arrays as described above.

    • A NamedList where each element is an array as described above, possibly with names.

  • row_names (Optional[Sequence]) – Sequence of strings of length equal to the number of rows of x, containing the name of each ADT. Duplicate names are allowed but only the first occurrence will be used. If None, rows are assumed to be unnamed.

  • num_threads (int) – Number of threads to use.

Return type:

BiocFrame

Returns:

A BiocFrame with number of rows equal to the number of cells (i.e., columns) in x. It contains the following columns:

  • sum, a double-precision NumPy array containing the sum of counts across all ADTs for each cell.

  • detected, an integer NumPy array containing the number of detected ADTs in each cell.

  • subset_sum, a nested BiocFrame with one column per subset in subsets. Each column is a double-precision NumPy array that contains the sum of counts for the corresponding subset in each cell.

References

The compute_adt_qc_metrics function in the scran_qc C++ library, which describes the rationale behind these QC metrics.

Examples

>>> import numpy
>>> mat = numpy.reshape(numpy.random.poisson(lam=5, size=1000), (50, 20))
>>> import scranpy
>>> res = scranpy.compute_adt_qc_metrics(mat, { "IgG": [ 1, 10, 20, 40 ] })
>>> print(res)
scranpy.compute_clrm1_factors(x, num_threads=1)[source]

Compute size factors from an ADT count matrix using the CLRm1 method.

Parameters:
  • x (Any) – A matrix-like object containing ADT count data. Rows correspond to tags and columns correspond to cells.

  • num_threads (int) – Number of threads to use.

Return type:

ndarray

Returns:

Double-precision NumPy array containing the CLRm1 size factor for each cell. Note that these size factors are not centered and should be passed through, e.g., center_size_factors() before normalization.

References

https://github.com/libscran/clrm1, for a description of the CLRm1 method.

Examples
>>> import numpy
>>> counts = numpy.random.poisson(lam=5, size=1000).reshape(20, 50)
>>> import scranpy
>>> sf = scranpy.compute_clrm1_factors(counts)
>>> print(scranpy.center_size_factors(sf))
scranpy.compute_crispr_qc_metrics(x, num_threads=1)[source]

Compute quality control metrics from CRISPR count data.

Parameters:
  • x (Any) – A matrix-like object containing CRISPR counts. Each row should correspond to a guide while each column should correspond to a cell.

  • num_threads (int) – Number of threads to use.

Return type:

BiocFrame

Returns:

A BiocFrame with number of rows equal to the number of cells (i.e., columns) in x. It contains the following columns.

  • sum, a double-precision NumPy array containing the sum of counts across all guides for each cell.

  • detected, an integer NumPy array containing the number of guides with non-zero counts in each cell.

  • max_value, a double-precision NumPy array containing the maximum count for each cell.

  • max_index, an integer NumPy array containing the row index of the guide with the maximum count in each cell.

References

The compute_crispr_qc_metrics function in the scran_qc C++ library, which describes the rationale behind these QC metrics.

Examples

>>> import numpy
>>> mat = numpy.reshape(numpy.random.poisson(lam=5, size=1000), (50, 20))
>>> import scranpy
>>> res = scranpy.compute_crispr_qc_metrics(mat)
>>> print(res)
scranpy.compute_rna_qc_metrics(x, subsets, row_names=None, num_threads=1)[source]

Compute quality control metrics from RNA count data.

Parameters:
  • x (Any) – A matrix-like object containing RNA counts. Each row should correspond to a gene while each column should correspond to a cell.

  • subsets (Union[dict, Sequence, NamedList]) –

    Subsets of genes corresponding to “control” features like mitochondrial genes. This may be either:

    • A list of sequences. Each sequence corresponds to a gene subset and can contain booleans, integers or strings.

      • For booleans, the sequence should be of length equal to the number of rows, and values should be truthy for rows that belong in the subset. If the sequence contains booleans, it should not contain any other type.

      • For integers, the value is the row index of a gene in the subset.

      • For strings, the value is the name of a gene in the subset. This should match at least one element in row_names.

    • A dictionary where keys are the names of each gene subset and the values are arrays as described above.

    • A NamedList where each element is an array as described above, possibly with names.

  • row_names (Optional[Sequence]) – Sequence of strings of length equal to the number of rows of x, containing the name of each gene. Duplicate names are allowed but only the first occurrence will be used. If None, rows are assumed to be unnamed.

  • num_threads (int) – Number of threads to use.

Return type:

BiocFrame

Returns:

A BiocFrame with number of rows equal to the number of cells (i.e., columns) in x. It contains the following columns:

  • sum: a double-precision NumPy array containing the sum of counts across all genes for each cell.

  • detected: an integer NumPy array containing the number of genes with non-zero expression in each cell.

  • subset_proportion: a nested BiocFrame with one column per subset in subsets. Each column is a double-precision NumPy array that contains the proportion of counts in the corresponding subset in each cell.

References

The compute_rna_qc_metrics function in the scran_qc C++ library, which describes the rationale behind these QC metrics.

Examples

>>> import numpy
>>> mat = numpy.reshape(numpy.random.poisson(lam=5, size=1000), (50, 20))
>>> import scranpy
>>> res = scranpy.compute_rna_qc_metrics(mat, { "mito": [ 1, 10, 20, 40 ] })
>>> print(res)
scranpy.compute_rna_qc_metrics_with_altexps(x, subsets, altexp_proportions=None, num_threads=1, assay_type='counts')

Compute RNA-related QC metrics for the main and alternative experiments of a SingleCellExperiment. This calls compute_rna_qc_metrics() on each experiment.

Parameters:
  • x (SummarizedExperiment) – A SummarizedExperiment object or one of its subclasses. Rows correspond to genes and columns correspond to cells.

  • subsets (Union[Mapping, Sequence]) – List of subsets of control genes, passed to compute_rna_qc_metrics() for the main experiment.

  • altexp_proportions (Union[str, int, dict, Sequence, NamedList, None]) –

    Indices or names of alternative experiments for which to compute QC metrics. This is typically used to refer to alternative experiments holding spike-in data. An additional proportion is computed for each alternative experiment relative to the main experiment, i.e., X/(X+Y) where X is the alternative experiment’s total and Y is the main experiment’s total. These proportions will be used for filtering in the same manner as the proportions computed from subsets.

    If a list or unnamed NamedList is supplied, it should contain the indices/names of alternative experiments for which to compute QC metrics. The assay to use from each alternative experiment is determined by assay_type.

    If a string or integer is supplied, it is assumed to be a name/index of one alternative experiment for which to compute QC metrics. Again, the assay to use from each alternative experiment is determined by assay_type.

    If a dictionary or named NamedList is supplied, each key is a string specifying the name of an alternative experiment for which to compute QC metrics, while each value is an integer/string specifying the index/name of the assay to use from that experiment.

    This option is only relevant if x is a SingleCellExperiment.

  • num_threads (int) – Number of threads, passed to compute_rna_qc_metrics().

  • assay_type (Union[int, str]) – Index or name of the assay of x containing the ADT count matrix.

Returns:

  • main: a BiocFrame containing QC statistics for the main experiment, see compute_rna_qc_metrics() for details. The proportion of counts for each alternative experiment in altexp_proportions is stored in the subset_proportions column.

  • altexps: a NamedList with one entry per alternative experiment listed in altexp_proportions. Each entry is named after its corresponding alternative experiment and is a BiocFrame of QC statistics for that experiment.

Return type:

A NamedList containing

Examples

>>> import scranpy
>>> sce = scranpy.get_test_rna_data_se()
>>> is_mito = list(y.startswith("mt-") for y in sce.get_row_names())
>>> res = scranpy.compute_rna_qc_metrics_with_altexps(sce, subsets={ "mito": is_mito }, altexp_proportions="ERCC")
>>> print(res["main"])
>>> print(res["main"]["subset_proportion"])
>>> print(res["altexps"][0])
scranpy.correct_mnn(x, block, num_neighbors=15, num_steps=1, merge_policy='rss', num_mads=None, robust_iterations=None, robust_trim=None, mass_cap=None, order=None, reference_policy=None, nn_parameters=<knncolle.annoy.AnnoyParameters object>, num_threads=1)[source]

Perform mutual nearest neighbor (MNN) correction to remove batch effects from a matrix of low-dimensional embeddings.

Parameters:
  • x (ndarray) – Matrix of coordinates where rows are dimensions and columns are cells, typically generated by run_pca().

  • block (Sequence) – Factor specifying the block of origin (e.g., batch, sample) for each cell. Length should equal the number of columns in x.

  • num_neighbors (int) – Number of neighbors to use when identifying MNN pairs.

  • num_steps (int) – Number of steps for the recursive neighbor search to compute the center of mass. Larger values mitigate the kissing effect but increase the risk of including inappropriately distant subpopulations into the center of mass.

  • merge_policy (Literal['rss', 'size', 'variance', 'input']) –

    Policy for choosing the merge order.

    • input uses the input order of the batches. Observations in the last batch are corrected first, and then the second-last batch, and so on. This allows users to control the merge order by simply changing the inputs.

    • size merges batches in order of increasing size (i.e., the number of observations). So, the smallest batch is corrected first while the largest batch is unchanged. The aim is to lower compute time by reducing the number of observations that need to be reprocessed in later merge steps.

    • variance merges batches in order of increasing variance between observations. So, the batch with the lowest variance is corrected first while the batch with the highest variance is unchanged. The aim is to lower compute time by encouraging more observations to be corrected to the most variable batch, thus avoid reprocessing in later merge steps.

    • rss merges batches in order of increasing residual sum of squares (RSS). This is effectively a compromise between variance and size.

  • num_mads (Optional[int]) – Deprecated and ignored.

  • robust_iterations (Optional[int]) – Deprecated and ignored.

  • robust_trim (Optional[float]) – Deprecated and ignored.

  • mass_cap (Optional[int]) – Deprecated and ignored.

  • order (Optional[Sequence]) – Deprecated and ignored. Now the merge order is always determinted automatically.

  • reference_policy (Optional[str]) – Deprecated, see merge_policy instead.

  • nn_parameters (Parameters) – The nearest-neighbor algorithm to use.

  • num_threads (int) – Number of threads to use.

Return type:

NamedList

Returns:

A NamedList containing the following entries.

  • corrected, a double-precision NumPy array of the same dimensions as x, containing the corrected values.

References

https://libscran.github.io/mnncorrect, which describes the MNN correction algorithm in more detail.

Examples

>>> import numpy
>>> pcs = numpy.random.rand(10, 200)
>>> block = ["A", "B"] * 100
>>> import scranpy
>>> mnn_out = scranpy.correct_mnn(pcs, block)
>>> mnn_out["corrected"].shape
scranpy.correct_mnn_se(x, block, nn_parameters=<knncolle.annoy.AnnoyParameters object>, num_threads=1, more_mnn_args={}, reddim_type='PCA', output_name='MNN')[source]

Correct batch effects from an existing embedding with mutual nearest neighbors (MNNs). This calls correct_mnn() on reduced dimensions from a SingleCellExperiment.

Parameters:
  • x (SingleCellExperiment) – A SingleCellExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.

  • block (Sequence) – Block assignment of each cell, see correct_mnn() for more details.

  • nn_parameters (Parameters) – Algorithm for nearest neighbor search, see correct_mnn() for more details.

  • num_threads (int) – Number of threads, see correct_mnn() for more details.

  • more_mnn_args (dict) – Additional arguments to pass to correct_mnn().

  • reddim_type (Union[str, int, tuple]) – Name or index of an existing embedding in x, on which to perform the batch correction. Alternatively a tuple, where the first element contains the name of an alternative experiment of x, and the second element contains the name/index of an embedding in that alternative experiment.

  • output_name (str) – Name of the reduced dimension entry which to store the corrected coordinates.

Return type:

SingleCellExperiment

Returns:

A copy of x , with the corrected embedding stored as a reduced dimension entry.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_rna_data_se("pca")
>>> # Treating the tissue of origin as the batch.
>>> sce = scranpy.correct_mnn_se(sce, block=sce.get_column_data()["tissue"])
>>> sce.get_reduced_dimension_names()
scranpy.filter_adt_qc_metrics(thresholds, metrics, block=None)[source]

Filter for high-quality cells based on ADT-derived QC metrics.

Parameters:
Return type:

ndarray

Returns:

A boolean NumPy vector of length equal to the number of cells in metrics, containing truthy values for putative high-quality cells.

References

The AdtQcFilters and AdtQcBlockedFilters functions in the scran_qc C++ library.

Examples

>>> import numpy
>>> mat = numpy.reshape(numpy.random.poisson(lam=5, size=1000), (50, 20))
>>> import scranpy
>>> res = scranpy.compute_adt_qc_metrics(mat, { "IgG": [ 1, 10, 20, 40 ] })
>>> filt = scranpy.suggest_adt_qc_thresholds(res)
>>> keep = scranpy.filter_adt_qc_metrics(filt, res)
>>> import biocutils
>>> print(biocutils.table(keep))
scranpy.filter_crispr_qc_metrics(thresholds, metrics, block=None)[source]

Filter for high-quality cells based on CRISPR-derived QC metrics.

Parameters:
Return type:

ndarray

Returns:

A boolean NumPy vector of length equal to the number of cells in metrics, containing truthy values for putative high-quality cells.

References

The CrisprQcFilters and CrisprQcBlockedFilters functions in the scran_qc C++ library.

Examples

>>> import numpy
>>> mat = numpy.reshape(numpy.random.poisson(lam=5, size=1000), (50, 20))
>>> import scranpy
>>> res = scranpy.compute_crispr_qc_metrics(mat)
>>> filt = scranpy.suggest_crispr_qc_thresholds(res)
>>> keep = scranpy.filter_crispr_qc_metrics(filt, res)
>>> keep.sum()
scranpy.filter_rna_qc_metrics(thresholds, metrics, block=None)[source]

Filter for high-quality cells based on RNA-derived QC metrics.

Parameters:
Return type:

ndarray

Returns:

A boolean NumPy vector of length equal to the number of cells in metrics, containing truthy values for putative high-quality cells.

References

The RnaQcFilters and RnaQcBlockedFilters functions in the scran_qc C++ library.

Examples

>>> import numpy
>>> mat = numpy.reshape(numpy.random.poisson(lam=5, size=1000), (50, 20))
>>> import scranpy
>>> res = scranpy.compute_rna_qc_metrics(mat, { "mito": [ 1, 10, 20, 40 ] })
>>> filt = scranpy.suggest_rna_qc_thresholds(res)
>>> keep = scranpy.filter_rna_qc_metrics(filt, res)
>>> print(biocutils.table(keep))
scranpy.fit_variance_trend(mean, variance, mean_filter=True, min_mean=0.1, transform=True, span=0.3, use_min_width=False, min_width=1, min_window_count=200, num_threads=1)[source]

Fit a trend to the per-gene variances with respect to the mean.

Parameters:
  • mean (ndarray) – Array containing the mean (log-)expression for each gene.

  • variance (ndarray) – Array containing the variance in the (log-)expression for each gene. This should have length equal to mean.

  • mean_filter (bool) – Whether to filter on the means before trend fitting.

  • min_mean (float) – The minimum mean of genes to use in trend fitting. Only used if mean_filter = True.

  • transform (bool) – Whether a quarter-root transformation should be applied before trend fitting.

  • span (float) – Span of the LOWESS smoother. Ignored if use_min_width = True.

  • use_min_width (bool) – Whether a minimum width constraint should be applied to the LOWESS smoother. This is useful to avoid overfitting in high-density intervals.

  • min_width (float) – Minimum width of the window to use when use_min_width = True.

  • min_window_count (int) – Minimum number of observations in each window. Only used if use_min_width = True.

  • num_threads (int) – Number of threads to use.

Return type:

NamedList

Returns:

A NamedList containing the following entries.

  • fitted: a double-precision NumPy array of length equal to mean, containing the fitted value of the trend for each gene.

  • residual: a double-precision NumPy array of length equal to mean, containing the residual from the trend for each gene.

References

The fit_variance_trend function in the scran_variances C++ library, for the underlying implementation.

Examples

>>> import numpy
>>> # Mock up a typical mean-variance relationship for count data.
>>> mean = numpy.random.exponential(size=1000)
>>> variance = mean / (1 + mean**2) * 2**(numpy.random.normal(size=len(mean)) / 10)
>>>
>>> import scranpy
>>> fit = scranpy.fit_variance_trend(mean, variance)
>>> print(fit["fitted"])
>>> print(fit["residual"])
scranpy.format_compute_adt_qc_metrics_result(df, flatten=True)

Pretty-format the results of compute_adt_qc_metrics().

Parameters:
Return type:

BiocFrame

Returns:

A BiocFrame containing per-cell QC statistics. If flatten = True, the subset sums are stored as top-level columns with name subset_sum_<SUBSET> where <SUBSET> is the name of the subset.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_adt_data_se().get_alternative_experiment("ADT")
>>> is_igg = list(y.find("IgG") >= 0 for y in sce.get_row_names())
>>> qc = scranpy.compute_adt_qc_metrics(sce.get_assay(0), subsets={ "igg": is_igg })
>>> print(scranpy.format_compute_adt_qc_metrics_result(qc))
scranpy.format_compute_crispr_qc_metrics_result(df)

Pretty-format the results of compute_crispr_qc_metrics().

Parameters:

df (BiocFrame) – Result of compute_crispr_qc_metrics().

Return type:

BiocFrame

Returns:

A BiocFrame containing per-cell QC statistics.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_crispr_data_se().get_alternative_experiment("CRISPR Guide Capture")
>>> df = scranpy.compute_crispr_qc_metrics(sce.get_assay(0))
>>> print(scranpy.format_compute_crispr_qc_metrics_result(df))
scranpy.format_compute_rna_qc_metrics_result(df, flatten=True)

Pretty-format the results of compute_rna_qc_metrics().

Parameters:
Return type:

BiocFrame

Returns:

A BiocFrame containing per-cell QC statistics. If flatten = True, the subset proportions are stored as top-level columns with name subset_proportion_<SUBSET> where <SUBSET> is the name of the subset.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_rna_data_se()
>>> is_mito = list(y.startswith("mt-") for y in sce.get_row_names())
>>> qc = scranpy.compute_rna_qc_metrics(sce.get_assay(0), subsets={ "mito": is_mito })
>>> print(scranpy.format_compute_rna_qc_metrics_result(qc))
scranpy.format_score_markers_result(res, extra_columns=None, order_by=True, row_names=None)

Reformat the output of score_markers() into a list of per-group BiocFrame objects.

Parameters:
  • res – Results of score_markers().

  • extra_columns (Union[str, Sequence, BiocFrame, None]) – A BiocFrame with the same number of rows as x, containing extra columns to add each BiocFrame.

  • order_by (Union[str, bool, None]) – Name of the column to use for ordering the rows of each output BiocFrame. Alternatively True, in which case a column is automatically chosen from the effect size summaries. If None or False, no ordering is performed.

  • row_names (Optional[Sequence]) – Sequence of strings containing the row names to add to each BiocFrame. This should correspond to the gene names corresponding to the rows of x used in score_markers().

Return type:

NamedList

Returns:

A NamedList of BiocFrame objects. Each BiocFrame corresponds to a unique group in groups. Each row of each BiocFrame contains statistics for a gene in x. Each BiocFrame contains the following columns:

  • mean, the mean expression in the current group.

  • detected, the proportion of cells with detected expression in the current group.

  • <effect>_<summary>, a summary statistic for an effect size, e.g., cohens_d_mean contains the mean Cohen’s d across comparisons involving the current group.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_rna_data_se("cluster")
>>> raw_markers = scranpy.score_markers(sce.get_assay("logcounts"), sce.get_column_data()["clusters"])
>>> formatted = scranpy.format_score_markers_result(raw_markers, row_names=sce.get_row_names())
scranpy.get_test_adt_data_se(at='start')

Get a CITE-seq dataset with varying levels of processing. This contains human PBMCs obtained with scrnaseq.fetch_dataset("kotliarov-pbmc-2020", "2024-04-18"). The main experiment contains RNA counts and the alternative experiment contains ADT counts.

Parameters:

at (Literal['start', 'qc', 'norm', 'hvg', 'pca']) – The desired level of processing. For start, no processing is performed. Otherwise, the dataset is returned after quality control (qc), normalization (norm), feature selection (hvg) or PCA (pca).

Return type:

SingleCellExperiment

Returns:

The dataset as a SingleCellExperiment. This uses caching to avoid recomputation.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_adt_data_se()
scranpy.get_test_crispr_data_se(at='start')

Get a Perturb-seq dataset with varying levels of processing. This contains a pancreatic beta cell line obtained with scrnaseq.fetch_dataset("cao-pancreas-2025", "2025-10-10", "rqc"). The main experiment contains RNA counts and the alternative experiment contains CRISPR guide counts.

Parameters:

at (Literal['start', 'qc']) – The desired level of processing. For start, no processing is performed. Otherwise, the dataset is returned after quality control (qc), normalization (norm), feature selection (hvg) or PCA (pca).

Return type:

SingleCellExperiment

Returns:

The dataset as a SingleCellExperiment. This uses caching to avoid recomputation.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_crispr_data_se()
scranpy.get_test_rna_data_se(at='start')

Get a single-cell RNA-seq dataset with varying levels of processing. This contains cells from the mouse brain, obtained with scrnaseq.fetch_dataset("zeisel-brain-2015", "2023-12-14"). The main experiment contains RNA counts and the alternative experiments contain ERCC and repeat element counts.

Parameters:

at (Literal['start', 'qc', 'norm', 'hvg', 'pca', 'cluster']) – The desired level of processing. For start, no processing is performed. Otherwise, the dataset is returned after quality control (qc), normalization (norm), feature selection (hvg), PCA (pca) or clustering (cluster).

Return type:

SingleCellExperiment

Returns:

The dataset as a SingleCellExperiment. This uses caching to avoid recomputation.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_rna_data_se()
scranpy.model_gene_variances(x, block=None, block_average_policy='mean', block_weight_policy='variable', variable_block_weight=(0, 1000), block_quantile=0.8, mean_filter=True, min_mean=0.1, transform=True, span=0.3, use_min_width=False, min_width=1, min_window_count=200, num_threads=1)[source]

Compute the variance in (log-)expression values for each gene, and model the trend in the variances with respect to the mean.

Parameters:
  • x (Any) – A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. It is typically expected to contain log-expression values, e.g., from normalize_counts().

  • block (Optional[Sequence]) – Array of length equal to the number of columns of x, containing the block of origin (e.g., batch, sample) for each cell. Alternatively None, if all cells are from the same block.

  • block_average_policy (Literal['mean', 'quantile']) – Policy for averaging statistics across blocks. This can either use a (weighted) mean or a quantile.

  • block_weight_policy (Literal['variable', 'equal', 'none']) – Policy for weighting different blocks when computing the weighted mean across blocks for each statistic. Only used if block is provided and block_average_policy = "mean".

  • variable_block_weight (Tuple) – Parameters for variable block weighting. This should be a tuple of length 2 where the first and second values are used as the lower and upper bounds, respectively, for the variable weight calculation. Only used if block is provided, block_average_policy = "mean", and block_weight_policy = "variable".

  • block_quantile (float) – Probability for computing the quantile across blocks. Defaults to 0.5, i.e., the median of per-block statistics. Only used if block is provided and block_average_policy = "quantile".

  • mean_filter (bool) – Whether to filter on the means before trend fitting.

  • min_mean (float) – The minimum mean of genes to use in trend fitting. Only used if mean_filter = True.

  • transform (bool) – Whether a quarter-root transformation should be applied before trend fitting.

  • span (float) – Span of the LOWESS smoother for trend fitting, see fit_variance_trend().

  • use_min_width (bool) – Whether a minimum width constraint should be applied during trend fitting, see fit_variance_trend().

  • min_width (float) – Minimum width of the smoothing window for trend fitting, see fit_variance_trend().

  • min_window_count (int) – Minimum number of observations in each smoothing window for trend fitting, see fit_variance_trend().

  • num_threads (int) – Number of threads to use.

Return type:

BiocFrame

Returns:

A NamedList containing statistics. This is a BiocFrame with one row per gene and the following columns:

  • mean: a double-precision NumPy array containing the mean (log-)expression for each gene.

  • variance: a double-precision NumPy array containing the mean (log-)expression for each gene.

  • fitted: a double-precision NumPy array containing the fitted value of the mean-variance trend for each gene.

  • residual: a double-precision NumPy array containing the residual from the mean-variance trend for each gene.

If block is supplied, the NamedList will also contain:

  • per_block: a NamedList containing the per-block statistics. Each entry is a BiocFrame that contains the mean, variance, fitted and residual for each block.

  • block_ids: a list containing the identities of the blocks. This corresponds to the entries of per_block.

References

The model_gene_variances function in the scran_variances C++ library.

Examples

>>> import numpy
>>> # mock up some log-normalized data with an interesting mean-variance relationship.
>>> mu = numpy.random.rand(200) * 5
>>> normed = numpy.ndarray((200, 10))
>>> for c in range(10):
>>>     normed[:,c] = numpy.log1p(numpy.random.poisson(lam=mu, size=200)) / numpy.log(2)
>>>
>>> import scranpy
>>> res = scranpy.model_gene_variances(normed)
>>> print(res["statistics"])
scranpy.normalize_adt_counts_se(x, size_factors=None, num_threads=1, center=True, block=None, mode='lowest', log=True, pseudo_count=1, assay_type='counts', output_name='logcounts', factor_name='size_factor')[source]

Compute (log-)normalized expression values after performing scaling normalization of an ADT count matrix. This calls compute_clrm1_factors() to compute CLRm1 size factors, center_size_factors() to center the size factors, and then normalize_counts() to compute normalized log-expression values.

Parameters:
  • x (SummarizedExperiment) – A SummarizedExperiment object or one of its subclasses. Rows correspond to antibody-derived tags (ADTs) and columns correspond to cells.

  • size_factors (Optional[Sequence]) – Size factor for each cell in x, to be passed to normalize_counts(). If None, size factors are computed with compute_clrm1_factors().

  • num_threads (int) – Number of threads, to be passed to normalize_counts().

  • center (bool) – Whether to center the size_factors by passing them to center_size_factors().

  • block (Optional[Sequence]) – Block assignment for each cell in x, to be passed to center_size_factors().

  • mode (Literal['lowest', 'per-block']) – How to scale the size factors across blocks, see center_size_factors(). Only relevant if block is provided.

  • log (bool) – Whether to log-transform the normalized expression values, see normalize_counts().

  • pseudo_count (float) – Pseudo-count for the log-transformation, see normalize_counts(). Only relevant if log = True.

  • assay_type (Union[int, str]) – Name or index specifying the assay of x that contains the count matrix to be normalized.

  • output_name (str) – Name of the assay in which to store the normalized matrix in the output object.

  • factor_name (Optional[str]) – Name of the column of the column data in which to store the size factors in the output object. If None, the size factors are not stored.

Return type:

SummarizedExperiment

Returns:

A copy of x, with an additional assay containing the (log-)normalized matrix. Size factors are also stored in the column data.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_adt_data_se("qc").get_alternative_experiment("ADT")
>>> sce = scranpy.normalize_adt_counts_se(sce)
>>> sce.get_assay_names()
>>> sce.get_column_data()["size_factor"]
scranpy.normalize_counts(x, size_factors, log=True, pseudo_count=1, log_base=2, preserve_sparsity=False, delayed=True)

Create a matrix of (log-transformed) normalized expression values. The normalization removes uninteresting per-cell differences due to sequencing efficiency and library size. The log-transformation ensures that any differences represent log-fold changes in downstream analysis steps; such relative changes in expression are more relevant than absolute changes.

Parameters:
  • x (Any) –

    Matrix-like object containing cells in columns and features in rows, typically with count data.

    Alternatively, a InitializedMatrix representing a count matrix, typically created by initialize().

  • size_factors (Sequence) – Size factor for each cell. This should have length equal to the number of columns in x.

  • log (bool) – Whether log-transformation should be performed.

  • pseudo_count (float) – Positive pseudo-count to add before log-transformation. Ignored if log = False.

  • log_base (float) – Base of the log-transformation, ignored if log = False.

  • preserve_sparsity (bool) – Whether to preserve sparsity when pseudo_count != 1. If True, users should manually add log(pseudo_count, log_base) to the returned matrix to obtain the desired log-transformed expression values. Ignored if log = False or pseudo_count = 1.

  • delayed (bool) – Whether operations on a matrix-like x should be delayed. This improves memory efficiency at the cost of some speed in downstream operations.

Return type:

Union[DelayedArray, InitializedMatrix]

Returns:

If x is a matrix-like object and delayed = True, a DelayedArray is returned containing the (log-transformed) normalized expression matrix. If delayed = False, the type of the (log-)normalized matrix will depend on the operations applied to x.

If x is an InitializedMatrix, a new InitializedMatrix is returned containing the normalized expression matrix.

References

The normalize_counts function in the scran_norm C++ library, for the rationale behind normalization and log-transformation.

Examples

>>> import numpy
>>> counts = numpy.random.poisson(lam=2, size=1000).reshape(50, 20)
>>> import scranpy
>>> sf = scranpy.center_size_factors(counts.sum(axis=0))
>>> normed = scranpy.normalize_counts(counts, sf)
>>> print(normed)
scranpy.normalize_crispr_counts_se(x, size_factors=None, center=True, block=None, mode='lowest', log=True, pseudo_count=1, assay_type='counts', output_name='logcounts', factor_name='size_factor')[source]

Compute (log-)normalized expression values after performing scaling normalization of a CRISPR guide count matrix. This calls center_size_factors() to center the library sizes, and then normalize_counts() to compute normalized log-expression values.

Parameters:
  • x (SummarizedExperiment) – A SummarizedExperiment object or one of its subclasses. Rows correspond to guides and columns correspond to cells.

  • size_factors (Optional[Sequence]) – Size factor for each cell in x, to be passed to normalize_counts(). If None, this defaults to the column sums of the count matrix in x.

  • center (bool) – Whether to center the size_factors by passing them to center_size_factors().

  • block (Optional[Sequence]) – Block assignment for each cell in x, to be passed to center_size_factors().

  • mode (Literal['lowest', 'per-block']) – How to scale the size factors across blocks, see center_size_factors(). Only relevant if block is provided.

  • log (bool) – Whether to log-transform the normalized expression values, see normalize_counts().

  • pseudo_count (float) – Pseudo-count for the log-transformation, see normalize_counts(). Only relevant if log = True.

  • assay_type (Union[str, int]) – Name or index specifying the assay of x that contains the count matrix to be normalized.

  • output_name (str) – Name of the assay in which to store the normalized matrix in the output object.

  • factor_name (Optional[str]) – Name of the column of the column data in which to store the size factors in the output object. If None, the size factors are not stored.

Return type:

SummarizedExperiment

Returns:

A copy of x, with an additional assay containing the (log-)normalized matrix. Size factors are also stored in the column data.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_crispr_data_se("qc").get_alternative_experiment("CRISPR Guide Capture")
>>> sce = scranpy.normalize_crispr_counts_se(sce, size_factors=sce.get_column_data()["sum"])
>>> sce.get_assay_names()
>>> sce.get_column_data()["size_factor"]
scranpy.normalize_rna_counts_se(x, size_factors=None, center=True, block=None, mode='lowest', log=True, pseudo_count=1, assay_type='counts', output_name='logcounts', factor_name='size_factor')[source]

Compute (log-)normalized expression values after performing scaling normalization of an RNA count matrix. This calls center_size_factors() to center the library sizes, and then normalize_counts() to compute normalized log-expression values.

Parameters:
  • x (SummarizedExperiment) – A SummarizedExperiment object or one of its subclasses. Rows correspond to genes and columns correspond to cells.

  • size_factors (Optional[Sequence]) – Size factor for each cell in x, to be passed to normalize_counts(). If None, this defaults to the column sums of the count matrix in x.

  • center (bool) – Whether to center the size_factors by passing them to center_size_factors().

  • block (Optional[Sequence]) – Block assignment for each cell in x, to be passed to center_size_factors().

  • mode (Literal['lowest', 'per-block']) – How to scale the size factors across blocks, see center_size_factors(). Only relevant if block is provided.

  • log (bool) – Whether to log-transform the normalized expression values, see normalize_counts().

  • pseudo_count (float) – Pseudo-count for the log-transformation, see normalize_counts(). Only relevant if log = True.

  • assay_type (Union[str, int]) – Name or index specifying the assay of x that contains the count matrix to be normalized.

  • output_name (str) – Name of the assay in which to store the normalized matrix in the output object.

  • factor_name (Optional[str]) – Name of the column of the column data in which to store the size factors in the output object. If None, the size factors are not stored.

Return type:

SummarizedExperiment

Returns:

A copy of x, with an additional assay containing the (log-)normalized matrix. Size factors are also stored in the column data.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_rna_data_se("qc")
>>> sce = scranpy.normalize_rna_counts_se(sce, size_factors=sce.get_column_data()["sum"])
>>> sce.get_assay_names()
>>> sce.get_column_data()["size_factor"]
scranpy.preview_markers(df, columns=['mean', 'detected', ('lfc', 'delta_mean_mean')], rows=10, order_by=None, include_order_by=True)

Preview the top markers for a group in a pretty format.

Parameters:
  • df (BiocFrame) – BiocFrame containing the marker statistics for a single group.

  • columns (Optional[Sequence]) –

    Names of columns of df to retain in the preview.

    Alternatively, each entry may be a tuple of two strings. The first string is the name of the column in the output BiocFrame, and the second string is the name of the column of df to retain.

  • rows (Optional[int]) – Number of rows to show. If None, all rows are returned.

  • order_by (Union[str, bool, None]) – Name of the column to use for ordering the rows of the output Alternatively ``True`, in which case a column is automatically chosen from the effect size summaries. If None or False, no ordering is performed.

  • include_order_by (bool) – Whether to include the column named by order_by in the output BiocFrame. This may also be a string, which is treated as True; the value is used as the name of the column in the output BiocFrame.

Return type:

BiocFrame

Returns:

A BiocFrame containing important columns for the top markers.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_rna_data_se("cluster")
>>> markers = scranpy.score_markers_se(sce, sce.get_column_data()["clusters"])
>>> print(scranpy.preview_markers(markers["1"]))
scranpy.quick_adt_qc_se(x, subsets, num_threads=1, more_suggest_args={}, block=None, assay_type='counts', output_prefix=None, meta_name='qc', flatten=True)

Quickly compute quality control (QC) metrics, thresholds and filters from ADT data in a SummarizedExperiment. This calls compute_adt_qc_metrics(), suggest_adt_qc_thresholds(), and filter_adt_qc_metrics().

Parameters:
  • x (SummarizedExperiment) – A SummarizedExperiment or one of its subclasses. Rows correspond to antibody-derived tags (ADTs) and columns correspond to cells.

  • subsets (Union[Mapping, Sequence]) – List of subsets of control features, passed to compute_adt_qc_metrics().

  • num_threads (int) – Number of threads, passed to compute_adt_qc_metrics().

  • more_suggest_args (dict) – Additional arguments to pass to suggest_adt_qc_thresholds().

  • block (Optional[Sequence]) – Blocking factor specifying the block of origin (e.g., batch, sample) for each cell in metrics. If supplied, a separate threshold is computed from the cells in each block. Alternatively None, if all cells are from the same block.

  • assay_type (Union[int, str]) – Index or name of the assay of x containing the ADT count matrix.

  • output_prefix (Optional[str]) – Prefix to add to the column names of the column data containing the output QC statistics. If None, no prefix is added.

  • meta_name (Optional[str]) – Name of the metadata entry in which to store additional outputs like the filtering thresholds. If None, additional outputs are not reported.

  • flatten (bool) – Whether to flatten the subset proportions into separate columns of the column data. If False, the subset proportions are stored in a nested BiocFrame.

Return type:

SummarizedExperiment

Returns:

A copy of x with additional columns added to its column data. Each column contains per-cell values for one of the ADT-related QC metrics, see compute_adt_qc_metrics() for details. The suggested thresholds are stored as a list in the metadata. The column data also contains a keep column, specifying which cells are to be retained.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_adt_data_se().get_alternative_experiment("ADT")
>>> is_igg = list(y.find("IgG") >= 0 for y in sce.get_row_names())
>>> sce = scranpy.quick_adt_qc_se(sce, subsets={ "igg": is_igg })
>>> print(sce.get_column_data()[:,["sum", "detected", "subset_sum_igg"]])
>>> print(sce.get_metadata()["qc"]["thresholds"])
>>> import biocutils
>>> print(biocutils.table(sce.get_column_data()["keep"]))
scranpy.quick_crispr_qc_se(x, more_suggest_args={}, num_threads=1, block=None, assay_type='counts', output_prefix=None, meta_name='qc')

Quickly compute quality control (QC) metrics, thresholds and filters from CRISPR data in a SummarizedExperiment. This calls compute_crispr_qc_metrics(), suggest_crispr_qc_thresholds(), and filter_crispr_qc_metrics().

Parameters:
  • x (SummarizedExperiment) – A SummarizedExperiment or one of its subclasses. Rows correspond to antibody-derived tags (ADTs) and columns correspond to cells.

  • subsets – Passed to compute_crispr_qc_metrics().

  • more_suggest_args (dict) – Additional arguments to pass to suggest_crispr_qc_thresholds().

  • num_threads (int) – Passed to compute_crispr_qc_metrics().

  • block (Optional[Sequence]) – Blocking factor specifying the block of origin (e.g., batch, sample) for each cell in metrics. If supplied, a separate threshold is computed from the cells in each block. Alternatively None, if all cells are from the same block.

  • assay_type (Union[int, str]) – Index or name of the assay of x containing the ADT count matrix.

  • output_prefix (Optional[str]) – Prefix to add to the column names of the column data containing the output QC statistics. If None, no prefix is added.

  • meta_name (str) – Name of the metadata entry in which to store additional outputs like the filtering thresholds. If None, additional outputs are not reported.

Return type:

SummarizedExperiment

Returns:

x, with additional columns added to its column data. Each column contains per-cell values for one of the ADT-related QC metrics, see compute_crispr_qc_metrics() for details. The suggested thresholds are stored as a list in the metadata. The column data also contains a keep column, specifying which cells are to be retained.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_crispr_data_se().get_alternative_experiment("CRISPR Guide Capture")
>>> sce = scranpy.quick_crispr_qc_se(sce)
>>> print(sce.get_column_data()[:,["sum", "detected", "max_value", "max_index"]])
>>> print(sce.get_metadata()["qc"]["thresholds"])
>>> sce.get_column_data()["keep"].sum()
scranpy.quick_rna_qc_se(x, subsets, altexp_proportions=None, num_threads=1, more_suggest_args={}, block=None, assay_type='counts', output_prefix=None, meta_name='qc', flatten=True)

Quickly compute quality control (QC) metrics, thresholds and filters from RNA data in a SummarizedExperiment. This calls compute_rna_qc_metrics(), suggest_rna_qc_thresholds(), and filter_rna_qc_metrics().

Parameters:
  • x (SummarizedExperiment) – A SummarizedExperiment or one of its subclasses. Rows correspond to genes and columns correspond to cells.

  • subsets (Union[Mapping, Sequence]) – List of subsets of control genes, passed to compute_rna_qc_metrics_with_altexps().

  • altexp_proportions (Union[str, int, dict, Sequence, NamedList, None]) – Indices or names of alternative experiments for which to compute QC metrics, see compute_rna_qc_metrics_with_altexps() for details.

  • num_threads (int) – Number of threads, passed to compute_rna_qc_metrics_with_altexps().

  • more_suggest_args (dict) – Additional arguments to pass to suggest_rna_qc_thresholds().

  • block (Optional[Sequence]) – Blocking factor specifying the block of origin (e.g., batch, sample) for each cell in metrics. If supplied, a separate threshold is computed from the cells in each block. Alternatively None, if all cells are from the same block.

  • assay_type (Union[int, str]) – Index or name of the assay of x containing the RNA count matrix.

  • output_prefix (Optional[str]) – Prefix to add to the column names of the column data containing the output QC statistics. If None, no prefix is added.

  • meta_name (Optional[str]) – Name of the metadata entry in which to store additional outputs like the filtering thresholds. If None, additional outputs are not reported.

  • flatten (bool) – Whether to flatten the subset proportions into separate columns of the column data. If False, the subset proportions are stored in a nested BiocFrame.

Return type:

SummarizedExperiment

Returns:

x, with additional columns added to its column data. Each column contains per-cell values for one of the RNA-related QC metrics, see compute_rna_qc_metrics() for details. The suggested thresholds are stored as a list in the metadata. The column data also contains a keep column, specifying which cells are to be retained.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_rna_data_se()
>>> is_mito = list(y.startswith("mt-") for y in sce.get_row_names())
>>> sce = scranpy.quick_rna_qc_se(sce, subsets={ "mito": is_mito })
>>> print(sce.get_column_data()[:,["sum", "detected", "subset_proportion_mito"]])
>>> print(sce.get_metadata()["qc"]["thresholds"])
>>> import biocutils
>>> print(biocutils.table(sce.get_column_data()["keep"]))
>>>
>>> # Computing spike-in proportions, if available.
>>> sce = scranpy.get_test_rna_data_se()
>>> sce = scranpy.quick_rna_qc_se(sce, subsets={ "mito": is_mito }, altexp_proportions=["ERCC"])
>>> print(sce.get_column_data()[:,["sum", "detected", "subset_proportion_mito", "subset_proportion_ERCC"]])
>>> print(sce.get_alternative_experiment("ERCC").get_column_data()[:,["sum", "detected"]])
scranpy.run_all_neighbor_steps(x, run_umap_options={}, run_tsne_options={}, build_snn_graph_options={}, cluster_graph_options={}, nn_parameters=<knncolle.annoy.AnnoyParameters object>, collapse_search=True, num_threads=3)

Run all steps that depend on the nearest neighbor search, i.e., run_tsne(), run_umap(), build_snn_graph(), and cluster_graph(). This builds the index once and re-uses it for the neighbor search in each step; the various steps are also run in parallel to save more time.

Parameters:
  • x (Union[Index, ndarray]) –

    Matrix of principal components where rows are cells and columns are PCs, typically produced by run_pca().

    Alternatively, a Index instance containing a prebuilt search index for the cells.

  • run_umap_options (Optional[dict]) – Optional arguments for run_umap(). If None, UMAP is not performed.

  • run_tsne_options (Optional[dict]) – Optional arguments for run_tsne(). If None, t-SNE is not performed.

  • build_snn_graph_options (Optional[dict]) – Optional arguments for build_snn_graph(). Ignored if cluster_graph_options = None.

  • cluster_graph_options (dict) – Optional arguments for cluster_graph(). If None, graph-based clustering is not performed.

  • nn_parameters (Parameters) – Parameters for the nearest-neighbor search.

  • collapse_search (bool) – Whether to collapse the nearest-neighbor search for each step into a single search. Steps that need fewer neighbors will use a subset of the neighbors from the collapsed search. This is faster but may not give the same results as separate searches for some approximate search algorithms.

  • num_threads (int) – Number of threads to use for the parallel execution of UMAP, t-SNE and SNN graph construction. This overrides the specified number of threads in the various *_options arguments.

Return type:

NamedList

Returns:

A NamedList containing one entry for each step.

  • run_tsne: results of run_tsne(). Omitted if t-SNE was not performed.

  • run_umap: results of run_tsne(). Omitted if UMAP was not performed.

  • build_snn_graph: results of build_snn_graph(). Omitted if graph-based clustering was not performed.

  • cluster_graph: results of cluster_graph(). Omitted if graph-based clustering was not performed.

If collapse_search = False, results should be identical to the result of running each step in serial.

Examples

>>> import numpy
>>> pcs = numpy.random.rand(10, 200)
>>> import scranpy
>>> output = scranpy.run_all_neighbor_steps(pcs)
>>> print(output["run_tsne"][:5,:])
>>> print(output["run_umap"][:5,:])
>>> import biocutils
>>> print(biocutils.table(output["cluster_graph"]["membership"]))
scranpy.run_all_neighbor_steps_se(x, umap_output_name='UMAP', more_umap_args={}, tsne_output_name='TSNE', more_tsne_args={}, build_graph_name=None, more_build_graph_args={}, cluster_output_name='clusters', cluster_meta_name=None, more_cluster_graph_args={}, nn_parameters=<knncolle.annoy.AnnoyParameters object>, num_threads=3, more_neighbor_args={}, reddim_type='PCA')[source]

Concurrently run all steps involving a nearest-neighbor search, using the same nearest-neighbor search index built from an existing embedding. This includes the t-SNE, UMAP and graph-based clustering. Internally, this uses run_all_neighbor_steps() on a reduced dimension entry in SingleCellExperiment.

Parameters:
  • x (SingleCellExperiment) – A SingleCellExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.

  • umap_output_name (Optional[str]) – Name of the reduced dimension entry in which to store the UMAP coordinates in the output object. If None, the UMAP is not computed.

  • more_umap_args (Optional[dict]) – Additional arguments for UMAP, to pass to run_all_neighbor_steps() as run_umap_options. If None, the UMAP is not computed.

  • tsne_output_name (Optional[str]) – Name of the reduced dimension entry in which to store the t-SNE coordinates in the output object. If None, the t-SNE is not computed.

  • more_tsne_args (Optional[dict]) – Additional arguments for t-SNE, to pass to run_all_neighbor_steps() as run_tsne_options. If None, the t-SNE is not computed.

  • build_graph_name (Optional[str]) – Name of the metadata entry in which to store the nearest neighbor graph. If None, the graph is not stored.

  • more_build_graph_args (Optional[dict]) – Additional arguments for graph construction, to pass to run_all_neighbor_steps() as more_build_graph_args.

  • cluster_output_name (Optional[str]) – Name of the column of the column data in which to store the cluster assignments. If None, graph-based clustering is not performed.

  • cluster_meta_name (Optional[str]) – Name of the metadata entry in which to store additional clustering outputs. If None, additional outputs are not stored.

  • more_cluster_graph_args (Optional[dict]) – Additional arguments for graph-based clustering, to pass to run_all_neighbor_steps() as more_cluster_graph_args. If None, the graph-based clustering is not performed.

  • nn_parameters (Parameters) – Parameters for the nearest-neighbor search.

  • num_threads (int) – Number of threads to use, passed to run_all_neighbor_steps().

  • more_neighbor_args (dict) – Additional arguments to pass to run_all_neighbor_steps().

  • reddim_type (Union[str, int]) – Name or index of the reduced dimensions of x on which to perform the nearest neighbor search. Alternatively a tuple, where the first element contains the name of an alternative experiment of x, and the second element contains the name/index of an embedding in that alternative experiment.

Return type:

SingleCellExperiment

Returns:

A copy of x, with additional coordinates in its reduced dimensions and clustering output in its column data. Additional information may also be stored in its metadata.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_rna_data_se("pca")
>>> sce = scranpy.run_all_neighbor_steps_se(
>>>     sce,
>>>     more_tsne_args={ "max_iterations": 50 }, # turned down for brevity
>>>     more_umap_args={ "num_epochs": 50 }
>>> )
>>> sce.get_reduced_dimension_names()
>>> import biocutils
>>> print(biocutils.table(sce.get_column_data()["clusters"]))
scranpy.run_pca(x, number=25, scale=False, block=None, block_weight_policy='variable', variable_block_weight=(0, 1000), subset=None, components_from_residuals=False, extra_work=7, iterations=1000, seed=5489, realized=True, num_threads=1)[source]

Run a PCA on the gene-by-cell log-expression matrix to obtain a low-dimensional representation for downstream analyses.

Parameters:
  • x (Any) – A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. Typically, the matrix is expected to contain log-expression values, and the rows should be filtered to relevant (e.g., highly variable) genes.

  • number (int) – Number of PCs to retain.

  • scale (bool) – Whether to scale all genes to have the same variance.

  • block (Optional[Sequence]) – Array of length equal to the number of columns of x, containing the block of origin (e.g., batch, sample) for each cell. Alternatively None, if all cells are from the same block.

  • block_weight_policy (Literal['variable', 'equal', 'none']) – Policy to use for weighting different blocks when computing the average for each statistic. Only used if block is provided.

  • variable_block_weight (Tuple) – Parameters for variable block weighting. This should be a tuple of length 2 where the first and second values are used as the lower and upper bounds, respectively, for the variable weight calculation. Only used if block is provided and block_weight_policy = "variable".

  • components_from_residuals (bool) – Whether to compute the PC scores from the residuals in the presence of a blocking factor. If False, the residuals are only used to compute the rotation matrix, and the original expression values of the cells are projected onto this new space. Only used if block is provided.

  • extra_work (int) – Number of extra dimensions for the IRLBA workspace.

  • iterations (int) – Maximum number of restart iterations for IRLBA.

  • seed (int) – Seed for the initial random vector in IRLBA.

  • realized (bool) – Whether to realize x into an optimal memory layout for IRLBA. This speeds up computation at the cost of increased memory usage.

  • num_threads (int) – Number of threads to use.

Return type:

NamedList

Returns:

A NamedList containing the following entries.

  • components: a double-precision NumPy matrix of principal component (PC) scores. Rows are dimensions (i.e., PCs) and columns are cells.

  • rotation: a double-precision NumPy matrix containing the rotation vectors. Rows are genes and columns are dimensions (i.e., PCs).

  • variance_explained: a double-precision NumPy array containing the variance explained by each successive PC.

  • total_variance: total variance in the input data. Guaranteed to be greater than the sum of variance_explained.

  • center: a double-precision NumPy array containing the mean for each gene across all cells. If block was supplied, this is instead a matrix containing the mean for each gene (column) in each block of cells (row).

  • block_ids: list containing the levels of the blocking factor corresponding to each row of center. Only reported if block was supplied.

  • scale: a double-precision NumPy arrary containing the scaling for each gene. Only reported if scale = True in the function call.

References

https://libscran.github.io/scran_pca, which describes the approach in more detail. In particular, the documentation for the blocked_pca function explains the blocking strategy.

Examples

>>> import numpy
>>> normed = numpy.random.rand(500, 100)
>>> import scranpy
>>> res = scranpy.run_pca(normed)
>>> print(res["variance_explained"] / res["total_variance"])
scranpy.run_pca_se(x, features, number=25, block=None, num_threads=1, more_pca_args={}, assay_type='logcounts', output_name='PCA', meta_name='PCA')[source]

Compact and denoise the dataset by performing PCA on the (log-)normalized expression matrix. This calls run_pca() on an assay of a SummarizedExperiment.

Parameters:
  • x (SummarizedExperiment) – A SummarizedExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.

  • features (Optional[Sequence]) – Features of interest to use in the PCA. This can be a sequence of row indices, row names, or booleans indicating the rows of x to use. For RNA data, this is typically the hvg vector added to the row data of x by choose_rna_hvgs_se(). If None, all available features are used.

  • number (int) – Number of PCs to retain, passed to run_pca().

  • block (Optional[Sequence]) – Block assignment of each cell, passed to run_pca().

  • num_threads (int) – Number of threads, passed to run_pca().

  • more_pca_args (dict) – Additional arguments passed to run_pca().

  • assay_type (Union[int, str]) – Name or index of the assay of x to be used for PCA. This is typically the log-normalized expression matrix created by normalize_rna_counts_se() or equivalent.

  • output_name (str) – Name of the reduced dimension entry in which to store the PC scores in the output object.

  • meta_name (Optional[str]) – Name of the metadata entry in which to store other PCA statistics in the output object.

Return type:

SingleCellExperiment

Returns:

A copy of x with the newly computed principal component scores in the reduced dimensions. Additional outputs (e.g., rotation matrix, variance explained) are stored in the metadata.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_rna_data_se("hvg")
>>> sce = scranpy.run_pca_se(sce, features=sce.get_row_data()["hvg"])
>>> sce.get_reduced_dimension("PCA").shape
>>> pcameta = sce.get_metadata()["PCA"]
>>> pcameta["variance_explained"] / pcameta["total_variance"]
scranpy.run_tsne(x, perplexity=30, num_neighbors=None, theta=1, early_exaggeration_iterations=250, exaggeration_factor=12, momentum_switch_iterations=250, start_momentum=0.5, final_momentum=0.8, eta=200, max_depth=7, leaf_approximation=False, max_iterations=500, seed=42, num_threads=1, nn_parameters=<knncolle.annoy.AnnoyParameters object>)

Compute t-SNE coordinates to visualize similarities between cells.

Parameters:
  • x (Union[ndarray, FindKnnResults, Index]) –

    Numeric matrix where rows are dimensions and columns are cells, typically containing a low-dimensional representation from, e.g., run_pca().

    Alternatively, a FindKnnResults object containing existing neighbor search results.

    Alternatively, a Index object.

  • perplexity (float) – Perplexity to use in the t-SNE algorithm. Larger values cause the embedding to focus on global structure.

  • num_neighbors (Optional[int]) – Number of neighbors in the nearest-neighbor graph. Typically derived from perplexity using tsne_perplexity_to_neighbors(). If x is a FindKnnResults object and the number of neighbors is not equal to num_neighbors, a warning is raised; this can be suppressed by setting num_neighbors = None.

  • theta (float) – Approximation level for the Barnes-Hut calculation of repulsive forces. Lower values increase accuracy at the cost of computational time.

  • early_exaggeration_iterations (int) – Number of iterations of the early exaggeration phase, where the conditional probabilities are multiplied by exaggeration_factor. In this phase, the empty space between clusters is increased so that clusters can easily relocate to find a good global organization. Larger values improve convergence within this phase at the cost of reducing the remaining iterations in max_iterations.

  • exaggeration_factor (float) – Exaggeration factor to scale the probabilities during the early exaggeration phase (see early_exaggeration_iterations). Larger values increase the attraction between nearest neighbors to favor local structure during this phase.

  • momentum_switch_iterations (int) – Number of iterations to perform before switching from the starting momentum (start_momentum) to the final momentum (final_momentum). Greater momentum can improve convergence by increasing the step size and smoothing over local oscillations, at the risk of potentially skipping over relevant minima.

  • start_momentum (float) – Starting momentum in [0, 1) to be used in the iterations before the momentum switch at momentum_switch_iterations. This is usually lower than final_momentum to avoid skipping over suitable local minima.

  • final_momentum (float) – Final momentum in [0, 1) to be used in the iterations after the momentum switch at momentum_switch_iterations. This is usually higher than start_momentum to accelerate convergence to the local minima once the observations are moderately well-organized.

  • eta (float) – The learning rate, used to scale the updates to the coordinates at each iteration. Larger values can speed up convergence at the cost of potentially skipping over local minima.

  • max_depth (int) – Maximum depth of the Barnes-Hut quadtree. If neighboring observations cannot be separated before the maximum depth is reached, they will be assigned to the same leaf node. This effectively approximates each observation’s coordinates with the center of mass of its leaf node. Smaller values (7-10) improve speed at the cost of accuracy.

  • leaf_approximation (bool) – Whether to use the “leaf approximation” approach, which sacrifices some accuracy for greater speed. This replaces a observation with the center of mass of its leaf node when computing the repulsive forces to all other observations. Only effective when max_depth is small enough for multiple cells to be assigned to the same leaf node of the quadtree.

  • max_iterations (int) – Maximum number of iterations to perform.

  • seed (int) – Random seed to use for generating the initial coordinates.

  • num_threads (int) – Number of threads to use.

  • nn_parameters (Parameters) – The algorithm to use for the nearest-neighbor search. Only used if x is not a pre-built nearest-neighbor search index or a list of existing nearest-neighbor search results.

Return type:

ndarray

Returns:

Double-precision NumPy matrix containing the coordinates of each cell in a 2-dimensional embedding. Each row corresponds to a cell and each column corresponds to a dimension.

References

https://libscran.github.io/qdtsne, for some more details on the approximations.

Examples

>>> import numpy
>>> pcs = numpy.random.rand(20, 500)
>>> import scranpy
>>> tout = scranpy.run_tsne(pcs)
>>> print(tout[:5,:])
scranpy.run_tsne_se(x, perplexity=30, num_threads=1, more_tsne_args={}, reddim_type='PCA', output_name='TSNE')[source]

Generate a t-SNE visualization from an existing embedding. This calls run_tsne() on a reduced dimension entry in a SingleCellExperiment.

Parameters:
  • x (SingleCellExperiment) – A SingleCellExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.

  • perplexity (float) – Perplexity for the t-SNE algorithm, passed to run_tsne().

  • num_threads (int) – Number of threads for the neighbor search and optimization, passed to run_tsne().

  • more_tsne_args (dict) – Additional arguments to pass to run_tsne().

  • reddim_type (Union[int, str]) – Name or index of the existing reduced dimension embedding in x from which to generate t-SNE coordinates. Alternatively a tuple, where the first element contains the name of an alternative experiment of x, and the second element contains the name/index of an embedding in that alternative experiment.

  • output_name (str) – Name of the reduced dimension entry in which to store the t-SNE coordinates in the output object.

Return type:

SingleCellExperiment

Returns:

A copy of x with a new reduced dimension entry containing the t-SNE coordinates.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_rna_data_se("pca")
>>> # Using fewer iterations for a faster-running example.
>>> sce = scranpy.run_tsne_se(sce, more_tsne_args={ "max_iterations": 50 })
>>> sce.get_reduced_dimension("TSNE")[:5,:]
scranpy.run_umap(x, num_dim=2, parallel_optimization=False, local_connectivity=1, bandwidth=1, mix_ratio=1, spread=1, min_dist=0.1, a=None, b=None, repulsion_strength=1, initialize_method='spectral', initial_coordinates=None, initialize_random_on_spectral_fail=True, initialize_spectral_scale=10, initialize_spectral_jitter=False, initialize_spectral_jitter_sd=0.0001, initialize_random_scale=10, initialize_seed=9876543210, num_epochs=None, learning_rate=1, negative_sample_rate=5, num_neighbors=15, optimize_seed=1234567890, num_threads=1, nn_parameters=<knncolle.annoy.AnnoyParameters object>)

Compute UMAP coordinates to visualize similarities between cells.

Parameters:
  • x (Union[ndarray, FindKnnResults, Index]) –

    Numeric matrix where rows are dimensions and columns are cells, typically containing a low-dimensional representation from, e.g., run_pca().

    Alternatively, a FindKnnResults object containing existing neighbor search results.

    Alternatively, a Index object.

  • num_dim (int) – Number of dimensions in the UMAP embedding.

  • local_connectivity (float) – Number of nearest neighbors that are assumed to be always connected, with maximum membership confidence. Larger values increase the connectivity of the embedding and reduce the focus on local structure. This may be a fractional number of neighbors, in which case interpolation is performed when computing the membership confidence.

  • bandwidth (float) – Effective bandwidth of the kernel when converting the distance to a neighbor into a fuzzy set membership confidence. Larger values reduce the decay in confidence with respect to distance, increasing connectivity and favoring global structure.

  • mix_ratio (float) – Number between 0 and 1 specifying the mixing ratio when combining fuzzy sets. A mixing ratio of 1 will take the union of confidences, a ratio of 0 will take the intersection, and intermediate values will interpolate between them. Larger values favor connectivity and more global structure.

  • spread (float) – Scale of the coordinates of the final low-dimensional embedding. Ignored if a and b are provided.

  • min_dist (float) – Minimum distance between observations in the final low-dimensional embedding. Smaller values will increase local clustering while larger values favor a more even distribution of observations throughout the low-dimensional space. This is interpreted relative to spread. Ignored if a and b are provided.

  • a (Optional[float]) – The a parameter for the fuzzy set membership strength calculations. Larger values yield a sharper decay in membership strength with increasing distance between observations. If this or b are None, a suitable value for this parameter is automatically determined from spread and min_dist.

  • b (Optional[float]) – The b parameter for the fuzzy set membership strength calculations. Larger values yield an earlier decay in membership strength with increasing distance between observations. If this or a are None, a suitable value for this parameter is automatically determined from spread and min_dist.

  • repulsion_strength (float) – Modifier for the repulsive force. Larger values increase repulsion and favor local structure.

  • initialize_method (Literal['spectral', 'random', 'none']) –

    How to initialize the embedding.

    • spectral: spectral decomposition of the normalized graph Laplacian. Specifically, the initial coordinates are defined from the eigenvectors corresponding to the smallest non-zero eigenvalues. This fails in the presence of multiple graph components or if the approximate SVD fails to converge.

    • random: fills the embedding with random draws from a normal distribution.

    • none: uses initial values from initial_coordinates.

  • initialize_random_on_spectral_fail (bool) – Whether to fall back to random sampling (i.e., random) if spectral initialization fails due to the presence of multiple components in the graph. If False, the values in initial_coordinates will be used instead, i.e., same as none. Only relevant if initialize_method = "spectral" and spectral initialization fails.

  • initialize_spectral_scale (float) – Maximum absolute magnitude of the coordinates after spectral initialization. The default is chosen to avoid outlier observations with large absolute distances that may interfere with optimization. Only relevant if initialize_method = "spectral" and spectral initialization does not fail.

  • initialize_spectral_jitter (bool) – Whether to jitter coordinates after spectral initialization to separate duplicate observations (e.g., to avoid overplotting). This is done using normally-distributed noise of mean zero and standard deviation of initialize_spectral_jitter_sd. Only relevant if initialize_method = "spectral" and spectral initialization does not fail.

  • initialize_spectral_jitter_sd (float) – Standard deviation of the jitter to apply after spectral initialization. Only relevant if initialize_method = "spectral" and spectral initialization does not fail and initialize_spectral_jitter = True.

  • initialize.random.scale – Scale of the randomly generated initial coordinates. Coordinates are sampled from a uniform distribution from [-x, x) where x is initialize_random_scale. Only relevant if initialize_method = "random", or initialize_method = "spectral" and spectral initialization fails and initialize_random_on_spectral_fail = True.

  • initialize_seed (int) – Seed for the random number generation during initialization. Only relevant if initialize_method = "random", or initialize_method = "spectral" and initialize_spectral_jitter = True; or initialize_method = "spectral" and spectral initialization fails and initialize_random_on_spectral_fail = True.

  • initial_coordinates (Optional[array]) – Double-precision matrix of initial coordinates with number of rows and columns equal to the number of observations and num_dim, respectively. Only relevant if initialize_method = "none"; or initialize_method = "spectral" and spectral initialization fails and initialize_random_on_spectral_fail = False.

  • num_epochs (Optional[int]) – Number of epochs to perform. If set to None, an appropriate number of epochs is chosen based on the number of points in x.

  • num_neighbors (int) – Number of neighbors to use in the UMAP algorithm. Larger values cause the embedding to focus on global structure. Ignored if x is a FindKnnResults object.

  • optimize_seed (int) – Integer scalar specifying the seed to use.

  • num_threads (int) – Number of threads to use.

  • nn_parameters (Parameters) – The algorithm to use for the nearest-neighbor search. Only used if x is not a pre-built nearest-neighbor search index or a list of existing nearest-neighbor search results.

Return type:

ndarray

Returns:

Double-precision NumPy matrix containing the coordinates of each cell in a 2-dimensional embedding. Each row corresponds to a cell and each column corresponds to a dimension.

References

https://libscran.github.io/umappp, for the underlying implementation.

Examples

>>> import numpy
>>> pcs = numpy.random.rand(20, 500)
>>> import scranpy
>>> uout = scranpy.run_umap(pcs)
>>> print(uout[:5,:])
scranpy.run_umap_se(x, num_dim=2, min_dist=0.1, num_neighbors=15, num_threads=1, more_umap_args={}, reddim_type='PCA', output_name='UMAP')[source]

Generate a UMAP visualization from an existing embedding. This calls run_umap() on a reduced dimension entry in a SingleCellExperiment.

Parameters:
  • x (SingleCellExperiment) – A SingleCellExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.

  • num_dim (int) – Number of dimensions in the UMAP, passed to run_umap().

  • min_dist (float) – Minimum distance between cells in the UMAP, passed to run_umap().

  • num_neighbors (int) – Number of neighbors, passed to run_umap().

  • num_threads (int) – Number of threads for the neighbor search and optimization, passed to run_umap().

  • more_umap_args (dict) – Additional arguments to pass to run_umap().

  • reddim_type (Union[int, str]) – Name or index of the existing reduced dimension embedding in x from which to generate UMAP coordinates. Alternatively a tuple, where the first element contains the name of an alternative experiment of x, and the second element contains the name/index of an embedding in that alternative experiment.

  • output_name (str) – Name of the reduced dimension entry in which to store the UMAP coordinates in the output object.

Return type:

SingleCellExperiment

Returns:

A copy of x with a new reduced dimension entry containing the UMAP coordinates.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_rna_data_se("pca")
>>> # Using fewer iterations for a faster-running example.
>>> sce = scranpy.run_umap_se(sce, more_umap_args={ "num_epochs": 50 })
>>> sce.get_reduced_dimension("UMAP")[:5,:]
scranpy.sanitize_size_factors(size_factors, replace_zero=True, replace_negative=True, replace_infinite=True, replace_nan=True, in_place=False)

Replace invalid size factors, i.e., zero, negative, infinite or NaNs.

Parameters:
  • size_factors (ndarray) – Floating-point array containing size factors for all cells.

  • replace_zero (bool) – Whether to replace size factors of zero with the lowest positive factor. If False, zeros are retained.

  • replace_negative (bool) – Whether to replace negative size factors with the lowest positive factor. If False, negative values are retained.

  • replace_infinite (bool) – Whether to replace infinite size factors with the largest positive factor. If False, infinite values are retained.

  • replace_nan (bool) – Whether to replace NaN size factors with unity. If False, NaN values are retained.

  • in_place (bool) – Whether to modify size_factors in place. If False, a new array is returned. This argument only used if size_factors is double-precision, otherwise a new array is always returned.

Return type:

ndarray

Returns:

Double-precision NumPy array containing sanitized size factors. If in_place = True, this is a reference to size_factors.

References

The sanitize_size_factors function in the scran_norm C++ library, which provides the underlying implementation.

Examples

>>> import numpy
>>> sf = numpy.random.rand(1000)
>>> sf[:4] = [0, -1, numpy.inf, numpy.nan]
>>> import scranpy
>>> san = scranpy.sanitize_size_factors(sf)
>>> san[:4]
scranpy.scale_by_neighbors(x, num_neighbors=20, block=None, block_weight_policy='variable', variable_block_weight=(0, 1000), num_threads=1, weights=None, nn_parameters=<knncolle.annoy.AnnoyParameters object>)[source]

Scale multiple embeddings (usually derived from different modalities across the same set of cells) so that their within-population variances are comparable. Then, combine them into a single embedding matrix for combined downstream analysis.

Parameters:
  • x (Sequence) – Sequence of of numeric matrices of principal components or other embeddings, one for each modality. For each entry, rows are dimensions and columns are cells. All entries should have the same number of columns but may have different numbers of rows.

  • num_neighbors (int) – Number of neighbors to use to define the scaling factor.

  • num_threads (int) – Number of threads to use.

  • nn_parameters (Parameters) – Algorithm for the nearest-neighbor search.

  • weights (Optional[Sequence]) – Array of length equal to x, specifying the weights to apply to each modality. Each value represents a multiplier of the within-population variance of its modality, i.e., larger values increase the contribution of that modality in the combined output matrix. The default of None is equivalent to an all-1 vector, i.e., all modalities are scaled to have the same within-population variance.

Return type:

NamedList

Returns:

A NamedList containing the following entries.

  • scaling: a double-precision NumPy array containing the scaling factor to be applied to each embedding in x.

  • combined: a double-precision NumPy matrix of scaled and concatenated embeddings. This is formed by scaling each embedding in x by its corresponding entry of scaling, and then concatenating them together by row. Each row corresponds to a dimension and each column corresponds to a cell.

References

https://libscran.github.io/mumosa, for the basis and caveats of this approach.

Examples

>>> import numpy
>>> rna_pcs = numpy.random.rand(25, 200)
>>> adt_pcs = numpy.random.rand(15, 200)
>>> other_pcs = numpy.random.rand(10, 200)
>>> import scranpy
>>> res = scranpy.scale_by_neighbors([rna_pcs, adt_pcs, other_pcs])
>>> print(res["scaling"])
scranpy.scale_by_neighbors_se(x, altexp_reddims, main_reddims='PCA', num_neighbors=20, block=None, nn_parameters=<knncolle.annoy.AnnoyParameters object>, num_threads=1, more_scale_args={}, output_name='combined', meta_name='combined')[source]

Scale embeddings for different modalities to equalize their intra-population variance, and then combine them into a single embedding for downstream analysis. This calls scale_by_neighbors() on the main/alternative experiments of a SingleCellExperiment.

Parameters:
  • x (SingleCellExperiment) – A SingleCellExperiment object or one of its subclasses. Rows correspond to genomic features and columns correspond to cells.

  • altexp_reddims (Union[dict, NamedList]) –

    Dictionary specifying the reduced dimensions of the alternative experiments of x to be combined. Specifically, each key is the name of an alternative experiment and the value is a list of strings/integers specifying the names/indices of the reduced dimensions of that experiment.

    Alternatively, each value may be a single string or integer that will be converted into a list of length 1. Each value may also be None in which case the corresponding alternative experiment is ignored.

    Alternatively, a NamedList where each element is named after one or more alternative experiments, and each value is as described above for a dictionary.

  • main_reddims (Union[int, str, Sequence]) –

    List of strings/integers specifying the names/indices of the reduced dimensions from the main experiment of x.

    Alternatively, a single string or integer that will be converted into a list of length 1.

    Alternatively None, which will be treated as a

  • num_neighbors (int) – Number of neighbors to compute the scaling, see scale_by_neighbors().

  • block (Optional[Sequence]) – Block assignment for each cell, passed to scale_by_neighbors().

  • nn_parameters (Parameters) – Algorithm for the nearest neighbor search, passed to scale_by_neighbors().

  • num_threads (int) – Number of threads, passed to scale_by_neighbors().

  • more_scale_args (dict) – Additional arguments to pass to scale_by_neighbors().

  • output_name (str) – Name of the reduced dimension entry in which to store the combined embeddings in the output object.

  • meta_name (Optional[str]) – Name of the metadata entry in which to store additional metrics in the output object. If None, additional metrics are not stored.

Return type:

SingleCellExperiment

Returns:

A copy of x with the combined embeddings stored in its row data. The scaling factors for all embeddings are stored in the metadata.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_adt_data_se("pca")
>>> sce = scranpy.scale_by_neighbors_se(sce, altexp_reddims={ "ADT": "PCA" })
>>> sce.get_reduced_dimension_names()
>>> sce.get_metadata()["combined"]
scranpy.score_gene_set(x, set, rank=1, scale=False, block=None, block_weight_policy='variable', variable_block_weight=(0, 1000), extra_work=7, iterations=1000, seed=5489, realized=True, num_threads=1)[source]

Compute per-cell scores for a gene set, defined as the column sums of a rank-1 approximation to the submatrix for the feature set. This uses the same approach implemented in the GSDecon package by Jason Hackney.

Parameters:
  • x (Any) – A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. The matrix is expected to contain log-expression values.

  • set (Sequence) – Array of integer indices specifying the rows of x belonging to the gene set. Alternatively, a sequence of boolean values of length equal to the number of rows, where truthy elements indicate that the corresponding row belongs to the gene set.

  • rank (int) – Rank of the approximation.

  • scale (bool) – Whether to scale all genes to have the same variance.

  • block (Optional[Sequence]) – Array of length equal to the number of columns of x, containing the block of origin (e.g., batch, sample) for each cell. Alternatively None, if all cells are from the same block.

  • block_weight_policy (Literal['variable', 'equal', 'none']) – Policy to use for weighting different blocks when computing the average for each statistic. Only used if block is provided.

  • variable_block_weight (Tuple) – Parameters for variable block weighting. This should be a tuple of length 2 where the first and second values are used as the lower and upper bounds, respectively, for the variable weight calculation. Only used if block is provided and block_weight_policy = "variable".

  • extra_work (int) – Number of extra dimensions for the IRLBA workspace.

  • iterations (int) – Maximum number of restart iterations for IRLBA.

  • seed (int) – Seed for the initial random vector in IRLBA.

  • realized (bool) – Whether to realize x into an optimal memory layout for IRLBA. This speeds up computation at the cost of increased memory usage.

  • num_threads (int) – Number of threads to use.

Return type:

NamedList

Returns:

A NamedList containing the following entries.

  • scores: a double-precision NumPy array containing the gene set score for each cell.

  • weights: a double-precision NumPy array containing containing the weight of each gene in set. A larger weight indicates that the corresponding gene in set has more influence on scores.

References

https://libscran.github.io/gsdecon, which describes the approach in more detail. In particular, the documentation for the compute_blocked function explains the blocking strategy.

Examples

>>> import numpy
>>> normed = numpy.random.rand(200, 100)
>>> import scranpy
>>> res = scranpy.score_gene_set(normed, set=range(20))
>>> print(res)
scranpy.score_gene_set_se(x, set, block=None, num_threads=1, more_score_args={}, assay_type='logcounts')[source]

Compute a gene set activity score for each cell based on the expression values of the genes in the set. This calls score_gene_set() on an assay of a SummarizedExperiment.

Parameters:
Return type:

NamedList

Returns:

A NamedList containing per-cell scores and per-gene weights, see score_gene_set() for details.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_rna_data_se("norm")
>>> custom_set = [0, 1, 4, 5, 7]
>>> custom_scores = scranpy.score_gene_set_se(sce, custom_set)
>>> custom_scores["scores"]
scranpy.score_markers(x, groups, block=None, block_average_policy='mean', block_weight_policy='variable', variable_block_weight=(0, 1000), block_quantile=0.5, threshold=0, compute_group_mean=True, compute_group_detected=True, compute_cohens_d=True, compute_auc=True, compute_delta_mean=True, compute_delta_detected=True, compute_summary_min=True, compute_summary_mean=True, compute_summary_median=True, compute_summary_max=True, compute_summary_quantiles=None, compute_summary_min_rank=True, min_rank_limit=500, all_pairwise=False, num_threads=1)[source]

Score marker genes for each group using a variety of effect sizes from pairwise comparisons between groups. This includes Cohen’s d, the area under the curve (AUC), the difference in the means (delta-mean) and the difference in the proportion of detected cells (delta-detected).

Parameters:
  • x (Any) – A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. It is typically expected to contain log-expression values, e.g., from normalize_counts().

  • groups (Sequence) – Group assignment for each cell in x. This should have length equal to the number of columns in x.

  • block (Optional[Sequence]) – Array of length equal to the number of columns of x, containing the block of origin (e.g., batch, sample) for each cell. Alternatively None, if all cells are from the same block.

  • block_average_policy (Literal['mean', 'quantile']) – Policy to use for average statistics across blocks. This can either be a (weighted) mean or a quantile. Only used if block is supplied.

  • block_weight_policy (Literal['variable', 'equal', 'none']) – Policy to use for weighting different blocks when computing the average for each statistic. Only used if block is provided.

  • variable_block_weight (Tuple) – Parameters for variable block weighting. This should be a tuple of length 2 where the first and second values are used as the lower and upper bounds, respectively, for the variable weight calculation. Only used if block is provided and block_weight_policy = "variable".

  • block_quantile (float) – Probability of the quantile of statistics across blocks. Defaults to 0.5, i.e., the median of per-block statistics. Only used if block is provided and block_average_policy ="quantile".

  • threshold (float) – Non-negative value specifying the minimum threshold on the differences in means (i.e., the log-fold change, if x contains log-expression values). This is incorporated into the calculation for Cohen’s d and the AUC.

  • compute_group_mean (bool) – Whether to compute the group-wise mean expression for each gene.

  • compute_group_detected (bool) – Whether to compute the group-wise proportion of detected cells for each gene.

  • compute_cohens_d (bool) – Whether to compute Cohen’s d, i.e., the ratio of the difference in means to the standard deviation.

  • compute_auc (bool) – Whether to compute the AUC. Setting this to False can improve speed and memory efficiency.

  • compute_delta_mean (bool) – Whether to compute the delta-means, i.e., the log-fold change when x contains log-expression values.

  • compute_delta_detected (bool) – Whether to compute the delta-detected, i.e., differences in the proportion of cells with detected expression.

  • compute_summary_min (bool) – Whether to compute the minimum as a summary statistic for each effect size. Only used if all_pairwise = False.

  • compute_summary_mean (bool) – Whether to compute the mean as a summary statistic for each effect size. Only used if all.pairwise = False.

  • compute_summary_median (bool) – Whether to compute the median as a summary statistic for each effect size. Only used if all_pairwise = False.

  • compute_summary_max (bool) – Whether to compute the maximum as a summary statistic for each effect size. Only used if all_pairwise = False.

  • compute_summary_quantiles (Optional[Sequence]) – Probabilities of quantiles to compute as summary statistics for each effect size. This should be in [0, 1] and sorted in order of increasing size. If None, no quantiles are computed. Only used if all_pairwise = False.

  • compute_summary_min_rank (bool) – Whether to compute the mininum rank as a summary statistic for each effect size. If None, no quantiles are computed. Only used if all_pairwise = False.

  • min_rank_limit (int) – Maximum value of the min-rank to report. Lower values improve memory efficiency at the cost of discarding information about lower-ranked genes. Only used if all_pairwise = False and compute_summary_min_rank = True.

  • all_pairwise (bool) – Whether to report the full effects for every pairwise comparison between groups. Alternatively, an integer scalar indicating the number of top markers to report from each pairwise comparison between groups. If False, only summaries are reported.

  • num_threads (int) – Number of threads to use.

Return type:

NamedList

Returns:

A NamedList containing various marker statistics for each group. This has the following entries:

  • nrow: integer specifying the number of genes in the dataset.

  • group_ids: list containing the identities of the groups.

  • means: double-precision NumPy matrix containing the mean expression for each gene in each group. Each row is a gene and each column is a group, ordered as in group_ids. Omitted if compute_group_means = False.

  • means: double-precision NumPy matrix containing the proportion of cells with detected expression for each gene in each group. Each row is a gene and each column is a group, ordered as in group_ids. Omitted if compute_group_means = False.

If all_pairwise = False, the NamedList contains the following additional entries:

  • cohens_d: a NamedList with the same structure as returned by summarized_effects().

    Briefly, each entry corresponds to a group in group_ids and is a BiocFrame with one row per gene. Each column contains a summary statistic of the Cohen’s d from pairwise comparisons to all other groups, e.g., min, mean, median, max, min-rank, and any requested quantiles. Columns are omitted if the relevant compute_summary_* option is set to False.

  • auc: Same as cohens_d but for the AUCs.

  • delta_mean: Same as cohens_d but for the delta-mean.

  • delta_detected: Same as cohens_d but for the delta-detected.

If all_pairwise = True, the NamedList contains the following addditional entries:

  • cohens_d: a 3-dimensional double-precision NumPy array containing the Cohen’s d from each pairwise comparison between groups. The extents of the first two dimensions are equal to the number of groups, while the extent of the final dimension is equal to the number of genes. Specifically, the entry [i, j, k] represents Cohen’s d from the comparison of group j over group i for gene k.

  • auc: Same as cohens_d but for the AUCs.

  • delta_mean: Same as cohens_d but for the delta-mean.

  • delta_detected: Same as cohens_d but for the delta-detected.

If all_pairwise is an integer, the NamedList contains the following additional entries:

  • cohens_d: a NamedList list of named lists of BiocFrame objects. The BiocFrame at cohens_d[m][n] contains the top markers for the comparison of group m over group n. Each BiocFrame has the index and effect columns, containing the row indices and effect sizes of the top genes, respectively.

  • auc: Same as cohens_d but for the AUCs.

  • delta_mean: Same as cohens_d but for the delta-mean.

  • delta_detected: Same as cohens_d but for the delta-detected.

Entries will be omitted if the relevant compute_* option is set to False. For example, if compute_cohens_d = False, the output will not contain any cohens_d entry.

References

The score_markers_summary and score_markers_pairwise functions in the scran_markers C++ library, which describes the rationale behind the choice of effect sizes and summary statistics. Also see their blocked equivalents score_markers_summary_blocked and score_markers_pairwise_blocked when block is provided.

Examples

>>> import numpy
>>> normed = numpy.random.rand(200, 100)
>>> import scranpy
>>> group = ["A", "B", "C", "D"] * 25
>>> res = scranpy.score_markers(normed, group)
>>> print(res["cohens_d"]["A"])
scranpy.score_markers_se(x, groups, block=None, num_threads=1, more_marker_args={}, assay_type='logcounts', extra_columns=None, order_by=True)

Identify candidate marker genes based on effect sizes from pairwise comparisons between groups of cells. This calls score_markers() on an assay of a SummarizedExperiment, and then uses format_score_markers_result() to reformat the results.

Parameters:
  • x (SummarizedExperiment) – A SummarizedExperiment object or one of its subclasses. Rows correspond to genes and columns correspond to cells.

  • groups (Sequence) – Group assignment for each cell, passed to score_markers().

  • block (Optional[Sequence]) – Block assignment for each cell, passed to score_markers().

  • num_threads (int) – Number of threads for marker scoring, passed to score_markers().

  • more_marker_args (dict) – Additional arguments to pass to score_markers().

  • assay_type (Union[str, int]) – Name or index of the assay of x to use for marker detection, usually containing log-normalized expression values.

  • extra_columns (Union[Sequence, str, NamedList, None]) – A BiocFrame with the same number of rows as x, containing extra columns to add each DataFrame. Alternatively, a list of strings specifying the columns of the row data of x to be added. A single string is treated as a list of length 1.

  • order_by (Union[str, bool, None]) – Name of the column to use for ordering the rows of each output BiocFrame. Alternatively True, in which case a column is automatically chosen from the effect size summaries. If None or False, no ordering is performed.

Return type:

NamedList

Returns:

A NamedList of Each ``BiocFrame` corresponds to a unique group in groups. Each row contains statistics for a gene in x, with the following columns:

  • mean, the mean expression in the current group.

  • detected, the proportion of cells with detected expression in the current group.

  • <effect>_<summary>, a summary statistic for an effect size. For example, cohens_d_mean contains the mean Cohen’s d across comparisons involving the current group.

Examples

>>> import scranpy
>>> sce = scranpy.get_test_rna_data_se("cluster")
>>> markers = scranpy.score_markers_se(sce, sce.get_column_data()["clusters"])
>>> print(scranpy.preview_markers(markers["0"]))
scranpy.subsample_by_neighbors(x, num_neighbors=20, min_remaining=10, nn_parameters=<knncolle.annoy.AnnoyParameters object>, num_threads=1)

Subsample a dataset by selecting cells to represent all of their nearest neighbors.

Parameters:
  • x (Union[ndarray, FindKnnResults, Index]) –

    Numeric matrix where rows are dimensions and columns are cells, typically containing a low-dimensional representation from, e.g., run_pca().

    Alternatively, a Index object containing a pre-built search index for a dataset.

    Alternatively, a FindKnnResults object containing pre-computed search results for a dataset.

  • num_neighbors (int) – Number of neighbors to use. Larger values result in greater downsampling. Ignored if x is a FindKnnResults object.

  • nn_parameters (Parameters) – Neighbor search algorithm to use. Only used if x does not contain existing neighbor search results.

  • min_remaining (int) – Minimum number of remaining (i.e., unselected) neighbors that a cell must have in order to be considered for selection. This should be less than or equal to num_neighbors.

  • num_threads (int) – Number of threads to use for the nearest-neighbor search. Only used if x does not contain existing neighbor search results.

Return type:

ndarray

Returns:

Integer NumPy array containing the indices of the cells selected to be in the subsample.

References

https://libscran.github.io/nenesub, for the rationale behind this approach.

Examples

>>> import numpy
>>> pcs = numpy.random.rand(20, 500)
>>> import scranpy
>>> keep = scranpy.subsample_by_neighbors(pcs)
>>> print(keep)
scranpy.suggest_adt_qc_thresholds(metrics, block=None, min_detected_drop=0.1, num_mads=3.0)[source]

Suggest filter thresholds for the ADT-derived QC metrics from compute_adt_qc_metrics().

Parameters:
  • metrics (BiocFrame) – ADT-derived QC metrics generated by compute_adt_qc_metrics(). This should contain the sum, detected and subset_sum columns.

  • block (Optional[Sequence]) – Blocking factor specifying the block of origin (e.g., batch, sample) for each cell in metrics. If supplied, a separate threshold is computed from the cells in each block. Alternatively None, if all cells are from the same block.

  • min_detected_drop (float) – Minimum proportional drop in the number of detected ADTs to consider a cell to be of low quality. This avoids overly aggressive filtering when the MAD is zero.

  • num_mads (float) – Number of MADs from the median to define the threshold for outliers in each QC metric.

Return type:

NamedList

Returns:

If block = None, a NamedList is returned, containing the following entries.

  • detected: a number specifying the lower threshold on the number of detected ADTs.

  • subsets: a FloatList of length equal to the number of control subsets (and named accordingly). Each entry represents the upper bound on the sum of counts in the corresponding control subset.

If block is provided, the NamedList instead contains:

  • detected, a FloatList of length equal to the number of blocks (and named accordingly). Each entry represents the lower threshold on the number of detected ADTs in the corresponding block.

  • subset_sum, a NamedList of length equal to the number of control subsets. Each entry is another FloatList that contains the upper threshold on the sum of counts for that subset in each block.

  • block_ids, a list containing the unique levels of the blocking factor. This is in the same order as the blocks in detected and subset_sum.

References

The compute_adt_qc_filters and compute_adt_qc_filters_blocked functions in the scran_qc C++ library, which describe the rationale behind the suggested filters.

Examples

>>> import numpy
>>> mat = numpy.reshape(numpy.random.poisson(lam=5, size=1000), (50, 20))
>>> import scranpy
>>> res = scranpy.compute_adt_qc_metrics(mat, { "IgG": [ 1, 10, 20, 40 ] })
>>> filt = scranpy.suggest_adt_qc_thresholds(res)
>>> print(filt)
scranpy.suggest_crispr_qc_thresholds(metrics, block=None, num_mads=3.0)[source]

Suggest filter thresholds for the CRISPR-derived QC metrics, typically generated from compute_crispr_qc_metrics().

Parameters:
  • metrics (BiocFrame) – CRISPR-derived QC metrics from compute_crispr_qc_metrics(). This should contain the sum, detected, max_value and max_index columns.

  • block (Optional[Sequence]) – Blocking factor specifying the block of origin (e.g., batch, sample) for each cell in metrics. If supplied, a separate threshold is computed from the cells in each block. Alternatively None, if all cells are from the same block.

  • num_mads (float) – Number of MADs from the median to define the threshold for outliers in each QC metric.

Return type:

NamedList

Returns:

If block = None, a NamedList is returned, containing the following entries.

  • max_value, a number specifying the lower threshold on the maximum count in each cell.

If block is provided, the NamedList instead contains:

  • max_value, a FloatList of length equal to the number of blocks (and named accordingly). Each entry represents the lower threshold on the maximum count in the corresponding block.

  • block_ids, a list containing the unique levels of the blocking factor. This is in the same order as the blocks in detected and subset_sum.

References

The compute_crispr_qc_filters and compute_crispr_qc_filters_blocked functions in the scran_qc C++ library, which describes the rationale behind the suggested filters.

Examples

>>> import numpy
>>> mat = numpy.reshape(numpy.random.poisson(lam=5, size=1000), (50, 20))
>>> import scranpy
>>> res = scranpy.compute_crispr_qc_metrics(mat)
>>> filt = scranpy.suggest_crispr_qc_thresholds(res)
>>> print(filt)
scranpy.suggest_rna_qc_thresholds(metrics, block=None, num_mads=3.0)[source]

Suggest filter thresholds for the RNA-derived QC metrics, typically generated from compute_rna_qc_metrics().

Parameters:
  • metrics (BiocFrame) – RNA-derived QC metrics from compute_rna_qc_metrics(). This should contain the sum, detected and subset_proportion columns.

  • block (Optional[Sequence]) – Blocking factor specifying the block of origin (e.g., batch, sample) for each cell in metrics. If supplied, a separate threshold is computed from the cells in each block. Alternatively None, if all cells are from the same block.

  • num_mads (float) – Number of MADs from the median to define the threshold for outliers in each QC metric.

Return type:

NamedList

Returns:

If block = None, a NamedList is returned, containing the following entries.

  • sum, a number specifying the lower threshold on the sum of counts in each cell.

  • detected, a number specifying the lower threshold on the number of detected genes.

  • subset_proportion, a FloatList of length equal to the number of control subsets (and named accordingly). Each entry represents the upper bound on the proportion of counts in the corresponding control subset.

If block is provided, the NamedList instead contains:

  • sum, a FloatList of length equal to the number of blocks (and named accordingly). Each entry represents the lower threshold on the sum of counts in the corresponding block.

  • detected, a FloatList of length equal to the number of blocks (and named accordingly). Each entry represents the lower threshold on the number of detected genes in the corresponding block.

  • subset_proportion, a NamedList of length equal to the number of control subsets. Each entry is another FloatList that contains the upper threshold on the proportion of counts for that subset in each block.

  • block_ids, a list containing the unique levels of the blocking factor. This is in the same order as the blocks in detected and subset_sum.

References

The compute_rna_qc_filters and compute_rna_qc_filters_blocked functions in the scran_qc C++ library, which describes the rationale behind the suggested filters.

Examples

>>> import numpy
>>> mat = numpy.reshape(numpy.random.poisson(lam=5, size=1000), (50, 20))
>>> import scranpy
>>> res = scranpy.compute_rna_qc_metrics(mat, { "mito": [ 1, 10, 20, 40 ] })
>>> filt = scranpy.suggest_rna_qc_thresholds(res)
>>> print(filt)
scranpy.summarize_effects(effects, compute_min=True, compute_mean=True, compute_median=True, compute_max=True, compute_quantiles=None, compute_min_rank=True, num_threads=1)[source]

For each group, summarize the effect sizes for all pairwise comparisons to other groups. This yields a set of summary statistics that can be used to rank marker genes for each group.

Parameters:
  • effects (ndarray) – A 3-dimensional numeric containing the effect sizes from each pairwise comparison between groups. The extents of the first two dimensions should be equal to the number of groups, while the extent of the final dimension is equal to the number of genes. The entry [i, j, k] should represent the effect size from the comparison of group j against group i for gene k. See also the output of score_markers() with all_pairwise = True.

  • compute_min (bool) – Whether to compute the minimum as a summary statistic for each effect size.

  • compute_mean (bool) – Whether to compute the mean as a summary statistic for each effect size.

  • compute_median (bool) – Whether to compute the median as a summary statistic for each effect size.

  • compute_max (bool) – Whether to compute the maximum as a summary statistic for each effect size.

  • compute_quantiles (Optional[Sequence]) – Probabilities of quantiles to compute as summary statistics for each effect size. This should be in [0, 1] and sorted in order of increasing size. If None, no quantiles are computed.

  • compute_min_rank (bool) – Whether to compute the mininum rank as a summary statistic for each effect size. If None, no quantiles are computed.

  • num_threads (int) – Number of threads to use.

Return type:

NamedList

Returns:

A NamedList of length equal to the number of groups (i.e., the extents of the first two dimensions of effects). Each entry is a BiocFrame where each row corresponds to a gene in effects. Each column contain a summary statistic of the effect sizes of the comparisons involving its corresponding group.

  • min: double-precision NumPy array containing the minimum effect size across all pairwise comparisons to other groups. Only present if compute_min = true.

  • median: double-precision NumPy array containing the median effect size across all pairwise comparisons to other groups. Only present if compute_median = True.

  • mean: double-precision NumPy array containing the mean effect size from across all pairwise comparisons to other groups. Only present if compute_median = True.

  • quantile: nested BiocFrame containing the specified quantiles of the effect sizes across all pairwise comparisons to other groups. Only present if compute_quantiles is provided.

  • max: double-precision NumPy array containing the maximum effect size across all pairwise comparisons to other groups. Only present if compute_max = true.

  • min_rank: integer array containing the minimum rank of each gene across all pairwise comparisons to other groups. Only present if compute_min_rank = true.

References

The summarize_effects function in the scran_markers C++ library, for more details on the statistics.

Examples

>>> import numpy
>>> normed = numpy.random.rand(200, 100)
>>> import scranpy
>>> group = ["A", "B", "C", "D"] * 25
>>> res = scranpy.score_markers(normed, group, all_pairwise=True)
>>> summaries = scranpy.summarize_effects(res["cohens_d"])
>>> print(summaries[0])
scranpy.test_enrichment(x, sets, universe, log=False, num_threads=1)[source]

Perform a hypergeometric test for enrichment of interesting genes (e.g., markers) in one or more pre-defined gene sets.

Parameters:
  • x (Sequence) – Sequence of identifiers for the interesting genes.

  • sets (Union[dict, Sequence]) –

    Sequence of gene sets, where each entry corresponds to a gene set and contains a sequence of identifiers for genes in that set.

    Alternatively, a dictionary where each key is the name of a gene set and each value is a sequence of identifiers for that gene set.

  • universe (Union[int, Sequence]) –

    Sequence of identifiers for the universe of genes in the dataset. It is assumed that x is a subset of universe. Identifiers in sets that are not in universe will be ignored.

    Alternatively, an integer specifying the number of genes in the universe.

  • log (bool) – Whether to report the log-transformed p-values.

  • num_threads (int) – Number of threads to use.

Return type:

ndarray

Returns:

Double-precision NumPy array of (log-transformed) p-values to test for significant enrichment of x in each entry of sets.

References

https://libscran.github.io/phyper, for the underlying implementation.

Examples

>>> LETTERS = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
>>> import scranpy
>>> scranpy.test_enrichment(
>>>     x=["A", "B", "C", "D", "E"],
>>>     sets={
>>>         "set1": LETTERS[:10],
>>>         "set2": ["B", "D", "F", "H", "J"],
>>>         "set3": LETTERS[10:20]
>>>     },
>>>     universe=LETTERS
>>> )
scranpy.tsne_perplexity_to_neighbors(perplexity)

Determine the number of nearest neighbors required to support a given perplexity in the t-SNE algorithm.

Parameters:

perplexity (float) – Perplexity to use in run_tsne().

Return type:

int

Returns:

The corresponding number of nearest neighbors.

Examples

>>> import scranpy
>>> scranpy.tsne_perplexity_to_neighbors(10)
>>> scranpy.tsne_perplexity_to_neighbors(30)
>>> scranpy.tsne_perplexity_to_neighbors(50)