scranpy package

Submodules

scranpy.adt_quality_control module

class scranpy.adt_quality_control.ComputeAdtQcMetricsResults(sum, detected, subset_sum)

Bases: object

Results of compute_adt_qc_metrics().

__annotate_func__()
__annotations_cache__ = {'detected': <class 'numpy.ndarray'>, 'subset_sum': <class 'biocutils.NamedList.NamedList'>, 'sum': <class 'numpy.ndarray'>}
__dataclass_fields__ = {'detected': Field(name='detected',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'subset_sum': Field(name='subset_sum',type=<class 'biocutils.NamedList.NamedList'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'sum': Field(name='sum',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 13
__hash__ = None
__match_args__ = ('sum', 'detected', 'subset_sum')
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
detected: ndarray
subset_sum: NamedList
sum: ndarray
to_biocframe(flatten=True)

Convert the results into a BiocFrame.

Parameters:

flatten (bool) – Whether to flatten the subset sums into separate columns. If True, each entry of subset_sum is represented by a subset_sum_<NAME> column, where <NAME> is the the name of each entry (if available) or its index (otherwise). If False, subset_sum is represented by a nested BiocFrame.

Returns:

A BiocFrame where each row corresponds to a cell and each column is one of the metrics.

class scranpy.adt_quality_control.SuggestAdtQcThresholdsResults(detected, subset_sum, block)

Bases: object

Results of suggest_adt_qc_thresholds().

__annotate_func__()
__annotations_cache__ = {'block': list | None, 'detected': biocutils.NamedList.NamedList | float, 'subset_sum': <class 'biocutils.NamedList.NamedList'>}
__dataclass_fields__ = {'block': Field(name='block',type=list | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'detected': Field(name='detected',type=biocutils.NamedList.NamedList | float,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'subset_sum': Field(name='subset_sum',type=<class 'biocutils.NamedList.NamedList'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 106
__hash__ = None
__match_args__ = ('detected', 'subset_sum', 'block')
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
block: Optional[list]
detected: Union[NamedList, float]
subset_sum: NamedList
scranpy.adt_quality_control.compute_adt_qc_metrics(x, subsets, num_threads=1)

Compute quality control metrics from ADT count data.

Parameters:
  • x (Any) – A matrix-like object containing ADT counts.

  • subsets (Union[Mapping, Sequence]) –

    Subsets of ADTs corresponding to control features like IgGs. This may be either:

    • A list of arrays. Each array corresponds to an ADT subset and can either contain boolean or integer values. For booleans, the array should be of length equal to the number of rows, and values should be truthy for rows that belong in the subset. For integers, each element of the array is treated the row index of an ADT in the subset.

    • A dictionary where keys are the names of each ADT subset and the values are arrays as described above.

    • A NamedList where each element is an array as described above, possibly with names.

  • num_threads (int) – Number of threads to use.

Return type:

ComputeAdtQcMetricsResults

Returns:

QC metrics computed from the ADT count matrix for each cell.

References

The compute_adt_qc_metrics function in the scran_qc C++ library, which describes the rationale behind these QC metrics.

scranpy.adt_quality_control.filter_adt_qc_metrics(thresholds, metrics, block=None)

Filter for high-quality cells based on ADT-derived QC metrics.

Parameters:
Return type:

ndarray

Returns:

A NumPy vector of length equal to the number of cells in metrics, containing truthy values for putative high-quality cells.

scranpy.adt_quality_control.suggest_adt_qc_thresholds(metrics, block=None, min_detected_drop=0.1, num_mads=3.0)

Suggest filter thresholds for the ADT-derived QC metrics, typically generated from compute_adt_qc_metrics().

Parameters:
  • metrics (ComputeAdtQcMetricsResults) – ADT-derived QC metrics from compute_adt_qc_metrics().

  • block (Optional[Sequence]) – Blocking factor specifying the block of origin (e.g., batch, sample) for each cell in metrics. If supplied, a separate threshold is computed from the cells in each block. Alternatively None, if all cells are from the same block.

  • min_detected_drop (float) – Minimum proportional drop in the number of detected ADTs to consider a cell to be of low quality. Specifically, the filter threshold on metrics.detected must be no higher than the product of min_detected_drop and the median number of ADTs, regardless of num_mads.

  • num_mads (float) – Number of MADs from the median to define the threshold for outliers in each QC metric.

Return type:

SuggestAdtQcThresholdsResults

Returns:

Suggested filters on the relevant QC metrics.

References

The compute_adt_qc_filters and compute_adt_qc_filters_blocked functions in the scran_qc C++ library, which describes the rationale behind the suggested filters.

scranpy.aggregate_across_cells module

class scranpy.aggregate_across_cells.AggregateAcrossCellsResults(sum, detected, combinations, counts, index)

Bases: object

Results of aggregate_across_cells().

__annotate_func__()
__annotations_cache__ = {'combinations': <class 'biocutils.NamedList.NamedList'>, 'counts': <class 'numpy.ndarray'>, 'detected': <class 'numpy.ndarray'>, 'index': <class 'numpy.ndarray'>, 'sum': <class 'numpy.ndarray'>}
__dataclass_fields__ = {'combinations': Field(name='combinations',type=<class 'biocutils.NamedList.NamedList'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'counts': Field(name='counts',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'detected': Field(name='detected',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'index': Field(name='index',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'sum': Field(name='sum',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 12
__hash__ = None
__match_args__ = ('sum', 'detected', 'combinations', 'counts', 'index')
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
combinations: NamedList
counts: ndarray
detected: ndarray
index: ndarray
sum: ndarray
to_summarizedexperiment(include_counts=True)

Convert the results to a SummarizedExperiment.

Parameters:

include_counts (bool) – Whether to include counts in the column data. Users may need to set this to False if a "counts" factor is present in combinations.

Returns:

A SummarizedExperiment where sum and detected are assays and combinations is stored in the column data.

scranpy.aggregate_across_cells.aggregate_across_cells(x, factors, num_threads=1)

Aggregate expression values across cells based on one or more grouping factors. This is primarily used to create pseudo-bulk profiles for each cluster/sample combination.

Parameters:
  • x (Any) – A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. Values are expected to be counts.

  • factors (Sequence) – One or more grouping factors, see combine_factors(). If this is a NamedList, any names will be retained in the output.

  • num_threads (int) – Number of threads to use for aggregation.

Return type:

AggregateAcrossCellsResults

Returns:

Results of the aggregation, including the sum and the number of detected cells in each group for each gene.

References

The aggregate_across_cells function in the scran_aggregate C++ library, which implements the aggregation.

scranpy.aggregate_across_genes module

scranpy.aggregate_across_genes.aggregate_across_genes(x, sets, average=False, num_threads=1)[source]

Aggregate expression values across genes, potentially with weights. This is typically used to summarize expression values for gene sets into a single per-cell score.

Parameters:
  • x (Any) – Matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. Values are expected to be log-expression values.

  • sets (Sequence) – Sequence of integer arrays containing the row indices of genes in each set. Alternatively, each entry may be a tuple of length 2, containing an integer vector (row indices) and a numeric vector (weights). If this is a NamedList, the names will be preserved in the output.

  • average (bool) – Whether to compute the average rather than the sum.

  • num_threads (int) – Number of threads to be used for aggregation.

Return type:

NamedList

Returns:

List of length equal to that of sets. Each entry is a numeric vector of length equal to the number of columns in x, containing the (weighted) sum/mean of expression values for the corresponding set across all cells.

References

The aggregate_across_genes function in the scran_aggregate C++ library, which implements the aggregation.

scranpy.analyze module

class scranpy.analyze.AnalyzeResults(rna_qc_metrics, rna_qc_thresholds, rna_qc_filter, adt_qc_metrics, adt_qc_thresholds, adt_qc_filter, crispr_qc_metrics, crispr_qc_thresholds, crispr_qc_filter, combined_qc_filter, rna_filtered, adt_filtered, crispr_filtered, rna_size_factors, rna_normalized, adt_size_factors, adt_normalized, crispr_size_factors, crispr_normalized, rna_gene_variances, rna_highly_variable_genes, rna_pca, adt_pca, crispr_pca, combined_pca, block, mnn_corrected, tsne, umap, snn_graph, graph_clusters, kmeans_clusters, clusters, rna_markers, adt_markers, crispr_markers, rna_row_names, adt_row_names, crispr_row_names, column_names)

Bases: object

Results of analyse().

__annotate_func__()
__annotations_cache__ = {'adt_filtered': delayedarray.DelayedArray.DelayedArray | None, 'adt_markers': scranpy.run_pca.RunPcaResults | None, 'adt_normalized': delayedarray.DelayedArray.DelayedArray | None, 'adt_pca': scranpy.run_pca.RunPcaResults | None, 'adt_qc_filter': numpy.ndarray | None, 'adt_qc_metrics': scranpy.adt_quality_control.ComputeAdtQcMetricsResults | None, 'adt_qc_thresholds': scranpy.adt_quality_control.SuggestAdtQcThresholdsResults | None, 'adt_row_names': <class 'biocutils.Names.Names'>, 'adt_size_factors': numpy.ndarray | None, 'block': typing.Sequence | None, 'clusters': numpy.ndarray | None, 'column_names': <class 'biocutils.Names.Names'>, 'combined_pca': typing.Literal['rna_pca', 'adt_pca', 'crispr_pca'] | scranpy.scale_by_neighbors.ScaleByNeighborsResults, 'combined_qc_filter': <class 'numpy.ndarray'>, 'crispr_filtered': delayedarray.DelayedArray.DelayedArray | None, 'crispr_markers': scranpy.run_pca.RunPcaResults | None, 'crispr_normalized': delayedarray.DelayedArray.DelayedArray | None, 'crispr_pca': scranpy.run_pca.RunPcaResults | None, 'crispr_qc_filter': numpy.ndarray | None, 'crispr_qc_metrics': scranpy.crispr_quality_control.ComputeCrisprQcMetricsResults | None, 'crispr_qc_thresholds': scranpy.crispr_quality_control.SuggestCrisprQcThresholdsResults | None, 'crispr_row_names': <class 'biocutils.Names.Names'>, 'crispr_size_factors': numpy.ndarray | None, 'graph_clusters': scranpy.cluster_graph.ClusterGraphResults | None, 'kmeans_clusters': scranpy.cluster_graph.ClusterGraphResults | None, 'mnn_corrected': scranpy.correct_mnn.CorrectMnnResults | None, 'rna_filtered': delayedarray.DelayedArray.DelayedArray | None, 'rna_gene_variances': scranpy.model_gene_variances.ModelGeneVariancesResults | None, 'rna_highly_variable_genes': numpy.ndarray | None, 'rna_markers': scranpy.run_pca.RunPcaResults | None, 'rna_normalized': delayedarray.DelayedArray.DelayedArray | None, 'rna_pca': scranpy.run_pca.RunPcaResults | None, 'rna_qc_filter': numpy.ndarray | None, 'rna_qc_metrics': scranpy.rna_quality_control.ComputeRnaQcMetricsResults | None, 'rna_qc_thresholds': scranpy.rna_quality_control.SuggestRnaQcThresholdsResults | None, 'rna_row_names': biocutils.Names.Names | None, 'rna_size_factors': numpy.ndarray | None, 'snn_graph': scranpy.build_snn_graph.GraphComponents | None, 'tsne': numpy.ndarray | None, 'umap': numpy.ndarray | None}
__dataclass_fields__ = {'adt_filtered': Field(name='adt_filtered',type=delayedarray.DelayedArray.DelayedArray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'adt_markers': Field(name='adt_markers',type=scranpy.run_pca.RunPcaResults | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'adt_normalized': Field(name='adt_normalized',type=delayedarray.DelayedArray.DelayedArray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'adt_pca': Field(name='adt_pca',type=scranpy.run_pca.RunPcaResults | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'adt_qc_filter': Field(name='adt_qc_filter',type=numpy.ndarray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'adt_qc_metrics': Field(name='adt_qc_metrics',type=scranpy.adt_quality_control.ComputeAdtQcMetricsResults | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'adt_qc_thresholds': Field(name='adt_qc_thresholds',type=scranpy.adt_quality_control.SuggestAdtQcThresholdsResults | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'adt_row_names': Field(name='adt_row_names',type=<class 'biocutils.Names.Names'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'adt_size_factors': Field(name='adt_size_factors',type=numpy.ndarray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'block': Field(name='block',type=typing.Sequence | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'clusters': Field(name='clusters',type=numpy.ndarray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'column_names': Field(name='column_names',type=<class 'biocutils.Names.Names'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'combined_pca': Field(name='combined_pca',type=typing.Literal['rna_pca', 'adt_pca', 'crispr_pca'] | scranpy.scale_by_neighbors.ScaleByNeighborsResults,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'combined_qc_filter': Field(name='combined_qc_filter',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'crispr_filtered': Field(name='crispr_filtered',type=delayedarray.DelayedArray.DelayedArray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'crispr_markers': Field(name='crispr_markers',type=scranpy.run_pca.RunPcaResults | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'crispr_normalized': Field(name='crispr_normalized',type=delayedarray.DelayedArray.DelayedArray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'crispr_pca': Field(name='crispr_pca',type=scranpy.run_pca.RunPcaResults | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'crispr_qc_filter': Field(name='crispr_qc_filter',type=numpy.ndarray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'crispr_qc_metrics': Field(name='crispr_qc_metrics',type=scranpy.crispr_quality_control.ComputeCrisprQcMetricsResults | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'crispr_qc_thresholds': Field(name='crispr_qc_thresholds',type=scranpy.crispr_quality_control.SuggestCrisprQcThresholdsResults | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'crispr_row_names': Field(name='crispr_row_names',type=<class 'biocutils.Names.Names'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'crispr_size_factors': Field(name='crispr_size_factors',type=numpy.ndarray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'graph_clusters': Field(name='graph_clusters',type=scranpy.cluster_graph.ClusterGraphResults | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'kmeans_clusters': Field(name='kmeans_clusters',type=scranpy.cluster_graph.ClusterGraphResults | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'mnn_corrected': Field(name='mnn_corrected',type=scranpy.correct_mnn.CorrectMnnResults | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'rna_filtered': Field(name='rna_filtered',type=delayedarray.DelayedArray.DelayedArray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'rna_gene_variances': Field(name='rna_gene_variances',type=scranpy.model_gene_variances.ModelGeneVariancesResults | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'rna_highly_variable_genes': Field(name='rna_highly_variable_genes',type=numpy.ndarray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'rna_markers': Field(name='rna_markers',type=scranpy.run_pca.RunPcaResults | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'rna_normalized': Field(name='rna_normalized',type=delayedarray.DelayedArray.DelayedArray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'rna_pca': Field(name='rna_pca',type=scranpy.run_pca.RunPcaResults | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'rna_qc_filter': Field(name='rna_qc_filter',type=numpy.ndarray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'rna_qc_metrics': Field(name='rna_qc_metrics',type=scranpy.rna_quality_control.ComputeRnaQcMetricsResults | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'rna_qc_thresholds': Field(name='rna_qc_thresholds',type=scranpy.rna_quality_control.SuggestRnaQcThresholdsResults | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'rna_row_names': Field(name='rna_row_names',type=biocutils.Names.Names | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'rna_size_factors': Field(name='rna_size_factors',type=numpy.ndarray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'snn_graph': Field(name='snn_graph',type=scranpy.build_snn_graph.GraphComponents | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'tsne': Field(name='tsne',type=numpy.ndarray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'umap': Field(name='umap',type=numpy.ndarray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 27
__hash__ = None
__match_args__ = ('rna_qc_metrics', 'rna_qc_thresholds', 'rna_qc_filter', 'adt_qc_metrics', 'adt_qc_thresholds', 'adt_qc_filter', 'crispr_qc_metrics', 'crispr_qc_thresholds', 'crispr_qc_filter', 'combined_qc_filter', 'rna_filtered', 'adt_filtered', 'crispr_filtered', 'rna_size_factors', 'rna_normalized', 'adt_size_factors', 'adt_normalized', 'crispr_size_factors', 'crispr_normalized', 'rna_gene_variances', 'rna_highly_variable_genes', 'rna_pca', 'adt_pca', 'crispr_pca', 'combined_pca', 'block', 'mnn_corrected', 'tsne', 'umap', 'snn_graph', 'graph_clusters', 'kmeans_clusters', 'clusters', 'rna_markers', 'adt_markers', 'crispr_markers', 'rna_row_names', 'adt_row_names', 'crispr_row_names', 'column_names')
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
adt_filtered: Optional[DelayedArray]
adt_markers: Optional[RunPcaResults]
adt_normalized: Optional[DelayedArray]
adt_pca: Optional[RunPcaResults]
adt_qc_filter: Optional[ndarray]
adt_qc_metrics: Optional[ComputeAdtQcMetricsResults]
adt_qc_thresholds: Optional[SuggestAdtQcThresholdsResults]
adt_row_names: Names
adt_size_factors: Optional[ndarray]
block: Optional[Sequence]
clusters: Optional[ndarray]
column_names: Names
combined_pca: Union[Literal['rna_pca', 'adt_pca', 'crispr_pca'], ScaleByNeighborsResults]
combined_qc_filter: ndarray
crispr_filtered: Optional[DelayedArray]
crispr_markers: Optional[RunPcaResults]
crispr_normalized: Optional[DelayedArray]
crispr_pca: Optional[RunPcaResults]
crispr_qc_filter: Optional[ndarray]
crispr_qc_metrics: Optional[ComputeCrisprQcMetricsResults]
crispr_qc_thresholds: Optional[SuggestCrisprQcThresholdsResults]
crispr_row_names: Names
crispr_size_factors: Optional[ndarray]
graph_clusters: Optional[ClusterGraphResults]
kmeans_clusters: Optional[ClusterGraphResults]
mnn_corrected: Optional[CorrectMnnResults]
rna_filtered: Optional[DelayedArray]
rna_gene_variances: Optional[ModelGeneVariancesResults]
rna_highly_variable_genes: Optional[ndarray]
rna_markers: Optional[RunPcaResults]
rna_normalized: Optional[DelayedArray]
rna_pca: Optional[RunPcaResults]
rna_qc_filter: Optional[ndarray]
rna_qc_metrics: Optional[ComputeRnaQcMetricsResults]
rna_qc_thresholds: Optional[SuggestRnaQcThresholdsResults]
rna_row_names: Optional[Names]
rna_size_factors: Optional[ndarray]
snn_graph: Optional[GraphComponents]
to_singlecellexperiment(main_modality=None, flatten_qc_subsets=True, include_per_block_variances=False)

Convert the results into a SingleCellExperiment.

Parameters:
  • main_modality (Optional[Literal['rna', 'adt', 'crispr']]) – Modality to use as the main experiment. If other modalities are present, they are stored in the alternative experiments. If None, it defaults to RNA, then ADT, then CRISPR, depending on which modalities are available.

  • flatten_qc_subsets (bool) – Whether to flatten QC feature subsets, see the to_biocframe() method of the ComputeRnaQcMetricsResults class for more details.

  • include_per_block_variances (bool) – Whether to compute the per-block variances, see the to_biocframe() method of the ModelGeneVariancesResults class for more details.

Returns:

A SingleCellExperiment containing the filtered and normalized matrices in the assays. QC metrics, size factors and clustering results are stored in the column data. PCA and other low-dimensional embeddings are stored in the reduced dimensions. Additional modalities are stored as alternative experiments.

tsne: Optional[ndarray]
umap: Optional[ndarray]
scranpy.analyze.analyze(rna_x, adt_x=None, crispr_x=None, block=None, rna_subsets=[], adt_subsets=[], suggest_rna_qc_thresholds_options={}, suggest_adt_qc_thresholds_options={}, suggest_crispr_qc_thresholds_options={}, filter_cells=True, center_size_factors_options={}, compute_clrm1_factors_options={}, normalize_counts_options={}, model_gene_variances_options={}, choose_highly_variable_genes_options={}, run_pca_options={}, use_rna_pcs=True, use_adt_pcs=True, use_crispr_pcs=True, scale_by_neighbors_options={}, correct_mnn_options={}, run_umap_options={}, run_tsne_options={}, build_snn_graph_options={}, cluster_graph_options={}, run_all_neighbor_steps_options={}, kmeans_clusters=None, cluster_kmeans_options={}, clusters_for_markers=['graph', 'kmeans'], score_markers_options={}, nn_parameters=<knncolle.annoy.AnnoyParameters object>, rna_assay=0, adt_assay=0, crispr_assay=0, num_threads=3)

Run through a simple single-cell analysis pipeline, starting from a count matrix and ending with clusters, visualizations and markers. This also supports integration of multiple modalities and correction of batch effects.

Parameters:
  • rna_x (Optional[Any]) –

    A matrix-like object containing RNA counts. This should have the same number of columns as the other *_x arguments.

    Alternatively, a SummarizedExperiment object containing such a matrix in its rna_assay.

    Alternatively None, if no RNA counts are available.

  • adt_x (Optional[Any]) –

    A matrix-like object containing ADT counts. This should have the same number of columns as the other *_x arguments.

    Alternatively, a SummarizedExperiment object containing such a matrix in its adt_assay.

    Alternatively None, if no ADT counts are available.

  • crispr_x (Optional[Any]) –

    A matrix-like object containing CRISPR counts. This should have the same number of columns as the other *_x arguments.

    Alternatively, a SummarizedExperiment object containing such a matrix in its crispr_assay.

    Alternatively None, if no CRISPR counts are available.

  • block (Optional[Sequence]) – Factor specifying the block of origin (e.g., batch, sample) for each cell in the *_x matrices. Alternatively None, if all cells are from the same block.

  • rna_subsets (Union[Mapping, Sequence]) – Gene subsets for quality control, typically used for mitochondrial genes. Check out the subsets arguments in compute_rna_qc_metrics() for details.

  • adt_subsets (Union[Mapping, Sequence]) – ADT subsets for quality control, typically used for IgG controls. Check out the subsets arguments in compute_adt_qc_metrics() for details.

  • suggest_rna_qc_thresholds_options (dict) – Arguments to pass to suggest_rna_qc_thresholds().

  • suggest_adt_qc_thresholds_options (dict) – Arguments to pass to suggest_adt_qc_thresholds().

  • suggest_crispr_qc_thresholds_options (dict) – Arguments to pass to suggest_crispr_qc_thresholds().

  • filter_cells (bool) – Whether to filter the count matrices to only retain high-quality cells in all modalities. If False, QC metrics and thresholds are still computed but are not used to filter the count matrices.

  • center_size_factors_options (dict) – Arguments to pass to center_size_factors().

  • compute_clrm1_factors_options (dict) – Arguments to pass to compute_clrm1_factors(). Only used if adt_x is provided.

  • normalize_counts_options (dict) – Arguments to pass to normalize_counts().

  • model_gene_variances_options (dict) – Arguments to pass to model_gene_variances(). Only used if rna_x is provided.

  • choose_highly_variable_genes_options (dict) – Arguments to pass to choose_highly_variable_genes(). Only used if rna_x is provided.

  • run_pca_options (dict) – Arguments to pass to run_pca().

  • use_rna_pcs (bool) – Whether to use the RNA-derived PCs for downstream steps (i.e., clustering, visualization). Only used if rna_x is provided.

  • use_adt_pcs (bool) – Whether to use the ADT-derived PCs for downstream steps (i.e., clustering, visualization). Only used if adt_x is provided.

  • use_crispr_pcs (bool) – Whether to use the CRISPR-derived PCs for downstream steps (i.e., clustering, visualization). Only used if crispr_x is provided.

  • scale_by_neighbors_options (dict) – Arguments to pass to scale_by_neighbors(). Only used if multiple modalities are available and their corresponding use_*_pca arguments are True.

  • correct_mnn_options (dict) – Arguments to pass to correct_mnn(). Only used if block is supplied.

  • run_tsne_options (Optional[dict]) – Arguments to pass to run_tsne(). If None, t-SNE is not performed.

  • run_umap_options (Optional[dict]) – Arguments to pass to run_umap(). If None, UMAP is not performed.

  • build_snn_graph_options (Optional[dict]) – Arguments to pass to build_snn_graph(). Ignored if cluster_graph_options = None.

  • cluster_graph_options (dict) – Arguments to pass to cluster_graph(). If None, graph-based clustering is not performed.

  • run_all_neighbor_steps_options (dict) – Arguments to pass to run_all_neighbor_steps().

  • kmeans_clusters (Optional[int]) – Number of clusters to use in k-means clustering. If None, k-means clustering is not performed.

  • cluster_kmeans_options (dict) – Arguments to pass to cluster_kmeans(). Ignored if kmeans_clusters = None.

  • clusters_for_markers (list) – List of clustering algorithms (either graph or kmeans), specifying the clustering to be used for marker detection. The first available clustering will be chosen.

  • score_markers_options (dict) – Arguments to pass to score_markers(). Ignored if no suitable clusterings are available.

  • nn_parameters (Parameters) – Algorithm to use for nearest-neighbor searches in the various steps.

  • rna_assay (Union[int, str]) – Integer or string specifying the assay to use if rna_x is a SummarizedExperiment.

  • adt_assay (Union[int, str]) – Integer or string specifying the assay to use if adt_x is a SummarizedExperiment.

  • crispr_assay (Union[int, str]) – Integer or string specifying the assay to use if crispr_x is a SummarizedExperiment.

  • num_threads (int) – Number of threads to use in each step.

Return type:

AnalyzeResults

Returns:

The results of the entire analysis, including the results from each step.

References

C++ libraries in the libscran GitHub organization, which implement all of these steps.

scranpy.build_snn_graph module

class scranpy.build_snn_graph.GraphComponents(vertices, edges, weights)

Bases: object

Components of a (possibly weighted) graph. Typically, nodes are cells and edges are formed between cells with similar expression profiles.

__annotate_func__()
__annotations_cache__ = {'edges': <class 'numpy.ndarray'>, 'vertices': <class 'int'>, 'weights': numpy.ndarray | None}
__dataclass_fields__ = {'edges': Field(name='edges',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'vertices': Field(name='vertices',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'weights': Field(name='weights',type=numpy.ndarray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 11
__hash__ = None
__match_args__ = ('vertices', 'edges', 'weights')
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
as_igraph()

Convert to a Graph from the igraph package.

Returns:

A Graph for use with methods in the igraph package.

edges: ndarray
vertices: int
weights: Optional[ndarray]
scranpy.build_snn_graph.build_snn_graph(x, num_neighbors=10, weight_scheme='ranked', num_threads=1, nn_parameters=<knncolle.annoy.AnnoyParameters object>)

Build a shared nearest neighbor (SNN) graph where each node is a cell. Edges are formed between cells that share one or more nearest neighbors, weighted by the number or importance of those shared neighbors.

Parameters:
  • x (Union[ndarray, FindKnnResults, Index]) –

    Numeric matrix where rows are dimensions and columns are cells, typically containing a low-dimensional representation from, e.g., run_pca().

    Alternatively, a FindKnnResults object containing existing neighbor search results. The number of neighbors should be the same as num_neighbors, otherwise a warning is raised.

    Alternatively, a Index object.

  • num_neighbors (int) – Number of neighbors in the nearest-neighbor graph. Larger values generally result in broader clusters during community detection.

  • weight_scheme (Literal['ranked', 'number', 'jaccard']) – Weighting scheme to use for the edges of the SNN graph, based on the number or ranking of the shared nearest neighbors.

  • num_threads (int) – Number of threads to use.

  • nn_parameters (Parameters) – The algorithm to use for the nearest-neighbor search. Only used if x is not a pre-built nearest-neighbor search index or a list of existing nearest-neighbor search results.

Return type:

GraphComponents

Results:

The components of the SNN graph, to be used in community detection.

References

The build_snn_graph function in the scran_graph_cluster C++ library, which provides some more details on the weighting.

scranpy.center_size_factors module

scranpy.center_size_factors.center_size_factors(size_factors, block=None, mode='lowest', in_place=False)

Center size factors before computing normalized values from the count matrix. This ensures that the normalized values are on the same scale as the original counts for easier interpretation.

Parameters:
  • size_factors (ndarray) – Floating-point array containing size factors for all cells.

  • block (Optional[Sequence]) – Block assignment for each cell. If provided, this should have length equal to the number of cells, where cells have the same value if and only if they are in the same block. Defaults to None, where all cells are treated as being part of the same block.

  • mode (Literal['lowest', 'per-block']) – How to scale size factors across blocks. lowest will scale all size factors by the lowest per-block average. per-block will center the size factors in each block separately. This argument is only used if block is provided.

  • in_place (bool) – Whether to modify size_factors in place. If False, a new array is returned. This argument only used if size_factors is double-precision, otherwise a new array is always returned.

Return type:

ndarray

Returns:

Array containing centered size factors. If in_place = True, this is a reference to size_factors.

References

The center_size_factors and center_size_factors_blocked functions in the scran_norm C++ library, which describes the rationale behind centering.

scranpy.choose_highly_variable_genes module

scranpy.choose_highly_variable_genes.choose_highly_variable_genes(stats, top=4000, larger=True, keep_ties=True, bound=None)[source]

Choose highly variable genes (HVGs), typically based on a variance-related statistic.

Parameters:
  • stats (ndarray) – Array of variances (or a related statistic) across all genes. Typically the residuals from model_gene_variances() used here.

  • top (int) – Number of top genes to retain. Note that the actual number of retained genes may not be equal to top, depending on the other options.

  • larger (bool) – Whether larger values of stats represent more variable genes. If true, HVGs are defined from the largest values of stats.

  • keep_ties (bool) – Whether to keep ties at the top-th most variable gene. This avoids arbitrary breaking of tied values.

  • bound (Optional[float]) – The lower bound (if larger = True) or upper bound (otherwise) to be applied to stats. Genes are not considered to be HVGs if they do not pass this bound, even if they are within the top genes. Ignored if None.

Return type:

ndarray

Returns:

Array containing the indices of genes in stats that are considered to be highly variable.

References

The choose_highly_variable_genes function from the scran_variances library, which provides the underlying implementation.

scranpy.choose_pseudo_count module

scranpy.choose_pseudo_count.choose_pseudo_count(size_factors, quantile=0.05, max_bias=1, min_value=1)[source]

Choose a suitable pseudo-count to control the bias introduced by log-transformation of normalized counts.

Parameters:
  • size_factors (ndarray) – Floating-point array of size factors for all cells.

  • quantile (float) – Quantile to use for defining extreme size factors.

  • max_bias (float) – Maximum allowed bias in the log-fold changes between cells.

  • min_value (float) – Minimum value for the pseudo-count.

Return type:

float

Returns:

Choice of pseudo-count, for use in normalize_counts().

References

The choose_pseudo_count function in the scran_norm C++ library, which describes the rationale behind the choice of pseudo-count.

scranpy.cluster_graph module

class scranpy.cluster_graph.ClusterGraphLeidenResults(membership, quality)

Bases: ClusterGraphResults

Clustering results from cluster_graph() when method = "leiden".

__annotate_func__()
__annotations_cache__ = {'quality': <class 'float'>}
__dataclass_fields__ = {'membership': Field(name='membership',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'quality': Field(name='quality',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 32
__hash__ = None
__match_args__ = ('membership', 'quality')
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
quality: float
class scranpy.cluster_graph.ClusterGraphMultilevelResults(membership, levels, modularity)

Bases: ClusterGraphResults

Clustering results from cluster_graph() when method = "multilevel".

__annotate_func__()
__annotations_cache__ = {'levels': tuple[numpy.ndarray], 'modularity': <class 'numpy.ndarray'>}
__dataclass_fields__ = {'levels': Field(name='levels',type=tuple[numpy.ndarray],default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'membership': Field(name='membership',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'modularity': Field(name='modularity',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 19
__hash__ = None
__match_args__ = ('membership', 'levels', 'modularity')
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
levels: tuple[ndarray]
modularity: ndarray
class scranpy.cluster_graph.ClusterGraphResults(membership)

Bases: object

Clustering results from cluster_graph().

__annotate_func__()
__annotations_cache__ = {'membership': <class 'numpy.ndarray'>}
__dataclass_fields__ = {'membership': Field(name='membership',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 10
__hash__ = None
__match_args__ = ('membership',)
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
membership: ndarray
class scranpy.cluster_graph.ClusterGraphWalktrapResults(membership, merges, modularity)

Bases: ClusterGraphResults

Clustering results from cluster_graph() when method = "walktrap".

__annotate_func__()

The type of the None singleton.

__annotations_cache__ = {'merges': <class 'numpy.ndarray'>, 'modularity': <class 'numpy.ndarray'>}
__dataclass_fields__ = {'membership': Field(name='membership',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'merges': Field(name='merges',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'modularity': Field(name='modularity',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 41
__hash__ = None
__match_args__ = ('membership', 'merges', 'modularity')
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
merges: ndarray
modularity: ndarray
scranpy.cluster_graph.cluster_graph(x, method='multilevel', multilevel_resolution=1, leiden_resolution=1, leiden_objective='modularity', walktrap_steps=4, seed=42)

Identify clusters of cells using a variety of community detection methods from a graph where similar cells are connected.

Parameters:
  • x (GraphComponents) – Components of the graph to be clustered, typically produced by build_snn_graph().

  • method (Literal['multilevel', 'leiden', 'walktrap']) – Community detection algorithm to use.

  • multilevel_resolution (float) – Resolution of the clustering when method = "multilevel". Larger values result in finer clusters.

  • leiden_resolution (float) – Resolution of the clustering when method = "leiden". Larger values result in finer clusters.

  • leiden_objective (Literal['modularity', 'cpm', 'er']) – Objective function to use when method = "leiden".

  • walktrap_steps (int) – Number of steps to use when method = "walktrap".

  • seed (int) – Random seed to use for method = "multilevel" or "leiden".

Returns:

All objects contain at least status, an indicator of whether the algorithm successfully completed; and membership, an array of cluster assignments for each node in x.

Return type:

Clustering results, as a

References

https://igraph.org/c/html/latest/igraph-Community.html, for the underlying implementation of each clustering method.

The various cluster_* functions in the scran_graph_cluster C++ library, which wraps the igraph functions.

scranpy.cluster_kmeans module

class scranpy.cluster_kmeans.ClusterKmeansResults(clusters, centers, iterations, status)

Bases: object

Results of cluster_kmeans().

__annotate_func__()
__annotations_cache__ = {'centers': <class 'numpy.ndarray'>, 'clusters': <class 'numpy.ndarray'>, 'iterations': <class 'int'>, 'status': <class 'int'>}
__dataclass_fields__ = {'centers': Field(name='centers',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'clusters': Field(name='clusters',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'iterations': Field(name='iterations',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'status': Field(name='status',type=<class 'int'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 9
__hash__ = None
__match_args__ = ('clusters', 'centers', 'iterations', 'status')
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
centers: ndarray
clusters: ndarray
iterations: int
status: int
scranpy.cluster_kmeans.cluster_kmeans(x, k, init_method='var-part', refine_method='hartigan-wong', var_part_optimize_partition=True, var_part_size_adjustment=1, lloyd_iterations=100, hartigan_wong_iterations=10, hartigan_wong_quick_transfer_iterations=50, hartigan_wong_quit_quick_transfer_failure=False, seed=5489, num_threads=1)

Perform k-means clustering with a variety of different initialization and refinement algorithms.

Parameters:
  • x (ndarray) – Input data matrix where rows are dimensions and columns are observations (i.e., cells).

  • k (int) – Number of clusters.

  • init_method (Literal['var-part', 'kmeans++', 'random']) – Initialization method for defining the initial centroid coordinates. Choices are variance partitioning (var-part), kmeans++ (kmeans++) or random initialization (random).

  • refine_method (Literal['hartigan-wong', 'lloyd']) – Method to use to refine the cluster assignments and centroid coordinates. Choices are Lloyd’s algorithm (lloyd) or the Hartigan-Wong algorithm (hartigan-wong).

  • var_part_optimize_partition (bool) – Whether each partition boundary should be optimized to reduce the sum of squares in the child partitions. Only used if init_method = "var-part".

  • var_part_size_adjustment (float) – Floating-point value between 0 and 1, specifying the adjustment to the cluster size when prioritizing the next cluster to partition. Setting this to 0 will ignore the cluster size while setting this to 1 will generally favor larger clusters. Only used if init_method = "var-part".

  • lloyd_iterations (int) – Maximmum number of iterations for the Lloyd algorithm.

  • hartigan_wong_iterations (int) – Maximmum number of iterations for the Hartigan-Wong algorithm.

  • hartigan_wong_quick_transfer_iterations (int) – Maximmum number of quick transfer iterations for the Hartigan-Wong algorithm.

  • hartigan_wong_quit_quick_transfer_failure (bool) – Whether to quit the Hartigan-Wong algorithm upon convergence failure during quick transfer iterations.

  • seed (int) – Seed to use for random or kmeans++ initialization.

  • num.threads – Number of threads to use.

Return type:

ClusterKmeansResults

Returns:

Results of k-means clustering on the observations.

References

https://ltla.github.io/CppKmeans, which describes the various initialization and refinement algorithms in more detail.

scranpy.combine_factors module

scranpy.combine_factors.combine_factors(factors, keep_unused=False)

Combine multiple categorical factors based on the unique combinations of levels from each factor.

Parameters:
  • factors (Sequence) – Sequence containing factors of interest. Each entry corresponds to a factor and should be a sequence of the same length. Corresponding elements across all factors represent the combination of levels for a single observation.

  • keep_unused (bool) – Whether to report unused combinations of levels. If any entry of factors is a Factor object, any unused levels will also be preserved.

Returns:

  • Sorted and unique combinations of levels as a tuple. Each entry of the tuple is a list that corresponds to a factor in factors. Corresponding elements of each list define a single combination, i.e., the i-th combination is defined by taking the i-th element of each sequence in the tuple.

  • Integer array of length equal to each sequence of factors, specifying the combination for each observation. Each entry is an index i into the sequences in the previous tuple.

Return type:

Tuple containing

References

The combine_factors function in the scran_aggregate library, which provides the underlying implementation.

scranpy.compute_clrm1_factors module

scranpy.compute_clrm1_factors.compute_clrm1_factors(x, num_threads=1)[source]

Compute size factors from an ADT count matrix using the CLRm1 method.

Parameters:
  • x (Any) – A matrix-like object containing ADT count data. Rows correspond to tags and columns correspond to cells.

  • num_threads (int) – Number of threads to use.

Return type:

ndarray

Returns:

Array containing the CLRm1 size factor for each cell. Note that these size factors are not centered and should be passed through, e.g., center_size_factors() before normalization.

References

https://github.com/libscran/clrm1, for a description of the CLRm1 method.

scranpy.correct_mnn module

class scranpy.correct_mnn.CorrectMnnResults(corrected)

Bases: object

Results of correct_mnn().

__annotate_func__()
__annotations_cache__ = {'corrected': <class 'numpy.ndarray'>}
__dataclass_fields__ = {'corrected': Field(name='corrected',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 11
__hash__ = None
__match_args__ = ('corrected',)
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
corrected: ndarray
scranpy.correct_mnn.correct_mnn(x, block, num_neighbors=15, num_steps=1, merge_policy='rss', num_mads=None, robust_iterations=None, robust_trim=None, mass_cap=None, order=None, reference_policy=None, nn_parameters=<knncolle.annoy.AnnoyParameters object>, num_threads=1)

Apply mutual nearest neighbor (MNN) correction to remove batch effects from a low-dimensional matrix.

Parameters:
  • x (ndarray) – Matrix of coordinates where rows are dimensions and columns are cells, typically generated by run_pca().

  • block (Sequence) – Factor specifying the block of origin (e.g., batch, sample) for each cell. Length should equal the number of columns in x.

  • num_neighbors (int) – Number of neighbors to use when identifying MNN pairs.

  • num_mads (Optional[int]) – Number of median absolute deviations to use for removing outliers in the center-of-mass calculations.

  • robust_iterations (Optional[int]) – Number of iterations for robust calculation of the center of mass.

  • robust_trim (Optional[float]) – Trimming proportion for robust calculation of the center of mass. This should be a value in [0, 1).

  • mass_cap (Optional[int]) – Cap on the number of observations to use for center-of-mass calculations on the reference dataset. A value of 100,000 may be appropriate for speeding up correction of very large datasets. If None, no cap is used.

  • order (Optional[Sequence]) – Sequence containing the unique levels of block in the desired merge order. If None, a suitable merge order is automatically determined.

  • merge_policy (Literal['rss', 'size', 'variance', 'input']) – Policy to use to choose the first reference batch. This can be based on the largest batch (max-size), the most variable batch (max-variance), the batch with the largest residual sum of squares (max-rss), or the first specified input (input). Only used for automatic merges, i.e., when order = None.

  • nn_parameters (Parameters) – The nearest-neighbor algorithm to use.

  • num_threads (int) – Number of threads to use.

Return type:

CorrectMnnResults

Returns:

The results of the MNN correction, including a matrix of the corrected coordinates and some additional diagnostics.

References

https://libscran.github.io/mnncorrect, which describes the MNN correction algorithm in more detail.

scranpy.crispr_quality_control module

class scranpy.crispr_quality_control.ComputeCrisprQcMetricsResults(sum, detected, max_value, max_index)

Bases: object

Results of compute_crispr_qc_metrics().

__annotate_func__()
__annotations_cache__ = {'detected': <class 'numpy.ndarray'>, 'max_index': <class 'numpy.ndarray'>, 'max_value': <class 'numpy.ndarray'>, 'sum': <class 'numpy.ndarray'>}
__dataclass_fields__ = {'detected': Field(name='detected',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'max_index': Field(name='max_index',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'max_value': Field(name='max_value',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'sum': Field(name='sum',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 11
__hash__ = None
__match_args__ = ('sum', 'detected', 'max_value', 'max_index')
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
detected: ndarray
max_index: ndarray
max_value: ndarray
sum: ndarray
to_biocframe()

Convert the results into a BiocFrame.

Returns:

A BiocFrame where each row corresponds to a cell and each column is one of the metrics.

class scranpy.crispr_quality_control.SuggestCrisprQcThresholdsResults(max_value, block)

Bases: object

Results of suggest_crispr_qc_thresholds().

__annotate_func__()
__annotations_cache__ = {'block': list | None, 'max_value': biocutils.NamedList.NamedList | float}
__dataclass_fields__ = {'block': Field(name='block',type=list | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'max_value': Field(name='max_value',type=biocutils.NamedList.NamedList | float,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 62
__hash__ = None
__match_args__ = ('max_value', 'block')
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
block: Optional[list]
max_value: Union[NamedList, float]
scranpy.crispr_quality_control.compute_crispr_qc_metrics(x, num_threads=1)

Compute quality control metrics from CRISPR count data.

Parameters:
  • x (Any) – A matrix-like object containing CRISPR counts.

  • num_threads (int) – Number of threads to use.

Returns:

QC metrics computed from the count matrix for each cell.

References

The compute_crispr_qc_metrics function in the scran_qc C++ library, which describes the rationale behind these QC metrics.

scranpy.crispr_quality_control.filter_crispr_qc_metrics(thresholds, metrics, block=None)

Filter for high-quality cells based on CRISPR-derived QC metrics.

Parameters:
Return type:

ndarray

Returns:

A NumPy vector of length equal to the number of cells in metrics, containing truthy values for putative high-quality cells.

scranpy.crispr_quality_control.suggest_crispr_qc_thresholds(metrics, block=None, num_mads=3.0)

Suggest filter thresholds for the CRISPR-derived QC metrics, typically generated from compute_crispr_qc_metrics().

Parameters:
  • metrics (ComputeCrisprQcMetricsResults) – CRISPR-derived QC metrics from compute_crispr_qc_metrics().

  • block (Optional[Sequence]) – Blocking factor specifying the block of origin (e.g., batch, sample) for each cell in metrics. If supplied, a separate threshold is computed from the cells in each block. Alternatively None, if all cells are from the same block.

  • num_mads (float) – Number of MADs from the median to define the threshold for outliers in each QC metric.

Return type:

SuggestCrisprQcThresholdsResults

Returns:

Suggested filters on the relevant QC metrics.

References

The compute_crispr_qc_filters and compute_crispr_qc_filters_blocked functions in the scran_qc C++ library, which describes the rationale behind the suggested filters.

scranpy.fit_variance_trend module

scranpy.fit_variance_trend.fit_variance_trend(mean, variance, mean_filter=True, min_mean=0.1, transform=True, span=0.3, use_min_width=False, min_width=1, min_window_count=200, num_threads=1)[source]

Fit a trend to the per-cell variances with respect to the mean.

Parameters:
  • mean (ndarray) – Array containing the mean (log-)expression for each gene.

  • variance (ndarray) – Array containing the variance in the (log-)expression for each gene. This should have length equal to mean.

  • mean_filter (bool) – Whether to filter on the means before trend fitting.

  • min_mean (float) – The minimum mean of genes to use in trend fitting. Only used if mean_filter = True.

  • transform (bool) – Whether a quarter-root transformation should be applied before trend fitting.

  • span (float) – Span of the LOWESS smoother. Ignored if use_min_width = TRUE.

  • use_min_width (bool) – Whether a minimum width constraint should be applied to the LOWESS smoother. This is useful to avoid overfitting in high-density intervals.

  • min_width (float) – Minimum width of the window to use when use_min_width = TRUE.

  • min_window_count (int) – Minimum number of observations in each window. Only used if use_min_width=TRUE.

  • num_threads (int) – Number of threads to use.

Return type:

Tuple

Returns:

A tuple of two arrays. The first array contains the fitted value of the trend for each gene while the second array contains the residual.

References

The fit_variance_trend function in the scran_variances C++ library, for the underlying implementation.

scranpy.lib_scranpy module

scranpy.model_gene_variances module

class scranpy.model_gene_variances.ModelGeneVariancesResults(mean, variance, fitted, residual, per_block)

Bases: object

Results of model_gene_variances().

__annotate_func__()
__annotations_cache__ = {'fitted': <class 'numpy.ndarray'>, 'mean': <class 'numpy.ndarray'>, 'per_block': biocutils.NamedList.NamedList | None, 'residual': <class 'numpy.ndarray'>, 'variance': <class 'numpy.ndarray'>}
__dataclass_fields__ = {'fitted': Field(name='fitted',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'mean': Field(name='mean',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'per_block': Field(name='per_block',type=biocutils.NamedList.NamedList | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'residual': Field(name='residual',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'variance': Field(name='variance',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 11
__hash__ = None
__match_args__ = ('mean', 'variance', 'fitted', 'residual', 'per_block')
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
fitted: ndarray
mean: ndarray
per_block: Optional[NamedList]
residual: ndarray
to_biocframe(include_per_block=False)

Convert the results into a BiocFrame.

Parameters:

include_per_block (bool) – Whether to include the per-block results as a nested dataframe.

Returns:

A BiocFrame where each row corresponds to a gene and each column corresponds to a statistic.

variance: ndarray
scranpy.model_gene_variances.model_gene_variances(x, block=None, block_average_policy='mean', block_weight_policy='variable', variable_block_weight=(0, 1000), block_quantile=0.8, mean_filter=True, min_mean=0.1, transform=True, span=0.3, use_min_width=False, min_width=1, min_window_count=200, num_threads=1)

Compute the variance in (log-)expression values for each gene, and model the trend in the variances with respect to the mean.

Parameters:
  • x (Any) – A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. It is typically expected to contain log-expression values, e.g., from normalize_counts().

  • block (Optional[Sequence]) – Array of length equal to the number of columns of x, containing the block of origin (e.g., batch, sample) for each cell. Alternatively None, if all cells are from the same block.

  • block_average_policy (Literal['mean', 'quantile']) – Policy for averaging statistics across blocks. This can either use a (weighted) mean or a quantile.

  • block_weight_policy (Literal['variable', 'equal', 'none']) – Policy for weighting different blocks when computing the weighted mean across blocks for each statistic. Only used if block is provided and block_average_policy == "mean".

  • variable_block_weight (Tuple) – Parameters for variable block weighting. This should be a tuple of length 2 where the first and second values are used as the lower and upper bounds, respectively, for the variable weight calculation. Only used if block is provided, block_average_policy == "mean", and block_weight_policy = "variable".

  • block_quantile (float) – Probability for computing the quantile across blocks. Defaults to 0.5, i.e., the median of per-block statistics. Only used if block is provided and block_average_policy == "quantile".

  • mean_filter (bool) – Whether to filter on the means before trend fitting.

  • min_mean (float) – The minimum mean of genes to use in trend fitting. Only used if mean_filter = True.

  • transform (bool) – Whether a quarter-root transformation should be applied before trend fitting.

  • span (float) – Span of the LOWESS smoother for trend fitting, see fit_variance_trend().

  • use_min_width (bool) – Whether a minimum width constraint should be applied during trend fitting, see fit_variance_trend().

  • min_width (float) – Minimum width of the smoothing window for trend fitting, see fit_variance_trend().

  • min_window_count (int) – Minimum number of observations in each smoothing window for trend fitting, see fit_variance_trend().

  • num_threads (int) – Number of threads to use.

Return type:

ModelGeneVariancesResults

Returns:

The results of the variance modelling for each gene.

References

The model_gene_variances function in the scran_variances C++ library, for the underlying implementation.

scranpy.normalize_counts module

scranpy.normalize_counts.normalize_counts(x, size_factors, log=True, pseudo_count=1, log_base=2, preserve_sparsity=False, delayed=True)

Create a matrix of (log-transformed) normalized expression values. The normalization removes uninteresting per-cell differences due to sequencing efficiency and library size. The log-transformation ensures that any differences represent log-fold changes in downstream analysis steps; such relative changes in expression are more relevant than absolute changes.

Parameters:
  • x (Any) –

    Matrix-like object containing cells in columns and features in rows, typically with count data.

    Alternatively, a InitializedMatrix representing a count matrix, typically created by initialize.

  • size_factors (Sequence) – Size factor for each cell. This should have length equal to the number of columns in x.

  • log (bool) – Whether log-transformation should be performed.

  • pseudo_count (float) – Positive pseudo-count to add before log-transformation. Ignored if log = False.

  • log_base (float) – Base of the log-transformation, ignored if log = False.

  • preserve_sparsity (bool) – Whether to preserve sparsity when pseudo_count != 1. If True, users should manually add log(pseudo_count, log_base) to the returned matrix to obtain the desired log-transformed expression values. Ignored if log = False or pseudo_count = 1.

  • delayed (bool) – Whether operations on a matrix-like x should be delayed. This improves memory efficiency at the cost of some speed in downstream operations.

Return type:

Union[DelayedArray, InitializedMatrix]

Returns:

If x is a matrix-like object and delayed = True, a DelayedArray is returned containing the (log-transformed) normalized expression matrix. If delayed = False, the type of the (log-)normalized matrix will depend on the operations applied to x.

If x is an InitializedMatrix, a new InitializedMatrix is returned containing the normalized expression matrix.

References

The normalize_counts function in the scran_norm C++ library, for the rationale behind normalization and log-transformation.

scranpy.rna_quality_control module

class scranpy.rna_quality_control.ComputeRnaQcMetricsResults(sum, detected, subset_proportion)

Bases: object

Results of compute_rna_qc_metrics().

__annotate_func__()
__annotations_cache__ = {'detected': <class 'numpy.ndarray'>, 'subset_proportion': <class 'biocutils.NamedList.NamedList'>, 'sum': <class 'numpy.ndarray'>}
__dataclass_fields__ = {'detected': Field(name='detected',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'subset_proportion': Field(name='subset_proportion',type=<class 'biocutils.NamedList.NamedList'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'sum': Field(name='sum',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 13
__hash__ = None
__match_args__ = ('sum', 'detected', 'subset_proportion')
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
detected: ndarray
subset_proportion: NamedList
sum: ndarray
to_biocframe(flatten=True)

Convert the results into a BiocFrame.

Parameters:

flatten (bool) – Whether to flatten the subset proportions into separate columns. If True, each entry of subset_proportion is represented by a subset_proportion_<NAME> column, where <NAME> is the the name of each entry (if available) or its index (otherwise). If False, subset_proportion is represented by a nested BiocFrame.

Returns:

A BiocFrame where each row corresponds to a cell and each column is one of the metrics.

class scranpy.rna_quality_control.SuggestRnaQcThresholdsResults(sum, detected, subset_proportion, block)

Bases: object

Results of suggest_rna_qc_thresholds().

__annotate_func__()
__annotations_cache__ = {'block': list | None, 'detected': biocutils.NamedList.NamedList | float, 'subset_proportion': <class 'biocutils.NamedList.NamedList'>, 'sum': biocutils.NamedList.NamedList | float}
__dataclass_fields__ = {'block': Field(name='block',type=list | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'detected': Field(name='detected',type=biocutils.NamedList.NamedList | float,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'subset_proportion': Field(name='subset_proportion',type=<class 'biocutils.NamedList.NamedList'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'sum': Field(name='sum',type=biocutils.NamedList.NamedList | float,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 103
__hash__ = None
__match_args__ = ('sum', 'detected', 'subset_proportion', 'block')
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
block: Optional[list]
detected: Union[NamedList, float]
subset_proportion: NamedList
sum: Union[NamedList, float]
scranpy.rna_quality_control.compute_rna_qc_metrics(x, subsets, num_threads=1)

Compute quality control metrics from RNA count data.

Parameters:
  • x (Any) – A matrix-like object containing RNA counts.

  • subsets (Union[Mapping, Sequence]) –

    Subsets of genes corresponding to “control” features like mitochondrial genes. This may be either:

    • A list of arrays. Each array corresponds to an gene subset and can either contain boolean or integer values. For booleans, the array should be of length equal to the number of rows, and values should be truthy for rows that belong in the subset. For integers, each element of the array is treated the row index of an gene in the subset.

    • A dictionary where keys are the names of each gene subset and the values are arrays as described above.

    • A NamedList where each element is an array as described above, possibly with names.

  • num_threads (int) – Number of threads to use.

Returns:

QC metrics computed from the count matrix for each cell.

References

The compute_rna_qc_metrics function in the scran_qc C++ library, which describes the rationale behind these QC metrics.

scranpy.rna_quality_control.filter_rna_qc_metrics(thresholds, metrics, block=None)

Filter for high-quality cells based on RNA-derived QC metrics.

Parameters:
Return type:

ndarray

Returns:

A NumPy vector of length equal to the number of cells in metrics, containing truthy values for putative high-quality cells.

scranpy.rna_quality_control.suggest_rna_qc_thresholds(metrics, block=None, num_mads=3.0)

Suggest filter thresholds for the RNA-derived QC metrics, typically generated from compute_rna_qc_metrics().

Parameters:
  • metrics (ComputeRnaQcMetricsResults) – RNA-derived QC metrics from compute_rna_qc_metrics().

  • block (Optional[Sequence]) – Blocking factor specifying the block of origin (e.g., batch, sample) for each cell in metrics. If supplied, a separate threshold is computed from the cells in each block. Alternatively None, if all cells are from the same block.

  • num_mads (float) – Number of MADs from the median to define the threshold for outliers in each QC metric.

Return type:

SuggestRnaQcThresholdsResults

Returns:

Suggested filters on the relevant QC metrics.

References

The compute_rna_qc_filters and compute_rna_qc_filters_blocked functions in the scran_qc C++ library, which describes the rationale behind the suggested filters.

scranpy.run_all_neighbor_steps module

class scranpy.run_all_neighbor_steps.RunAllNeighborStepsResults(run_tsne, run_umap, build_snn_graph, cluster_graph)

Bases: object

Results of run_all_neighbor_steps().

__annotate_func__()
__annotations_cache__ = {'build_snn_graph': scranpy.build_snn_graph.GraphComponents | None, 'cluster_graph': scranpy.cluster_graph.ClusterGraphResults | None, 'run_tsne': numpy.ndarray | None, 'run_umap': numpy.ndarray | None}
__dataclass_fields__ = {'build_snn_graph': Field(name='build_snn_graph',type=scranpy.build_snn_graph.GraphComponents | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'cluster_graph': Field(name='cluster_graph',type=scranpy.cluster_graph.ClusterGraphResults | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'run_tsne': Field(name='run_tsne',type=numpy.ndarray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'run_umap': Field(name='run_umap',type=numpy.ndarray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 25
__hash__ = None
__match_args__ = ('run_tsne', 'run_umap', 'build_snn_graph', 'cluster_graph')
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
build_snn_graph: Optional[GraphComponents]
cluster_graph: Optional[ClusterGraphResults]
run_tsne: Optional[ndarray]
run_umap: Optional[ndarray]
scranpy.run_all_neighbor_steps.run_all_neighbor_steps(x, run_umap_options={}, run_tsne_options={}, build_snn_graph_options={}, cluster_graph_options={}, nn_parameters=<knncolle.annoy.AnnoyParameters object>, collapse_search=False, num_threads=3)

Run all steps that depend on the nearest neighbor search - namely, run_tsne(), run_umap(), build_snn_graph(), and cluster_graph(). This builds the index once and re-uses it for the neighbor search in each step; the various steps are also run in parallel to save more time.

Parameters:
  • x (Union[Index, ndarray]) –

    Matrix of principal components where rows are cells and columns are PCs, typically produced by run_pca().

    Alternatively, a Index instance containing a prebuilt search index for the cells.

  • run_umap_options (Optional[dict]) – Optional arguments for run_umap(). If None, UMAP is not performed.

  • run_tsne_options (Optional[dict]) – Optional arguments for run_tsne(). If None, t-SNE is not performed.

  • build_snn_graph_options (Optional[dict]) – Optional arguments for build_snn_graph(). Ignored if cluster_graph_options = None.

  • cluster_graph_options (dict) – Optional arguments for cluster_graph(). If None, graph-based clustering is not performed.

  • nn_parameters (Parameters) – Parameters for the nearest-neighbor search.

  • collapse_search (bool) – Whether to collapse the nearest-neighbor search for each step into a single search. Steps that need fewer neighbors will use a subset of the neighbors from the collapsed search. This is faster but may not give the same results as separate searches for some approximate search algorithms.

  • num_threads (int) – Number of threads to use for the parallel execution of UMAP, t-SNE and SNN graph construction. This overrides the specified number of threads in the various *_options arguments.

Return type:

RunAllNeighborStepsResults

Returns:

The results of each step. These should be equivalent to the result of running each step in serial.

scranpy.run_pca module

class scranpy.run_pca.RunPcaResults(components, rotation, variance_explained, total_variance, center, scale, block)

Bases: object

Results of run_pca().

__annotate_func__()
__annotations_cache__ = {'block': list | None, 'center': <class 'numpy.ndarray'>, 'components': <class 'numpy.ndarray'>, 'rotation': <class 'numpy.ndarray'>, 'scale': numpy.ndarray | None, 'total_variance': <class 'float'>, 'variance_explained': <class 'numpy.ndarray'>}
__dataclass_fields__ = {'block': Field(name='block',type=list | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'center': Field(name='center',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'components': Field(name='components',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'rotation': Field(name='rotation',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'scale': Field(name='scale',type=numpy.ndarray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'total_variance': Field(name='total_variance',type=<class 'float'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'variance_explained': Field(name='variance_explained',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 11
__hash__ = None
__match_args__ = ('components', 'rotation', 'variance_explained', 'total_variance', 'center', 'scale', 'block')
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
block: Optional[list]
center: ndarray
components: ndarray
rotation: ndarray
scale: Optional[ndarray]
total_variance: float
variance_explained: ndarray
scranpy.run_pca.run_pca(x, number=25, scale=False, block=None, block_weight_policy='variable', variable_block_weight=(0, 1000), subset=None, components_from_residuals=False, extra_work=7, iterations=1000, seed=5489, realized=True, num_threads=1)

Run a PCA on the gene-by-cell log-expression matrix to obtain a low-dimensional representation for downstream analyses.

Parameters:
  • x (Any) – A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. Typically, the matrix is expected to contain log-expression values, and the rows should be filtered to relevant (e.g., highly variable) genes.

  • number (int) – Number of PCs to retain.

  • scale (bool) – Whether to scale all genes to have the same variance.

  • block (Optional[Sequence]) – Array of length equal to the number of columns of x, containing the block of origin (e.g., batch, sample) for each cell. Alternatively None, if all cells are from the same block.

  • block_weight_policy (Literal['variable', 'equal', 'none']) – Policy to use for weighting different blocks when computing the average for each statistic. Only used if block is provided.

  • variable_block_weight (Tuple) – Parameters for variable block weighting. This should be a tuple of length 2 where the first and second values are used as the lower and upper bounds, respectively, for the variable weight calculation. Only used if block is provided and block_weight_policy = "variable".

  • components_from_residuals (bool) – Whether to compute the PC scores from the residuals in the presence of a blocking factor. If False, the residuals are only used to compute the rotation matrix, and the original expression values of the cells are projected onto this new space. Only used if block is provided.

  • extra_work (int) – Number of extra dimensions for the IRLBA workspace.

  • iterations (int) – Maximum number of restart iterations for IRLBA.

  • seed (int) – Seed for the initial random vector in IRLBA.

  • realized (bool) – Whether to realize x into an optimal memory layout for IRLBA. This speeds up computation at the cost of increased memory usage.

  • num_threads (int) – Number of threads to use.

Return type:

RunPcaResults

Returns:

The results of the PCA.

References

https://libscran.github.io/scran_pca, which describes the approach in more detail. In particular, the documentation for the blocked_pca function explains the blocking strategy.

scranpy.run_tsne module

scranpy.run_tsne.run_tsne(x, perplexity=30, num_neighbors=None, theta=1, early_exaggeration_iterations=250, exaggeration_factor=12, momentum_switch_iterations=250, start_momentum=0.5, final_momentum=0.8, eta=200, max_depth=7, leaf_approximation=False, max_iterations=500, seed=42, num_threads=1, nn_parameters=<knncolle.annoy.AnnoyParameters object>)

Compute t-SNE coordinates to visualize similarities between cells.

Parameters:
  • x (Union[ndarray, FindKnnResults, Index]) –

    Numeric matrix where rows are dimensions and columns are cells, typically containing a low-dimensional representation from, e.g., run_pca().

    Alternatively, a FindKnnResults object containing existing neighbor search results. The number of neighbors should be the same as num_neighbors, otherwise a warning is raised.

    Alternatively, a Index object.

  • perplexity (float) – Perplexity to use in the t-SNE algorithm. Larger values cause the embedding to focus on global structure.

  • num_neighbors (Optional[int]) – Number of neighbors in the nearest-neighbor graph. Typically derived from perplexity using tsne_perplexity_to_neighbors().

  • theta (float) – Approximation level for the Barnes-Hut calculation of repulsive forces. Lower values increase accuracy at the cost of computational time.

  • early_exaggeration_iterations (int) – Number of iterations of the early exaggeration phase, where the conditional probabilities are multiplied by exaggeration_factor. In this phase, the empty space between clusters is increased so that clusters can easily relocate to find a good global organization. Larger values improve convergence within this phase at the cost of reducing the remaining iterations in max_iterations.

  • exaggeration_factor (float) – Exaggeration factor to scale the probabilities during the early exaggeration phase (see early_exaggeration_iterations). Larger values increase the attraction between nearest neighbors to favor local structure during this phase.

  • momentum_switch_iterations (int) – Number of iterations to perform before switching from the starting momentum (start_momentum) to the final momentum (final_momentum). Greater momentum can improve convergence by increasing the step size and smoothing over local oscillations, at the risk of potentially skipping over relevant minima.

  • start_momentum (float) – Starting momentum in [0, 1) to be used in the iterations before the momentum switch at momentum_switch_iterations. This is usually lower than final_momentum to avoid skipping over suitable local minima.

  • final_momentum (float) – Final momentum in [0, 1) to be used in the iterations after the momentum switch at momentum_switch_iterations. This is usually higher than start_momentum to accelerate convergence to the local minima once the observations are moderately well-organized.

  • eta (float) – The learning rate, used to scale the updates to the coordinates at each iteration. Larger values can speed up convergence at the cost of potentially skipping over local minima.

  • max_depth (int) – Maximum depth of the Barnes-Hut quadtree. If neighboring observations cannot be separated before the maximum depth is reached, they will be assigned to the same leaf node. This effectively approximates each observation’s coordinates with the center of mass of its leaf node. Smaller values (7-10) improve speed at the cost of accuracy.

  • leaf_approximation (bool) – Whether to use the “leaf approximation” approach, which sacrifices some accuracy for greater speed. This replaces a observation with the center of mass of its leaf node when computing the repulsive forces to all other observations. Only effective when max_depth is small enough for multiple cells to be assigned to the same leaf node of the quadtree.

  • max_iterations (int) – Maximum number of iterations to perform.

  • seed (int) – Random seed to use for generating the initial coordinates.

  • num_threads (int) – Number of threads to use.

  • nn_parameters (Parameters) – The algorithm to use for the nearest-neighbor search. Only used if x is not a pre-built nearest-neighbor search index or a list of existing nearest-neighbor search results.

Return type:

ndarray

Returns:

Array containing the coordinates of each cell in a 2-dimensional embedding. Each row corresponds to a dimension and each column represents a cell.

References

https://libscran.github.io/qdtsne, for some more details on the approximations.

scranpy.run_tsne.tsne_perplexity_to_neighbors(perplexity)

Determine the number of nearest neighbors required to support a given perplexity in the t-SNE algorithm.

Parameters:

perplexity (float) – Perplexity to use in run_tsne().

Return type:

int

Returns:

The corresponding number of nearest neighbors.

scranpy.run_umap module

scranpy.run_umap.run_umap(x, num_dim=2, parallel_optimization=False, local_connectivity=1, bandwidth=1, mix_ratio=1, spread=1, min_dist=0.1, a=None, b=None, repulsion_strength=1, initialize_method='spectral', initial_coordinates=None, initialize_random_on_spectral_fail=True, initialize_spectral_scale=10, initialize_spectral_jitter=False, initialize_spectral_jitter_sd=0.0001, initialize_random_scale=10, initialize_seed=9876543210, num_epochs=None, learning_rate=1, negative_sample_rate=5, num_neighbors=15, optimize_seed=1234567890, num_threads=1, nn_parameters=<knncolle.annoy.AnnoyParameters object>)

Compute UMAP coordinates to visualize similarities between cells.

Parameters:
  • x (Union[ndarray, FindKnnResults, Index]) –

    Numeric matrix where rows are dimensions and columns are cells, typically containing a low-dimensional representation from, e.g., run_pca().

    Alternatively, a FindKnnResults object containing existing neighbor search results. The number of neighbors should be the same as num_neighbors, otherwise a warning is raised.

    Alternatively, a Index object.

  • num_dim (int) – Number of dimensions in the UMAP embedding.

  • local_connectivity (float) – Number of nearest neighbors that are assumed to be always connected, with maximum membership confidence. Larger values increase the connectivity of the embedding and reduce the focus on local structure. This may be a fractional number of neighbors, in which case interpolation is performed when computing the membership confidence.

  • bandwidth (float) – Effective bandwidth of the kernel when converting the distance to a neighbor into a fuzzy set membership confidence. Larger values reduce the decay in confidence with respect to distance, increasing connectivity and favoring global structure.

  • mix_ratio (float) – Number between 0 and 1 specifying the mixing ratio when combining fuzzy sets. A mixing ratio of 1 will take the union of confidences, a ratio of 0 will take the intersection, and intermediate values will interpolate between them. Larger values favor connectivity and more global structure.

  • spread (float) – Scale of the coordinates of the final low-dimensional embedding. Ignored if a and b are provided.

  • min_dist (float) – Minimum distance between observations in the final low-dimensional embedding. Smaller values will increase local clustering while larger values favor a more even distribution of observations throughout the low-dimensional space. This is interpreted relative to spread. Ignored if a and b are provided.

  • a (Optional[float]) – The a parameter for the fuzzy set membership strength calculations. Larger values yield a sharper decay in membership strength with increasing distance between observations. If this or b are None, a suitable value for this parameter is automatically determined from spread and min_dist.

  • b (Optional[float]) – The b parameter for the fuzzy set membership strength calculations. Larger values yield an earlier decay in membership strength with increasing distance between observations. If this or a are None, a suitable value for this parameter is automatically determined from spread and min_dist.

  • repulsion_strength (float) – Modifier for the repulsive force. Larger values increase repulsion and favor local structure.

  • initialize_method (Literal['spectral', 'random', 'none']) –

    How to initialize the embedding.

    • spectral: spectral decomposition of the normalized graph Laplacian. Specifically, the initial coordinates are defined from the eigenvectors corresponding to the smallest non-zero eigenvalues. This fails in the presence of multiple graph components or if the approximate SVD fails to converge.

    • random: fills the embedding with random draws from a normal distribution.

    • none: uses initial values from initial_coordinates.

  • initialize_random_on_spectral_fail (bool) – Whether to fall back to random sampling (i.e., random) if spectral initialization fails due to the presence of multiple components in the graph. If False, the values in initial_coordinates will be used instead, i.e., same as none. Only relevant if initialize_method = "spectral" and spectral initialization fails.

  • initialize_spectral_scale (float) – Maximum absolute magnitude of the coordinates after spectral initialization. The default is chosen to avoid outlier observations with large absolute distances that may interfere with optimization. Only relevant if initialize_method = "spectral" and spectral initialization does not fail.

  • initialize_spectral_jitter (bool) – Whether to jitter coordinates after spectral initialization to separate duplicate observations (e.g., to avoid overplotting). This is done using normally-distributed noise of mean zero and standard deviation of initialize_spectral_jitter_sd. Only relevant if initialize_method = "spectral" and spectral initialization does not fail.

  • initialize_spectral_jitter_sd (float) – Standard deviation of the jitter to apply after spectral initialization. Only relevant if initialize_method = "spectral" and spectral initialization does not fail and initialize_spectral_jitter = True.

  • initialize.random.scale – Scale of the randomly generated initial coordinates. Coordinates are sampled from a uniform distribution from [-x, x) where x is initialize_random_scale. Only relevant if initialize_method = "random", or initialize_method = "spectral" and spectral initialization fails and initialize_random_on_spectral_fail = True.

  • initialize_seed (int) – Seed for the random number generation during initialization. Only relevant if initialize_method = "random", or initialize_method = "spectral" and initialize_spectral_jitter = True; or initialize_method = "spectral" and spectral initialization fails and initialize_random_on_spectral_fail = True.

  • initial_coordinates (Optional[array]) – Double-precision matrix of initial coordinates with number of rows and columns equal to the number of observations and num_dim, respectively. Only relevant if initialize_method = "none"; or initialize_method = "spectral" and spectral initialization fails and initialize_random_on_spectral_fail = False.

  • num_epochs (Optional[int]) – Number of epochs to perform. If set to None, an appropriate number of epochs is chosen based on the number of points in x.

  • num_neighbors (int) – Number of neighbors to use in the UMAP algorithm. Larger values cause the embedding to focus on global structure.

  • optimize_seed (int) – Integer scalar specifying the seed to use.

  • num_threads (int) – Number of threads to use.

  • nn_parameters (Parameters) – The algorithm to use for the nearest-neighbor search. Only used if x is not a pre-built nearest-neighbor search index or a list of existing nearest-neighbor search results.

Return type:

ndarray

Returns:

Array containing the coordinates of each cell in a 2-dimensional embedding. Each row corresponds to a dimension and each column represents a cell.

References

https://libscran.github.io/umappp, for the underlying implementation.

scranpy.sanitize_size_factors module

scranpy.sanitize_size_factors.sanitize_size_factors(size_factors, replace_zero=True, replace_negative=True, replace_infinite=True, replace_nan=True, in_place=False)

Replace invalid size factors, i.e., zero, negative, infinite or NaNs.

Parameters:
  • size_factors (ndarray) – Floating-point array containing size factors for all cells.

  • replace_zero (bool) – Whether to replace size factors of zero with the lowest positive factor. If False, zeros are retained.

  • replace_negative (bool) – Whether to replace negative size factors with the lowest positive factor. If False, negative values are retained.

  • replace_infinite (bool) – Whether to replace infinite size factors with the largest positive factor. If False, infinite values are retained.

  • replace_nan (bool) – Whether to replace NaN size factors with unity. If False, NaN values are retained.

  • in_place (bool) – Whether to modify size_factors in place. If False, a new array is returned. This argument only used if size_factors is double-precision, otherwise a new array is always returned.

Return type:

ndarray

Returns:

Array containing sanitized size factors. If in_place = True, this is a reference to size_factors.

References

The sanitize_size_factors function in the scran_norm C++ library, which provides the underlying implementation.

scranpy.scale_by_neighbors module

class scranpy.scale_by_neighbors.ScaleByNeighborsResults(scaling, combined)

Bases: object

Results of scale_by_neighbors().

__annotate_func__()
__annotations_cache__ = {'combined': <class 'numpy.ndarray'>, 'scaling': <class 'numpy.ndarray'>}
__dataclass_fields__ = {'combined': Field(name='combined',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'scaling': Field(name='scaling',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 11
__hash__ = None
__match_args__ = ('scaling', 'combined')
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
combined: ndarray
scaling: ndarray
scranpy.scale_by_neighbors.scale_by_neighbors(x, num_neighbors=20, block=None, block_weight_policy='variable', variable_block_weight=(0, 1000), num_threads=1, weights=None, nn_parameters=<knncolle.annoy.AnnoyParameters object>)

Scale multiple embeddings (usually derived from different modalities across the same set of cells) so that their within-population variances are comparable. Then, combine them into a single embedding matrix for combined downstream analysis.

Parameters:
  • x (Sequence) – Sequence of of numeric matrices of principal components or other embeddings, one for each modality. For each entry, rows are dimensions and columns are cells. All entries should have the same number of columns but may have different numbers of rows.

  • num_neighbors (int) – Number of neighbors to use to define the scaling factor.

  • num_threads (int) – Number of threads to use.

  • nn_parameters (Parameters) – Algorithm for the nearest-neighbor search.

  • weights (Optional[Sequence]) – Array of length equal to x, specifying the weights to apply to each modality. Each value represents a multiplier of the within-population variance of its modality, i.e., larger values increase the contribution of that modality in the combined output matrix. The default of None is equivalent to an all-1 vector, i.e., all modalities are scaled to have the same within-population variance.

Return type:

ScaleByNeighborsResults

Returns:

Scaling factors and the combined matrix from all modalities.

References

https://libscran.github.io/mumosa, for the basis and caveats of this approach.

scranpy.score_gene_set module

class scranpy.score_gene_set.ScoreGeneSetResults(scores, weights)

Bases: object

Results of score_gene_set().

__annotate_func__()
__annotations_cache__ = {'scores': <class 'numpy.ndarray'>, 'weights': <class 'numpy.ndarray'>}
__dataclass_fields__ = {'scores': Field(name='scores',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'weights': Field(name='weights',type=<class 'numpy.ndarray'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 12
__hash__ = None
__match_args__ = ('scores', 'weights')
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
scores: ndarray
weights: ndarray
scranpy.score_gene_set.score_gene_set(x, set, rank=1, scale=False, block=None, block_weight_policy='variable', variable_block_weight=(0, 1000), extra_work=7, iterations=1000, seed=5489, realized=True, num_threads=1)

Compute per-cell scores for a gene set, defined as the column sums of a rank-1 approximation to the submatrix for the feature set. This uses the same approach implemented in the GSDecon package by Jason Hackney.

Parameters:
  • x (Any) – A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. The matrix is expected to contain log-expression values.

  • set (Sequence) – Array of integer indices specifying the rows of x belonging to the gene set. Alternatively, a sequence of boolean values of length equal to the number of rows, where truthy elements indicate that the corresponding row belongs to the gene set.

  • rank (int) – Rank of the approximation.

  • scale (bool) – Whether to scale all genes to have the same variance.

  • block (Optional[Sequence]) – Array of length equal to the number of columns of x, containing the block of origin (e.g., batch, sample) for each cell. Alternatively None, if all cells are from the same block.

  • block_weight_policy (Literal['variable', 'equal', 'none']) – Policy to use for weighting different blocks when computing the average for each statistic. Only used if block is provided.

  • variable_block_weight (Tuple) – Parameters for variable block weighting. This should be a tuple of length 2 where the first and second values are used as the lower and upper bounds, respectively, for the variable weight calculation. Only used if block is provided and block_weight_policy = "variable".

  • extra_work (int) – Number of extra dimensions for the IRLBA workspace.

  • iterations (int) – Maximum number of restart iterations for IRLBA.

  • seed (int) – Seed for the initial random vector in IRLBA.

  • realized (bool) – Whether to realize x into an optimal memory layout for IRLBA. This speeds up computation at the cost of increased memory usage.

  • num_threads (int) – Number of threads to use.

Return type:

ScoreGeneSetResults

Returns:

Array of per-cell scores and per-gene weights.

References

https://libscran.github.io/gsdecon, which describes the approach in more detail. In particular, the documentation for the compute_blocked function explains the blocking strategy.

scranpy.score_markers module

class scranpy.score_markers.ScoreMarkersResults(groups, mean, detected, cohens_d, auc, delta_mean, delta_detected)

Bases: object

Results of score_markers().

__annotate_func__()
__annotations_cache__ = {'auc': numpy.ndarray | biocutils.NamedList.NamedList | None, 'cohens_d': numpy.ndarray | biocutils.NamedList.NamedList | None, 'delta_detected': numpy.ndarray | biocutils.NamedList.NamedList | None, 'delta_mean': numpy.ndarray | biocutils.NamedList.NamedList | None, 'detected': numpy.ndarray | None, 'groups': <class 'list'>, 'mean': numpy.ndarray | None}
__dataclass_fields__ = {'auc': Field(name='auc',type=numpy.ndarray | biocutils.NamedList.NamedList | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'cohens_d': Field(name='cohens_d',type=numpy.ndarray | biocutils.NamedList.NamedList | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'delta_detected': Field(name='delta_detected',type=numpy.ndarray | biocutils.NamedList.NamedList | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'delta_mean': Field(name='delta_mean',type=numpy.ndarray | biocutils.NamedList.NamedList | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'detected': Field(name='detected',type=numpy.ndarray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'groups': Field(name='groups',type=<class 'list'>,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'mean': Field(name='mean',type=numpy.ndarray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 13
__hash__ = None
__match_args__ = ('groups', 'mean', 'detected', 'cohens_d', 'auc', 'delta_mean', 'delta_detected')
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
auc: Union[ndarray, NamedList, None]
cohens_d: Union[ndarray, NamedList, None]
delta_detected: Union[ndarray, NamedList, None]
delta_mean: Union[ndarray, NamedList, None]
detected: Optional[ndarray]
groups: list
mean: Optional[ndarray]
to_biocframes(effect_sizes=None, summaries=None, include_mean=True, include_detected=True)

Convert the effect size summaries into a BiocFrame for each group. This should only be used if all_pairwise = False in score_markers().

Parameters:
  • effect_sizes (Optional[list]) – List of effect sizes to include in each BiocFrame. This can contain any of cohens_d, auc, delta_mean, and delta_detected. If None, all non-None effect sizes are reported.

  • summaries (Optional[list]) – List of summary statistics to include in each BiocFrame. This can contain any of min, mean, median, max, and min_rank. If None, all summary statistics are reported.

  • include_mean (bool) – Whether to include the mean for each group.

  • include_detected (bool) – Whether to include the detected proportion for each group.

Return type:

NamedList

Returns:

A list of length equal to groups, containing a BiocFrame with the effect size summaries for each group. Each row of the BiocFrame corresponds toa gene. Each effect size summary is represented by a column named <EFFECT>_<SUMMARY>. If include_mean = True or include_detected = True, additional columns will be present with the mean and detected proportion, respectively.

The list itself is named according to groups if the elements can be converted to strings, otherwise it is unnamed.

scranpy.score_markers.score_markers(x, groups, block=None, block_average_policy='mean', block_weight_policy='variable', variable_block_weight=(0, 1000), block_quantile=0.5, threshold=0, compute_group_mean=True, compute_group_detected=True, compute_cohens_d=True, compute_auc=True, compute_delta_mean=True, compute_delta_detected=True, compute_summary_min=True, compute_summary_mean=True, compute_summary_median=True, compute_summary_max=True, compute_summary_quantiles=None, compute_summary_min_rank=True, min_rank_limit=500, all_pairwise=False, num_threads=1)

Score marker genes for each group using a variety of effect sizes from pairwise comparisons between groups. This includes Cohen’s d, the area under the curve (AUC), the difference in the means (delta-mean) and the difference in the proportion of detected cells (delta-detected).

Parameters:
  • x (Any) – A matrix-like object where rows correspond to genes or genomic features and columns correspond to cells. It is typically expected to contain log-expression values, e.g., from normalize_counts().

  • groups (Sequence) – Group assignment for each cell in x. This should have length equal to the number of columns in x.

  • block (Optional[Sequence]) – Array of length equal to the number of columns of x, containing the block of origin (e.g., batch, sample) for each cell. Alternatively None, if all cells are from the same block.

  • block_average_policy (Literal['mean', 'quantile']) – Policy to use for average statistics across blocks. This can either be a (weighted) mean or a quantile. Only used if block is supplied.

  • block_weight_policy (Literal['variable', 'equal', 'none']) – Policy to use for weighting different blocks when computing the average for each statistic. Only used if block is provided.

  • variable_block_weight (Tuple) – Parameters for variable block weighting. This should be a tuple of length 2 where the first and second values are used as the lower and upper bounds, respectively, for the variable weight calculation. Only used if block is provided and block_weight_policy = "variable".

  • block_quantile (float) – Probability of the quantile of statistics across blocks. Defaults to 0.5, i.e., the median of per-block statistics. Only used if block is provided and block_average_policy ="quantile".

  • threshold (float) – Non-negative value specifying the minimum threshold on the differences in means (i.e., the log-fold change, if x contains log-expression values). This is incorporated into the calculation for Cohen’s d and the AUC.

  • compute_group_mean (bool) – Whether to compute the group-wise mean expression for each gene.

  • compute_group_detected (bool) – Whether to compute the group-wise proportion of detected cells for each gene.

  • compute_cohens_d (bool) – Whether to compute Cohen’s d, i.e., the ratio of the difference in means to the standard deviation.

  • compute_auc (bool) – Whether to compute the AUC. Setting this to False can improve speed and memory efficiency.

  • compute_delta_mean (bool) – Whether to compute the delta-means, i.e., the log-fold change when x contains log-expression values.

  • compute_delta_detected (bool) – Whether to compute the delta-detected, i.e., differences in the proportion of cells with detected expression.

  • compute_summary_min (bool) – Whether to compute the minimum as a summary statistic for each effect size. Only used if all_pairwise = False.

  • compute_summary_mean (bool) – Whether to compute the mean as a summary statistic for each effect size. Only used if all.pairwise = False.

  • compute_summary_median (bool) – Whether to compute the median as a summary statistic for each effect size. Only used if all_pairwise = False.

  • compute_summary_max (bool) – Whether to compute the maximum as a summary statistic for each effect size. Only used if all_pairwise = False.

  • compute_summary_quantiles (Optional[Sequence]) – Probabilities of quantiles to compute as summary statistics for each effect size. This should be in [0, 1] and sorted in order of increasing size. If None, no quantiles are computed. Only used if all_pairwise = False.

  • compute_summary_min_rank (bool) – Whether to compute the mininum rank as a summary statistic for each effect size. If None, no quantiles are computed. Only used if all_pairwise = False.

  • min_rank_limit (int) – Maximum value of the min-rank to report. Lower values improve memory efficiency at the cost of discarding information about lower-ranked genes. Only used if all_pairwise = False and compute_summary_min_rank = True.

  • all_pairwise (bool) – Whether to report the full effects for every pairwise comparison between groups. Alternatively, an integer scalar indicating the number of top markers to report from each pairwise comparison between groups. If False, only summaries are reported.

  • num_threads (int) – Number of threads to use.

Return type:

ScoreMarkersResults

Returns:

Scores for ranking marker genes in each group, based on the effect sizes for pairwise comparisons between groups.

References

The score_markers_summary and score_markers_pairwise functions in the scran_markers C++ library, which describes the rationale behind the choice of effect sizes and summary statistics. Also see their blocked equivalents score_markers_summary_blocked and score_markers_pairwise_blocked when block is provided.

scranpy.subsample_by_neighbors module

scranpy.subsample_by_neighbors.subsample_by_neighbors(x, num_neighbors=20, min_remaining=10, nn_parameters=<knncolle.annoy.AnnoyParameters object>, num_threads=1)

Subsample a dataset by selecting cells to represent all of their nearest neighbors.

Parameters:
  • x (Union[ndarray, FindKnnResults, Index]) –

    Numeric matrix where rows are dimensions and columns are cells, typically containing a low-dimensional representation from, e.g., run_pca().

    Alternatively, a Index object containing a pre-built search index for a dataset.

    Alternatively, a FindKnnResults object containing pre-computed search results for a dataset. The number of neighbors should be equal to num_neighbors, otherwise a warning is raised.

  • num_neighbors (int) – Number of neighbors to use. Larger values result in greater downsampling. Only used if x does not contain existing neighbor search results.

  • nn_parameters (Parameters) – Neighbor search algorithm to use. Only used if x does not contain existing neighbor search results.

  • min_remaining (int) – Minimum number of remaining (i.e., unselected) neighbors that a cell must have in order to be considered for selection. This should be less than or equal to num_neighbors.

  • num_threads (int) – Number of threads to use for the nearest-neighbor search. Only used if x does not contain existing neighbor search results.

Return type:

ndarray

Returns:

Integer array with indices of the cells selected to be in the subsample.

References

https://libscran.github.io/nenesub, for the rationale behind this approach.

scranpy.summarize_effects module

class scranpy.summarize_effects.GroupwiseSummarizedEffects(min, mean, median, max, quantiles, min_rank)

Bases: object

Summarized effect sizes for a single group, typically created by summarize_effects() or score_markers().

__annotate_func__()
__annotations_cache__ = {'max': numpy.ndarray | None, 'mean': numpy.ndarray | None, 'median': numpy.ndarray | None, 'min': numpy.ndarray | None, 'min_rank': numpy.ndarray | None, 'quantiles': biocutils.NamedList.NamedList | None}
__dataclass_fields__ = {'max': Field(name='max',type=numpy.ndarray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'mean': Field(name='mean',type=numpy.ndarray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'median': Field(name='median',type=numpy.ndarray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'min': Field(name='min',type=numpy.ndarray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'min_rank': Field(name='min_rank',type=numpy.ndarray | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD), 'quantiles': Field(name='quantiles',type=biocutils.NamedList.NamedList | None,default=<dataclasses._MISSING_TYPE object>,default_factory=<dataclasses._MISSING_TYPE object>,init=True,repr=True,hash=None,compare=True,metadata=mappingproxy({}),kw_only=False,doc=None,_field_type=_FIELD)}
__dataclass_params__ = _DataclassParams(init=True,repr=True,eq=True,order=False,unsafe_hash=False,frozen=False,match_args=True,kw_only=False,slots=False,weakref_slot=False)
__eq__(other)

Return self==value.

__firstlineno__ = 15
__hash__ = None
__match_args__ = ('min', 'mean', 'median', 'max', 'quantiles', 'min_rank')
__replace__(**changes)
__repr__()

Return repr(self).

__static_attributes__ = ()
max: Optional[ndarray]
mean: Optional[ndarray]
median: Optional[ndarray]
min: Optional[ndarray]
min_rank: Optional[ndarray]
quantiles: Optional[NamedList]
to_biocframe()

Convert the results to a BiocFrame.

Returns:

A BiocFrame where each row is a gene and each column is a summary statistic.

scranpy.summarize_effects.summarize_effects(effects, compute_min=True, compute_mean=True, compute_median=True, compute_max=True, compute_quantiles=None, compute_min_rank=True, num_threads=1)

For each group, summarize the effect sizes for all pairwise comparisons to other groups. This yields a set of summary statistics that can be used to rank marker genes for each group.

Parameters:
  • effects (ndarray) – A 3-dimensional numeric containing the effect sizes from each pairwise comparison between groups. The extents of the first two dimensions should be equal to the number of groups, while the extent of the final dimension is equal to the number of genes. The entry [i, j, k] should represent the effect size from the comparison of group j against group i for gene k. See also the output of score_markers() with all_pairwise = True.

  • compute_min (bool) – Whether to compute the minimum as a summary statistic for each effect size.

  • compute_mean (bool) – Whether to compute the mean as a summary statistic for each effect size.

  • compute_median (bool) – Whether to compute the median as a summary statistic for each effect size.

  • compute_max (bool) – Whether to compute the maximum as a summary statistic for each effect size.

  • compute_quantiles (Optional[Sequence]) – Probabilities of quantiles to compute as summary statistics for each effect size. This should be in [0, 1] and sorted in order of increasing size. If None, no quantiles are computed.

  • compute_min_rank (bool) – Whether to compute the mininum rank as a summary statistic for each effect size. If None, no quantiles are computed.

  • num_threads (int) – Number of threads to use.

Return type:

list[GroupwiseSummarizedEffects]

Returns:

List of length equal to the number of groups (i.e., the extents of the first two dimensions of effects). Each entry contains the summary statistics of the effect sizes of the comparisons involving the corresponding group.

References

The summarize_effects function in the scran_markers C++ library, for more details on the statistics.

scranpy.test_enrichment module

scranpy.test_enrichment.test_enrichment(x, sets, universe, log=False, num_threads=1)[source]

Perform a hypergeometric test for enrichment of interesting genes (e.g., markers) in one or more pre-defined gene sets.

Parameters:
  • x (Sequence) – Sequence of identifiers for the interesting genes.

  • sets (Sequence) – Sequence of gene sets, where each entry corresponds to a gene set and contains a sequence of identifiers for genes in that set.

  • universe (Union[int, Sequence]) – Sequence of identifiers for the universe of genes in the dataset. It is expected that x is a subset of universe. Alternatively, an integer specifying the number of genes in the universe.

  • log (bool) – Whether to report the log-transformed p-values.

  • num_threads (int) – Number of threads to use.

Return type:

ndarray

Returns:

Array of (log-transformed) p-values to test for significant enrichment of x in each entry of sets.

References

https://libscran.github.io/phyper, for the underlying implementation.

Module contents