scran_markers
Marker detection for single-cell data
|
Marker detection for single-cell data. More...
Classes | |
struct | ScoreMarkersPairwiseBuffers |
Buffers for score_markers_pairwise() and friends. More... | |
struct | ScoreMarkersPairwiseOptions |
Options for score_markers_pairwise() and friends. More... | |
struct | ScoreMarkersPairwiseResults |
Results for score_markers_pairwise() and friends. More... | |
struct | ScoreMarkersSummaryBuffers |
Buffers for score_markers_summary() and friends. More... | |
struct | ScoreMarkersSummaryOptions |
Options for score_markers_summary() and friends. More... | |
struct | ScoreMarkersSummaryResults |
Results for score_markers_summary() and friends. More... | |
struct | SummarizeEffectsOptions |
Options for summarize_effects() . More... | |
struct | SummaryBuffers |
Pointers to arrays to hold the summary statistics. More... | |
struct | SummaryResults |
Container for the summary statistics. More... | |
Functions | |
template<typename Value_ , typename Index_ , typename Group_ , typename Stat_ > | |
void | score_markers_pairwise (const tatami::Matrix< Value_, Index_ > &matrix, const Group_ *const group, const ScoreMarkersPairwiseOptions &options, const ScoreMarkersPairwiseBuffers< Stat_ > &output) |
template<typename Value_ , typename Index_ , typename Group_ , typename Block_ , typename Stat_ > | |
void | score_markers_pairwise_blocked (const tatami::Matrix< Value_, Index_ > &matrix, const Group_ *const group, const Block_ *const block, const ScoreMarkersPairwiseOptions &options, const ScoreMarkersPairwiseBuffers< Stat_ > &output) |
template<typename Stat_ = double, typename Value_ , typename Index_ , typename Group_ > | |
ScoreMarkersPairwiseResults< Stat_ > | score_markers_pairwise (const tatami::Matrix< Value_, Index_ > &matrix, const Group_ *const group, const ScoreMarkersPairwiseOptions &options) |
template<typename Stat_ = double, typename Value_ , typename Index_ , typename Group_ , typename Block_ > | |
ScoreMarkersPairwiseResults< Stat_ > | score_markers_pairwise_blocked (const tatami::Matrix< Value_, Index_ > &matrix, const Group_ *const group, const Block_ *const block, const ScoreMarkersPairwiseOptions &options) |
template<typename Gene_ , typename Stat_ , typename Rank_ > | |
void | summarize_effects (const Gene_ ngenes, const std::size_t ngroups, const Stat_ *const effects, const std::vector< SummaryBuffers< Stat_, Rank_ > > &summaries, const SummarizeEffectsOptions &options) |
template<typename Stat_ = double, typename Rank_ = int, typename Gene_ > | |
std::vector< SummaryResults< Stat_, Rank_ > > | summarize_effects (const Gene_ ngenes, const std::size_t ngroups, const Stat_ *const effects, const SummarizeEffectsOptions &options) |
template<typename Value_ , typename Index_ , typename Group_ , typename Stat_ , typename Rank_ > | |
void | score_markers_summary (const tatami::Matrix< Value_, Index_ > &matrix, const Group_ *const group, const ScoreMarkersSummaryOptions &options, const ScoreMarkersSummaryBuffers< Stat_, Rank_ > &output) |
template<typename Value_ , typename Index_ , typename Group_ , typename Block_ , typename Stat_ , typename Rank_ > | |
void | score_markers_summary_blocked (const tatami::Matrix< Value_, Index_ > &matrix, const Group_ *const group, const Block_ *const block, const ScoreMarkersSummaryOptions &options, const ScoreMarkersSummaryBuffers< Stat_, Rank_ > &output) |
template<typename Stat_ = double, typename Rank_ = int, typename Value_ , typename Index_ , typename Group_ > | |
ScoreMarkersSummaryResults< Stat_, Rank_ > | score_markers_summary (const tatami::Matrix< Value_, Index_ > &matrix, const Group_ *const group, const ScoreMarkersSummaryOptions &options) |
template<typename Stat_ = double, typename Rank_ = int, typename Value_ , typename Index_ , typename Group_ , typename Block_ > | |
ScoreMarkersSummaryResults< Stat_, Rank_ > | score_markers_summary_blocked (const tatami::Matrix< Value_, Index_ > &matrix, const Group_ *const group, const Block_ *const block, const ScoreMarkersSummaryOptions &options) |
Marker detection for single-cell data.
ScoreMarkersPairwiseResults< Stat_ > scran_markers::score_markers_pairwise | ( | const tatami::Matrix< Value_, Index_ > & | matrix, |
const Group_ *const | group, | ||
const ScoreMarkersPairwiseOptions & | options ) |
Overload of score_markers_pairwise()
that allocates memory for the output statistics.
Stat_ | Floating-point type of the statistics. |
Value_ | Matrix data type. |
Index_ | Matrix index type. |
Group_ | Integer type of the group assignments. |
matrix | A matrix of expression values, typically normalized and log-transformed. Rows should contain genes while columns should contain cells. | |
[in] | group | Pointer to an array of length equal to the number of columns in matrix , containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups. |
options | Further options. |
void scran_markers::score_markers_pairwise | ( | const tatami::Matrix< Value_, Index_ > & | matrix, |
const Group_ *const | group, | ||
const ScoreMarkersPairwiseOptions & | options, | ||
const ScoreMarkersPairwiseBuffers< Stat_ > & | output ) |
Score potential marker genes based on the effect sizes for the pairwise comparisons between groups. For each group, the strongest markers are those genes with the largest effect sizes (i.e., upregulated) when compared to all other groups. The pairwise effect sizes computed by this function can be used to identify markers to distinguish two specific groups, or the effect sizes for multiple comparisons involving a group can be passed to summarize_effects()
to obtain a single ranking for that group.
The delta-mean is the difference in the mean expression between groups. This is fairly straightforward to interpret - a positive delta-mean corresponds to increased expression in the first group compared to the second. The delta-mean can also be treated as the log-fold change if the input matrix contains log-transformed normalized expression values.
The delta-detected is the difference in the proportion of cells with detected expression between groups. This lies between 1 and -1, with the extremes occurring when a gene is silent in one group and detected in all cells of the other group. For this interpretation, we assume that the input matrix contains non-negative expression values, where a value of zero corresponds to lack of detectable expression.
Cohen's d is the standardized difference between two groups. This is defined as the difference in the mean for each group scaled by the average standard deviation across the two groups. (Technically, we should use the pooled variance; however, this introduces some unintuitive asymmetry depending on the variance of the larger group, so we take a simple average instead.) A positive value indicates that the gene has increased expression in the first group compared to the second. Cohen's d is analogous to the t-statistic in a two-sample t-test and avoids spuriously large effect sizes from comparisons between highly variable groups. We can also interpret Cohen's d as the number of standard deviations between the two group means.
The area under the curve (AUC) is the probability that a randomly chosen observation in one group is greater than a randomly chosen observation in the other group. Values greater than 0.5 indicate that a gene is upregulated in the first group. The AUC is closely related to the U-statistic used in the Wilcoxon rank sum test. The key difference between the AUC and Cohen's d is that the former is less sensitive to the variance within each group, e.g., if two distributions exhibit no overlap, the AUC is the same regardless of the variance of each distribution. This may or may not be desirable as it improves robustness to outliers but reduces the information available to obtain a fine-grained ranking.
Setting a minimum change threshold (ScoreMarkersPairwiseOptions::threshold
) prioritizes genes with large shifts in expression instead of those with low variances. Currently, only positive thresholds are supported, which focuses on genes that are upregulated in the first group compared to the second. The effect size definitions are generalized when testing against a non-zero threshold:
We report the mean expression of all cells in each group as well as the proportion of cells with detectable expression in each group. These statistics are useful for quickly interpreting the differences in expression driving the effect sizes.
The effect sizes for all comparisons involving a particular group can be summarized into a few key statistics with summarize_effects()
. Ranking by a selected summary statistic can identify candidate markers for the group of interest compared to any, some or all other groups. See also score_markers_summary()
, to efficiently obtain effect size summaries for each group.
Value_ | Matrix data type. |
Index_ | Matrix index type. |
Group_ | Integer type of the group assignments. |
Stat_ | Floating-point type of the statistics. |
matrix | A matrix of expression values, typically normalized and log-transformed. Rows should contain genes while columns should contain cells. | |
[in] | group | Pointer to an array of length equal to the number of columns in matrix , containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups. |
options | Further options. | |
[out] | output | Collection of buffers in which to store the computed statistics. Each buffer is filled with the corresponding statistic for each group or pairwise comparison. Any of ScoreMarkersPairwiseBuffers::cohens_d , ScoreMarkersPairwiseBuffers::auc , ScoreMarkersPairwiseBuffers::delta_mean or ScoreMarkersPairwiseBuffers::delta_detected may be NULL, in which case the corresponding statistic is not computed. |
ScoreMarkersPairwiseResults< Stat_ > scran_markers::score_markers_pairwise_blocked | ( | const tatami::Matrix< Value_, Index_ > & | matrix, |
const Group_ *const | group, | ||
const Block_ *const | block, | ||
const ScoreMarkersPairwiseOptions & | options ) |
Overload of score_markers_pairwise_blocked()
that allocates memory for the output statistics.
Stat_ | Floating-point type of the statistics. |
Value_ | Matrix data type. |
Index_ | Matrix index type. |
Group_ | Integer type of the group assignments. |
Block_ | Integer type of the block assignments. |
matrix | A matrix of expression values, typically normalized and log-transformed. Rows should contain genes while columns should contain cells. | |
[in] | group | Pointer to an array of length equal to the number of columns in matrix , containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups. |
[in] | block | Pointer to an array of length equal to the number of columns in matrix , containing the blocking factor. Block identifiers should be 0-based and should contain all integers in \([0, B)\) where \(B\) is the number of unique blocking levels. |
options | Further options. |
void scran_markers::score_markers_pairwise_blocked | ( | const tatami::Matrix< Value_, Index_ > & | matrix, |
const Group_ *const | group, | ||
const Block_ *const | block, | ||
const ScoreMarkersPairwiseOptions & | options, | ||
const ScoreMarkersPairwiseBuffers< Stat_ > & | output ) |
Score potential marker genes as described for score_markers_pairwise()
after accounting for any blocking factor in the dataset. Comparisons are only performed between the groups of cells in the same level of the blocking factor. The batch-specific effect sizes are then combined into a single aggregate value for output. This strategy avoids most problems related to batch effects as we never directly compare across different blocking levels.
Specifically, for each gene and each pair of groups, we obtain one effect size per blocking level. We consolidate these into a single statistic by computing the weighted mean across levels. The weight for each level is defined as the product of the weights of the two groups involved in the comparison, where each weight is derived from the size of the group using the policy in ScoreMarkersPairwiseOptions::block_weight_policy
.
Blocking levels with no cells in either group will not contribute anything to the weighted mean. If two groups never co-occur in the same blocking level, no effect size will be computed and a NaN
is reported in the output. We do not attempt to reconcile batch effects in a partially confounded scenario.
For the mean and detected proportion in each group, we compute a weighted average of each statistic across blocks for each gene. Again, the weight for each group is derived from the size of that group using the policy in ScoreMarkersPairwiseOptions::block_weight_policy
.
Value_ | Matrix data type. |
Index_ | Matrix index type. |
Group_ | Integer type of the group assignments. |
Block_ | Integer type of the block assignments. |
Stat_ | Floating-point type of the statistics. |
matrix | A matrix of expression values, typically normalized and log-transformed. Rows should contain genes while columns should contain cells. | |
[in] | group | Pointer to an array of length equal to the number of columns in matrix , containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups. |
[in] | block | Pointer to an array of length equal to the number of columns in matrix , containing the blocking factor. Block identifiers should be 0-based and should contain all integers in \([0, B)\) where \(B\) is the number of unique blocking levels. |
options | Further options. | |
[out] | output | Collection of buffers in which to store the computed statistics. Each buffer is filled with the corresponding statistic for each group or pairwise comparison. Any of ScoreMarkersPairwiseBuffers::cohens_d , ScoreMarkersPairwiseBuffers::auc , ScoreMarkersPairwiseBuffers::delta_mean or ScoreMarkersPairwiseBuffers::delta_detected may be NULL, in which case the corresponding statistic is not computed. |
ScoreMarkersSummaryResults< Stat_, Rank_ > scran_markers::score_markers_summary | ( | const tatami::Matrix< Value_, Index_ > & | matrix, |
const Group_ *const | group, | ||
const ScoreMarkersSummaryOptions & | options ) |
Overload of score_markers_pairwise()
that allocates memory for the output statistics.
Stat_ | Floating-point type to store the statistics. |
Rank_ | Numeric type to store the minimum rank. |
Value_ | Matrix data type. |
Index_ | Matrix index type. |
Group_ | Integer type of the group assignments. |
matrix | A matrix of expression values, typically normalized and log-transformed. Rows should contain genes while columns should contain cells. | |
[in] | group | Pointer to an array of length equal to the number of columns in matrix , containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups. |
options | Further options. |
void scran_markers::score_markers_summary | ( | const tatami::Matrix< Value_, Index_ > & | matrix, |
const Group_ *const | group, | ||
const ScoreMarkersSummaryOptions & | options, | ||
const ScoreMarkersSummaryBuffers< Stat_, Rank_ > & | output ) |
Score each gene as a candidate marker for each group of cells, based on summaries of effect sizes from pairwise comparisons between groups.
Markers are identified by differential expression analyses between pairs of groups of cells (e.g., clusters, cell types). Given \(N\) groups, each group is involved in \(N - 1\) pairwise comparisons and thus has \(N - 1\) effect sizes for each gene. We summarize each group's effect sizes into a small set of desriptive statistics like the minimum, median or mean. Users can then sort genes by any of these summaries to obtain a ranking of potential markers for the group.
The choice of effect size and summary statistic determines the characteristics of the marker ranking. The effect sizes include Cohen's d, the area under the curve (AUC), the delta-mean and the delta-detected (see score_markers_pairwise()
). The summary statistics include the minimum, mean, median, maximum and min-rank of the effect sizes across each group's pairwise comparisons (see summarize_effects()
). For example, ranking by the delta-detected with the minimum summary will promote markers that are silent in every other group.
This behavior of this function is equivalent to - but more efficient than - calling score_markers_pairwise()
followed by summarize_effects()
on each array of effect sizes.
Value_ | Matrix data type. |
Index_ | Matrix index type. |
Group_ | Integer type of the group assignments. |
Stat_ | Floating-point type to store the statistics. |
Rank_ | Numeric type to store the minimum rank. |
matrix | A matrix of expression values, typically normalized and log-transformed. Rows should contain genes while columns should contain cells. | |
[in] | group | Pointer to an array of length equal to the number of columns in matrix , containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups. |
options | Further options. | |
[out] | output | Collection of buffers in which to store the computed statistics. Each buffer is filled with the corresponding statistic for each group or pairwise comparison. Any of ScoreMarkersSummaryBuffers::cohens_d , ScoreMarkersSummaryBuffers::auc , ScoreMarkersSummaryBuffers::delta_mean or ScoreMarkersSummaryBuffers::delta_detected may be empty, in which case the corresponding statistic is not computed or summarized. |
ScoreMarkersSummaryResults< Stat_, Rank_ > scran_markers::score_markers_summary_blocked | ( | const tatami::Matrix< Value_, Index_ > & | matrix, |
const Group_ *const | group, | ||
const Block_ *const | block, | ||
const ScoreMarkersSummaryOptions & | options ) |
Overload of score_markers_pairwise_blocked()
that allocates memory for the output statistics.
Stat_ | Floating-point type to store the statistics. |
Rank_ | Numeric type to store the minimum rank. |
Value_ | Matrix data type. |
Index_ | Matrix index type. |
Group_ | Integer type of the group assignments. |
Block_ | Integer type of the block assignments. |
matrix | A matrix of expression values, typically normalized and log-transformed. Rows should contain genes while columns should contain cells. | |
[in] | group | Pointer to an array of length equal to the number of columns in matrix , containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups. |
[in] | block | Pointer to an array of length equal to the number of columns in matrix , containing the blocking factor. Block identifiers should be 0-based and should contain all integers in \([0, B)\) where \(B\) is the number of unique blocking levels. |
options | Further options. |
void scran_markers::score_markers_summary_blocked | ( | const tatami::Matrix< Value_, Index_ > & | matrix, |
const Group_ *const | group, | ||
const Block_ *const | block, | ||
const ScoreMarkersSummaryOptions & | options, | ||
const ScoreMarkersSummaryBuffers< Stat_, Rank_ > & | output ) |
Score potential marker genes by computing summary statistics across pairwise comparisons between groups, accounting for any blocking factor in the dataset. Comparisons are only performed between the groups of cells in the same level of the blocking factor, as described in score_markers_pairwise_blocked()
. This strategy avoids most problems related to batch effects as we never directly compare across different blocking levels. The block-specific effect sizes are combined into a single aggregate value per comparison, which are in turn summarized as described in summarize_effects()
. This behavior of this function is equivalent to - but more efficient than - calling score_markers_pairwise_blocked()
followed by summarize_effects()
on each array of effect sizes.
Value_ | Matrix data type. |
Index_ | Matrix index type. |
Group_ | Integer type of the group assignments. |
Stat_ | Floating-point type to store the statistics. |
Rank_ | Numeric type to store the minimum rank. |
matrix | A matrix of expression values, typically normalized and log-transformed. Rows should contain genes while columns should contain cells. | |
[in] | group | Pointer to an array of length equal to the number of columns in matrix , containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups. |
[in] | block | Pointer to an array of length equal to the number of columns in matrix , containing the blocking factor. Block identifiers should be 0-based and should contain all integers in \([0, B)\) where \(B\) is the number of unique blocking levels. |
options | Further options. | |
[out] | output | Collection of buffers in which to store the computed statistics. Each buffer is filled with the corresponding statistic for each group or pairwise comparison. Any of ScoreMarkersSummaryBuffers::cohens_d , ScoreMarkersSummaryBuffers::auc , ScoreMarkersSummaryBuffers::delta_mean or ScoreMarkersSummaryBuffers::delta_detected may be empty, in which case the corresponding statistic is not computed or summarized. |
void scran_markers::summarize_effects | ( | const Gene_ | ngenes, |
const std::size_t | ngroups, | ||
const Stat_ *const | effects, | ||
const std::vector< SummaryBuffers< Stat_, Rank_ > > & | summaries, | ||
const SummarizeEffectsOptions & | options ) |
Given \(N\) groups, each group is involved in \(N - 1\) pairwise comparisons and thus has \(N - 1\) effect sizes (e.g., as computed by score_markers_pairwise()
). We summarize each group's effect sizes into a small set of desriptive statistics like the mininum, median or mean. Users can then sort genes by any of these summaries to obtain a ranking of potential markers for that group.
The choice of summary statistic determines the interpretation of the ranking. Given a group \(X\):
The exact definition of "large" and "small" depends on the choice of effect size. For signed effects like Cohen's d, delta-mean and delta-detected, the value must be positive to be considered "large", and negative to be considered "small". For the AUC, a value greater than 0.5 is considered "large" and less than 0.5 is considered "small".
The interpretation above is also contingent on the threshold used (see score_markers_pairwise()
for details). For positive thresholds, small summary statistics cannot be unambiguously interpreted as downregulation, as the effect is already adjusted to account for the threshold. Only large summary statistics can be safely interpreted, i.e., as evidence for upregulation.
NaN effect sizes are allowed, e.g., if two groups do not exist in the same block for a blocked analysis in score_markers_pairwise_blocked()
. This class will ignore NaN values when computing each summary. If all effects are NaN for a particular group, the summary statistic will also be NaN
.
All choices of summary statistics are enumerated by Summary
.
Gene_ | Integer type of the number of genes. |
Stat_ | Floating-point type of the statistics. |
Rank_ | Numeric type of the minimum rank. |
ngenes | Number of genes. | |
ngroups | Number of groups. | |
[in] | effects | Pointer to a 3-dimensional array containing the pairwise statistics, see ScoreMarkersPairwiseBuffers::cohens_d for the expected contents. The entry \((i, j, k)\) (i.e., effects[i * N * N + j * N + k] ) represents the effect size of gene \(i\) upon comparing group \(j\) against group \(k\). |
[out] | summaries | Vector of length equal to the number of groups. Each entry corresponds to a group and is used to store the summary statistics for that group. Each pointer in any given SummaryBuffers should either point to an array of length equal to the number of genes, or be NULL to indicate that the corresponding summary statistic should not be computed for that group. |
options | Further options. |
std::vector< SummaryResults< Stat_, Rank_ > > scran_markers::summarize_effects | ( | const Gene_ | ngenes, |
const std::size_t | ngroups, | ||
const Stat_ *const | effects, | ||
const SummarizeEffectsOptions & | options ) |
Overload of summarize_effects()
that allocates memory for the output summary statistics.
Gene_ | Integer type of the number of genes. |
Stat | Floating point type of the statistics. |
Rank_ | Numeric type of the minimum rank. |
ngenes | Number of genes. | |
ngroups | Number of groups. | |
[in] | effects | Pointer to a 3-dimensional array containing the pairwise statistics, see ScoreMarkersPairwiseBuffers::cohens_d for the expected contents. The entry \((i, j, k)\) (i.e., effects[i * N * N + j * N + k] ) represents the effect size of gene \(i\) upon comparing group \(j\) against group \(k\). |
options | Further options. |
SummaryResults
corresponds to a group and contains the summary statistics (depending on options
) for that group.