scran_markers
Marker detection for single-cell data
|
Marker detection for single-cell data. More...
Classes | |
struct | ScoreMarkersPairwiseBuffers |
Buffers for score_markers_pairwise() and friends. More... | |
struct | ScoreMarkersPairwiseOptions |
Options for score_markers_pairwise() and friends. More... | |
struct | ScoreMarkersPairwiseResults |
Results for score_markers_pairwise() and friends. More... | |
struct | ScoreMarkersSummaryBuffers |
Buffers for score_markers_summary() and friends. More... | |
struct | ScoreMarkersSummaryOptions |
Options for score_markers_summary() and friends. More... | |
struct | ScoreMarkersSummaryResults |
Results for score_markers_summary() and friends. More... | |
struct | SummarizeEffectsOptions |
Options for summarize_effects() . More... | |
struct | SummaryBuffers |
Pointers to arrays to hold the summary statistics. More... | |
struct | SummaryResults |
Container for the summary statistics. More... | |
Marker detection for single-cell data.
ScoreMarkersPairwiseResults< Stat_ > scran_markers::score_markers_pairwise | ( | const tatami::Matrix< Value_, Index_ > & | matrix, |
const Group_ * | group, | ||
const ScoreMarkersPairwiseOptions & | options | ||
) |
Overload of score_markers_pairwise()
that allocates memory for the output statistics.
Stat_ | Floating-point type to store the statistics. |
Value_ | Matrix data type. |
Index_ | Matrix index type. |
Group_ | Integer type for the group assignments. |
matrix | A tatami matrix instance. | |
[in] | group | Pointer to an array of length equal to the number of columns in matrix , containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups. |
options | Further options. |
void scran_markers::score_markers_pairwise | ( | const tatami::Matrix< Value_, Index_ > & | matrix, |
const Group_ * | group, | ||
const ScoreMarkersPairwiseOptions & | options, | ||
const ScoreMarkersPairwiseBuffers< Stat_ > & | output | ||
) |
Compute the effect sizes for the pairwise comparisons between groups. This can be used to identify marker genes based on a specific comparison between two groups of interest. Alternatively, the pairwise effects can be passed to summarize_effects()
to obtain summaries for each group (though it would be more efficient to use `score_markers_summary() to do so).
The delta-mean is the difference in the mean expression between groups. This is fairly straightforward to interpret, where a positive delta-mean corresponds to increased expression in the first group compared to the second. The delta-mean can also be treated as the log-fold change if the input matrix contains log-transformed normalized expression values.
The delta-detected is the difference in the proportion of cells with detected expression between groups. This lies between 1 and -1, with the extremes occurring when a gene is silent in one group and detected in all cells of the other group. For this interpretation, we assume that the input matrix contains non-negative expression values, where a value of zero corresponds to lack of detectable expression.
Cohen's d is the standardized difference between two groups. This is defined as the difference in the mean for each group scaled by the average standard deviation across the two groups. (Technically, we should use the pooled variance; however, this introduces some unintuitive asymmetry depending on the variance of the larger group, so we take a simple average instead.) A positive value indicates that the gene has increased expression in the first group compared to the second. Cohen's d is analogous to the t-statistic in a two-sample t-test and avoids spuriously large effect sizes from comparisons between highly variable groups. We can also interpret Cohen's d as the number of standard deviations between the two group means.
The area under the curve (AUC) can be interpreted as the probability that a randomly chosen observation in one group is greater than a randomly chosen observation in the other group. Values greater than 0.5 indicate that a gene is upregulated in the first group. The AUC is closely related to the U-statistic used in the Wilcoxon rank sum test. The key difference between the AUC and Cohen's d is that the former is less sensitive to the variance within each group, e.g., if two distributions exhibit no overlap, the AUC is the same regardless of the variance of each distribution. This may or may not be desirable as it improves robustness to outliers but reduces the information available to obtain a highly resolved ranking.
Setting a minimum change threshold (see ScoreMarkersPairwiseOptions::threshold
) can be helpful as it prioritizes genes with large shifts in expression instead of those with low variances. Currently, only positive thresholds are supported - this focuses on genes that are upregulated in the first group compared to the second. The effect size definitions are generalized when testing against a non-zero threshold.
We report the mean expression of all cells in each group, as well as the proportion of cells with detectable expression in each group. These statistics are useful for quickly interpreting the differences in expression driving the effect size summaries.
Value_ | Matrix data type. |
Index_ | Matrix index type. |
Group_ | Integer type for the group assignments. |
Stat_ | Floating-point type to store the statistics. |
matrix | A tatami matrix instance. | |
[in] | group | Pointer to an array of length equal to the number of columns in matrix , containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups. |
options | Further options. | |
[out] | output | Collection of buffers in which to store the computed statistics. Each buffer is filled with the corresponding statistic for each group or pairwise comparison. Any of ScoreMarkersPairwiseBuffers::cohens_d , ScoreMarkersPairwiseBuffers::auc , ScoreMarkersPairwiseBuffers::delta_mean or ScoreMarkersPairwiseBuffers::delta_detected may be NULL, in which case the corresponding statistic is not computed. |
ScoreMarkersPairwiseResults< Stat_ > scran_markers::score_markers_pairwise_blocked | ( | const tatami::Matrix< Value_, Index_ > & | matrix, |
const Group_ * | group, | ||
const Block_ * | block, | ||
const ScoreMarkersPairwiseOptions & | options | ||
) |
Overload of score_markers_pairwise_blocked()
that allocates memory for the output statistics.
Stat_ | Floating-point type to store the statistics. |
Value_ | Matrix data type. |
Index_ | Matrix index type. |
Group_ | Integer type for the group assignments. |
Block_ | Integer type for the block assignments. |
matrix | A tatami matrix instance. | |
[in] | group | Pointer to an array of length equal to the number of columns in matrix , containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups. |
[in] | block | Pointer to an array of length equal to the number of columns in matrix , containing the blocking factor. Block identifiers should be 0-based and should contain all integers in \([0, B)\) where \(B\) is the number of unique blocking levels. |
options | Further options. |
void scran_markers::score_markers_pairwise_blocked | ( | const tatami::Matrix< Value_, Index_ > & | matrix, |
const Group_ * | group, | ||
const Block_ * | block, | ||
const ScoreMarkersPairwiseOptions & | options, | ||
const ScoreMarkersPairwiseBuffers< Stat_ > & | output | ||
) |
Compute effect sizes for pairwise comparisons between groups, accounting for any blocking factor in the dataset. Comparisons are only performed between the groups of cells in the same level of the blocking factor. The batch-specific effect sizes are then combined into a single aggregate value for output. This strategy avoids most problems related to batch effects as we never directly compare across different blocking levels.
Specifically, for each gene and each pair of groups, we obtain one effect size per blocking level. We consolidate these into a single statistic by computing the weighted mean across levels. The weight for each level is defined as the product of the weights of the two groups involved in the comparison, where each weight is computed from the size of the group using the logic described in scran_blocks::compute_weights()
.
Obviously, blocking levels with no cells in either group will not contribute anything to the weighted mean. If two groups never co-occur in the same blocking level, no effect size will be computed and a NaN
is reported in the output. We do not attempt to reconcile batch effects in a partially confounded scenario.
For the mean and detected proportion in each group, we compute a weighted average of each statistic across blocks for each gene. Again, the weight for each block is defined from scran_blocks::compute_weights()
on the size of the group in that block.
Value_ | Matrix data type. |
Index_ | Matrix index type. |
Group_ | Integer type for the group assignments. |
Block_ | Integer type for the block assignments. |
Stat_ | Floating-point type to store the statistics. |
matrix | A tatami matrix instance. | |
[in] | group | Pointer to an array of length equal to the number of columns in matrix , containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups. |
[in] | block | Pointer to an array of length equal to the number of columns in matrix , containing the blocking factor. Block identifiers should be 0-based and should contain all integers in \([0, B)\) where \(B\) is the number of unique blocking levels. |
options | Further options. | |
[out] | output | Collection of buffers in which to store the computed statistics. Each buffer is filled with the corresponding statistic for each group or pairwise comparison. Any of ScoreMarkersPairwiseBuffers::cohens_d , ScoreMarkersPairwiseBuffers::auc , ScoreMarkersPairwiseBuffers::delta_mean or ScoreMarkersPairwiseBuffers::delta_detected may be NULL, in which case the corresponding statistic is not computed. |
ScoreMarkersSummaryResults< Stat_, Rank_ > scran_markers::score_markers_summary | ( | const tatami::Matrix< Value_, Index_ > & | matrix, |
const Group_ * | group, | ||
const ScoreMarkersSummaryOptions & | options | ||
) |
Overload of score_markers_pairwise()
that allocates memory for the output statistics.
Stat_ | Floating-point type to store the statistics. |
Rank_ | Numeric type to store the minimum rank. |
Value_ | Matrix data type. |
Index_ | Matrix index type. |
Group_ | Integer type for the group assignments. |
matrix | A tatami matrix instance. | |
[in] | group | Pointer to an array of length equal to the number of columns in matrix , containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups. |
options | Further options. |
void scran_markers::score_markers_summary | ( | const tatami::Matrix< Value_, Index_ > & | matrix, |
const Group_ * | group, | ||
const ScoreMarkersSummaryOptions & | options, | ||
const ScoreMarkersSummaryBuffers< Stat_, Rank_ > & | output | ||
) |
Score each gene as a candidate marker for each group of cells. Markers are identified by differential expression analyses between pairs of groups of cells (e.g., clusters, cell types). Given \(N\) groups, each group is involved in \(N - 1\) pairwise comparisons and thus has \(N - 1\) effect sizes for each gene. We summarize each group's effect sizes into a small set of desriptive statistics like the mininum, median or mean. Users can then sort genes by any of these summaries to obtain a ranking of potential markers for the group.
The choice of effect size and summary statistic determines the characteristics of the marker ranking. The effect sizes include Cohen's d, the area under the curve (AUC), the delta-mean and the delta-detected (see score_markers_pairwise()
). The summary statistics include the minimum, mean, median, maximum and min-rank of the effect sizes across each group's pairwise comparisons (see summarize_effects()
). For example, ranking by the delta-detected with the minimum summary will promote markers that are silent in every other group.
This behavior of this function is equivalent to - but more efficient than - calling score_markers_pairwise()
followed by summarize_effects()
on each array of effect sizes.
Value_ | Matrix data type. |
Index_ | Matrix index type. |
Group_ | Integer type for the group assignments. |
Stat_ | Floating-point type to store the statistics. |
Rank_ | Numeric type to store the minimum rank. |
matrix | A tatami matrix instance. | |
[in] | group | Pointer to an array of length equal to the number of columns in matrix , containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups. |
options | Further options. | |
[out] | output | Collection of buffers in which to store the computed statistics. Each buffer is filled with the corresponding statistic for each group or pairwise comparison. Any of ScoreMarkersSummaryBuffers::cohens_d , ScoreMarkersSummaryBuffers::auc , ScoreMarkersSummaryBuffers::delta_mean or ScoreMarkersSummaryBuffers::delta_detected may be empty, in which case the corresponding statistic is not computed or summarized. |
ScoreMarkersSummaryResults< Stat_, Rank_ > scran_markers::score_markers_summary_blocked | ( | const tatami::Matrix< Value_, Index_ > & | matrix, |
const Group_ * | group, | ||
const Block_ * | block, | ||
const ScoreMarkersSummaryOptions & | options | ||
) |
Overload of score_markers_pairwise_blocked()
that allocates memory for the output statistics.
Stat_ | Floating-point type to store the statistics. |
Rank_ | Numeric type to store the minimum rank. |
Value_ | Matrix data type. |
Index_ | Matrix index type. |
Group_ | Integer type for the group assignments. |
Block_ | Integer type for the block assignments. |
matrix | A tatami matrix instance. | |
[in] | group | Pointer to an array of length equal to the number of columns in matrix , containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups. |
[in] | block | Pointer to an array of length equal to the number of columns in matrix , containing the blocking factor. Block identifiers should be 0-based and should contain all integers in \([0, B)\) where \(B\) is the number of unique blocking levels. |
options | Further options. |
void scran_markers::score_markers_summary_blocked | ( | const tatami::Matrix< Value_, Index_ > & | matrix, |
const Group_ * | group, | ||
const Block_ * | block, | ||
const ScoreMarkersSummaryOptions & | options, | ||
const ScoreMarkersSummaryBuffers< Stat_, Rank_ > & | output | ||
) |
Score potential marker genes by computing summary statistics across pairwise comparisons between groups, accounting for any blocking factor in the dataset. Comparisons are only performed between the groups of cells in the same level of the blocking factor. The block-specific effect sizes are combined into a single aggregate value per comparison, which are in turn summarized as described in summarize_effects()
. This strategy avoids most problems related to batch effects as we never directly compare across different blocking levels.
Value_ | Matrix data type. |
Index_ | Matrix index type. |
Group_ | Integer type for the group assignments. |
Stat_ | Floating-point type to store the statistics. |
Rank_ | Numeric type to store the minimum rank. |
matrix | A tatami matrix instance. | |
[in] | group | Pointer to an array of length equal to the number of columns in matrix , containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups. |
[in] | block | Pointer to an array of length equal to the number of columns in matrix , containing the blocking factor. Block identifiers should be 0-based and should contain all integers in \([0, B)\) where \(B\) is the number of unique blocking levels. |
options | Further options. | |
[out] | output | Collection of buffers in which to store the computed statistics. Each buffer is filled with the corresponding statistic for each group or pairwise comparison. Any of ScoreMarkersSummaryBuffers::cohens_d , ScoreMarkersSummaryBuffers::auc , ScoreMarkersSummaryBuffers::delta_mean or ScoreMarkersSummaryBuffers::delta_detected may be empty, in which case the corresponding statistic is not computed or summarized. |
void scran_markers::summarize_effects | ( | Index_ | ngenes, |
size_t | ngroups, | ||
const Stat_ * | effects, | ||
const std::vector< SummaryBuffers< Stat_, Rank_ > > & | summaries, | ||
const SummarizeEffectsOptions & | options | ||
) |
Given \(N\) groups, each group is involved in \(N - 1\) pairwise comparisons and thus has \(N - 1\) effect sizes (e.g., as computed by score_markers_pairwise()
). We summarize each group's effect sizes into a small set of desriptive statistics like the mininum, median or mean. Users can then sort genes by any of these summaries to obtain a ranking of potential markers for the group.
The choice of summary statistic dictates the interpretation of the ranking. Given a group \(X\):
The exact definition of "large" and "small" depends on the choice of effect size. For signed effects like Cohen's d, delta-mean and delta-detected, the value must be positive to be considered "large", and negative to be considered "small". For the AUC, a value greater than 0.5 is considered "large" and less than 0.5 is considered "small".
The interpretation above is also contingent on the threshold used (see score_markers_pairwise()
for details). For positive thresholds, small effects cannot be unambiguously interpreted as downregulation, as the effect is already adjusted to account for the threshold. As a result, only large effects can be interpreted as evidence for upregulation.
NaN effect sizes are allowed, e.g., if two groups do not exist in the same block for a blocked analysis in score_markers_pairwise_blocked()
. This class will ignore NaN values when computing each summary. If all effects are NaN for a particular group, the summary statistic will also be NaN
.
All choices of summary statistics are enumerated by Summary
.
Index_ | Integer type for the number of genes. |
Stat_ | Floating-point type for the statistics. |
Rank_ | Numeric type for the minimum rank. |
ngenes | Number of genes. | |
ngroups | Number of groups. | |
[in] | effects | Pointer to a 3-dimensional array containing the pairwise statistics, see ScoreMarkersPairwiseBuffers::cohens_d for the expected contents. The entry \((i, j, k)\) (i.e., effects[i * N * N + j * N + k] ) represents the effect size of gene \(i\) upon comparing group \(j\) against group \(k\). |
[out] | summaries | Vector of length equal to the number of groups. Each entry corresponds to a group and is used to store the summary statistics for that group. Each pointer in any given SummaryBuffers should either point to an array of length equal to the number of genes, or be NULL to indicate that the corresponding summary statistic should not be computed for that group. |
options | Further options. |
std::vector< SummaryResults< Stat_, Rank_ > > scran_markers::summarize_effects | ( | Index_ | ngenes, |
size_t | ngroups, | ||
const Stat_ * | effects, | ||
const SummarizeEffectsOptions & | options | ||
) |
Overload of summarize_effects()
that allocates memory for the output summary statistics.
Index_ | Integer type for the number of genes. |
Stat | Floating point type for the statistics. |
Rank_ | Numeric type for the minimum rank. |
ngenes | Number of genes. | |
ngroups | Number of groups. | |
[in] | effects | Pointer to a 3-dimensional array containing the pairwise statistics, see ScoreMarkersPairwiseBuffers::cohens_d for the expected contents. The entry \((i, j, k)\) (i.e., effects[i * N * N + j * N + k] ) represents the effect size of gene \(i\) upon comparing group \(j\) against group \(k\). |
options | Further options. |
SummaryResults
corresponds to a group and contains the summary statistics (depending on options
) for that group.