scran_markers
Marker detection for single-cell data
Loading...
Searching...
No Matches
Classes | Functions
scran_markers Namespace Reference

Marker detection for single-cell data. More...

Classes

struct  ScoreMarkersPairwiseBuffers
 Buffers for score_markers_pairwise() and friends. More...
 
struct  ScoreMarkersPairwiseOptions
 Options for score_markers_pairwise() and friends. More...
 
struct  ScoreMarkersPairwiseResults
 Results for score_markers_pairwise() and friends. More...
 
struct  ScoreMarkersSummaryBuffers
 Buffers for score_markers_summary() and friends. More...
 
struct  ScoreMarkersSummaryOptions
 Options for score_markers_summary() and friends. More...
 
struct  ScoreMarkersSummaryResults
 Results for score_markers_summary() and friends. More...
 
struct  SummarizeEffectsOptions
 Options for summarize_effects(). More...
 
struct  SummaryBuffers
 Pointers to arrays to hold the summary statistics. More...
 
struct  SummaryResults
 Container for the summary statistics. More...
 

Functions

template<typename Value_ , typename Index_ , typename Group_ , typename Stat_ >
void score_markers_pairwise (const tatami::Matrix< Value_, Index_ > &matrix, const Group_ *group, const ScoreMarkersPairwiseOptions &options, const ScoreMarkersPairwiseBuffers< Stat_ > &output)
 
template<typename Value_ , typename Index_ , typename Group_ , typename Block_ , typename Stat_ >
void score_markers_pairwise_blocked (const tatami::Matrix< Value_, Index_ > &matrix, const Group_ *group, const Block_ *block, const ScoreMarkersPairwiseOptions &options, const ScoreMarkersPairwiseBuffers< Stat_ > &output)
 
template<typename Stat_ = double, typename Value_ , typename Index_ , typename Group_ >
ScoreMarkersPairwiseResults< Stat_score_markers_pairwise (const tatami::Matrix< Value_, Index_ > &matrix, const Group_ *group, const ScoreMarkersPairwiseOptions &options)
 
template<typename Stat_ = double, typename Value_ , typename Index_ , typename Group_ , typename Block_ >
ScoreMarkersPairwiseResults< Stat_score_markers_pairwise_blocked (const tatami::Matrix< Value_, Index_ > &matrix, const Group_ *group, const Block_ *block, const ScoreMarkersPairwiseOptions &options)
 
template<typename Index_ , typename Stat_ , typename Rank_ >
void summarize_effects (Index_ ngenes, size_t ngroups, const Stat_ *effects, const std::vector< SummaryBuffers< Stat_, Rank_ > > &summaries, const SummarizeEffectsOptions &options)
 
template<typename Stat_ = double, typename Rank_ = int, typename Index_ >
std::vector< SummaryResults< Stat_, Rank_ > > summarize_effects (Index_ ngenes, size_t ngroups, const Stat_ *effects, const SummarizeEffectsOptions &options)
 
template<typename Value_ , typename Index_ , typename Group_ , typename Stat_ , typename Rank_ >
void score_markers_summary (const tatami::Matrix< Value_, Index_ > &matrix, const Group_ *group, const ScoreMarkersSummaryOptions &options, const ScoreMarkersSummaryBuffers< Stat_, Rank_ > &output)
 
template<typename Value_ , typename Index_ , typename Group_ , typename Block_ , typename Stat_ , typename Rank_ >
void score_markers_summary_blocked (const tatami::Matrix< Value_, Index_ > &matrix, const Group_ *group, const Block_ *block, const ScoreMarkersSummaryOptions &options, const ScoreMarkersSummaryBuffers< Stat_, Rank_ > &output)
 
template<typename Stat_ = double, typename Rank_ = int, typename Value_ , typename Index_ , typename Group_ >
ScoreMarkersSummaryResults< Stat_, Rank_score_markers_summary (const tatami::Matrix< Value_, Index_ > &matrix, const Group_ *group, const ScoreMarkersSummaryOptions &options)
 
template<typename Stat_ = double, typename Rank_ = int, typename Value_ , typename Index_ , typename Group_ , typename Block_ >
ScoreMarkersSummaryResults< Stat_, Rank_score_markers_summary_blocked (const tatami::Matrix< Value_, Index_ > &matrix, const Group_ *group, const Block_ *block, const ScoreMarkersSummaryOptions &options)
 

Detailed Description

Marker detection for single-cell data.

Function Documentation

◆ score_markers_pairwise() [1/2]

ScoreMarkersPairwiseResults< Stat_ > scran_markers::score_markers_pairwise ( const tatami::Matrix< Value_, Index_ > &  matrix,
const Group_ group,
const ScoreMarkersPairwiseOptions options 
)

Overload of score_markers_pairwise() that allocates memory for the output statistics.

Template Parameters
Stat_Floating-point type to store the statistics.
Value_Matrix data type.
Index_Matrix index type.
Group_Integer type for the group assignments.
Parameters
matrixA tatami matrix instance.
[in]groupPointer to an array of length equal to the number of columns in matrix, containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups.
optionsFurther options.
Returns
Object containing the pairwise effects, plus the mean expression and detected proportion in each group.

◆ score_markers_pairwise() [2/2]

void scran_markers::score_markers_pairwise ( const tatami::Matrix< Value_, Index_ > &  matrix,
const Group_ group,
const ScoreMarkersPairwiseOptions options,
const ScoreMarkersPairwiseBuffers< Stat_ > &  output 
)

Compute the effect sizes for the pairwise comparisons between groups. This can be used to identify marker genes based on a specific comparison between two groups of interest. Alternatively, the pairwise effects can be passed to summarize_effects() to obtain summaries for each group (though it would be more efficient to use `score_markers_summary() to do so).

Choice of effect sizes

The delta-mean is the difference in the mean expression between groups. This is fairly straightforward to interpret, where a positive delta-mean corresponds to increased expression in the first group compared to the second. The delta-mean can also be treated as the log-fold change if the input matrix contains log-transformed normalized expression values.

The delta-detected is the difference in the proportion of cells with detected expression between groups. This lies between 1 and -1, with the extremes occurring when a gene is silent in one group and detected in all cells of the other group. For this interpretation, we assume that the input matrix contains non-negative expression values, where a value of zero corresponds to lack of detectable expression.

Cohen's d is the standardized difference between two groups. This is defined as the difference in the mean for each group scaled by the average standard deviation across the two groups. (Technically, we should use the pooled variance; however, this introduces some unintuitive asymmetry depending on the variance of the larger group, so we take a simple average instead.) A positive value indicates that the gene has increased expression in the first group compared to the second. Cohen's d is analogous to the t-statistic in a two-sample t-test and avoids spuriously large effect sizes from comparisons between highly variable groups. We can also interpret Cohen's d as the number of standard deviations between the two group means.

The area under the curve (AUC) can be interpreted as the probability that a randomly chosen observation in one group is greater than a randomly chosen observation in the other group. Values greater than 0.5 indicate that a gene is upregulated in the first group. The AUC is closely related to the U-statistic used in the Wilcoxon rank sum test. The key difference between the AUC and Cohen's d is that the former is less sensitive to the variance within each group, e.g., if two distributions exhibit no overlap, the AUC is the same regardless of the variance of each distribution. This may or may not be desirable as it improves robustness to outliers but reduces the information available to obtain a highly resolved ranking.

With a minimum change threshold

Setting a minimum change threshold (see ScoreMarkersPairwiseOptions::threshold) can be helpful as it prioritizes genes with large shifts in expression instead of those with low variances. Currently, only positive thresholds are supported - this focuses on genes that are upregulated in the first group compared to the second. The effect size definitions are generalized when testing against a non-zero threshold.

  • Cohen's d is redefined as the standardized difference between the difference in means and the specified threshold, analogous to the TREAT method from limma. Large positive values are only obtained when the observed difference in means is significantly greater than the threshold. For example, if we had a threshold of 2 and we obtained a Cohen's d of 3, this means that the observed difference in means was 3 standard deviations greater than 2. Importantly, a negative Cohen's d cannot be intepreted as downregulation, as the difference may still be positive but less than the threshold.
  • The AUC is generalized to the probability of obtaining a random observation in one group that is greater than a random observation plus the threshold in the other group. For example, if we had a threshold of 2 and we obtained an AUC of 0.8, this means that - 80% of the time - the random observation from the first group would be greater than a random observation from the second group by 2 or more. Again, AUCs below 0.5 cannot be interpreted as downregulation, as it may be caused by a positive shift that is less than the threshold.

Other statistics

We report the mean expression of all cells in each group, as well as the proportion of cells with detectable expression in each group. These statistics are useful for quickly interpreting the differences in expression driving the effect size summaries.

Template Parameters
Value_Matrix data type.
Index_Matrix index type.
Group_Integer type for the group assignments.
Stat_Floating-point type to store the statistics.
Parameters
matrixA tatami matrix instance.
[in]groupPointer to an array of length equal to the number of columns in matrix, containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups.
optionsFurther options.
[out]outputCollection of buffers in which to store the computed statistics. Each buffer is filled with the corresponding statistic for each group or pairwise comparison. Any of ScoreMarkersPairwiseBuffers::cohens_d, ScoreMarkersPairwiseBuffers::auc, ScoreMarkersPairwiseBuffers::delta_mean or ScoreMarkersPairwiseBuffers::delta_detected may be NULL, in which case the corresponding statistic is not computed.

◆ score_markers_pairwise_blocked() [1/2]

ScoreMarkersPairwiseResults< Stat_ > scran_markers::score_markers_pairwise_blocked ( const tatami::Matrix< Value_, Index_ > &  matrix,
const Group_ group,
const Block_ block,
const ScoreMarkersPairwiseOptions options 
)

Overload of score_markers_pairwise_blocked() that allocates memory for the output statistics.

Template Parameters
Stat_Floating-point type to store the statistics.
Value_Matrix data type.
Index_Matrix index type.
Group_Integer type for the group assignments.
Block_Integer type for the block assignments.
Parameters
matrixA tatami matrix instance.
[in]groupPointer to an array of length equal to the number of columns in matrix, containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups.
[in]blockPointer to an array of length equal to the number of columns in matrix, containing the blocking factor. Block identifiers should be 0-based and should contain all integers in \([0, B)\) where \(B\) is the number of unique blocking levels.
optionsFurther options.
Returns
Object containing the pairwise effects, plus the mean expression and detected proportion in each group.

◆ score_markers_pairwise_blocked() [2/2]

void scran_markers::score_markers_pairwise_blocked ( const tatami::Matrix< Value_, Index_ > &  matrix,
const Group_ group,
const Block_ block,
const ScoreMarkersPairwiseOptions options,
const ScoreMarkersPairwiseBuffers< Stat_ > &  output 
)

Compute effect sizes for pairwise comparisons between groups, accounting for any blocking factor in the dataset. Comparisons are only performed between the groups of cells in the same level of the blocking factor. The batch-specific effect sizes are then combined into a single aggregate value for output. This strategy avoids most problems related to batch effects as we never directly compare across different blocking levels.

Specifically, for each gene and each pair of groups, we obtain one effect size per blocking level. We consolidate these into a single statistic by computing the weighted mean across levels. The weight for each level is defined as the product of the weights of the two groups involved in the comparison, where each weight is computed from the size of the group using the logic described in scran_blocks::compute_weights().

Obviously, blocking levels with no cells in either group will not contribute anything to the weighted mean. If two groups never co-occur in the same blocking level, no effect size will be computed and a NaN is reported in the output. We do not attempt to reconcile batch effects in a partially confounded scenario.

For the mean and detected proportion in each group, we compute a weighted average of each statistic across blocks for each gene. Again, the weight for each block is defined from scran_blocks::compute_weights() on the size of the group in that block.

Template Parameters
Value_Matrix data type.
Index_Matrix index type.
Group_Integer type for the group assignments.
Block_Integer type for the block assignments.
Stat_Floating-point type to store the statistics.
Parameters
matrixA tatami matrix instance.
[in]groupPointer to an array of length equal to the number of columns in matrix, containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups.
[in]blockPointer to an array of length equal to the number of columns in matrix, containing the blocking factor. Block identifiers should be 0-based and should contain all integers in \([0, B)\) where \(B\) is the number of unique blocking levels.
optionsFurther options.
[out]outputCollection of buffers in which to store the computed statistics. Each buffer is filled with the corresponding statistic for each group or pairwise comparison. Any of ScoreMarkersPairwiseBuffers::cohens_d, ScoreMarkersPairwiseBuffers::auc, ScoreMarkersPairwiseBuffers::delta_mean or ScoreMarkersPairwiseBuffers::delta_detected may be NULL, in which case the corresponding statistic is not computed.

◆ score_markers_summary() [1/2]

ScoreMarkersSummaryResults< Stat_, Rank_ > scran_markers::score_markers_summary ( const tatami::Matrix< Value_, Index_ > &  matrix,
const Group_ group,
const ScoreMarkersSummaryOptions options 
)

Overload of score_markers_pairwise() that allocates memory for the output statistics.

Template Parameters
Stat_Floating-point type to store the statistics.
Rank_Numeric type to store the minimum rank.
Value_Matrix data type.
Index_Matrix index type.
Group_Integer type for the group assignments.
Parameters
matrixA tatami matrix instance.
[in]groupPointer to an array of length equal to the number of columns in matrix, containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups.
optionsFurther options.
Returns
Object containing the summary statistics and the other per-group statistics.

◆ score_markers_summary() [2/2]

void scran_markers::score_markers_summary ( const tatami::Matrix< Value_, Index_ > &  matrix,
const Group_ group,
const ScoreMarkersSummaryOptions options,
const ScoreMarkersSummaryBuffers< Stat_, Rank_ > &  output 
)

Score each gene as a candidate marker for each group of cells. Markers are identified by differential expression analyses between pairs of groups of cells (e.g., clusters, cell types). Given \(N\) groups, each group is involved in \(N - 1\) pairwise comparisons and thus has \(N - 1\) effect sizes for each gene. We summarize each group's effect sizes into a small set of desriptive statistics like the mininum, median or mean. Users can then sort genes by any of these summaries to obtain a ranking of potential markers for the group.

The choice of effect size and summary statistic determines the characteristics of the marker ranking. The effect sizes include Cohen's d, the area under the curve (AUC), the delta-mean and the delta-detected (see score_markers_pairwise()). The summary statistics include the minimum, mean, median, maximum and min-rank of the effect sizes across each group's pairwise comparisons (see summarize_effects()). For example, ranking by the delta-detected with the minimum summary will promote markers that are silent in every other group.

This behavior of this function is equivalent to - but more efficient than - calling score_markers_pairwise() followed by summarize_effects() on each array of effect sizes.

Template Parameters
Value_Matrix data type.
Index_Matrix index type.
Group_Integer type for the group assignments.
Stat_Floating-point type to store the statistics.
Rank_Numeric type to store the minimum rank.
Parameters
matrixA tatami matrix instance.
[in]groupPointer to an array of length equal to the number of columns in matrix, containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups.
optionsFurther options.
[out]outputCollection of buffers in which to store the computed statistics. Each buffer is filled with the corresponding statistic for each group or pairwise comparison. Any of ScoreMarkersSummaryBuffers::cohens_d, ScoreMarkersSummaryBuffers::auc, ScoreMarkersSummaryBuffers::delta_mean or ScoreMarkersSummaryBuffers::delta_detected may be empty, in which case the corresponding statistic is not computed or summarized.

◆ score_markers_summary_blocked() [1/2]

ScoreMarkersSummaryResults< Stat_, Rank_ > scran_markers::score_markers_summary_blocked ( const tatami::Matrix< Value_, Index_ > &  matrix,
const Group_ group,
const Block_ block,
const ScoreMarkersSummaryOptions options 
)

Overload of score_markers_pairwise_blocked() that allocates memory for the output statistics.

Template Parameters
Stat_Floating-point type to store the statistics.
Rank_Numeric type to store the minimum rank.
Value_Matrix data type.
Index_Matrix index type.
Group_Integer type for the group assignments.
Block_Integer type for the block assignments.
Parameters
matrixA tatami matrix instance.
[in]groupPointer to an array of length equal to the number of columns in matrix, containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups.
[in]blockPointer to an array of length equal to the number of columns in matrix, containing the blocking factor. Block identifiers should be 0-based and should contain all integers in \([0, B)\) where \(B\) is the number of unique blocking levels.
optionsFurther options.
Returns
Object containing the pairwise effects, plus the mean expression and detected proportion in each group.

◆ score_markers_summary_blocked() [2/2]

void scran_markers::score_markers_summary_blocked ( const tatami::Matrix< Value_, Index_ > &  matrix,
const Group_ group,
const Block_ block,
const ScoreMarkersSummaryOptions options,
const ScoreMarkersSummaryBuffers< Stat_, Rank_ > &  output 
)

Score potential marker genes by computing summary statistics across pairwise comparisons between groups, accounting for any blocking factor in the dataset. Comparisons are only performed between the groups of cells in the same level of the blocking factor. The block-specific effect sizes are combined into a single aggregate value per comparison, which are in turn summarized as described in summarize_effects(). This strategy avoids most problems related to batch effects as we never directly compare across different blocking levels.

Template Parameters
Value_Matrix data type.
Index_Matrix index type.
Group_Integer type for the group assignments.
Stat_Floating-point type to store the statistics.
Rank_Numeric type to store the minimum rank.
Parameters
matrixA tatami matrix instance.
[in]groupPointer to an array of length equal to the number of columns in matrix, containing the group assignments. Group identifiers should be 0-based and should contain all integers in \([0, N)\) where \(N\) is the number of unique groups.
[in]blockPointer to an array of length equal to the number of columns in matrix, containing the blocking factor. Block identifiers should be 0-based and should contain all integers in \([0, B)\) where \(B\) is the number of unique blocking levels.
optionsFurther options.
[out]outputCollection of buffers in which to store the computed statistics. Each buffer is filled with the corresponding statistic for each group or pairwise comparison. Any of ScoreMarkersSummaryBuffers::cohens_d, ScoreMarkersSummaryBuffers::auc, ScoreMarkersSummaryBuffers::delta_mean or ScoreMarkersSummaryBuffers::delta_detected may be empty, in which case the corresponding statistic is not computed or summarized.

◆ summarize_effects() [1/2]

void scran_markers::summarize_effects ( Index_  ngenes,
size_t  ngroups,
const Stat_ effects,
const std::vector< SummaryBuffers< Stat_, Rank_ > > &  summaries,
const SummarizeEffectsOptions options 
)

Given \(N\) groups, each group is involved in \(N - 1\) pairwise comparisons and thus has \(N - 1\) effect sizes (e.g., as computed by score_markers_pairwise()). We summarize each group's effect sizes into a small set of desriptive statistics like the mininum, median or mean. Users can then sort genes by any of these summaries to obtain a ranking of potential markers for the group.

The choice of summary statistic dictates the interpretation of the ranking. Given a group \(X\):

  • A large mean effect size indicates that the gene is upregulated in \(X\) compared to the average of the other groups. A small value indicates that the gene is downregulated in \(X\) instead. This is a good general-purpose summary statistic for ranking, usually by decreasing size to obtain upregulated markers in \(X\).
  • A large median effect size indicates that the gene is upregulated in \(X\) compared to most (>50%) other groups. A small value indicates that the gene is downregulated in \(X\) instead. This is also a good general-purpose summary, with the advantage of being more robust to outlier effects compared to the mean. However, it also has the disadvantage of being less sensitive to strong effects in a minority of comparisons.
  • A large minimum effect size indicates that the gene is upregulated in \(X\) compared to all other groups. A small value indicates that the gene is downregulated in \(X\) compared to at least one other group. For upregulation, this is the most stringent summary as markers will only have extreme values if they are uniquely upregulated in \(X\) compared to every other group. However, it may not be effective if \(X\) is closely related to any of the groups.
  • A large maximum effect size indicates that the gene is upregulated in \(X\) compared to at least one other group. A small value indicates that the gene is downregulated in \(X\) compared to all other groups. For downregulation, this is the most stringent summary as markers will only have extreme values if they are uniquely downregulated in \(X\) compared to every other group. However, it may not be effective if \(X\) is closely related to any of the groups.
  • The "minimum rank" (a.k.a. min-rank) is defined by ranking genes based on decreasing effect size within each comparison, and then taking the smallest rank across comparisons. A minimum rank of 1 means that the gene is the top upregulated gene in at least one comparison to another group. More generally, a minimum rank of \(T\) indicates that the gene is the \(T\)-th upregulated gene in at least one comparison. Applying a threshold on the minimum rank is useful for obtaining a set of genes that, in combination, are guaranteed to distinguish \(X\) from every other group.

The exact definition of "large" and "small" depends on the choice of effect size. For signed effects like Cohen's d, delta-mean and delta-detected, the value must be positive to be considered "large", and negative to be considered "small". For the AUC, a value greater than 0.5 is considered "large" and less than 0.5 is considered "small".

The interpretation above is also contingent on the threshold used (see score_markers_pairwise() for details). For positive thresholds, small effects cannot be unambiguously interpreted as downregulation, as the effect is already adjusted to account for the threshold. As a result, only large effects can be interpreted as evidence for upregulation.

NaN effect sizes are allowed, e.g., if two groups do not exist in the same block for a blocked analysis in score_markers_pairwise_blocked(). This class will ignore NaN values when computing each summary. If all effects are NaN for a particular group, the summary statistic will also be NaN.

All choices of summary statistics are enumerated by Summary.

Template Parameters
Index_Integer type for the number of genes.
Stat_Floating-point type for the statistics.
Rank_Numeric type for the minimum rank.
Parameters
ngenesNumber of genes.
ngroupsNumber of groups.
[in]effectsPointer to a 3-dimensional array containing the pairwise statistics. The first dimension is the fastest changing, is of length equal to the number of groups, and represents the first group. The second dimension is the next fastest changing, is also of length equal to the number of groups, and represents the second group. The third dimension is the slowest changing, is of length equal to the number of genes, and represents the gene. Thus, an entry \((i, j, k)\) represents the effect size of gene \(k\) for group \(i\) against group \(j\).
[out]summariesVector of length equal to the number of groups. Each entry corresponds to a group and is used to store the summary statistics for that group. Each pointer in any given SummaryBuffers should either point to an array of length equal to the number of genes, or be NULL to indicate that the corresponding summary statistic should not be computed for that group.
optionsFurther options.

◆ summarize_effects() [2/2]

template<typename Stat_ = double, typename Rank_ = int, typename Index_ >
std::vector< SummaryResults< Stat_, Rank_ > > scran_markers::summarize_effects ( Index_  ngenes,
size_t  ngroups,
const Stat_ effects,
const SummarizeEffectsOptions options 
)

Overload of summarize_effects() that allocates memory for the output summary statistics.

Template Parameters
Index_Integer type for the number of genes.
StatFloating point type for the statistics.
Rank_Numeric type for the minimum rank.
Parameters
ngenesNumber of genes.
ngroupsNumber of groups.
[in]effectsPointer to a 3-dimensional array containing the pairwise statistics. The first dimension is the fastest changing, is of length equal to the number of groups, and represents the first group. The second dimension is the next fastest changing, is also of length equal to the number of groups, and represents the second group. The third dimension is the slowest changing, is of length equal to the number of genes, and represents the gene. Thus, an entry \((i, j, k)\) represents the effect size of gene \(k\) for group \(i\) against group \(j\).
optionsFurther options.
Returns
A vector of length equal to the number of groups. Each SummaryResults corresponds to a group and contains the summary statistics (depending on options) for that group.