|
mumosa
Multi-modal analyses of single-cell data
|
In multi-modal single-cell experiments, we obtain data of different modalities (e.g., RNA, protein) from the same set of cells. Naturally, we would like to combine data from different modalities to increase the information available for each cell in further analyses. This is most relevant to analysis steps that operate on cells, e.g., clustering, visualization with t-SNE or UMAP. The simplest combining strategy is to just concatenate the per-modality data matrices together into a single matrix for further analysis. While convenient and compatible with many downstream procedures, this is complicated by the differences in the variance between modalities. Higher noise in one modality might drown out biological signal in another modality that has lower variance.
The mumosa algorithm scales the data from each modality to equalize "uninteresting" noise to concatenation. First, we compute the median distance to the $k$-th nearest neighbor across all cells for each modality. This distance is used to as a measure of the modality-specific variance within each cell's local neighborhood. We define a scaling factor for each modality as the ratio of the median distances for that modality compared to a "reference" modality. We scale the modality's coordinates by this factor, removing differences in variance due to irrelevant factors like the scale of expression values, dimensionality, etc. We then concatenate data across modalities into a single matrix for further analysis.
Each modality should be represented as a low-dimensional embedding (e.g., after PCA) for more efficient neighbor searches. Given the embedding coordinates for multiple modalities, we compute the median distance to the $k$-nearest neighbor for each modality:
We compute scaling factors for each modality:
And combine the scaled per-modality embeddings into a single matrix, which can be used for downstream steps like k-means clustering:
Check out the reference documentation for more details.
The premise of the mumosa approach is that the distance to the $k$-nearest neighbor is a suitable measure of (uninteresting) variation. By quantifying the spread of cells in each local neighborhood, we capture the effects of dimensionality, scale, etc. without much contribution from biological variance. Scaling by this distance removes differences in the magnitude of noise while preserving modality-specific biological signal in the concatenated matrix. In contrast, the total variance for each embedding includes the biological heterogeneity of interest. Scaling by the total variance would reduce the contribution of the most informative modalities, which is obviously not desirable.
Ideally, the median distance-to-neighbor would serve as a proxy for the average variance within subpopulations of at least $k + 1$ cells. This provides an intuitive rationale for scaling each modality to equalize the within-population variance. However, this interpretation has several caveats:
One appeal of mumosa is its simplicity relative to other approaches, e.g., multi-modal factor analyses, intersection of simplicial sets. No further transformations beyond scaling are performed, ensuring that population structure within each modality is faithfully represented in the combined embedding. It is very easy to implement and the result is directly compatible with any downstream analysis step that can operate on an embedding matrix. In fact, we only care about the median distance so we could save even more time by only performing the neighbor search for a subset of cells.
FetchContentIf you're using CMake, you just need to add something like this to your CMakeLists.txt:
Then you can link to mumosa to make the headers available during compilation:
By default, this will use FetchContent to fetch all external dependencies. Applications should consider pinning versions of all dependencies - see extern/CMakeLists.txt for suggested versions. If you want to install them manually, use -DMUMOSA_FETCH_EXTERN=OFF.
find_package()To install the library, use:
Again, this will use FetchContent to retrieve dependencies, see comments above.
If you're not using CMake, the simple approach is to just copy the files in include/ - either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I. This also requires the external dependencies listed in extern/CMakeLists.txt.