Table of Contents
What’s AnnData?
If you’re familiar with AnnData, feel free to skip this section.
AnnData is a Python Package for handling annotated data matrices in computational biology. It’s best designed to handle single cell data. Structurally it’s a collection of arrays and DataFrames aligned with two common dimensions: obs
the observations, and var
the variables. The center green Array X
is the Data Matrix, and the labels are stored in Index components obs_names
(rows) and var_names
(columns). For X
, layers is a mapping from a string to another array the same shape as X
with the same index components. Layers are primarily used for storing transformations of your data, such as log2
or arcsinh
.
When working with AnnData, the goal is to generate some insight about your Data Matrix X
. These are stored in the accompanying components surrounding X
. One Dimensional features for observations and variables are stored in obs
and var
respectively, with their associated indices being obs_names
and var_names
. In practice obs
and var
are Pandas DataFrames, with obs_names
and var_names
as their respective indices. For example in obs
we could store data such as clustering information, or patient related info such as risk. While in var
we would put data about the markers or genes we are interested in.
Multidimensional representations get stored in obsm
and varm
. Commonly you’ll see embeddings such as UMAP, TSNE and PCA in obsm
with a key-value pair (the value being the array, just like layers for X
). So AnnData.obsm["X_umap"]
stores the UMAP embedding of the original Data Matrix X
. In addition in the realm of Spatial Single Cell Analysis, you’d find the coordinates of the observation (usually a cell, but it remains up to the user what the smallest atomic “thing” they’re generating knowledge from) store here as AnnData.obsm["spatial"]
.
When working with single cell data, often you want to compare observations to other observations (or variables to variables). These pairwise relations are stored in obsp
, p
for pairwise, and varp
. These are graph adjacency matrices. And they are also a key-value store. In practice, you might run some algorithm to figure out which cells are similar to others based on the values in X
, stored in AnnData.obsp["neighbors"]
.
The last bit, uns
is for anything unstructured, or anything else you feel like. Suppose you have a categorical column in obs
denoting the cell type "CellType"
. When it comes to plotting, the colors for each cell type in "CellType"
would be in AnnData.uns["CellType_colors"]
which contains a list of colors, sorted by obs["CellType"].cat.categories
.
In addition you can index an AnnData object as you would with a Pandas DataFrame with the first dimension being observations, and the second variables.
Indexing with integer arguments behaves like Pandas.DataFrame.iloc
:
adata[:5, :] # Selects observations 0 through 5 exclusive and all variables
Indexing with string arguments behaves like Pandas.DataFrame.loc
adata[:, ["Var_1", "Var_2"]] # Selects all observations and variables Var_1 and Var_2
You work with the data structure on disk with HDF5
2 or Zarr
3 which allows for neat features like lazy loading and data compression. All of this leads to it being scalable well beyond millions of observations.
Why Annsel
?
I find one of the few things lacking with AnnData is manipulating the structure as I would with traditional one dimensional Dataframes like Pandas or Polars. Like, uh how do I do a group by? Or filter my AnnData based on a property in the obs
DataFrame? This does not happen often, but when end up needing these features I stop my self and just go “Ah fuck, here we go again”.
Here’s how to filter your AnnData object on, say the cell type column in the obs
DataFrame.
import anndata as ad
adata: ad.AnnData
filtered_obs_names = adata.obs["CellType"].isin(["Type I", "Type II"]).index
filtered_adata = adata[filtered_obs_names, :]
Obviously this doesn’t look too bad because it isn’t. If you end up with more complex queries involving multiple columns in obs
or vars
, then you’ll end up wrangling with all of those indices as well.
I made Annsel
as a small tool to help make this easier. This package supports 3 common operations and one utility method. We have filter, select and group by, and pipe as the utility function. Pipe is inspired by the same method in Pandas
and Xarray
. These methods are applied to the AnnData.obs
and AnnData.var
respectively and can be used to slice and dice your AnnData object.
Installation
You can install Annsel
using uv
or pip
.
How to use Annsel
Just import the package as you would with any other.
import annsel as an
Annsel works as an extension to AnnData, it follows the accessor style that Xarray uses. An accessor is a way of adding extra functionality to an object via composition over inheritance. It provides a clear separation between the “core” API of your object and the custom “extension” API. It helps to limit naming conflicts and makes it easier to handle updates with your core package. Generally you’ll access the methods an accessor provides by, well it’s name. For example you may have MyObject.my_accessor.a_cool_function()
.
For Annsel, you can access the methods under the an
accessor, AnnData.an.<method_name>
. In order to select columns in the obs
or var
DataFrames, you use the included an.col(*names: Iterable[str] | str)
function, where names
is a list of column names, or a single column name as a string. Note that they are positional arguments, not keyword arguments. an.col(names=["Cell Type I"])
will throw an exception.
The an.obs_names
and an.var_names
are special cases of an.col
Expressions. These are only used for the filter
operation.
an.obs_names = an.col("obs_names")an.var_names = an.col("var_names")
Overview of Supported Operations
The package contains a small dataset which can be downloaded for users to easily test it out. It consists of 31K cells and 458 genes from a bone marrow dataset 4. You can view it on CZI’s CELLxGENE Explorer.
import annsel as anadata = an.datasets.leukemic_bone_marrow_dataset()adata
AnnData object with n_obs × n_vars = 31586 × 458obs: 'Cluster_ID', 'donor_id', 'Sample_Tag', 'Cell_label', 'is_primary_data', 'organism_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'tissue_ontology_term_id', 'Genotype', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'disease_ontology_term_id', 'cell_type_ontology_term_id', 'suspension_type', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable', 'feature_is_filtered', 'Unnamed: 0', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'uns: 'cell_type_ontology_term_id_colors', 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'obsm: 'X_bothumap', 'X_pca', 'X_projected', 'X_projectedmean', 'X_tsneni', 'X_umapni'
Here you can see the components of our AnnData object. Embeddings live in obsm
, and we have one dimensional features in obs
and var
. This dataset does not include pairwise observations (obsp
) and variables (varp
).
Selection
Let’s say we want to select the Cell_label
column from the AnnData.obs
DataFrame, and return a new AnnData
object with the just that column in the obs
DataFrame.
adata.an.select(obs=an.col(["Cell_label"]))
AnnData object with n_obs × n_vars = 31586 × 458obs: 'Cell_label'var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable', 'feature_is_filtered', 'Unnamed: 0', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'uns: 'cell_type_ontology_term_id_colors', 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'obsm: 'X_bothumap', 'X_pca', 'X_projected', 'X_projectedmean', 'X_tsneni', 'X_umapni'
This selects the Cell_label
column from the AnnData.obs
DataFrame, constructs a new AnnData
object with the subsetted obs
DataFrame and returns it back. You’d find this useful if you don’t need to work with all the columns in obs
at the moment. During analysis workflows it’s not uncommon to have dozens of columns in obs
.
We can also select multiple columns at once, even from both obs
and var
DataFrames at the same time.
adata.an.select(obs=an.col(["Cell_label", "donor_id"]), var=an.col(["feature_name", "feature_type"]))
AnnData object with n_obs × n_vars = 31586 × 458obs: 'donor_id', 'Cell_label'var: 'feature_name', 'feature_type'uns: 'cell_type_ontology_term_id_colors', 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'obsm: 'X_bothumap', 'X_pca', 'X_projected', 'X_projectedmean', 'X_tsneni', 'X_umapni'
Filter
We can filter the AnnData
object by columns in the obs
or var
DataFrames.
Let’s step things up a notch. In the following example we want to filter both the obs
and var
DataFrames, and return this small chunk of the dataset. There isn’t really much of a reason for these filters in particular, but it’s a good illustration of writing more complex expressions.
For the observations (obs
), we want to filter for two things:
- Select specific cell types of interest: Classical Monocytes, and CD8+CD103+ tissue resident memory T cells
- Select cells from patients who have either:
- APL (Acute Promyelocytic Leukemia - a specific type of blood cancer), or
- A combination of two mutations: FLT3-ITD and NPM1 (common genetic changes found in leukemia)
For the variables (var
), we want to filter for two things:
- Select genes with high mean expression:
vst.mean >= 3
- Select genes of specific biological types:
feature_type
is in["IG_C_gene", "lncRNA", "protein_coding"]
IG_C_gene
: Immunoglobulin constant region genes (antibody-related)lncRNA
: Long non-coding RNA genes (regulatory RNA molecules)protein_coding
: Standard genes that code for proteins
adata.an.filter( obs=( an.col(["Cell_label"]).is_in(["Classical Monocytes", "CD8+CD103+ tissue resident memory T cells"]), an.col(["Genotype"]).is_in(["APL", "FLT3-ITD,NPM1-mut"]), ), var=(an.col(["vst.mean"]) >= 3, an.col("feature_type").is_in(["IG_C_gene", "lncRNA", "protein_coding"])),)
AnnData object with n_obs × n_vars = 8017 × 67obs: 'Cluster_ID', 'donor_id', 'Sample_Tag', 'Cell_label', 'is_primary_data', 'organism_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'tissue_ontology_term_id', 'Genotype', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'disease_ontology_term_id', 'cell_type_ontology_term_id', 'suspension_type', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable', 'feature_is_filtered', 'Unnamed: 0', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'uns: 'cell_type_ontology_term_id_colors', 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'obsm: 'X_bothumap', 'X_pca', 'X_projected', 'X_projectedmean', 'X_tsneni', 'X_umapni'
As an aside, how would we implement this without Annsel?
# First create the observation (obs) filterscell_label_filter = adata.obs["Cell_label"].isin(["Classical Monocytes", "CD8+CD103+ tissue resident memory T cells"])genotype_filter = adata.obs["Genotype"].isin(["APL", "FLT3-ITD,NPM1-mut"])obs_index_filter = cell_label_filter & genotype_filter
# Then create the variable (var) filtersvst_mean_filter = adata.var["vst.mean"] >= 3feature_type_filter = adata.var["feature_type"].isin(["IG_C_gene", "lncRNA", "protein_coding"])var_index_filter = vst_mean_filter & feature_type_filter
# Apply both indices to get the final subsetfiltered_adata = adata[obs_index_filter, var_index_filter]
Annsel provides a structured way to perform these operations, and handles the indices for you. Realistically the main difference is just syntax and brevity.
Groupby
Groupby is supported for both the obs
and var
DataFrames as well. Just like in Pandas
and Polars
you can also run groupby’s on multiple columns for each DataFrame.
groups = adata.an.groupby(obs=an.col("Cell_label"), var=an.col("feature_type"), return_group_names=True)
This returns a generator of AnnData
objects, one for each group. along with a tuple of the group names, with the obs
group first, then var
group.
next(groups)
(('Lymphomyeloid prog',), ('protein_coding',), View of AnnData object with n_obs × n_vars = 913 × 445obs: 'Cluster_ID', 'donor_id', 'Sample_Tag', 'Cell_label', 'is_primary_data', 'organism_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'tissue_ontology_term_id', 'Genotype', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'disease_ontology_term_id', 'cell_type_ontology_term_id', 'suspension_type', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable', 'feature_is_filtered', 'Unnamed: 0', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'uns: 'cell_type_ontology_term_id_colors', 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'obsm: 'X_bothumap', 'X_pca', 'X_projected', 'X_projectedmean', 'X_tsneni', 'X_umapni')
next(groups)
(('Lymphomyeloid prog',), ('lncRNA',), View of AnnData object with n_obs × n_vars = 913 × 7obs: 'Cluster_ID', 'donor_id', 'Sample_Tag', 'Cell_label', 'is_primary_data', 'organism_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'tissue_ontology_term_id', 'Genotype', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'disease_ontology_term_id', 'cell_type_ontology_term_id', 'suspension_type', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable', 'feature_is_filtered', 'Unnamed: 0', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'uns: 'cell_type_ontology_term_id_colors', 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'obsm: 'X_bothumap', 'X_pca', 'X_projected', 'X_projectedmean', 'X_tsneni', 'X_umapni')...
How does it work?
Implementation Details
Under the hood, when you call one of these methods, Annsel relies on the DataFrame compatibility layer Narwhals. We are limited to a subset of the Polars API for interacting with our DataFrames, but in return we get so much more like a guarantee of our index not getting messed up excellent static typing support.
For each of the supported methods, the general implementation is the same:
- Perform the operation on the necessary attributes of
AnnData
:AnnData.obs
(Pandas DataFrame)AnnData.var
(Pandas DataFrame)AnnData.X
(NDArray converted to a Pandas DataFrame)AnnData.obs_names
(Pandas Index)AnnData.var_names
(Pandas Index)
- Merge the indices of the results if necessary (based on the number of expressions)
- Create a new
AnnData
object with the new indices
There are two special cases, obs_names
and var_names
, the Pandas Index objects of AnnData. These are only supported for the filter
operation and require a small implementation change.
Because Narwhals doesn’t directly support manipulating Pandas Index objects, they get converted to a DataFrame with pd.Index.to_frame()
beforehand.
Filter Walkthrough
Lets take a look at the implementation of the filter
operation. Filter needs to work on the following attributes of AnnData:
obs
:pd.DataFrame
var
:pd.DataFrame
obs_names
:pd.Index
, converted to apd.DataFrame
withpd.Index.to_frame()
var_names
:pd.Index
, converted to apd.DataFrame
withpd.Index.to_frame()
X
:NDArray
, converted to apd.DataFrame
withanndata.to_df()
withvar_names
as the columns, andobs_names
as the index.
Narwhals provides the filter
method for DataFrame
objects.
import narwhals as nwfrom narwhals.typing import IntoDataFrameT
@nw.narwhalifydef _filter_df(df: IntoDataFrameT, *predicates: Predicates) -> IntoDataFrameT: return df.filter(*predicates)
As an aside, @nw.narwhalify
is a decorator which allows us to express logic using the subset of the Polars API supported by Narwhals.
This function simply takes anything which can be interpreted as a DataFrame
and applies filter with a list of predicates / expressions. We use the IntoDataFrameT
as it guarantees that the input and output will be the same type. In this case a Pandas DataFrame.
A Predicate is anything which can become a Narwhals Expression, or an iterable of Narwhals Expressions.
from collections.abc import Iteratorfrom typing import Any, Protocol, TypeAlias, runtime_checkable
from narwhals.typing import IntoExpr
@runtime_checkableclass ExprCollection(Protocol): """Protocol for collections of predicates."""
def __iter__(self) -> Iterator[IntoExpr]: ...
# Final recursive typePredicates: TypeAlias = IntoExpr | ExprCollection
These can consists of Narwhals expressions nw.Expr
objects such as >
, >=
, ==
, is_in
, etc… tied to a column name.
Now that we have the general implementation of filter
independent of the DataFrame type, we can implement the specific implementations
for each of the AnnData
attributes. All of these functions look like the following: _filter_<attribute>(adata: ad.AnnData, *predicates: Predicates) -> pd.Index
.
def _filter_obs(adata: ad.AnnData, *predicates: Predicates) -> pd.Index: return _filter_df(adata.obs, *predicates).index
def _filter_var(adata: ad.AnnData, *predicates: Predicates) -> pd.Index: return _filter_df(adata.var, *predicates).index
def _filter_x(adata: ad.AnnData, *predicates: Predicates, layer: str | None = None) -> pd.Index: return _filter_df(adata.to_df(layer=layer), *predicates).index
Filtering by X
means to filter the variables by their expression values. You can specify a layer
to filter by. These functions return the index of the filtered DataFrame which are use to construct the new AnnData
object.
Next we have to account for filtering by obs_names
and var_names
.
def _filter_obs_names(adata: ad.AnnData, *predicates: Predicates) -> pd.Index: return _filter_df(adata.obs_names.to_frame(name="obs_names"), *predicates).index
def _filter_var_names(adata: ad.AnnData, *predicates: Predicates) -> pd.Index: return _filter_df(adata.var_names.to_frame(name="var_names"), *predicates).index
The only way that I figured out how to filter the Pandas Index objects obs_names
and var_names
was to convert them to a DataFrame with their names as the column. On the user facing side, this means that an.obs_names
and an.var_names
are an.col
objects with the name of the column as the name parameter of the an.col
function, i.e. obs_names = an.col(names="obs_names")
and var_names = an.col(names="var_names")
. The design decisions for an.col
are explained in it’s own section.
Next we need to define a function which handles and applies the expressions to the necessary helper functions.
import anndata as ad
from annsel.core.utils import _get_final_indices
def _filter( adata: ad.AnnData, obs: Predicates | None = None, var: Predicates | None = None, x: Predicates | None = None, obs_names: Predicates | None = None, var_names: Predicates | None = None, layer: str | None = None,) -> ad.AnnData: obs_names_idx = [] var_names_idx = []
if obs: obs_names_idx.append(_filter_obs(adata, obs))
if obs_names: obs_names_idx.append(_filter_obs_names(adata, obs_names))
if var: var_names_idx.append(_filter_var(adata, var))
if var_names: var_names_idx.append(_filter_var_names(adata, var_names))
if x: obs_names_idx.append(_filter_x(adata, x, layer=layer))
final_obs_idx = adata.obs_names if not obs_names_idx else _get_final_indices(adata.obs_names, obs_names_idx) final_var_idx = adata.var_names if not var_names_idx else _get_final_indices(adata.var_names, var_names_idx) ...
_get_final_indices
is a simple helper function which takes the intersection of the indices.
from functools import reducefrom operator import and_
import pandas as pd
def _get_final_indices( obj_names: pd.Index, *idx: pd.Index,) -> pd.Index:
return obj_names.intersection(pd.Index(list(reduce(and_, map(set, *idx)))))
Now that we have the final indices for obs_names
and var_names
, we can construct the new filtered, AnnData
object by indexing it.
def _filter(...): ... adata_subset = adata[final_obs_idx, final_var_idx] # Subsets the AnnData object return adata_subset
What is an.col()
?
an.col
is a technically not a function, but a callable class similar in functionality to pl.col()
in Polars
. The Col
class implements nw.col()
in Narwhals.
It is used as a function in order to select columns from the obs
or var
DataFrames. An instance of Col
is exported under the name col
, just like in Polars
, and is the preferred way to use it.
When you run an.col("column_name")
it returns a subclassed Narwhals Expression.
class Col: """A wrapper around the `narwhals.col` function."""
names: Iterable[str]
def __call__(self, *names: str | Iterable[str]) -> AnnselExpr: """ Select columns from a DataFrame component of an AnnData object.
Parameters ---------- *names Names of columns to select. Can be strings or iterables of strings.
Returns ------- A wrapper around the `narwhals.Expr` class. """
def column_selector(plx: Any): return plx.col(*flatten(names))
return AnnselExpr(column_selector, *flatten(names))col = Col()
The inner method column_selector
in __call__
is how Narwhals creates it’s column expressions in nw.col()
.
def col(*names: str | Iterable[str]) -> Expr: def func(plx: Any) -> Any: return plx.col(*flatten(names))
return Expr(func)
One design decision in Narwhals which proved to be a bit tricky to work around is that nw.DataFrame.group_by
only works with an iterable of strings, and not nw.col
expressions. A solution was to implement a custom Expr
subclass to handle the group by method when required. The column names are stored in that AnnselExpr
class.
For the filter
and select
the AnnselExpr
works just like the Narwhals Expr
class, and for group by the names attribute is accessed when Narwhals calls the Expr
class.
class AnnselExpr(nw.Expr): """A wrapper for the `narwhals.Expr` class."""
def __init__(self, call: Callable[[Any], Any], *names: Iterable[str]) -> None: super().__init__(call) self.names = names
var_names = col("var_names")obs_names = col("obs_names")
This way an.obs_names = an.col("obs_names")
matches with adata.obs_names.to_frame(name="obs_names")
and an.var_names = an.col("var_names")
matches with adata.var_names.to_frame(name="var_names")
.
The user can work with the column expressions with the same syntax as pl.col()
. Common operations such as >
, <
, >=
, <=
, ==
, is_in
and others are supported. For a full list of supported operations on Narwhals expressions, see the Narwhals documentation.
The ideal implementation would be to allow grouping with Narwhals expressions. I’m currently giving this feature a shot, but we’ll see how it goes.
Insights & Reflections
This was pretty fun to develop and I learned quite a bit about the newer tools in Python development. Most of the packages I help maintain are still using some older toolsets, so it was nice to get a chance to use some of the newer tools. I learned the most about documentation.
What went wrong?
I overengineered this whole thing in a few ways early on. I ended up adding a bunch of stuff which I thought looked cool but was unnecessary and made my experience worse.
The first was by having separate column variables for obs
, and var
DataFrames (obs_col
, var_col
) which would allow for more flexibility in
mixing and matching predicates. You were able to do something like the following:
adata.an.select(an.obs_col(["Cell_label"]), an.var_col(["feature_name"]), an.obs_col(["donor_id"]))
Without needing to specify the obs
or var
keyword argument. While this allowed flexible ordering of predicates, it made implementing filter
and groupby
operations much harder.
I thought that I would try my hand at using ABC methods, since I haven’t’ really worked with them in Python. This proved to be rather troublesome, and I ended up ditching it entirely in favor of the explicit, custom implementations per function. I did it because I thought it would be cool and seemed handy if I wanted a skeleton for all the operations.
Reflections on the Toolbox
This subsection is dedicated to briefly talking about the tools and packages I used to develop Annsel
.
Scverse provides a great template for Scientific Python packages, called cookiecutter-scverse. It was wonderful to use and required minimal
tweaking to get it to where I wanted it to be. If you’re developing a scientific Python package, I recommend using it even if your packages isn’t related to the scverse
ecosystem. Unless you prefer using MkDocs over Sphinx.
Documentation
I haven’t really done extensive and aesthetic documentation like this before, so it was a great learning experience. Jupyter Notebooks for tutorials are were great, as I could run them in the docs and push outputs alongside the code to readthedocs.I modified the dark mode to be pure black which looks pretty decent. When writing accessors, the Sphinx extension sphinx_autosummary_accessors
was a necessity.
And since this is a relatively simple package, I took a page from Patrick Kidger’s All of Equinox tutorial and briefly went over all the features of the package in a single tutorial.
I thought it would pretty cool if users could try it out without having to install the package. For example, Xarray has a nice WASM interactive sandbox where you can play around with the API. I wanted to do something similar using Marimo but as a guided interactive notebook to supplement the documentation. For this use case, I do not believe that Marimo is ready for a nontrivial use case as you cannot explicitly pin dependencies of the wasm-html notebooks you create and deploy. I had to use the workaround by installing the package with micropip
, and even then this caused issues as annsel
requires a more recent version of Narwhals than what Marimo initializes it with. So you uninstall Narwhals with micropip, and then reinstall it again with micropip, then…
import micropip
micropip.uninstall("narwhals")await micropip.install(["narwhals==1.25.0", "anndata", "lzma", "annsel"], deps=True)
And you end up with an error loading the example dataset cause Pydiode doesn’t support http GET
requests. Therefore we can’t download the example dataset. Oh well, the workaround is simple, just manually create an RNG powered AnnData
object with random X
, and obs
and var
DataFrames with categorical columns.
And, finally… you can’t even run filter due to an import error with narwhals.utils
. I have no idea why this was happening. I spent a few days trying to figure this out, and I just decided to scrap it and wait it out for a bit.
Notable Packages
-
narwhals
, is the workhorse ofAnnsel
. It provides a nice API and has good documentation. It’s also seeing more and more adoption in the Scientific Python ecosystem, as several packages are becoming DataFrame agnostic (i.e. bring your own DataFrame library, likePolars
,Dask
,CuDF
,DuckDB
,Pandas
, and more). Packages such asaltair
andplotly
have been using Narwhals, and recentlyformulaic
has started implementing support for Narwhals. -
pooch
is a package for downloading and caching files from the internet. It has a simple API which made downloading the CELLxGENE dataset extremely easy. It’s pretty incredible to be honest.
Build Systems, CI and Package Managers
-
uv
: I switched off of Conda, pyenv, poetry, and whatever else I’ve been using for the past few years to useuv
and haven’t looked back since. The CLI is quite intuitive. -
ruff
: A well developed linter and formatter for Python is know for being (and it is yeah). Besides performance (which doesn’t really matter for a small package like this), the real significance is taking the most recent decade of Python linting and formatting, and making it “just work” under one tool with minimal dependencies and be extremely easy to configure and use.- I’ve moderately customized configuration with Pandas linting
"PD"
and Numpy linting"NP"
. - I’m looking forward to their new static type checker codename
red-knot
as well.
- I’ve moderately customized configuration with Pandas linting
-
Github Actions: Perfectly good, mainly as there is just so much documentation and large Python packages using it to refer to. I did shoot myself in the foot a few times with
uv
, such as when I was installing only the required packages for building (not all extras). This caused the lock file to change, and after a while I learned you can freeze the lock file so it doesn’t change. Adding the--frozen
parameter inuv sync
fixed my issue. -
hatch
: Works well, I didn’t really mess with the build system much since there isn’t much of a need. The best feature is being able to set scripts to different environments. My favorite is the one for building and cleaning the docs for local development which comes standard with thescverse-cookiecutter
template. I’m still not sure on how to properly integratehatch
anduv
as in CI I use uv for running tests, but I know this can be done with hatch as well. For me, these exist as two separate, somewhat complementary tools. -
git-cliff
: Pretty cool utility to generate a changelog. While I haven’t gotten it to commit the generated changelog in CI yet, I do run it locally and it’s great as it’ll autogenerate the changelog from your commit history. -
gitmoji
: This is just a silly tool. You can use it to add emojis to your commit messages, and it comes with plenty of preset ones. Quite fun. -
pre-commit
: It’s good. Just make sure that pre-commit runs as a part of your CI workflow as well. I know some people don’t like it, but it’s fairly common in the Scientific Python ecosystem and doesn’t make the my developer experience worse.
What’s next?
AnnData currently supports Dask, so next on the chopping block will be to add support for it. In addition I might as well expand the accessor to other Scverse Data objects such as MuData5 and SpatialData6.
I’m thinking of taking Annsel
and developing something useful with it. One idea is a small plotting library to replicate scanpy
’s plots. I quite like the declarative nature of marsilea
7 and scanpy
already has a tutorial on using it with AnnData. It shouldn’t be too difficult, but famous last words I guess.
Out of Scope features
There are no plans for aggregations for now, as I’m not sure what their use case would be for AnnData
objects. In addition the scope will be limited to obs
, var
, and X
attributes. Applying these methods to obsm
and varm
are out of scope since they can be any array-api
compatible data structure. For example they can be DataFrames
, Numpy
arrays, Xarrays
, and more. uns
also cannot be supported since it can be almost anything.
Footnotes
-
Virshup et al., (2024). anndata: Access and store annotated data matrices. Journal of Open Source Software, 9(101), 4371, https://doi.org/10.21105/joss.04371 ↩
-
Alistair Miles, John Kirkham, Martin Durant, James Bourbeau, Tarik Onalan, Joe Hamman, Zain Patel, shikharsg, Matthew Rocklin, raphael dussin, Vincent Schut, Elliott Sales de Andrade, Ryan Abernathey, Charles Noyes, sbalmer, pyup.io bot, Tommy Tran, Stephan Saalfeld, Justin Swaney, … Anderson Banihirwe. (2020). zarr-developers/zarr-python: v2.4.0 (v2.4.0). Zenodo. https://doi.org/10.5281/zenodo.3773450 ↩
-
The HDF Group. Hierarchical Data Format, version 5 [Computer software]. https://github.com/HDFGroup/hdf5 ↩
-
Triana, S., Vonficht, D., Jopp-Saile, L. et al. Single-cell proteo-genomic reference maps of the hematopoietic system enable the purification and massive profiling of precisely defined cell states. Nat Immunol 22, 1577-1589 (2021). https://doi.org/10.1038/s41590-021-01059-0 ↩
-
Bredikhin, D., Kats, I., & Stegle, O. (2022). MUON: multimodal omics analysis framework. Genome Biology. https://doi.org/10.1186/s13059-021-02577-8 ↩
-
Marconato, L., Palla, G., Yamauchi, K.A. et al. SpatialData: an open and universal data framework for spatial omics. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02212-x ↩
-
Zheng, Y., Zheng, Z., Rendeiro, A., & Cheung, E. Marsilea [Computer software]. https://github.com/Marsilea-viz/marsilea ↩