skip to content
Logo curved-cluster/srivarra

Devlog: AnnData Selection, Filtering and Groupbys

23 min read
Annsel Schema

What’s AnnData?

If you’re familiar with AnnData, feel free to skip this section.

AnnData is a Python Package for handling annotated data matrices in computational biology. It’s best designed to handle single cell data. Structurally it’s a collection of arrays and DataFrames aligned with two common dimensions: obs the observations, and var the variables. The center green Array X is the Data Matrix, and the labels are stored in Index components obs_names (rows) and var_names (columns). For X, layers is a mapping from a string to another array the same shape as X with the same index components. Layers are primarily used for storing transformations of your data, such as log2 or arcsinh.

X Matrix Diagram
Observation Diagram
Variable Diagram

When working with AnnData, the goal is to generate some insight about your Data Matrix X. These are stored in the accompanying components surrounding X. One Dimensional features for observations and variables are stored in obs and var respectively, with their associated indices being obs_names and var_names. In practice obs and var are Pandas DataFrames, with obs_names and var_names as their respective indices. For example in obs we could store data such as clustering information, or patient related info such as risk. While in var we would put data about the markers or genes we are interested in.

Multidimensional Observation Diagram
Multidimensional Variable Diagram

Multidimensional representations get stored in obsm and varm. Commonly you’ll see embeddings such as UMAP, TSNE and PCA in obsm with a key-value pair (the value being the array, just like layers for X). So AnnData.obsm["X_umap"] stores the UMAP embedding of the original Data Matrix X. In addition in the realm of Spatial Single Cell Analysis, you’d find the coordinates of the observation (usually a cell, but it remains up to the user what the smallest atomic “thing” they’re generating knowledge from) store here as AnnData.obsm["spatial"].

Pairwise Observation Diagram
Pairwise Variable Diagram

When working with single cell data, often you want to compare observations to other observations (or variables to variables). These pairwise relations are stored in obsp, p for pairwise, and varp. These are graph adjacency matrices. And they are also a key-value store. In practice, you might run some algorithm to figure out which cells are similar to others based on the values in X, stored in AnnData.obsp["neighbors"].

The last bit, uns is for anything unstructured, or anything else you feel like. Suppose you have a categorical column in obs denoting the cell type "CellType". When it comes to plotting, the colors for each cell type in "CellType" would be in AnnData.uns["CellType_colors"] which contains a list of colors, sorted by obs["CellType"].cat.categories.

In addition you can index an AnnData object as you would with a Pandas DataFrame with the first dimension being observations, and the second variables.

Indexing with integer arguments behaves like Pandas.DataFrame.iloc:

adata[:5, :] # Selects observations 0 through 5 exclusive and all variables

Indexing with string arguments behaves like Pandas.DataFrame.loc

adata[:, ["Var_1", "Var_2"]] # Selects all observations and variables Var_1 and Var_2

You work with the data structure on disk with HDF52 or Zarr3 which allows for neat features like lazy loading and data compression. All of this leads to it being scalable well beyond millions of observations.

Why Annsel?

I find one of the few things lacking with AnnData is manipulating the structure as I would with traditional one dimensional Dataframes like Pandas or Polars. Like, uh how do I do a group by? Or filter my AnnData based on a property in the obs DataFrame? This does not happen often, but when end up needing these features I stop my self and just go “Ah fuck, here we go again”.

Here’s how to filter your AnnData object on, say the cell type column in the obs DataFrame.

import anndata as ad
adata: ad.AnnData
filtered_obs_names = adata.obs["CellType"].isin(["Type I", "Type II"]).index
filtered_adata = adata[filtered_obs_names, :]

Obviously this doesn’t look too bad because it isn’t. If you end up with more complex queries involving multiple columns in obs or vars, then you’ll end up wrangling with all of those indices as well.

I made Annsel as a small tool to help make this easier. This package supports 3 common operations and one utility method. We have filter, select and group by, and pipe as the utility function. Pipe is inspired by the same method in Pandas and Xarray. These methods are applied to the AnnData.obs and AnnData.var respectively and can be used to slice and dice your AnnData object.

Installation

You can install Annsel using uv or pip.

How to use Annsel

Just import the package as you would with any other.

import annsel as an

Annsel works as an extension to AnnData, it follows the accessor style that Xarray uses. An accessor is a way of adding extra functionality to an object via composition over inheritance. It provides a clear separation between the “core” API of your object and the custom “extension” API. It helps to limit naming conflicts and makes it easier to handle updates with your core package. Generally you’ll access the methods an accessor provides by, well it’s name. For example you may have MyObject.my_accessor.a_cool_function().

For Annsel, you can access the methods under the an accessor, AnnData.an.<method_name>. In order to select columns in the obs or var DataFrames, you use the included an.col(*names: Iterable[str] | str) function, where names is a list of column names, or a single column name as a string. Note that they are positional arguments, not keyword arguments. an.col(names=["Cell Type I"]) will throw an exception.

The an.obs_names and an.var_names are special cases of an.col Expressions. These are only used for the filter operation.

an.obs_names = an.col("obs_names")
an.var_names = an.col("var_names")

Overview of Supported Operations

The package contains a small dataset which can be downloaded for users to easily test it out. It consists of 31K cells and 458 genes from a bone marrow dataset 4. You can view it on CZI’s CELLxGENE Explorer.

import annsel as an
adata = an.datasets.leukemic_bone_marrow_dataset()
adata
AnnData object with n_obs × n_vars = 31586 × 458
obs: 'Cluster_ID', 'donor_id', 'Sample_Tag', 'Cell_label', 'is_primary_data', 'organism_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'tissue_ontology_term_id', 'Genotype', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'disease_ontology_term_id', 'cell_type_ontology_term_id', 'suspension_type', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable', 'feature_is_filtered', 'Unnamed: 0', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'
uns: 'cell_type_ontology_term_id_colors', 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
obsm: 'X_bothumap', 'X_pca', 'X_projected', 'X_projectedmean', 'X_tsneni', 'X_umapni'

Here you can see the components of our AnnData object. Embeddings live in obsm, and we have one dimensional features in obs and var. This dataset does not include pairwise observations (obsp) and variables (varp).

Selection

Let’s say we want to select the Cell_label column from the AnnData.obs DataFrame, and return a new AnnData object with the just that column in the obs DataFrame.

adata.an.select(obs=an.col(["Cell_label"]))
AnnData object with n_obs × n_vars = 31586 × 458
obs: 'Cell_label'
var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable', 'feature_is_filtered', 'Unnamed: 0', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'
uns: 'cell_type_ontology_term_id_colors', 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
obsm: 'X_bothumap', 'X_pca', 'X_projected', 'X_projectedmean', 'X_tsneni', 'X_umapni'

This selects the Cell_label column from the AnnData.obs DataFrame, constructs a new AnnData object with the subsetted obs DataFrame and returns it back. You’d find this useful if you don’t need to work with all the columns in obs at the moment. During analysis workflows it’s not uncommon to have dozens of columns in obs.

We can also select multiple columns at once, even from both obs and var DataFrames at the same time.

adata.an.select(obs=an.col(["Cell_label", "donor_id"]), var=an.col(["feature_name", "feature_type"]))
AnnData object with n_obs × n_vars = 31586 × 458
obs: 'donor_id', 'Cell_label'
var: 'feature_name', 'feature_type'
uns: 'cell_type_ontology_term_id_colors', 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
obsm: 'X_bothumap', 'X_pca', 'X_projected', 'X_projectedmean', 'X_tsneni', 'X_umapni'

Filter

We can filter the AnnData object by columns in the obs or var DataFrames.

Let’s step things up a notch. In the following example we want to filter both the obs and var DataFrames, and return this small chunk of the dataset. There isn’t really much of a reason for these filters in particular, but it’s a good illustration of writing more complex expressions.

For the observations (obs), we want to filter for two things:

  • Select specific cell types of interest: Classical Monocytes, and CD8+CD103+ tissue resident memory T cells
  • Select cells from patients who have either:
    • APL (Acute Promyelocytic Leukemia - a specific type of blood cancer), or
    • A combination of two mutations: FLT3-ITD and NPM1 (common genetic changes found in leukemia)

For the variables (var), we want to filter for two things:

  • Select genes with high mean expression: vst.mean >= 3
  • Select genes of specific biological types: feature_type is in ["IG_C_gene", "lncRNA", "protein_coding"]
    • IG_C_gene: Immunoglobulin constant region genes (antibody-related)
    • lncRNA: Long non-coding RNA genes (regulatory RNA molecules)
    • protein_coding: Standard genes that code for proteins
adata.an.filter(
obs=(
an.col(["Cell_label"]).is_in(["Classical Monocytes", "CD8+CD103+ tissue resident memory T cells"]),
an.col(["Genotype"]).is_in(["APL", "FLT3-ITD,NPM1-mut"]),
),
var=(an.col(["vst.mean"]) >= 3, an.col("feature_type").is_in(["IG_C_gene", "lncRNA", "protein_coding"])),
)
AnnData object with n_obs × n_vars = 8017 × 67
obs: 'Cluster_ID', 'donor_id', 'Sample_Tag', 'Cell_label', 'is_primary_data', 'organism_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'tissue_ontology_term_id', 'Genotype', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'disease_ontology_term_id', 'cell_type_ontology_term_id', 'suspension_type', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable', 'feature_is_filtered', 'Unnamed: 0', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'
uns: 'cell_type_ontology_term_id_colors', 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
obsm: 'X_bothumap', 'X_pca', 'X_projected', 'X_projectedmean', 'X_tsneni', 'X_umapni'

As an aside, how would we implement this without Annsel?

Without Annsel, standard AnnData implementation
# First create the observation (obs) filters
cell_label_filter = adata.obs["Cell_label"].isin(["Classical Monocytes", "CD8+CD103+ tissue resident memory T cells"])
genotype_filter = adata.obs["Genotype"].isin(["APL", "FLT3-ITD,NPM1-mut"])
obs_index_filter = cell_label_filter & genotype_filter
# Then create the variable (var) filters
vst_mean_filter = adata.var["vst.mean"] >= 3
feature_type_filter = adata.var["feature_type"].isin(["IG_C_gene", "lncRNA", "protein_coding"])
var_index_filter = vst_mean_filter & feature_type_filter
# Apply both indices to get the final subset
filtered_adata = adata[obs_index_filter, var_index_filter]

Annsel provides a structured way to perform these operations, and handles the indices for you. Realistically the main difference is just syntax and brevity.

Groupby

Groupby is supported for both the obs and var DataFrames as well. Just like in Pandas and Polars you can also run groupby’s on multiple columns for each DataFrame.

groups = adata.an.groupby(obs=an.col("Cell_label"), var=an.col("feature_type"), return_group_names=True)

This returns a generator of AnnData objects, one for each group. along with a tuple of the group names, with the obs group first, then var group.

next(groups)
(('Lymphomyeloid prog',), ('protein_coding',), View of AnnData object with n_obs × n_vars = 913 × 445
obs: 'Cluster_ID', 'donor_id', 'Sample_Tag', 'Cell_label', 'is_primary_data', 'organism_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'tissue_ontology_term_id', 'Genotype', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'disease_ontology_term_id', 'cell_type_ontology_term_id', 'suspension_type', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable', 'feature_is_filtered', 'Unnamed: 0', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'
uns: 'cell_type_ontology_term_id_colors', 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
obsm: 'X_bothumap', 'X_pca', 'X_projected', 'X_projectedmean', 'X_tsneni', 'X_umapni')
next(groups)
(('Lymphomyeloid prog',), ('lncRNA',), View of AnnData object with n_obs × n_vars = 913 × 7
obs: 'Cluster_ID', 'donor_id', 'Sample_Tag', 'Cell_label', 'is_primary_data', 'organism_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'tissue_ontology_term_id', 'Genotype', 'development_stage_ontology_term_id', 'sex_ontology_term_id', 'disease_ontology_term_id', 'cell_type_ontology_term_id', 'suspension_type', 'tissue_type', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage', 'observation_joinid'
var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable', 'feature_is_filtered', 'Unnamed: 0', 'feature_name', 'feature_reference', 'feature_biotype', 'feature_length', 'feature_type'
uns: 'cell_type_ontology_term_id_colors', 'citation', 'default_embedding', 'schema_reference', 'schema_version', 'title'
obsm: 'X_bothumap', 'X_pca', 'X_projected', 'X_projectedmean', 'X_tsneni', 'X_umapni')
...

How does it work?

Implementation Details

Under the hood, when you call one of these methods, Annsel relies on the DataFrame compatibility layer Narwhals. We are limited to a subset of the Polars API for interacting with our DataFrames, but in return we get so much more like a guarantee of our index not getting messed up excellent static typing support.

For each of the supported methods, the general implementation is the same:

  1. Perform the operation on the necessary attributes of AnnData:
    • AnnData.obs (Pandas DataFrame)
    • AnnData.var (Pandas DataFrame)
    • AnnData.X (NDArray converted to a Pandas DataFrame)
    • AnnData.obs_names (Pandas Index)
    • AnnData.var_names (Pandas Index)
  2. Merge the indices of the results if necessary (based on the number of expressions)
  3. Create a new AnnData object with the new indices

There are two special cases, obs_names and var_names, the Pandas Index objects of AnnData. These are only supported for the filter operation and require a small implementation change. Because Narwhals doesn’t directly support manipulating Pandas Index objects, they get converted to a DataFrame with pd.Index.to_frame() beforehand.

Filter Walkthrough

Lets take a look at the implementation of the filter operation. Filter needs to work on the following attributes of AnnData:

  • obs: pd.DataFrame
  • var: pd.DataFrame
  • obs_names: pd.Index, converted to a pd.DataFrame with pd.Index.to_frame()
  • var_names: pd.Index, converted to a pd.DataFrame with pd.Index.to_frame()
  • X: NDArray, converted to a pd.DataFrame with anndata.to_df() with var_names as the columns, and obs_names as the index.

Narwhals provides the filter method for DataFrame objects.

annsel/tl/_filter.py | Main Filtering Function
import narwhals as nw
from narwhals.typing import IntoDataFrameT
@nw.narwhalify
def _filter_df(df: IntoDataFrameT, *predicates: Predicates) -> IntoDataFrameT:
return df.filter(*predicates)

As an aside, @nw.narwhalify is a decorator which allows us to express logic using the subset of the Polars API supported by Narwhals.

This function simply takes anything which can be interpreted as a DataFrame and applies filter with a list of predicates / expressions. We use the IntoDataFrameT as it guarantees that the input and output will be the same type. In this case a Pandas DataFrame.

A Predicate is anything which can become a Narwhals Expression, or an iterable of Narwhals Expressions.

annsel/core/typing.py | Type Aliases
from collections.abc import Iterator
from typing import Any, Protocol, TypeAlias, runtime_checkable
from narwhals.typing import IntoExpr
@runtime_checkable
class ExprCollection(Protocol):
"""Protocol for collections of predicates."""
def __iter__(self) -> Iterator[IntoExpr]: ...
# Final recursive type
Predicates: TypeAlias = IntoExpr | ExprCollection

These can consists of Narwhals expressions nw.Expr objects such as >, >=, ==, is_in, etc… tied to a column name.

Now that we have the general implementation of filter independent of the DataFrame type, we can implement the specific implementations for each of the AnnData attributes. All of these functions look like the following: _filter_<attribute>(adata: ad.AnnData, *predicates: Predicates) -> pd.Index.

annsel/tl/_filter.py | DataFrame Filtering
def _filter_obs(adata: ad.AnnData, *predicates: Predicates) -> pd.Index:
return _filter_df(adata.obs, *predicates).index
def _filter_var(adata: ad.AnnData, *predicates: Predicates) -> pd.Index:
return _filter_df(adata.var, *predicates).index
def _filter_x(adata: ad.AnnData, *predicates: Predicates, layer: str | None = None) -> pd.Index:
return _filter_df(adata.to_df(layer=layer), *predicates).index

Filtering by X means to filter the variables by their expression values. You can specify a layer to filter by. These functions return the index of the filtered DataFrame which are use to construct the new AnnData object.

Next we have to account for filtering by obs_names and var_names.

annsel/tl/_filter.py | Index Filtering
def _filter_obs_names(adata: ad.AnnData, *predicates: Predicates) -> pd.Index:
return _filter_df(adata.obs_names.to_frame(name="obs_names"), *predicates).index
def _filter_var_names(adata: ad.AnnData, *predicates: Predicates) -> pd.Index:
return _filter_df(adata.var_names.to_frame(name="var_names"), *predicates).index

The only way that I figured out how to filter the Pandas Index objects obs_names and var_names was to convert them to a DataFrame with their names as the column. On the user facing side, this means that an.obs_names and an.var_names are an.col objects with the name of the column as the name parameter of the an.col function, i.e. obs_names = an.col(names="obs_names") and var_names = an.col(names="var_names"). The design decisions for an.col are explained in it’s own section.

Next we need to define a function which handles and applies the expressions to the necessary helper functions.

annsel/tl/_filter.py | Main Filtering Function
import anndata as ad
from annsel.core.utils import _get_final_indices
def _filter(
adata: ad.AnnData,
obs: Predicates | None = None,
var: Predicates | None = None,
x: Predicates | None = None,
obs_names: Predicates | None = None,
var_names: Predicates | None = None,
layer: str | None = None,
) -> ad.AnnData:
obs_names_idx = []
var_names_idx = []
if obs:
obs_names_idx.append(_filter_obs(adata, obs))
if obs_names:
obs_names_idx.append(_filter_obs_names(adata, obs_names))
if var:
var_names_idx.append(_filter_var(adata, var))
if var_names:
var_names_idx.append(_filter_var_names(adata, var_names))
if x:
obs_names_idx.append(_filter_x(adata, x, layer=layer))
final_obs_idx = adata.obs_names if not obs_names_idx else _get_final_indices(adata.obs_names, obs_names_idx)
final_var_idx = adata.var_names if not var_names_idx else _get_final_indices(adata.var_names, var_names_idx)
...

_get_final_indices is a simple helper function which takes the intersection of the indices.

annsel/core/utils.py | Get Final Indices
from functools import reduce
from operator import and_
import pandas as pd
def _get_final_indices(
obj_names: pd.Index,
*idx: pd.Index,
) -> pd.Index:
return obj_names.intersection(pd.Index(list(reduce(and_, map(set, *idx)))))

Now that we have the final indices for obs_names and var_names, we can construct the new filtered, AnnData object by indexing it.

annsel/core/utils.py | Construct New AnnData
def _filter(...):
...
adata_subset = adata[final_obs_idx, final_var_idx] # Subsets the AnnData object
return adata_subset

What is an.col()?

an.col is a technically not a function, but a callable class similar in functionality to pl.col() in Polars. The Col class implements nw.col() in Narwhals. It is used as a function in order to select columns from the obs or var DataFrames. An instance of Col is exported under the name col, just like in Polars, and is the preferred way to use it.

When you run an.col("column_name") it returns a subclassed Narwhals Expression.

annsel/core/expr.py | Col Class
class Col:
"""A wrapper around the `narwhals.col` function."""
names: Iterable[str]
def __call__(self, *names: str | Iterable[str]) -> AnnselExpr:
"""
Select columns from a DataFrame component of an AnnData object.
Parameters
----------
*names
Names of columns to select. Can be strings or iterables of strings.
Returns
-------
A wrapper around the `narwhals.Expr` class.
"""
def column_selector(plx: Any):
return plx.col(*flatten(names))
return AnnselExpr(column_selector, *flatten(names))
col = Col()

The inner method column_selector in __call__ is how Narwhals creates it’s column expressions in nw.col().

narwhals/expr.py | Narwhals Col
def col(*names: str | Iterable[str]) -> Expr:
def func(plx: Any) -> Any:
return plx.col(*flatten(names))
return Expr(func)

One design decision in Narwhals which proved to be a bit tricky to work around is that nw.DataFrame.group_by only works with an iterable of strings, and not nw.col expressions. A solution was to implement a custom Expr subclass to handle the group by method when required. The column names are stored in that AnnselExpr class. For the filter and select the AnnselExpr works just like the Narwhals Expr class, and for group by the names attribute is accessed when Narwhals calls the Expr class.

annsel/core/expr.py | AnnselExpr Class
class AnnselExpr(nw.Expr):
"""A wrapper for the `narwhals.Expr` class."""
def __init__(self, call: Callable[[Any], Any], *names: Iterable[str]) -> None:
super().__init__(call)
self.names = names
var_names = col("var_names")
obs_names = col("obs_names")

This way an.obs_names = an.col("obs_names") matches with adata.obs_names.to_frame(name="obs_names") and an.var_names = an.col("var_names") matches with adata.var_names.to_frame(name="var_names").

The user can work with the column expressions with the same syntax as pl.col(). Common operations such as >, <, >=, <=, ==, is_in and others are supported. For a full list of supported operations on Narwhals expressions, see the Narwhals documentation.

The ideal implementation would be to allow grouping with Narwhals expressions. I’m currently giving this feature a shot, but we’ll see how it goes.

Insights & Reflections

This was pretty fun to develop and I learned quite a bit about the newer tools in Python development. Most of the packages I help maintain are still using some older toolsets, so it was nice to get a chance to use some of the newer tools. I learned the most about documentation.

What went wrong?

I overengineered this whole thing in a few ways early on. I ended up adding a bunch of stuff which I thought looked cool but was unnecessary and made my experience worse.

The first was by having separate column variables for obs, and var DataFrames (obs_col, var_col) which would allow for more flexibility in mixing and matching predicates. You were able to do something like the following:

adata.an.select(an.obs_col(["Cell_label"]), an.var_col(["feature_name"]), an.obs_col(["donor_id"]))

Without needing to specify the obs or var keyword argument. While this allowed flexible ordering of predicates, it made implementing filter and groupby operations much harder.

I thought that I would try my hand at using ABC methods, since I haven’t’ really worked with them in Python. This proved to be rather troublesome, and I ended up ditching it entirely in favor of the explicit, custom implementations per function. I did it because I thought it would be cool and seemed handy if I wanted a skeleton for all the operations.

Reflections on the Toolbox

This subsection is dedicated to briefly talking about the tools and packages I used to develop Annsel.

Scverse provides a great template for Scientific Python packages, called cookiecutter-scverse. It was wonderful to use and required minimal tweaking to get it to where I wanted it to be. If you’re developing a scientific Python package, I recommend using it even if your packages isn’t related to the scverse ecosystem. Unless you prefer using MkDocs over Sphinx.

Documentation

I haven’t really done extensive and aesthetic documentation like this before, so it was a great learning experience. Jupyter Notebooks for tutorials are were great, as I could run them in the docs and push outputs alongside the code to readthedocs.I modified the dark mode to be pure black which looks pretty decent. When writing accessors, the Sphinx extension sphinx_autosummary_accessors was a necessity. And since this is a relatively simple package, I took a page from Patrick Kidger’s All of Equinox tutorial and briefly went over all the features of the package in a single tutorial.

I thought it would pretty cool if users could try it out without having to install the package. For example, Xarray has a nice WASM interactive sandbox where you can play around with the API. I wanted to do something similar using Marimo but as a guided interactive notebook to supplement the documentation. For this use case, I do not believe that Marimo is ready for a nontrivial use case as you cannot explicitly pin dependencies of the wasm-html notebooks you create and deploy. I had to use the workaround by installing the package with micropip, and even then this caused issues as annsel requires a more recent version of Narwhals than what Marimo initializes it with. So you uninstall Narwhals with micropip, and then reinstall it again with micropip, then…

import micropip
micropip.uninstall("narwhals")
await micropip.install(["narwhals==1.25.0", "anndata", "lzma", "annsel"], deps=True)

And you end up with an error loading the example dataset cause Pydiode doesn’t support http GET requests. Therefore we can’t download the example dataset. Oh well, the workaround is simple, just manually create an RNG powered AnnData object with random X, and obs and var DataFrames with categorical columns. And, finally… you can’t even run filter due to an import error with narwhals.utils. I have no idea why this was happening. I spent a few days trying to figure this out, and I just decided to scrap it and wait it out for a bit.

Notable Packages

  • narwhals, is the workhorse of Annsel. It provides a nice API and has good documentation. It’s also seeing more and more adoption in the Scientific Python ecosystem, as several packages are becoming DataFrame agnostic (i.e. bring your own DataFrame library, like Polars, Dask, CuDF,DuckDB, Pandas, and more). Packages such as altair and plotly have been using Narwhals, and recently formulaic has started implementing support for Narwhals.

  • pooch is a package for downloading and caching files from the internet. It has a simple API which made downloading the CELLxGENE dataset extremely easy. It’s pretty incredible to be honest.

Build Systems, CI and Package Managers

  • uv: I switched off of Conda, pyenv, poetry, and whatever else I’ve been using for the past few years to use uv and haven’t looked back since. The CLI is quite intuitive.

  • ruff: A well developed linter and formatter for Python is know for being (and it is yeah). Besides performance (which doesn’t really matter for a small package like this), the real significance is taking the most recent decade of Python linting and formatting, and making it “just work” under one tool with minimal dependencies and be extremely easy to configure and use.

    • I’ve moderately customized configuration with Pandas linting "PD" and Numpy linting "NP".
    • I’m looking forward to their new static type checker codename red-knot as well.
  • Github Actions: Perfectly good, mainly as there is just so much documentation and large Python packages using it to refer to. I did shoot myself in the foot a few times with uv, such as when I was installing only the required packages for building (not all extras). This caused the lock file to change, and after a while I learned you can freeze the lock file so it doesn’t change. Adding the --frozen parameter in uv sync fixed my issue.

  • hatch: Works well, I didn’t really mess with the build system much since there isn’t much of a need. The best feature is being able to set scripts to different environments. My favorite is the one for building and cleaning the docs for local development which comes standard with the scverse-cookiecutter template. I’m still not sure on how to properly integrate hatch and uv as in CI I use uv for running tests, but I know this can be done with hatch as well. For me, these exist as two separate, somewhat complementary tools.

  • git-cliff: Pretty cool utility to generate a changelog. While I haven’t gotten it to commit the generated changelog in CI yet, I do run it locally and it’s great as it’ll autogenerate the changelog from your commit history.

  • gitmoji: This is just a silly tool. You can use it to add emojis to your commit messages, and it comes with plenty of preset ones. Quite fun.

  • pre-commit: It’s good. Just make sure that pre-commit runs as a part of your CI workflow as well. I know some people don’t like it, but it’s fairly common in the Scientific Python ecosystem and doesn’t make the my developer experience worse.

What’s next?

AnnData currently supports Dask, so next on the chopping block will be to add support for it. In addition I might as well expand the accessor to other Scverse Data objects such as MuData5 and SpatialData6.

I’m thinking of taking Annsel and developing something useful with it. One idea is a small plotting library to replicate scanpy’s plots. I quite like the declarative nature of marsilea7 and scanpy already has a tutorial on using it with AnnData. It shouldn’t be too difficult, but famous last words I guess.

Out of Scope features

There are no plans for aggregations for now, as I’m not sure what their use case would be for AnnData objects. In addition the scope will be limited to obs, var, and X attributes. Applying these methods to obsm and varm are out of scope since they can be any array-api compatible data structure. For example they can be DataFrames, Numpy arrays, Xarrays, and more. uns also cannot be supported since it can be almost anything.

Footnotes

  1. Virshup et al., (2024). anndata: Access and store annotated data matrices. Journal of Open Source Software, 9(101), 4371, https://doi.org/10.21105/joss.04371

  2. Alistair Miles, John Kirkham, Martin Durant, James Bourbeau, Tarik Onalan, Joe Hamman, Zain Patel, shikharsg, Matthew Rocklin, raphael dussin, Vincent Schut, Elliott Sales de Andrade, Ryan Abernathey, Charles Noyes, sbalmer, pyup.io bot, Tommy Tran, Stephan Saalfeld, Justin Swaney, … Anderson Banihirwe. (2020). zarr-developers/zarr-python: v2.4.0 (v2.4.0). Zenodo. https://doi.org/10.5281/zenodo.3773450

  3. The HDF Group. Hierarchical Data Format, version 5 [Computer software]. https://github.com/HDFGroup/hdf5

  4. Triana, S., Vonficht, D., Jopp-Saile, L. et al. Single-cell proteo-genomic reference maps of the hematopoietic system enable the purification and massive profiling of precisely defined cell states. Nat Immunol 22, 1577-1589 (2021). https://doi.org/10.1038/s41590-021-01059-0

  5. Bredikhin, D., Kats, I., & Stegle, O. (2022). MUON: multimodal omics analysis framework. Genome Biology. https://doi.org/10.1186/s13059-021-02577-8

  6. Marconato, L., Palla, G., Yamauchi, K.A. et al. SpatialData: an open and universal data framework for spatial omics. Nat Methods (2024). https://doi.org/10.1038/s41592-024-02212-x

  7. Zheng, Y., Zheng, Z., Rendeiro, A., & Cheung, E. Marsilea [Computer software]. https://github.com/Marsilea-viz/marsilea