Skip to content

slowkow/harmonypy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

160 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

harmonypy

PyPI Downloads Tests DOI

harmonypy is a Python package for the Harmony algorithm for integrating multiple high-dimensional datasets. It uses a C++ backend (Armadillo) for fast linear algebra, matching the R harmony2 package step-by-step.

This animation shows Harmony aligning three single-cell RNA-seq datasets from different donors. → How to make this animation. Before Harmony, you can clearly distinguish cells from each of the three donors. After Harmony, the cells from different donors are mixed while preserving the overall shape of the data.

Installation

Install from PyPI (pre-built wheels for Linux and macOS):

pip install harmonypy

Building from source

Building from source requires a C++ compiler, CMake, and a BLAS library:

macOS (uses Apple Accelerate, no extra dependencies):

pip install .

Linux (requires OpenBLAS):

# Debian/Ubuntu
sudo apt install libopenblas-dev cmake

# RHEL/Fedora
sudo dnf install openblas-devel cmake

pip install .

Quick Start

import harmonypy as hm
import pandas as pd

# Load the principal components and metadata
pcs = pd.read_csv("data/pbmc_3500_pcs.tsv.gz", sep="\t")
meta = pd.read_csv("data/pbmc_3500_meta.tsv.gz", sep="\t")

# Run Harmony to correct for batch effects (donor)
harmony_out = hm.run_harmony(pcs, meta, "donor")

# Save corrected PCs (same shape as input)
result = pd.DataFrame(harmony_out.Z_corr, columns=pcs.columns)
result.to_csv("pbmc_3500_pcs_harmony.tsv", sep="\t", index=False)

Usage with Scanpy

import scanpy as sc
import harmonypy as hm

# Load and preprocess your data
adata = sc.read_h5ad("my_data.h5ad")
sc.pp.pca(adata)

# Get PCs from the AnnData object
pcs = adata.obsm['X_pca']
print(pcs.shape)  # (n_cells, n_pcs)

# Run Harmony on the PCA embedding
harmony_out = hm.run_harmony(pcs, adata.obs, "batch")

# Store corrected PCs back in the AnnData object
adata.obsm['X_pca_harmony'] = harmony_out.Z_corr

# Use harmonized PCs for downstream analysis
sc.pp.neighbors(adata, use_rep='X_pca_harmony')
sc.tl.umap(adata)
sc.tl.leiden(adata)

Parameters

run_harmony accepts the same parameters as the R package:

Parameter Default Description
theta 2 Diversity penalty per batch variable
sigma 0.1 Kernel bandwidth for soft clustering
nclust min(N/30, 100) Number of clusters
max_iter_harmony 10 Maximum Harmony iterations
max_iter_kmeans 4 K-means iterations per Harmony round
epsilon_harmony 1e-2 Convergence threshold
ncores 0 BLAS threads (0 = all cores)
lamb None Ridge penalty (None = auto-estimate)

The ncores parameter controls BLAS threading (Accelerate on macOS, OpenBLAS on Linux). Default is 0 (use all available cores). Set ncores=1 for single-threaded execution.

Performance

The script in tests/test_harmony.py on an Apple M1 (2022) chip reports:

  Dataset                    Time    RSS delta
  ---------------------- -------- ------------
  Small (3.5k cells)        0.23s     45.2 MB
  Medium (69k cells)        4.76s    262.3 MB
  Large (858k cells)       29.29s   1969.5 MB

Citation

If you use Harmony in your work, please cite the original paper:

Korsunsky, I., Millard, N., Fan, J. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat Methods 16, 1289–1296 (2019). https://doi.org/10.1038/s41592-019-0619-0

The Supplementary Information PDF provides detailed mathematical descriptions and implementation notes.

To learn more about Harmony 2, please see the preprint here:

Patikas, Nikolaos, Hongcheng Yao, Roopa Madhu, Soumya Raychaudhuri, Martin Hemberg, and Ilya Korsunsky. 2026. Integration of Large, Complex Single-Cell Datasets with Harmony2. bioRxiv. https://doi.org/10.64898/2026.03.16.711825

About

🎼 Integrate multiple high-dimensional datasets with fuzzy k-means and locally linear adjustments.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors