PyPair

PyPair computes pairwise association measures between variables (binary, categorical, ordinal, and continuous), with local NumPy-first implementations and Spark dataframe support.

Modern toolchain (Python 3.13 + uv)

uv venv --python 3.13
source .venv/bin/activate
uv sync
uv run pytest

Build wheel/sdist:

uv build

Publish a release:

make publish-check
make publish

Put your PyPI credentials file at ./.pypirc in the repository root. The repo already ignores .pypirc. Use make publish-testpypi if you want to upload to TestPyPI instead.

Profiling

Run the built-in profiling workload:

make profile

This writes cProfile stats to .profiles/pypair.prof, prints the hottest frames, and can also emit internal timings for the decorated contingency measures. It also writes a memory report with tracemalloc allocation hotspots and process peak RSS to .profiles/pypair.memory.txt. Override the default workload or scale with PROFILE_FLAGS when needed:

make profile PROFILE_FLAGS="--workload corr --size 8000 --width 20 --limit 40 --output .profiles/corr.prof --memory-output .profiles/corr.memory.txt"

Quick usage

from pypair.association import (
    binary_binary,
    confusion,
    categorical_categorical,
    binary_continuous,
    concordance,
    categorical_continuous,
    continuous_continuous,
)

# same public convenience API
jaccard = binary_binary(a, b, measure='jaccard')
acc = confusion(a, b, measure='acc')
phi = categorical_categorical(a, b, measure='phi')
biserial = binary_continuous(a, b, measure='biserial')
tau = concordance(a, b, measure='kendall_tau')
eta = categorical_continuous(a, b, measure='eta')
pearson = continuous_continuous(a, b, measure='pearson')

Input types

Local pairwise APIs accept 1D array-like inputs:

numpy.ndarray
pandas.Series
Python lists / tuples
Other iterables that can be consumed once

Use categorical/object-like inputs for categorical metrics and numeric inputs for continuous or concordance metrics. For best runtime and lowest allocation overhead, prefer already-materialized numpy.ndarray or pandas.Series inputs with the right dtype instead of generators or mixed-object containers.

The shared type aliases live in pypair.typing, for example ArrayLike1D and NumericArrayLike1D.

DataFrame integration

Pandas: use pypair.util.corr(df, func) to build pairwise association matrices.
PySpark: use pypair.spark.* methods for distributed pairwise computations.

Notes on internals

Internals now prefer NumPy for local numeric workflows where possible.
Pandas remains supported as a dataframe input/output layer.
PySpark APIs are preserved for distributed dataframe workflows.

Name		Name	Last commit message	Last commit date
Latest commit History 212 Commits
.github		.github
docs		docs
misc		misc
pypair		pypair
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.readthedocs.yml		.readthedocs.yml
Dockerfile		Dockerfile
LICENSE.txt		LICENSE.txt
MANIFEST.in		MANIFEST.in
Makefile		Makefile
Makefile.bat		Makefile.bat
README.md		README.md
build.sh		build.sh
logo.png		logo.png
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PyPair

Modern toolchain (Python 3.13 + uv)

Profiling

Quick usage

Input types

DataFrame integration

Notes on internals

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

PyPair

Modern toolchain (Python 3.13 + uv)

Profiling

Quick usage

Input types

DataFrame integration

Notes on internals

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages