PyPair computes pairwise association measures between variables (binary, categorical, ordinal, and continuous), with local NumPy-first implementations and Spark dataframe support.
uv venv --python 3.13
source .venv/bin/activate
uv sync
uv run pytestBuild wheel/sdist:
uv buildPublish a release:
make publish-check
make publishPut your PyPI credentials file at ./.pypirc in the repository root. The repo already ignores .pypirc.
Use make publish-testpypi if you want to upload to TestPyPI instead.
Run the built-in profiling workload:
make profileThis writes cProfile stats to .profiles/pypair.prof, prints the hottest frames, and can also emit
internal timings for the decorated contingency measures. It also writes a memory report with
tracemalloc allocation hotspots and process peak RSS to .profiles/pypair.memory.txt. Override the
default workload or scale with PROFILE_FLAGS when needed:
make profile PROFILE_FLAGS="--workload corr --size 8000 --width 20 --limit 40 --output .profiles/corr.prof --memory-output .profiles/corr.memory.txt"from pypair.association import (
binary_binary,
confusion,
categorical_categorical,
binary_continuous,
concordance,
categorical_continuous,
continuous_continuous,
)
# same public convenience API
jaccard = binary_binary(a, b, measure='jaccard')
acc = confusion(a, b, measure='acc')
phi = categorical_categorical(a, b, measure='phi')
biserial = binary_continuous(a, b, measure='biserial')
tau = concordance(a, b, measure='kendall_tau')
eta = categorical_continuous(a, b, measure='eta')
pearson = continuous_continuous(a, b, measure='pearson')Local pairwise APIs accept 1D array-like inputs:
numpy.ndarraypandas.Series- Python lists / tuples
- Other iterables that can be consumed once
Use categorical/object-like inputs for categorical metrics and numeric inputs for continuous or concordance metrics. For best runtime and lowest allocation overhead, prefer already-materialized numpy.ndarray or pandas.Series inputs with the right dtype instead of generators or mixed-object containers.
The shared type aliases live in pypair.typing, for example ArrayLike1D and NumericArrayLike1D.
- Pandas: use
pypair.util.corr(df, func)to build pairwise association matrices. - PySpark: use
pypair.spark.*methods for distributed pairwise computations.
- Internals now prefer NumPy for local numeric workflows where possible.
- Pandas remains supported as a dataframe input/output layer.
- PySpark APIs are preserved for distributed dataframe workflows.
