Skip to content

Race conditions in split-cache dir when parallel jobs share data_split_dir #310

@zaz

Description

@zaz

Summary

topobench/data/utils/split_utils.py has two concurrency bugs in the shared code path used by random_splitting and k_fold_split. Both can surface when multiple workers (e.g. parallel Optuna trials) point at the same data_split_dir and reach the split-generation branch simultaneously, typically on the first run against an uncached dataset.

Bug 1. Directory-creation ToC-ToU race

if not os.path.isdir(split_dir):
    os.makedirs(split_dir)

If two workers observe isdir == False within the same short window and both call os.makedirs, the second raises FileExistsError: [Errno 17]. One sweep trial crashes visibly during setup and is recorded as failed.

Bug 2. Non-atomic fold-file writes

np.savez(os.path.join(split_dir, f"{fold_n}.npz"), **split_idx)

If the process dies mid-write (SIGKILL, OOM, preemption), the canonical {fold_n}.npz is left partially written. Every subsequent run then loads the corrupt file and either raises cryptic BadZipFile / EOFError errors or returns an NpzFile with missing keys that downstream code misinterprets as valid splits. Persists until the directory is manually removed.

PR to follow.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions