Add option to enforce minimum segment length by StephanDollberg · Pull Request #150 · apache/otava

StephanDollberg · 2026-04-23T17:29:56Z

One problem we often see in practice is that single spikes cause
changepoints. These are just false positive noise which we want to
avoid.

This patch adds a config to disallow changepoints that only enclose
segments of a certain length.

Like that we can filter out these one-event changepoints and avoid
noise.

Of course this will mute true changepoints in a short segment but that's
fine if they are followed by another true changepoint. For example [100,
100, 130, 130, 150, 150, 150 ...] would only report the 150 one. This is
fine. A single alert is good enough to get someone to look at the data.

Default behaviour is unchanged.

Let me know what you think.

One problem we often see in practice is that single spikes cause changepoints. These are just false positive noise which we want to avoid. This patch adds a config to disallow changepoints that only enclose segments of a certain length. Like that we can filter out these one-event changepoints and avoid noise. Of course this will mute true changepoints in a short segment but that's fine if they are followed by another true changepoint. For example [100, 100, 130, 130, 150, 150, 150 ...] would only report the 150 one. This is fine. A single alert is good enough to get someone to look at the data. Default behaviour is unchanged.

henrikingo · 2026-04-23T19:03:38Z

Hi Stephan

The original e-divisive implementation (in R) actually included such an option, and I think MongoDB's signal_processing_algorithms likewise required 2 points at each end of the segment, meaning that it was only possible to find a change point in segments that had at least 5 points, and in the case it would have to be point #3 that is the change point.

The Hunter implementation then modified this to any point in arbitrary short segments to be a change point. "In practice the minimum segment is 3, since with only 2 points it is not possible to establish the "normal" range that the other point would be a change from." The Hunter modifications specifically were introduced to correctly find two nearby change points, since a common use case (in Cassandra development, anyway) was that they would observe a regression and then immediately fix it in a nearby commit. The datastax team observed that they could make the algorithm more sensitive by dividing a long series into smaller windows. A byproduct of this is that even individual outlier points sometimes get marked as change points much more easily than in the original e-divisive.

The relevant parameter is window_len and by defaylt it is set to 50. Before adding a new parameter, it would be interesting to hear from you whether you get "better" behavior by increasing this parameter. In principle you should be able to get original e-divisive behavior by setting window_len to a really large value = larger than the length of your time series.

If you do this and still observe individual outliers getting marked as change points (or even two changepoints) could you please share a data sample

StephanDollberg · 2026-04-24T09:34:15Z

Thanks for the background, interesting to know!

Sure, I can give that a try. I had tried without specifying windowlen in the past and wasn't entirely happy (I know not a very qualitive statement haha) but I also didn't know 50 was the default until yesterday and assumed it's actually a lot bigger. Let me try again with that.

StephanDollberg · 2026-04-27T13:34:57Z

Mega windows (1000) certainly seem to behave better, though I am still seeing some extra changepoints that don't make sense on first sight (anyway not relevant here).

I guess what it doesn't solve is very recent changepoints where the range of points to the right of it is very small.

Something like the following shows the issue:

diff --git a/tests/analysis_test.py b/tests/analysis_test.py
index 4f40e2f..bd3fc97 100644
--- a/tests/analysis_test.py
+++ b/tests/analysis_test.py
@@ -66,6 +66,23 @@ def test_single_series():
     assert indexes == [10]


+def test_large_window_reports_tail_spike_change_point():
+    series = np.random.default_rng(2).normal(loc=100.0, scale=5.0, size=303)
+    series[-4] = 150.0
+
+    cps, _ = compute_change_points(
+        series,
+        window_len=1000,
+        max_pvalue=0.001,
+        min_magnitude=0.01,
+    )
+
+    assert len(series) - 4 in [cp.index for cp in cps]

The min segment length thing does work around that.

henrikingo · 2026-04-30T13:13:31Z

Ok thanks!

Yes so the test case you supply, and also the ones in the patch itself, do match familiar behavior: e-divisive gets more volatile and therefore sensitive towards the ends of a segment. A good way to illustrate this is to use your repro from above, and modify it to have the outlier spike in the middle:

series[151] = 150.0

At least in my attempts, this will never be marked as a change point.

And this is also why the original e-divisive requires change points to be a minimum amount of data points away from each other. I'll use min_segment_len as you've done in the patch.

The trade off is that:

A newly introduced regression can only be detected after min_segment_len/2 additional data points have accumulated. For example if min_segment_len=10, and you run benchmarks in the nightly build, then you need to wait 5 days and nights before you can expect to find a change point.
Two nearby changepoints, only one of them can be found. There needs to be min_segment_len/2 points between two change points.

As for 1 I guess this is generally bad for everyone, the only argument for it is that the results at the very end of a segment are not stable and I think the paper even says that the proof only works in the middle of the segment, and ends must be surpressed.

2 is subjective. For Datastax, the whole point of their modifications is that they want the algorithm to find and flag 2 changes that are close to each other. You otoh say it is fine, as long as one point is marked, a human is alerted and see the other change point too. Both of these are valid opinions.

In conclusion: Adding a new option is precisely the right solution.

Before we proceed with this PR, I would like to introduce a new composition of the algorithm. Currently we have

otava/otava/analysis.py

Line 294 in 1ceb153

def compute_change_points(

and

otava/otava/analysis.py

Line 280 in 1ceb153

    
           def compute_change_points_orig(series: Sequence[SupportsFloat], max_pvalue: float = 0.001, seed: Optional[int] = None) -> Tuple[PermCPList, Optional[PermCPList]]:

...the latter attempts to be an implementation of the literal Matteson & James algorithm as described in their publication. Generally your modification makes more sense against this latter variant of the algorithm. Adding it to the current default variant you are kind of fighting against behavior thathe algo was specifically designed to do.

So, I would like to add a variant of the algorithm, let's call it --deterministic-edivisive variant, which is like --orig_edivisive but uses the Student T-test instead of the random permutations tester.

Then your patch can be added both to --orig_edivisive and --deterministic-edivisive and when doing so we are actually just adding back a parameter that is in the original paper anyway.

And to remain backward compatible, we default this to 0, but you would set it to e.g. 3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to enforce minimum segment length#150

Add option to enforce minimum segment length#150
StephanDollberg wants to merge 1 commit intoapache:masterfrom
StephanDollberg:stephan/min-segment-len

StephanDollberg commented Apr 23, 2026

Uh oh!

henrikingo commented Apr 23, 2026

Uh oh!

StephanDollberg commented Apr 24, 2026

Uh oh!

StephanDollberg commented Apr 27, 2026

Uh oh!

henrikingo commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

StephanDollberg commented Apr 23, 2026

Uh oh!

henrikingo commented Apr 23, 2026

Uh oh!

StephanDollberg commented Apr 24, 2026

Uh oh!

StephanDollberg commented Apr 27, 2026

Uh oh!

henrikingo commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants