Skip to content

Add option to enforce minimum segment length#150

Open
StephanDollberg wants to merge 1 commit intoapache:masterfrom
StephanDollberg:stephan/min-segment-len
Open

Add option to enforce minimum segment length#150
StephanDollberg wants to merge 1 commit intoapache:masterfrom
StephanDollberg:stephan/min-segment-len

Conversation

@StephanDollberg
Copy link
Copy Markdown

One problem we often see in practice is that single spikes cause
changepoints. These are just false positive noise which we want to
avoid.

This patch adds a config to disallow changepoints that only enclose
segments of a certain length.

Like that we can filter out these one-event changepoints and avoid
noise.

Of course this will mute true changepoints in a short segment but that's
fine if they are followed by another true changepoint. For example [100,
100, 130, 130, 150, 150, 150 ...] would only report the 150 one. This is
fine. A single alert is good enough to get someone to look at the data.

Default behaviour is unchanged.

Let me know what you think.

One problem we often see in practice is that single spikes cause
changepoints. These are just false positive noise which we want to
avoid.

This patch adds a config to disallow changepoints that only enclose
segments of a certain length.

Like that we can filter out these one-event changepoints and avoid
noise.

Of course this will mute true changepoints in a short segment but that's
fine if they are followed by another true changepoint. For example [100,
100, 130, 130, 150, 150, 150 ...] would only report the 150 one. This is
fine. A single alert is good enough to get someone to look at the data.

Default behaviour is unchanged.
@henrikingo
Copy link
Copy Markdown
Contributor

Hi Stephan

The original e-divisive implementation (in R) actually included such an option, and I think MongoDB's signal_processing_algorithms likewise required 2 points at each end of the segment, meaning that it was only possible to find a change point in segments that had at least 5 points, and in the case it would have to be point #3 that is the change point.

The Hunter implementation then modified this to any point in arbitrary short segments to be a change point. "In practice the minimum segment is 3, since with only 2 points it is not possible to establish the "normal" range that the other point would be a change from." The Hunter modifications specifically were introduced to correctly find two nearby change points, since a common use case (in Cassandra development, anyway) was that they would observe a regression and then immediately fix it in a nearby commit. The datastax team observed that they could make the algorithm more sensitive by dividing a long series into smaller windows. A byproduct of this is that even individual outlier points sometimes get marked as change points much more easily than in the original e-divisive.

The relevant parameter is window_len and by defaylt it is set to 50. Before adding a new parameter, it would be interesting to hear from you whether you get "better" behavior by increasing this parameter. In principle you should be able to get original e-divisive behavior by setting window_len to a really large value = larger than the length of your time series.

If you do this and still observe individual outliers getting marked as change points (or even two changepoints) could you please share a data sample

@StephanDollberg
Copy link
Copy Markdown
Author

Thanks for the background, interesting to know!

Sure, I can give that a try. I had tried without specifying windowlen in the past and wasn't entirely happy (I know not a very qualitive statement haha) but I also didn't know 50 was the default until yesterday and assumed it's actually a lot bigger. Let me try again with that.

@StephanDollberg
Copy link
Copy Markdown
Author

Mega windows (1000) certainly seem to behave better, though I am still seeing some extra changepoints that don't make sense on first sight (anyway not relevant here).

I guess what it doesn't solve is very recent changepoints where the range of points to the right of it is very small.

Something like the following shows the issue:

diff --git a/tests/analysis_test.py b/tests/analysis_test.py
index 4f40e2f..bd3fc97 100644
--- a/tests/analysis_test.py
+++ b/tests/analysis_test.py
@@ -66,6 +66,23 @@ def test_single_series():
     assert indexes == [10]


+def test_large_window_reports_tail_spike_change_point():
+    series = np.random.default_rng(2).normal(loc=100.0, scale=5.0, size=303)
+    series[-4] = 150.0
+
+    cps, _ = compute_change_points(
+        series,
+        window_len=1000,
+        max_pvalue=0.001,
+        min_magnitude=0.01,
+    )
+
+    assert len(series) - 4 in [cp.index for cp in cps]

The min segment length thing does work around that.

@henrikingo
Copy link
Copy Markdown
Contributor

Ok thanks!

Yes so the test case you supply, and also the ones in the patch itself, do match familiar behavior: e-divisive gets more volatile and therefore sensitive towards the ends of a segment. A good way to illustrate this is to use your repro from above, and modify it to have the outlier spike in the middle:

series[151] = 150.0

At least in my attempts, this will never be marked as a change point.

And this is also why the original e-divisive requires change points to be a minimum amount of data points away from each other. I'll use min_segment_len as you've done in the patch.

The trade off is that:

  1. A newly introduced regression can only be detected after min_segment_len/2 additional data points have accumulated. For example if min_segment_len=10, and you run benchmarks in the nightly build, then you need to wait 5 days and nights before you can expect to find a change point.
  2. Two nearby changepoints, only one of them can be found. There needs to be min_segment_len/2 points between two change points.

As for 1 I guess this is generally bad for everyone, the only argument for it is that the results at the very end of a segment are not stable and I think the paper even says that the proof only works in the middle of the segment, and ends must be surpressed.

2 is subjective. For Datastax, the whole point of their modifications is that they want the algorithm to find and flag 2 changes that are close to each other. You otoh say it is fine, as long as one point is marked, a human is alerted and see the other change point too. Both of these are valid opinions.

In conclusion: Adding a new option is precisely the right solution.

Before we proceed with this PR, I would like to introduce a new composition of the algorithm. Currently we have

def compute_change_points(

and

def compute_change_points_orig(series: Sequence[SupportsFloat], max_pvalue: float = 0.001, seed: Optional[int] = None) -> Tuple[PermCPList, Optional[PermCPList]]:

...the latter attempts to be an implementation of the literal Matteson & James algorithm as described in their publication. Generally your modification makes more sense against this latter variant of the algorithm. Adding it to the current default variant you are kind of fighting against behavior thathe algo was specifically designed to do.

So, I would like to add a variant of the algorithm, let's call it --deterministic-edivisive variant, which is like --orig_edivisive but uses the Student T-test instead of the random permutations tester.

Then your patch can be added both to --orig_edivisive and --deterministic-edivisive and when doing so we are actually just adding back a parameter that is in the original paper anyway.

And to remain backward compatible, we default this to 0, but you would set it to e.g. 3.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants