# Consistent and powerful non-Euclidean graph-based change-point test with applications to segmenting random interfered video data

^{a}Department of Mathematics and Statistics, Thompson Rivers University, Kamloops, BC, Canada V2C0C8;^{b}Department of Mathematics and Statistics, York University, Toronto, ON, Canada M3J1P3;^{c}Department of Biostatistics, University at Buffalo, The State University of New York, Buffalo, NY 14221-3000;^{d}CR RAO Advanced Institute of Mathematics, Statistics and Computer Science, Hyderabad 500046, India

See allHide authors and affiliations

Contributed by Calyampudi Radhakrishna Rao, April 13, 2018 (sent for review March 19, 2018; reviewed by Miklós Csörgő and Runze Li)

## Significance

A webcam collects tons of short video clips for home security, fire inspection, and animal’s behavior study, etc. For efficiency, effectiveness, and economy of the variety of purposes of the use of the video data, we need to remove informationless data. To solve this problem, we propose two non-Euclidean graph-based change-point tests: SHP* and MST*. The proposed test SHP* is not only distribution-free but also, consistent. As shown in the real data example of a bee’s flower visitation, it has successfully detected both landing and departure times of a bee from the video data with random interference.

## Abstract

The change-point detection has been carried out in terms of the Euclidean minimum spanning tree (MST) and shortest Hamiltonian path (SHP), with successful applications in the determination of authorship of a classic novel, the detection of change in a network over time, the detection of cell divisions, etc. However, these Euclidean graph-based tests may fail if a dataset contains random interferences. To solve this problem, we present a powerful non-Euclidean SHP-based test, which is consistent and distribution-free. The simulation shows that the test is more powerful than both Euclidean MST- and SHP-based tests and the non-Euclidean MST-based test. Its applicability in detecting both landing and departure times in video data of bees’ flower visits is illustrated.

Webcams are used in both daily life and research projects. Such examples include visitors’ motion detection via a home security system, video fire detection via a fire detection system, study of the behavior of a rare and big but dangerous animal, magnetic resonance imaging analysis of some functions (e.g., learning function) of a part of a brain, and investigation of bees’ flower visitation. A webcam collects tons of short video clips. However, many of video clips do not contain any information of interest, and hence, they can be and should be removed for efficiency, effectiveness, and economy of the variety of purposes. For example, to investigate the movement pattern of bees, one may use webcams to record a bee’s flower visitation to find the duration of the bee on the flower (1). Huge video data need to be analyzed to accurately investigate bees’ flower visitation. However, in all of the above examples, there may exist unexpected (i.e., random) interferences. For instance, in the bees’ flower visitation example, relatively smaller insects, such as ants, may also visit the flower unexpectedly; these are the random interferences and are not avoidable. Fig. 1 displays four selected frames from those extracted every second from a recorded video. It can be seen that the flower was visited by both bees and ants, and it can also be seen that they were landed in different places of the flower. Thus, we are interested in keeping the video data only containing bees’ flower visitation.

The removal of the informationless video data in the examples given above can actually be converted into a change-point detection problem for high-dimensional data that contain random interferences. Consider the bees’ flower visitation example for demonstration. Here, the sequence of data consists of a sequence of vectorized pixel values from frames, and both landing and leaving times of bees make large changes in vectorized pixel values and can thus be considered as change points in the data sequence. A well-performed change-point detection method would allow users to remove large but informationless video data in such examples, which can be carried out daily or other regular basis.

Many change-point detection methods can be found in literature. Refs. 2 and 3, among others, gave change-point analyses for high-dimensional time series, where the data structure changes in a fixed subset of components. Graph-based change-point tests have been developed recently for their advantage in describing high-dimensional data, which date back to Friedman and Rafsky (4) for a two-sample test via applying the minimum spanning tree (MST). In terms of Friedman and Rafsky (4), Chen and Zhang (5) proposed a change-point test. However, the test based on MST is not distribution-free and is not consistent when there is a shift in variance in high-dimensional settings as shown in ref. 6. By applying the shortest Hamiltonian path (SHP) introduced in ref. 7 for a two-sample test, ref. 6 proposed a distribution-free and consistent change-point test. We remark that both of the above change-point tests are constructed in terms of Euclidean graph, which may not perform for the problem considered in this paper. Consider the bees’ flower visitation example. The issue arises because (*i*) the random interferences by small insects may lead to the changes of vectorized pixel values, which could not be ignored, and (*ii*) the pixel values change at a relatively large number of unknown image points for a bee but a small number of unknown image points for a small insect, such as an ant. Both locations and amounts of changes in pixel values are unknown, which make the Euclidean graph-based change-point tests not to perform. The issue may be resolved by replacing the Euclidean distance by a non-Euclidean distance.

## Non-Euclidean Graph-Based Change-Point Test

We first vectorized the matrix of pixel values of the tth image into a d-dimensional vector, with d being the number of pixels in this image. To show how to remove large but informationless video data in any example described above, we use the bees’ flower visitation example for demonstration. We assume that, at most, a single bee has visited the flower for simple presentation. The possible scenarios include (*i*) there is no insect on the flower; (*ii*) there are only a few small insects, such as ants, on the flower; and (*iii*) there is a bee on the flower. It is noted that both small insects and a bee may land on any place of the flower. The following model can be used to model such video data, where the three parts of this model correspond to the above three scenarios, respectively:

To detect whether there is a change point

For example, a graph G can be an MST or an SHP based on the Euclidean edge weight Eq. **2** or the non-Euclidean edge weight Eq. **3** denoted **2**, we define**MST** in this paper, with the test statistic**4**;

For the same Euclidean edge weight Eq. **2**, ref. 6 proposed an SHP-based test, denoted **SHP** in this paper, with the test statistic**4**,

We now define both non-Euclidean SHP- and MST-based test statistics by replacing the Euclidean edge weights Eq. **2** with the non-Euclidean edge weights Eq. **3** in both test statistics Eq. **7** and Eq. **5**. We denote so-defined test statistics **SHP*** and **MST***, respectively, in this paper. If the null hypothesis is rejected by **SHP*** or **MST***, the change-point estimate is given by the ratio cut as in Eq. **8** or by Eq. **6**, respectively. **SHP*** and **MST*** will also denote the non-Euclidean SHP-based change-point detection and the non-Euclidean MST-based change-point detection, respectively, for convenience.

To illustrate the different edge weights, we consider a mean shift model by setting **1**, with a change point at five. Note that the Euclidean MST-based change-point detection was implemented in the R package gSeg (9) with default settings

An illustration is given in Fig. 2, from which it can be seen that the change-point estimate by **SHP*** is the same as the true one, but the other three tests produce biased change-point estimates that are eight, six, and eight, respectively.

It seems that the model Eq. **1** would only be suitable for studying a bee’s landing. In fact, it is also the right model for studying a bee’s departure, because we may set **SHP*** and **MST*** remain unchanged.

## Consistency of the Non-Euclidean SHP-Based Test

Consider **SHP*** given above. We will show that it is consistent with fixed N and

### Assumption 1.

*Suppose that* *as* *, where* *and*

### Assumption 2.

*There exists an* *, such that*

We remark that *Assumption 1* stems from the bee’s flower visitation example, in which the following are met: (*i*) a bee is quite large compared with other small insects, such as ants (*ii*) a bee is not much smaller than the flower (*iii*) the color of a bee is different from that of the flower (*iv*) the whole flower has almost the same color (*Assumption 2* comes from ref. 6.

The following theorem shows that **SHP*** is consistent.

### Theorem.

*Under* Assumptions 1 *and* 2*, for a predefined positive number* α*, the power of the non-Euclidean SHP-based test of significance level* α *converges to one as*

### Proof:

By Eq. **3**,**1**, *Assumption 1*. Thus, we have

We remark that the conditions in Eq. **1** may be relaxed by letting *Assumption 1*, the test can still be shown to be consistent.

It is noted that the change-point estimate given by using the ratio cut as in ref. 6 has the same error bound as given in theorem 2 in ref. 6.

## Data Examples

### Simulations.

For simple presentation, we only carry out the simulation studies for the model Eq. **1**, with

In Table 1, the simulated type I errors for **SHP*** are compared based on 1,000 simulations, with **1**, where the critical values are taken from table 1 of ref. 6, and

To examine the powers of **MST**, **SHP**, **MST***, and **SHP***, 200 simulations are carried out for **MST**, **SHP**, **MST***, and **SHP*** in the simulation study.

It can be seen from Fig. 3*A* that the powers of **MST*** and **SHP*** monotonically increase as d increases, which suggests that both of them may be consistent. Compared with the powers of both **MST** and **SHP**, both MST* and SHP* have much better performances, especially when the change point is located at *A* also reveals that both **MST** and **SHP** may not be consistent.

Further comparisons are carried out for each of the four change-point estimates based on **MST**, **MST***, **SHP**, and **SHP***. Fig. 3*B* displays the boxplots of these estimates for each of the dimensions

It can be seen from Fig. 3*B* that both change-point estimates given by non-Euclidean graph-based methods, which are comparable, outperform both change-point estimates by Euclidean graph-based methods when d increases. It is noted that the power and change-point estimate based on **MST** and **MST*** are dependent on the constraints of

### Video Data.

We still use the bees’ flower visitation example for demonstration. As we described before, relatively smaller insects, such as ants, may also visit the flowers unexpectedly. We analyze the original video 09-10-2010_15h_49_27.17.mpg (faculty.tru.ca/xshi/09-10-2010_15h_49_27.17.mpg); some of 49 frames extracted from the video data are shown in Fig. 1 with one frame per second. The details of the experiment that produced the data are given in the work by Lihoreau et al. (1). As shown in Fig. 1, there were only some ants in the first and fourth frames. Our aim is to locate the frame where a bee appears.

To proceed, we first convert the three intensities of the R, G, and B components (three-dimensional array) to one intensity of the grayscale (one-dimensional array). As reviewed by Kanan and Cottrell (11), there are two popular methods for the conversion. The simpler one is to convert three intensities by their average as follows (12):

The quality of the videos was variable depending on climatic conditions, such as light. To have the same contrast, we make the same-scale transformations on the pixel values **10** or **11** of 49 frames:**3** using *A* displays *B*. Thus, there are two change-point estimates at 4 and 41 (*A*. As a matter of fact, 4 and 41 are the true change points corresponding to the landing and departure times, respectively, of a bee.

If we replace **SHP***, which are displayed in Fig. 4 *C* and *D*. This suggests that the impact of different weighted algorithms for converting RGB to grayscale may be negligible.

The application of **MST*** yields the same change-point estimates that are detected by **SHP***, which are displayed in Fig. 4 *E*–*H*. However, if we apply **MST** or **SHP**, their performances are not satisfactory, as shown in Fig. 5 *A* and *B* for *C* and *D* for

## Discussion and Conclusions

Using the bees’ flower visitation example as a demonstration, two non-Euclidean graph-based change-point tests are developed for high-dimensional data with random interferences. The proposed non-Euclidean SHP-based change-point test **SHP*** is not only distribution-free but also, consistent. For analyzing the video data, we first extract it to a sequence of frames and make the same-scale transformation of the data; then, we apply **SHP*** to detect the bee’s landing and departure times. It is noted that the performance of the resulting change-point estimate depends on the number of frames per second. We have limited the discussion to the case that there is, at most, one bee on the flower. If we remove this assumption and allow multiple bees to visit a flower, the following are possible cases: (*i*) they show up at almost same time, or (*ii*) they visit the flower at different times. For both cases *i* and *ii*, **SHP*** can be used to find out the landing of the first bee and departure of the last bee. We remark that we may also use **MST***; however, its performance is heavily dependent on the constraints

The model Eq. **1** can be modified to adapt to other video data examples containing random interferences, which can be used to remove the informationless data. The change-point detection method can be given similarly as above. The details are omitted.

## Acknowledgments

We thank Dr. Mathieu Lihoreau for data sharing. The research is partially supported by the Natural Sciences and Engineering Research Council of Canada.

## Footnotes

- ↵
^{1}To whom correspondence may be addressed. Email: xshi{at}tru.ca, wuyh{at}mathstat.yorku.ca, or crr1{at}psu.edu.

Author contributions: X.S., Y.W., and C.R.R. designed research; X.S., Y.W., and C.R.R. performed research; X.S. analyzed data; and X.S., Y.W., and C.R.R. wrote the paper.

Reviewers: M.C., Carleton University; and R.L., Pennsylvania State University.

The authors declare no conflict of interest.

Published under the PNAS license.

## References

- ↵
- ↵
- Cho H,
- Fryzlewicz P

- ↵
- Wang T,
- Samworth RJ

- ↵
- ↵
- Chen H,
- Zhang N

- ↵
- Shi X,
- Wu Y,
- Rao CR

- ↵
- ↵
- ↵
- Chen H,
- Zhang N

- ↵
- Fontenla M

- ↵
- ↵
- Jack K

- ↵
- Pratt W

## Citation Manager Formats

## Article Classifications

- Physical Sciences
- Statistics