class: monash-bg-blue center middle hide-slide-number <div class="bg-black white" style="width:45%;right:0;bottom:0;padding-left:5px;border: solid 4px white;margin: auto;"> <i class="fas fa-exclamation-circle"></i> These slides are viewed best by Chrome and occasionally need to be refreshed when not loading properly. <!-- See here for <a href=index.pdf>PDF <i class="fas fa-file-pdf"></i></a>. --> </div> <br> .white[Press the **right arrow/space** to progress to the next slide!] --- count: false background-image: url(images/cover.png) background-size: cover class: hide-slide-number title-slide .item.center[ # <span style="text-shadow: 2px 2px 50px white;">Manifold Learning with<br/>Approximate Nearest Neighbors</span> ] .center.shade_monash_blue.animated.bounceInUp.slower[ <!-- <br><br> --> <!-- ## <span style="color: #ccf2ff; text-shadow: 10px 10px 100px white;"></span> --> <br> Presented by Fan Cheng (Monash University) Rob J Hyndman (Monash University)<br/>Anastasios Panagiotelis (University of Sydney) Department of Econometrics and Business Statistics <img src="images/monash-one-line-reversed.png" style="width:500px">
<i class="fas fa-envelope faa-float animated "></i>
Fan.Cheng@monash.edu
<i class="fab fa-twitter faa-float animated faa-fast "></i>
@fanchengfc <br><br> .bottom_abs.width100.bg-black[ Early Career & Student Statisticians Conference 2021 July 30 ] ] ??? <div class="grid-row" style="grid: 1fr / 2fr;"> .item.center[ # <span style="text-shadow: 2px 2px 30px white;">Manifold Learning with<br/>Approximate Nearest Neighbors</span> <!-- # .monash-blue.outline-text[Manifold Learning with<br/>Approximate Nearest Neighbors] --> ## <span style="color:;text-shadow: 2px 2px 30px black;"></span> ] </div> <style> p.caption { font-size: 0.8em; } </style> --- # Outline - Motivation - Manifold learning - Approximate nearest neighbors - Embedding quality measures - MNIST data experiments - Application to smart meter data - Conclusions --- class: split-two # Irish Smart Meter Data .row[ .split-five[ .row[ ] .row[.content[ - Problem of interest: electricity usage patterns of households ]] .row[.content[ - Half-hourly data for 535 days in 3639 households ]] .row[.content[ - Typical/anomalous households in distributions ]] .row[.content[ - Empirical discrete distributions: `\(48 \times 7=336\)` distributions per household ]] ] ] .row[ .split-two[ .column[
] .column[ <img src="images/electricity_hdrbox_3id_7dow.png" width="85%" style="display: block; margin: auto;" /> ] ] ] ??? ## How to estimate empirical distributions? - *Objective*: to approximate the distribution of electricity demand `\(d_{i,t}\)` for household `\(i\)` over time period `\(t\)` - *Steps*: - Group all data and construct an evenly spaced grid `\(\kappa_0,\kappa_1,\dots,\kappa_G\)` of size `\(G=200\)` with `\(\kappa_0=\underset{i,t}{\min} d_{i,t}\)` and `\(\kappa_G=\underset{i,t}{\max} d_{i,t}\)` as endpoints. - Compute `\(\pi_{i,g}=(1/T)\sum_t I(\kappa_{g-1}<d_{i,t}<\kappa_g)\)` where `\(T\)` is the total number of time periods. - Vector `\(\pi_i\)` represents a probability mass function over discrete bins. - Why quantiles over kernel density estimate? - Missing values; highly skewed data; non-negative support --- class: inverse, middle .center[# Why manifold learning?] .pull-left[.content[ - `\(3639\)` households ( `\(i\)` ) - `\(336\)` distributions ( `\(j\)` ) per household - each distribution `\(\boldsymbol{p}_{ij} = (p_{i,j,1}, p_{i,j,2}, \dots, p_{i,j,l})^\prime\)` is a vector of length `\(l=200\)` - `\(\mathbb{R}^{200} \rightarrow \mathbb{R}^{2}\)` ]] --- class: split-two .row[ .center.monash-bg-blue.white[ # Manifold learning ] - *Manifold*: topological space that is locally *homeomorphic* to Euclidean space at each point - Manifold learning: non-linear dimension reduction tool *by preserving the local geometry structure (distances)* <!-- - Statistical manifold: manifold with each observation as a PDF --> ] .row[.center[ ![Swiss Roll](images/manifold.gif) ] ] ??? Wikipedia: In mathematics, a **manifold** is a topological space that locally resembles Euclidean space near each point. **Manifold**: where the data set lies on a low-dimensional subspace embedded in a high-dimensional space ## Isometry - Manifold learning assumption: recover geometry (distances) - Isometry: a manifold learning algorithm `\(f:\mathcal{M}\rightarrow \mathbb{R}^r\)` is an isometry if $$ d_\mathcal{M}(p_i, p_j) = d(f(p_i),f(p_j)), \text{ for all } i,j $$ - To preserve the geometry of the manifold we need to preserve distances. - World map distances between two locations are not geometric distances. --- class: split-two count: true .row[ .center.monash-bg-blue.white[ # Manifold learning ] - *Manifold*: topological space that is locally *homeomorphic* to Euclidean space at each point - Manifold learning: non-linear dimension reduction tool *by preserving the local geometry structure (distances)* - **Statistical manifold**: manifold with each observation as a PDF ] .row[ <img src="images/sr_unroll.png" width="80%" style="display: block; margin: auto;" /> ] ??? to extract the representation of data lying along a low-dimensional manifold embedded in the high-dimensional space # Manifold Learning for Single Household - Similar times of day are grouped closely together - Cyclical patterns observed in the representation - Three clusters roughly corresponding to three phases of a typical day working during the day (08:00 - 17:00); recreation during the evening (17:00 - 00:00); sleeping (00:00 - 08:00) <img src="images/tod_1id1row_1id336tow.png" width="70%" style="display: block; margin: auto;" /> The cyclical pattern observed in the representation is indicative of the fact that low values for half-hour of the day (00:00, 00:30, 01:00) are temporally proximate to high values for half-hour of the day (22:30, 23:00, 23:30) and are therefore similar. --- # Manifold learning - Manifold learning methods <span style="font-size: 85%;">Isomap, LLE, Laplacian eigenmaps, Hessian LLE, etc.</span> - Most methods involve the nearest neighbor computation <span style="font-size: 85%;">Pairwise Euclidean distance</span> - Computation complexity: `\(\mathcal{O}(N^2)\)` - Statistical manifold learning: we consider discrete-domained distributions **Distances between distributions**: Hellinger/Total Variation distance ??? NN searching requires computing all pairwise distances. It works for small N, but there is a computation bottleneck for large N. Similar fashion to Lee, Abbott & Araman (2007), discrete-domained distributions are approximated for each element of the statistical manifold and the values of the probability mass function (pmf) are stacked into vectors. 1. The first step is to group all data together and construct an evenly spaced grid k0,k1,...,kG of size G = 200 with k0 = min{di,t } and kG = max{di,t } as endpoints. This provides 200 bins which can be used to construct a discrete distribution approximation to each `\(F_i\)` over time. 2. This is found by computing `\(\pi_{i,g}\)` as the average of nearby two period of time for smoothing. 3. The resulting vector `\(\pi_i\)` represents a probability mass function over the discrete bins. --- class: inverse, center, middle # But what about **8 million** households in VIC? ## Computational efficiency matters --- class: split-two # <span style="color: white; font-size: 90%; text-shadow: 10px 10px 50px black;"> Manifold learning with approximate nearest neighbors </span> .pull-left[.content[ - We propose to improve the efficiency by searching for approximate nearest neighbors (ANN) instead of exact ones. - **Goal**: largely decrease the computation time without losing much accuracy (high embedding quality) - ANN methods: k-d trees, Annoy, HNSW ]] .pull-right[ <div class="figure" style="text-align: center"> <img src="images/ann.png" alt="Illustration of k-d trees approximate nearest neighbors compared to exact ones. " width="50%" /> <p class="caption">Illustration of k-d trees approximate nearest neighbors compared to exact ones. </p> </div> ] ??? k-d trees is one of the classic way to find ANN. Annoy and HNSW are brought up in the recent 5 years. And they both work better than k-d trees. [ann-benchmark.com] --- # Embedding quality measures - Trustworthiness (Venna & Kaski 2006) to distinguish one type of errors where <span style="font-size: 85%;"> the embedding point `\(y_j\)` is among the `\(K\)`-nearest neighbors of `\(y_i\)` ( `\(V_K(i)\)` ) but </span> <span style="font-size: 85%;"> the data point `\(x_j\)` is not among the `\(K\)`-nearest neighbors of `\(x_i\)` ( `\(U_K(i)\)` ) </span> - Using all such points for each `\(i\)`, the trustworthiness of the embedding is `$$M_{T}(K)=1-\frac{2}{G_{K}} \sum_{i=1}^{N} \sum\limits_{\begin{subarray}{c}j \in V_{K}(i)\\j \notin U_{K}(i)\end{subarray}}(\rho_{ij}-K), \ M_{T}(K) \in [0,1]$$` - `\(G_{K}\)` is the normalizing factor calculated using `\(N\)` and `\(K\)` - `\(\rho_{ij}\)` is the neighborhood ranking of `\(x_j\)` with repect to `\(x_i\)` ??? with a higher value indicating a higher-quality representation `\(U_K(i)\)` was defined as the set of points `\(j\)` such that `\(x_j\)` is one of the `\(K\)`-nearest neighbors of `\(x_i\)`. We now also define `\(V_K(i)\)` as the nearest neighborhood of observation `\(i\)` in the output space, i.e. a set of points `\(j\)` such that `\(y_j\)` is one of the `\(K\)`-nearest neighbors of `\(y_i\)`. We define the neighborhood ranking of `\(x_j\)` with respect to `\(x_i\)` as `\(\rho_{ij} =|\{\ell: \delta_{i \ell}<\delta_{i j}\} |\)`. For example, if `\(x_j\)` is the nearest neighbor of `\(x_i\)`, then `\(\rho_{ij}=1\)` since only `\(\delta_{ii}<\delta_{ij}\)`, where we assume without loss of generality that all input points are distinct. --- # MNIST embedding quality vs computation time <div class="figure" style="text-align: center"> <img src="images/compareml.png" alt="Trustworthiness comparison of three ANN methods for four ML methods on MNIST dataset. " width="70%" /> <p class="caption">Trustworthiness comparison of three ANN methods for four ML methods on MNIST dataset. </p> </div> --- class: inverse, center, middle # Application to electricity data --- # For all households: anomaly detection - Distances between distributions - Highest density region plots (Hyndman, 1996) - Anomalies: points with lowest densities from kernel density estimate <!-- - Identify the most unusual observations (*"anomalies"*) with the lowest density --> <div class="figure" style="text-align: center"> <img src="images/hdr10_compare4ml_kdtreeannoy_allids_nt50_3639id_336tow_201length.png" alt="High density region plot for all households from the Isomap and LLE embeddings." width="60%" /> <p class="caption">High density region plot for all households from the Isomap and LLE embeddings.</p> </div> --- # Anomaly detection
.center[ <font size="5em">Quantile region plots of electricity demand against the time of week for one typical household and two anomalies.</font> ] --- # Conclusions - A broad range of **approximate nearest neighbors methods** in manifold learning algorithms <span style="font-size: 80%;"> The use of ANN methods significantly reduces computational time while maintaining a high level of embedding quality across a range of accuracy measures </span> - Propose the **Hellinger and Total Variation Metric** to describe the distance between discrete probability distributions in manifold learning - Use the **highest density region plots** to identify both typical and anomalous times of week and households - **Open questions**: the selection of tuning parameters in ANN methods; the choice of intrinsic dimension `\(d\)`; the choice of manifold learning algorithm --- class: inverse, middle #
Thanks!
<!-- Slides<sup>1</sup> available at https://talks.fancheng.me/ECSSC2021 --> .center[
<i class="fab fa-safari faa-float animated "></i>
: [fancheng.me](https://fancheng.me)
<i class="fab fa-github faa-float animated "></i>
: [@ffancheng/paper-mlann](https://github.com/ffancheng/paper-mlann)
<i class="fab fa-twitter faa-float animated "></i>
: [@fanchengfc](https://twitter.com/fanchengfc) ] .footnote[ Slides created via the R package [**xaringan**](https://github.com/yihui/xaringan) with the [**ninja theme**](https://github.com/emitanaka/ninja-theme). ]