C++
C++

Contents

Introduction

libfolding is initially a C++ library which implements the folding test of unimodality (FTU). It is designed to work on streaming data but you can also perform the test on a batch dataset. This library is based on Armadillo for linear algebra.

Installation

Manual installation

As usual, you just have to clone the repo

git clone https://github.com/asiffer/libfolding.git

And then build and install it

cd libfolding/
make
make install

During the installation, you can set the location of the library and its headers. By default, there are put in /usr/lib and /usr/include/libfolding. You can change it with the variable INSTALL_LIB_DIR and INSTALL_HEAD_DIR.

Installation through debian package

The easiest way is to use the debian package. You just need to add the following ppa:

sudo add-apt-repository ppa:asiffer/libfolding
sudo apt-get update

Then you can install the library

sudo apt install libfolding

and the headers (if desired)

sudo apt install libfolding-dev

Getting started

Batch example

The first thing you want to do is probably to perform the FTU on a dataset. As libfolding is built upon libarmadillo (for linear algebra), you need to use its objects to load your data. An example is given below.

// file batch_iris.cpp
#include "folding.h"

int main(int argc, const char* argv[]) {
    arma::mat X; // declare the matrix object
    // load the 'iris' dataset
    X.load("../data/iris.csv",arma::csv_ascii); 
    X.shed_col(4); // remove the species columns
    BatchFolding bf(X.t()); // perform the FTU
    bf.print(); // print results
    return 0;
}

So, you need to use the arma::mat object provided by libarmadillo to import and manipulate data. Then, you may notice that the FTU is performed on the transposed matrix. Actually, as Armadillo uses column-major ordering convention, we do consider a column as an observation. On linux system you might compile as follows:

g++ -std=c++11 -Wall -pedantic -I/usr/include/libfolding -o ./batch_iris  ./batch_iris.cpp -lfolding -larmadillo

Finally, by running ./batch_iris, the output may look like this:

Folding Test of Unimodality
Number of observations: 151 (dim: 4)
Computation time: 261 µs (~ 578544 it/s)
Folding Statistics: 3.414
        The distribution is then unimodal (with p-value: 0.000e+00)
Folding pivot
   4.4659
   2.9085
   3.6206
   1.8279

In a nutshell, the test claims that the distribution is rather unimodal. Details about the output of the FTU are given here.

The BatchFolding object does not offer more methods but you can access to the data printed above.

bool unimodal; // the FTU result
double Phi; // the folding statistics
double p_value; // the p-value of the test (lower is better)
arma::vec s2star; // the folding pivot
long int usec; // the time to perform the test (~ debug information)

Streaming example

One great feature of the FTU is that it can be computed/updated incrementally over a sliding window. Then it is possible to monitor over time the unimodality character of streaming data.

The following example in more theoretical and deals about a drifting mode. Data are initially under two modes which are initially at the same position (0). Then a mode drifts to the right while the other one remains the same. In this example, data are initially unimodal and become multimodal.

In this example, we want to perform the FTU several times on this stream to check if we monitor the modality change (unimodality -> multimodality).

For the next part we will use the following headers.

#include "folding.h"
#include <iostream>
#include <sstream>
#include <iomanip>
#include <random>

The generation of the data is given by the code below.

/**
 * @brief   Generates N streaming points X_1, ... X_n such that 
 *          X_i is drawn from (1-ratio)*N(0,1) + ratio*N(i*speed, 1)
 * @param   N     The number of data to generate
 * @param   ratio The proportion of data inside the drifiting mode 
 *                (0.5 means that the modes are balanced)
 * @param   speed The drifting speed
 * @returns A 1xN armadillo matrix
*/
arma::mat binormal_with_drift(size_t N, double ratio = 0.5, double speed = 1e-3) {
    // random device
    std::random_device rd;
    std::default_random_engine gen(rd());
    std::uniform_real_distribution<double> u(0, 1);

    // fill the data
    arma::mat X(1, N, arma::fill::zeros);
    for (size_t i = 0; i < N; i++) {
        X.col(i) = arma::randn(1, 1); // N(0,1) -> first mode (fixed)
        if (u(gen) < ratio) {
            X.col(i) += speed * i; // N(speed*i , 1) -> second mode (drifting)
        }
    }

    return X;
}

The object StreamFolding is designed to carry out the folding test of unimodality within a sliding window. At any time we can do the test and check how the data behave in this time slot.

int main(int argc, const char *argv[]) {

   arma::arma_rng::set_seed_random();

   // data parameters
   size_t N = 100000;
   double ratio = 0.5;
   double speed = 1e-4;

   // window size
   size_t win_size = 2000;

   // data
   arma::mat X = binormal_with_drift(N, ratio, speed);

   // initialization
   StreamFolding sf(win_size);

   // temp or side variables
   double p;
   double Phi;
   bool u;
   auto mod = (size_t) (N / 50);

   const int wi = 9;
   const int wphi = 8;
   const int wp = 12;

   // Printing the results header
   std::cout << std::right;
   std::cout << setw(wi) << "Iteration";
   std::cout << setw(wphi) << "Phi";
   std::cout << setw(wp) << "p-value" << std::endl;
   std::cout.fill('-');
   std::cout << setw(wi + wphi + wp) << '-' << std::endl;
   std::cout.fill(' ');

   // loop over the observations
   for (size_t i = 0; i < N; i++) {
      sf.update(X.col(i));

      // breakpoint to perform the test
      if ((i > 0) && (i % mod == 0)) {
         // compute the test!
         Phi = sf.folding_test(&u, &p);

         // printing stuff
         std::cout << std::fixed << std::setprecision(4);
         std::cout << setw(wi) << i;
         std::cout << setw(wphi) << Phi;
         std::cout << std::scientific << setw(wp) << p << std::endl;
      }
   }
   return 0;
}

We compile it as before. The execution result may look like this:

Iteration     Phi     p-value
-----------------------------
     5000  1.4418  0.0000e+00
    10000  1.4753  0.0000e+00
    15000  1.3866  0.0000e+00
    20000  1.3149  9.0961e-13
    25000  1.1633  1.5735e-06
    30000  1.0396  3.3863e-01
    35000  0.9137  9.2496e-03
    40000  0.7511  9.7316e-11
    45000  0.6506  9.0961e-13
    50000  0.5474  0.0000e+00
    55000  0.4672  0.0000e+00
    60000  0.3944  0.0000e+00
    65000  0.3341  0.0000e+00
    70000  0.3111  0.0000e+00
    75000  0.2662  0.0000e+00
    80000  0.2470  0.0000e+00
    85000  0.2263  0.0000e+00
    90000  0.1865  0.0000e+00
    95000  0.1765  0.0000e+00

Obviously, the folding statistics (Phi) decreases over time (unimodal -> multimodal).

The modality change occurs around iteration 30000: as the speed is set to $10^{-4}$, it means that the second mode is centered at $x=3$ at this time. Actually, this moment visually corresponds to the modes split.

Close to 1 the output of the test is naturally less significant (higher p-value).

  • C++
  • R
  • ···