Classification | Anomaly Detection | Clustering

What is Vertical Data Mining?

Modern analytics systems work by sequentially considering each new data point, one by one. In the case of the bank, perhaps each data point represents a new loan application; in the case of the image analyst, perhaps it is a pixel in a satellite image. Each new data point requires that the analytics run on that point to execute on it. Unfortunately, if an image grows from one million pixels to two million pixels, you are doubling the amount of work required (and potentially quadrupling the execution time!) We call this point-driven classification - each point in the dataset requires a loop through an analytics engine. As a result, growing datasets cause analytics executions times to grow as well. Two million points, two million loops. Or more. This is the root of the scalability challenges for modern data mining systems.

Treeminer literally turns the problem on its side - we look at data not in rows, but in thin vertical strips. If we consider the examples above, a simple mortgage application may have a handful of fields, or attributes: name, social security number, credit score, property zip code, and value. If, instead of looping on each mortgage application to perform the analytics, you loop on the attributes, then you have a loop of five, no matter how many applications you have. Similarly, in the case of an image, each attribute is perhaps a color band - again, as images grow the number of color bands do not. This is the basis for the scalability advantage of vertical data mining.

Vertical Algebra

In order for vertical data mining to work, the complex algebra that is needed to perform the actual analytics needs to be able to be calculated on columns rather than rules. It turns out, another special property of the our thin vertical strips is that by combining them with simple logical operators, like AND, OR, XOR, NOT, etc., you can build an incredibly diverse and complex mathematics that can support very sophisticated data mining processes. Back to the mortgage example above, if we apply these logical AND and OR operators to the 5 vertical attributes - that is, take each attribute, combine them with logical operators, and get a result.

By working over a data set vertically, you get the analytics results for the whole dataset at once for each class you are looking for.

Using our vertical algebra, Treeminer designs and develops in house analytics methods design specifically leverage the advantages of our vertical data organization.

Classification

Treeminer's classification algorithm, Oblique, operates in a similar fashion to many classification methods - a separating hyperplane is constructed, and points are identified to be on one side of the hyperplane or the other, determining predicted class membership. So if our algorithm is logically similar to alternative methods, why is it faster?

The key is our vertical algebra: all distance calculations (e,g, distances from the points from the hyperplane) can be calculated across all datapoints in the dataset in a single logical operation, rather than once for each point in the dataset. A mask gets created for the entire dataset indicating which side of the hyperplane the point is on. The result: not only significant speed advantages, but more importantly, speed advantages that only grow bigger as the dataset gets larger.

Anomaly Detection

Anomaly detection algorithms identify points in a dataset that are unlike others. As such, they can indicate potential fraud (transactions that seem out of place), network intrusions (activity that does not follow a normal pattern), potential component failures (sensor readings that are not normal), or other similar applications. Using Treeminer's One-Class SPS method, anomalous behavior can be detected significantly faster, and as result action taken quicker than with existing methods.

Clustering

Clustering methods group like datapoints together to identify items in a dataset that are similar to each other. Using vertical methods, clusters can be isolated across all points in the dataset at once, rather than point by point. Again, the result is significant performance gains without sacrificing accuracy.

For more details on Treeminer's high performance, scalable analytics, please download our White Paper, An Introduction to Vertical Data Mining.