**Location:** DIMA, TU Berlin, E-N 719

Skip to main content### New Directions for Data Mining on Sequences and Matrices

### Abstract

Prof. Dr. Rainer Gemulla, Universität Mannheim

31.10.2106, 4 pm

In this talk, I will summarize our current research at the Data Analytics chair, University of Mannheim. After a general overview, I'll talk in more detail about two specific directions: declarative sequential pattern mining on the one hand and matrix factorization on the other hand. I'll briefly summarize these two directions below.

Sequential pattern mining is a fundamental task in data mining. Given a database of sequences (e.g., customer transactions, event logs, or natural-language sentences), the goal of sequential pattern mining is to detect relevant and interesting patterns in the data. The stated goal of our research is to make sequential pattern mining useable, useful, and efficient. I will introduce Desq, a general-purpose system for declarative pattern mining. Desq allows data scientists to specify what they want, but abstracts away algorithmic and implementation details. Desq unifies many of the existing variants of sequential pattern mining---including length, gap, span, regular-expression, and hierarchy constraints---and additionally goes beyond what was possible before. I describe how Desq improves usability, how mining can be performed efficiently and scalably, and outline directions for maximizing the usefulness of the found patterns.

Matrix factorization methods have been called the Swiss Army Knife of data mining. In general, matrix factorization methods represent each row (e.g., objects) and each column (e.g., attributes) of a data matrix with latent feature vectors (or "embeddings"). They are an effective tool for tasks such as denoising, compression, imputation of missing data, clustering, link prediction, and more. In this talk, I'll focus on our recent work on factorizing large Boolean matrices. Here we often have to make a choice between using expensive combinatorial methods that retain the discrete nature of the data and using continuous methods that can be more efficient but destroy the discrete structure. I present an alternative approach that first computes a continuous factorization and subsequently applies a rounding procedure to obtain a discrete representation. I discuss our current answers to questions such as what can be gained by rounding, whether this approach achieves lower reconstruction errors, and how hard it is to obtain a good factorization. A key concept to approach these questions is the notion of the "rounding rank" of a binary matrix, which has relationships to linear classification, dimensionality reduction, and nested matrices.

**Location:** DIMA, TU Berlin, E-N 719