In principle, machine learning resembles techniques to analyze large amounts of data.
However, until now the main focus of academic research was on efficient learning algorithms that store data in main memory, and their implementations are hand-optimized. Technically speaking, machine-learning methods can be roughly divided into distinct categories:
- methods that can be formulated as optimization problems (e.g. a prediction error between a model and the data is minimized as a function of the parameters) and
- methods (often derived under the Bayes theory), which require the computation of integrals, or
- other iterative methods (e.g., the perceptron algorithm).
The algorithms are often formulated in the language of linear algebra, or in the form of statistical or probabilistic models. In order to achieve scalability, it is necessary to find a calculation scheme that is efficiently computable either by finding a suitable decomposition of the problem into parallelizable partial sub-problems, or by suitable approximation. This can be achieved using varying approaches, ranging from mathematical optimization, such as plain gradient descent to complex state-of-the-art optimization algorithms, such as bundle methods. In Bayesian methods, these are for example variational methods or sampling. Especially with Big Data, the approach to apply efficient hand-optimized algorithms on data in main memory meets its limits. The scalability aspect (in the sense of massive parallelization, and seamless migration of in-memory algorithms to secondary storage algorithms during execution of a data analysis program depending on the problem size and data distributions) is still neglected due to the datasets being beyond the capabilities of machine learning methods. The main problems here are that the appropriate modifications to the decoupling of dependencies in the basic algorithm that are needed for scalability are non-trivial and difficult to map in an usual formalism. Furthermore, a suitable connection to data management concepts and abstractions is missing.
As a first step, it has been shown that massive parallelization is an appropriate means for some learning methods. Approaches that emanate directly from an online learning setting in which the data is processed as a stream are also promising.