Data analysis and predictive analytics today are driven by large scale distributed deployments of complex pipelines, guiding data cleaning, model training and evaluation. In this work, we focus on the problem of modelling such a pipeline framework and providing algorithms that build on top of basic abstractions, fundamental to stream processing. We design a streaming machine learning pipeline as a series of stages such as model building, concept drift detection and continuous evaluation. We build our prototype on Apache Flink, a distributed data processing system with streaming capabilities along with a state-of-the-art implementation of a variation of Vertical Hoeffding Tree (VHT), a distributed decision tree classification algorithm as a proof of concept. Furthermore, we compare our version of VHT with the current state-of-the-art implementations on distributed data processing systems. Our experimental results on real-world data sets show significant performance benefits of ourpipeline while maintaining low classification error. We believe, that this pipeline framework can offer a good baseline for a full-fledged implementation of streaming algorithms which can work in parallel.
Hinweis: Dieser Artikel kann nur an eine deutsche Lieferadresse ausgeliefert werden.
Hinweis: Dieser Artikel kann nur an eine deutsche Lieferadresse ausgeliefert werden.