ApacheCon North America 2020 Talk 1


by Shekhar Prasad Rajak — Posted on September 30, 2020

Back to Blog Home page

ApacheCon @Home 2020

Title: Running ML algorithms with ML tools available in Apache Ecosystem


In these days, having libraries to get abstract methods to use machine learning algorithm in the application is important but to train our model effectively in lesser time & resources; for our own customize algorithm is more important.

Machine learning technology is changing every single day, so let’s spend time on how Researchers and Software Developers can leverage the powerful features provided by Apache libraries & frameworks.

In this talk we will focus on Apache libraries/frameworks available for distributed training, large scale & less costly data transfer during the whole Model training life cycle.

Fundamentals and motive behind following Apache Projects:

  • Apache Spark MLlib: Simplifies large scale machine learning pipelines, using distributed memory-based Spark architecture. The best for building & experimenting new algorithms.

  • Apache MxNet: A lean, flexible, and ultra-scalable deep learning framework that supports state of the art in deep learning models

  • Apache Singa: It provides intelligent database system, distributed deep learning by partitioning the model and data onto nodes in a cluster and parallelize the training.

  • Apache Ignite: A distributed database , caching and processing platform designed to store and compute on large volumes of data across a cluster of nodes - which can be super useful to perform distributed training and inference instantly without massive data transmissions

  • Apache Mahout : A distributed linear algebra framework that support multiple distributed backends like Apache Spark, to use by data scientists to quickly implement algorithms and statistics analysis of data.

Practical guide for above Apache projects, focusing following points: Data processing, implementing existing & customised own ML algorithms, tuning, scaling up and finally deploying to optimising it using Apache cluster management tools and(or) Kubernetes. Performance and benchmark with Kubernetes. Handling large-scale batch, streaming data & realtime processing. Caching data or in-memory for faster ML predictions

Follow @shekharrajak