Topic : SCALABLE DATA SCIENCE AND MACHINE
Abstract: Distributed Machine Learning and Data Science has it’s unique challenges and approaches.
R has been a de facto language for statistics but with the rise of the big data, a lot of ML and statistical functions does not scale.
In this talk, we will introduce R4ML, a new open source scalable Machine Learning Framework build using Apache Spark and Apache SystemML, allow R scripts to invoke custom algorithms developed in Apache SystemML. R4ML integrates seamlessly with SparkR, so data scientists useful best features of SparkR and SystemML together in the same scripts. In addition, the R4ML package provides a number of useful new R functions that simplify common data cleaning and statistical analysis tasks.
Our meeting will begin with overview of R4ML R package, it’s API, supported canned algorithms, and integration to Spark and SystemML. In this tutorial style presentation, We will walk through
– a small example of creating a custom algorithm and
– A typical end to end machine learning flow using the built-in scalable algorithms to get the deeper insights.
– Exploreatory and analytical analysis
– typical pre processing and data cleaning
– Dimension reduction
– Classification using SVM
– cross validation
The talk will bring with pointers to how the audience can try out R4ML.
Bio: Alok Singh is a Principal Engineer at the IBM Spark Technology Center, where he leads the R4ML project. He has built and architected multiple analytical frameworks and implemented various machine learning algorithms. His interest is in creating Big Data and scalable machine learning software and has been worked in the software industry for last 20 years in various roles.