Spark is a new tool which is added recently to data scientists toolkit. In this article we are going to discuss why it can be preferred tool of choice for data scientists and how it will help them to solve their day to day problems while working on data analysis projects.
Spark is a general purpose cluster computing platform. It is mainly designed to speed up interactive & iterative computation. It supports in-memory computing and leverage data stored on hadoop mesos cluster. Spark jobs are generally 10-20 times faster than hadoop map reduce jobs which itself is a great reason for it becoming a tool of choice. Data scientists mostly use R, Python and SQL to perform data wrangling and statistical modeling. Spark API for Python also referred as pyspark gives an easy access to data scientists for using spark features. Spark libraries spark SQL, spark Streaming, spark MLlib and GraphX are an added advantage while working on data science applications. Spark MLlib implements highly parallelized version of major machine learning algorithms such as logistic regression, support vector machine, classification and regression tree, latent dirichlet allocation, random forest etc. which are used by data scientists for modeling data science applications. Its streaming library can help create real time big data analytics applications which are the need of time. Spark shell gives freedom of doing interactive data analysis using Python API which helps in exploring data and tuning ML algorithms.
Spark active forums and communities provide a place where data scientists turn when stuck at some place while creating solutions. Recent release of spark 1.4.x also has SparkR library which is further push for data science professionals for using spark in future projects.
Asad Ahamad is a data science professional with more than five years of experience. He has held various positions in different organizations in India. He did his masters in Industrial Mathematics with Computer Application from Jamia Millia Islamia, New Delhi. He admires Mathematics and always looking for business problems to apply his knowledge. He has good experience working on data mining, machine learning and data science projects for multiple domains. He mainly uses R and Python to perform data wrangling and modeling. He is fond of using open source tools for data analysis. He is active social media user. Feel free to connect with him on twitter @asadtaj88.