Many thanks for your cherished time, this time we like to share with you the details on what is 3 S’s of Spark as we all know the 3 V’s of Big Data is Volume, Variety & Velocity. And even added with kernel V’s like Veracity & Values.
Big Data is defined as a collection of data sets so huge and difficult that it becomes difficult to process using on-hand database management tools or traditional data processing applications”.
Big Data describes a all-inclusive information management strategy that includes and integrates many new types of data and data management alongside with traditional data. While many of the techniques process and analyze these data types have existed for some time, it has been the massive explosion of data and the lower cost computing models that have encouraged broader adoption. Also Big Data introduced two foundational storage and processing technologies called Apache Hadoop and NoSQL database. To list few names we have Hadoop, Spark, Cassandra, MongoDB, neo4j, Titan and many more…..
Spark is a framework for performing general data analytics on distributed computing cluster like Hadoop. It provides in memory computations for increase speed and data process over mapreduce. It runs on top of existing hadoop cluster and access hadoop data store (HDFS), can also process structured data in Hive and Streaming data from HDFS, Flume, Kafka, Twitter.
Apache Spark is an open source processing engine built around speed, ease of use, and analytics. If you have large amounts of data that requires low latency processing that a typical Map Reduce program cannot provide, Spark is the alternative. Spark performs at speeds up to 100 times faster than Map Reduce for iterative algorithms or interactive data mining. Spark provides in-memory cluster computing for lightning fast speed and supports Java, Scala, and Python APIs for ease of development.
Spark combines SQL, streaming and complex analytics together seamlessly in the same application to handle a wide range of data processing scenarios. Spark runs on top of Hadoop, Mesos, standalone, or in the cloud. It can access diverse data sources such as HDFS, Cassandra, HBase, or S3.
And now what is 3 S’s of Spark and reasons to choose Spark with below three key,
Simplicity: Spark’s capabilities are accessible via a set of rich APIs, all designed specifically for interacting quickly and easily with data at scale. These APIs are well documented, and structured in a way that makes it straightforward for data scientists and application developers to quickly put Spark to work.
Speed: Spark is designed for speed, operating both in memory and on disk. In 2014, Spark was used to win the Daytona Gray Sort benchmarking challenge, processing 100 terabytes of data stored on solid-state drives in just 23 minutes. The previous winner used Hadoop and a different cluster configuration, but it took 72 minutes. This win was the result of processing a static data set. Spark’s performance can be even greater when supporting interactive queries of data stored in memory, with claims that Spark can be 100 times faster than Hadoop’s MapReduce in these situations;
Support: Spark supports a range of programming languages, including Java, Python, R, and Scala. Although often closely associated with Hadoop’s underlying storage system, HDFS, Spark includes native support for tight integration with a number of leading storage solutions in the Hadoop ecosystem and beyond. Additionally, the Apache Spark community is large, active, and international. A growing set of commercial providers including Databricks, IBM, and all of the main Hadoop vendors deliver comprehensive support for Spark-based solutions.
Reference: Open Source Community, mapr.com & aptuz blog.
Please feel free to comment and suggest.