Pages

APACHE SPARK

 


What is Apache Spark?

Apache Spark is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Unified engine for large-scale data analytics.

  • What are the four main problem areas of Bigdata processing Data         Collection, Data Storage, Data Processing, and Data Access
  • Select the correct statements 
  • Horizontal Scalability allows you to start small and add more resources and capacity as your requirement grows
  • Apache Spark is a Distributed data processing tool
  • Spark can run on YARN Cluster, Kubernetes Cluster, and Mesos Cluster
  • Spark recommends using Dataframe APIs and SQL for data processing
  • The Spark APIs are available in Scala, Java, python and R
  • The spark engine is distributed computing engine and framework
  • Why Spark does not recommend using Core APIs:  
  •                 They are difficult to learn and use             
  •                 They lack some performance optimizations
  • Which Google Whitepaper was implemented as HDFS Google File System
  • The core offering of HDFS is Provide fault-tolerant data storage and access on a horizontally scalable distributed cluster
  • Datawarehouse was expensive and vertically scalable
  • DataLakes can be scaled by adding more machines to the cluster
  • A DataLake can store data in HDFS, Amazon S3, Microsoft Azure blob, and Google cloud storage
  • DataIngestion is Collect data from sources and store it in raw format


No comments:

Post a Comment

Note: Only a member of this blog may post a comment.