What is Apache Spark™?

Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.

Unified engine for large-scale data analytics.

What are the four main problem areas of Bigdata processing Data Collection, Data Storage, Data Processing, and Data Access
Select the correct statements
Horizontal Scalability allows you to start small and add more resources and capacity as your requirement grows
Apache Spark is a Distributed data processing tool
Spark can run on YARN Cluster, Kubernetes Cluster, and Mesos Cluster
Spark recommends using Dataframe APIs and SQL for data processing
The Spark APIs are available in Scala, Java, python and R
The spark engine is distributed computing engine and framework
Why Spark does not recommend using Core APIs:
They are difficult to learn and use
They lack some performance optimizations
Which Google Whitepaper was implemented as HDFS Google File System
The core offering of HDFS is Provide fault-tolerant data storage and access on a horizontally scalable distributed cluster
Datawarehouse was expensive and vertically scalable
DataLakes can be scaled by adding more machines to the cluster
A DataLake can store data in HDFS, Amazon S3, Microsoft Azure blob, and Google cloud storage
DataIngestion is Collect data from sources and store it in raw format