Spark Interview Questions And Answers -2020

This is my first blog and am going to start with current trending topic : Spark.
Apache Spark is a booming technology nowadays. Hence it is very important to know each and every aspect of Apache Spark as well as Spark Interview Questions.

Here I am going to share the top 35 Spark Questions with answers which are asked now a days in IT companies Interview.

All these questions are based on my personal experience by attending multiple IT companies Interview in the Year of 2020.

I assume, reader would have good understanding of all Spark concept prior going through this content. Answer to each question is very precise and can be referred by all Spark experienced folks who are looking for job change and hopefully it can help them to crack most of the Spark Interviews. Please note, spark fresher / beginner should understand spark concepts in details first before going through it.

So let's start:

Question-1: What is the difference between Mapreduce and Spark? OR How does Spark achieve 100x times faster performance than Mapreduce ?

Answer:
Main difference between Hadoop MapReduce and Spark processing are mentioned below:
1. MapReduce has to read from and write to a disk, as a result, it slows down the processing speed while Spark can do it in-memory. As a result, the speed of processing differs significantly ans hence Spark may be up to 100 times faster than MapReduce.
2. The volume of data processed also differs: Hadoop MapReduce is able to work with far larger data sets than Spark.
3. Map reduce process data in batch mode whereas with Spark Streaming, near real time data can also be processed
4. MapReduce is complex. As a result, we need to handle low-level APIs to process the data, which requires lots of hand coding whereas Spark process the data using high-level operators. It also provides rich APIs in Java, Scala, Python, and R.
5.Hadoop MapReduce is developed in Java whereas Spark is developed in Scala.
6. There is no interactive shell in Hadoop MapReduce. Spark provides REPL(Read Evaluate Print Loop).

Question-2: What is the difference between RDD and Dataframe? OR What is there in DF but not in RDD?

Answer:

There are lot many differences between the two. Few are mentioned below:

1. RDDs are Resilient distributed Datasets where data is stored in form of objects. It has no schema. Whereas in Dataframes, data is organized into named columns similar to table structure.

~~2. RDDs are immutable collections and Dataframes are mutable in nature.~~

3. RDD is low level of API, whereas Dataframe is high level API.

4. No inbuilt optimization engine is available in RDD. But Optimization takes place in dataframes using catalyst optimizer.

5. RDDs are typed-safe hence can get analysis errors at compile time. On the other hand, a DataFrames are untyped and can lead to runtime errors.

6. RDD is slower in performing simple grouping and aggregation operations as compared to DataFrame.

Question-3: What is the difference between Dataframe and Datasets?

Answer:

1. Dataframe and Datasets both are high level APIs.

2. Data is organized into named columns both in Dataframe and Datasets. Basically, they are similar to tables in a relational database.

3. Dataset is an extension of DataFrame API that provides the functionality of – type-safe, object-oriented programming interface of the RDD API and performance benefits of the Catalyst query optimizer and off heap storage mechanism of a DataFrame API.

4. Datasets are DataFrames with the types of the columns defined. When converting a Dataset to DataFrame only the type info is lost otherwise the object is the same and vica versa i.e. you can convert a DataFrame to Dataset by defining the types of the columns in a case class.

5. Dataset brings the best of both worlds with a mix of relational (DataFrame) and functional (RDD) transformations.

6. If you are trying to access the column which does not exist in the table in such case Dataframe APIs does not support compile-time error. It detects attribute error only at runtime. whereas Datasets provides compile-time type safety.

Question-4: Define RDD in Spark.

Answer:

Spark uses RDD (Resilient Distributed Dataset) which is a fault-tolerant collection of elements that can be operated in parallel. RDDs are distributed collection of objects. Distributed means, each RDD is divided into multiple partitions. Each of these partitions can reside in memory or stored on disk of different machines in a cluster.

They are immutable (Read Only) in nature. You can’t change original RDD, but you can always transform it into different RDD with all changes you want

RDDs can be created by 2 ways:

1. Parallelizing existing collection(sc.parallelize())

2. Loading external dataset from HDFS(sc.textFile())

Question-5: What are RDD Operations in Spark?

Answer:

Spark RDD supports two types of Operations:

Transformations - are lazy operations on a RDD that create one or many new RDDs. It takes RDD as input and produces one or more RDD as output. Each time it creates new RDD when we apply any transformation.

After the transformation, the resultant RDD is always different from its parent RDD.

It can be smaller (e.g. filter, count, distinct, sample), bigger (e.g. flatMap(), union(), Cartesian()) or the same size (e.g. map).

Some frequently used RDD Transformations are - map(), flatmap(), filter(), union(), distinct(), reduceByKey(), groupByKey(),coalesce(),repartition(), join() etc.

Actions - Actions are the functions that perform some kind of computation over the transformed RDD and sends the computed result from executors to driver.

Actions triggers execution using lineage graph to load the data into original RDD, carry out all intermediate transformations and return final results to Driver program or write it out to file system.

Some frequently used RDD Actions are - reduce(func), collect(), count(), first(), take(), takeSample(), saveAsTextFile(), foreach(func) etc.

Question-6: What are Narrow and Wide transformations?

Answer:

Transformations can further be divided into 2 types:

Narrow transformation: A pipeline of operations that can be executed as one stage and does not require the data to be shuffled across the partitions —

For example, Map, flatmap, filter, sample, union, etc..

Wide transformation: Here each operation requires the data to be shuffled, henceforth for each wide transformation a new stage will be created —

For example, Intersection, distinct, reduceByKey, groupByKey, join, repartition, coalesce, etc..

Note: When compared to Narrow transformations, wider transformations are expensive operations due to shuffling.

Question-7: What are the Shared Variables in Spark?

Answer:

There are two different types of Shared Variables in spark -Broadcast Variable and Accumulator:

Broadcast Variable - Broadcast variables are read-only variables that are distributed across worker nodes in-memory instead of shipping a copy of data with tasks. The data broadcasted this way is cached in a serialized form and deserialized before running each task. They are used to cache a value in memory on all nodes. Generally small datasets can be broadcasted.

Accumulator - Accumulators are shared variables which are used to update the variables in parallel during execution and share the results from workers to the driver program running on master node. They are similar to counters in map-reduce. Are those variables that are used to perform associative and commutative operations such as counters or sums.

Question-8: What is the difference between Persist and Cache?

Answer:

Cache and Persist are optimization techniques for both iterative and interactive Spark applications to improve the performance of the jobs or applications.

Iterative means Reuse intermediate results. Whereas interactive means allowing a two-way flow of information.

Spark provides an optimization mechanism to store the intermediate results of an RDD, DataFrame, and Dataset, so they can be reused in subsequent stages.

Both caching and persisting are used to save the intermediate resultsets but, the difference among them is that cache() will cache the RDD into memory(MEMORY_ONLY), whereas persist(level) can cache in memory, on disk, or off-heap memory according to the caching strategy specified by level. Those storage levels are like MEMORY_ONLY, MEMORY_AND_DISK, MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY

Note:

- persist() without an argument is equivalent with cache().

- unpersist() can be used to Freeing up space from the Storage memory.

Question-9: What is Spark checkpointing and how it is different from Persist to disk?

Answer:

“When should I cache or checkpoint?”

- to determining if the results of a set of transformations can be reused for a very long time or not.

If the answer is yes, use checkpointing. If the answer is no, use caching.

Caching is extremely effective and more useful than checkpointing, as it’s typically much faster when used properly, when you have a lot of available memory, as RDDs and DataFrames can be massive in size, like giga or terabyte size. Essentially, caching will maintain the result of applying a set of transformations to an RDD or DataFrame so that those transformations will not have to be recomputed again when an additional transformation is applied or fault occurs within a RDD or DataFrame.

Checkpointing will persist the transformed RDD or DataFrame forever. Now you’re thinking “this is great, I can throw away old history for free and not have Spark need to redo transformations from time to time like caching.” Well not for free exactly. The main problem with checkpointing is that Spark must be able to persist any checkpoint RDD or DataFrame to HDFS which is slower and less flexible than caching. You also need to setup checkpointing to a location on HDFS, where a RDD or DataFrame’s transformations can be persisted, whereas caching is part of Spark’s implicit default setup.

Question-10: What is SparkSession and how it is different from SparkContext?

Answer:

Spark Context -

SparkContext is the entry point of Apache Spark functionality.

It allows your Spark Application to access Spark Cluster with the help of Resource Manager

To create SparkContext, first SparkConf should be made, which had all the cluster configs and parameters to create a Spark Context object.

Only one SparkContext may be active per JVM, its a expensive operation

Spark Session -

Spark session is a unified entry point of a spark application from Spark 2.0

Spark session is a combination of spark context, hive context and SQL context

Spark session provides with spark.implicits._ which is one of the most useful imports in all of the spark packages which comes in handy with a lot of implicit methods for converting Scala objects to Datasets and some other handy utils.

You can think of Spark Session as a data structure where the driver maintains all the information including the executor location and their status.

Question-11: What parameters to specify in spark-submit command -

Answer:

To submit a Spark job, below command has to be triggered:

Parameters can be passed as per your project configuration

spark-submit --num-executers a --executor-cores b --driver-memory c -- executor-memory d --class com.spark.examples.TestSparkJob --master yarn --deploy-mode cluster /path/to/examples.jar

Values of a,b,c,d can be substituted as below:

a= No of executors required. Eg: 15

b= No of cores per executor. 5 cores(default)

c= Driver memory. Eg: 1G

d= Executor memery. Eg: 2G

Question-12: What are the different types of Cluster manager available in Spark?

Answer:

Apache Spark supports four different cluster managers:

Apache YARN - YARN is the cluster manager for Hadoop. As of date, YARN is the most widely used cluster manager for Apache Spark.

Apache Mesos - Apache Mesos is another general-purpose cluster manager. If you are not using Hadoop, you might be using Mesos for your Spark cluster.

Kubernetes - it's a general purpose container orchestration platform from Google.

Standalone - Finally, the standalone. The Standalone is a simple and basic cluster manager that comes with Apache Spark and makes it easy to set up a Spark cluster very quickly. This is basically used during development.

No matter which cluster manager do we use, primarily, all of them delivers the same purpose.

Question-13: What are deploy modes in Spark?

Answer:

When you start an application, you have a choice to specify the execution mode, and there are three options - Local, Client & Cluster :

1. Client Mode -

In Client mode, driver starts on the local machine and then as soon as the driver create a Spark Session, a request goes to YARN resource manager to create a YARN application. The YARN resource manager starts an Application Master. For the client mode, the AM acts as an Executor Launcher. So, the YARN application master will reach out to YARN resource manager and request for further containers. The resource manager will allocate new containers, and the Application Master starts an executor in each container. After the initial setup, these executors directly communicate with the driver.

Client Deploy Mode

2. Cluster Mode -

In Cluster mode, the spark-submit utility will send a YARN application request to the YARN resource manager. The YARN resource manager starts an application master. And then, the driver starts in the AM container. That's where the client mode and cluster mode differs.

In the client mode, the YARN AM acts as an executor launcher, and the driver resides on your local machine, but in the cluster mode, the YARN AM starts the driver, and you don't have any dependency on your local computer. Once started, the driver will reach out to resource manager with a request for more Containers. The resource manager will allocate new Containers, and the driver starts an executor in each Container.

Cluster Deploy Mode

3. Local Mode -

Local mode is a for debugging purpose. The local mode doesn't use the cluster at all and everything runs in a single JVM on your local machine.

Question-14: What is the difference between executor & executor core?

Answer:

An Executor is a JVM process within a YARN container and an executor core is a thread within the JVM. Each task is processed by a thread. Each task work on a subset of the data. An Executor is dedicated to a specific Spark application and terminated when the application completes. A Spark program normally consists of many Executors, often working in parallel.

Executor Core defines number of concurrent tasks each executor can run. If we give executors cores very high, then spark will create very high number of tasks for each executor. These tasks will compete each other for resources and It will reduce data I/O throughput. The more cores we have, the more work we can do. In spark, this controls the number of parallel tasks an executor can run.

Question-15: What is Shuffling concept in Spark?

Answer:

Shuffling is a process of redistributing data across multiple partitions. Its a process of data transfer between stages. By default, shuffling doesn’t change the number of partitions, but their content. It is a costly and complex operation.

When an “Action” is called, Spark sends out tasks to the worker nodes. If the action is a reduction, data shuffling takes place.

When the DAGScheduler generates the physical execution plan from our logical plan it pipelines together all RDDs into one stage that are connected by a narrow dependency. Consequently, it cuts the logical plan between all RDDs with wide dependencies

In general, avoiding shuffle will make your program run faster. All shuffle data must be written to disk and then transferred over the network.

Repartition, Join, cogroup, distinct, and any of the *By or *ByKey transformations can result in shuffles.

map, filter and union generate a only stage (no shuffling).

Question-16: How to decide number of stages created in Spark job?

Answer:

DAG is a collection of stages in spark, which contains different task.

A Stage is a sequence of Tasks that don't require a Shuffle in-between Or you can say a Stage is collection of transformations which does not include shuffling.

When an wide transformation executed then data will shuffled between partitions. Due to this a new stage is created in DAG.

In spark, stages are of two types:

-ShuffleMapStage

-ResultStage

Basically, Number of stages to be created completely depends on shuffling, i.e. whenever you perform any transformation where Spark needs to shuffle the data by communicating to the other partitions, it creates other stages for such transformations. And the transformation does not require the shuffling of your data; it creates a single stage for it.

And the number of tasks to be created depends on your number of partitions. Lets say we have only two partitions, so each stage is divided into two tasks. And a single task runs on a single partition.

The number of tasks for a job is = ( no of your stages * no of your partitions )

Question-17: What happens when Spark job is submitted OR How Spark Internally Executes a Program?

Answer:

When we do a transformation on any RDD, it gives us a new RDD. But it does not start the execution of those transformations. The execution is performed only when an action is performed on the new RDD and gives us a final result.

So once you perform any action on an RDD, Spark context gives your program to the driver.

The driver creates the DAG (directed acyclic graph) or execution plan (job) for your program. Once the DAG is created, the driver divides this DAG into a number of stages. These stages are then divided into smaller tasks and all the tasks are given to the executors for execution on the worker node.

Question-18: Can output be directly stored without sending it back to driver?

Answer:

Yes. Code needs to be written in such a way that all explicit result collection at the driver should be avoided. We can very well delegate this task to one of the executors.

E.g., if we want to save the results to a particular file, either we can collect it at the driver or assign an executor to do that for you.

Question-19: What is coalesce and repartition used for ?

Answer:

These are the functions used to increase/decrease the number of partitions generated.

Coalesce is used to decrease the number of partitions without invoking Shuffling.It should be used if the number of output partitions is less than the input. It can trigger RDD shuffling depending on the shuffle flag which is disabled by default (i.e. false).

While Repartition function helps to increase or decrease the number of partitions by doing Shuffling of data.

Question-20: What is physical and logical plan optimization in Spark?

Answer:

Logical Plan:

In this phase, an RDD is created using a set of transformations, It keeps track of those transformations in the driver program by building a computing chain (a series of RDD)as a Graph of transformations to produce one RDD called a Lineage Graph.

Logical Plan is divided into three parts:

a. Unresolved Logical Plan OR Parsed Logical Plan

- When code is valid and the syntax is correct.

b. Resolved Logical Plan OR Analyzed Logical Plan OR Logical Plan

- This plan is then passed on to a “Catalyst Optimizer” which will apply its own rule and will try to optimize the plan.

c. Optimized Logical Plan

- Once our Optimized Logical Plan is created then further, Physical Plan is generated.

Catalyst Optimizer

Physical Plan:

It simply specifies how our Logical Plan is going to be executed on the cluster. It generates different kinds of execution strategies and then keeps comparing them in the “Cost Model”. Spark decides which partitions should be joined first (basically it decides the order of joining the partitions), the type of join, etc for better optimization.

Physical Plan is specific to Spark operation and for this it will do check up of multiple physical plans and decide the best optimal physical plan. And finally the Best Physical Plan runs in our cluster.

Question-21: How catalyst optimizer work internally?

Answer:

The Catalyst optimizer handles: analysis, logical optimization, physical planning, and code generation to compile parts of queries to Java byte code. Catalyst now supports both rule-based and cost-based optimization.

- Analyzing logical plan -- Rule based

- Logical plan optimization -- Rule based

- Physical planning -- Cost based

- Code generation to generate Java bytecode-- Rule based

*Refer Question(20) for more detailed answer on this.

Question-22: What are new features have Tungsten brought?

Answer:

Tungsten Engine has been introduced in Spark 2.x. The goal of Project Tungsten is to improve Spark execution by optimizing Spark jobs for CPU and memory efficiency. It does so by offering the following optimization features:

a. Memory Management and Binary Processing: to eliminate the overhead of JVM object model and garbage collection

- Tungsten is a Spark SQL component that provides increased performance by rewriting Spark operations in bytecode, at runtime.

b. Cache-aware Computation:

c. Code generation: For performance reasons, Spark tries to group together multiple operators inside a Whole-Stage CodeGen.

Question-23: What is Predicate Pushdown in Spark?

Answer:

Predicate pushdown is a technique to process only the required data. Predicates can be applied to SparkSQL by defining filters in where conditions. By using explain command to query we can check the query processing stages. If the query plan contains PushedFilter than the query is optimized to select only required data as every predicate returns either True or False. If there is no PushedFilter found in query plan than better is to cast the where condition. Predicate Pushdowns limits the number of files and partitions that SparkSQL reads while querying, thus reducing disk I/O. Querying on data in buckets with predicate push downs produce results faster with less shuffle.

*Predicate Pushdown does not work with all the datatypes in Parquet, rather when used with Int, pushes down the values.

Question-24: How to handle skewed data? What are the steps?

Answer:

Data skew means that data distribution is uneven or asymmetric. Data skew can severely downgrade performance of queries, especially those with joins.

Data skew is not an issue with Spark, rather it is a data problem.

- Common Symptoms of data skew are -

- Stuck stages & tasks

- Low utilization of CPU

- Out of memory errors

There are several ways to solve data skew problem.

- Data Broadcast - If we are doing a join operation on a skewed dataset one of the tricks is to increase the spark.sql.autoBroadcastJoinThreshold value so that smaller tables get broadcasted. This should be done to ensure sufficient driver and executor memory.

- Data processes - If there are too many null values in a join or group-by key they would skew the operation. Try to preprocess the null values with some random ids and handle them in the application

- Salting - The idea is to invent a new key, which guarantees an even distribution of data. In a SQL join operation, the join key is changed to redistribute data in an even manner so that processing for a partition does not take more time. This technique is called salting.

Salting is a technique where we will add random values to join key of one of the tables. In the other table, we need to replicate the rows to match the random keys.The idea is if the join condition is satisfied by key1 == key1, it should also get satisfied by key1_<salt> = key1_<salt>. The value of salt will help the dataset to be more evenly distributed.

Question-25: What are Spark optimization techniques? OR What are the best practices and optimization tips used in Spark?

Answer:

Optimization in Spark can be done in many ways. Few and important ones are mentioned below:

- Partitioning of data - If you join a very large table with a smaller table, pre partition the large table before joining with the smaller table to minimize the amount of data getting shuffled (i.e. only the data in the smaller table will be shuffled).

- Broadcast small dataset - Utilize Broadcast Joining for joining Smaller Datasets to Larger Ones. Broadcast when the data is less than 1 GB or 1 million rows. Load small dataset into memory for joining with large dataset to avoid shuffling.

- Using Accumulators

- Caching of data - Caching is the best technique for Apache Spark Optimization when we need some data again and again. But it is always not acceptable to cache data.

- Use DataFrame’s instead of RDDs

- Predicate Pushdown Optimization

- Avoid or minimize shuffle - Shuffling is an expensive operation since it involves disk I/O, data serialization, and network I/O. The primary goal when choosing an arrangement of operators is to reduce the number of shuffles and the amount of data shuffled.

- Skewed data: Skewing happens when you don’t distribute the data evenly across the cluster nodes. One or two nodes will be processing most of the data. For example, when you join two tables by keys and say 60% of the keys are say either null or a particular value (e.g. most traded stock id), then 60% of all the data will be processed by a single node. This will adversely impact performance in a distributed computing environment.

- Define Right number of Executors and Executor cores

Question-26: What Points to be taken care while writing Spark Application?

Answer:

- Prefer ReduceByKey than groupByKey

- Choosing Right File formats

- Handling multiple output files by using repartition/coalesce

- Load small dataset into memory for joining with large dataset to avoid shuffling using Broadcast variables.

- Spark applications should be continuously monitored using DAG(a data structure used in Spark that describes various stages of tasks in Graph format)

- Use the power of Tungsten - Tungsten is a Spark SQL component that provides increased performance by rewriting Spark operations in byte code, at runtime

Question-27: How DAG optimization takes place in Spark?

Answer:

DAGScheduler is the scheduling layer of Apache Spark that implements stage-oriented scheduling. It transforms a logical execution plan (i.e. RDD lineage of dependencies built using RDD transformations) to a physical execution plan (using stages).

The DAG scheduler splits the graph into multiple stages, the stages are created based on the transformations.

The narrow transformations will be grouped together into a single stage. Wide transformation define the boundary of 2 stages. Stages are separated by 2 shuffle operations.

The DAG scheduler will then submit the stages into the task scheduler. The number of tasks submitted depends on the number of partitions present in the textFile.

Once the Best Physical Plan is selected, it’s the time to generate the executable code (DAG of RDDs) for the query that is to be executed in a cluster in a distributed fashion. This process is called Codegen and that’s the job of Spark’s Tungsten Execution Engine.

- Spark creates DAG.

- Once action is triggered on the RDD, DAG is submitted to the DAGScheduler.

- DAGScheduler looks at RDD lineage and comes up with the best execution plan by dividing it into stages of the task

- Stages set is given to TaskScheduler and TaskScheduler will launch tasks through Cluster manager.

- TaskScheduler with help of cluster manager will check the data/resources availability in different nodes to execute the tasks. It will distribute the tasks to different executors.

Question-28: Why we prefer treereduce and treeaggregate over reduceByKey and aggregateByKey?

Answer:

In a regular reduce or aggregate functions in Spark (and the original MapReduce) all partitions have to send their reduced value to the driver machine, and that machine spends linear time on the number of partitions. It becomes a bottleneck when there are many partitions and the data from each partition is big.

So Spark has introduced a new aggregation communication pattern based on multi-level aggregation trees.

In this setup, data are combined partially on a small set of executors before they are sent to the driver, which dramatically reduces the load the driver has to deal with.

So, in treeReduce and in treeAggregate, the partitions talk to each other in a logarithmic number of rounds.

Also would like to mention difference between reduceByKey and treeReduce -

reduceByKey is only available on key-value pair RDDs, while treeReduce is a generalization of reduce operation on any RDD.

reduceByKey performs reduction for each key, resulting in an RDD; it is not an action but a transformation that returns a Shuffled RDD.

On the other hand, treeReduce perform the reduction in parallel using reduceByKey .

Question-29: In what cases Accumulators are reliable? OR If executors are crashed and we are using accumulators, then can we rely on accumulators output. If yes, why and if no, why?

Answer:

Accumulators are variables that are used for aggregating information across the executors. For example, this information can be used to find out how many records are corrupted or how many times a particular library API was called.

Spark automatically deals with failed or slow machines by re-executing failed or slow tasks.

Example :- if the node running a partition of a map() operation crashes, Spark will rerun it on another node; and even if the node does not crash but is simply much slower than other nodes, Spark can preemptively launch a “speculative” copy of the task on another node, and take its result if that finishes.

Even if no nodes fail,Spark may have to rerun a task to rebuild a cached value that falls out of memory.The net result is therefore that the same function may run multiple times on the same data depending on what happens on the cluster.

So Accumulators are truely reliable when they are present in an Action operation.

Accumulator updates are sent back to the driver when a task is successfully completed. So your accumulator results are guaranteed to be correct when you are certain that each task will have been executed exactly once and each task did as you expected.

Question-30: What are Stragglers? How to handle stragglers and how to detect them.

Answer:

“Stragglers” are tasks within a stage that take much longer to execute than other tasks.

Stragglers in general are those tasks that take more time to complete than their peers. This could happen due to many reasons such as load imbalance, I/O blocks, garbage collections, etc.

Once a Spark job is submitted, it gets broken down into multiple stages and each stage spawns many tasks which will be executed on each node. One or more stage might be dependent on the previous stages. So, when a task of the earlier stage takes a longer time to complete, then the dependent stages execution in the pipeline also gets delayed. Such tasks that takes longer time to complete are termed as straggler.

As the size of the cluster and the size of the stages/tasks grows, the impact of stragglers increases dramatically impacting the job completion time.

Thus, addressing the problem of straggler becomes prudent in order to speed up the job completion time and also improve the cluster efficiency

The Straggler Mitigation strategy in Spark is based on Speculation. A task is identified as a Straggler (or Speculatable) and is duplicated.

The Straggler identification strategy starts when spark.speculation.quantile fraction of tasks in a stage have completed.

By default, spark.speculation.quantile is set as 0.75. Thereafter, if a currently running task exceeds spark.speculation.multiplier(1.5) times the median task time from the set of successful task completions of this particular stage, then this task is identified as a straggler.

The default value of spark.speculation.multiplier is 1.5.

Question-31: Difference between groupByKey and reduceByKey

Answer:

Both groupByKey and reduceByKey produce the same answer but concept to produce results are different.

reduceByKey works much better on a large dataset because Spark can be combine output with a common key on each partition before shuffling the data.

While on the other side in groupByKey, all the key-value pairs are shuffled around. This is a lot of unnecessary data to being transferred over the network.

Question-32: How will you handle OOM errors in Spark?

Answer:

Out of memory issues can be observed for the driver nodes or for executor nodes:

- Out-of-memory errors in the driver. - Caused by actions

Common causes which result in driver OOM are:

- rdd.collect()

- sparkContext.broadcast

- Low driver memory configured as per the application requirements

- Misconfiguration of spark.sql.autoBroadcastJoinThreshold.

- Out-of-memory errors on the executor nodes. - Caused by shuffles associated with wide transformations

Common causes which result in executor OOM are:

- HIGH CONCURRENCY

- INEFFICIENT QUERIES

- INCORRECT CONFIGURATION

**OutofMemoryError due to operations: Increase parallelism so each task input is smaller. Since Spark reuses executor JVM for many tasks, increase parallelism to more than the number of cores.

Question-33: What is the command to see RDD lineage ?

Answer:

"ToDebugString" Method - Returns “A description of the RDD and its recursive dependencies for debugging.

Question-34: What is Explode function and how it can be used?

Answer:

explode() takes in an array (or a map) as an input and outputs the elements of the array (map) as separate rows.

In Spark, we can use “explode” method to convert single column values into multiple rows.

+---+----+---------+

| id|col1| col2|

+---+----+---------+

| 1| abc|[p, q, r]|

| 2| def|[x, y, z]|

+---+----+---------+

df.withColumn("col2",explode($"col2")).show()

+---+----+----+

| id|col1|col2|

+---+----+----+

| 1| abc| p|

| 1| abc| q|

| 1| abc| r|

| 2| def| x|

| 2| def| y|

| 2| def| z|

+---+----+----+

Question-35: Does Parquet file contains the header of data?

Answer: Yes

--------------------------------------------------------------------------------------------------------------------------------------------------------

Feel free to drop your comments / feedback below :)

19 comments:

VickyJuly 20, 2020 at 10:49 AM
Yeah its really helpful 😊
B K GUPTAJuly 20, 2020 at 7:05 PM
Really you did hard work to prepare this.Definetily it would be helpful for others
UnknownJuly 21, 2020 at 1:46 PM
It's really helpful Manushree. Just one small point, in second question you have mentioned that Dataframes are mutable but dataframes are not mutable because at the core it's rdd only, you can perform any transformation and derive a new dataframe out of an existing one . Let me know your thoughts on this
Puneet naikJuly 21, 2020 at 9:34 PM
Great explaination to evry questions. Thanks for taking your time in preparing this document. Really helpful
Shantanu KherJuly 24, 2020 at 12:34 AM
This is what I was looking for. I have bookmarked it.
A concise list of fundamental terms being used in Spark world!! I appreciate your efforts!
Kaushik DJuly 24, 2020 at 12:58 PM
Thanks for curating the questions, I went through the entire content & its really helpful. However i have doubt on
Question 9 - still not clear on the difference b/w check pointing & persist to disk. Both of them are used to save data to disk, then how are these different. Can you give some example?

Swati PagadJuly 24, 2020 at 8:43 PM
Very Informative and helpful. Thank you investing so much time and effort.
UnknownJuly 25, 2020 at 6:06 PM
This is what I am looking for.thanks a lot and very well explained. Appreciate your efforts.
UnknownJuly 26, 2020 at 1:00 AM
Thanks for covering almost all topics with detailed explanation.It is very very helpful.
AnonymousAugust 2, 2020 at 10:42 AM
In Question No 15 Answer(By default, shuffling doesn’t change the number of partitions, but their content).

As i worked spark change the shuffle partition By default:200 and you can change using property spark.sql.shuffle.partitions.
UnknownOctober 10, 2020 at 11:18 AM
I have more questions. Can I post here? It would be helpful if you can provide that answer and add to this post also?
UnknownNovember 12, 2020 at 8:44 PM
Awesome , Thank you ManuShree