This is one of the great features added in a MapReduce framework to make tasks independent of each other. Its framework/hadoop that monitors everything while job is in execution and considering that, tasks do their respective jobs.
We know about fault tolerance in M/R which in a nutshell start a new task on another tasktracker node if any of the running task node fails or any task fails by itself because of some reason (could be exception).
Now think of a situation where your job has completed all the map tasks except one which is taking a lot of time because of CPU limitation or slow disk controller or due to some other reason but not a failure. To come over from such a hanged up situation where your job will be in a waiting state, jobtracker starts that same time taking task on another available nodes/tasktrackers in parallel. Now which ever comes first will be a legal candidate task to submit its result and rest will be terminated. This smart play is termed as speculative execution.
Excerpt from wikipedia :
“Speculative execution is an optimization technique where a computer system performs some task that may not be actually needed. The main idea is to do work before it is known whether that work will be needed at all, so as to prevent a delay that would have to be incurred by doing the work after it is known whether it is needed.”
Speculative Execution and its side effect :
Speculative execution has side effects and should only be considered if needed or in another way should be turned off when not needed because it is active by default. Lets see when it should not be used in Map task and Reduce task:
Reduce Task : It is a best practice to turn speculative execution off on reduce tasks execution. Lets take an example to understand “why?”
Consider that job of your reducer is only to write map output values on HDFS. Now, there is a reduce task which is taking a longer time than other reduce tasks but it has also written some values let say few 100 lines so far. Though this task is taking too much time, Jobtracker assign this same task to another node in parallel. For convenience, we will call time taking node which is running a task – N1 and the other one which is running in parallel – N2. Because of speculative execution which occurred in background helped us getting result through N2. But didn’t you noticed that task running on N1 have already written 100+ line, by which we mean there are some 100+ lines which are duplicate.
When you are running multiple reduce tasks and there could be a case something like above then it is advisable to turn speculative execution on reduce side.Map Task : Same goes with map tasks. For example, when there is zero reduce task in your job, when there is a dedicated map task for each file (irrespective of its size), and few more reasons which I might be missing here. At this stage you actually have to give thought on your use case and effect of speculative execution.
It will be a useless until we know how to disable/enable speculative execution. Themapreduce.map.speculativepr
Rahul Mathur has more than 4 years of experience in development and designing large scale enterprise applications based on Java/ J2EE, Open Source & Big Data Technologies. Extensive hands on experience in modeling, refining, analysing and mining of data using Big Data technologies like Map Reduce, Hadoop, Hive, Pig etc. Expertise in working with layered architectures and enterprise applications.
Have good exposure to various Java/ J2EE and Enterprise Design Patterns. Good knowledge of NoSQL Data-stores (HBase, Cassandra). Good understanding of software development processes and worked in Waterfall and AGILE methodologies.