hdinsight.github.io

title description keywords services documentationcenter author manager editor ms.assetid ms.service ms.workload ms.tgt_pltfrm ms.devlang ms.topic ms.date ms.author
Both ResourceManger stuck in standby mode? Microsoft Docs   Azure HDInsight, Yarn, Resource Manger, stadnby mode, wasb, adls, abfs. Azure HDInsight na marshall shravan marshall na multiple na na na article 12/09/2019 zhaya

Both RM in standby mode?

You can check the /var/log/hadoop-yarn/yarn/yarn-yarn-resourcemanager.log from both headnodes. If both headnode are saying can not transit to active that means you’re in the right TSG. Reason mentioning this is because, can not transit to active node is expected log if the other RM is taking the active role.

What is the reason behind that?

We’ve seen several ICMs about the standy mode issue. All of them are due to the node-label.mirror file missing from HDFS issue. We introduced node-label recently which is using the hdfs to maintain the node-label store. However, during the scale down event, this file might lost. HDFS would reject all the request until this missing block is recovered.

What is the fix?

We are working on a fix to use remote storage account instead of local hdfs to maintain the node-label mirror, which might take a while to test with all the storage types.

What is the mitigation?

You can check

$ hdfs fsck hdfs://mycluster/

if it says some files are under replica, or there’re missing blocks in hdfs. You can run

$ hdfs fsck hdfs://mycluster/ -delete

to forcefully clean up the hdfs. After this you should be able to get rid of the standby RM issue.