
Azure HDInsight Solutions | Spark | Application Failure Probing Questions

Minimum data required to better understand any Spark Application performance or Spark Application failure issues.

Spark Applications are usually submitting to HDInsight clusters from Azure Data Factory, Jupyter, Zeppelin, JDBC, SSH or Livy directly using curl command.

Front End Servied Used on HDInsight
Jupyter Livy
ADF Livy
Zeppelin Interpreter(Livy, Spark)
Curl Livy

Details required to start troubleshooting following Spark Application issues

*   Slow Performance : Spark application takes more time compared to another HDInsight cluster, still complete successfully.
*   Unexpected Failure: Spark Application starts processing data but fails to complete with some exception
*   Application fails with Exception :Spark Application starts processing data but fails to complete with some exception
*   Application hangs-Never gets into finished state 
*   Spark application fails to start when submitted from Spark-CLI
  1. Spark application submission if initiated from Azure Data Factory / Jupyter or any other client application like curl that uses livy, then follow the steps below.
    • Confirm livy server is started on HN0 from Ambari UI, incase it is stopped start the service.
    • In case livy server is not starting, and you see java.lang.OutOfMemoryError: unable to create new native thread in livy logs /var/log/livy/livy-livy-server.out then follow steps detailed in livy-nativethread-exhaustion.
    • If exception metioned in Point b. was not found then Capture the livy logs from the cluster ( /var/log/livy/livy-livy-server.out),
    • Get Jupyter logs (/var/log/jupyter/) when troubleshooting Spark Application issues that were submitted using jupter notebook.
      • Jupyter uses livy to submit Spark application, get Livy logs as well.
  2. If application is submitted using JDBC that uses Spark Thrift Service then get Spark Thrift Driver logs from /var/log/spark/sparkthriftdriver.log

  3. In case the Spark job is submitted from spark-shell then get the complete spark-submit command.

  4. For any spark application performance issues (including the three scenarios list above) first note the Application ID, next capture YARN logs for the application that is experiencing performance issue (Slow/Hang) or failures. a. How do I download Yarn logs from HDInsight cluster?, this article details different options to capture YARN Logs. b. Download all Application Master logs. c. Get logs for all containers (Driver and Executor).

  5. Get screenshot of YARN UI showing the start datetime, end datetime and the status for the failed application.

  6. If this application had completed successful early then capture start, end datetime, application status and also the YARN logs for this successfully completed Spark Application How do I download Yarn logs from HDInsight cluster?.

  7. Apart from the YARN Logs get details about the environment.
    • Number of Workernodes.
    • Executors
    • Source and Sink (Ex Kakfa to Storage)

## Spark Streaming

For General Spark Tuning Refer