none
Debug Map Reduce program locally before uploading to Hadoop cluster RRS feed

  • Question

  • I was going through one of the MVA video and found good stuff about a sample Map Reduce program but I'm unable to get the details about debugging a MR program before actually uploading in a real cluster. any inputs would be appreciable.

    Course was at below link -

    https://mva.microsoft.com/en-US/training-courses/getting-started-with-microsoft-big-data-8252?l=1MluYUKy_4204984382

    Wednesday, November 16, 2016 8:21 AM

Answers

  • Thanks for the response, Gopi. Actually I am using C# 6.0 and currently it's in learning phase. I created both hortonworks sandbox and HDInsight cluster as earlier I was having Azure subscription but now it's not there. I need to see for Azure subscription again. I reckon, u are suggesting that I need to connect to Hadoop cluster using Azure blob storage and HDinsight credentials. Is that correct. 

    I was thinking if there's a way on the local machine which is having VS2015 to debug MR program before actually connecting to cluster.

    On a side note, if I have Azure subscription, then would I be able to debug MR program? Not sure about "data science virtual machine". 

    Your inputs would certainly give me some more  insights  about this amazing technology.

    Thursday, November 17, 2016 5:25 AM

All replies

  • @Ritesh - You can create a small test HDInsight cluster to test and debug your map reduce if you have an Azure subscription. You will be able to access your test HDInsight from the data science virtual machine. Alternately you can create a HortonWorks  Sandbox (a one node install of a Hadoop/Spark on another Azure VM). Hope this helps. 

    What technology and language are you using to write your map reduce program?

    Regards/Gopi


    Thursday, November 17, 2016 12:44 AM
  • Thanks for the response, Gopi. Actually I am using C# 6.0 and currently it's in learning phase. I created both hortonworks sandbox and HDInsight cluster as earlier I was having Azure subscription but now it's not there. I need to see for Azure subscription again. I reckon, u are suggesting that I need to connect to Hadoop cluster using Azure blob storage and HDinsight credentials. Is that correct. 

    I was thinking if there's a way on the local machine which is having VS2015 to debug MR program before actually connecting to cluster.

    On a side note, if I have Azure subscription, then would I be able to debug MR program? Not sure about "data science virtual machine". 

    Your inputs would certainly give me some more  insights  about this amazing technology.

    Thursday, November 17, 2016 5:25 AM
  • @Ritesh, if you have a Visual Studio MSDN subscription, do activate your included free Azure credits and then use the credits to provision the test and dev environment that you need, as per Gopi's guidance. Follow this link to activate your monthly Azure credits if you are a Visual Studio MSDN subscriber: Activate MSDN Visual Studio Subscription Monthly Azure Credits

    Additionally, if it is debug only and you just want to do it on your own local machine, then you can download the Hortonworks Sandbox for your own machine and run that locally or use the 1 months free trial in Azure. Go to this link to download the sandbox:Hortonworks Sandbox Download.

    The Data Science Virtual Machine (DSVM) is a custom virtual machine image from Microsoft that comes pre-installed with popular data science tools for modeling and development activities. The DSVM is offered in both Windows and Linux editions. There’s been a tremendous response to this offering from the data analytics community worldwide, and we continue to iterate and improve the experience. You can try the DSVM for free before adopting it from the Linux DSVM Test Drive, obtain community-based support here on the Forum from users within and outside Microsoft, and run Deep learning tools on Azure GPUs.

    Thursday, November 17, 2016 11:56 PM
  • Thanks a lot guys for the guidance. Will setup the cluster and will get back for any further assistance..
    Friday, November 18, 2016 5:12 AM
  • Additionally, I guess "Hortonworks Sandbox for your own machine" should be as equivalent to setting up SyncFusion BigData Platform on local machine. I am asking that as I've already downloaded that setup and was thinking to analyze on top of that.
    Friday, November 18, 2016 5:43 AM
  • I have got the HDInsight cluster and now trying to create a Virtual machine to access that cluster. I was under impression that I might be able to debug my simple square root MR program from my local visual studio 2015 but it's not as it gives me errors related to environment. Now I believe, I need to create a virtual machine. When I proceed to create that from portal.azure.com, there I see plenty of options. Not sure which one to create. Please suggest. I think from that I should be able to debug sample MR program. Just to add, I connect to my cluster using HDInsight/Storage key credentials.

    Hadoop.Connect(
                    new Uri("https://ritjainHDI.azurehdinsight.net:563"),
                    "admin", "admin", "pwd", "sa.blob.core.windows.net",
                    "sakey",
                    "container", true
                    );

    • Edited by Ritesh Jain Tuesday, November 22, 2016 9:29 AM
    Tuesday, November 22, 2016 9:01 AM
  • Any suggestion would be appreciable..
    Wednesday, November 23, 2016 1:24 PM
  • Though I could not get much luck until this time but thought of putting my analysis --

    I recreated HDInsight Hadoop cluster and also Azure VM. Installed VS2015 community addition in VM. Created a simple MR program in VM and rebuild the solution. It's Target build is "Any CPU" and mode is debug (tried other modes too). I connected with cluster using valid syntax and credentials - 

    Hadoop.Connect(
                    new Uri("https://ritjainHDI.azurehdinsight.net:563"), "admin", "Hadoop", "pwd", "sa.blob.core.windows.net","sakey","container", true );

    But while executing, getting below error. Also I copies all DLLs/Exe, directly into my cluster and tried running MRRunner.exe command but in vain. 

    At this moment, I tried all possible ways, whatever I can think, but unable to succeed. Not sure if anyone has really faced such issues or if I'm really missing anything.

    Wednesday, November 30, 2016 9:41 AM
  • Thanks for trying various options and documenting them Ritesh. 

    The HDInsight forum may be a better place for this question since most data science VM users deal with higher level abstractions like Hive, Pig, Spark while working with HDInsight. In fact, it may help to consider first using higher level abstractions (which can still take benefit of scaling and big data processing of HDInsight/Hadoop) is you can and use the lower level APIs only if you need that level of control. 

    Samples in HDInsight on the Azure HDInsight documentation pages may be of help. Pasting a couple of starting points. 

    https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-hive-pig-udf-dotnet-csharp 

    https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-run-samples#hdinsight-sample-csharp-streaming


    Thursday, December 1, 2016 2:24 AM
  • Thanks Gopi. Certainly I can try higher level abstractions but I wanted to practice and focus on MR for customized logic. I'm sure that will be possible only thru MR job.

    I've already gone thru the provided links and those are really useful but those does not contain the details about running C# MR job i.e. how , where. It's using streaming AP via Power shell script to run MR job. Now I am thinking to install HDP on a VM and then try running MR job. Hope at least that should work.

     
    Thursday, December 1, 2016 6:36 AM
  • Brainstorming, but not sure of myself...

    As for the error you saw, I see you are connecting to port number as :563. What is that one for? Wasn't sure if that is the problem. I thought unless you are running inside the cluster talking to the same HDInsight, normally the outside HDInsight traffic will go to :443 to a secure gateway to reach HDInsight, where the gateway will then handle the inbound request and send to the right port for templeton, which would then host the work and make sure the job starts and completes.

    There are two .Net scenarios - I wasn't entirely sure which kind you wanted to do, or mixing them both.

    Are you talking about your .Net app launching a M/R job from the outside of HDInsight cluster, or your .Net code runs inside M/R doing the data parsing and work?

    Like these examples:
    A> Starting a Templeton job to run Map/Reduce job using a .Net application. Calling SubmitMRJob()
    https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-hadoop-use-mapreduce-dotnet-sdk

    B> Using your custom .Net app compiled as .exe to be run as the streaming Map & Reduce classes, piping the data on the console  stdin/stdout to communicate between Hadoop java code and your .Net apps.

    https://docs.microsoft.com/en-us/azure/hdinsight/hdinsight-run-samples

    See Section "Word count - C# streaming"

    There were two options but its been a long time since I tested.  I think the compiled .Net .exe could be uploaded into blob storage and referenced (like a .jar file) from the relative path in cloud's blob storage, or the .exes could be included in the -files listing that are part of the job, and powershell would upload them to the head node for job submission through templeton and it would distribute the executables to the worker nodes where the job was hosted.

             -Files "/example/apps/cat.exe","/example/apps/wc.exe"


    Didn't get enough help here? Submit a case with the Microsoft Customer Support teams for deeper investigation - Azure service support: https://manage.windowsazure.com/?getsupport=true For on Premise software support go here instead: http://support.microsoft.com/select/default.aspx?target=assistance

    Friday, December 2, 2016 12:31 AM
  • Thanks Jason. Actually I have gone almost all the things suggested by you but not really any success. Tried ports 563/443. :(

    Regarding the example which has .exe/.jar files, that uses streaming and example was more towards running MR via PS script but I want to actually see the flow in debug mode.

    As I mentioned in previous post, I will be setting up HDP (I guess people say is local HDInsight) and try over there. If at all, that does not help, will approach support as per the provided links.

    Friday, December 2, 2016 10:32 AM
  • On a side note, not sure if there will be any charge for Azure support. Do you have any idea?
    Friday, December 2, 2016 10:33 AM
  • Microsoft Azure Support plans for developer stuff like this start at $29/mo. They don't really touch the Hortonworks clusters that you may host, only HDInsight in Azure. If you have an MSDN account you may already have some support options within that offering.

    Hortonworks Data Platform is the Hadoop distro that Hortonworks builds and tests.
    HDInsight is based on HDP and hosted in Microsoft's cloud Azure.

    You can install HDP sandbox locally on your own VM or computer, without having to have a bunch of head nodes/workernodes.http://hortonworks.com/downloads/

    If your .Net app is used as a .exe for streaming map reduce, then you can just pass in the data on the console to debug and test it outside of Hadoop. I don't see much need in attaching the debugger to the .Net app when running inside Hadoop. Seems like you could test the input and output key/value pairs from your computer with Visual Studio and make sure it gives reasonable input and output.

    I am a little confused why the map reduce streaming app .exe would connect into Hadoop again.

    I can understand to start the job you could write a .Net app to do so, to connect and launch the work which seems to be the error shown above, but that's a totally separate app than a streaming map/reduce app, so just trying to get clarity on which piece is the main goal.


    Didn't get enough help here? Submit a case with the Microsoft Customer Support teams for deeper investigation - Azure service support: https://manage.windowsazure.com/?getsupport=true For on Premise software support go here instead: http://support.microsoft.com/select/default.aspx?target=assistance

    Friday, December 2, 2016 10:16 PM
  • I just wanted to understand MR job flow in detail and I thought that debug is best friend for that. Accordingly I created a simple Square root program, uploaded sqrt.txt file in my blob storage -> container. I saw many articles on that but could not succeed in achieving with HDInsight cluster. 

    What's your suggestion if I install HDP on Azure VM, install VS2015, import Nuget package for mapreduce and try debugging. Let me know if that would help me.

    Tuesday, December 6, 2016 6:56 AM
  • There is a template with HDP in Azure VM Marketplace already I believe to quickly deploy HDP in Azure on VMs. I think you Click [New +] then search for Hadoop and you'll see the options for Hortonworks / Cloudera, etc.

    You can't debug the map reduce framework itself using visual studio - that's java code running in java.exe VM - you can see it working with jstack and other java debuggers attached to a specific process, and there will be many java processes, so its not trivial to find the right process on the right computer.

    You can attach VS to your .exe if you can find it. Let's say you have a 3 worker node cluster, then there will most likely be multiple copies of your streaming map reduce .exe on each node potentially, each handling a split of the data in a separate Yarn container. Hadoop is a distributed system, so attaching a live debugger is not a normal use case, since there are moving parts running here and there on the various worker node, and the processes will come and go as the job progresses, with little time to research and get a debugger attached. Its scattered and its transient.

    You are better off to manual debug by running your app on the command line and pass in the text file on the pipe and get the result on the stdout pipe. I haven't seen people debug their own runtimes within Hadoop using Visual Studio unless its a hang or advanced scenario where they have hours of time to get things attached for the worst case.


    Didn't get enough help here? Submit a case with the Microsoft Customer Support teams for deeper investigation - Azure service support: https://manage.windowsazure.com/?getsupport=true For on Premise software support go here instead: http://support.microsoft.com/select/default.aspx?target=assistance

    Thursday, December 8, 2016 3:57 PM
  • At the outset, thanks for spending time on my posts and replying. 

    Actually I don't want to debug map reduce framework, it would be my MR code. I wanted to debug my MR code (simple word count example) as mentioned in MVA courses. Considering practical production scenario, there we won't be deploying the dll/exe directly. Development team would be writing the code, verifying the output in debug and then finally uploading those files to HDFS. I think, as you suggested about HDP in Azure VM, that should help me.

    i.e. Install HDP on Azure VM, install VS2015, import Nuget package for mapreduce and try debugging. 

    You mentioned about manual debug by running your app on the command line. Appreciate some more inputs around this. Not sure if you're referring MRRunner.exe in HCL.

    Friday, December 9, 2016 12:09 PM
  • It was suggested earlier "you can create a HortonWorks  Sandbox (a one node install of a Hadoop/Spark on another Azure VM). Hope this helps. ". Unfortunately, I tried that today. Created an azure VM, installed HDP2.4, Installed VS2015 in VM. Post that, tried running a simple MR job but no luck. Is there any thing which I am really missing. Please suggest.. 
    Monday, December 12, 2016 12:41 PM
  • I found a link and tried as per that but still no luck. It's becoming a nightmare now..

    https://blogs.msdn.microsoft.com/data_otaku/2013/09/07/hadoop-for-net-developers-implementing-a-simple-mapreduce-job/#comment-3105

    Please help guys..

    Tuesday, December 13, 2016 7:26 AM
  • You didn't share any context on what went wrong, any error, etc. Not going to be able to help much for Hortonworks Sandbox. More than what can be done in a forum thread I'm afraid. Sounds like you need to work with support or a consultant.

    In my own opinion, Setting up Hadoop is non trivial, is not really a turn-key product, and has a lot of ports to open, and settings. HDInsight simplifies that greatly, but you can't install your own software as easily on Windows on the worker nodes. Maybe on Linux you could debug java more easily.

    Programming Map Reduce is a DIY technology. That's why most people choose a higher level apps like Hive / Sqoop / Oozie / Pig / Spark that are quite flexible.

    I don't think debugging with Visual Studio is a common scenario, and I do not recommend you keep going down the difficult path.

    Wish I knew how to help more.

    Thanks, Jason


    Didn't get enough help here? Submit a case with the Microsoft Customer Support teams for deeper investigation - Azure service support: https://manage.windowsazure.com/?getsupport=true For on Premise software support go here instead: http://support.microsoft.com/select/default.aspx?target=assistance

    Wednesday, December 14, 2016 3:51 AM
  • The SDK changed a lot since 2013, so I don't think that blog is relevant in the current SDK. It relies on incubator libraries that were never finished. 

    https://hadoopsdk.codeplex.com/

    Please note:  The following .NET SDK packages have been deprecated and will no longer be supported starting on January 1, 2017:

      • Microsoft.WindowsAzure.Management.HDInsight
      • Microsoft.Hadoop.Client


    Didn't get enough help here? Submit a case with the Microsoft Customer Support teams for deeper investigation - Azure service support: https://manage.windowsazure.com/?getsupport=true For on Premise software support go here instead: http://support.microsoft.com/select/default.aspx?target=assistance

    Wednesday, December 14, 2016 3:55 AM
  • I think I got the root cause of the issue. Actually I created Azure VM on windows 10, installed VS2015, wrote simple MR job to calculate square root, installed Hortonworks Sandbox in that VM. After much analysis, I found that sandbox is Linux based and I was trying to connect that in VS2015 debug mode. There, I feel, I was making mistake. When

    When we say, local cluster, it looks its HDP 2.3.4 which is a cluster to be installed in windows server 2012. There we can install VS2015 and debug a job. I need to download HDP for windows and do that analysis as well. I'm sure that will give me success.

    Appreciate your patience. :)

    Wednesday, December 14, 2016 12:04 PM
  • My bad luck. Even after installing local HDP, unable to succeed. I posted the issue at below link -

    https://community.hortonworks.com/questions/72672/unable-to-connect-to-the-remote-server-while-conne.html#answer-form

    Wednesday, December 28, 2016 1:49 PM
  • I installed HDP2.4 locally on Azure VM Window 2012 R2 and tried MR job. Unfortunately could not succeed. I think I tried all possible ways to get MR job run in MS VS but in vain. Is there any inputs from anyone who could resolve this sang. 

    Monday, January 23, 2017 12:32 PM