none
Working with PDF and other types of files while working with DSC fileset RRS feed

  • Question

  • Hi,

    I understand that in order to query the data using LinQ to HPC, we need to have the data represented either in plain text format or a serialized object. I had also worked with the nice samples that were available with the Programmer's guide.

    Now for ex: if I have a bunch of pdf files or for that matter any file which is not in plain text format, is it mandatory to convert that to text format to work with that? Is there any way to work with these file types directly using LinQ to HPC (based on my understanding we cant work with them)?

    Please let me know if there is any way out to work with such requirements using LinQ to HPC.

    Regards,


    Phani Note: Please vote/mark the post as answered if it answers your question/helps to solve your problem.
    Wednesday, October 26, 2011 11:33 AM

Answers

  • Hi,

    There is no native support for working with files that contain neither plain text nor serialized objects (w/ or w/out compression).

    The recommended way to handle input files like this would be to write your query lambdas such that they read in and analyze the PDF directly based on a list of PDF file names (either stored in DSC or via an IEnumerable). The output of this query operator could then be chained into other operators to perform additional analysis and/or persist results to DSC.

    An alternative solution, which may be best if you want to persist the original unmodified contents of the PDFs for multiple queries, would be to read the contents of the PDF and generate either LineRecords or some custom serializable objects, which can then be ingressed into DSC.

    Hope this helps,

    Jeremy

    Tuesday, November 1, 2011 8:50 PM