locked
pdf转txt中文编码问题 RRS feed

  • Question

  • 我用binarywriter类来实现但是对有中文老是出现出现乱码的问题,怎么解决呢?

     

    • Moved by Sheng Jiang 蒋晟 Friday, May 13, 2011 5:07 PM Adobe文件格式问题 (From:一般性问题讨论区)
    Tuesday, December 18, 2007 12:58 PM

Answers

All replies

  • 转换到txt的那几句是怎么写的?

    具体如何把pdf读出然后写到txt的?有点代码或许能好分析一些

    Tuesday, December 18, 2007 1:54 PM
  • 调用了TET_dotnet.dll这个动态控件。

    class PdfToTxt
        {
            public static void PDFtoTXT(string tagetTxt, string sourcePdf)
            {
                string globaloptlist = "searchpath=../../../resource/cmap";

                /* document-specific  option list */
                string docoptlist = "";

                /* page-specific option list */
                string pageoptlist = "granularity=page";

                TET tet;
                int pageno = 0;

                FileStream outfile;
                BinaryWriter w;
               // StreamWriter w ;
                Encoding gb=  Encoding.GetEncoding("GB2312");
                Byte[] byteOrderMark = gb.GetPreamble();
                outfile = File.Create(tagetTxt);
                w = new BinaryWriter(outfile, Encoding.GetEncoding("GB2312"));
               // w = new StreamWriter(outfile, Encoding.GetEncoding("GB2312"));
                //w.Write(byteOrderMark);
                w.Write(Convert.ToString(byteOrderMark));
                tet = new TET();

                try
                {
                    int n_pages;

                      tet.set_option(globaloptlist);
                    int doc = tet.open_document(sourcePdf, docoptlist);

                    /* get number of pages in the document */
                    n_pages = (int)tet.pcos_get_number(doc, "lengthStick out tongueages");

                    for (pageno = 1; pageno <= n_pages; ++pageno) /* loop over pages */
                    {
                        string text;
                        int page;

                        page = tet.open_page(doc, pageno, pageoptlist);


                        /* Retrieve all text fragments; This is actually not required
                         * for granularity=page, but must be used for other
                         * granularities.
                         */
                        while ((text = tet.get_text(page)) != null)
                        {
                            /* loop over all characters  */
                            while ((tet.get_char_info(page)) != -1)
                            {
                                string fontname;
                                StringBuilder path = new StringBuilder();
                                /* The following shows how to query the fontname;
                                 * The position could be fetched from ci->x and ci->y.
                                 */
                                path.Length = 0;
                                path.AppendFormat("fonts[{0}]/name", tet.fontid);
                                fontname = tet.pcos_get_string(doc, path.ToString());
                            }

                            /* print the retrieved text */
                            w.Write(gb.GetBytes(Convert.ToString(text)));
                        }

                    
                        tet.close_page(page);
                    }
                    tet.close_document(doc);
                }
                catch (TETException err)
                {
                    // caught exception thrown by TET
                
                }
                finally
                {
                    outfile.Close();
                    if (tet != null)
                    {
                        tet.Dispose();
                    }
                  
                }
            }
        }

    Wednesday, December 19, 2007 9:45 AM
  • 你获取的PDF文件是你用GB2312生成的吗?换种别的编码方式试试,比如用UnicodeEncoding类或者UTF8、UFT16的编码取代Gb2312试试

    Wednesday, December 19, 2007 10:24 AM
  •  

    我也试过其他的编码譬如你刚才提到的,怎么样才能知道pdf文件的默认编码呢?
    Thursday, December 20, 2007 8:46 AM
  • Tuesday, December 25, 2007 3:32 AM