locked
Is Microsoft Translator designed to translate whole web pages? RRS feed

  • Question

  • Hi there

    My app needs to grab, translate and save web pages at different times of the day so that they can be read in English during working hours.

    I've written a simple Translator App using the SOAP API that takes a URL, streams the page into a string and passes it to the SOAP API.

    It works perfectly for very simple web pages. For larger pages (20-100K) it fails every time with 'Bad Request', even though the same pages translate successfully using Translator manually via IE. Most of the pages I need to process are in the 50-200K size range.

    For example, the app works with this simple URL: http://www.lecanardenchaine.fr/trvlap.html

    But fails with every other one I try (eg: http://fr.wikipedia.org/wiki/Albert_Camus)

    Neither of the above pages pass W3C validator test, but both render correctly in all browsers, and in other HTML pre-processors.

    Am I wrong to assume that Translator can be used for such tasks or am I missing something?

    I've written the same programme using the HTTP API and get the same behaviour - the 'Bad Request' is accompanied by a WebException status of 'ProtocolError' (7).

    The code I'm using is very simple:

                     WebClient wb = new WebClient();
                     string sourceText = wb.DownloadString(url);
                     string  translationResult = client.Translate("", sourceText, "fr", "en", "text/html", "general");

    where client is an instance of TranslatorService.

    any help much appreciated - I need to decide whether I can continue using Translator.

    best wishes

    Rob Macdonald

    Thursday, April 12, 2012 4:26 PM

All replies

  • Hello, Rob,

    You see this issue consistently when you tried to translate Web Pages that has large content.

    This is because you are hitting our API call size limit. Each translation has a size limit of 10K characters.

    In this case, you need to break down the "sourceText" into multiple chunks

    Thursday, April 12, 2012 9:29 PM
  • Hi,

    you have actually reach the WCF limit on how much data you can send per request (ie, sourceText is too long).  Our recommendations is to send data in chunks, in size under 5000 bytes.  A good way to do this is to break the page by <p> tags and then send the "translate" request iteratively.

    hope that helps

    Thursday, April 12, 2012 9:32 PM
  • thank you for this reply.

    I couldn't find any reference to such limits in the documentation, nor in the error messages presented. Perhaps this could be addressed at some time?

    Friday, April 13, 2012 7:33 AM
  • thank you for this reply.

    I couldn't find any reference to such limits in the documentation, nor in the error messages presented. Perhaps this could be addressed at some time?

    Please note that it is not specifically a WCF constraint - the same error occurred when using the HTTP API. My understanding is that a WCF client contract should specifically report on message size violations with a meaningful error message. I didn't get such behaviour.

    I appreciate the recommendation and tip on how to work round it. Are there other such recommendations? There was little of this nature in the FAQ.

    many thanks

    rob

    Friday, April 13, 2012 7:42 AM
  • so in your app.config is where you define the client side configurations.

     <binding name="BasicHttpBinding_ITranslationServiceContract"
                        closeTimeout="00:01:00" openTimeout="00:01:00" receiveTimeout="00:10:00"
                        sendTimeout="00:01:00" allowCookies="false" bypassProxyOnLocal="false"
                        hostNameComparisonMode="StrongWildcard" maxBufferSize="65536"
                        maxBufferPoolSize="524288" maxReceivedMessageSize="65536"
                        messageEncoding="Text" textEncoding="utf-8" transferMode="Buffered"
                        useDefaultWebProxy="true">
                        <readerQuotas maxDepth="32" maxStringContentLength="8192" maxArrayLength="16384"
                            maxBytesPerRead="4096" maxNameTableCharCount="16384" />
                        <security mode="None">
                            <transport clientCredentialType="None" proxyCredentialType="None"
                                realm="" />
                            <message clientCredentialType="UserName" algorithmSuite="Default" />
                        </security>
                    </binding>

    so by default the max package size you can send is 65536 (including all the header info, other parameters etc).  So if you send in a text of length 50000, you will get an error message from our API.

    The formatter threw an exception while trying to deserialize the message: Error in deserializing body of request message for operation 'Translate'. The maximum string content length quota (30720) has been exceeded while reading XML data. This quota may be increased by changing the MaxStringContentLength property on the XmlDictionaryReaderQuotas object used when creating the XML reader. Line 190, position 148

    which is meaningful to you.  But since your packet exceeded the limit on the client side, it never reaches us and it just gives you the default WCF error message.  Hope that helps.

    (side note, HTTP/REST/SOAP api are all implemented on top of WCF).

    Friday, April 13, 2012 5:26 PM
  • A programatic way to overcome the error message on large Translation requests using TranslateArray for me was:

    "The maximum message size quota for incoming messages (65536) has been exceeded. To increase the quota, use the MaxReceivedMessageSize property on the appropriate binding element."

    Error message above disappeared by coding:

    var translator = new MicrosoftTranslatorService.LanguageServiceClient();
    var binding = (BasicHttpBinding translator.Endpoint.Binding; 
    binding.MaxReceivedMessageSize = int.MaxValue;
    binding.MaxBufferSize = int.MaxValue;

    Thursday, May 31, 2012 10:38 AM
  • A different way to approach this would be to use the HTML Agility pack. http://htmlagilitypack.codeplex.com/

    This allows you to pull down the whole document, and treat it like it is well formed. You can then use XPath syntax to derive the text nodes and translate them. I'm working on a command-line tool that will allow users to do this.

    The code will look like:

    private void processDocument(HtmlAgilityPack.HtmlDocument html)
            {
                HtmlNodeCollection coll = html.DocumentNode.SelectNodes("//text()[normalize-space(.) != '']");

                 foreach (HtmlNode node in coll)
                {
                    if (node.InnerText == node.InnerHtml)
                    {
                        node.InnerHtml = translateText(node.InnerText);
                    }
                }

            }

    And the translateText function uses the Microsoft Translator API to translate the text. I have to set it to the .innerHTML because .innerText is read-only.

    This makes it easy to translate even huge pages...I've tested it with a 500Kb page and it has worked nicely.

    Thursday, May 31, 2012 3:15 PM