Symfony Dom Crawler Component and UTF-8 HTML

I have used the Dom Crawler Component of Symfony to parse some result from Google SERP pages. After the first try, I learned that the result of <span class="lang:php decode:true crayon-inline ">$crawler->html()</span> is not as I expected: all UTF-8 characters was displayed incorrectly.  The problem can be reproduced easily via this script (supposed that all classes has been loaded via Composer autoload or anything else):

$html = <<<'HTML'
<!DOCTYPE html>
<html>
  <head>
    <title>Đây là một chuỗi UTF-8</title>
  </head>
  <body>
  </body> 
</html> 
HTML; 

$crawler = new \Symfony\Component\DomCrawler\Crawler($html);
echo $crawler->html();

The output will be:

<head><title>Đây là một chuỗi UTF-8</title></head>
<body>

</body>

Which is absolutely not what we want.

Quick fix:

Instead of passing the HTML content to the crawler constructor, passing it in the next call via addHtmlContent method

$html = <<<'HTML'
<!DOCTYPE html>
<html>
<head>
  <title>Đây là một chuỗi UTF-8</title>
</head>
<body>

</body>
</html>
HTML;

$crawler = new \Symfony\Component\DomCrawler\Crawler();
$crawler->addHtmlContent($html) # this line is added
echo $crawler->html();

 

And the output will be fine:

<head><title>Đây là một chuỗi UTF-8</title></head>
<body>

</body>

Note: the <!DOCTYPE>  and <html>  markup need adding manually.

 Explanation

This unexpected behavior happens with HTML content that does not have a meta element telling the crawler that the content is UTF-8 encoded. In the constructor, the component check for these declaration to decide whether using UTF-8 or the default ISO-8859-1 charset (defined in HTTP 1.1 spec). The addHtmlContent method will allow us to choose which charset is used.

7 thoughts on “Symfony Dom Crawler Component and UTF-8 HTML

  1. Yes but you should show that there is a second parameter which takes the input string for addHtmlContent($html).
    Example
    addHtmlContent($html, ‘UTF-8’);

  2. Great website. Lots of helpful information here. I’m sending it
    to some friends ans also sharing in delicious. And certainly, thanks on your
    effort!

  3. If you use Crawler for nested things and crawl nodes’ HTML or XML, be sure to do this for parent Crawler(s), too.

    $crawler = new Crawler();
    $crawler->addHtmlContent($xml);

    $crawler->filter('Question')->each(function ($node) use (&$question) {

    $crawler = new Crawler();
    $crawler->addHtmlContent($node->html());

Leave a Reply