Symfony Dom Crawler Component and UTF-8 HTML

I have used the Dom Crawler Component of Symfony to parse some result from Google SERP pages. After the first try, I learned that the result of <span class="lang:php decode:true crayon-inline ">$crawler->html()</span> is not as I expected: all UTF-8 characters was displayed incorrectly.  The problem can be reproduced easily via this script (supposed that all classes has been loaded via Composer autoload or anything else):

$html = <<<'HTML'
<!DOCTYPE html>
<html>
  <head>
    <title>Đây là một chuỗi UTF-8</title>
  </head>
  <body>
  </body> 
</html> 
HTML; 

$crawler = new \Symfony\Component\DomCrawler\Crawler($html);
echo $crawler->html();

The output will be:

<head><title>Đây là một chuỗi UTF-8</title></head>
<body>

</body>

Which is absolutely not what we want.

Quick fix:

Instead of passing the HTML content to the crawler constructor, passing it in the next call via addHtmlContent method

$html = <<<'HTML'
<!DOCTYPE html>
<html>
<head>
  <title>Đây là một chuỗi UTF-8</title>
</head>
<body>

</body>
</html>
HTML;

$crawler = new \Symfony\Component\DomCrawler\Crawler();
$crawler->addHtmlContent($html) # this line is added
echo $crawler->html();

 

And the output will be fine:

<head><title>Đây là một chuỗi UTF-8</title></head>
<body>

</body>

Note: the <!DOCTYPE>  and <html>  markup need adding manually.

 Explanation

This unexpected behavior happens with HTML content that does not have a meta element telling the crawler that the content is UTF-8 encoded. In the constructor, the component check for these declaration to decide whether using UTF-8 or the default ISO-8859-1 charset (defined in HTTP 1.1 spec). The addHtmlContent method will allow us to choose which charset is used.

5 thoughts on “Symfony Dom Crawler Component and UTF-8 HTML

  1. Yes but you should show that there is a second parameter which takes the input string for addHtmlContent($html).
    Example
    addHtmlContent($html, ‘UTF-8’);

  2. Great website. Lots of helpful information here. I’m sending it
    to some friends ans also sharing in delicious. And certainly, thanks on your
    effort!

Leave a Reply