Symfony Dom Crawler Component and UTF-8 HTML

I have used the Dom Crawler Component of Symfony to parse some result from Google SERP pages. After the first try, I learned that the result of $crawler->html() is not as I expected: all UTF-8 characters was displayed incorrectly.  The problem can be reproduced easily via this script (supposed that all classes has been loaded via Composer autoload or anything else):

$html = <<<'HTML'
<!DOCTYPE html>
<html>
<head>
  <title>Đây là một chuỗi UTF-8</title>
</head>
<body>

</body>
</html>
HTML;

$crawler = new \Symfony\Component\DomCrawler\Crawler($html);
echo $crawler->html();

The output will be:

<head><title>Đây là một chuỗi UTF-8</title></head>
<body>

</body>

Which is absolutely not what we want.

Quick fix:

Instead of passing the HTML content to the crawler constructor, passing it in the next call via addHtmlContent method

$html = <<<'HTML'
<!DOCTYPE html>
<html>
<head>
  <title>Đây là một chuỗi UTF-8</title>
</head>
<body>

</body>
</html>
HTML;

$crawler = new \Symfony\Component\DomCrawler\Crawler();
$crawler->addHtmlContent($html) # this line is added
echo $crawler->html();

And the output will be fine:

<head><title>Đây là một chuỗi UTF-8</title></head>
<body>

</body>

Note: the <!DOCTYPE>  and <html>  markup need adding manually.

 Explanation

This unexpected behavior happens with HTML content that does not have a meta element telling the crawler that the content is UTF-8 encoded. In the constructor, the component check for these declaration to decide whether using UTF-8 or the default ISO-8859-1 charset (defined in HTTP 1.1 spec). The addHtmlContent method will allow us to choose which charset is used.

2 comments

  1. jason on

    Yes but you should show that there is a second parameter which takes the input string for addHtmlContent($html).
    Example
    addHtmlContent($html, ‘UTF-8’);

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *