I have used the Dom Crawler Component of Symfony to parse some result from Google SERP pages. After the first try, I learned that the result of <span class="lang:php decode:true crayon-inline ">$crawler->html()</span>
is not as I expected: all UTF-8 characters was displayed incorrectly. The problem can be reproduced easily via this script (supposed that all classes has been loaded via Composer autoload or anything else):
$html = <<<'HTML'
<!DOCTYPE html>
<html>
<head>
<title>Đây là một chuỗi UTF-8</title>
</head>
<body>
</body>
</html>
HTML;
$crawler = new \Symfony\Component\DomCrawler\Crawler($html);
echo $crawler->html();
The output will be:
<head><title>Äây là má»t chuá»i UTF-8</title></head>
<body>
</body>
Which is absolutely not what we want.
Quick fix:
Instead of passing the HTML content to the crawler constructor, passing it in the next call via addHtmlContent method
$html = <<<'HTML'
<!DOCTYPE html>
<html>
<head>
<title>Đây là một chuỗi UTF-8</title>
</head>
<body>
</body>
</html>
HTML;
$crawler = new \Symfony\Component\DomCrawler\Crawler();
$crawler->addHtmlContent($html) # this line is added
echo $crawler->html();
And the output will be fine:
<head><title>Đây là một chuỗi UTF-8</title></head>
<body>
</body>
Note: the <!DOCTYPE>
and <html>
markup need adding manually.
Explanation
This unexpected behavior happens with HTML content that does not have a meta element telling the crawler that the content is UTF-8 encoded. In the constructor, the component check for these declaration to decide whether using UTF-8 or the default ISO-8859-1 charset (defined in HTTP 1.1 spec). The addHtmlContent method will allow us to choose which charset is used.
Yes but you should show that there is a second parameter which takes the input string for addHtmlContent($html).
Example
addHtmlContent($html, ‘UTF-8’);
You saved me! Kudos
Thank you, solved my problem
Hello there I am so grateful I found your blog, I have bookmarked it.
Great website. Lots of helpful information here. I’m sending it
to some friends ans also sharing in delicious. And certainly, thanks on your
effort!
finally it is not necassary now https://github.com/symfony/dom-crawler/commit/28c1dd075b63f470d17f2a421b984ba5761aa55d
If you use Crawler for nested things and crawl nodes’ HTML or XML, be sure to do this for parent Crawler(s), too.
$crawler = new Crawler();
$crawler->addHtmlContent($xml);
$crawler->filter('Question')->each(function ($node) use (&$question) {
$crawler = new Crawler();
$crawler->addHtmlContent($node->html());