How to parse large XML files in PHP?

Some time ago I was faced with the problem of parsing large XML files in PHP. While the small files are no problem and all is quickly parsed, an attempt to pare larger files often causes timeout or Internal Server Error. Such large files, however, are often used for remote updates of offers (eg. published by the wholesalers).

This is because PHP has set a limit to use, a standard method for parsing files (such as DOMDocument) can be effectively use that limit.

The solution is to use the XMLReader class, by default, available in a standard configuration of PHP 5.1.0.

I did a quick comparison of the speed of action for DOMDocument and XMLReader using 4 different computers.

Comparison

XML filesize: 208 MB
Number of entries: 148723

DOMDocument XMLReader
Laptop 269 sec. 41 sec.
Dedicated server / Hetzner 264 sec. 15 sec.
Shared server / vipserv.org error 500 / timeout 15 sec.
Shared server / IQ.pl 277 sec. 33 sec.

As you can see the difference is enormous (about 10-20 times faster) and for large files we put on the XMLReader.

Code used

Piece of code responsible for parsing the DOMDocument:

$doc = new DOMDocument();
$doc->load($localurl);
$items= $doc->getElementsByTagName("item");
$countItems = $items->length;

foreach($items as $item)
{
	$id = $item->getElementsByTagName("id")->item(0)->nodeValue;
	$url = $item->getElementsByTagName("url")->item(0)->nodeValue;
	$title = $item->getElementsByTagName("title")->item(0)->nodeValue;
	$author = $item->getElementsByTagName("author")->item(0)->nodeValue;
	$isbn = $item->getElementsByTagName("isbn")->item(0)->nodeValue;
	$image = $item->getElementsByTagName("image")->item(0)->nodeValue;
	$ean = $item->getElementsByTagName("ean")->item(0)->nodeValue;
	$published = $item->getElementsByTagName("published")->item(0)->nodeValue;
	$publisher = $item->getElementsByTagName("publisher")->item(0)->nodeValue;
	$pages = $item->getElementsByTagName("pages")->item(0)->nodeValue;
	$price = $item->getElementsByTagName("price")->item(0)->nodeValue;
	$description = $item->getElementsByTagName("description")->item(0)->nodeValue;
	$status = $item->getElementsByTagName("status")->item(0)->nodeValue;
	$count++;
  }

Piece of code responsible for parsing the XMLReader:

$reader = new XMLReader();
$reader->open($localurl);

while($reader->read())
{
	if($reader->nodeType == XMLReader::ELEMENT) $nodeName = $reader->name;
	if($reader->nodeType == XMLReader::TEXT || $reader->nodeType == XMLReader::CDATA)
	{
		if ($nodeName == 'id') $id = $reader->value;
		if ($nodeName == 'url') $url = $reader->value;
		if ($nodeName == 'title') $title = $reader->value;
		if ($nodeName == 'author') $author = $reader->value;
		if ($nodeName == 'isbn') $isbn = $reader->value;
		if ($nodeName == 'image') $image = $reader->value;
		if ($nodeName == 'ean') $ean = $reader->value;
		if ($nodeName == 'published') $published = $reader->value;
		if ($nodeName == 'publisher') $publisher = $reader->value;
		if ($nodeName == 'pages') $pages = $reader->value;
		if ($nodeName == 'price') $price = $reader->value;
		if ($nodeName == 'description') $description = $reader->value;
		if ($nodeName == 'status') $status = $reader->value;
		$ean = '';
	}

	if($reader->nodeType == XMLReader::END_ELEMENT && $reader->name == 'item')
	{
		$count++;
	}
}
$reader->close();

Leave a Reply

Your email address will not be published. Required fields are marked *