Jan 022012

How to parse large XML files in PHP?

Some time ago I was faced with the problem of parsing large XML files in PHP. While the small files are no problem and all is quickly parsed, an attempt to pare larger files often causes timeout or Internal Sever Error. Such large files, however, are often used for remote updates of offers (eg, published by the wholesalers).

This is because PHP has set a limit to use, a standard method for parsing files (such as DOMDocument) can be effectively use that limit.

The solution is to use the XMLReader class, by default, available in a standard configuration of PHP 5.1.0.

I did a quick comparison of the speed of action for DOMDocument and XMLReader using 4 different computers.

XML filesize: 208 MB
Number of entries: 148723

 DOMDocumentXMLReader
Lokalny komnputer269 sek41 sek.
Serwer dedykowany / Hetzner264 sek.15 sek.
Serwer współdzielony / vipserv.orgerror 500 / timeout15 sek.
Serwer współdzielony / IQ.pl277 sek.33 sek.

As you can see the difference is enormous (about 10-20 times faster) and for large files we put on the XMLReader.

Piece of code responsible for parsing the DOMDocument:

$doc = new DOMDocument();
$doc->load($localurl);
$items= $doc->getElementsByTagName("item");
$countItems = $items->length;

foreach($items as $item)
{
	$id = $item->getElementsByTagName("id")->item(0)->nodeValue;
	$url = $item->getElementsByTagName("url")->item(0)->nodeValue;
	$title = $item->getElementsByTagName("title")->item(0)->nodeValue;
	$author = $item->getElementsByTagName("author")->item(0)->nodeValue;
	$isbn = $item->getElementsByTagName("isbn")->item(0)->nodeValue;
	$image = $item->getElementsByTagName("image")->item(0)->nodeValue;
	$ean = $item->getElementsByTagName("ean")->item(0)->nodeValue;
	$published = $item->getElementsByTagName("published")->item(0)->nodeValue;
	$publisher = $item->getElementsByTagName("publisher")->item(0)->nodeValue;
	$pages = $item->getElementsByTagName("pages")->item(0)->nodeValue;
	$price = $item->getElementsByTagName("price")->item(0)->nodeValue;
	$description = $item->getElementsByTagName("description")->item(0)->nodeValue;
	$status = $item->getElementsByTagName("status")->item(0)->nodeValue;
	$count++;
  }

 

Piece of code responsible for parsing the XMLReader:

$reader = new XMLReader();
$reader->open($localurl);

while($reader->read())
{
	if($reader->nodeType == XMLReader::ELEMENT) $nodeName = $reader->name;
	if($reader->nodeType == XMLReader::TEXT || $reader->nodeType == XMLReader::CDATA)
	{
		if ($nodeName == 'id') $id = $reader->value;
		if ($nodeName == 'url') $url = $reader->value;
		if ($nodeName == 'title') $title = $reader->value;
		if ($nodeName == 'author') $author = $reader->value;
		if ($nodeName == 'isbn') $isbn = $reader->value;
		if ($nodeName == 'image') $image = $reader->value;
		if ($nodeName == 'ean') $ean = $reader->value;
		if ($nodeName == 'published') $published = $reader->value;
		if ($nodeName == 'publisher') $publisher = $reader->value;
		if ($nodeName == 'pages') $pages = $reader->value;
		if ($nodeName == 'price') $price = $reader->value;
		if ($nodeName == 'description') $description = $reader->value;
		if ($nodeName == 'status') $status = $reader->value;
		$ean = '';
	}

	if($reader->nodeType == XMLReader::END_ELEMENT && $reader->name == 'item')
	{
		$count++;
	}
}
$reader->close();
Tagged with , , 3 comments
3 Responses to How to parse large XML files in PHP?
  1. Anonymous

    W pętli porównującej $nodeName warto dodać wykluczenia else i przerwania continue – będzie jeszcze szybciej ;)

  2. grzegorz

    Fakt, dobra sugestia.

  3. krzysztof

    Muszę to przetestować. Dotychczas gdy zdażało mi się przetwarzać duże pliki XML – to niestety wykonanie skryptu padało. Nawet gdy importowałem XML przy pomocy phpMyAdmin do bazy MySQL’a.

Leave a Reply

Your email address will not be published. Please enter your name, email and a comment.