Web Scrapping using PHP and CURL


We will explain how to use Web Scrapping in PHP by data-ming Google search results. We have some search terms and want to get the first 5 URLs for our terms in Google search.

For data-mining you must know the HTML structure of the site, which you can see from the source of any HTML page and find common patterns around the data. Google search result URL is rendered in a <h3 class="r"> tag. I would suggest that you disable Javascript of your browser, because while using PHP curl, you won't be running the javascript. To grep the search results, you can use DOMDocument, Regular Expression or some PHP libraries like simplehtmldom

Web Scrapping using Regular Expression

The following example shows the use of Regular Expression: 

<?php
header('Content-Type: text/plain;');
$curl=curl_init('http://www.google.com/search?output=search&q=php');    // Please study the url and params.   q=php or q={your keyword here}

curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

$data=curl_exec($curl);

curl_close($curl);

$mathces=array();

preg_match_all('|<h3 class="r">.*?href="/url\?q=(.*?)&amp;.*?".*?</h3>|', $data, $mathces);

print_r($mathces[1]);

?>

Your output will be:

Array
(
    [0] => http://php.net/
    [1] => http://en.wikipedia.org/wiki/PHP
    [2] => http://www.w3schools.com/php/
    [3] => http://www.w3schools.com/php/php_intro.asp
    [4] => http://www.codecademy.com/tracks/php
    [5] => http://www.php.com/
    [6] => http://www.zend.com/
    [7] => http://www.php.org/
    [8] => http://www.planet-php.net/
    [9] => http://www.phpmyadmin.net/
)

Using the Regular Expression may be quite challenging. Alternatively, you can use DOMDocument or simplehtmldom instead to get more efficient results.

Web Scrapping using DOMDocument

PHP provides another powerful way for parsing and accessing HTML or XML document using DOMDocument.

We have already explained how to grab a link from Google search using regular expression. The following example demonstrates the same process using DomDocument

<?php

header('Content-Type: text/html');

$curl=curl_init('http://www.google.com/search?output=search&q=php');

curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);

$data=curl_exec($curl);

$domdoc=new DOMDocument();

if(@$domdoc->loadHTML($data)){

	$link_list=$domdoc->getElementsByTagName('a');

	for($i=0;$i<$link_list->length;$i++){

		$link=$link_list->item($i);

		if(($class=$link->parentNode->attributes->getNamedItem('class')) and $class->nodeValue=='r'){

			echo $link->attributes->getNamedItem('href')->nodeValue, '<br />';

		}
		
	}

}

The example above will show you all the href links of your search. The "@" character in front of $domdoc->loadHTML($data) is responsible for suppressing the warning generated during execution.

The DOMDocument is a class to manipulate any HTML or XML document. The most useful properties and methods are explained below.

Important Properties of DOMDocument

  • doctype:                  doctype tells about document type Declaration of the document
  • documentElement:  documentElement returns root node of the document
  • childNodes:             childNodes returns the nodelist of child nodes
  • nodeName:              nodeName returns the name of the node
  • nodeValue:              nodeValue Returns the value of the node
  • xmlVersion:             xmlVersion returns XML Version of document

Important Methods of DomDocument

  • getElementById():     Returns the element has an ID passed as 1st parameter  
  • getElementsByTagName(): Returns the all elements as nodelist has same name passed as 1st parameter  
  • renameNode():          Renames any node passed as 1st parameter  and new name space as 2nd and name as 3rd parameter  
  • createElement():       To create a new node, name as 1st
  • createAttribute():      To create a new attribute node
  • createTextNode():     To create a new text node