Data scraping using cURL in PHP

on Saturday, 11 June 2011. Posted in Technology

?

First step:You will need to login to the site using cURL. Inspect the login form using a tool such as Firebug or view source to see what all fields are being sent and what is the endpoint of the request. You will need all this information to send login request to the site using cURL. In fact, make a small HTML version of form (using code from the site) on local machine and try and login with that. If it works part of the job is done.

Another point that could be very helpful here is, if the form uses POST request convert that to GET and send the request to a local url to see what all parameters are passed. Sometimes there could be some hidden variables which are not very easy to track. Now inspect the query string from the GET request and create a url for cURL POST based on this string. One important point while writing login request is not to forget saving the cookie. So set the option CURLOPT_COOKIEFILE with filename. Also, you could get the filename using?$tmp_fname = tempnam("/tmp", "COOKIE");?to make it platform independent (windows, Linux, Mac)

Every site comes with its own site of rules for login but broadly keeping above points in mind you should be able to login to any site using cURL.

Second Step:After login the next step is to get the file and save it on disk. If the site URL is simple (non dynamic) then there is no problem just invoke a simple GET request for that URL with cURL and save it to disk. However if the URL for the file is dynamic then you need to fetch the page which has the link, search for the link in the page of that text (knowledge of REGEX would come handy here) and get the dynamic URL. And then invoke cURL again on the dynamic URL to get the data. One point to keep in mind in case you are dealing with dynamic URL is that when you get the string for URL in php variables if there is any & it gets converted to & so if you directly invoke the url to get the data it will not work. Use?htmlspecialchars_decode?to get actual URL and you should be able to save the data.

?Second Step: After login the next step is to get the file and save it on disk. If the site URL is simple (non dynamic) then there is no problem just invoke a simple GET request for that URL with cURL and save it to disk. However if the URL for the file is dynamic then you need to fetch the page which has the link, search for the link in the page of that text (knowledge of REGEX would come handy here) and get the dynamic URL. And then invoke cURL again on the dynamic URL to get the data. One point to keep in mind in case you are dealing with dynamic URL is that when you get the string for URL in php variables if there is any & it gets converted to & so if you directly invoke the url to get the data it will not work. Usehtmlspecialchars_decode to get actual URL and you should be able to save the data.

?

Comments (0)

Leave a comment

You are commenting as guest.