Web scraping is data extraction from a websites. Although I have been working on that for quite a long time, I will just walk you through the basics. You can find my github repository on web scraping here.
Giving a demonstration without an aim is futile, so let's keep a target: Get user data off a website http://travelers.bootsnall.com.
urllib/urllib2: Since we are going through the basics, we do not use the very simple python libraries. We would use urllib/urllib2 to send a request to web page and get it's data.
Beautiful soup:
In PHP, you would probably use cURL and search for regular expressions. In Python, beautiful soup comes to the rescue. You can find the proper documentation with loads of examples here.
The next step is to get some data out of the webpage. Let's try to get all the profile links in this very page.
You can find the object oriented implementation of this example here.
When to use mechanize:
The example that I described was pretty simple. Note that we send the HTTP requests with urllib and it recognizes itself as a Python library in the header. Sometimes, some (cunning) web developers block such requests coming from a non-browser. That is when we need to get our hands dirty!
Enter mechanize to make our lives simpler. We would be emulating a browser with the help of mechanize and manage cookies with cookielib. I read about mechanize here (and full credit to the original author for introducing me to mechanize!)
You can also login into a website using mechanize, but I would not be concentrating on that. What I would tell you is this. Consider this page (http://www.everytrail.com/guide/the-olomana-trail). Urllib is not able to load the right column for some reason (I think it is disabled if the screedn width is less than a particular value). Let's say I want the link to the author's profile (which is in the right column), here is the code I will use.
No comments:
Post a Comment