Jun 18, 2013

For the love of Python | Scraping IndiBlogger and other tasks

The title of this post is inspired by MIT professor Walter Lewin's book 'For the love of Physics | From the end of the rainbow to the edge of time'. I agree that my knowledge in Python is not even comparable to that of Walter Lewin's in Physics, but I couldn't think of a better title.

Let me start by tell you about the time I tried to learn Java. Someone told me Java is the easiest programming language to learn in this world. I had a background in C++ and one whole semester full of hardcore PHP and PostgreSQL. Naturally, I started by asking Google for tutorials. However, all the results seemed to shout loud at me why Python is better than Java. After nudging my seniors a bit, I finally got a project in Python/Django and that was it for Java.

There are many reasons why I find Python better than anything else. Take for example how we swap numbers in Python.
>>>> x = 2
>>>> y = 3
>>>> x, y = y, x #that's it!
Another example. This is how you check if a word is pallindrome.
>>>> word = #something
>>>> is_palindrome = word.find(word[-1::-1])
One more very useful function that I use extensively is zip. Check this.
>>>> x = [1, 2, 3]
>>>> y = [4, 5, 6]
>>>> print zip (x,y)
[(1, 4), (2, 5), (3, 6)]
Quoting an answer from Quora,
There's nothing wrong with falling in love with a programming language for her looks. I mean, let's face it - Python does have a rockin' body of modules, and a damn good set of utilities and interpreters on various platforms. Her whitespace-sensitive syntax is easy on the eyes, and it's a beautiful sight to wake up to in the morning after a long night of debugging. The way she sways those releases on a consistent cycle - she knows how to treat you right, you know?
It's not just these examples but the whole structure of Python that makes it so much appealing to people all round the world. Take the importance of indentation in Python- it makes a programmer disciplined! Plus it makes great readability- you can comprehend others' code in no time!

So coming back to the reason I am writing this, it all started yesterday. At this point, I literally despise Java where every little thing seems to be performed by functions like  SomeLongPath.SomeSubPath.SomeClass.SomeOtherCrap.BlahBlahBlah()! Someone posted a graph of the "Number of posts per person vs No of persons" in a forum of IndiBlogger and apparently it was done through Java (and the program was not made open source!) Anyways, I had already commented there that I could scrape them through Python, just that I needed some motivation. And motivation I got!

I humbly asked a question regarding the results and no one bothered to reply. At someone point, someone mentioned that we should do it manually. That was it. The motivation I needed. I thought it was time I showed an example of how programs in Python are at least ten times smaller than their Java counterparts (ask Sandeep, he coded the Internship Online first in Java/Struts and then Placement Online in Python/Djano the following year!) Secondly, I believe in Open Source! Also, to prove my love for Python, I had to do it.

Naturally, I used the module BeautifulSoup, which has it's own beauty as I mentioned in an older blog post. IndiBlogger is not exactly the greatest of sites in terms of web development standards (with very poor accessibility as well- but that's a different topic for a different blog...) and getting data off its webpages using urllib was not really what you call difficult.

My primary motive was to find the profile with the maximum posts. Every page had 10 links, and I had extracted each of them, searching with the CSS class that they had and the first link out of each of the results gave me the bloggers' profiles.

Here is the script that did the work. It contains just 45 lines of pure Python beauty including newlines, proper variable and class declarations! I wonder how long its Java counterpart was (500 lines?) I guess I will never know...

Coming back to the status of IndiBlogger, I have notice many Java related posts. The problem is that the Indian public is way to fascinated by what is already established. People may argue about the fact that Java is object oriented, but Python isn't far behind. In fact, everything in an object in Python- variables, functions- everything! Let's take two small examples.

The Hello World Program:
Programmers, on exploring a new language, start with a program which prints 'Hello World' in the computer screen. Take a look at the Java version first.
public class HelloWorld {
    public static void main(String[] args) {
        System.out.println("Hello World!");
    }
}
In Python, you simply write:
print 'Hello World'

Program to reverse a number:
Let's do it in Java first.
import java.util.Scanner;

class ReverseNumber
{
   public static void main(String args[])
   {
      int n, reverse = 0;

      System.out.println("Enter the number to reverse");
      Scanner in = new Scanner(System.in);
      n = in.nextInt();

      while( n != 0 )
      {
          reverse = reverse * 10;
          reverse = reverse + n%10;
          n = n/10;
      }

      System.out.println("Reverse of entered number is "+reverse);
   }
}
How long is that? Well, I would rather write a program in Python to count the words there- it would be faster than counting it manually. Anyways, the Python counterpart.
n = raw_input('Enter the number to reverse: ')
print "Reverse of the number is: " + str(n)[::-1]
How much time did I save? You tell me. Which language is way cooler? Again, you decide!

Liked this post? Have any suggestions? Just let me know. Feel free to comment below!

0 responses:

Post a Comment