Crawling the Web...
I was previously taking a look at the Go programming language and finding it quite an enjoyable experience but I drifted away from it a little bit over the past week or two.
I have sort of got interested in writing an Internet crawler and the first thing that became apparent was this was much easier to do using python. It is almost what python was designed for, a network bound task.
Now I am sure there are plenty of crawlers out there but I wanted the experience of writing my own, even if it is a little simple.
This is the first project where I have decided to use a database, to store the downloaded html obviously. The zero set up of sqlite won me over instantly, while setting up a different database would not be that hard using sqlite let me instantly get to doing the fun stuff like pushing data into it.
After a bit of experimentation I decided that I would compress the data before pushing it into the database. With my rather small data set of 10 websites the compression made my database about 6 times smaller. Obviously the trade of is increased CPU usage. Hey it is nearly winter so using the computer is heat the house is a bit of a bonus.
I admit this is not the usual thing I like to code but I am finding it rather good fun and the task is pulling me to my desktop in the evenings which many other ideas I have had recently just have not done. Although part of it my be python is just such good fun to code in.
I have sort of got interested in writing an Internet crawler and the first thing that became apparent was this was much easier to do using python. It is almost what python was designed for, a network bound task.
Now I am sure there are plenty of crawlers out there but I wanted the experience of writing my own, even if it is a little simple.
This is the first project where I have decided to use a database, to store the downloaded html obviously. The zero set up of sqlite won me over instantly, while setting up a different database would not be that hard using sqlite let me instantly get to doing the fun stuff like pushing data into it.
After a bit of experimentation I decided that I would compress the data before pushing it into the database. With my rather small data set of 10 websites the compression made my database about 6 times smaller. Obviously the trade of is increased CPU usage. Hey it is nearly winter so using the computer is heat the house is a bit of a bonus.
I admit this is not the usual thing I like to code but I am finding it rather good fun and the task is pulling me to my desktop in the evenings which many other ideas I have had recently just have not done. Although part of it my be python is just such good fun to code in.
Comments
Post a Comment