Posts

Showing posts from November, 2015

Iterate then iterate on your mistakes

I love this title as is it really defines the way I am working on my little search engine. I am making a series of decision that amount to getting each iteration completed as fast as possible. This means making a series of decisions that are at best maximally simple and at worst dumb, based on the idea I will fix them in a later iteration. When faced with a problem the first question I ask is "Is there a dumb solution that lets me progress?". If there is a dumb solution I use it with the expectation I will revisit it in the future. Yes, I have already started to revisit some of these decisions and usually it is because the performance was not great once the data got bigger. Making things fast is an enjoyable process so if revisiting means making things fast then I am probably going to enjoy this approach to coding. One feature of this simple as can be approach is the code can get a bit messy. So each iteration I make sure it includes a task to clean up the code in some way. J...

Attempt to Create a Mini Search Engine...

In my last post I talked a little bit a small web crawler I have been coding in python but I didn't really mention why I was coding other than it would be a fun little hack. Well I have sorted of decided to explore the world of Internet search engines by writing my own one. That is one that will be tiny and simple and generally not very useful. I will probably only ever index a couple of thousand sites but it will search non the less. The current status is I have a simple web crawler the pushes websites into a sqlite database. Another script then converts the HTML to text and splits this text into words. Each word is pushed into another sqlite database along with the url it came from. The second database effectively contains for every word encountered all the web pages it appears on. I then query this database with a single word and it is returns a list of the first 10 entries. I do the query in python code and just output the urls to the shell. I think you could say that is pretty...

Crawling the Web...

I was previously taking a look at the Go programming language and finding it quite an enjoyable experience but I drifted away from it a little bit over the past week or two. I have sort of got interested in writing an Internet crawler and the first thing that became apparent was this was much easier to do using python. It is almost what python was designed for, a network bound task. Now I am sure there are plenty of crawlers out there but I wanted the experience of writing my own, even if it is a little simple. This is the first project where I have decided to use a database, to store the downloaded html obviously. The zero set up of sqlite won me over instantly, while setting up a different database would not be that hard using sqlite let me instantly get to doing the fun stuff like pushing data into it. After a bit of experimentation I decided that I would compress the data before pushing it into the database. With my rather small data set of 10 websites the compression made my datab...