Attempt to Create a Mini Search Engine...
In my last post I talked a little bit a small web crawler I have been coding in python but I didn't really mention why I was coding other than it would be a fun little hack. Well I have sorted of decided to explore the world of Internet search engines by writing my own one. That is one that will be tiny and simple and generally not very useful. I will probably only ever index a couple of thousand sites but it will search non the less.
The current status is I have a simple web crawler the pushes websites into a sqlite database. Another script then converts the HTML to text and splits this text into words. Each word is pushed into another sqlite database along with the url it came from. The second database effectively contains for every word encountered all the web pages it appears on.
I then query this database with a single word and it is returns a list of the first 10 entries. I do the query in python code and just output the urls to the shell.
I think you could say that is pretty basic and most definitely useless for actually finding things.
As I mentioned before I initially came up with the idea for this project when I was playing with the Go language but it rapidly became apparent that the mature libraries on python would let me progress with this faster.
I am trying to take the approach of not wasting time on things that don't progress the project. Learning Go well enough and then creating libraries that already exist in python is something that while fun slows down the progress of the project. It is very easy in a project like this for me to get distracted by some detailed part and try to create a really good solution for it rather than something that is good enough for now that can be changed later it needs be. In truth I tried create a mini search engine about 12 years ago but failed because I got lost in the detail, actually I think I decided it would be a good idea to design my own file format for holding the data rather than sitting down and learning about databases - or something like that, my memory is hazy.
So python it is and simple solutions that may get refined with each iteration into something more complicated as my understanding in the each area grows.
The current status is I have a simple web crawler the pushes websites into a sqlite database. Another script then converts the HTML to text and splits this text into words. Each word is pushed into another sqlite database along with the url it came from. The second database effectively contains for every word encountered all the web pages it appears on.
I then query this database with a single word and it is returns a list of the first 10 entries. I do the query in python code and just output the urls to the shell.
I think you could say that is pretty basic and most definitely useless for actually finding things.
As I mentioned before I initially came up with the idea for this project when I was playing with the Go language but it rapidly became apparent that the mature libraries on python would let me progress with this faster.
I am trying to take the approach of not wasting time on things that don't progress the project. Learning Go well enough and then creating libraries that already exist in python is something that while fun slows down the progress of the project. It is very easy in a project like this for me to get distracted by some detailed part and try to create a really good solution for it rather than something that is good enough for now that can be changed later it needs be. In truth I tried create a mini search engine about 12 years ago but failed because I got lost in the detail, actually I think I decided it would be a good idea to design my own file format for holding the data rather than sitting down and learning about databases - or something like that, my memory is hazy.
So python it is and simple solutions that may get refined with each iteration into something more complicated as my understanding in the each area grows.
Comments
Post a Comment