Posts

Showing posts from December, 2015

Hitting iteration 20

As I mentioned in my previous post I am doing small targeted iterations and to my surprise I am actually managing to keep to this. Over the past few iterations I have added support for robots.txt file to my crawler so I am slowly becoming a good web citizen. The rate of sites downloaded is still pretty slow. This may be down to the time.sleep(1) I have put between each download. Having more than a few thousand websites download it not that useful to me at the moment. Having a large dataset of pages in my database would mean I would attach value to it and feeling I should attempt to correct mistakes with just simple blowing away the data. Really it still feels like I am just try to arrange the bones and am nowhere near trying to flesh it out. It has also dawned on me that I should limit the engine to just sites written in English.  The rational being is it is the only language I understand which will make attempting to create scoring algorithms a little easier. Of course detecting