Hitting iteration 20
As I mentioned in my previous post I am doing small targeted iterations and to my surprise I am actually managing to keep to this.
Over the past few iterations I have added support for robots.txt file to my crawler so I am slowly becoming a good web citizen. The rate of sites downloaded is still pretty slow. This may be down to the time.sleep(1) I have put between each download. Having more than a few thousand websites download it not that useful to me at the moment.
Having a large dataset of pages in my database would mean I would attach value to it and feeling I should attempt to correct mistakes with just simple blowing away the data. Really it still feels like I am just try to arrange the bones and am nowhere near trying to flesh it out.
It has also dawned on me that I should limit the engine to just sites written in English. The rational being is it is the only language I understand which will make attempting to create scoring algorithms a little easier. Of course detecting whether a page is written in English is fun.
In the end I just arrived at grabbing a great bit list of English words off the internet and then looking at each page. After scrubbing out css and javascript sections I just look to see if the 15% of the remaining text is in the list of words, if so it passes.
Sure it is a hacky little heuristic meaning we will get both false positives and false negatives. It does seem to work reasonably well and lets me progress. As always I can return to it if needed
Python seems to be shaping up reasonably well at the moment, at least I can iterate on things quite fast and I have flipped over to using the community edition of PyCharm. This is a really good editor and really should be the first Ide you try for python. If you don't like it then fair enough move onto something else.
My experience is gradually growing with it. I like the optional typing in 3.5 well I like that PyCharm supports it. For my C++ warped mind this will be a great feature as my code grows. Although having said that my code base is only growing at a slow rate as I learn more about python it seems to let me use less code in turn slowing down the rate of growth in my code base.
Over the past few iterations I have added support for robots.txt file to my crawler so I am slowly becoming a good web citizen. The rate of sites downloaded is still pretty slow. This may be down to the time.sleep(1) I have put between each download. Having more than a few thousand websites download it not that useful to me at the moment.
Having a large dataset of pages in my database would mean I would attach value to it and feeling I should attempt to correct mistakes with just simple blowing away the data. Really it still feels like I am just try to arrange the bones and am nowhere near trying to flesh it out.
It has also dawned on me that I should limit the engine to just sites written in English. The rational being is it is the only language I understand which will make attempting to create scoring algorithms a little easier. Of course detecting whether a page is written in English is fun.
In the end I just arrived at grabbing a great bit list of English words off the internet and then looking at each page. After scrubbing out css and javascript sections I just look to see if the 15% of the remaining text is in the list of words, if so it passes.
Sure it is a hacky little heuristic meaning we will get both false positives and false negatives. It does seem to work reasonably well and lets me progress. As always I can return to it if needed
Python seems to be shaping up reasonably well at the moment, at least I can iterate on things quite fast and I have flipped over to using the community edition of PyCharm. This is a really good editor and really should be the first Ide you try for python. If you don't like it then fair enough move onto something else.
My experience is gradually growing with it. I like the optional typing in 3.5 well I like that PyCharm supports it. For my C++ warped mind this will be a great feature as my code grows. Although having said that my code base is only growing at a slow rate as I learn more about python it seems to let me use less code in turn slowing down the rate of growth in my code base.
Comments
Post a Comment