web crawling and doc
Articles,  Blog

web crawling and doc

So that’s the end of that lesson on
vector spaces and things like that. Now let’s discuss web crawling and
preparing documents. So, you can write a web crawler in a
introductory computer science class. And probably lots of
classes have set this, and lots of crawlers have
crawled around based on this. So you start with a list of
URLs you visited already. Could be empty. And you would have at least
one URL to start with. And then you fetch the URL off
the list, check if it’s done. If it’s not done, you go off
to the web and collect it. You bring it back. You hand the document
to a document analyzer. The document analyzer goes and does all sorts of nifty things
with the terms in the document. And among other things, it extracts
all the URLs in the document and adds it to the list
of new URLs to visit. And then you’ll proceed ad infinitum
until you’ve crawled the whole web. As we know, there are isolated
sets of the web, and so you do need to see that
properly to get all URLs this way. And there also are all
sorts of subtleties. Encrypted pages. Pages demanding you be polite and
not spend, not load the server down by fetching
lots of information from it. There are pages which are dynamic. And there’s a so-called deep web
which are effectively databases with a [INAUDIBLE] front end
and things like that. So all of that make it
more complicated, but those are, that’s engineering. I don’t think there’s much
point in some principles there. So then we have these
[COUGH] so we’ve fetched it from the web, what do we do with it? Well, you’re gonna have to get these
terms which are these canonical words of the English language. So when you first take the HTML
document, which actually could be Word, it could be PDF, others of
course, HTML in various flavors. And it’s gotta be converted
to a text document. Once you’ve done that, you run it
through a PDF to text converter, for example, or
a Word to text converter. Then you do so-called tokenization. You view it as a string of words. And you remove the formatting and
the punctuation and the capitals. Even if their capitals are important
logically, they’re not so important for query matching cuz people don’t
use capitals very often in queries. You convert common forms like you
use USA, not U dot S dot A dot. You drop accents as in naive. And then you have canonical tokens,
which your document has become. Then you have some, you may or may not remove stop words like is,
a, to, the, for. One reason to keep stop words is
that some phrases have to have stop words. Like to be or
not to be is full of stop words. If you’ve removed stop words, you’ll never be able to respond
to the query to be or not to be. Whereas it
[COUGH] actually likely to be a perfectly reasonable thing to respond to if
you happen to know that you have to, twice sudden location such
that to is followed by be, is followed by or,
is followed by not, and so on. So, to be or not to be is a
relatively easy phrase to find, and it doesn’t require too much changes
to your search engine to find it. Then we have stemming and
normalization. So we convert things like singing
and sung to sing and better to good and best to good and cope with non
trivial synonyms such as automobiles and car, not and are not, and
aren’t, and different date formats. And then you end up with terms for
your lexicon. So those terms are the things that
are in your bag of words which are really bag of terms. And those terms are the things that
define each dimension of the vector space model. Here’s another which we certainly
want to do for say Google scholar we want to actually identify
specially the citations and specially the abstract and specially
the authors and specially the title. So that comes from
a segmentation algorithm which chops the document up into parts. And site seer of course
probably pioneered this, at least this is one of the early
analyses of scholarly works. So that’s important if you want to do the many,
many documents and many, many information retrieval systems
like those looking for figures in a document and things like that,
they need to do this segmentation. So that’s the end of our
short discussion of actually how we crawl the web and
convert it into canonical form. Now we have a little
discussion of indices, which will be the next lesson.

Leave a Reply

Your email address will not be published. Required fields are marked *