Crawl Web – Intro to Computer Science
Articles,  Blog

Crawl Web – Intro to Computer Science


Now we’re ready to write the code for crawling the web. So our goal is to define a procedure, we’ll call it crawl_web, that takes as input a seed page url. So, that’s the url that identifies our seed page, and outputs a list of all the urls that can be reached by following links starting from the seed page. So, if you’re really ambitious you should try to do this yourself without anymore help. That’s going to be a pretty tough challenge. So we’re also going to step through one way to do this as a series of quizzes. But you should feel free at any point, when you feel confident that you can do it yourself, to try to finish for yourself, rather than following the step by step quizzes that I’ll show you. So we will start defining our procedure crawl web, and we are going to introduce two variables. The tocrawl variable that keeps track of the pages that we need to crawl, and the crawled variable that is a list of pages that we have already crawled. For the first step, your goal is to figure out, how to initialize these variables. Which of the first value, tocrawl and crawled be?

Leave a Reply

Your email address will not be published. Required fields are marked *