• Improving Crawler – Intro to Computer Science
    Articles,  Blog

    Improving Crawler – Intro to Computer Science

    If we want more confidence, we could also look at the documentation and the challenges. If we don’t know what we’re looking for, well the documentation won’t tell us We can see that we have union that returns a union of the sets. What we don’t see yet is update and now, given that we found update and guess what it did, we can see this description, not the most clear description to know for sure that it does what we want. We could do some more searches. Let’s see what we got when we search for Python and update. Well, we got some discussion here, we’re looking at the…

  • Grace Hopper
    Articles,  Blog

    Grace Hopper

    One of the pioneers in computing was Admiral Grace Hopper. She was famous for walking around with nanosticks, which were pieces of wire that were the length light would travel in a nanosecond– 30 cm long. Grace Hopper wrote one of the first languages, and the language COBOL, which she is seen holding here next to UNIVAC, was for a long time the most widely-used computer language. She was one of the first people to think about writing languages this way, [“Nobody believed that I had a running compiler and nobody would touch it. They told me computers could only do arithmetic.” Grace Hopper] and you have this quote when…

  • Crawling Process – Intro to Computer Science
    Articles,  Blog

    Crawling Process – Intro to Computer Science

    So I’m going to describe that process, and I’m going to write it out in a, fairly precise way, but not as actual python code. Because it will end up being your job to finish the python code to do this. But I want to describe it precisely enough so we can ask some questions about it. So we’re going to start with some seed page and to crawl will just be that page. The list containing just the seed page, and crawled will be empty. And we’re going to keep going as long as there are more pages to crawl. And for each step we’re going to pick one…

  • User Agent Header – Web Development
    Articles,  Blog

    User Agent Header – Web Development

    >>Okay, so I want to add a little more about user agents. So it’s, it’s one of the most important headers in an HTTP request. And when we were doing Reddit, user agents were really important to us. So we had the site that was online and really popular. And users were always you know, often writing scripts to, to pull content out on Reddit. And, you know, mostly they were doing good things. They were building you know, tools to, you know, do data collections so they can do some, you know, cool blog post about how Reddit works and that sort of thing. Sometimes they’re doing bad things.…

  • The Internet: How Search Works
    Articles,  Blog

    The Internet: How Search Works

    Hi, my name is John. I lead the search and machine learning teams at Google. I think it’s amazingly inspiring that people all over the world turn to search engines to ask trivial questions and incredibly important questions, so it’s a huge responsibility to give them the best answers that we can. Hi, my name is Akshaya and I work on the Bing search team. There are many times where we’ll start looking into artificial intelligence and machine learning, but we have to address how are the users going to use this because end of the day we want to make an impact to society. Let’s ask a simple question.…

  • Crawl Web Loop – Intro to Computer Science
    Articles,  Blog

    Crawl Web Loop – Intro to Computer Science

    The next step is to write the loop that’s going to do the crawling. And we said the process we want to follow is to keep going as long as there are more pages to crawl. We can do that with a while loop, and we can use tocrawl like this in our test condition. If a list is empty that’s interpreted as false. If the list is not empty, that would be interpreted as true. So this means the same thing as testing if the length of the list is zero, it’s a cleaner way to write this by just doing while tocrawl. Inside the loop, well, we want…

  • Finishing the Web Crawler Solution – Intro to Computer Science
    Articles,  Blog

    Finishing the Web Crawler Solution – Intro to Computer Science

    So the answer is we should use the “addpageto_index” procedure we just defined, and we should pass in the index. We should pass in the page, that’s the URL that identifies the location, and we should pass in the content. And that’s all we need. So we’re done with our web crawler.>From a seed, we can find a set of pages. Following that seed, following all the links that we find on the pages that we find starting from that seed, for each page, we’re going to add the content that we find on that page to an index, and we’re going to return that index. And we’ve already written…

  • Finishing the Web Crawler – Intro to Computer Science
    Articles,  Blog

    Finishing the Web Crawler – Intro to Computer Science

    So let’s remember the code we had at the end of unit 2 for crawling the web. So we used 2 variables. We initialized “tocrawl” to the seed, a list containing just the seed, and we’re going to use “tocrawl” to keep track of the pages to crawl. We initialized “crawled” to the empty list, and we’re keeping track of the pages we found using “crawled.” Then we had a loop that would continue as long as there were pages left to crawl. We’d pop the last page off the “tocrawl” list. If it’s not already crawled, then we’ll union into “tocrawl” all the links that we can find on…