Web Search: Crash Course AI #17
Articles,  Blog

Web Search: Crash Course AI #17


Hi, I’m Jabril and welcome to Crash Course
AI! There used to be a time when a group of friends at dinner could ask a question like
“is a hot dog a sandwich?” and it would turn into a basic shouting match with lots
of gesturing and hypothetical examples. But now, we have access to a LOT of human
knowledge in the palm of our hands… so our friends can look up memes and dictionary definitions
and pictures of sandwiches to prove that none of them have a connected bun like hot dogs
(disappointed). Search engines are a huge part of modern life.
They help us access information, find directions to places, shop, and participate in sandwich
arguments. But how does Google find answers to questions?
How are Siri and Alexa so smart but also easily stumped? How did IBM’s Watson beat the best
Jeopardy players in the world? Well, search engines are just AI systems that
are getting better and better at helping us find what we’re looking for. INTRO When we talk about search engines, we typically
think about the AI systems online, like Google, Bing, Duck Duck Go and Ask Jeeves. But the basic ideas behind non-AI search engines
have existed for centuries. Essentially, search engines gather data, create organization systems
to sort that data, and find results to a question. For example, when you needed an answer to
a question and couldn’t search online, you could go to the library! Libraries gather
data in the form of books and newspapers that are stacked neatly on the shelves. Librarians have organization systems to help
you find what you’re looking for. Knowing that magazines are on shelves by the water
fountain, while kids books are on the second floor is a kind of organization. Plus, fiction
books are sorted by the author’s last name, while nonfiction has the Dewey Decimal System,
and so on. Once you (or the librarian) have the resources
you need, you’ll be able to find results to your question! Now, rather than looking through books, web
search engines look through all the data on the World Wide Web, aka “the Web”. And
instead of asking a human librarian where to find information, we ask an AI like John-Green-bot
instead. Jabril: Oh John Green Bot? [JGB dialup beeps] Alright John Green Bot you’re all set. We’re going to need that later. And just so we’re clear, we’re using “Web”
throughout this video even though it might sound a little old-fashioned. That’s because
the Internet and the Web are not the same thing. The Internet is a collection of computers
that send messages to each other. Video services like Netflix that play on your TV, for example,
use the Internet, not the Web. The Web, on the other hand, is part of the
Internet and uses the Internet’s connections to send documents and other content in a format
that can be displayed by a browser like Chrome or Safari. As with most AI systems, the first step is
to gather lots of data. To gather data on the Web, we can use a computer program called
a Web crawler, which systematically finds and downloads Web pages. This is a HUGE task
and happens before the search engine AI can take any questions. It starts on some Web page that we pick, called
a seed, and downloads that page and finds all its links. Then, the crawler downloads
each of the linked Web pages and finds their links, and so on… until we’ve crawled
the whole Web. After we have collected all the data, the
AI’s next step is to organize it by building an index, which is a kind of lookup system.
The kind that’s used for organizing Web pages is called an inverted index, which is
like the index in the back of a textbook. For each word, it lists all of the Web pages
that contain that word. Usually, the Web pages are represented by I.D. numbers so we don’t
have a long, messy list of URLs. Let’s say 0 is the seed – which happens
to be a page about Genghis Khan. It has a lot of words on it like “the, mongol, Khan,
Genghis, who, and is”. In this inverted index, page 1 is about Marco Polo, but it
mentions the word “Genghis” along with words like “the, Marco, Polo, who, are,
and is.” Page 2 is about the Mongols, page 3 is a different webpage about Marco Polo,
and page 4 is about Water Polo. So, let’s say we type “Who is Genghis
Khan?” into a search engine. Our AI can use this inverted index to find
results, which in this case, are links to Web pages. The AI will look at the words “who”,
“is”, “Genghis”, and “Khan” and use the inverted index to find relevant pages. Our AI might find that Web pages zero, one,
two and five have at least one of the words from the question “who is Genghis Khan?”
When Siri says “I found this for you,” the AI is just returning a list of Web pages
that contain the same terms as the question. Except… most search engines include one
more step. There are millions of pages online that contain the same terms. So it’s important
for search engines to rank Web pages, so that the top result is more likely to be relevant than the tenth result or the hundredth. Of course, Google and Bing don’t hire “supervisors”
to grade each possible question and answer to help their AI systems learn from training
data. That would take forever, and they wouldn’t be able to keep up with all the new content
that gets created every day. Really, regular users like us do this training
for free all the time. Every time we use a search engine, our behavior tells the AI
whether or not the results answered our question. For example, if we type in “who is Genghis
Khan” into a search engine, and click on a Web page about Star Trek II: The Wrath of
Khan, we might be disappointed to find Genghis Khan isn’t ANYWHERE in that movie. So we’ll
bounce back to the search results, and try again until we find a page that answers our
question. A bounce indicates a bad result. But if we
click on a Wikipedia article about Genghis Khan and stay for a while reading, that’s
a click through, which probably means that we found what we were looking for… so that
indicates a good result. Human behavior like bounces and click throughs
give AI systems the training data they need to learn how to rank search results and better answer our questions. Data from the Web and data from how we use the Web helps make
better and better search engines. Now, sometimes we ask our smart devices questions
and we want actual answers… not links to Web pages. When I say “OK Google, what’s
the weather like in Indianapolis?” I don’t want to scroll through results. For this kind of problem, instead of using
an inverted index, AIs rely on knowledge bases. Which you might remember from our video about
Symbolic AI. A knowledge base encodes information about the universe as relationships between objects like “chocolate donut” and “John Green Bot wears polo”. One of the main problems with knowledge bases
is that it’s really hard to write down all of the facts in the universe, especially common
sense things that humans take for granted but computers need to be told. Enter AI researcher Tom Mitchell and his team
of scientists from Carnegie Mellon University. In 2010, they created a huge knowledge base
called the Never Ending Language Learner or NELL, which was able to extract hundreds of
thousands of facts from random Web pages. The way it works is really clever, so let’s
go to the Thought Bubble to see how. NELL starts with some facts provided by a
human, for example, the genre of music that Mozart plays is classical. Which was represented
like this: Mozart. musicGenre. Classical. Similarly, Jimi Hendrix. plays. Guitar. And Darth Vader. hasChild. Luke Skywalker. Then, NELL gets to work and reads through
each Web page one-by-one for words mentioned in those facts. Maybe it finds the text “Mozart plays the piano.” NELL doesn’t know much about these symbols,
but this text matches the same pattern as one of the facts provided by a human, specifically,
the “plays” relationship. So NELL learns a new object: Piano. And a new fact: Mozart.
plays. Piano. By searching over the entire Web, NELL can
learn lots of facts based on just the three original ones that humans gave it! Some facts might appear hundreds or thousands
of times online, like Lenny Kravitz. hasChild. Zoë Kravitz. But NELL might also find facts
that are mentioned SOMEWHERE online and extract them as potentially true. Like, for example,
Darth Vader. plays. Kloo Horn. We just don’t know! Just like how we look for multiple sources
when writing a paper, NELL uses repetition and multiple sources to build confidence that
the facts it’s finding are actually true. To consider other relationships, NELL uses
the highly confident facts it learned and searches through the Web again. Only this
time, NELL is looking for new relationships. Maybe it finds the text “Darth Vader cuts
off Luke Skywalker’s hand,” and NELL learns a new (very specific) relationship: cutsOffHand. Over and over again, NELL will use known relationships
to find new objects, and known objects to find new relationships — creating a huge
knowledge base. Thanks, Thought Bubble! AI systems can use
huge knowledge bases, like this one extracted by NELL, to answer our questions directly. Instead of using the words from our questions
to search through an inverted index, an AI like Siri can reformulate our questions into
incomplete facts and then look for matches in a knowledge base. Hey John Green Bot…. John Green Bot: Yes, Jabril? Jabril: “Who wrote The Bluest Eye?” His AI could then reformulate that question
into an incomplete fact, replacing “who” with a question mark. If John-Green-bot extracted
that information earlier, he can find matches in his knowledge base and return the most
confident result. John-Green-bot: Toni Morrison wrote The Bluest
Eye! Jabril: Hey. Thanks, John-Green-bot! Different words are categorized differently,
so an AI like John-Green-bot can tell the difference between questions asking “who”
and “when” and “where.” But that gets more complicated, so we’re not going to
dive into the details here. If you want to learn more, you can read about part of speech
tagging systems. Using all these strategies, search engines
have become really good at answering common questions. But questions like “How many
trees are in Ohio?” or “How many hotdogs are eaten in the South Sandwich Islands annually?”
still stump most AI systems, because not enough people ask them and AI hasn’t learned how
to answer them well yet. It’s also important to watch out for search
engine answers to questions like “Who invented the time machine?” because AI systems have
a tough time with nuance and incomplete data. Sorry Doc Brown. And a big, sort of hidden, problem is that
search engine AI systems, are influenced by any biases in data online. For example, if
I ask Google for images of “nurses,” it will mostly show pictures of female nurses.
So next time, we’ll talk about how an algorithm can be biased, where bias comes from, and
what we can do to address bias in AI. I’ll see ya then. Crash Course AI is produced in association
with PBS Digital Studios! If you want to help keep all Crash Course free for everybody,
forever, you can join our community on Patreon. And if you want to learn more about the history
of the World Wide Web, check out this episode of Crash Course Computer Science.

45 Comments

  • IceMetalPunk

    Submarine sandwiches often have a connected bread, and they're sandwiches. I also don't think hot dogs are sandwiches, but I don't quite know why… Maybe it's the way you hold it? Like, even with a connected-bread sub, you hold it so the bread is horizontal with a top and bottom, but with a hot dog, you hold it with the connection at the bottom, open at the top, and bun on the sides. Maybe that's the distinction?

    Also… I wonder what percentage of CrashCourse's demographic are old enough to recognize and remember that dial-up sound? 😂

  • -

    0:21 – Sandwiches don't have connected buns? So Subway's submarine sandwiches are really hotdogs? 🤨
    2:19 – Typical John Green Bot. Hank Green Bot upgraded to broadband a long time ago.

  • Orome

    I'm sorry but NO, search engines haven't been getting better and better at helping us find what we are looking for.

    Back when google used and actually obeyed boolean search operators, it used to be an extremely powerful research tool. Now it's basically useless unless you are looking for advertisements or really basic information. For any kind of esoteric or very specific information that doesn't fit under google/scholar, google has become completely useless. It's good at showing you what it wants you to be searching for or what it thinks you are searching for, but not what you are actually searching for.

  • Rikkun Brouwers

    The scriptwriters have done a great job explaining AI in this series and don't get this negatively, because it is a lot better than I could ever do, but it feels so extremely oversimplified that it makes me angry. :'D

  • Abraham Morrison

    Some questions i ask take out parts of my question and give me what i think i want. It is not even close sometimes.

  • Mr. Wallet

    Oh boy, bias… here's comes the episode I've been dreading. I usually really like Crash Course but sometimes you folks go too far on the "woke"mentality.

    According to the United States Bureau of Labor, about 90% of nurses are female. If this holds up globally, then showing 90% images of female nurses would not be biased – it would be an accurate reflection of reality. 85% or 95% female nurses would be biased. A more even gender distribution may be desirable, but that's not the point; if I say "show me pictures of nurses," I either want to know what a stereotypical nurse is (in which case bias is desirable), or I want to know what real nurses actually look like, in which case if I see a large collection of images I want them to have the same distribution of features (e.g. gender) as reality.

    Trying to push an agenda by deliberately biasing the presentation of information, rather than doing the tough work of going out and effecting change to the underlying facts, is putting the cart before the horse. You can even do more harm than good by suggesting that the problem doesn't even exist: "Of course there's plenty of male nurses; look what happens when I search for pictures of them!"

  • DragoniteSpam

    I'm bouncing through my YouTube subscriptions, and the one I watched right before this one was Computerphile's video on language parsing. Good one to watch alongside this one, if you're into Computerphile.

  • A GOD FOR THE ATHEIST

    AI can an artificial intelligence experience fear, anger, love, erotica, empathy ?!?!?!? if not then it is not really very intelligent when it comes to the hierarchy of human needs of love belonging purpose esteem self actualization and self ascendancy is it? Humans learn mainly from ONE TEACHER as Jesus said, Personal Experience, The Golden Rule , the ONE AGREED ON MORAL AUTHORITY for over 2800 YEARS … if you consider Jesus saying before Abraham I am , but at least since 2200 years according to Confucius buddha and Zarathustra

  • Eric Alvaro

    "Bias" is not necessarily a bad thing and oftentimes it just reflects the reality. To give another example: most elementary school teachers are female, that's a fact and it shouldn't be changed (or should I say lied to) in search engines to reflect some kind of ""utopian society"".

Leave a Reply

Your email address will not be published. Required fields are marked *