Web Scraping Using Python | Python Web Scraping | Web Scraping with BeautifulSoup
Articles,  Blog

Web Scraping Using Python | Python Web Scraping | Web Scraping with BeautifulSoup

Good morning, good afternoon and good
evening ladies and gentlemen. I’m Atul from Intellipaat and I welcome you all to this session on web scraping using Python Web scraping is a term which is used to describe the use of a program or an
algorithm to extract and process large amount of data from the web. Whether you
are a data scientist or an engineer or anybody who wants to analyze large
amount of data sets then the ability to scrape data from the web is a useful
skill that you should have let’s say you find data from the web and
there is no direct way to download it then web scraping using Python is a
skill which you can use to extract the data into a useful form so that it can
be imported for further analysis well this was just a quick overview of web
scraping so before we dive deep into web scraping let me just discuss the agenda
for today’s session. We’ll start the session with a basic introduction to web
scraping where you learn about what exactly is web scraping how it differs
from web crawling and what are the various applications of web scraping. During our session we will also discuss about the legality of web scraping to
know whether web scraping is a legal thing to do or not. Once done with this,
we will move to the demo part where I will show you how you can scrape the data
from the web using Python libraries requests and beautiful soup. For this
tutorial session it would be good if you have a basic knowledge of HTML tags. If
not then don’t worry I will discuss some of the basic HTML tags during this session.
I hope the agenda is clear to you guys so without telling any further let’s
quickly jump into web scraping basics. so what is web scraping. Well web scraping
also known as screen scraping web data extraction, web harvesting etc is
basically a technique which is used to extract large amount of data from the
web sites. These data’s are mostly unstructured in nature and once
extracted you can transform it to a structured data and save it to a local
file in your computer or to a database in a tabular or a spreadsheet format. The
data displayed by most of the website can only be viewed using a web browser
and most often they do not offer the functionality to save a copy of this
data for personal use. Only option then you have is to manually copy and paste
the data which is a very tedious job and can take many hours or sometimes days to
complete. Now using web scraping you can automate this process. So instead of
manually copying the data from the website, now web scraping algorithm will
perform the same task within a fraction of time so this was about what is web
scraping let’s move on. Now next we have is web
scraping or web crawling the same thing? well there is a very subtle difference
between web scraping and web crawling moreover web scraping and web crawling
are interrelated the words web scraping and web crawling
may look similar and many people use this word very frequent but both have
lots of differences between them in simple terms web crawling is the process
of repetitively finding and fetching hyperlinks starting from a list of
starting URLs broadly speaking web crawling is a
process of locating information on worldwide web, indexing all the words in
a document, adding them to a database then following all hyperlinks and
indexes and even adding that information to the database. Major search engines
like Yahoo, Google, Bing etc all are such a program which is also known as a web
spider or a bot but when you see on the other hand web scraping is a process of
automatically requesting a web document and collecting information from it. Strictly
speaking to do web scraping you have to do some degree of web crawling to move
around the web sites. Web crawling is generally what major search engine do
for searching any kind of information web scraping is essentially targeted at
specific web sites for a specific data for example stock market data, business
leads or supply products scraping anything like that. But an important
thing to know is that web scraper can do things a web crawler wouldn’t do for
example they doesn’t obey the robots.txt I’ll tell you in detail what exactly it
is later in this session. But now just understand that they doesn’t obey
robots.txt it is a file which tells that what you can crawl and what you cannot.
So web scraper does not obey that robots.txt file.
Don’t worry later on we’ll read about it in detail. Next is submit forms with data
or execute JavaScript or transforming the data into required form and format
or even saving the extracted data into database so these are the thing a web
scraper can do but a web crawler cannot I hope you guys are clear with the
difference so moving on ahead let’s learn about
some of the use cases of web scraping or under what business scenario you could
use web scraping so number one we have,
tracking competitive pricing. Well web scraping helps in extracting products or
service price of the competitors to stay ahead of the cutthroat competition.
Next is sentiment analysis. Well sentiment analysis is analyzing the
reaction or response of a customer or a consumer. Using web scraping it can be
easy traced by extracting ratings reviews and feedbacks on forums as well
as e-commerce websites. Next we have market research. When you are planning
for a product or a service launch then you can use web scraping to study the
market in advance and it can help you out
for the product campaigns launch. Next is industry scrutinization. Most of the
time you and your business will demand to know that who all are present in
terms of market player so in this case again web scraping can surely help a big
time. Next is content aggregation. Well if you want to gather information from
multiple documents or web server for further processing then what you can do
then you can just web scrape the data then you can just process the
unstructured data and convert it into a organized structured data and it can be
further used as a real time information Next is monitoring brand value. You can
make future decisions easily and very accurately by analyzing the brand value
on scrape filter data related to your brand along with some positive and
negative keywords. Next we have is the lead generation. Well you can use the
scrap data to identify who to target in a company for your outbound sale campaign.
You can even use it to locate possible leads in your target market or identify
right contacts within each one. Well this was some of the business application of
web scraping. Well if you research out there you will find out n number of
other business application for web scraping. Now since that you know you are
fetching data from a different website so now that you have seen various
business use cases of web scraping and you know that what you do in web
scraping is you fetch the data from other website so question might arise in
your mind like is web scraping or web crawling a legal thing to do. Well the
answer is maybe crawling means fetching content from the web pages in an
automated manner rather than to manually opening each
page in your browser the calls made by the browser agent to
the target server that hosts the webpage is quite similar to the way the bot
heads a page to grab its content. So now the question arises why is crawling a
taboo among those who only have learned to use it? Well it’s mostly because
it’s quite often used against the website policies and breaks the ground
rule of crawling. So here are some thumb rules that you must follow if you want
your bot to behave humanely. The first one being robots dot txt. Well feed a
commercial or a non-commercial indian-american or any website one of
the easiest way to find out which websites allow scraping and which
doesn’t is simply by checking their robots dot txt file. You can do it by
appending forward slash robots dot txt at the end of the website URL you wish
to scrap. For example what if I want to check the robot dot txt file for
amazon.com so I just tried to play W dot Amazon come slash robots dot TX so as
you can see here it gave me a list of allowed and disallowed link which can be
crawled you can consider this as a filter consent form that you should
abide by if you want to crawl that site tells you what URLs you can and what you
cannot crawl this is really what specific even Google spot cannot crawl a
block page unless the site is worried about the page SEO next is the public
content will crawl only public content keeping copyright policies in mind if
you are crawling a site only to reproduce the same content on a new site
of yours then good luck with that you can easily do that it’s a very legal
thing and no one has gone oppose you or sue you for that next is Terms of Use
check the websites Terms of Use and make sure all’s well between them and you you
should definitely read the terms and condition of a website you want to
scrape if the data is under Creative Commons you should definitely read the
terms and condition of a website you want to scrape and check if the data is
under the Creative Commons that you can use it commercially so basically if
terms condition states that scraping is
illegal then you shouldn’t scrape that website next is authentication based
websites well some sites need authentication before you could access
their content and mostly they would discourage crawling because they only
want real human beings to get logged next is crawl daily reports dot txt also
lists delayed to be maintained between consecutive crawls to ensure you are not
hitting their servers too hard if you load them with requests chances are that
your eyepiece will be blocked let me tell you about an ongoing case between
length and vs. hiq in August 2017 Linden block hiq from accessing its data
available on public LinkedIn provides a registered user so what these hiq guys
were doing they were just extracting information from LinkedIn all the public
information all your public profiles which are being extracted by hiq so what
LinkedIn did it blocked the company from accessing its state hiq to blend into
the code and the case is still ongoing the case hinges on the question of who
owns a piece of data and the circumstances under which the
information can be viewed as residing in the public domain access by all and son
the appeals court judges may rule that linden owns exclusive right to the data
which would not have been compiled without the enterpreneur talents of the
linkedin founders conversely the judges may conclude that since LinkedIn users
set their profiles to public placing them in full view of search engine and
general web surfers they’re giving companies like hiq free rein to view and
use the data as the Seafarer so it’s a knife-edge decision with strong
arguments on both sides either ruling could have profound implication for how
people like you and me interact with the data in your daily lives so this was a
story of how H I Q briefs the policy of Langdon by web scraping their data
so as you know that for this tutorial I’ll be scraping using Python so what
are the basic Python libraries that I’ll be using for web scraping well the first
one is the request which is used for fetching URLs it defines the functions
and class to help with the basic URL actions like digestion authentication
redirection cookies etc next is the beautiful soup well it is an incredible
tool for pulling out information from a webpage you can use it to extract tables
lists paragraph and you can also put filters to extract information from
webpages beautifulsoup does not fetch the webpage
for us that’s why I am using the combination of both requests as well as
the beautiful soup library Python has several other option for HTML scrapping
in addition to beautiful soup like mechanize scrape mark scrapy etc don’t
worry we’ll discuss about them in our next tutorial for this tutorial we’ll be
just focusing on beautiful soup and requests so let’s move on
well–why performing web scraping you will have to deal with HTML tags thus
it’s very important that you must have a good understanding of them if you
already know the basic of HTML it’s good but for those don’t I’ll try to cover
the basic for you guys in just two minutes
here’s a basic HTML code the first line exclamation mark doctype HTML it
indicates that this particular HTML file is written in html5 so basically all the
html5 documents must start with it next is the HTML tag all the HTML documents
are contained between an opening and a closing HTML tag next we have is the
opening body down well this is the visible part of HTML document like
whatever you see on the web page is mentioned inside the body tag
next is the body tag well it is the visible part of HTML document like
whatever you see on a web page is enclosed between a body time next we
have is the H winter when HTML headings are defined with the h1 to exit star
there are six different types of heading which are defined as h1 h2 h3 h4 h5 and
h6 example h1 is the main heading then h2 is a subheading and so on after
headings comes the paragraph four paragraph we have a B tag over here this
tag is used to define a paragraph there’s the ending body tag and the
ending HTML time it’s more next we have is the table time don’t get confused
with the styling the main table over here is stable the style is just one of
the attribute of a table like it’s telling that the work thought the table
should be hundred percent we represent the row as TR and the rows
are divided into data using the TD tag for example I mentioned a name over here
like James James is a data of a row it’s a cell fine
we had another data to same row James Smith
under the data was same like 45 so first name last name and age being hired to
same row dia now again what if I want to add elements to my next row so again
I’ll define another row TR then again I’ll write first name last name and the
age and then again close down dr time or the road these are some of
the basics see what we have next next we have a pill is the a time well a
tag is used for hyper linking text with some websites for example this is my
entire time what we have up here is a href equals triple w dot in telepods com
so this is the time and what is my text that has to be hyper link to this
website it’s visit to learn then we close a time so this entire thing is
known as element a href part is the opening tag or the a tag and inside that
we have attribute name as href and attribute value as triple w dot
intellibid calm moving ahead we have the enclose that content to visit to learn
more which will be hyperlink to triple w dot intel about dot and finally we have
a slash a as the closing time so this was all about the basic HTML tags which
i will be using in this session if you want to learn more about HTML tags in
detail you will find multiple online sources freely available for you you can
just learn about HTML and detail from those websites of content but this
session I am considering all of you have a brief idea about the HTML basics let’s
start scraping our website so we’ll be scraping a web page using beautifulsoup
here I’ll be scraping a data from a wikipedia page and our final goal is to
extract lest of concrete from it let’s scrape out Wikipedia step by step
so we’ll be writing our Python code on our chip at a notebook well for those of
you who don’t have to put a notebook install in your machine I’d suggest that
you can just go over to triple w dot anaconda calm slash downloads and from
there you can install the Anaconda navigator anaconda navigator already
have pre-installed Jupiter notebook in it and you can just hit the launch
button to start the Jupiter notebook so this is my anaconda navigator
I’ll hit over the launch button and my Jupiter notebook will start on my
default browser so let’s start so the very first thing that I want to do is to
import the library used to query a website
this I’ll be importing requests now what I want to fetch so I need to mention our
URL so specify the URL URL will be of Wikipedia link so wiki link equal
this is a link so there’s the list of different Asian
countries by area and what I want to do I want to extract this particular column
from I want to extract country from this table all the different countries so
let’s just copy this up we start over here so there’s my wiki link it’s a
strength I need to convert this URL as a text so I will just write my link equal
quests dot get wiki link to text by executed no error let’s move ahead if
you want to check of the links of fetch correctly or not just straight brain
link and so yeah so it’s executed and I think it’s working fine so as you can
see it fetch me all the different links from that particular webpage so my next
task is to parse the data returned from the website for that I’ll be using
beautifulsoup so I will be importing from vs4
an beautiful soup then I’ll define a variable su equal
beautiful sooo inside that I’ll pass my variable
link Alexa so what I did up here I passed the
HTML in the link variable and stored it in beautiful su format so now what I’ll
do I’ll use the function prettify so now if you print su let’s see what is the
output now even not much readable but yeah it
passes the HTML from the link variable so next what I’ll do let’s pray defy it
so using predefined function I will get a nested structure of HTML page this
will help me to see the structure of the HTML tags this will also help you to
know about the different available tags and how you can play with these and
extract information so for it all you have to do is just write print soup dot
define fine executed
honored star so here you got a nested structure of HTML page good
thing next we’ll work with the HTML tags so
basically the general syntax for a test soup dot dot that’s it for example I
want to fetch out the title so I’ll write soup dot title it’s given so my
title is list of Asian countries by area slash Wikipedia
now what if I just want the string part of it now what if I don’t want these
tags over here I just found the string from up here like I just want list of
Asian countries by area – Wikipedia so what I’ll do I’ll just write over here su dot title
and from that I just need the string part
it’s gilded so yeah you got the string without the tags fine
now let’s move out and check out what are the various links in this particular
webpage so for that what I want to fetch I want to pass the a tags so soup dot a so all you can see is that we have only
one output now to extract all the links which have a tags in it so we will use
find underscore all function like su dot find underscore all inside that mention
a executed so it’s showing me all the links including titles links and other
information fine now to show only the links we need
to iterate over each tag and then return the link using the attribute href with a
get fine so for that what you will do let’s see for example I just want B
links so let’s define all link so let’s define a variable all
linked equal soup dot find underscore all with a time now for now for Lincoln
all underscore link then print link dot to get HR EF
think it’s executed and valid syntax
sorry so sorry executed say so what you brought up here is all the hrf link fine
now coming to our main task so a main task was to extract these country from
this particular table like
so what I want to do I want to find the right table as we are seeking a table to
extract information about the country so we should identify the right table first so let’s write the command to extract
information within all the table tags so all underscore tables equal soup dot
find underscore all inside that pass the table time
all right so this will fetch me all the table tags and you printed print all
underscore tables you’ll see that we have all the different table tags up
here same
now we have to identify which one is the right table so we’ll use the attribute
of class of table and use it to filter the right table so how you’re expecting
that you are going to check this class so if you open your chrome let me just
open my chrome open that particular website our web page I Tendo
you can just check the class name by right-click on the required table of the
web page and select the inspect element example I need the class name of this
lower right click over here select inspect element
okay yeah so this was my table class fine that class is stable and what is
the class name Rickey table saw table fine I’ll just copy the class name over
here so I define another variable as but right and a score table such that soup
dot find and it consists of all the table tags
and which table in particular I’m one which has class name as what was its
class name again Wiki table sortable
ricky tablespace sortable fine so there’s my right table from which I want
to extract the information now right under school table let’s check you know it’s the correct one ranked
country area kilometer square notes wrong country area kilometer square
notes fine so this one is the correct table
fine now as you can see here on the screen
that all the country has some link in it so what does it mean so each of the link
has some a tag on it so I have already faced the right table and all I want to
do is get the correct a type from it so I define like table links as right table dot find all and what I want
to find is my K tags it’s cute let’s check abel underscore links
so yeah we have complete list of these countries over here fine next what you can do you can just import
them to a data frame for example I define an empty list as country now for
links in table underscore length this was my variable right
people under school links country dot append link dot get and I just want to
get the title part of each links fine cuted now print the entire list French
country sorry we got a wrong put over here oh yeah it was links right so who’s
a Cuban okay fine so we got the entire list of our country over here fine now
all I want to do is represent them in a data frame fine
let’s see so for I’m coding it into a data frame what I need to do I need to
import pandas so import Honda’s as PD next I’ll define my data
frame as EF equal PD dot data frame fine while the finer edify for it so
data frame of like it’s a country right so country
equal this country fine that’s it and print your data frame it’s
cute so you are defined out and there’s how you scrape the data from
a wikipedia page so thank you guys this was all for this
session in case you have any query or doubt please feel free to add it out
over the comment section below thank you


Leave a Reply

Your email address will not be published. Required fields are marked *