Scrapy vs. Selenium vs. BeautifulSoup vs. Requests vs. LXML – Tutorial
Articles,  Blog

Scrapy vs. Selenium vs. BeautifulSoup vs. Requests vs. LXML – Tutorial


Hey there! So today, we are
going to learn about Scrapy. What Scrapy is overall. Scrapy versus other
Python-based scraping tools. Why you should use it and when it
makes sense to use some other tools. Pros and cons of Scrapy
and that would be it. So let’s begin! Scrapy, overall, is a web crawling
framework written in Python. One of its main advantages is that it’s built on top
of Twisted, an asynchronous networking framework, which in other words means that it’s:
a) really efficient, and b) Scrapy is an asynchronous framework. So, to illustrate why
this is a great feature… I’ll use, for those of you that don’t know what
an asynchronous scraping framework means… …I’ll use some enlightening example. So, imagine you have to call hundred
different people by phone numbers. Well, normally you’d do it by sitting down
and then dialing the first number… …and then patiently waiting for
the response on the other end. In an asynchronous world, you can pretty much
dial in first 20 or 50 phone numbers… …and then only process those calls once the
other person on the other end picks up the phone. Hopefully, now it makes sense. Scrapy is supported under or uses
Python 2.7 and Python 3.3. So you can pretty much, depending on your version
of Python, you are pretty much good to go. So Python 2.6, important thing to note,
support was dropped starting at Scrapy 0.20. So just bear that in mind, and Python 3
support was added in Scrapy 1.1 Scrapy, in some ways, it’s similar to Django. So those of you that use or have used,
previously, Django will definitely benefit. Now let’s talk more about other
Python-based scraping tools. And these, bear in mind that these,
are old-specialized libraries… …with very focused functionality and they don’t claim or they
are not really a complete web scraping solution like Scrapy is. The first two, urllib2 and then Requests are
modules for reading or opening web pages, so HTTP modules. The other two are Beautiful Soup
and then lxml. These are for, aka, the fun
part of the scraping jobs. Or really for extracting data points from those
pages that are loaded with urllib2 and then Requests. Let’s get back to urllib2 and urllib2’s biggest
advantage is that it’s included in the Python standard library… …so it’s batteries-included and as long as you
have Python installed, you are good to go. In the past, it was more popular but
since then another tool replaced it. And that tool, believe it
or not, is called Requests. The docs or documentations
are superb for Requests. I think it’s even the most popular
module for Python, period. And if you haven’t already… once again the docs are just amazing,
so if you haven’t already, just give it a read. And Requests, unfortunately, doesn’t come pre-installed
with Python, so you’ll have to install it. I personally use it for quick and dirty
scraping jobs… …and both urllib2 and Requests are supported
with Python 2 and Python 3. The next tool is called Beautiful Soup
and once again… …it’s used for extracting data points
from the pages that are loaded, okay? And it’s… Beautiful Soup is quite robust
and it handles nicely malformed markup. So, in other words, if you have a page that is
not getting validated as a proper HTML… …but you know for a fact that it’s a page and that it’s
an HTML specifically page, then you should give it a try… …scraping data from it with Beautiful Soup.
So actually the name came from the expression ‘tag soup’… …which is used to describe a really invalid markup. Beautiful Soup creates a parse tree
that can be used to extract data from HTML. The official docs are comprehensive,
easy to read and with lots of examples. So they are really, just like with Requests,
they are really, beginner-friendly. And just like the other tools for scraping, Beautiful Soup
also comes with Python 2 and Python 3. And now let’s talk about…
let’s see… about lxml Now what lxml is… it’s just
similar to Beautiful Soup so once again, it handles or
it’s used for scraping data. It’s the most feature-rich Python library… …for processing both XML and HTML. It’s also really fast and memory efficient. A fun fact is that Scrapy selectors are built over lxml and
for example, Beautiful Soup also supports it as a parser. Just like with Requests,
I personally use lxml in pair with Requests… …of course, for again as previously
mentioned, quick and dirty jobs. Bear in mind that the official documentation
is not that beginner-friendly to be honest. And so if you haven’t already used a similar tool in
the past, use examples from blogs or other sites. It’ll probably make a bit more sense
than the official way of reading. The last tool for scraping
is called Selenium. So to paraphrase this, Selenium is first of all
a tool writing automated tests for web applications. It’s used for web scraping mainly
because it’s beginner-friendly… …and if a site uses JavaScript…
so if a site is heavy on JavaScript… which more and more sites are… Selenium is a good option because,
once again, it’s easy to extract the data… …if you are a beginner or if JavaScript
interactions are very complex… …if we have a bunch of
get and post requests. I use it sometimes solely
or in pair with Scrapy. And most of the time when I’m using it
with Scrapy I, kind of, try to iterate over… …once again JavaScript heavy pages and then use
Scrapy Selectors to grab the HTML that Selenium produces. Currently, supported Python versions for
Selenium are 2.7 and 3.5+ Overall, Selenium support
is really extensive. And it provides bindings for languages such as Java,
C#, Ruby, Python of course, and then JavaScript. Selenium official docs are, once
again, great and easy to grasp. And you can probably give it a read
even if you are a complete beginner. And in two hours you will,
pretty much, figure all out. Bear in mind that, from my testing, for example,
Scraping thousand pages from Wikipedia… …was 20 times faster, believe it
or not, in Scrapy than in Selenium. Also, on top of that, it [i.e. Scrapy] consumed a lot
less memory and CPU usage was a lot lower… …with Scrapy than with Selenium. So back to the Scrapy main pros, and when using Scrapy,
of course, first and foremost it’s asynchronous… …but if you are building something robust
and want to make it as efficient as possible… …with lots of flexibility and a bunch of functions,
then you should definitely use it. One case example when using some other tools,
like the previously mentioned tools, kind of makes sense… …is if you had a project where you
need to load the Home Page or… …something like that or your favorite, let’s say, restaurant
and check if they are having your favorite dish on the menu. And then for this type of cases, you should not use
Scrapy because, to be honest, it would be overkill. Some of the drawbacks of Scrapy is that, since it’s really
a full-fledged framework, it’s not that beginner-friendly… …and the learning curve is a little
steeper than some other tools. Also, installing Scrapy is a tricky
process, especially with Windows. But bear in mind that you have a lot
of resources online for this… …and this pretty much means that you have, I’m not
even kidding, probably thousand blog posts about… …installing Scrapy on your
specific operating system. And that’s it for this video… So thanks for watching,
and I’ll see you in the very next video… …where I will discuss installing Scrapy. Bye!

12 Comments

  • Chawn Neal

    Great explanation. I made sure to like and subscribe. My question is and I'm also in the process of working with tools like UiPath and OctoParse: I have a great deal of programming experience web dev, my own projects parsing data, database programming, etc..

    so I was looking to make my own tool that would help me webscrape because no tool seems to handle the job I need. which framework or library do you think would be best for:

    if you know a name for this issue so I can look more easily that would also be great. I describe it as web navigation and different depths to extract data. Or multiple cases.

    going through a list of links that are in some container such as a table or div then going through multiple pages to get to a detail page and extract the data.

    Some maybe at different depths ex:
    case1: categories>boxes>redbox categories>chairs>red chairs
    case2: categories>boxes>cardboard_boxes>red_cardboard_boxes categories>chairs>bluechair .

    In my example the first case they have the same depth to the detail page where as in the second case they have different depths.

    I see that this could be a long project which I am prepared for but I'm looking for some direction, either a tool that can handle cases or the framework to start coding this project, Ive mainly looked at:
    pythons: scrapy, and not the same class but beautifulsoup
    Ruby: Nokiguri but I don't have as much expierence with ruby but I have the freedom to learn
    Nodejs: Django but I haven't looked at this one

  • introspection

    Hello, did you use proxies with selenium remote driver for selenium grid with chrome nodes? Is it true, that chrome nodes not support proxies or not? I'm stuck here. Thank you.

  • TRBO

    I would like to add a calculator to my website that pulls vehicle registration info from the Florida DMV. Instead of my customer going to the DMV website they can use mine. Can you help me?

  • XenoContact

    Nice explanation. However, I disagree that Selenium is for beginners. I have years of experience using it and I can assure you it is the most advanced and reliable of the bunch.

  • Anshuman Kumar

    BeautifulSoup is garbage. I wrote a script to download all class work pdfs from my college site, it worked but in a lot of cases but a lot of the tags were not interpreted properly.

Leave a Reply

Your email address will not be published. Required fields are marked *