How to write a simple crawler
Articles,  Blog

How to write a simple crawler


Hello everyone we have already learned to import data from the web in Excel however it is applicable almost only to data in a list. like the financial data in practice data in a website is generally randomly distributed huh and we often have to gather data from multiple pages in these situations manually import data from the web to excel can be very time consuming thus i want to show you how to make a simple crawler using Python this crawler mainly focuses on tasks that collect information from a list of web pages with similar structure which could be applied to review websites social websites like Twitter and Q&A forums etc the basic idea is using requests to repeatedly access pages and parsing HTML files to get data the first question is how to get data of interest from complex HTML files we take advantage of two third party packages the first one is the Requests package it sends requests to the server and returns the requested webpages second one is Bs4. it parses the HTML files and filters out information we are interested in, we choose the famous movie review website Douban as our target which is known as the IMDb in China For each film the website provides the following features, like the name of of the film, the ratings, the director of the film, the actors and the category it belongs to, the releasing date and so on take the shawshank redemption as an example, in order to collect the information from the web page first of all we need to gain access to this webpage the Requests package complete this task in a very simple way, we just provided it with a URL and then it returns a response object here denoted as r. We can access the information of the page via its attributes such as status code tell us about the status of the page, where 200 for access successfully and 404 for file not found, and the text which gives us the HTML file and we can see here. in the program the html file was passed to the beautiful soup function of the Bs4 package. where we start or begin our event education in the html file. From the web page it is clear that the features are not in a single list, so we want to know where they are in general the HTML file consists of consists of tags and contents, with tags specifying the format and the function of a module we can find exactly which content is connected to which tag using Firefox developer’s tool the name of the film is located inside a span tag then as we can see here since pantech was frequently used to in HTML, there may be many span tags in the file. The attribute “property” helps us find the specific tag correctly, as its value is unique. The Bs4 package follows the same rule. Using the find function. the first partner states what kind of tags we want to find, namely the span, and the second parameter identifies the unique tag of interest then the function extracts the value of this tag. Other features of this film like ratings, number of votes. can be found using the same rule with small modifications on names of the tag or attributes. Next, and here we ran this program to see we have successfully get the result we want you can see the name of the film, the ratings, number of votes all listed here and said we have completed our task on a single page. Next, to get information about more films we need to decide which page to visit next. In this task our interest lies in finding films similar to the first film, so every time when we visit a page we record 8 related films on that page and repeatedly visits each stored page. One problem arose during the actual implementation of the crawler is duplication, because one film maybe related to many films, which means it may appear twice or more in the records. Thus we need to test whether this film has already been visited before added to the records, and to store the data into the Excel file, we used the XLWT package, to create a workbook which represents an Excel file, and added a blank sheet to this file, for every film we see it as a tuple consists of multiple features and then use the “write” method of the sheet object to add one row to the table the results is shown here and see the first is the ID the name of the film the ratings, number of ratings and so on with a simple crawler we have successfully collected this useful information from the website. Thank you very much

Leave a Reply

Your email address will not be published. Required fields are marked *