Build a corpus from the web
Articles,  Blog

Build a corpus from the web


This screencast shows how to build a text
corpus in a few minutes from texts available on the web. Log in to Sketch Engine and switch to the
new interface. The corpus building tool can be reached from
the select corpus screen or from the corpus selector at the top. Give your corpus a name, select the language
and, optionally, provide some description. This list shows which functionality will be
available with your corpus. You can add data by having Sketch Engine find
relevant texts on the internet for you or by uploading your own texts. This screencast focuses on the first option. Use the web search option to make Sketch Engine
find suitable texts using an internet search engine. The URLs option lets you download texts from
a list of web addresses which you need to provide. Links on the pages leading to other pages
will not be followed. The website option lets you download the content
of the whole website. There is a limit of two thousand pages from
one website. For technical reasons, the website is downloaded
at a speed of about 6 pages per minute which makes this option time consuming. The good news is that the process will be completed in the
background even if the user logs out. This screencast will focus on the web search
option. And this is how it works: All you need to do is provide at least 3 seed
words or phrases that define the topic of your future corpus. Sketch Engine combines the seed words into
random groups of three and submits them to Bing. Bing searches the internet and sends addresses of matching web pages
back to Sketch Engine. Sketch Engine downloads the pages, removes advertising, navigation menus and other linguistically irrelevant content and processes the texts into a corpus. Type your seed words and hit Enter after each
phrase or word. This number indicates, how many groups of
three can be made with your seed words. There is no correct answer to how many seed
words should be used. More seedwords produce more searches and a
larger corpus but the topic coverage might be too wide. Fewer seedwords produce smaller but more focussed
corpus. This setting will affect the size of the corpus. Bing will normally search the whole internet but you can limit it to only certain websites. Sketch Engine now communicates with Bing to find and download texts for you. Now all pages have been downloaded and the main text have been extracted from
them. Click again to process the texts into a corpus. The corpus is ready now. Check the size of the new corpus and other
details. Extract terminology to quickly check the texts in the corpus are indeed related
to the intended topic. Use the left menu or the dashboard to start
searching and analysing your corpus. You can manage your corpus here to make it bigger, to grant access to your corpus to other users, to download the corpus, delete individual documents or the whole corpus and take some other actions. Did you find interesting what you’ve just
seen? Try Sketch Engine. Register for a free trial on sketchengine.eu. Thank you for watching. To see more videos like this one, don’t forget to click the Subscribe button
below.

Leave a Reply

Your email address will not be published. Required fields are marked *