PDF Data Extraction and Automation 3.1
Articles,  Blog

PDF Data Extraction and Automation 3.1


Hello And welcome to our specialized tutorial about
working with PDF files. In this video we’ll take a look at specific
methods of extracting data from this widely popular file format. Whether in native-text format or as scanned
images, UIPath allows you to navigate, identify and use PDF data however you may need. Before we go on, you should already be familiar
with extracting data, and how to use and edit Selectors. For both of these subjects there are separate
videos that go into much further details, so make sure you check these in case you don’t
fully understand something along the way. To start things off, first make sure you have
all the actions and dependencies required for working with PDF files. If a simple search in the activities panel
comes out empty, it means you have to install them. Simply go to the package manager, search for
PDF, and install the UIPath PDF Activities package. Great. Moving on, you are probably aware of the fact
that PDF files can contain text, images, and sometimes text that are actually undercover
images A basic method of identifying which is which,
is by simply selecting the elements you are interested in. As you can see, text can be easily selected,
while images are immediately apparent as blocks. Later on we’ll see how to deal with both situations. UIPath has various activities and methods
for handling PDF files, and we split them in 2 categories, by their intended use: the
first category, for larger chunks of text or whole documents; and the other for extracting
specific text items from a PDF file – like a name, a product, an invoice value, and so
on. We’ll start with the first category since
it’s the easiest. For reading whole PDF documents or pages you
can use the Read PDF Text activity. It’s pretty straightforward: choose the
file to be read, and the action will output a text variable with the contents of the file. We’ll save the result as a text file… and also show it in a message
box… but you could use other string operations to modify or extract information out of the
generated text. The Range parameter is an important one, because
it defines what to actually read. It can be set to All pages, as it is now,
or a specific page… like 5, or 12… or a range of pages… say… from 3rd to 7th
page. We have a one-page document so we can set
it to All, or 1. Now… when we take a look at the result for
the Read PDF Text action, we notice that only the text part of the document has been converted. The top 2 columns of text are present, but
the lower half, which is actually an image, is completely ignored. That’s okay, there’s a different action
for reading images inside PDFs, and it’s called Read PDF with OCR. As its name suggests, it uses Optical Character
Recognition to “scan” the images inside the PDF document and output all the text as
a variable. It is slightly different than its non-OCR
sibling, in that it requires an OCR engine. You can find the available ones by simply
searching for OCR in the activities pane. I have just these two installed, Google and
Microsoft, but others such as Abby FineReader can be added too. The engine itself has the usual OCR parametres
encountered throughout the app: things like allowed characters, denied characters, language,
scale, and so on. Different engines may have different parameters,
so make sure to watch the Advanced UI Interactions video if you need a detailed description of
how they work. We’ll just run with the default settings here. Straight away you can see that the lower half
– the image part of the PDF – is now also converted to text, so that’s good. But on closer inspection, the 2 columns for
both the text and the image part of the document have been intertwined together. That’s because OCR is not yet smart enough
to automatically recognize the 2-column layout in the document. If a certain layout is a recurring pattern,
you will see a bit later how to make a special automation to read it correctly. One thing you should be aware of with OCR
technology in general, is that it’s quality quickly degrades with the quality of the source-image. As you can see in this example, the end result
is highly dependent on font size, font face, and image resolution – things that are not
always in your control. So whenever possible, use the non-OCR ReadPDF
action. It’s important to note that both ReadPDF
actions are self-contained: they don’t need other applications open, so they can run in
the background. Most other PDF methods you’ll see today
don’t share this quality, so if background operation is important to you, look no further. Okay. The second method for grabbing large and smaller
blocks of text is the handy screen scraping tool. Accessed here in the main toolbar, it’s
actually an interactive wizard that generates the required actions for you. Simply indicate the text elements that you
need scraped and UIPath will show this preview window with a few options. If this is your first encounter with it, here’s
how it works: this is the preview area showing you the text elements identified inside the
selection you just made. This is the current scraping method used…
and these two are the other available methods. And this button is used to indicate on the
screen another element to scrape. Generally UIPath detects the best method for
your situation. In this case all 3 methods work fine, just
like we saw before: with only the OCR method reading the image along with the text. When we change the scraping method by clicking
its name on the right hand side, the preview updates accordingly. We’ll go with the default Fulltext method,
and hit continue. In UIpath, connect the newly created sequence
to the start node, and take a look inside it. Here, we see the generated actions from the Screen scraping wizard: the optional Attach-window,
and the Get Full Text action with the output variable set. We can display that variable in a message
box… or use it elsewhere in the automation. And if in the previous step we would have
chosen a different scraping method… like OCR… we would have gotten these actions
as a result. Right, so these 3 techniques can be used to
extract larger pieces of text. Next, let’s see how to extract a single
piece of information out of an entire PDF document. Or extract the same data from multiple files
with the same layout. First of all, we’ll be talking mostly about
PDF documents that are in the most common format, native text; which means its elements
are directly accessible to UIPath. If you’re dealing with scanned documents,
you’ll see a bit later some image-based techniques that you can use. So, for normal PDFs there are a few options
for getting to the data, the first one being the well-known GetText action. Also accessible in the recorder, here. Simply point to the element you are interested
in… and UIPath will generate the GetText action and its output variable for you. Nothing more to it and will display it in a message box. As it is now, this action will only get the
value of this specific text element from this specific file. But let’s say that you actually want to extract
the total value from a series of similar PDF invoices, instead of just one file. The getText action, like most UI interactions
uses a selector to identify the correct element and get it’s value. So as you probably guessed, we’ll need to
tweak it a bit to extend its scope. The automatic way to do that is by using the
Attach to Live Element feature. Simply point to another similar element that
should also match the current selector, and UIPath will try to fix the selector for you. In this case it worked but since that’s not
always the case, let’s also manually modify it to get an idea of what that entails. A bit of a warning though: we won’t go into
the general aspects of selectors, I’ll just explain this specific example. But I strongly encourage you to watch the
“Selectors” video detailing how they work and how to edit & debug them. They are a central part of UI
automation, so better understanding them will help in many other situations. So we’ll cancel this to not save the changes,
and open the selector again. This time we’ll choose to open it in UI
Explorer for a better view. The checked containers are the ones that actually
make up the selector so we’ll focus on those. The last element has its actual value present,
so we’ll want to remove that, to make it work for other values as well. We’ll also remove the Title parameter to
use other files too… and through some trial and error we found it’s better to use the
more unique Row-name attribute for this item. Then we’ll just copy the selector and paste
it over the old one. Now it should work for both files. It got the correct value from this invoice…
and also from the other one. Great. There is another method of achieving the same
result – that is, extracting a fluctuating value from a series of PDF files… or invoices
in our case. That is the Anchor base activity. From the start you’ll notice it looks a
bit different than other activities. It is made up of 2 actions because it performs
an action in relation to another fixed element, or anchor. A typical anchor is the find element action. We’ll use it to pinpoint a fixed element,
close to our target element; usually it’s name, or here, Grand Total. And the action we want to perform is to Get
the text. While both of these actions have associated
selectors, you will notice they are a bit simpler. The second one just needs to find its way
from the anchor element, so it’s only the last row of the full selector. It’s a bit too specific right now, so we’ll
replace the number with the wildcard asterisk to accommodate any values. And for the anchor element… same as before:
replace the fluctuating part of the file name with an asterisk. and remove this row because
it has no unique identifiers. The anchor base also has an optional parameter,
the anchor position. It’s used to define more clearly where to
look for the data. We could leave it as it is, or change it to
Left because that’s where our anchor is in relation to the text. Ok, now this workflow should work on both
of these files, and any other file with the same structure. The anchor base activity is pretty flexible,
which means you can use various actions inside it. For example, you can replace the Find-element
action with a Find-image. The advantages are that now the structure
of the PDF document is not that important anymore, what’s important is just that it
contains the specific indicated image anywhere in the visible document. Also, you don’t have to deal with selectors
as much anymore; and because PDF files look the same on all systems, you can use find-image
without its usual drawbacks. But before indicating the image to look for,
it’s good practice to set the zoom of the document to actual size, to make sure you
have a complete and accurate image. Simply go to view, zoom, and actual size. Then back in UIPath, click to indicate an
image, and make a selection around your anchor, in our case “grand total”. And… it got it. This method can sometimes be more reliable
than the others because it can handle even major structural changes in the document,
as long as the image and data are present and in the same relationship. Especially since the Find-Image action can
handle a reasonable amount of scale variation, and otherwise, PDF documents are pretty stable. Another important thing to note is that these
last 2 methods described require the PDF document to be opened, and the data you are trying
to interact with must be visible, otherwise it will fail. So make sure you take that into account when
building the final automation. That’s pretty much all for this video, these
4 or 5 activities that you’ve seen should allow you to handle most PDF extractions you
will be faced with. There are a couple more, like Find Relative
Element and Scrape Relative, but I will let you discover them on your own. Furthermore, if you’re dealing a lot with
scanned documents, you may want to have a look at our “Image-based Automations”
video, for specifics on handling images. And finally, let’s quickly do a recap to make
sure everything is settled in. We started with two methods for reading larger
blocks of text: Read PDF Text and Read PDF with OCR. They’re straightforward actions that read
entire PDF documents -or pages- and output a string variable. Then we looked at ScreenScraping, which is
an interactive wizard that is useful for comparing & choosing a scraping method, that also generates
the actions for you. And for extracting specific text elements,
your first option should be the GetText activity. Depending on your situation, it may or may
not require some selector tweaking, but it’s generally a quick and solid technique. Lastly, you experimented with the Anchor Base
activity. It is used to locate a fixed element and perform
an action relative to it. It’s a very powerful and flexible activity:
you can look for an image or element, and get the data that is related to it. That’s it for now, I’ll see you later. Bye!

16 Comments

  • ilyas el khattab

    uipath is a great software but i think he needs adjusments in my case its not working i have 366 pdf of a ware house in a table there is the product and near to it there is the stock stiuation . well i was willing to that the software check the color in the table and verify the product using the color to determine wich part are ocupied in an excel file

  • Junior Fumi

    00:00 – Introduction
    00:39 – Start
    04:45 – Screen Scraping
    06:21 – Extracting Specific Elements
    09:15 – Anchor Base
    12:34 – Resume

  • u dont stand a chance

    Its awesome. However I need help with the pdf on an entirely different matter. My pdf has 74 pages but ReadPDF reads only 4 pages for me. I wish to read all the pages and convert it to an XML. Any help please

  • jennish gurumayum

    Hi., Pliz someone help me. I have a problem about pdf automaion with uipath. How can i extract only a specific colored text(font color) from pdf file and put in world file… Pliz anyone.

  • Francisco Morales

    WoW!
    how Could I read many pdf documents in a specific folder? I open them and extract the field "total invoice" and Store it in datatable or excel?,,, is it possible?

    Thanks for the video. It was awesome

  • balireddy madhu

    hi sir this is madhu can you please tell me sir how to convert the pdf file to csv file and extrating data by using python code

  • Surya Avantsa

    I am having issues with selecting exactly the $40.00 element or the "Grand Total" element. It always select the entire two lines starting with the Credit 0.00. Is this process different with the 2019 version of UiPath Studio? Please help.

Leave a Reply

Your email address will not be published. Required fields are marked *