Webinar: Entrez Direct (EDirect) Part 2 of 4, PubMed Author Sorting
Articles,  Blog

Webinar: Entrez Direct (EDirect) Part 2 of 4, PubMed Author Sorting


The way I will do this is talk about the example
and the strategy first. Then we will go to the command line. The first one is — I have a bunch of PubMed
articles and I would like to know who are the most prolific authors. I would like to
take all the authors on those publications and sort them by the number of times they
appear in the citation list. How do we do that? We need to be able to find
where this information is. I tried to blow up a piece of the XML that you will actually
get back from a PubMed record. This is where you can find out this information. And easy
way to do this is to get the XML from PubMed. You will find a lot of the information there.
If you look there is a author list tag, there is a author tag. We have last name, forename
and initials. There is more affiliation information. There is a variety of things you can do here.
What we will do is look at the last name and the initials. We will be happy if we can get
those out. The strategy we would use to do this — we
will have a query that goes after the PubMed records we want. We will use esearch against PubMed. We will get a set of PMIDs back. We’ll take that set of PMIDs and pass that to efecth to get the PubMed XML. Then we will use a xtract
utility to take the last name and initial tags out of each of those records and do something
interesting with them. That is the basic approach. As we go through I want to talk about a few
things before we get to the command line that are tips and tricks about using xtract. This
may not make sense right now but it will make more sense when we actually get to the point
of looking at the data. There are a number of different flags that xtract takes. They
can help you organize the output of the command. There is one called sep. It specifies a separating character between multiple fields.
And one called tab which will separate multiple values. We will see how these work. The trick
is to play with them. You will get a sense of the way that they work. I would like to
show you both of those first. And see how they work. I will move to the command line. What I have
is a directory where I have taken down the tar file. I have taken that file and expanded
it. I have a variety of things here. All of the different EDirect utilities and different
example files. What I thought I would do to start is do something simple. The way that you can
approach this, if esearch is our first thing. I can type ./esearch and give it a database, PubMed and a query.
I will say childhood asthma for a query to start with. What happens is EDirect gives
me some XML back. It does something similar to what esearch does. It gives me the database back, a web environment,
where the results are stored. It will give me a count. So there are 9,397 records. That is not all that useful to me right
now. What is handy about EDirect is you can pipe the output into one of the other E-Utilities.
I piped to esummary and boom there are docsums. It will dump these to the screen.
I can put these in a file or somewhere else. It is easy and I do not have to keep track
of bookkeeping, I just pipe it to esummary. That was one thing we could do. What I will
now try to show you is one of the example scripts that I have for the first example. What we can do is we can do the search in
PubMed. Here’s the query about phospholipases. We can pipe that into efetch and give the format as xml. I will show
it to you. This is the type of output you will get. What we are trying to get is some
of the author information. You will see the author information, author list and last name,
forename and initials. This is what we are trying to pull out is the authors and do something
to sort them. What is something we can do? We will look
at the next step. What we can do is use xtract. What you do
is pipe the output which will be the XML — we will pipe the output into xtract. It has a
few things we want to be aware of. One is pattern. It is the overall record that we
want to extract things from. PubMedArticle is the high level tag that corresponds to a particular
article. We want to say that first. That will loop essentially over each of the records.
Block is a secondary section within that article. It is for the author. We want to go to the author tag. Within author
there are two elements we want to get, last name and initials. We say pattern PubMedArticle, block author, element LastName,Initials. Here is what we get. We get a whole bunch
of author names. You will notice it looks kind of strange because the last name and
the first name have been separated by tab characters. Each line corresponds to one article
and all of the different authors in that particular article. One line per article. The last name
and the forename have been separated by tabs. Maybe we do not want them separated. Maybe
we want them separated by spaces. That would be more natural. One way you can do that is to use the -sep character. Since I have multiple fields we will add a space between them instead
of the default tab. What will I get? We can look at the output. Now you will notice
that I get things that have a space. Each of the authors is separated by a tab because it is a separate item. But the individual last name and forename are separated by a space. I can
use any character I want and specify that in -sep. I am trying to find all the different
authors sorted by frequency. I would like to get them into some configuration where
I can do that. It may be better if I could get one author per line rather than all of
the authors on the same line. That is another part that EDirect and xtract
can do for you. It is -tab. What we can do now is, In addition to using -sep we can use new line.
What that will do is after each element, the last name and initials, it will insert a new
line. Instead of the new line being at the end of the data for one article you will get
a new line after each author. We will look at what happens. Here is what
we get. Now we have a file that has all of the different authors on a separate line. What are we trying to do? We are trying to
see how many of these authors are appearing multiple times in the file and rank them.
If you are a UNIX guy, this is the advantage of using EDirect. You can think about this
simply by taking that file and doing some things on the UNIX command line. I have the
output file. I will play UNIX now. This is not doing anything
with EDirect . I can cat that file and pipe it to sort. I can pipe it to uniq -c. What that will do for me?
— I will also pipe to more. Now what it does is it will sort the authors alphabetically
and it puts a number in front of them saying it is the frequency. Now I have it. I have
all the authors. I have their frequencies. Now what I would like to do is say what is
the most popular author. It is sorted alphabetically. I could pipe something at the end. Another
sort — a reverse sort, a numeric sort, and then pipe to more. Those are the most prolific authors. I can do that in my script I have been working
with. Or what Jonathan has done is, you can look in the EDirect implementation. There
are a number of utilities. One is called sort-uniq-count-rank. This is something he observed as a common
use case. That is, taking them out and sorting and ranking. The whole business piping to
sort then uniq -c and then sort again he has combined to this sort-uniq- count-rank thing.
The final version of this is to go, after xtract, pipe the output of xtract into sort-uniq-count-rank. It does the same thing. You will see it is the same thing we just generated with the command line utility. That is a first case. It is a way of taking the output that we get from the E-Utilities, using UNIX command line tricks and tools to do a little post-processisng
of those data. Let’s try something else now.

Leave a Reply

Your email address will not be published. Required fields are marked *