How Search Engines Index Your Websites
Articles,  Blog

How Search Engines Index Your Websites

hello thanks for joining us today for
today’s webinar how search engines index your website’s my name is Arantxa Recio I’m a search analyst for search I would like to go through a
couple of things before we get started we’ll take questions at various points
in this presentation and you can type those in the YouTube live chat box to
the right of the video we are recording today’s session and it will be available
immediately on the digitalgov youtube channel you will receive an email
immediately after the event with the slides and an event evaluation please
give us your feedback as it helps us to make sure we are creating content that
is worthwhile I would also like to give a little context for today’s event in
this webinar we will walk you through the fundamentals of how search engines
monitor your content and pull relevant data from your pages now I’d like to
introduce our presenter Dawn McCleskey she’s a professional librarian on a
mission to help people find what they’re looking for
she’s also the program manager for search gov where she works to improve
customers experience with the service and public’s experience with when
searching on government websites so don’t just take it away thanks Aaron
sure I appreciate that introduction so as she
mentioned we are going to talk today about the high level functionality of a
search engine including how they get monitor your
website for content as a rancher said and get the content into their indexes
and then also the role that sitemaps and robots.txt files play in that process
some special considerations for published sites that publish their
content on multiple platforms which is pretty common across government and then
also the relationship between the indexing process and the search process
that an end-user goes through so first what it what is a search engine
I thought it would be good to start off with some definitions so that we’re all
on the same page and basically so I’ll be using the word index a lot and we all
have slightly different understandings of what the word index means but
basically it’s a pointer so you have the I’m sorry one moment did that say share
your screen am i not sharing okay we’re good we’re good so so an index is a
pointer to something else and I was surprised that my my primary definition
of an index is actually the second definition according to the
merriam-webster dictionary but what we’ll be talking about today is a list
of items that points you in the direction of the information so that you
can get it get it get the full content where you need it so this is a this is a
page from the index of my copy of Herodotus’s histories and as you know on
an index in the back of a book it shows on what pages that will find certain
information so we’re gonna zoom in on the top portion here and some indexes
showed the key word and then just page numbers but you’ll see this index also
shows not only the key word but also some context that allows you to make an
informed decision about which page you want to turn to to find to find what you
want to look for so this is an old-school search result with snippets
and I was really I was really pleased about this for each of these entries
shows metadata about the page metadata of course is data about other data and
in this case we had information on a page and then we had that informations
topic subtopic and the number of the page and none of this information none
of this is the information on the page and if we want that we have to go to
read it directly so this is my very simplified definition of a search engine
and the thing that makes them different from
regular database is that they self populate and this can make them appear
very magical but other than the scale and sophistication of results
presentation and how information gets into them using Google isn’t all that
different from doing a find in all sheets in Excel so just like The Wizard
of Oz there are people behind the curtain used using certain technology to
make it look like magic but it’s really just technology so let’s draw back that
curtain there are two sides to any search engine indexing content which
consists of discovering what content is available and building a useful index of
that content and then they also allow queries against the index so that end
users can discover the content for themselves so the top two items here are
the indexing side and then the the bottom is the search the two primary
ways that search engines discover your content are through crawling and through
sitemaps crawling is when the search bot opens a page detects all the links on
that page makes a note of them and then goes to all of those pages as well XML
sitemaps are machine friendly lists of the items on a website and the bot reads
the list notes all of the URLs and for example ours is search gov slash o and
we have a resource available on our website about the details that they look
for in a sitemap so both of these crawling in sitemaps a result in the
search engine knowing what exists on your website one very important thing to
note is that Google and Bing have clearly stated that they don’t crawl
everything and they don’t curl all sites they don’t crawl every part of every
site and that they don’t index everything that they crawl through so
there’s there’s everything on the web the subset of that that that gets
crawled and then the subset of that that ends up in the Google index
Google can also discover content via feeds and you’ll see I’ve got it kind of
grayed out yes feeds I won’t cover this method today but there’s a link in the
slide notes if you want to learn more about that so you can indicate whether
you want portions of your site indexed or not using robots settings and we’ll
talk about these in detail later on but to mark entire folders as not to be
indexed you want to use your robots.txt file and for for individual files you
can list those in the robots.txt file if you want or you can use a robots meta
tag in the head of the page though more on both of these later though the Bob
has gone through your site and collected your URLs now it’s going to parse the
data on each of those pages the bot requests the page reads the code and
sorts through what’s there to see what it can what can be used according to the
developers who program the bot so what links are on the page what text-based
content is there are there images or other content types they look to see
what’s there they sort it out map it into their own model and save the
information to their indexes and that’s the actual indexing stage when the parse
to get data gets saved so when you’re very beautiful homepage gets opened up
it’s going to look like this to the robot so as some of you may have heard
me say before computers are really stupid
and can only do what they’ve been told to do so well so far at least
so parsing of content is helped by structure the more explicit structure
you can put into your pages the easier it will be for the computer to interpret
your intent and present good results to searchers do you have metadata in tags
in your page heads or in line in the tags within the body and won’t talk much
about structured structured markup today but if you have questions about that
let’s follow up afterwards if you use a content management system there are
plugins that will insert structured markup for you so also
do you use header tags the H tags in your page in the proper way with the
page title in h1 h2 containing section headings and so on down from there I’ve
seen page titles in h2s well the site section name is in the h1 and there are
plenty of pages still in government that have the title and sections just as bold
text in the body you’ll get better results in search if the title of your
page is in the h1 tag and there’s only one h1 tag on the page and lastly are
you using a main element in your body tag the body tag of course includes all
the visible stuff on the page but we don’t want headers footers and
navigation elements and next along with the main content of a given page so we
wrap that portion of the HDL ma h2 that portion of the HTML body in a main tag
or you can give a div the role of main and that lets bots know that anything
outside of this is fluff and should be ignored in terms of indexing the content
of that particular page this you want to be particularly sure you’ve implemented
properly the other day I was trying to figure out why a particular search
result was so hidden for a page we had indexed and it was because the only
thing inside the main element was the h1 tag so all our system took was the title
of the page and we didn’t have any of the other important content because it
was outside of the main element so watch out for that and you’ll also be helped
getting into the index with high quality content so the higher quality the
content the better your SEO ranking will be because the search engines will be
better able to match searcher needs with your content so what does high quality
content mean it should be unique to the page this includes the main content of
the page but also title tags and meta just meta – meta descriptions in the
head it should be written in plain language using the words your audiences
use in addition to whatever technical terms or acronyms you may need to use
and you’ll want to go easy on the Dargan so that
you get all of those terms in but the primary access point for how the
searchers express that interest matches the primary description of that content
on your page this is really win-win because it’ll be easier for the search
engine to match your content with searcher needs and then once they get to
the page they’ll be better able to understand it it should also be well
written so in the age of grammarly and other quality checkers it’s no surprise
that search engines would start running these kinds of tools over the content
and their indexes and scoring pages accordingly so I know some agencies
provide plain language and quality review tools to their staff so enquire
with your web management team if there’s something like that available for you
and this applies also this applies to new pages that you’re writing now but
then also to pages that have been published for a while
never hurts to make something more clear and then finally you want to ask if the
content is useful search engines track what people are looking for and what
they click on from results and they’re analyzing what makes a particular item
click worthy so if a page appears to be redundant or too fluffy or
incomprehensible it shouldn’t be too surprising if it doesn’t end up getting
indexed I was talking with someone the other day and they said you know we have
900,000 items on our website and Google keeps stopping after about three hundred
thousand and given the nature of this particular site I couldn’t I wasn’t too
surprised because it it could appear to Google as very repetitive content so
let’s recap what we’ve gone over so far so and I’ll take questions in a minute
so send them in if you if you have any in mind so this is your website and
here’s Google over in some other part of the web Googlebot visits your website to
review your sitemap and crawl around they parse the data they sort it out and
then they decide what of it they’ll bring in to their index
is not of your page and then here comes Allison from Ohio she wants to get a
passport so she goes to google and searches for passport application she
gets a bunch of results on she clicks on one from
and is brought over to that site to view the information it’s pretty
straightforward have we gotten any questions so far nope none so far okay let’s move on then
no no question at this time Donna was just chicken okay thank you so let’s
talk briefly about crawling we’re not going to talk in depth about crawling
today but you do want to make your site as crawlable as possible first you want
to make sure that you avoid crawler traps crawler chops occur when your site
can generate an infinite number of URLs and this usually happens when any given
URL can have parameters appended to it like tags or referring pages or a Google
tag manager token and so on so each one of those URLs will look like a different
URL and the crawler will open it and they’ll note the links on it which will
then themselves once followed have these additional parameters on them and the
crawler won’t be able to figure out what constitutes the entirety of a website
the most if not all search engines will stop working if they detect a crawler
trap because they don’t want to waste their resources and I’ve got a reference
in the notes so you can learn more about about crawler traps you can help avoid
crawler traps as well as alleviate other potential confusion over which version
of the URL is the version of record by inserting a canonical link in the head
of the page so this particular example here tells the search engine that even
though this page can be accessed via HTTP colon slash slash dub dub dub
Agency gov slash topic one it the without the sub-domain and and without
the trailing slash should be the one indexed and this helps with crawler
traps too because it says disregard all those parameters and index just the real
URL and if your site is built with JavaScript note that even though the
search engines can now render your pages to see the content they’re still
ignoring anything past a hash sign in the URL because the within page anchor
links within page anchor links use a hash sign as well so you’ll want to
change it to this format that’s known as hash bang and width or see if you can
drop that the hash symbol entirely from your URLs I understand it’s easier to
turn it into a hash bang than to remove it so take a look at that if you’ve got
hashes and you’ve been wondering why you your crawl coverages is poor and then
also if you have if you have a no script tag in your JavaScript display you know
the the text that you want to have in place in case the in case someone has
JavaScript turned off you want to want them to know to turn JavaScript on or to
have otherwise access to your site then take a look inside Google Google search
console to see whether Googlebot is tripping on that or not I was looking at
a site last week where all Googlebot was seeing was that no script tag they also
had hash in their URLs so don’t we do have a question okay go for it so Joe
from NSF is asking are there meta tags that are no longer paid attention to by
Google there are so I I will talk about that later and there is a link in the
slide notes that is that will send you to the list of meta tags that Google
pays attention to so we’ll loop back to that in a bit all right that’s the question thank you
okay great keep them coming everybody so with XML
sitemaps let’s it’s as I said earlier it’s a list of items on a website and
hopefully it will also state when each was last modified and then it’ll also
possibly indicate their relative importance one to another an XML sitemap
is not a shopping list for google of all the things that they should pick up from
your site it’s more like a weekly specialist flier saying here’s what we
have don’t you want to come get it so there’s a philosophical debate that is
shown there because do you want to include only the most important pages to
try to draw attention to them or do you want to include all your stuff Googlebot
will follow its programming to determine whether to index your content so from
their perspective including only the most important or all is a little bit
six of one half a dozen of the other because they have other determinations
that they that they make as to whether to index something or not for search gov
we will take all of your stuff we don’t editorialize for you so in determining
what content should be available for search or not that’s because we have a
different mission which is to make it easier to find things on a particular
government website we’re not trying to provide access across the whole web and
because of this we have a much smaller universe to cover than Google so we
don’t have to be as conservative with our index size even though it might
sound funny to think about Google’s index as being a conservative size so
why do you want an XML sitemap because you want to make it easier for BOTS to
discover your content and it will cut the URL discovery time down
significantly and as a reward for helping them and for showing you can
present your content in an organized way Google gives a bit of a ranking boost or
so it’s thought most of SEO is just reading tea leaves because of course
Google doesn’t give away their secrets so X XML sitemaps there’s a protocol
that you’re supposed to follow and it is this it needs to be an XML format file
that is itself the file location listed in the robots.txt file it needs to
include the files for the location where the sitemap file is it needs to have
clean URLs and the date each file was last modified that is optional but it’s
really really useful and then totally optional are the file change frequency
and the priority and that’s that’s it it’s pretty basic but each of these is
tricky which is why we’re having this session today
so your sitemap might look like this this is ours from search gov and it was
generated using a Jekyll tool and github pages and it includes just our URLs and
the last modified date or it might look like this this was generated by the
Drupal XML sitemaps module you can see the change frequency here and the
priorities priority settings over on the far right and let’s look at the elements
of the protocol and how they can each go wrong
all right so XML format file seems pretty straightforward but there and yes
that there are some alternative formats mentioned in the protocol but XML is the
standard so what could go wrong with this first it needs an opening XML
declaration with the version stated and you need to close the XML tag you can
infer everything from the advice that I’m giving that I’ve seen something to
the contrary in my in my travels so if you’re using a sitemap generating tool
it should handle this for you but I have seen sitemaps where the thatwe’re tool
generated that we’re missing these features so it take a look to make sure
what you’ve got in place has it and if you don’t do these things the poor
stupid search engine BOTS won’t know what they’re looking at they’ll just see
a code and they won’t be able to interpret it
about without these declarations it needs to be listed in the robots.txt
file the and we’ll talk in detail about the robots.txt in the next section but
search engines are going to check your robots.txt file first and the line for
the sitemap entry should look exactly like this one line and it says sitemap
colon space and then the path of the sitemap in theory this means that you
could put the sitemap anywhere but search engines would still find it
because it’s listed here so that’s true but it also violates the location
element of the protocol which we’ll look at next you don’t want to list the
sitemap file as an allow line in your robots.txt you do want to list multiple
sitemaps if you have multiple publishing platforms for a given subdomain one on
each line so you would have sitemap colon path and then next line sitemap
colon path and don’t list multiple sitemaps for multiple domains or
multiple subdomains on the same robots got txt more on that later before the location where the sitemap is
a sitemap should include files for the location where it is if a site map is
@ww agency of sitemap XML then it’s only supposed to have URLs for the wwm domain
long that agency gov should have its own sitemap at blog agency gov slash sitemap
XML similarly if you put your sitemap in a folder like dub-dub-dub agency gov
slash sitemaps slash sitemaps dot xml according to the
protocol it would only have URLs within that slash sitemaps folder so you want
to also include the URLs that you want indexed from that location for us search
gov please list all of your URLs that you want us to index as many as you can
in a sitemap at least for the stage where we’re seeding your index at the
very start up indexing for commercial in the engines it’s your call especially if
you’re happy with your Carl coverage in that most important or everything debate
that I was talking about earlier but for sure don’t include URLs from folders
that you’ve excluded from indexing on your robots.txt file and we can’t we’ll
talk more about that later clean URLs if you’re using a content management system
and a plug-in to generate your sitemap this should be taken care of you care of
for you it should be but first just to go over
the essentials so that you can double check that it’s working right the file
needs to be utf-8 encoded and you need to declare it as such in your oops in
your opening XML declaration any special characters do you have in your URLs need
to be encoded using standard HTML encoding and the protocol of the URLs in
the sitemap needs to match the protocol of the site itself this comes up a lot
if the sitemap was created before the site moved to HTTPS so the sitemap file
is HTTPS the pages on the site are HTTPS but the pages as listed on the sitemap
are still HTTP this makes double work for the search engine because even
though you’re 301 redirect might be getting the user to the right place the
search engine will create a record for the HTTP HTTP version and then have to
update it right away to the HTTP version same goes for any URL on your sitemap
that redirects to another URL so just to keep the resource load down for the
search engines please don’t rely on redirects to take care of inaccuracies
in your it’s like that and then finally if you have a content management system
it’s likely you have directory landing pages that are accessible via agency gov
slash section slash index.html and then also agency gov slash section
slash and agency gov slash section each of these pages might show the same
excuse me each of these URLs might show the same page but they’ll be seen as box
as different URLs cuz they’re dumb so this is the time to decide which
version is the preferred version and then you list that as the canonical URL
in the head of the page which I mentioned earlier and also set that
version to be the one to be included in the sitemap if your system stores a date
for when pages were last edited do include it on the sitemap in in a last
mod field it’ll indicate to the bot if there’s any work to be done on that item
since the last time it came to the site and then it can also be used in search
and filtering of results if you don’t have publication dates in your page
metadata if you don’t include it it’ll be harder for the search engine to pick
up updates to the pages and then what happens is we get desperate calls about
a leadership bio page that is still showing the name and information about
the person who had a very public departure the previous week and so if
your indexed with our system we can manually trigger the new page to be
updated in the old page to be removed if you’re talking about the commercial
search engines you can request it but you’re at their mercy as well change
frequency isn’t really used anymore mostly because people are people and
tried to game the system by saying that the page updated daily but there
wouldn’t be any actual changes to the page for months and months
so most search engines are ignoring this this field so don’t worry about adding
change frequency but focus on the last mod date and then priority is marginally
useful so because Google and Bing are comparing the content across many many
sites when they’re compiling a list of search results for a given query knowing
that one page on a site is more important than another page on that same
site isn’t that helpful for their search algorithm
for them high-quality content will be more useful as you compete with other
sites for ranking and in terms of what I was mentioning earlier about
well-written using the language that your searchers use there are ways that
you can make sure to incorporate all the different ways that they might talk
about it without doing something as blatant does keyword stuffing basically
you want to integrate those keywords down into the content of your page into
your link text to the page all of those kinds of things for search gov because
because when you’re writing for Google or Bing when you’re writing for your
public and they’re indexing you and you’re getting search through Google or
Bing you’re competing is from their perspective you’re competing with other
sites for position in those search results but so for for search gov we’re
thinking we are thinking about using this field because since we’re just
looking for search results within one domain usually sometimes we have many
but usually just one knowing the relative importance of pages would
actually be very informative to us so that we could weight those accordingly
so if you’ve got it don’t take it out but you also don’t need to worry about
adding it if you don’t have it so there are a couple of more considerations
about sitemaps and the Google and Bing got together and decided that there
should be a maximum size for a given sitemap which makes sense because there
again being conservative with the resource use and in the last year or two
they decided to raise the maximum size to either 50,000 URLs or 50 megabytes
whichever comes first and anything bigger than this is going to need to use
multiple sitemaps which will be listed on a sitemap index file most government
websites are very very large so this probably applies to your site the
sitemap index should meet all the same standards as an individual sitemap
particularly in that it should be location specific and the URLs on it
need to be clean and in the slide notes I’ve got links to
that for you here’s an example you can see that we’ve got this of a sitemap
index so we’ve got an opening deck XML declaration and and after that instead
of the URL set tag we have a sitemap index tag and then instead of a URL tag
within that we have a sitemap tag so here comes a sitemap here’s the location
of the sitemap here’s the last time this sitemap was updated this is the end of
the information about that sitemap here comes another sitemap etc so just like
any protocol it’s only as good so far as people are willing to enforce it you
know manners make the world go around except maybe not so Google was one of
the groups behind the sitemap protocol but they also do accept sitemaps in
Google search console that don’t follow the protocol particularly around whether
the sitemap is at the root of the section that it represents there are
humans behind every bot and there are humans behind every website so there’s
going to be messiness and here are some of the things that we have seen sitemaps
hidden away in totally obscure subfolders sitemaps on one domain with
URLs for different domains or subdomains like totally other domains URLs listed
multiple multiple times the same URL on the same sitemap URLs with spaces in
them URLs that are for staging environments or relative URLs or local
dev computer URLs on the publicly published sitemap URLs with ports
declared so agency gov : 443 URLs missing file extensions where they’re
needed and if you click on something it goes to a 404 but if you add a file
extension it will work and also this was my personal favorite URLs big
HTTP colon slash so don’t let this happen to you and please don’t do this
to anybody else so do we have do we have any questions at this time not at this
time done okay thank you very much all right so let’s let’s dive into the
robots.txt so what is a robots.txt file it’s a signal to bots of what you want
indexed and what you don’t want indexed as well as of the posted speed limit for
the requests to the site this file is not a setting that can actually control
bots boxes that have been programmed with bad manners will in ignore your
requests entirely and attempt to go all over your site as fast as you as they
can so you’ll want to make good use of your firewall to shut things down if the
requests go over your rate limit and you really need to keep if you really need
to keep BOTS out of a particular area of your site you need to put that behind
authentication so why would you have one then bots programs with good manners
will follow your settings in Google Bing search gov the Internet Archive will all
definitely follow your robots.txt settings because of this you can mostly
control what will appear in the Google index and you can totally control what
appears in search of the robots exclusion protocol was first published
in 1994 because all the way back then there was a clear need to control bots
that people were sending out to other people’s websites and there are two ways
that you can use it you can place a robots.txt file at the root of your
domain or you can put a robots meta tag in the head of a given page this is the
entirety of the formal protocol there are only two field types user agent
specifying which box you’re targeting particular commands and disallow saying
what you want BOTS to stay out of and then also you can add comments following
a little hash sign there are a few other fields that have become standardized
though they’re not part of the official standard crawl delay is your posted
speed limit the number here is the number of seconds in between requests by
the bots allow is the opposite of disallow and is helpful to allow a
particular bot in a location where you want others to stay out or to add an
exception for a file in a folder that you’ve disallowed and then sitemap of
course we talked about earlier a note about crawl delay so this applies to all
requests whether the bot is actually crawling your site or requesting pages
on your site map 10 is what I’ve seen the most common caryl delay and I think
it must be the default in some content management systems but if your site is
really big though it can have a huge impact on the crawl of your site and
it’s basic math if you’ve got 500,000 pages and a 10-second crawl delay it
would take almost 58 days to get through the whole site just one time so take a
take a take a look at what you’ve got do some math against your content again
Google and Bing don’t crawl the entirety of sites they don’t they don’t promise
you we do though so so it’s something to consider we’ve been to get around this
we’ve been requesting that sites let us work in their environment more rapidly
than the other bots so you can add different settings for different BOTS
like this so you would say user agent USA search caryl delay – and user agent
anybody else crawl delay ten mentioned that you can place robots
robots tags in the page-level in the head this is done through a meta tag and
you can tell about to not index the page or follow the links on the page or you
could tell it to follow links but not index the page or to index it but to not
follow the links and so that’s what these they’re basically the default is
index and follow and so there are three alternatives that you might set in the
head of the page no index no follow no index follow index no follow no index
follow is a really great setting to put on your content index pages like for a
given tag or list where the items on the list would be good search results but
the list itself wouldn’t be that helpful as a result and there are also some
additional meta tags that Google pays attention to this is what the question
about earlier was about so the reference for that is goes along with this page all right what could go wrong again mind
the slash you need to be really careful with your slashes because especially in
robots they have different meanings at different levels so here we’ve got
disallow slash disallow a path slap no slash and disallow a path with a slash
on the end so this first one means don’t index anything at all the second one
says don’t index anything at all within this folder and then this third one says
okay well you couldn’t you could do topic 1 dot HTML or topic 1 dot whatever
other file extension you have but not anything lower below the slash so be
careful as I said we will pay attention to this in our indexing and then another
way you need to be careful with the slashes is whether you’re allowing every
or disallowing everything adding just a slash means everything and leaving it
blank means nothing so this top set disallow slash is the same as allow
blank so disallow everything allow nothing is this top set or disallow
nothing allow everything means that’s totally open
we’ve had people accidentally block indexing to their whole site thinking
that they were allowing indexing to their whole site all right more pitfalls please try not to be sloppy I won’t you
know we all do our best but try hard and also don’t get creative as I mentioned
the computers are dumb and they won’t be able to infer what you’re talking about
so be sure to follow the field names that are in the protocol and watch your
your spaces and your colons and and all of that have we gotten any questions in
so far about robots no okay all right so what about multiple publishing platforms
simply but you want to follow the protocols each systems sub domain will
have its own robots.txt and its own sitemap XML or sitemap index and if you
use different publishing platforms get as much of it automated as you possibly
can so you want to use a sitemap generator so Drupal has a module that’s
sort of more or less standard Yoast has a wordpress plug-in and a Drupal module
that you could look at static sites have tools as well like the Jekyll sitemap
gem we use on github pages to make ours and then if you have to you could
generate a sitemap manually using your SEO tool there after they crawl your
site they’ll usually they most of them are able to display what they found in
the format of an XML sitemap which you can then save and post to the web server
and then just lather rinse repeat on that as often as you want to have
your sitemap updated and then if you don’t have a if you don’t have a content
management system but you have a publication log or other inventory that
you keep updated as you as you push things out you might be able to write a
script that would generate a sitemap when new entries appear in that log so
that’s something to think about some sites have multiple systems supporting
the same sub domain so the main site has migrated to Drupal but there are a bunch
of folders that are still static and they will be for the foreseeable future
so each CC each system should have its own sitemap and it would be placed at
the root for that system so you might have one at agency gov slash sitemap dot
XML and then you might have another one at agency gov slash subtopic a slash
sitemap dot XML and the last one at agency gov slash subtopic B dot slash
sitemap that XML and then you would list all of those three in the robots.txt
file for the domain one per line like we were talking about earlier so what’s the
relationship between sitemaps and search so this question is why I wanted to do
this session I’ve gotten a lot of questions asking how in search gov how
does the sitemap control the search and as we’ve established sitemaps support
the indexing side of search engines and then people querying the search engine
is actually a totally separate process so let’s go back to the pictures we were
using earlier so now instead of Google the search engine is search gov and in
between here this is the search tag of admin center where you have gone in to
declare you want to include in your search for
those of you familiar with Google Site Search this is just like logging into
the Google custom search admin UI and telling it oh I want to include this
domain map domain that’s just what we do in the search dog of admin Center and
just like Google will discover the URLs on your website and then we’ll pull data
from them and stick it in our indexes along with the content from other
agencies we leverage your sitemaps and we crawl if needed and we honor your
robots.txt settings so here comes Allison and she has she wants to check
on the status of her application she knows what agency she’s dealing with
so she goes right to their website and uses the search box on that website to
find – and types an application status the difference between now and when she
was searching google for passport application info is that the agency
whose site she’s searching works with us and has used our admin Center to set up
a search configuration that will target only the relevant content from our big
index so the query gets passed to us from the website we process the query
according to the search settings they’ve put in place
show her a page of results and then she selects the status check tool and is
brought back to the agency’s website to get her status so the relationship
between sitemaps and your search configurations is minimal the sitemaps
support indexing and they’ll come they will inform what is in the index to be
able to search it but then the admin Center settings control what exactly a
person is going to be searching against when they use your search box and that
brings us to the end of the presentation so did we get any questions or do we
want to take a moment to let people type some in
we do have a question though okay great the David Kaplan from Jessie is asking
would you use no index yes follow for parent navigation pages that link to
children but done have content themselves yes yes absolutely that’s
exactly the case where if you’ve got a if you’ve got anything that is solely a
navigational element meant to help people find their way through your site
that would have a no index follow daddy and it’s not yes follow it’s just follow
but yeah very good no car their questions at this point all right well
so I covered a lot in the session we have we talked about search engines
discovering URLs and parsing them into their indexes we talked about sitemaps
and robots that text we talked about indexing and search being flip sides of
the same coin so if you have any questions that come up in the next while
of course reach out you know where to find us and we look forward to hearing
from you thanks very much

Leave a Reply

Your email address will not be published. Required fields are marked *