Understanding the techniques used to
capture this meaning helps to provide better signals as to what our content
relates to, and ultimately helps it to rank higher in search results. This post
explores a series of on-page techniques that not only build upon one
another, but can be combined in sophisticated ways.
While Google doesn't reveal the
exact details of its algorithm, over the years we've collected evidence from
interviews, research papers,
As you read, keep in mind these are
only some of the ways in which Google could determine on-page relevancy,
and they aren't absolute law! Experimenting on your own is always the best
policy.
We'll start with the simple, and
move to the more advanced.
1.
Keyword Usage
In the beginning, there were
keywords. All over the page.
The concept was this: If your page
focused on a certain topic, search engines would discover keywords in important
areas. These locations included the title tag, headlines, alt attributes of
images, and throughout in the text. SEOs helped their pages rank by placing
keywords in these areas.
Even today, we start with keywords,
and it remains the most basic form of on-page optimization.

While it's important to ensure your
page at a bare minimum contains the keywords you want to rank for, it is
unlikely that keyword placement by itself will have much of an influence on
your page's ranking potential.
It's not keyword density, it's term
frequency–inverse document frequency (TF-IDF).
Google researchers recently described TF-IDF as "long used to
index web pages" and variations of TF-IDF appear as a component in several
well-known Google patents.
TF-IDF doesn't measure how often a
keyword appears, but offers a measurement of importance by comparing how
often a keyword appears compared to expectations gathered from a larger
set of documents.
If we compare the phrases
"basket" to "basketball player" in Google's Ngram viewer, we see that
"basketball player" is a more rare, while "basket" is more
common. Based on this frequency, we might conclude that "basketball
player" is significant on a page that contains that term, while the
threshold for "basket" remains much higher.

For SEO purposes, when we measure
TF-IDF's correlation with higher rankings, it performs
only moderately better than individual keyword usage. In other words,
generating a high TF-IDF score by itself generally isn't enough to expect much
of an SEO boost. Instead, we should think of TF-IDF as an important component
of other more advanced on-page concepts.
3.
Synonyms and Close Variants
With over 6 billion searches per
day, Google has a wealth of information to determine what searchers actually
mean when typing queries into a search box. Google's own research shows
that synonyms actually play a role in up to 70% of searches.
To solve this problem, search
engines possess vast corpuses of synonyms and close variants for
billions of phrases, which allows them to match content to queries even when
searchers use different words than your text. An example is the query dog pics, which can
mean the same thing as:
•
Dog Photos • Pictures of Dogs • Dog Pictures • Canine Photos • Dog Photographs
On the other hand, the query Dog
Motion Picture means something else entirely, and it's important for search
engines to know the difference.
From an SEO point of view, this
means creating content using natural language and variations,
instead of employing the same strict keywords over and over again.

Under Hummingbird, co-occurrence is
used to identify words that may be synonyms of each other in certain contexts
while following certain rules according to which, the selection of a certain
page in response to a query where such a substitution has taken place has a
heightened probability.
4.
Page Segmentation
Where you place your words on a page is often as important as the
words themselves.
Each web page is made up of
different parts—headers, footers, sidebars, and more. Search engines have long
worked to determine the most important part of a given page. Both Microsoft and
Google hold several patents suggesting content
in the more relevant sections of HTML carry more weight.
Content located in the main body
text likely holds more importance than text placed in sidebars or alternative
positions. Repeating text placed in boilerplate locations, or chrome, runs the
risk of being discounted even more.

Page segmentation becomes
significantly more important as we move toward mobile devices, which
often hide portions of the page. Search engines want to serve users the portion
of your pages that are visible and important, so text in these areas deserves
the most focus.
5.
Semantic Distance and Term Relationships
When talking about on-page
optimization, semantic distance refers to the relationships
between different words and phrases in the text. This differs from the physical
distance between phrases, and focuses on how terms connect within sentences,
paragraphs, and other HTML elements.
How do search engines know that
"Labrador" relates to "dog breeds" when the two phrases
aren't in the same sentence?
Search engines solve this problem by
measuring the distance between different words and phrases within
different HTML elements. The closer the concepts are semantically, the closer
the concepts may be related. Phrases located in the same paragraph are
closer semantically than phrases separated by several blocks of text.

Additionally, HTML elements may
shorten the semantic distance between concepts, pulling them closer together.
For example, list items can be considered equally distant to one
another, and "the title of a document may be
considered to be close to every other term in document".
Now is a good time to mention
Schema.org. Schema markup provides a
way to semantically structure portions of your text in a manner that explicitly
define relationship between terms.
The great advantage schema offers is
that it leaves no guesswork for the search engines. Relationships are clearly
defined. The challenge is it requires webmasters to employ special
6.
Co-occurrence and Phrase-Based Indexing
Up to this point, we've discussed
individual keywords and relationships between them. Search engines also employ
methods of indexing pages based on complete phrases, and also ranking
pages on the relevance of those phrases.
What's most interesting about this
process is not how Google determines the important phrases for a webpage, but
how Google can use these phrases to rank a webpage based on how relevant
they are.
Using the concept of co-occurrence,
search engines know that certain phrases tend to predict other phrases.
If your main topic targets "John Oliver," this phrase often co-occurs
with other phrases like "late night comedian," "Daily
Show," and "HBO." A page that contains these related terms is
more likely to be about "John Oliver" than a page that doesn't
contain related terms.

Add to this incoming links
from pages with related, co-occurring phrases and you've given your page
powerful contextual signals.
7.
Entity Salience
Looking to the future, search
engines are exploring ways of using relationships between entities, not just
keywords, to determine topical relevance.
Entity salience goes beyond
traditional keyword techniques, like TF-IDF, for finding relevant terms in a
document by leveraging known relationships between entities. An entity
is anything in the document that is distinct and well defined.
The stronger an entity's
relationship to other entities on the page, the more significant that entity
becomes.

In the diagram above, an article
contains the topics Iron Man, Tony Stark, Pepper Potts and
Science Fiction. The phrase "Marvel Comics" has a
strong entity relationship to all these terms. Even it only appears once, it's
likely significant in the document.
On the other hand, even though the
phrase "Cinerama" appears multiple times (because the film showed
there), this phrase has weaker entity relationships, and likely isn't as
significant.