Fundamentals of SEO in the age of ai
As we all know, the last era, google's sorting principle is the first inverted indexing, and then
according to the pagerank score for ranking, and today, many people say pagerank is outdated,
do not have to do the chain; big bald head said google ranking factors have more than 200 kinds of
some people say that this is too much, today I'll be an algorithmic engineer's point of view, in ai
era to talk about the basic principles of SEO.
The basic logic of seo in the age of ai
- Perform semantic search
-
Querying inverted indexes
Based on the web pages indexed backwards, let deep learning algorithms such as bert
predict the probability of each web page indexed backwards, and sort them based on the
score of the backwards index and the probability predicted by deep learning.
semantic search
Semantic search is primarily used to expand on keywords, assuming that a user enters a
query in Google, "best smartphone". In this example, we will see how semantic search can be
performed using deep learning techniques to improve the relevance of search results.
- Word embedding: Google first converts the words in the query ("best", "smartphone")
into word vectors. Word embedding captures the semantic relationships between words
so that similar words in the vector space have similar vector representations. For example,
word embedding captures the relationship between "smartphone" and "cell phone". - Query Expansion: Through word embedding, Google can find other words that are
semantically similar to the words in the query. For example, "best
can be expanded to "top-notch", "first-class", etc.; "smartphone" can be expanded to
"cell phone", "mobile device", etc.. ", "mobile device" and so on. In this way, Google can
expand the search scope and improve the relevance of search results.
inverted index
Although it's an old favorite, the technique of inverted indexes still has to be talked about.
An Inverted Index is a data structure used by search engines such as Google to quickly find
web pages that contain specific terms. Here's how Google uses the Inverted Index to search for
web pages from keywords in detail: - Web Crawlers: Google first crawls web pages on the Internet through web crawlers
(Googlebot). The crawler follows the hyperlinks on the web page and gradually crawls the
entire Internet. - Web page preprocessing: the crawled web page is preprocessed, including removing
HTML tags, JavaScript code, etc., and extracting plain text content. Next, the text content
is subject to Tokenization, which breaks down the text into words or phrases. For English
text, tokenization is usually split by space; for other languages such as Chinese,
tokenization may involve more complex processing. - Index Construction: The word items after word splitting are processed to generate a
backward index. A backward index is a mapping relationship that maps a word to a list of
Subscribe to DeepL Pro to translate larger documents.
Visit www.DeepL.com/pro for more information.
documents containing that word. In the construction of the index, the search engine will
also carry out word stemming (Stemming) and stop word filtering (Stopword Filtering)
and other operations to reduce the size of the index and improve search efficiency. - Query Processing: When users enter keywords to search, Google will pre-process the query,
such as word splitting and synonym replacement.
Switching, etc. Google then looks for documents containing these lexical items based on an
inverted index.
Now to elaborate on the techniques of each
web crawler
Google Web Crawler (Googlebot) is an automated program used to crawl web pages on the
Internet for indexing by search engines. Below is the detailed procedure of how Google web
crawler crawls web pages:
- Seed URLs: The crawler starts with a set of seed URLs, which come from old crawling
results, site map submitted by the webmaster (Sitemap), new URLs, etc. The seed URLs
are the starting point for the crawler to start crawling. The seed URL is the starting point
for the crawler to start crawling. - URL queue: the crawler will maintain a queue of URLs to be crawled, initially the queue
contains seed URLs. the crawler will retrieve URLs from the queue for subsequent
processing. - HTTP requests: The crawler makes HTTP requests, such as GET requests, for the URLs it
takes out. The server responds to these requests and returns
Back to the HTML source code of the web page. - Content processing: After the crawler receives the HTML source code, it parses it and
extracts the hyperlinks (usually ) on the page.
(the href attribute of the tag). These hyperlinks are the primary way that crawlers
find new URLs. - URL normalization: The extracted hyperlinks are subjected to URL normalization
processes, including removing fragment identifiers, converting relative URLs to absolute
URLs, decoding URL encodings, and so on. These processes help to eliminate the
diversity of URLs and reduce the probability of repeated crawling. - URL de-duplication: In order to avoid crawling the same web pages repeatedly, the
crawler needs to de-duplicate newly discovered URLs. This is usually done by
maintaining a collection of visited URLs or a Bloom Filter. - Speed Limits and Delays: In order to comply with a site's crawling policy (e.g.,
robots.txt file) and to prevent overstressing the server, the crawler will limit the
crawling speed. This can be accomplished by setting a delay time, limiting the number of
concurrent requests, etc. - Update URL queue: new URLs after de-duplication will be added to the queue of URLs to be
crawled for subsequent crawling by the crawler. - Storing Web Pages: The content of the web pages crawled by the crawler is stored for
subsequent indexing and analysis processes. - Cyclic crawling: The crawler will keep taking new URLs from the URL queue and repeat
the process until the queue is empty or some stopping condition is reached.
Through this process, Google web crawlers can crawl a large number of web pages on the
Internet and use the content of these pages as input for search engine indexing. In order to
maintain real-time indexing, Google regularly updates and re-crawls the crawled pages.
Page preprocessing
The Google search engine needs to pre-process the content of crawled web pages before
indexing them. This preprocessing process helps extract valuable information, reduces the
complexity of indexing, and improves search results. Below is the detailed process of how Google
preprocesses web pages:
-
Character encoding detection: Since web pages may use different character encodings,
such as UTF-8, GBK, etc., Google first needs to detect the character encoding of the web
page. This can be done by parsing the charset attribute in the tag o f t h e HTML
or by using an encoding detection algorithm. -
HTML parsing: Google will parse the HTML source code to build a DOM (Document Object
Model) tree that represents the structure of the web page. This process includes
identifying HTML tags, attributes, nesting relationships, and so on. -
Remove extraneous content: In order to extract plain text content, Google needs to
remove extraneous content such as HTML tags, JavaScript code, CSS styles, etc. from the
DOM tree. This helps reduce the complexity of indexing and focuses on the actual textual
content of the page. -
Extracting valuable information: During the pre-processing process, Google also extracts
valuable information from the web page, such as the title, meta information
(meta information), headings, hyperlinks and so on. This information helps to
understand the theme and structure of the web page, and plays an important role in
sorting and displaying search results. -
Text Segmentation: Google will extract the plain text content for segmentation
(Tokenization), the text is broken down into words or phrases. For English text, the split is
usually split by space; for other languages such as Chinese, the split may involve more
complex processing. -
Stem extraction and morphological reduction: in order to reduce lexical diversity, Google
performs stem extraction on word items after word splitting
(Stemming) or Lemmatization. These operations help to unify different forms of words
into their basic forms and improve search results -
Stopword Filtering: Google performs Stopword Filtering on entries to remove common
words that have less impact on search quality, such as "the", "and", "is", etc. This reduces
the size of the index and improves search quality. This reduces the size of the index and
improves search efficiency. -
Generate Keyword Vector: Google may use the word items after word segmentation to
generate keyword vectors of web pages for subsequent search, sorting and other
operations. Keyword vectors can be expressed as word frequency (TF, Term Frequency)
or word frequency - inverse document frequency (TF- IDF, Term Frequency - Inverse
Document Frequency) and other forms. -
Data suitable for search engine indexing and sorting. The preprocessing process helps
reduce the complexity of indexing and focuses on the actual textual content of the web
page, thus improving search results and user experience.- Entity Recognition and Linking: In the pre-processing process, Google also
performs Entity Recognition and Linking (Entity Recognition and Linking), which
identifies entities in the text (e.g., names of people, places, organizations, etc.)
and links these entities to relevant knowledge bases (e.g., Google Knowledge
Graph). This helps to understand the semantics of web content and improve the
relevance of search results.- NLP technology: Google may use Natural Language Processing (NLP) technology
for deeper segmentation of web text
analysis to extract richer semantic information. For example, syntactic analysis,
sentiment analysis, etc. are performed to further understand the themes and
perspectives of web content.- Image, video and other media content processing: For images, videos and other
media content in web pages, Google may use computer vision and other
technologies for processing to extract relevant information. This information
helps to enrich search results and provide users with more diversified content.
Through these pre-processing steps, Google can extract key information from the original
web page to generate data suitable for search engine indexing, sorting and display. This
helps to improve search results, achieve more accurate search matches, and provide
users with high-quality search results.Here is an example of a keyword vector:
Suppose we have an article about Apple Inc. and after preprocessing and segmentation, the
following partial lexical items are extracted:["apple", "company", "technology", "iphone", "innovation", "steve", "jobs", "apple", "store", "iphone", "sales", "revenue"]
Next, we can calculate the Term Frequency (TF) of each lexical item and get:
code{
"apple": 2,
"company": 1,
"technology": 1,
"iphone": 2,
"innovation": 1,
"steve": 1,
"jobs": 1,
"store": 1,
"sales": 1,
"revenue": 1
}
Assuming that we have a collection of documents containing multiple articles, we can
compute the Inverse D o c u m e n t Frequency (IDF) of each lexical item. Here,
code{
"apple": 1.5,
"company": 2.0,
"technology": 1.7,
"iphone": 1.8,
"innovation": 2.2,
"steve": 2.5,
"jobs": 2.5,
"store": 2.3,
"sales": 1.9,
"revenue": 2.1
}
Then, we can calculate the TF-IDF value for each word item, i.e., the word frequency (TF)
multiplied by the inverse document frequency (IDF)
code{
"apple": 2 * 1.5 = 3.0,
"company": 1 * 2.0 = 2.0,
"technology": 1 * 1.7 = 1.7,
"iphone": 2 * 1.8 = 3.6,
"innovation": 1 * 2.2 = 2.2,
"steve": 1 * 2.5 = 2.5,
"jobs": 1 * 2.5 = 2.5,
"store": 1 * 2.3 = 2.3,
"sales": 1 * 1.9 = 1.9,
"revenue": 1 * 2.1 = 2.1
}
Finally, we get the keyword vector for this article, which is used to represent the topic and
weight of the article
code{
"apple": 3.0,
"company": 2.0,
"technology": 1.7,
"iphone": 3.6,
"innovation": 2.2,
"steve": 2.5,
"jobs": 2.5,
"store": 2.3,
"sales": 1.9,
"revenue": 2.1
}
Keyword vectors can be used in search, sorting, clustering and other operations to help
improve the accuracy and efficiency of search engines.
Indexing
Now let's explain in detail how google does index building from a technical point of view:
Google converts web content into structured data suitable for search engines by pre-
processing and analyzing web pages. Based on this, Google constructs an index to facilitate fast
retrieval and sorting. The following is a detailed process of how Google performs index
construction, which we will continue to illustrate using the previous examples.
Inverted Index: Google uses an inverted index (Inverted Index) as an indexing structure,
which allows for a quick lookup of documents that contain a particular lexical item.
Inverted Index is a mapping relationship, where the key is the term, the value is the list of
documents containing the term and related information. For example, we can build a
simple inverted index from a vector of keywords:
code{
"apple": [Doc1],
"company": [Doc1],
"technology": [Doc1],
"iphone": [Doc1],
"innovation": [Doc1],
"steve": [Doc1],
"jobs": [Doc1],
"store": [Doc1],
"sales": [Doc1],
"revenue": [Doc1]
}
xing, each lexical item is associated with multiple documents that contain the lexical item, as
well as information such as the weight and position of the lexical item in the document.
-
Document Index: In order to speed up the document retrieval and sorting process,
Google may build a document index for each document. The document index contains
metadata about the document, such as URL, title, description, keyword vectors, and so
on. This information helps to present a summary of the document in the search results,
as well as to sort them according to relevance. -
Indexing: Web pages crawled by Google's web crawlers are pre-processed, analyzed, and
keyword vectors are generated. Subsequently, Google will add this data to the inverted
index and document index. The index building process involves multiple data structures,
algorithms and optimizations to ensure efficient storage and retrieval. -
Updating the Index: In order to keep the index real-time, Google regularly updates the
inverted index and document index. This may include adding new crawled pages, deleting
defunct pages, updating the content and weights of existing pages, etc. The index
updating process needs to be done in a way that ensures search engine usability to avoid
any impact on user experience. -
Distributed Indexing: Due to the massive amount of data on the Internet, a single server
cannot carry the entire inverted index and document index. Therefore, Google uses a
distributed system to store and manage indexes. The index is divided into multiple slices
(Shards), distributed on different servers. Distributed indexing allows Google to store
and manage the index on multiple -
Search requests are processed in parallel on individual servers to improve search efficiency
and scalability. -
Index Compression: In order to reduce the storage space and retrieval time of
the index, Google uses various compression techniques to compress the index
data. For example, the use of Variable-Length Encoding (Variable-Length
Encoding) on the word ID compression, or the use of prefix compression (Prefix
Compression) to reduce the storage space of the word string.
Through these steps, Google builds an efficient, real-time search engine index that quickly
retrieves documents containing specific keywords and sorts them according to relevance.
The process of building the index involves multiple data structures, algorithms, and
optimizations to ensure storage and retrieval efficiency and provide users with high-quality search results.Now let's take an example of document indexing:Suppose we have an article about Apple Inc. with t h e URL https://example.com/apple-company , the title Apple Company: The Tech Giant, and the description A comprehensiveoverview of Apple Inc. its history, products, and innovations. After preprocessing andanalyzing, we get the keyword vector of the article. Next, we can constructa document index for
this article:Doc1: { "url": "https://example.com/apple-company", "title": "Apple Company: The Tech Giant", "description": "A comprehensive overview of Apple Inc., its history, products, and innovations.", "keyword_vector": { "apple": 3.0, "company": 2.0, "technology": 1.7, "iphone": 3.6, "innovation": 2.2, "steve": 2.5, "jobs": 2.5, "store": 2.3, "sales": 1.9, "revenue": 2.1 } }
The document index contains metadata about documents, such as URLs, titles, descriptions,
and keyword vectors. This information helps to present document summaries in search results
and to sort search results according to relevance.
In practice, the document index may contain additional information such as PageRank, link
structure, click-through rates, and so on.
(CTR), etc. to help search engines more accurately assess the relevance and weight of documents.query processing
Now look at the query:
Suppose we have a simplified version of the inverted index containing four documents (Doc1,
Doc2, Doc3 and Doc4) associated with different terms. Now, we will demonstrate how to use the
inverted index to retrieve relevant web pages from the keywords "apple" and "cell phone" in the
Google search engine.
Example of an inverted index:
code{
"苹果": {
"Doc1": {"weight": 3.0, "position": [1, 12]},
"Doc2": {"weight": 2.5, "position": [5]},
"Doc4": {"weight": 1.0, "position": [3, 8]}
},
"手机": {
"Doc1": {"weight": 2.8, "position": [7]},
"Doc3": {"weight": 3.5, "position": [1, 5, 10]},
"Doc4": {"weight": 2.2, "position": [15]}
},
...
}
In this example, the inverted index is a dictionary, with the keys being the lexical items and the
values being the list of documents containing the lexical items and related information such as
weights and the position of the lexical items in the document.
-
Query parsing: The user enters the keyword "apple cell phone", and Google first parses
the query. Here we assume that the query has been correctly parsed into two terms:
"apple" and "cell phone". -
Retrieve the inverted index: Based on the parsed lexical items, Google queries the
inverted index to find documents that contain those items. In this example, we can get the
following list of documents:
Documents containing "Apple": Doc1,
Doc2, Doc4 Documents containing "cell
phone": Doc1, Doc3, Doc4 -
Merge Results: Merge the list of documents containing different terms to find the
document that contains all the query terms. In this example, the documents that contain
both "apple" and "cell phone" are Doc1 and Doc4. -
Document Score: For retrieved documents, Google needs to calculate a relevance score.
This can involve a variety of factors, such as word weight, location, the document's
PageRank value, and so on. In this simplified example, we only consider word weights as
the basis for scoring.
Accordingly. Therefore, we can calculate the score as follows:
Doc1 Score: 3.0 (Apple weight) + 2.8 (Mobile weight) = 5.8
Doc4 score: 1.0 (apple weight) + 2.2 (cell phone weight) = 3.2 -
Result Sorting: Sorts the search results by relevance based on the score of the
document. In this example, Doc1 has a higher score than Doc4, so Doc1 will be ranked
before Doc4.
Deep Learning Sorting Techniques
Deep Learning Sorting techniques are mainly used to sort web pages, in this example we will
illustrate how Google uses Deep Learning Sorting model for sorting web pages. We will discuss label
selection, feature selection, training the model, and the sorting process.
- Label (label) selection: in search, the label is mainly the user's click on the page, click is 1,
not clicked is 0 - Feature Selection: Features are variables that are used to describe the relationship
between a query and a document. In deep learning sorting models, multiple features can
be selected, for example:
a. Lexical item weights: e.g. TF-IDF, BM25, etc.
b. Word-item similarity: cosine similarity obtained by word embedding calculation,
etc.
c. Document quality: for example, using the PageRank value to indicate the quality
of a web page.
d. Content Length: indicates the length of the document.
e. Click-through rate: how often a user clicks on a particular search result. - TRAINING MODEL: A neural network is used as a sorting model. The query-document
pairs and their feature values from the training dataset are fed into the neural network,
which is trained using the labels as supervised signals. After several rounds of iterative
optimization, the network will learn to capture the complex relationship between
features and labels so that it can predict the relevance of new query-document pairs. - Sorting process: Suppose the user query is "best smartphone 2023". Based on the
inverted index, Google retrieves three candidate documents: DocA, DocB, and DocC. For
each query-document-user pair, we compute the values of the features selected above,
and input them into a trained neural network model. The model will output a prediction
score indicating the relevance of the query to the document.
Suppose we get the following predicted scores:
DocA: 0.85
DocB: 0.60
DocC: 0.75
Based on the prediction scores, we can sort the candidate documents:
DocA > DocC > DocB
Finally, the sorted documents are presented to the user. In this example, DocA will be
considered the most relevant document for the query "best smartphone 2023" and will therefore
be at the top of the search results.
With deep learning sorting models, Google is able to effectively integrate multiple scoring
factors and automatically discover new useful features to improve the quality of search results.
This approach provides high accuracy and efficiency in web page sorting.
Let's go over the sorting model in a little more detail:
Google's search engine algorithm involves hundreds of factors, and we offer a conceptual
formula that encompasses some of the major page-level and site-level SEO factors.
Let A, B, C, etc. denote page-level SEO factors, while X, Y, Z, etc. denote site-level SEO factors.
The weight factor is denoted by W. We can construct a conceptual formula as follows:
s = w1 * a + w2 * b + w3 * c + w4 * x + w5 * y + w6 * z
Among them:
A: Quality of content (e.g., originality, usefulness, depth, etc.)
B: Keyword use (e.g. keyword density, placement, relevance, etc.)
C: Page metadata (e.g., title tags, meta descriptions, URL structure, etc.)
X: Technical SEO (e.g., site speed, mobile-friendliness, sitemaps,
HTTPS, etc.) Y: Internal link structure (e.g., anchor text, link
quality, hierarchy, etc.)
Z: External links and authority (e.g. quality of external links, sources, social signals, etc.)
Deep learning algorithms may be involved in calculating and evaluating individual factors (A,
B, C, X, Y, and Z) as well as their weighting factors (W1, W2, W3, W4, W5, and W6). The specific
implementation may involve a variety of deep learning techniques such as natural language
processing, entity recognition, user behavior analysis, and so on. In fact, there may be hundreds of
factors, each of which will be trained to have a weight, of course, the actual processing is not so no,
the actual processing will also use onehot, embedding and other feature processing techniques.