Fundamentals-of-SEO-in-the-age-of-ai

Fundamentals of SEO in the age of ai

As we all know, the last era, google's sorting principle is the first inverted indexing, and then
according to the pagerank score for ranking, and today, many people say pagerank is outdated,
do not have to do the chain; big bald head said google ranking factors have more than 200 kinds of
some people say that this is too much, today I'll be an algorithmic engineer's point of view, in ai
era to talk about the basic principles of SEO.

The basic logic of seo in the age of ai

  1. Perform semantic search
  2. Querying inverted indexes

    Based on the web pages indexed backwards, let deep learning algorithms such as bert
    predict the probability of each web page indexed backwards, and sort them based on the
    score of the backwards index and the probability predicted by deep learning.

semantic search

Semantic search is primarily used to expand on keywords, assuming that a user enters a
query in Google, "best smartphone". In this example, we will see how semantic search can be
performed using deep learning techniques to improve the relevance of search results.

  1. Word embedding: Google first converts the words in the query ("best", "smartphone")
    into word vectors. Word embedding captures the semantic relationships between words
    so that similar words in the vector space have similar vector representations. For example,
    word embedding captures the relationship between "smartphone" and "cell phone".
  2. Query Expansion: Through word embedding, Google can find other words that are
    semantically similar to the words in the query. For example, "best
    can be expanded to "top-notch", "first-class", etc.; "smartphone" can be expanded to
    "cell phone", "mobile device", etc.. ", "mobile device" and so on. In this way, Google can
    expand the search scope and improve the relevance of search results.
    inverted index
    Although it's an old favorite, the technique of inverted indexes still has to be talked about.
    An Inverted Index is a data structure used by search engines such as Google to quickly find
    web pages that contain specific terms. Here's how Google uses the Inverted Index to search for
    web pages from keywords in detail:
  3. Web Crawlers: Google first crawls web pages on the Internet through web crawlers
    (Googlebot). The crawler follows the hyperlinks on the web page and gradually crawls the
    entire Internet.
  4. Web page preprocessing: the crawled web page is preprocessed, including removing
    HTML tags, JavaScript code, etc., and extracting plain text content. Next, the text content
    is subject to Tokenization, which breaks down the text into words or phrases. For English
    text, tokenization is usually split by space; for other languages such as Chinese,
    tokenization may involve more complex processing.
  5. Index Construction: The word items after word splitting are processed to generate a
    backward index. A backward index is a mapping relationship that maps a word to a list of
    Subscribe to DeepL Pro to translate larger documents.
    Visit www.DeepL.com/pro for more information.
    documents containing that word. In the construction of the index, the search engine will
    also carry out word stemming (Stemming) and stop word filtering (Stopword Filtering)
    and other operations to reduce the size of the index and improve search efficiency.
  6. Query Processing: When users enter keywords to search, Google will pre-process the query,
    such as word splitting and synonym replacement.
    Switching, etc. Google then looks for documents containing these lexical items based on an
    inverted index.
    Now to elaborate on the techniques of each

web crawler

Google Web Crawler (Googlebot) is an automated program used to crawl web pages on the
Internet for indexing by search engines. Below is the detailed procedure of how Google web
crawler crawls web pages:

  1. Seed URLs: The crawler starts with a set of seed URLs, which come from old crawling
    results, site map submitted by the webmaster (Sitemap), new URLs, etc. The seed URLs
    are the starting point for the crawler to start crawling. The seed URL is the starting point
    for the crawler to start crawling.
  2. URL queue: the crawler will maintain a queue of URLs to be crawled, initially the queue
    contains seed URLs. the crawler will retrieve URLs from the queue for subsequent
    processing.
  3. HTTP requests: The crawler makes HTTP requests, such as GET requests, for the URLs it
    takes out. The server responds to these requests and returns
    Back to the HTML source code of the web page.
  4. Content processing: After the crawler receives the HTML source code, it parses it and
    extracts the hyperlinks (usually ) on the page.
    (the href attribute of the tag). These hyperlinks are the primary way that crawlers
    find new URLs.
  5. URL normalization: The extracted hyperlinks are subjected to URL normalization
    processes, including removing fragment identifiers, converting relative URLs to absolute
    URLs, decoding URL encodings, and so on. These processes help to eliminate the
    diversity of URLs and reduce the probability of repeated crawling.
  6. URL de-duplication: In order to avoid crawling the same web pages repeatedly, the
    crawler needs to de-duplicate newly discovered URLs. This is usually done by
    maintaining a collection of visited URLs or a Bloom Filter.
  7. Speed Limits and Delays: In order to comply with a site's crawling policy (e.g.,
    robots.txt file) and to prevent overstressing the server, the crawler will limit the
    crawling speed. This can be accomplished by setting a delay time, limiting the number of
    concurrent requests, etc.
  8. Update URL queue: new URLs after de-duplication will be added to the queue of URLs to be
    crawled for subsequent crawling by the crawler.
  9. Storing Web Pages: The content of the web pages crawled by the crawler is stored for
    subsequent indexing and analysis processes.
  10. Cyclic crawling: The crawler will keep taking new URLs from the URL queue and repeat
    the process until the queue is empty or some stopping condition is reached.
    Through this process, Google web crawlers can crawl a large number of web pages on the
    Internet and use the content of these pages as input for search engine indexing. In order to
    maintain real-time indexing, Google regularly updates and re-crawls the crawled pages.

Page preprocessing

The Google search engine needs to pre-process the content of crawled web pages before
indexing them. This preprocessing process helps extract valuable information, reduces the
complexity of indexing, and improves search results. Below is the detailed process of how Google
preprocesses web pages:

  1. Character encoding detection: Since web pages may use different character encodings,
    such as UTF-8, GBK, etc., Google first needs to detect the character encoding of the web
    page. This can be done by parsing the charset attribute in the tag o f t h e HTML
    or by using an encoding detection algorithm.

  2. HTML parsing: Google will parse the HTML source code to build a DOM (Document Object
    Model) tree that represents the structure of the web page. This process includes
    identifying HTML tags, attributes, nesting relationships, and so on.

  3. Remove extraneous content: In order to extract plain text content, Google needs to
    remove extraneous content such as HTML tags, JavaScript code, CSS styles, etc. from the
    DOM tree. This helps reduce the complexity of indexing and focuses on the actual textual
    content of the page.

  4. Extracting valuable information: During the pre-processing process, Google also extracts
    valuable information from the web page, such as the title, meta information
    (meta information), headings, hyperlinks and so on. This information helps to
    understand the theme and structure of the web page, and plays an important role in
    sorting and displaying search results.

  5. Text Segmentation: Google will extract the plain text content for segmentation
    (Tokenization), the text is broken down into words or phrases. For English text, the split is
    usually split by space; for other languages such as Chinese, the split may involve more
    complex processing.

  6. Stem extraction and morphological reduction: in order to reduce lexical diversity, Google
    performs stem extraction on word items after word splitting
    (Stemming) or Lemmatization. These operations help to unify different forms of words
    into their basic forms and improve search results

  7. Stopword Filtering: Google performs Stopword Filtering on entries to remove common
    words that have less impact on search quality, such as "the", "and", "is", etc. This reduces
    the size of the index and improves search quality. This reduces the size of the index and
    improves search efficiency.

  8. Generate Keyword Vector: Google may use the word items after word segmentation to
    generate keyword vectors of web pages for subsequent search, sorting and other
    operations. Keyword vectors can be expressed as word frequency (TF, Term Frequency)
    or word frequency - inverse document frequency (TF- IDF, Term Frequency - Inverse
    Document Frequency) and other forms.

  9. Data suitable for search engine indexing and sorting. The preprocessing process helps
    reduce the complexity of indexing and focuses on the actual textual content of the web
    page, thus improving search results and user experience.

    1. Entity Recognition and Linking: In the pre-processing process, Google also

    performs Entity Recognition and Linking (Entity Recognition and Linking), which
    identifies entities in the text (e.g., names of people, places, organizations, etc.)
    and links these entities to relevant knowledge bases (e.g., Google Knowledge
    Graph). This helps to understand the semantics of web content and improve the
    relevance of search results.

    1. NLP technology: Google may use Natural Language Processing (NLP) technology

    for deeper segmentation of web text
    analysis to extract richer semantic information. For example, syntactic analysis,
    sentiment analysis, etc. are performed to further understand the themes and
    perspectives of web content.

    1. Image, video and other media content processing: For images, videos and other

    media content in web pages, Google may use computer vision and other
    technologies for processing to extract relevant information. This information
    helps to enrich search results and provide users with more diversified content.
    Through these pre-processing steps, Google can extract key information from the original
    web page to generate data suitable for search engine indexing, sorting and display. This
    helps to improve search results, achieve more accurate search matches, and provide
    users with high-quality search results.

    Here is an example of a keyword vector:
    Suppose we have an article about Apple Inc. and after preprocessing and segmentation, the
    following partial lexical items are extracted:

    ["apple", "company", "technology", "iphone", "innovation", "steve", "jobs", "apple", "store", "iphone", "sales", "revenue"]

    Next, we can calculate the Term Frequency (TF) of each lexical item and get:

 code{
    "apple": 2,
    "company": 1,
    "technology": 1,
    "iphone": 2,
    "innovation": 1,
    "steve": 1,
    "jobs": 1,
    "store": 1,
    "sales": 1,
    "revenue": 1
}

Assuming that we have a collection of documents containing multiple articles, we can
compute the Inverse D o c u m e n t Frequency (IDF) of each lexical item. Here,

code{
    "apple": 1.5,
    "company": 2.0,
    "technology": 1.7,
    "iphone": 1.8,
    "innovation": 2.2,
    "steve": 2.5,
    "jobs": 2.5,
    "store": 2.3,
    "sales": 1.9,
    "revenue": 2.1
}

Then, we can calculate the TF-IDF value for each word item, i.e., the word frequency (TF)
multiplied by the inverse document frequency (IDF)

 code{
    "apple": 2 * 1.5 = 3.0,
    "company": 1 * 2.0 = 2.0,
    "technology": 1 * 1.7 = 1.7,
    "iphone": 2 * 1.8 = 3.6,
    "innovation": 1 * 2.2 = 2.2,
    "steve": 1 * 2.5 = 2.5,
    "jobs": 1 * 2.5 = 2.5,
    "store": 1 * 2.3 = 2.3,
    "sales": 1 * 1.9 = 1.9,
    "revenue": 1 * 2.1 = 2.1
}

Finally, we get the keyword vector for this article, which is used to represent the topic and
weight of the article

code{
    "apple": 3.0,
    "company": 2.0,
    "technology": 1.7,
    "iphone": 3.6,
    "innovation": 2.2,
    "steve": 2.5,
    "jobs": 2.5,
    "store": 2.3,
    "sales": 1.9,
    "revenue": 2.1
}

Keyword vectors can be used in search, sorting, clustering and other operations to help
improve the accuracy and efficiency of search engines.

Indexing

Now let's explain in detail how google does index building from a technical point of view:
Google converts web content into structured data suitable for search engines by pre-
processing and analyzing web pages. Based on this, Google constructs an index to facilitate fast
retrieval and sorting. The following is a detailed process of how Google performs index
construction, which we will continue to illustrate using the previous examples.

Inverted Index: Google uses an inverted index (Inverted Index) as an indexing structure,
which allows for a quick lookup of documents that contain a particular lexical item.
Inverted Index is a mapping relationship, where the key is the term, the value is the list of
documents containing the term and related information. For example, we can build a
simple inverted index from a vector of keywords:

code{
    "apple": [Doc1],
    "company": [Doc1],
    "technology": [Doc1],
    "iphone": [Doc1],
    "innovation": [Doc1],
    "steve": [Doc1],
    "jobs": [Doc1],
    "store": [Doc1],
    "sales": [Doc1],
    "revenue": [Doc1]
}

xing, each lexical item is associated with multiple documents that contain the lexical item, as
well as information such as the weight and position of the lexical item in the document.

  1. Document Index: In order to speed up the document retrieval and sorting process,
    Google may build a document index for each document. The document index contains
    metadata about the document, such as URL, title, description, keyword vectors, and so
    on. This information helps to present a summary of the document in the search results,
    as well as to sort them according to relevance.

  2. Indexing: Web pages crawled by Google's web crawlers are pre-processed, analyzed, and
    keyword vectors are generated. Subsequently, Google will add this data to the inverted
    index and document index. The index building process involves multiple data structures,
    algorithms and optimizations to ensure efficient storage and retrieval.

  3. Updating the Index: In order to keep the index real-time, Google regularly updates the
    inverted index and document index. This may include adding new crawled pages, deleting
    defunct pages, updating the content and weights of existing pages, etc. The index
    updating process needs to be done in a way that ensures search engine usability to avoid
    any impact on user experience.

  4. Distributed Indexing: Due to the massive amount of data on the Internet, a single server
    cannot carry the entire inverted index and document index. Therefore, Google uses a
    distributed system to store and manage indexes. The index is divided into multiple slices
    (Shards), distributed on different servers. Distributed indexing allows Google to store
    and manage the index on multiple

  5. Search requests are processed in parallel on individual servers to improve search efficiency
    and scalability.

  6. Index Compression: In order to reduce the storage space and retrieval time of
    the index, Google uses various compression techniques to compress the index
    data. For example, the use of Variable-Length Encoding (Variable-Length
    Encoding) on the word ID compression, or the use of prefix compression (Prefix
    Compression) to reduce the storage space of the word string.
    Through these steps, Google builds an efficient, real-time search engine index that quickly
    retrieves documents containing specific keywords and sorts them according to relevance.
    The process of building the index involves multiple data structures, algorithms, and
    optimizations to ensure storage and retrieval efficiency and provide users with high-quality search results.Now let's take an example of document indexing:Suppose we have an article about Apple Inc. with t h e URL
    https://example.com/apple-company , the title Apple Company: The Tech Giant, and the description A comprehensiveoverview of Apple Inc. its history, products, and innovations. After preprocessing andanalyzing, we get the keyword vector of the article. Next, we can constructa document index for
    this article:

    Doc1: {
    "url": "https://example.com/apple-company",
    "title": "Apple Company: The Tech Giant",
    "description": "A comprehensive overview of Apple Inc., its history, products, and innovations.",
    "keyword_vector": {
        "apple": 3.0,
        "company": 2.0,
        "technology": 1.7,
        "iphone": 3.6,
        "innovation": 2.2,
        "steve": 2.5,
        "jobs": 2.5,
        "store": 2.3,
        "sales": 1.9,
        "revenue": 2.1
    }
    }

    The document index contains metadata about documents, such as URLs, titles, descriptions,
    and keyword vectors. This information helps to present document summaries in search results
    and to sort search results according to relevance.
    In practice, the document index may contain additional information such as PageRank, link
    structure, click-through rates, and so on.
    (CTR), etc. to help search engines more accurately assess the relevance and weight of documents.

    query processing

    Now look at the query:
    Suppose we have a simplified version of the inverted index containing four documents (Doc1,
    Doc2, Doc3 and Doc4) associated with different terms. Now, we will demonstrate how to use the
    inverted index to retrieve relevant web pages from the keywords "apple" and "cell phone" in the
    Google search engine.
    Example of an inverted index:

code{
    "苹果": {
        "Doc1": {"weight": 3.0, "position": [1, 12]},
        "Doc2": {"weight": 2.5, "position": [5]},
        "Doc4": {"weight": 1.0, "position": [3, 8]}
    },
    "手机": {
        "Doc1": {"weight": 2.8, "position": [7]},
        "Doc3": {"weight": 3.5, "position": [1, 5, 10]},
        "Doc4": {"weight": 2.2, "position": [15]}
    },
    ...
}

In this example, the inverted index is a dictionary, with the keys being the lexical items and the
values being the list of documents containing the lexical items and related information such as
weights and the position of the lexical items in the document.

  1. Query parsing: The user enters the keyword "apple cell phone", and Google first parses
    the query. Here we assume that the query has been correctly parsed into two terms:
    "apple" and "cell phone".

  2. Retrieve the inverted index: Based on the parsed lexical items, Google queries the
    inverted index to find documents that contain those items. In this example, we can get the
    following list of documents:
    Documents containing "Apple": Doc1,
    Doc2, Doc4 Documents containing "cell
    phone": Doc1, Doc3, Doc4

  3. Merge Results: Merge the list of documents containing different terms to find the
    document that contains all the query terms. In this example, the documents that contain
    both "apple" and "cell phone" are Doc1 and Doc4.

  4. Document Score: For retrieved documents, Google needs to calculate a relevance score.
    This can involve a variety of factors, such as word weight, location, the document's
    PageRank value, and so on. In this simplified example, we only consider word weights as
    the basis for scoring.
    Accordingly. Therefore, we can calculate the score as follows:
    Doc1 Score: 3.0 (Apple weight) + 2.8 (Mobile weight) = 5.8
    Doc4 score: 1.0 (apple weight) + 2.2 (cell phone weight) = 3.2

  5. Result Sorting: Sorts the search results by relevance based on the score of the
    document. In this example, Doc1 has a higher score than Doc4, so Doc1 will be ranked
    before Doc4.

Deep Learning Sorting Techniques

Deep Learning Sorting techniques are mainly used to sort web pages, in this example we will
illustrate how Google uses Deep Learning Sorting model for sorting web pages. We will discuss label
selection, feature selection, training the model, and the sorting process.

  1. Label (label) selection: in search, the label is mainly the user's click on the page, click is 1,
    not clicked is 0
  2. Feature Selection: Features are variables that are used to describe the relationship
    between a query and a document. In deep learning sorting models, multiple features can
    be selected, for example:
    a. Lexical item weights: e.g. TF-IDF, BM25, etc.
    b. Word-item similarity: cosine similarity obtained by word embedding calculation,
    etc.
    c. Document quality: for example, using the PageRank value to indicate the quality
    of a web page.
    d. Content Length: indicates the length of the document.
    e. Click-through rate: how often a user clicks on a particular search result.
  3. TRAINING MODEL: A neural network is used as a sorting model. The query-document
    pairs and their feature values from the training dataset are fed into the neural network,
    which is trained using the labels as supervised signals. After several rounds of iterative
    optimization, the network will learn to capture the complex relationship between
    features and labels so that it can predict the relevance of new query-document pairs.
  4. Sorting process: Suppose the user query is "best smartphone 2023". Based on the
    inverted index, Google retrieves three candidate documents: DocA, DocB, and DocC. For
    each query-document-user pair, we compute the values of the features selected above,
    and input them into a trained neural network model. The model will output a prediction
    score indicating the relevance of the query to the document.
    Suppose we get the following predicted scores:
DocA: 0.85
DocB: 0.60
DocC: 0.75

Based on the prediction scores, we can sort the candidate documents:

DocA > DocC > DocB

Finally, the sorted documents are presented to the user. In this example, DocA will be
considered the most relevant document for the query "best smartphone 2023" and will therefore
be at the top of the search results.
With deep learning sorting models, Google is able to effectively integrate multiple scoring
factors and automatically discover new useful features to improve the quality of search results.
This approach provides high accuracy and efficiency in web page sorting.
Let's go over the sorting model in a little more detail:

Google's search engine algorithm involves hundreds of factors, and we offer a conceptual
formula that encompasses some of the major page-level and site-level SEO factors.
Let A, B, C, etc. denote page-level SEO factors, while X, Y, Z, etc. denote site-level SEO factors.
The weight factor is denoted by W. We can construct a conceptual formula as follows:

s = w1 * a + w2 * b + w3 * c + w4 * x + w5 * y + w6 * z

Among them:
A: Quality of content (e.g., originality, usefulness, depth, etc.)
B: Keyword use (e.g. keyword density, placement, relevance, etc.)
C: Page metadata (e.g., title tags, meta descriptions, URL structure, etc.)
X: Technical SEO (e.g., site speed, mobile-friendliness, sitemaps,
HTTPS, etc.) Y: Internal link structure (e.g., anchor text, link
quality, hierarchy, etc.)
Z: External links and authority (e.g. quality of external links, sources, social signals, etc.)
Deep learning algorithms may be involved in calculating and evaluating individual factors (A,
B, C, X, Y, and Z) as well as their weighting factors (W1, W2, W3, W4, W5, and W6). The specific
implementation may involve a variety of deep learning techniques such as natural language
processing, entity recognition, user behavior analysis, and so on. In fact, there may be hundreds of
factors, each of which will be trained to have a weight, of course, the actual processing is not so no,
the actual processing will also use onehot, embedding and other feature processing techniques.

Leave a Comment