Getting a web page indexed doesn’t guarantee that it will rank well in search engines and bring organic traffic to your site. However, not getting your web pages indexed guarantees that they will not rank. In other words, you’ve got to be in it to win it: your pages need to be discoverable, crawled, and indexed if you want traffic. This article focuses on some of the key insights about how search engines crawl and index pages from our recent webinar Indexing and Crawling: What You Should Know.
The Finite Resources of Search Engines
Indexing is only possible if your website’s pages are discoverable so that search engines can crawl them. A critical point is that search engines have finite resources—their budgets for crawling are constrained by three factors:
- Financial costs, such as electricity for running servers and hiring staff to maintain those resources
- Computational costs stemming from the need to increase computing resources, such as servers, to crawl growing volumes of web pages
- Environmental expenses: sustainability is a tenet of how companies like Google operate and will operate in the future, so crawling more pages leaves more of an environmental footprint
These three factors result in search engines crawling and indexing websites from the perspective of efficiency versus effectiveness. From an initial fetch of web pages that meet a quality and relevance threshold for a particular search term, the search engine then fine-tunes the top results over multiple phases of crawling.
Even if a search engine had infinite resources, its value proposition would remain the same: providing customers (anyone searching something online) with the best results. In a world of infinite resources, the search engine would still only select the most relevant, high-quality pages for particular search terms. It’s the job of website owners and content creators to create those high-quality pages.
How Search Engines Determine Importance
Search engines such as Google or Bing ultimately want to answer one question when choosing whether to crawl, index, and rank a page:
Is this URL worth it?
Given the size of the Internet, the volume of pages published each day, and finite resources, search engines need a systematic approach to crawling and indexing. For context, Bing discovers 70 billion new URLs every single day. The system for crawling is based on:
- Discovery: going out and finding new URLs
- Schedules (machine learning systems predict when pages meaningfully change and recrawl them)
- Queues
- Thresholds for being crawled and indexed
- Tiers of importance
The prioritization for crawling is mostly driven by importance. This importance is influenced by two factors:
- Demand: dynamically figuring out what customers want, and crawling or pruning pages from results as appropriate based on search trends, seasonality, etc
- Safe crawling:
protecting website owners by not overburdening their site’s performance with excessive crawl requests
When crawling a specific website, search engines use a breadth-first approach, which finds pages along the shortest available paths by following layers downward (e.g. homepage -> pillar page -> sub-page). This approach contrasts with the alternative of going deep, which would mean going straight from the home page to lower-level pages deep within the URL hierarchy.
The idea of the breadth-first approach is to ensure search engines follow the most important URLs. Many lower-level URLs could be redundant, boilerplate pages, and indexing all of them in search results would worsen the user experience for people searching online. Search engines attempt to detect clusters of very similar pages on your site and find a canonical (master copy) of the page among them to index.
How To Make Your Pages Discoverable
In order for your web pages to be discoverable in the first place, you need to get the technical basics in place. Some actionable ways to make your content discoverable are:
- Make sure pages are well connected
- Generate an XML sitemap that indicates the structure of your site to search engines and includes all pages you want to be indexed
- Consistently use intent and query-driven internal linking anchor text to link pages
- Group topics into strong relatedness clusters
- Making sure pages are crawlable and renderable
- Consolidate weaker related pages into one stronger page about a topic
- Indicating in URLs and file names what those resources are about (e.g. example.com/best-smartwatches)
Consider the search engine as your customer—you want to have a solid information architecture in place that helps your customer easily find and navigate the various pages on your website. In other words, by viewing the search engine as a user trying to digest your content, you can better serve the search engine’s needs and improve your chances of being crawled and indexed.
Why Does Consistency Matter to Search Engines?
If you can be consistent in applying these practices across your web pages, you’re in a better position to be discovered, crawled, and indexed. The reason that consistency in a site’s structure matters comes down to the machine learning algorithms that underpin how search engines function.
Search engine machine learning algorithms essentially use judgment to determine which URLs among a set of discoverable pages are good and worth indexing and which aren’t worth indexing for specific queries. The training set for these algorithms to improve their judgment is the entire Internet.
Consistency in site structure and information architecture matters because similar consistencies already exist in currently ranked websites. If you take a random sample of the Internet, patterns emerge across sites, such as the use of internal search, about pages, product pages, blog pages, etc. Machine learning algorithms notice these patterns, which means that consistently applying them to your own site helps the algorithms more easily retrieve and index your pages.
The role of consistency is particularly important when considering whether to use a content management system for your site. You should have a really compelling business reason to use a bespoke, custom-coded website. Search engine algorithms already recognize the site information architecture patterns of popular CMS platforms and
website builders, which makes it easier to get indexed out-of-the-box.
How Search Engines Judge Quality
The advice for getting your indexed pages ranking well in search engines often focuses on providing quality content, but how exactly do search engines algorithmically judge quality as they crawl a website? Here are some pointers:
- Write original content
- Don’t stuff keywords into pages
- Avoid publishing machine-generated content
Understand that each page matters. If you have a website that is 20 percent great content and 80 percent junk or spun content, the site will ultimately be flagged as low quality and it’ll be very hard to rank well with any page. This brings the discussion full circle to the customer analogy—if you provide customers with low-quality products most of the time and high quality for a small fraction of the time, they won’t return to do business with you.
The Future of Crawling and Indexing
Over the next 5-10 years, search engines are likely to get far better at supervised and semi-supervised machine learning techniques and more accurate at content extraction through natural language processing techniques. Websites will play a more prominent role in helping search engines crawl and index content through API-driven integrations.
Ultimately, search engines will aim to become more efficient at crawling and indexing web pages as the web continues to grow exponentially. For website owners, having a solid information architecture and publishing original, high-quality content will only become more important to get crawled and indexed amongst the mass of new content published each day.