Webinar Insights: Indexing and Crawling

October 7, 2021
0 minute read

Getting a web page indexed doesn’t guarantee that it will rank well in search engines and bring organic traffic to your site. However, not getting your web pages indexed guarantees that they will not rank. In other words, you’ve got to be in it to win it: your pages need to be discoverable, crawled, and indexed if you want traffic. This article focuses on some of the key insights about how search engines crawl and index pages from our recent webinar Indexing and Crawling: What You Should Know. 

The Finite Resources of Search Engines

Indexing is only possible if your website’s pages are discoverable so that search engines can crawl them. A critical point is that search engines have finite resources—their budgets for crawling are constrained by three factors:

  1. Financial costs, such as electricity for running servers and hiring staff to maintain those resources
  2. Computational costs stemming from the need to increase computing resources, such as servers, to crawl growing volumes of web pages
  3. Environmental expenses: sustainability is a tenet of how companies like Google operate and will operate in the future, so crawling more pages leaves more of an environmental footprint


These three factors result in search engines crawling and indexing websites from the perspective of efficiency versus effectiveness. From an initial fetch of web pages that meet a quality and relevance threshold for a particular search term, the search engine then fine-tunes the top results over multiple phases of crawling. 


Even if a search engine had infinite resources, its value proposition would remain the same: providing customers (anyone searching something online) with the best results. In a world of infinite resources, the search engine would still only select the most relevant, high-quality pages for particular search terms. It’s the job of website owners and content creators to create those high-quality pages.

How Search Engines Determine Importance

Search engines such as Google or Bing ultimately want to answer one question when choosing whether to crawl, index, and rank a page:


Is this URL worth it?


Given the size of the Internet, the volume of pages published each day, and finite resources, search engines need a systematic approach to crawling and indexing. For context, Bing discovers 70 billion new URLs every single day. The system for crawling is based on:


  • Discovery: going out and finding new URLs
  • Schedules (machine learning systems predict when pages meaningfully change and recrawl them)
  • Queues
  • Thresholds for being crawled and indexed
  • Tiers of importance

The prioritization for crawling is mostly driven by importance. This importance is influenced by two factors:


  1. Demand: dynamically figuring out what customers want, and crawling or pruning pages from results as appropriate based on search trends, seasonality, etc
  2. Safe crawling: protecting website owners by not overburdening their site’s performance with excessive crawl requests


When crawling a specific website, search engines use a breadth-first approach, which finds pages along the shortest available paths by following layers downward (e.g. homepage -> pillar page -> sub-page). This approach contrasts with the alternative of going deep, which would mean going straight from the home page to lower-level pages deep within the URL hierarchy. 


The idea of the breadth-first approach is to ensure search engines follow the most important URLs. Many lower-level URLs could be redundant, boilerplate pages, and indexing all of them in search results would worsen the user experience for people searching online. Search engines attempt to detect clusters of very similar pages on your site and find a canonical (master copy) of the page among them to index. 

How To Make Your Pages Discoverable

In order for your web pages to be discoverable in the first place, you need to get the technical basics in place. Some actionable ways to make your content discoverable are:


  • Make sure pages are well connected 
  • Generate an XML sitemap that indicates the structure of your site to search engines and includes all pages you want to be indexed
  • Consistently use intent and query-driven internal linking anchor text to link pages
  • Group topics into strong relatedness clusters
  • Making sure pages are crawlable and renderable
  • Consolidate weaker related pages into one stronger page about a topic
  • Indicating in URLs and file names what those resources are about (e.g. example.com/best-smartwatches)


Consider the search engine as your customer—you want to have a solid information architecture in place that helps your customer easily find and navigate the various pages on your website. In other words, by viewing the search engine as a user trying to digest your content, you can better serve the search engine’s needs and improve your chances of being crawled and indexed. 

Why Does Consistency Matter to Search Engines?

If you can be consistent in applying these practices across your web pages, you’re in a better position to be discovered, crawled, and indexed. The reason that consistency in a site’s structure matters comes down to the machine learning algorithms that underpin how search engines function. 


Search engine machine learning algorithms essentially use judgment to determine which URLs among a set of discoverable pages are good and worth indexing and which aren’t worth indexing for specific queries. The training set for these algorithms to improve their judgment is the entire Internet.


Consistency in site structure and information architecture matters because similar consistencies already exist in currently ranked websites. If you take a random sample of the Internet, patterns emerge across sites, such as the use of internal search, about pages, product pages, blog pages, etc. Machine learning algorithms notice these patterns, which means that consistently applying them to your own site helps the algorithms more easily retrieve and index your pages.


The role of consistency is particularly important when considering whether to use a content management system for your site. You should have a really compelling business reason to use a bespoke, custom-coded website. Search engine algorithms already recognize the site information architecture patterns of popular CMS platforms and website builders, which makes it easier to get indexed out-of-the-box. 

How Search Engines Judge Quality

The advice for getting your indexed pages ranking well in search engines often focuses on providing quality content, but how exactly do search engines algorithmically judge quality as they crawl a website? Here are some pointers:


  • Write original content
  • Don’t stuff keywords into pages
  • Avoid publishing machine-generated content 


Understand that each page matters. If you have a website that is 20 percent great content and 80 percent junk or spun content, the site will ultimately be flagged as low quality and it’ll be very hard to rank well with any page. This brings the discussion full circle to the customer analogy—if you provide customers with low-quality products most of the time and high quality for a small fraction of the time, they won’t return to do business with you. 

The Future of Crawling and Indexing

Over the next 5-10 years, search engines are likely to get far better at supervised and semi-supervised machine learning techniques and more accurate at content extraction through natural language processing techniques. Websites will play a more prominent role in helping search engines crawl and index content through API-driven integrations. 


Ultimately, search engines will aim to become more efficient at crawling and indexing web pages as the web continues to grow exponentially. For website owners, having a solid information architecture and publishing original, high-quality content will only become more important to get crawled and indexed amongst the mass of new content published each day.       


Did you find this article interesting?


Thanks for the feedback!
By Shawn Davis April 1, 2026
Core Web Vitals aren't new, Google introduced them in 2020 and made them a ranking factor in 2021. But the questions keep coming, because the metrics keep changing and the stakes keep rising. Reddit's SEO communities were still debating their impact as recently as January 2026, and for good reason: most agencies still don't have a clear, repeatable way to measure, diagnose, and fix them for clients. This guide cuts through the noise. Here's what Core Web Vitals actually measure, what good scores look like today, and how to improve them—without needing a dedicated performance engineer on every project. What Core Web Vitals measure Google evaluates three user experience signals to determine whether a page feels fast, stable, and responsive: Largest Contentful Paint (LCP) measures how long it takes for the biggest visible element on a page — usually a hero image or headline — to load. Google considers anything under 2.5 seconds good. Above 4 seconds is poor. Interaction to Next Paint (INP) replaced First Input Delay (FID) in March 2024. Where FID measures the delay before a user's first click is registered, INP tracks the full responsiveness of every interaction across the page session. A good INP score is under 200 milliseconds. Cumulative Layout Shift (CLS) measures visual stability — how much page elements unexpectedly move while content loads. A score below 0.1 is good. Higher scores signal that images, ads, or embeds are pushing content around after load, which frustrates users and tanks conversions. These three metrics are a subset of Google's broader Page Experience signals, which also include HTTPS, safe browsing, and mobile usability. Core Web Vitals are the ones you can most directly control and improve. Why your clients' scores may still be poor Core Web Vitals scores vary dramatically by platform, hosting, and how a site was built. Some of the most common culprits agencies encounter: Heavy above-the-fold content . A homepage with an autoplay video, a full-width image slider, and a chat widget loading simultaneously will fail LCP every time. The browser has to resolve all of those resources before it can paint the largest element. Unstable image dimensions . When an image loads without defined width and height attributes, the browser doesn't reserve space for it. It renders the surrounding text, then jumps it down when the image appears. That jump is CLS. Third-party scripts blocking the main thread . Analytics pixels, ad tags, and live chat tools run on the browser's main thread. When they stack up, every click and tap has to wait in line — driving INP scores up. A single slow third-party script can push an otherwise clean site into "needs improvement" territory. Too many web fonts . Each font family and weight is a separate network request. A page loading four font files before rendering any text will fail LCP, especially on mobile connections. Unoptimized images . JPEGs and PNGs served at full resolution, without compression or modern formats like WebP or AVIF, add unnecessary weight to every page load. How to measure them accurately There are two types of Core Web Vitals data you should be looking at for every client: Lab data comes from tools like Google PageSpeed Insights, Lighthouse, and WebPageTest. It simulates page loads in controlled conditions. Lab data is useful for diagnosing specific issues and testing fixes before you deploy them. Field data (also called Real User Monitoring, or RUM) comes from actual users visiting the site. Google collects this through the Chrome User Experience Report (CrUX) and surfaces it in Search Console and PageSpeed Insights. Field data is what Google actually uses as a ranking signal — and it often looks worse than lab data because it reflects real-world device and connection variability. If your client's site has enough traffic, you'll see field data in Search Console under Core Web Vitals. This is your baseline. Lab data helps you understand why the scores are what they are. For clients with low traffic who don't have enough field data to appear in CrUX, you'll be working primarily with lab scores. Set that expectation early so clients understand that improvements may not immediately show up in Search Console. Practical fixes that move the needle Fix LCP: get the hero image loading first The single most effective LCP improvement is adding fetchpriority="high" to the hero image tag. This tells the browser to prioritize that resource over everything else. If you're using a background CSS image for the hero, switch it to anelement — background images aren't discoverable by the browser's preload scanner. Also check whether your hosting serves images through a CDN with caching. Edge delivery dramatically reduces the time-to-first-byte, which feeds directly into LCP. Fix CLS: define dimensions for every media element Every image, video, and ad slot on the page needs explicit width and height attributes in the HTML. If you're using responsive CSS, you can still define the aspect ratio with aspect-ratio in CSS while leaving the actual size fluid. The key is giving the browser enough information to reserve space before the asset loads. Avoid inserting content above existing content after page load. This is common with cookie banners, sticky headers that change height, and dynamically loaded ad units. If you need to show these, anchor them to fixed positions so they don't push content around. Fix INP: reduce what's competing for the main thread Audit third-party scripts and defer or remove anything that isn't essential. Tools like WebPageTest's waterfall view or Chrome DevTools Performance panel show you exactly which scripts are blocking the main thread and for how long. Load chat widgets, analytics, and ad tags asynchronously and after the page's critical path has resolved. For most clients, moving non-essential scripts to load after the DOMContentLoaded event is a meaningful INP improvement with no visible impact on the user experience. For websites with heavy JavaScript — particularly those built on frameworks with large client-side bundles — consider breaking up long tasks into smaller chunks using the browser's Scheduler API or simply splitting components so the main thread isn't locked for more than 50 milliseconds at a stretch. What platforms handle automatically One of the practical advantages of building on a platform optimized for performance is that many of these fixes are applied by default. Duda, for example, automatically serves WebP images, lazy loads below-the-fold content, minifies CSS, and uses efficient cache policies for static assets. As of May 2025, 82% of sites built on Duda pass all three Core Web Vitals metrics — the highest recorded pass rate among major website platforms. That baseline matters when you're managing dozens or hundreds of client sites. It means you're starting each project close to or at a passing score, rather than diagnosing and patching a broken foundation. How much do Core Web Vitals actually affect rankings? Honestly, they're a tiebreaker — not a primary signal. Google has been clear that content quality and relevance still dominate ranking decisions. A well-optimized site with thin, irrelevant content won't outrank a content-rich competitor just because its CLS is 0.05. What Core Web Vitals do affect is the user experience that supports those rankings. Pages with poor LCP scores have measurably higher bounce rates. Sites with high CLS lose users mid-session. Those behavioral signals — time on page, return visits, conversions — are things search engines can observe and incorporate. The practical argument for fixing Core Web Vitals isn't just "because Google said so." It's that faster, more stable pages convert better. Every second of LCP improvement can reduce bounce rates by 15–20% depending on the industry and device mix. For client sites that monetize through leads or eCommerce, that's a revenue argument, not just an SEO argument. A repeatable process for agencies Audit every new site before launch. Run PageSpeed Insights and record LCP, INP, and CLS scores for both mobile and desktop. Flag anything in the "needs improvement" or "poor" range before the client sees the live site. Check Search Console monthly for existing clients. The Core Web Vitals report surfaces issues as they appear in field data. Catching a regression early — before it compounds — is significantly easier than explaining a traffic drop after the fact. Document what you've improved. Clients rarely see Core Web Vitals scores on their own. A monthly one-page performance summary showing before/after scores builds credibility and makes your technical work visible. Prioritize mobile. Google uses mobile-first indexing, and field data shows that mobile CWV scores are almost always worse than desktop. If you only have time to optimize one version, do mobile first. Core Web Vitals aren't a one-time fix. Platforms change, new scripts get added, campaigns bring in new widgets. Build the audit into your workflow and treat it like any other ongoing deliverable, and you'll stay ahead of the issues before they affect your clients' rankings. Duda's platform is built with Core Web Vitals performance in mind. Explore how it handles image optimization, script management, and site speed automatically — so your team spends less time debugging and more time building.
By Ilana Brudo March 31, 2026
Vertical SaaS must transition from tools to an AI-powered Vertical Operating System (vOS). Learn to leverage context, end tech sprawl, and maximize retention.
By Shawn Davis March 27, 2026
Automate client management, instant site generation, and data synchronization with an API-driven website builder to create a scalable growth engine for your SaaS platform.
Show More

Latest posts