Indexing is the only bridge between publishing and ranking. This guide walks you through the real pipeline — discovery, crawl, parse, index — and gives you the diagnostics to fix the failures that keep pages invisible. No fluff. Just the bottleneck.
Indexing is not crawling. Crawling is the delivery truck. Indexing is the warehouse putting each box on the right shelf. Most site owners conflate the two, then wonder why a page with 500 crawl requests still shows discovered - currently not indexed.
The index is Google's central database. If your page is not in it, you do not exist for organic search. Period. The pipeline has four stages: discovery (Google finds a URL), crawl (fetcher downloads the content), parse and render (extract text, execute JS), indexing decision (store or discard). Failure at any stage means no index entry.
A common situation we see: a site with 200k product pages, 80k indexed, 120k in the crawled - currently not indexed state. The owner blames server speed. In reality, the issue is index bloat — Google decided those pages were too similar to existing ones and dropped them. The fix is not more crawling. The fix is pruning and canonical consolidation.
One more nuance: indexing is not ranking. A page can be fully indexed and rank on page 12. But indexing is the prerequisite. No index, no chance.
URL must be found via sitemap, internal link, backlink, or Search Console submission. If none exist, the gate stays closed.
Googlebot fetches the URL. Server must return 200 OK (not 3xx/4xx/5xx) within a few seconds. Slow responses reduce crawl frequency.
Googlebot extracts raw HTML, then renders JavaScript. Blocked CSS/JS or heavy client-side rendering can cause empty or partial content.
Google evaluates content quality, uniqueness, and site authority. Pages with thin content, duplicate body, or soft 404s are often excluded.
If accepted, the URL enters the index. It can now appear in search results. But it can be removed later if the page changes or expires.
| Gate / Failure Mode | Primary Diagnostic Tool | Key Metric to Check | Common Mistake / Risk | Immediate Action |
|---|---|---|---|---|
| Discovery failure URL not found by Google | Search Console > Sitemaps report | Submitted vs indexed count Large gap (e.g., 10k submitted, 2k indexed) | Submitting URLs with nofollow internal links or placed only in sitemap but blocked by robots.txt | Add internal links from high-traffic pages; ensure sitemap URLs are reachable and not disallowed |
| Crawl failure Googlebot gets 4xx/5xx or timeout | Crawl Stats report in Search Console | Average response time > 2s Error rate > 5% | Assuming CDN fixes all timeouts; ignoring server-level rate limiting or WAF blocks | Optimize server TTFB; check server logs for 503 or 429 responses; use PageSpeed Insights to identify render-blocking resources |
| Parse failure Google renders empty page or JS fails | URL Inspection tool > Live Test | Screenshot shows blank or partial content Rendered HTML missing key text | Using lazy-load without fallback content; relying on client-side rendered meta tags | Pre-render critical content on server; use dynamic rendering for JS-heavy sites; test with JavaScript disabled |
| Indexing rejection Page crawled but not indexed | Search Console > Pages report | Count in 'crawled - currently not indexed' category Often > 10k pages | Adding more pages instead of pruning; ignoring duplicate content signals | Consolidate thin pages into cluster hubs; add canonical tags; improve content depth to > 300 words of unique value |
Index bloat is the silent killer of crawl budget. Here is a worked example from a real audit.
The site: an e-commerce store selling 50k products with 5 color variants each = 250k product URLs. They had a filter system that generated URLs like /shoes?color=red&size=10&sort=price for every combination. Total URLs: 1.2 million.
The crawl: Googlebot attempted 80k requests per day. 60k of those hit filter/facet pages. Only 20k reached real product pages. Result: 15k product pages indexed out of 50k. Revenue lost.
The fix: we added noindex tags on all facet URLs with more than two parameters. We also set rel=canonical on variant pages to point to the main product. After 3 weeks, indexed product pages went from 15k to 41k. Crawl budget shifted from junk to real inventory.
The lesson: indexing is a finite resource. Every low-value URL you let into the index steals a slot from a high-value page. Use the Free XML Sitemap URL Extractor to audit your sitemap and identify which URLs are actually needed. Remove the rest.
| Option | What happens | Verdict |
|---|---|---|
| <strong>Prune Approach</strong><br>Remove low-value pages from index via noindex, canonical, or 410. Frees crawl budget. Reduces index bloat. Works for 80% of large sites. | <strong>Push Approach</strong><br>Keep all pages, try to force Google to index them via more sitemap submissions, internal links, and paid indexing services. Can work for high-authority sites but often fails for mid-tier domains. | Prune first, push second. Pruning is reversible. Pushing without pruning is like pouring water into a leaky bucket. |
| Index Status Label | What It Actually Means | Likely Cause | User Impact | Recommended Fix |
|---|---|---|---|---|
| Submitted and indexed | URL is in the Google index. It can appear in search results. | Page is healthy, unique, and reachable. | Positive. Page has a chance to rank. | None. Monitor for changes. |
| Submitted but not indexed | Google has the URL from sitemap but decided not to index it yet or ever. | Low content quality, duplicate body, or site authority too low. | Page is invisible. Zero organic visitors. | Improve content depth; remove thin pages; build site authority. |
| Discovered - currently not indexed | Google found the URL via a link or sitemap, but has not crawled it yet (or crawl queue is full). | High crawl queue depth; low crawl priority; server responsiveness issues. | Page may appear later, but often stays in this state for months. | Increase internal links from high-authority pages; improve server speed. |
| Crawled - currently not indexed | Google fetched the page, parsed it, but chose not to index it. | Index bloat, weak content, or soft 404 (page looks empty to Google). | Page has been examined and rejected. Unlikely to ever be indexed without changes. | Rewrite or merge content; add unique value; check for empty templates. |
| Page with redirect | URL returns a 3xx status code. Google followed the redirect and indexed the target. | Intentional redirect (301/302) or accidental chain. | The original URL is not indexed. The target URL is indexed instead. | Update internal links to point directly to the target URL. |
| Not found (404) | URL returns a 404. Google drops it from the index after a short period. | Deleted page, broken link, or typo in sitemap. | Page is gone. No traffic from that URL. | Set up 301 redirects to relevant pages or remove from sitemap. |
Blocked CSS/JS: Google needs to render your page to understand it. If your robots.txt blocks /assets/ or /scripts/, Googlebot sees a blank page. Use the URL Inspection tool's Live Test to check rendered HTML. If it is empty, your CSS/JS are blocked.
Soft 404s: A page that returns 200 but has no useful content (e.g., a search results page with 'no results found'). Google treats this as a soft 404 and drops it from the index. Check for pages with very low word count (< 50 words) and no user interaction.
Hreflang mismatches: If you have multi-language pages and the hreflang tags point to non-indexed pages, Google may refuse to index any of them. We saw a site with 12 language variants where only 3 were indexed because the hreflang cluster was broken.
Canonical loops: Page A canonicals to Page B, Page B canonicals to Page A. Google sees ambiguity and may index neither. Use a crawler to detect canonical chains longer than 2 hops.
This error means Google fetched the page but chose not to store it. Common causes: thin content (under 300 words), duplicate body text, or low site authority. Merge the page into a more comprehensive guide, add original research or data, and build internal links from high-traffic pages. Avoid adding more pages; prune and consolidate first.
Crawl budget is the number of URLs Googlebot will try to fetch per day (varies by site authority). Index budget is a softer limit: Google stops indexing new pages once it decides the site has enough low-value URLs. For large sites (100k+ pages), index budget is the real bottleneck. Prune thin pages to free up index slots.
Use the Google Indexing API for job posting or live streaming pages only (limited scope). For bulk checks, use Search Console's API to pull the 'submitted but not indexed' list. For a free manual workflow, paste URLs into the URL Inspection tool one by one, or use the <a href='https://en.speedyindex.com/free-xml-sitemap-url-extractor/'>Free XML Sitemap URL Extractor</a> to export your sitemap and then run a batch index check via a script that hits the Search Console API.
This status means Google knows the URL exists but has not crawled it yet (or crawl queue is full). It often happens for pages with low internal link authority, very deep site structure, or when the site has many URLs that Google considers low priority. To accelerate, add direct links from the homepage or top navigation, and ensure the page loads fast. If it stays for 6+ months, it is unlikely to ever be crawled.
Use noindex tags on all filter, sort, and facet URLs (e.g., /category?color=red). Set rel=canonical on product variant pages to the main product page. Regularly audit sitemap to exclude out-of-stock or discontinued items. Aim for 1 product URL per physical product, not per combination. Monitor Search Console pages report for spike in 'crawled not indexed'.
Top errors: (1) 'Submitted URL not found (404)' - page deleted but still in sitemap. Remove from sitemap or set up 301. (2) 'Soft 404' - page returns 200 but empty. Add content or redirect. (3) 'Blocked by robots.txt' - check robots.txt for disallow directives. (4) 'Crawled but not indexed' - prune or improve content. (5) 'Alternate page with proper canonical tag' - Google chose the canonical. If you disagree, fix the canonical tag.
The Indexing API is only available for pages with JobPosting or BroadcastEvent structured data. You must verify ownership in Search Console and set up a service account in Google Cloud. It allows you to notify Google of new or updated content immediately. For all other page types, use sitemap submission and URL Inspection tool. Do not use the API for regular content; Google will reject requests.
Googlebot can render JavaScript, but it adds a second pass (crawl HTML first, then render JS). This can delay indexing by days or weeks. If your JS fails or is blocked, Google sees an empty page. Best fix: server-side render (SSR) critical content, or use dynamic rendering (serve pre-rendered HTML to Googlebot, full JS to users). Test with Google's URL Inspection tool in 'Live Test' mode.
Googlebot uses a time budget per crawl session. Slow pages (load > 3s) consume more time, so Googlebot crawls fewer pages per session. Check your site's performance using <a href='https://pagespeed.web.dev/'>PageSpeed Insights</a>. Pages with low Core Web Vitals scores (LCP > 2.5s, CLS > 0.1) are crawled less frequently and may be deprioritized for indexing. Improving speed directly increases crawl depth and index coverage.
1. Check Search Console Pages report: note count of 'submitted not indexed' and 'crawled not indexed'. 2. Crawl your sitemap with a tool and verify all URLs return 200. 3. Review robots.txt for accidental blocks. 4. Test 5 random new pages with URL Inspection tool for index status. 5. Check server logs for Googlebot error rates > 5%. 6. Monitor Core Web Vitals for any regression. Weekly check takes 15 minutes and prevents index drift.
Quick calculator. Put in the expected monthly value of a page or link batch and the natural waiting time.