Stop throwing 50,000 URLs into one file and hoping Google figures it out. Real sitemap architecture means splitting by category, enforcing priority logic, and catching errors before they waste your crawl budget. This guide covers the operational details that matter when you manage 10,000+ pages.
A single sitemap file can hold up to 50,000 URLs and must not exceed 50 MB uncompressed. Exceed either limit and Google will refuse to process the file. That is not a suggestion — it is a hard boundary.
For an e-commerce site with 200,000 product pages, that means at least four sitemap files. But throwing them all into a flat index file is a mistake. The real work begins when you decide how to split those URLs. We recommend splitting by logical category: products, categories, brands, blog, and static pages. Each category gets its own sitemap, all referenced in a single sitemap index file.
In practice, when you split by category you also gain the ability to set per-category priority and change frequency. A product page with a price update every day should have a change frequency of 'daily'. A static 'About Us' page can be 'monthly'. This granularity tells Google which sections to crawl more aggressively.
A common situation we see is teams generating one massive sitemap with mixed priorities — 0.8 for the homepage, 0.9 for a blog post, and 0.3 for a product. That inconsistency confuses the crawler. Set a rule: priority 1.0 only for the homepage and top-level category pages. Products get 0.6-0.8. Blog posts get 0.5. Static pages get 0.3. Stick to it.
| Sitemap Category | Recommended Priority Range | Change Frequency | Hidden Risk / Failure Mode |
|---|---|---|---|
| Products One sitemap per 50k URLs max, split by subcategory when needed | 0.6 - 0.8 | Daily or weekly depending on stock/price updates | Thin product pages (no description, no reviews) still get indexed. Filter out products with zero stock or no images before including them. |
| Categories Include only the canonical category page, not faceted filters | 0.8 - 0.9 | Weekly | Faceted URLs (color=red&size=large) can leak into the sitemap and cause massive duplicate content. Block them at the XML generation step. |
| Brands / Designers One sitemap per brand if you have 10k+ brand pages | 0.5 - 0.7 | Weekly | Brand pages with fewer than 5 products should be excluded. They are weak pages and waste crawl budget. |
| Blog / Articles Date-based splitting for news-heavy sites | 0.5 - 0.6 | Daily for news, weekly for evergreen | Old articles with broken links or no internal links. Set a max age (e.g., 2 years) and remove outdated posts. |
| Static Pages About, contact, shipping, returns, FAQ | 0.3 - 0.5 | Monthly | These rarely change, but they often contain thin content. Add a canonical tag and ensure they are not noindexed. |
Extract all live, indexable URLs from your CMS or database. Exclude noindex, canonicalized, and redirecting URLs at this stage.
Remove thin pages (under 200 words), out-of-stock products, and duplicate faceted URLs. Apply your category split rules.
Generate one XML file per category, respecting the 50k URL and 50 MB limits. Set lastmod, changefreq, and priority per category.
Reference all category sitemaps in a single index file. Validate the XML schema using an online validator before submission.
Submit to Google Search Console and Bing Webmaster Tools. Check for errors (404s, blocked URLs, schema issues) weekly.
Regenerate sitemaps on a schedule (daily for e-commerce, hourly for news). Only re-submit if the index file changes.
Scenario: An online furniture store has 125,000 indexable URLs: 95,000 product pages, 15,000 category/filter pages, 12,000 brand pages, and 3,000 blog posts.
Step 1: Filter out 20,000 product pages that are permanently out of stock. Remove 5,000 faceted filter URLs (e.g., /sofas?material=leather&color=brown). Remove 500 blog posts older than 2 years with zero internal links. Final count: 99,500 URLs.
Step 2: Split into category sitemaps: Products (75,000 URLs -> 2 files of 37,500 each), Brands (12,000 URLs -> 1 file), Categories (10,000 canonical category pages -> 1 file), Blog (2,500 articles -> 1 file), Static (500 pages -> 1 file). Total: 6 sitemap files.
Step 3: Set priorities: Products 0.7, Brands 0.5, Categories 0.9, Blog 0.5, Static 0.3. Change frequencies: Products daily, Brands weekly, Categories weekly, Blog weekly, Static monthly.
Step 4: Create sitemap index file with <sitemap> entries for each of the 6 files. Submit to Search Console. Result: Google crawls 99,500 URLs efficiently, with no wasted requests on thin pages or duplicates.
Blocked URLs in sitemap. A common situation we see is a sitemap containing URLs that are blocked by robots.txt or return a 403 status. Google will report these as 'Blocked' in Search Console. Fix: run a crawl with Screaming Frog or a similar tool and cross-reference the sitemap URLs against the crawl log. Remove or unblock any URL that returns a non-200 status.
Wrong filters. One e-commerce client included every product variant (color, size) in the sitemap. That created 300,000 URLs for 1,000 products. The fix: only include the canonical product URL in the sitemap, and use <link rel=alternate> for variants. Google wasted three weeks re-crawling duplicates before we cleaned it.
Bad data from CMS. If your CMS exports a list of URLs that includes draft pages or staging URLs (e.g., /products/draft-123), your sitemap will contain 404s. Set a strict filter: only export pages with status 'published' and production hostname.
Duplicate lists. Two sitemaps listing the same URL can cause Google to pick the wrong one for priority. Deduplicate across all sitemaps before building the index.
Limits and slow vendors. Some third-party sitemap generators (especially WordPress plugins) break at 10,000 URLs. If your site has 100,000 URLs, you need a custom script or a tool like the free XML sitemap URL extractor to audit what is actually in your generated files before submission.
Google has stated that priority and change frequency are hints, not directives. But hints matter when the crawler is deciding between two similar pages. If you have 10,000 product pages and only 500 category pages, setting category priority to 0.9 and product priority to 0.6 tells Google: 'Crawl categories first.' That is useful for crawl budget allocation.
For video content, the rules are stricter. You must include <video:video> tags with fields like video:title, video:description, and video:thumbnail_loc. Google's video sitemap documentation specifies that missing any of these three fields can cause your video to be omitted from search results. If your e-commerce site has product videos, you need a separate video sitemap — do not mix video tags into your standard sitemap.
Regenerate sitemaps daily and auto-submit via Search Console API.
Cross-reference sitemap URLs against live crawl data every week.
Remove any URL that returns 4xx, 5xx, or is blocked by robots.txt.
Do not include paginated URLs (page=2, page=3) — only the first page.
Verify that the sitemap index file is valid XML and references only 200-status sitemaps.
For news sites, split by date (one sitemap per day or week) and set lastmod to the article publish date.
Monitor Search Console for 'Submitted URLs not indexed' — if the count exceeds 5%, investigate.
You need at least 4 sitemaps to stay under the 50k URL limit. But we recommend splitting further: one sitemap per product category (furniture, lighting, decor) and additional sitemaps for brands, blog, and static pages. That gives you 6-10 sitemaps, all referenced in one index file. This structure lets you set per-category priority and makes debugging easier when a category has indexing issues.
Do not include faceted URLs in your sitemap. They create duplicate content and waste crawl budget. Use canonical tags on the faceted pages pointing back to the main category page. If you must include some filtered pages (e.g., 'sofas under $500'), create a separate curated sitemap for those and set priority lower than the main category. Block all other faceted patterns at the XML generation step.
Regenerate your sitemap index every hour if you publish more than 10 articles per day. Use a cron job or serverless function to build the XML and upload it to your server. Only resubmit to Google when the index file changes (new sitemap added or removed). For individual article sitemaps, update the lastmod field each time the article is edited. Google will re-crawl it faster if lastmod changes.
Check three things: 1) The sitemap URL returns a 200 HTTP status, not a redirect or 404. 2) The file is not gzip-compressed on the server (Google accepts gzip, but the .xml extension must be served as text/xml). 3) The server has no rate limiting or IP blocking for Googlebot. If all three are fine, use the <a href='https://en.speedyindex.com/free-xml-sitemap-url-extractor/'>free XML sitemap URL extractor</a> to verify the file contents manually.
Use a rule-based system: 1.0 for homepage and top-level categories, 0.9 for subcategories, 0.7 for product pages, 0.5 for blog posts, 0.3 for static pages. Do not assign 1.0 to every page — that tells Google nothing. For e-commerce, set product priority based on sales velocity: best-sellers at 0.8, slow movers at 0.5. Implement this logic in your sitemap generation script, not manually.
No. Out-of-stock pages are thin content and waste crawl budget. Remove them from the sitemap until they are back in stock. If you have a 'notify when available' page, keep it in the sitemap only if it has unique content (e.g., reviews, size guide). Otherwise, exclude it. Google will still find the URL through internal links if it matters.
Download your current sitemap index and all sub-sitemaps. Use a tool like the free XML sitemap URL extractor to pull every URL into a CSV. Then run a crawl of your live site and compare the two lists. Look for URLs in the sitemap that return 4xx/5xx, are blocked by robots.txt, or are noindexed. Those are the errors you need to fix before migration. Expect to find 5-15% of URLs with issues.
Create one sitemap per language (e.g., /sitemap-en.xml, /sitemap-de.xml). Within each sitemap, include the hreflang annotations using <xhtml:link> tags for every URL that has a language variant. The sitemap index file should list all language-specific sitemaps. Do not mix languages in the same sitemap. Google uses these annotations to serve the correct language result to users.
Yes, but only if you follow the rules. Create a separate 'new products' sitemap that contains only products added in the last 30 days. Set priority to 0.9 and change frequency to daily. Reference this sitemap in your main index file. Google will crawl the new products sitemap faster because it has fewer URLs. Remove products from this sitemap after 30 days and let them appear only in the main product sitemap.
For sites over 100k URLs, avoid WordPress plugins and use a custom script (Python with lxml library, or a server-side cron job). For smaller sites, Screaming Frog can generate sitemaps but it has a 500 URL limit in the free version. The <a href='https://en.speedyindex.com/free-xml-sitemap-url-extractor/'>free XML sitemap URL extractor</a> is useful for auditing existing sitemaps, not generating new ones. For generation, consider a headless CMS with a sitemap module or a dedicated tool like Sitemap Generator Pro.
Quick calculator. Put in the expected monthly value of a page or link batch and the natural waiting time.