檢索預算對 Googlebot 會有什麼影響
透過集合功能整理內容
你可以依據偏好儲存及分類內容。
2017 年 1 月 16 日,星期一
近來,我們聽到許多關於「檢索預算」的說法,不過光就「檢索預算」這個詞本身,一般人並無法直接從字面上瞭解它所代表的完整概念。因此,我們希望藉由本文釐清「檢索預算」的實際意涵,並說明這個概念對於 Googlebot 的影響。
首先要強調的是,如同下文所述,檢索預算不是大多數發布者需要擔心的課題。如果新網頁一般會在發布當天受到檢索,網站管理員便不必特別在意檢索預算的分配。同樣地,如果網站包含的網址少於幾千個,檢索作業通常都很有效率。
如果網站規模較大,或是會根據網址參數自動產生網頁等,這時網站管理員才比較需要重視該如何排定檢索目標的優先順序與時間,以及網站代管伺服器可以分配多少資源給檢索作業。
檢索頻率上限
Googlebot 有如一位良善的網路公民,主要任務是檢索網站,同時確保檢索作業不會破壞網站的使用體驗。為此,我們會用所謂的「檢索頻率上限」來限制特定網站的最高擷取頻率。
簡單來說,檢索頻率上限就是 Googlebot 檢索特定網站時可以使用的同時連線數量,以及每次擷取之間的間隔時間。檢索頻率可能因為下列兩種因素而有所起伏:
-
檢索狀態:如果網站的回應速度很快,檢索頻率上限就會提高,讓 Googlebot 可使用更多連線進行檢索。如果網站的回應速度變慢或出現伺服器錯誤,檢索容量上限則會降低,Googlebot 的檢索次數也會隨之減少。
-
在 Search Console 中設定的限制:網站擁有者可以調降 Googlebot 對於自家網站的檢索頻率。請注意,設定較高的上限並不會使檢索次數自動增加。
檢索需求
如果沒有建立索引的需求,即使未達檢索頻率上限,Googlebot 仍不會提高檢索作業的頻率。以下是決定檢索需求的兩大重要因素:
-
熱門程度:為了盡可能在索引中提供最新的資訊,在網際網路上越熱門的網址,其檢索頻率也會越高。
-
過時程度:我們的系統會避免在索引中納入資料過時的網址。
此外,關聯到整個網站的事件 (例如網站遷移) 可能會使檢索需求增加,因為我們需要為新網址中的內容重新建立索引。
綜合檢索頻率和檢索需求兩個層面來看,我們可以將檢索預算視為 Googlebot 有能力檢索且想要檢索的網址數量。
影響檢索預算的因素
根據我們的分析,擁有許多低附加價值的網址並不利於網站的檢索和索引作業。如果依照重要性排序,低附加價值網址可分為以下幾類:
如果將伺服器資源浪費在這類網頁上,會使得真正有價值的網頁錯失檢索機會,導致 Googlebot 發掘網站上優質內容的進度嚴重推後。
常見問題
每個網站都必須經過檢索才會出現在 Google 的搜尋結果中,高效率的檢索作業有助於網站編入 Google 搜尋索引。
網站速度會影響檢索預算嗎?發生錯誤又會有何影響?
提高網站速度不僅可帶來更流暢的使用者體驗,還能提升檢索頻率。執行速度快的網站代表伺服器的運作情況良好,能讓 Googlebot 透過相同的連線數量擷取更多內容。反過來說,如果出現重大錯誤 (例如 5xx 錯誤) 或是連線逾時,則表示伺服器運作情況不佳,檢索速度也會因此變慢。
建議您密切注意 Search Console 中的檢索錯誤報告,盡量避免伺服器發生錯誤。
檢索是排名依據之一嗎?
提高檢索頻率不代表網站一定能獲得更好的搜尋結果排名。
Google 會參考數百種信號來決定搜尋結果排名,雖然網站必須經過檢索才能顯示在搜尋結果中,但並非排名信號。
替代網址和嵌入內容也會計入檢索預算嗎?
一般來說,Googlebot 在網站上檢索到的所有網址都會計入檢索預算。
由於替代網址 (例如 AMP 或 hreflang) 和嵌入內容 (例如 CSS 和 JavaScript,包括 AJAX 中的 XHR) 都需要經過檢索,所以會占用網站的檢索預算。同樣地,多次連續重新導向也會對檢索造成負面影響。
我可以使用 crawl-delay
規則來控制 Googlebot 嗎?
Googlebot 不會處理非標準的 crawl-delay
robots.txt 規則。
nofollow
規則是否會影響檢索預算?
視情況而定。任何受到檢索的網址都會影響檢索預算,因此,即使將網頁的網址加上 nofollow
標記,但只要您網站上的其他網頁 (或網路上的任何網頁) 未將連至該網址的連結加上 nofollow 標籤,系統就仍會檢索該網址。
透過 robots.txt 禁止的網址對檢索預算有任何影響嗎?
如要進一步瞭解如何針對網站的檢索作業進行最佳化調整,請參閱這篇關於最佳化檢索作業的網誌文章 (該文雖發布於 2009 年,但內容依然實用)。如有任何問題,歡迎前往論壇提問!
發文者:檢索和索引團隊 Gary Illyes
除非另有註明,否則本頁面中的內容是採用創用 CC 姓名標示 4.0 授權,程式碼範例則為阿帕契 2.0 授權。詳情請參閱《Google Developers 網站政策》。Java 是 Oracle 和/或其關聯企業的註冊商標。
[null,null,[],[[["\u003cp\u003eGooglebot's crawl budget is the number of URLs it can and wants to crawl on a website, influenced by factors like crawl rate limit and crawl demand.\u003c/p\u003e\n"],["\u003cp\u003eCrawl rate limit is the maximum fetching rate for a site, determined by site health and potential limits set in Search Console.\u003c/p\u003e\n"],["\u003cp\u003eCrawl demand is influenced by the popularity and staleness of URLs, with popular and fresh content being crawled more frequently.\u003c/p\u003e\n"],["\u003cp\u003eLow-value-add URLs like faceted navigation, duplicate content, and soft error pages can negatively impact a site's crawl budget and indexing.\u003c/p\u003e\n"],["\u003cp\u003eWhile crucial for indexing, crawl rate is not a direct ranking factor in Google Search results.\u003c/p\u003e\n"]]],["Google's \"crawl budget\" is the number of URLs Googlebot can and wants to crawl, influenced by crawl rate limit and crawl demand. Crawl rate is determined by server responsiveness and user-set limits, while crawl demand depends on URL popularity and staleness. Low-value URLs, such as faceted navigation and duplicate content, negatively impact the budget. Site speed and server errors affect crawl rate. Crawling is not a ranking factor, but it's necessary for a URL to appear in search results.\n"],null,["# What Crawl Budget Means for Googlebot\n\n| It's been a while since we published this blog post. Some of the information may be outdated (for example, some images may be missing, and some links may not work anymore). Check out our updated documentation on [optimizing crawling efficiency](/search/docs/crawling-indexing/large-site-managing-crawl-budget#improve_crawl_efficiency).\n\nMonday, January 16, 2017\n\n\nRecently, we've heard a number of definitions for \"crawl budget\", however we don't have a single\nterm that would describe everything that \"crawl budget\" stands for externally. With this post\nwe'll clarify what we actually have and what it means for Googlebot.\n\n\nFirst, we'd like to emphasize that crawl budget, as described below, is not something most\npublishers have to worry about. If new pages tend to be crawled the same day they're published,\ncrawl budget is not something webmasters need to focus on. Likewise, if a site has fewer than a\nfew thousand URLs, most of the time it will be crawled efficiently.\n\n\nPrioritizing what to crawl, when, and how much resource the server hosting the site can allocate\nto crawling is more important for bigger sites, or those that auto-generate pages based on URL\nparameters, for example.\n\nCrawl rate limit\n----------------\n\n\nGooglebot is designed to be a good citizen of the web. Crawling is its main priority, while making\nsure it doesn't degrade the experience of users visiting the site. We call this the \"crawl rate\nlimit,\" which limits the maximum fetching rate for a given site.\n\n\nSimply put, this represents the number of simultaneous parallel connections Googlebot may use to\ncrawl the site, as well as the time it has to wait between the fetches. The crawl rate can go up\nand down based on a couple of factors:\n\n- **Crawl health**: If the site responds really quickly for a while, the limit goes up, meaning more connections can be used to crawl. If the site slows down or responds with server errors, the limit goes down and Googlebot crawls less.\n- [**Limit set in\n Search Console**](https://support.google.com/webmasters/answer/48620): Website owners can reduce Googlebot's crawling of their site. Note that setting higher limits doesn't automatically increase crawling.\n\nCrawl demand\n------------\n\n\nEven if the crawl rate limit isn't reached, if there's no demand from indexing, there will be low\nactivity from Googlebot. The two factors that play a significant role in determining crawl demand\nare:\n\n- **Popularity**: URLs that are more popular on the Internet tend to be crawled more often to keep them fresher in our index.\n- **Staleness**: Our systems attempt to prevent URLs from becoming stale in the index.\n\n\nAdditionally, site-wide events like site moves may trigger an increase in crawl demand in order to\nreindex the content under the new URLs.\n\n\nTaking crawl rate and crawl demand together we define crawl budget as the number of URLs Googlebot\ncan and wants to crawl.\n\nFactors affecting crawl budget\n------------------------------\n\n\nAccording to our analysis, having many low-value-add URLs can negatively affect a site's crawling\nand indexing. We found that the low-value-add URLs fall into these categories, in order of\nsignificance:\n\n- [Faceted navigation](/search/blog/2014/02/faceted-navigation-best-and-5-of-worst) and [session identifiers](/search/blog/2007/09/google-duplicate-content-caused-by-url)\n- [On-site duplicate content](/search/blog/2007/09/google-duplicate-content-caused-by-url)\n- [Soft error pages](/search/blog/2010/06/crawl-errors-now-reports-soft-404s)\n- Hacked pages\n- [Infinite spaces](/search/blog/2008/08/to-infinity-and-beyond-no) and proxies\n- Low quality and spam content\n\n\nWasting server resources on pages like these will drain crawl activity from pages that do actually\nhave value, which may cause a significant delay in discovering great content on a site.\n\nTop questions\n-------------\n\n\nCrawling is the entry point for sites into Google's search results. Efficient crawling of a\nwebsite helps with its indexing in Google Search. \n\n### Does site speed affect my crawl budget? How about errors?\n\n\nMaking a site faster improves the users' experience while also increasing crawl rate. For\nGooglebot, a speedy site is a sign of healthy servers, so it can get more content over the\nsame number of connections. On the flip side, a significant number of 5xx errors or\nconnection timeouts signal the opposite, and crawling slows down.\n\n\nWe recommend paying attention to the\n[Crawl Errors report in Search Console](https://support.google.com/webmasters/answer/35120)\nand keeping the number of server errors low. \n\n### Is crawling a ranking factor?\n\n\nAn increased crawl rate will not necessarily lead to better positions in Search results.\nGoogle uses hundreds of signals to rank the results, and while crawling is necessary for\nbeing in the results, it's not a ranking signal. \n\n### Do alternate URLs and embedded content count in the crawl budget?\n\n\nGenerally, any URL that Googlebot crawls will count towards a site's crawl budget.\nAlternate URLs, like AMP or hreflang, as well as embedded content, such as CSS and\nJavaScript, including AJAX\n(like [XHR](https://en.wikipedia.org/wiki/XMLHttpRequest))\ncalls, may have to be crawled and will consume a site's crawl budget. Similarly, long\nredirect chains may have a negative effect on crawling. \n\n### Can I control Googlebot with the `crawl-delay` rule?\n\n\nThe non-standard `crawl-delay` robots.txt rule is not processed by Googlebot. \n\n### Does the `nofollow` rule affect crawl budget?\n\n\nIt depends. Any URL that is crawled affects crawl budget, so even if your page marks a URL\nas `nofollow` it can still be crawled if another page on your site, or any page on the web,\ndoesn't label the link as nofollow. \n\n### Do URLs I disallowed through robots.txt affect my crawl budget in any way?\n\n\nNo, disallowed URLs do not affect the crawl budget.\n\n\nFor information on how to optimize crawling of your site, take a look at our blogpost on\n[optimizing crawling](/search/blog/2009/08/optimize-your-crawling-indexing)\nfrom 2009 that is still applicable. If you have questions, ask in the\n[forums](https://support.google.com/webmasters/community/)!\n\n\nPosted by [Gary Illyes](https://garyillyes.com/+), Crawling and Indexing\nteams"]]