Googlebot Only Indexes The First 15 MB Of An HTML Page: Googlebot Secrets

Beware the Page is Too Big! Googlebot Only Indexes The First 15 MB Of An HTML Page.

GoogleBot: Google has again updated to maximize its performance in crawling or crawling all websites in the world.

Recently, Google made an update on the Google Search Central documentation page which states that Googlebot crawls the initial 15 MB of HTML files for inclusion in the index (and of course determines search engine rankings ). After this 15 MB limit, Google will not count it as a ranking consideration.

What does it mean? If you want important content on your website to be indexed by Googlebot, you should place the content you want to be indexed into the initial 15 MB of your HTML page file.

Is the 15 MB limit big or small? What are the files indexed by Googlebot? Are files such as images, JavaScript, and videos also included in the indexed files section of the HTML? What should be done regarding this update? Let’s discuss it together in this article!

What is Googlebot – Types & How To Control It

Googlebot is a web crawler software used by Google to collect information from the web to build an index (crawl process). Google bot is a generic name referring to two different types of web crawlers: desktop crawlers (to simulate desktop users) and mobile crawlers (to simulate mobile users).

If your site has been converted for mobile-first indexing on Google, then most Googlebot crawl requests will be made using the mobile crawler. For sites that haven’t been converted, most crawls will be created using the desktop crawler. In both cases, the minority crawler only crawls URLs that have been crawled by the majority crawler.

Read also: What are Long Tail Keywords? Short Tail vs Long Tail Keyword

Why is Googlebot Important?

Googlebot greatly affects the success of the crawl process carried out by Google. And as you know that Crawl is the first step to finding out how a search engine like Google works. So if this process is problematic it can have an effect on your position on the Google search results page.

Please note that there are three ways search engines work, namely Crawl, Indexing, and Ranking:

  • Crawl is the process search engine bots use to visit new and updated pages to add to the index.
  • Indexing is the process by which Google or other search engines understand the elements on a page.
  • The Google ranking process is the process of ranking Google search results and serving them to users.

Types of Googlebot

Crawler User-agent token Complete user agent string
Googlebot Image Googlebot-Image, Googlebot Googlebot-Image/1.0
Googlebot News Googlebot-Image, Googlebot The Googlebot-News user agent uses various Googlebot user-agent strings.
Googlebot Videos Googlebot-Image, Googlebot Googlebot-Video/1.0
Googlebot Desktop Googlebot Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Mozilla / 5.0 AppleWebKit / 537.36 (KHTML, like Gecko; compatible; Googlebot / 2.1; + http: //www.google.com/bot.html) Chrome / WXYZ Safari / 537.36,

Googlebot/2.1 (+http://www.google.com/bot.html)

Googlebot Smartphone Googlebot Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Your site is likely to be crawled by Googlebot Desktop and Googlebot Smartphone. You can identify the Googlebot subtype by looking at the user agent string. However, both crawler types use the same product token (user agent token) in robots.txt, so you can’t selectively target Google bot Smartphone or Google bot Desktop using robots.txt.

What Files Are Indexed By Google

Not just HTML files, Google is able to index various types. Some of the most common file types indexed by Google as reported by Search Console Help are:

  • Adobe Portable Document Format (.pdf)
  • HTML (.htm, .html)
  • Microsoft Excel
  • Microsoft PowerPoint
  • Microsoft Word
  • Text (.txt, .text) includes source code in common programming languages ​​such as, .java, .cs, .pl, .py, .xml, .bas, .c, .cc, etc.
  • Wireless Markup Language (.wap, .wml)
  • XML (.xml)

What to Do to Maximize SEO?

Regarding this update, you must optimize the HTML size of your website so that your website remains in the best ranking on search engines.

According to the Search Engine Journal, friends can place important content at the top/beginning of the website page. Thus, you must structure the HTML codes well and place them at the beginning of the HTML file, in order to have a high level of relevance for SEO.

In addition, for those of you who have a page with a large size, it should be repaired first because it will take up a lot of space.

Still launching SEJ, the best way to maximize this latest update from Google is to keep website pages at a size of 100KB or less. This will prevent your page from being affected by this update.

To find out the size of your current website page size, you can check it directly through Google Page Speed ​​Insights.

Then, another way that you can do according to Search Engine Land is to check your website pages through the URL inspection tools in Google Search Console. Once inspected, you can see which part of the page Google is rendering and viewing in the debugging tools.

Does This 15 MB Include Files Like Images & Videos That Are On Our Web Pages?

At the beginning of the article, I mentioned questions related to files such as images, videos, and JavaScript. Do the files also fall into the 15 MB HTML file indexed by Google?

The answer is no. This can be seen from the Search Engine Journal, in which Google states that files such as images, videos, CSS, and JavaScript will not be included in the 15 MB file indexed by Google. The files are indexed separately by Google.

How to Check Website HTML File Size?

By using Pingdom tools, we can find out

1. Go to https://tools.pingdom.com/. Enter the URL you want to check, then press the “Start Test” button

2. Wait a moment, then see the results. In this example, the page size is 552.5 KB. But is this Page size an HTML size only, or does it include other file types?

3. Scroll down to the bottom of the results to see the size details by file type. It can be seen that the HTML file is only 100 KB in size. Still very far from the 15 MB limit set by Google, yes, even though this sample page has a lot of text.

How to Control Googlebot

Google gives you several ways to control what is crawled and indexed.

Ways to control crawling
  1. Robots.txt – This file on your website allows you to control what is crawled.
  2. Nofollow – Nofollow is a link attribute or robot meta tag that suggests unfollowing links. It’s only considered a hint, so it might be ignored.
  3. Crawl Budget – Using the crawl budget feature in the old Google Search Console allows you to slow down Google crawls. This feature can be accessed here.
google bot crawling index
google bot crawling index
Ways to control indexing
  1. Delete your content – ​​If you delete a page then nothing needs to be indexed. The downside of this is that no one else can access it either.
  2. Restrict access to content – ​​If you restrict access to content then Google doesn’t get into that content, so any kind of protection like passwords or authentication will prevent Google from seeing the content.
  3. Noindex – Noindex in the robots meta tag tells search engines not to index your page.
  4. URL removal tool – The name of this tool from Google is a bit misleading, as the way it works will temporarily hide the content. Google will still see and crawl this content, but the page will not appear in search results.
  5. Robots.txt (Images Only) – Blocking Googlebot Images from crawling means that your images will not be indexed.

How Googlebot Access Your Site

For most sites, on average Googlebot won’t access your site more than every few seconds. However, due to network delays, it is possible that the crawl speed will be slightly higher in a short period of time.

Googlebot is designed to be run concurrently by thousands of devices to improve performance. To reduce bandwidth usage, Google will run multiple crawlers on devices located near the server of the site to be crawled.

This can cause your site logs to show visits from several different devices but with the same user agent, Googlebot. Google’s goal is to crawl as many web pages as possible without burdening your server bandwidth. If your site is having trouble keeping up with Google’s crawling requests, you can request a crawl speed change as outlined above.

HTTP

Generally, Googlebot crawls via HTTP/1.1. However, starting in November 2020, Googlebot can crawl sites over HTTP/2 if supported by the site. This can save computing resources such as CPU, and RAM) for the site and Googlebot.

But remember, this doesn’t affect your site’s indexing or ranking. You can check http/2 support for your site on many services available for free such as:

https://tools.keycdn.com/http2-test
https://http2.pro

If you don’t want to be crawled over HTTP/2, you can ask the server hosting your site to respond with HTTP status code 421 when Googlebot tries to crawl your site over HTTP/2. If that doesn’t work, you can message the Googlebot team (but this workaround is temporary).

First 15 MB

Googlebot can crawl the first 15 MB of supported HTML files or text-based files. Each resource referenced in HTML such as images, videos, CSS, and JavaScript will be fetched separately.

After the first 15 MB of the file, the Google bot will stop crawling and only consider indexing the first 15 MB of the file. File size limits are applied to uncompressed data. Other crawlers may have different limits.

15MB is a very large size for a web page. For comparison, the page of the article you are reading is no more than 1 MB in size. If a page is more than 3 MB in size, usually visitors have complained about being slow when accessing the page.

Verifying Googlebot

Before deciding to block Googlebot, keep in mind that the user agent string used by Googlebot is often spoofed by other crawlers. It’s important to verify that the problematic request is actually from Google. The best way to verify that a request really comes from Googlebot is to use a reverse DNS lookup on the request’s source IP or match the source IP to a Google bot IP range.

Read Also: 20+ Online Web Tools for Freelancers, Writers, and Bloggers

Closing

That’s something like an update that has been issued by Google recently. Hopefully, friends can adjust it to the maximum and be able to get the best ranking on search engines after this update.

Of course, one of the biggest PRs after this update is that friends should pay attention to your website pages. If the HTML file size has exceeded the 15 MB limit, you should immediately fix it from now on. Although 15 MB is actually a very high limit for HTML file sizes, as in the example above.

If friends need a discussion partner about this update, you can immediately join the Rossgram Telegram group.

Don’t forget to join now, OK?

Leave a Comment