In this episode of SEO Mythbusting, Martin Splitt of Google and Alexis Sanders discuss crawl budget. The video goes into detail explaining what crawl budget is, whether businesses need a crawl budget, how Google crawling a website affects it, and much more.

Topics discussed in this episode include:

  • Why is crawl budget an interesting topic to discuss?
  • What is crawl budget?
  • What is crawl rate, and what is crawl demand?
  • How does Googlebot make its crawl rate and crawl demand decisions?
  • ETags, HTTP headers, last modified dates, and similar.
  • What size of sites should worry about crawl budget?
  • Server setup vs crawl budget.
  • Crawl frequency vs quality of content.
  • What to expect to see in one’s log files if Google is testing one’s server?
  • Tips on how to get your site crawled accurately during a site migration.
  • Crawl budget and the different levels of one’s site’s infrastructure.
  • Does crawl budget affect rendering as well?
  • Caching of resources and crawl budget.
  • Crawl budget and specific industries such as publishing.
  • What can be, generally speaking, recommended to help Googlebot out when crawling one’s site?
  • What are the usual pitfalls people get into with crawl budget?
  • Can one tell Googlebot to crawl one’s site more?

Video Transcript

MARTIN SPLITT: How about crawl budget?

ALEXIS SANDERS: I would love to talk about crawl budget.

MARTIN SPLITT: All right.

ALEXIS SANDERS: Yeah, I think it’s a term that’s thrown around a lot in the industry. And it seems like an abstraction of something that is much more technical and concrete. And I think we can get a lot of clarity through this video.

MARTIN SPLITT: Fair enough.

[MUSIC PLAYING]

Hello, and welcome to “SEO Mythbusting.” With me today is Alexis Sanders, from Merkle.

ALEXIS SANDERS: Thanks for having me.

MARTIN SPLITT: And you are a senior account manager?

ALEXIS SANDERS: Yes.

MARTIN SPLITT: OK, so you work with a lot of different companies and accounts and have different kinds of work. Yesterday we were discussing fermented food because that’s a hobby of yours.

[LAUGHS]

ALEXIS SANDERS: Yeah, definitely.

I went on a trip, came back, and my boyfriend had fermented everything.

MARTIN SPLITT: Everything.

ALEXIS SANDERS: Yes.

MARTIN SPLITT: So, lots of kimchi?

ALEXIS SANDERS: Oh, my god. We have so much kimchi.

MARTIN SPLITT: [LAUGHS]

ALEXIS SANDERS: It’s insane. Like in a huge jar.

What is Crawl Budget?

MARTIN SPLITT: But we’re not here today to talk about kimchi. So, what are the things that your clients and customers are dealing with, and what’s a question that you often get? Or what’s a topic you would like to discuss?

ALEXIS SANDERS: I would love to talk about crawl budget today. So to get started, before we even talk about anything, can we start with what is crawl budget, kind of recapping some of the stuff that Gary covered in his Google Webmaster article.

What is Crawl Rate and What is Crawl Demand?

MARTIN SPLITT: That’s a very, very good point. Actually, the article is a fantastic starting point to this entire journey. So, when we are talking Google Search and indexing and thus, crawling, we have to have a bit of a– what’s it called– trade-off to make. And the trade-off that we have to make there is we want to get as much information crawled as quickly as possible, but we don’t want to overwhelm the servers.

ALEXIS SANDERS: Definitely. And would you categorize that part of it as crawl limit?

MARTIN SPLITT: Yeah, pretty much. Or crawl rate.

ALEXIS SANDERS: Almost like an abstraction under an [? objection. ?]

MARTIN SPLITT: Yes. That’s the crawl rate, basically– how much stress can we put on your server without crashing anything or suffering from killing your server too much. That’s one thing. The other thing is we just have to be reasonable with our resources. The web is really, really large, so instead of just crawling everything all the time, we need to make some distinctions there. A news website probably changes quite often, and we probably need to be catching up with them every now and then, whereas a website on the history of kimchi is probably not changing as often. I mean–

ALEXIS SANDERS: Definitely.

MARTIN SPLITT: –nothing against kimchi, but I think the history isn’t as fast-paced as the world news are.

ALEXIS SANDERS: Yeah, definitely.

MARTIN SPLITT: So that’s the crawl demand. So we try to figure out is this something that we need to crawl more often, or is this something where it’s OK if we check-up every now and then.

How does Googlebot make its Crawl Rate and Crawl Demand Decisions?

ALEXIS SANDERS: And what goes into that decision-making process? For instance, if you had that kimchi website, or if you had something like archive.org, where it’s really just something that’s going to be constant for a long period of time and there really isn’t going to be any changes, how do you guys determine that?

MARTIN SPLITT: We basically fingerprint content as we crawl it. So we see what is this page about.

We use that for deduplication later on, but we also look into when was this last changed. So you can tell us, and things like structured data has possibilities. You can have date elements somewhere in your page. And we more or less keep track of how often or how frequently things are changing. And if we detect that the frequency of change is really low, then we don’t have to crawl as much. That has nothing to do with quality. You can have fantastic content that will rank really high or whatever that never changes. It’s more about is this information that we need to revisit every now and then, or is this something that we can leave alone for a longer period of time.

ETags, HTTP Headers, Last Modified Dates, and Similar

ALEXIS SANDERS: Now, do you guys use something like the ETag or the last-modified network header, or is it more about taking that fingerprint of the content?

MARTIN SPLITT: You can give us various hints. As I said, structured data dates are one. ETag is an interesting one, as well. The HTTP headers are useful for this last-modified date in sitemap. Right? And the more useful you’re making these– if you just automatically update the last modified date in the sitemap and we figure out that does not correspond to actual changes on the website, or the website changes are minimal, then that’s not helping us.

ALEXIS SANDERS: Don’t want to waste our resources.

MARTIN SPLITT: Exactly. Not only that. At some point we’re just going to be like, OK, this is not useful for this particular site, so. But as much information as we get we will use to figure out what we can reasonably expect as a change frequency.

What Size of Sites should Worry about Crawl Budget?

ALEXIS SANDERS: So going backwards a little bit, what size of sites should be worried about crawl budget, should be worried that hitting that crawl limit?

MARTIN SPLITT: Large sites.

ALEXIS SANDERS: Large sites. And if we were to give–

MARTIN SPLITT: Millions URLs.

ALEXIS SANDERS: Millions of pages. OK, cool. So if you have under a million pages, really, crawl budget is not something that you should be concerned about.

Server Setup vs. Crawl Budget

MARTIN SPLITT: Yes. Unless you have a really flaky server setup. But then, again, crawl budget isn’t your problem. Your problem is your server setup, right?

Crawl Frequency vs. Quality of Content

ALEXIS SANDERS: No, definitely. So do you typically see that the issue isn’t really with the server setup but more with the crawl demand? Or do you see it more with the server setup?

MARTIN SPLITT: It depends on what you say the issue is. Normally, crawl budget gets cited in many, many different times when it actually wasn’t an issue to begin with. It was just like the quality of the content was bad and that’s why we didn’t index it, so we crawled it. We figured out, oh, this is not worth keeping. Then we didn’t index it. And people were like, so is this a crawl budget problem? And we’re like, no, we have crawled it. It was just not good.

What to Expect to See in one’s Log Files if Google is Testing one’s Server

ALEXIS SANDERS: And is that what you would typically see in the Excluded section of the GSA report? Something along those lines?

MARTIN SPLITT: Yeah, exactly. These things are really helpful. Sometimes you also see people are like, oh, yeah. My crawl budget is really low. And then we’re like, yeah, but is that a problem? Are you having lots of changing content? And they’re like, actually, no. It’s a blog that updates once a week. And then we’re like, hmm. How big of a problem really is that? Sometimes it’s also people just barely missing the crawl window where it does not discover the URL. And if you’re not submitting it in sitemaps, then we have to crawl another page to find that link to that URL, and then we’re going to crawl that one. So using sitemap reasonably is a good idea for these things. But again, if you’re not having millions of pages then crawl budget is rarely an issue.

Tips on How to Get Your Site Crawled Accurately During a Site Migration

ALEXIS SANDERS: Let’s say you’re working with a major e-commerce, or something like real estate, where you have hundreds of millions of pages that are a little bit smaller, maybe a little bit similar. Is there anything that those industries can do to make their pages crawled more often or more appealing to Google?

MARTIN SPLITT: The thing is, being crawled more often is not necessarily helping. It’s not giving us a signal for quality, or it’s not meaning like, oh, this is fantastic. Having something crawled, and then indexed, and then never changing is OK as far as we’re concerned. But for e-commerce, I would highly recommend– you were saying something really interesting like lots of smaller pages that are very similar to each other. If they are very similar to each other, should they exist is my first question. Can you extend the content? For instance, if it’s product variations of it, maybe you just describe them in the table inside one page rather than having 10 pages for all the variations.

ALEXIS SANDERS: Definitely.

MARTIN SPLITT: Which colors are available? Well, that’s not–

ALEXIS SANDERS: It almost brings the fact that there’s so many different issues that fit into crawl budget.

MARTIN SPLITT: Exactly.

ALEXIS SANDERS: Whether it’s duplication, or finding the pages, or having a service.

MARTIN SPLITT: Yes. Server speed is an issue, right?

ALEXIS SANDERS: Yeah.

MARTIN SPLITT: If you have a server that every now and then just flakes on us, then we don’t know. Is that because it’s a flaky server? Is that because we’re about to overwhelm you?

ALEXIS SANDERS: Got a flaky server.

MARTIN SPLITT: Yeah, it’s like it’s just wobbly and falling over every now and then, then we have to be careful.

ALEXIS SANDERS: No!

MARTIN SPLITT: We have to touch very lightly on that server.

ALEXIS SANDERS: So let’s say someone is switching over their site to a more powerful server. What is something they should expect to see in their log files if Google is testing their servers to see how much they can handle?

MARTIN SPLITT: They might see an increase in crawl activity and then a small decrease again, and basically there’s this wave motion that’s going to happen. But normally, when you just switch the server and nothing else changes, you might not even see us doing anything. You just see it just continues to go through.

ALEXIS SANDERS: Definitely.

MARTIN SPLITT: Unless you change from something that was broken to something that works. Then you probably see more crawling happening.

ALEXIS SANDERS: OK. So what about in a situation of migrations? A lot of sites are really hyper-aware about how their site is being crawled during a migration. Are there any tips that you guys have to making sure that your site is being crawled accurately?

MARTIN SPLITT: What you can definitely do is you can make sure that you’re, more or less, progressively updating your sitemap and saying this has now changed. This has now changed. And this has now changed. And it’s now time to crawl that. And we crawl it. We get a redirect. That’s actually a change that is interesting for us. And then you can control a little bit on how we discover the change in the migration. But generally speaking, just make sure that both servers are up on par and running smoothly, and not flaking out every now and then or giving us error codes.

Make sure that your redirects are set up correctly, as well, and that you’re not accidentally blocking important things for us to understand once we are landing on the new site. If you have a robots.txt beforehand, and then you change that to completely different URLs being blocked and put that on the other thing, that we’re like, what’s happening here?

ALEXIS SANDERS: What is going on?

MARTIN SPLITT: So that’s not a great idea. So try to do one thing at a time. Do not be like, we’re going to change the text deck, we’re going to change the server, we’re going to change the URLs, we’re going to change the content, and we’re going to migrate to a different domain. That’s a lot.

Crawl Budget and the Different Levels of one’s Site’s Infrastructure

ALEXIS SANDERS: Too much for one year. Too much for one project. No, for sure. So, for crawl budget, what levels of a site does it affect? So if you have a CDN, will it affect your site? If you have a subdomain, or a domain, at what levels of your infrastructure is it affecting?

MARTIN SPLITT: That depends. So, generally speaking, we go on the site level, so everything that is on the same domain. Sometimes subdomains might be counted towards that. Sometimes they are not. CDNs are one where it’s kind of counted against your site, but not really. You should normally not worry about it. I think crawl budget–

ALEXIS SANDERS: Stop stressing so much, SEOs.

[LAUGHS]

MARTIN SPLITT: Enjoy yourselves. What’s interesting is, when it comes to crawl budget, where I would say you should consider this, especially if you’re dealing with user-generated content, you can try to tell us not to index or not to crawl certain things that you know are bad. So it’s more of a content side of things rather than the technical infrastructure side of things.

Does Crawl Budget Affect Rendering as well?

ALEXIS SANDERS: Yeah, definitely. So does crawl budget actually affect both your crawling phase and your rendering phase?

MARTIN SPLITT: Crawl budget does actually touch on rendering because as we render, we will fetch additional resources that comes with the crawl budget. Because, again, the trade-off is still there. When we’re crawling initially, we don’t want to overwhelm your server, but the same goes to when we are rendering. We don’t want to overwhelm your server there. What’s the point in killing your e-commerce website just because we are making thousands of requests at the same time to crawl all these resources? So it can happen, especially if you have a large site with millions of URLs and you get your caching wrong, for instance, that we actually have to fetch everything over and over and over again, which means we have a lot of simultaneous requests to your server. If your server then slows down significantly, then we’d be like, oh, OK. Careful. And then resource fetches might fail. So we might be able to fetch and render the first– I don’t know– 500,000 pages, but then suddenly the resources seem to fail because we’re like, ah. We don’t want to make more requests at this point. And then we can’t really render them. So that would be bad.

Caching of Resources and Crawl Budget

ALEXIS SANDERS: Yeah. So what do we do in terms of caching for those resources? Obviously, with pages, we can put them in [INAUDIBLE] sitemaps, and all the stuff you mentioned before with potentially structured data, and then maybe even adding some network headers. But what do you think about those resources that could be really consuming of Google’s resources?

MARTIN SPLITT: Absolutely. The thing is, we are trying to be as aggressive as possible when we are caching sub-resources, such as CSS, JavaScript, API calls, all that kind of stuff. If you have API calls that are not GET requests, then we can’t cache them. So you want to be very careful with doing POST requests or something like that.

ALEXIS SANDERS: Don’t do POST.

MARTIN SPLITT: Some APIs are doing that by default. And then I’m like, ay. That’s an interesting one because we can’t cache it, so that’s going to consume your crawl budget quicker. The other thing is you can help us by making sure that URLs, optimally, never change. So what is a really good idea is you can have content hashes in your resources. So instead of saying application.JavaScript, you can say application dot and then this hash that describes the content, like AEF, CE–

ALEXIS SANDERS: Nice. So, basically, 16 characters.

MARTIN SPLITT: Exactly. Like these [? MB5 ?] hashes dot JavaScript. And this will never change. So when you’re making an update to your JavaScript, you get a different URL. So we can cache the other thing forever, and once we get the new HTML, we get the new URL for the JavaScript, and then–

ALEXIS SANDERS: It’s almost like versioning, right?

MARTIN SPLITT: It’s like versioning in the URL. Exactly.

ALEXIS SANDERS: Nice.

MARTIN SPLITT: So that’s a great idea.

ALEXIS SANDERS: So for that first version that you would have, or any particular version that you would have, how long is that typically cached for?

MARTIN SPLITT: It depends, but we’re trying to cache as aggressively as possible, so this can be like a day. This can be a week. This can be a month. And we might ignore caching headers. So if you’re telling us this expires tomorrow and we’re like, no, it does not, then we might just cache it nonetheless.

Crawl Budget and Specific Industries such as Publishing

ALEXIS SANDERS: Yeah. Awesome. So do you find that any particular industries complain more about crawl budget than others? Or talk more about it?

MARTIN SPLITT: That’s true. I think e-commerce and publishers are quite prone to this because they have, historically, typically large sites with lots of pages, and they might actually run into this. We had a really interesting situation with one webmaster presented that at a Webmaster conference in Zurich once. In Japan, it’s a website that has lots of user-generated content. And they do machine learning to figure out if the quality of the content that was produced is considered good or not. And if it’s not good, then they put it in noindex, and they put it in robots.txt, so that we’re not wasting crawl budget on it. And that way they steer– because they had a crawl budget of 200,000 pages per day or something like that, but they had a million coming in, and most of it was spam. So they’re like, ugh, it’s annoying if we just waste our crawl budget for the day on spam that will not get indexed anyway. So that’s an interesting one.

ALEXIS SANDERS: Do you believe that there is an overall quality metric that Google uses? So, for that particular site in general, because they had so much non-high-quality content, did that affect Google’s outlook on their content overall?

MARTIN SPLITT: Not really. It was more like this page is not good. We’re not going to put it in the index. So that’s the outcome, really.

ALEXIS SANDERS: OK, nice.

MARTIN SPLITT: Or if it’s light on content. If someone just literally posts a picture and says ha, ha, then that’s not great content necessarily, especially if the picture is just a repost of some meme or some stock photo, right?

What can be Recommended to Help Googlebot Out When Crawling One’s Site?

ALEXIS SANDERS: Yeah, definitely. So what are some things that we, as webmasters, and SEOs, and agency people can recommend to our clients on how to help Googlebot out, and help me your rendering system out, as well?

MARTIN SPLITT: If you know that there’s URLs that you don’t need to render the page– and that’s a very big be careful here because if you’re blocking the entire API because you’re like, oh, we don’t need the API, and then the JavaScript does need the API to render the page, then that you shouldn’t block.

ALEXIS SANDERS: You don’t get the full experience. Ugh!

MARTIN SPLITT: Exactly. Do not block that because then we might actually not see much content. But if you have something that definitely– like internal analytics, or some internal tools, or some chat tool that pops up, or whatever, don’t let that be crawled because what’s the point. Also, you can use–

ALEXIS SANDERS: And do you recommend blocking at the robots.txt for those particular resources, like kind of throwing them in a folder and–

MARTIN SPLITT: If you don’t want them crawled, sure. Yes. Yes. Just be careful that you’re not putting something in robots.txt that turns out to be useful and important later.

ALEXIS SANDERS: Yeah. Big caution there.

MARTIN SPLITT: Yes. You can also– especially with client-side rendered apps, you oftentimes have a bunch of API calls going to and from. You can proxy that through a facade kind of thing. You basically build a little application, and all it does is it takes one request for one specific piece of data that you need to render the page, makes all the API requests and all the back-end requests that it needs, maybe has a layer of caching, as well, to make this faster, and then has one response going back. That way, you have one request, one response cycle.

ALEXIS SANDERS: So I may sound a bit goofy, but this is something like GraphQL?

MARTIN SPLITT: Right, GraphQL. Exactly, yes. But with GraphQL, again, be careful to not use the POST mode.

ALEXIS SANDERS: No POST mode, people.

MARTIN SPLITT: Make sure that it uses GET requests, not POST requests.

ALEXIS SANDERS: OK. Cool.

MARTIN SPLITT: Yeah, there’s a bunch of stuff you can do. And, basically, cut off the bits that you don’t want crawled. And give us information of the frequency of change in sitemaps and stuff, so that we can get a better grasp of where we should be spending our time.

What are the Usual Pitfalls People Get into with Crawl Budget?

ALEXIS SANDERS: Definitely. And do you see any pitfalls that people fall into with crawl budget quite often?

MARTIN SPLITT: Well, definitely the robots one. That happens all the time, that they block something that turns out to be an important resource. Like, it does not need to load the CSS. And we’re like, ah.

ALEXIS SANDERS: We don’t know what the page looked like, so we need the layout, or something like that.

MARTIN SPLITT: Exactly. People sometimes fall for it when they do A/B testing, like something gets no indexed when they don’t want it, and then they waste crawl budget on something that doesn’t end up in the index. And I’m like, hmm. Yeah, all sorts of things can go wrong. My favorite is when people are dealing with servers that are not configured correctly, and then they just give you a 500-something. And we’re like, OK, if this happens, then we’re not crawling as much anymore because we don’t want to overwhelm your server. And then people are like, why is Google not crawling this? And we’re like, well, because your server constantly tells us that it’s overwhelmed, so hmm.

Can one tell Googlebot to Crawl one’s Site More?

ALEXIS SANDERS: Yeah. So let’s say you have a very large site, and you know you have these mega-powerful servers with gigabytes and gigabytes of data. Is there any way that we can tell Google that, hey, you can crawl us more, like we expect you to crawl us more here?

MARTIN SPLITT: You can’t, really. We are detecting that. What you can do is you can limit us. But you can’t really say more, please, more, please.

ALEXIS SANDERS: OK.

MARTIN SPLITT: This crawler scheduler is relatively clever when it comes to these kind of things. And normally, if we detect like, oh, there’s lots of good content out there on this page, and the site map is full of URLs, then we will try to grab as much as we can. And as long as your server doesn’t tell us not to, we’ll continue doing that. So, eventually, the crawl budget might ramp up to what you would expect to see.

ALEXIS SANDERS: Ah, lovely. So just keep creating fresh, great-quality content. That’s really what it comes down to to get [INAUDIBLE]

MARTIN SPLITT: That’s generally the answer to pretty much everything. Good quality content. If it’s fresh, also great.

ALEXIS SANDERS: Awesome.

MARTIN SPLITT: Yes. Alexis, thank you so much for being with me and having this fantastic conversation. And I think there was a bunch of nuggets in there. And it was great having you here, so.

ALEXIS SANDERS: Thank you so much.

MARTIN SPLITT: We did this.

ALEXIS SANDERS: Yes!

MARTIN SPLITT: And I hope you enjoyed it. And see you soon again.

ALEXIS SANDERS: Thanks so much, everyone.

[MUSIC PLAYING]

MARTIN SPLITT: Next episode, we have Eric with us. And we’re going to be talking about page speed– how it works in ranking, what to look out for, what to not do, and how to not get it wrong.

ERIC: It’s such a great topic because so many people do get it wrong, I’m afraid.

MARTIN SPLITT: So, don’t miss out on the next episode.

Newsletter



Allan McNabb, COO, Internet marketing by Image Building Media

Subscribe to our newsletter for the information you need to generate leads and develop new business, while increasing customer loyalty.