So we’ve covered what web scraping is. We’ve also covered how to cycle proxy information to disguise our IP Address. But there is another step in masking the scraper that we now must consider. To illustrate, think of it this way:
Imagine that somebody is bugging you on any social media platform and you’ve blocked them. Let’s say you’re an influencer so you get hundreds of follows every day and it is really hard to do a “profile check” of every follower to make sure that they are not the person who you’ve blocked. What would be the easiest way for you to make a preliminary guess? Obviously, checking the name of the new followers would be your first line of defense. Well, what if somebody with the same name and profile picture started following you? Would the fact that their profile snippet listed them as being from a different country than before stop you from blocking them right away? Not likely. In all likelihood, it would make you more suspicious of this person because they are being deceitful.
Think of a web server bot filter as the “influencer with an annoying follower”. By cycling our IP address, we change our respective location ID, and “profile pic”, but we are still listed as a connection from the same type of laptop, with the same OS version, from the same browser, etc. It’s pretty easy to peg and filter. So what is a budding web scraper to do? Cycle your User-Agent.
A User-Agent is essentially a field in your HTTP/S headers that tells the web server the browser version, host OS version, host machine form factor, and some other small tidbits about every request that comes through. This serves two purposes: diagnostics of the web content (to see if a certain bug is only reported on certain OSes, Browsers, or some combination of the two) and security (to see who’s connected and what are they connecting from to track possible offenders during and/or after a security incident). So in lamen’s terms, a User-Agent tells the web server about the connecting machine, whereas an IP address tells it more about where you’re coming from. Make sense? Good!
To begin cycling User-Agents, head here for an up-to-date collection of desktop User-Agents:
Pro Tip: We can also scrape the mobile versions of websites by changing the User Agent. That is actually how the “show desktop site” option on Mobile Browsers works. The browser will spoof your User Agent as if it came from a desktop device despite the fact that you are on mobile, thus showing you the “desktop version” of any content. The same works in reverse for scraping mobile-only content
Now, we could just copy/paste all of these user-agents into a file, but this is a web scraping post series after all so instead, create a new project file and call it user_agent.py
from bs4 import BeautifulSoup
import random
import requests
class UserAgent:
ua_source_url='https://deviceatlas.com/blog/list-of-user-agent-strings#desktop'
def __init__(self):
self.new_ua = random.choice(self.get_ua_list())
Enter fullscreen mode Exit fullscreen mode
So first we’re making the UserAgent class. If you’ve made it this far into the series you can probably put together that I am a big fan of encapsulation. This is largely due to coming from the Network Engineering and IT Technical worlds before I started writing code where, it is assumed, that an incident WILL wake you up EVERY TIME you are on call at no earlier than 4 AM. Because of this, I am a huge fan of creating things that are easily debugged. Getting paged at 4, getting to the office by 4:30, and leaving by 4:45 to actually have some semblance of a morning with my family is preferred to taking 3 hours to fix a simple issue that was buried under layers of junk. Because if it takes me 3 hours, then I have solved my problem by 7:30. You know, just in time to start “normal” work for the day…
That being said, the UserAgent class is going to handle all the user-agenty things for us! The class variable we’ve created ua_source_url
is outside of the __init__
function, as you might have noticed. Why is that? Because everything inside __init__
only “exists” after an instance of the class is called (aka instantiated) in your code. Class level variables can be referenced directly and more importantly, are shared among all instances of that class. Meaning when we multithread this bad boy, we won’t have to store that particular string in memory more than once.
Within our __init__
function is the new_ua
property which we want to be different for each instance of this class but not be pulled into existence until the class is created.
Side Note: For a further understanding of __init__
, what it does, and how to use it, stick around. There is a post coming on that very soon!
The new_ua
property picks a random user agent from a list of user agents. Where will we get that list you ask? From the code your about to type:
def get_ua_list(self, source=ua_source_url):
r = requests.get(source)
soup = BeautifulSoup(r.content, "html.parser")
tables = soup.find_all('table')
return [table.find('td').text for table in tables]
Enter fullscreen mode Exit fullscreen mode
So what have we done here? Well, we have scraped the page above (as we do) for the user-agent information listed on that page. As with all of our little scraping scripts, it started with a Requests get()
function, fed the response content to our HTML parser (BeautifulSoup), found the table markup for each user-agent, and extracted the text of the data of each table. Now that this function is complete, does the __init__
make a bit more sense? If not, allow me to explain:
As soon as we create an instance of the UserAgent class ANYWHERE in our code, it will run this function without us needing to touch it. When the function completes, it will have assigned the list of potential UserAgents to the property new_ua
. How is this relevant? Take a look at the terminal screen capture below which shows the output of the following code:
i=1
while i <= 4:
print(UserAgent().new_ua)
i+=1
Enter fullscreen mode Exit fullscreen mode
As you can see, we do not need to write the user agents to a file, which involves opening the file, reading it, then closing it again, relying on disk memory. Instead, we read this list into RAM (the faster, easier-to-clean storage) and retrieve it only when we need it, getting a different UserAgent each time through the random.choice()
function
Pro Tip: For those of you who are unfamiliar with random.choice()
, it basically does the same thing when selecting from a list or iterable as random.randint(0, len(sample_list))
. Simply put, it selects a pseudo-random item from any iterable data structure you pass to it. In instances like this, random.choice()
is my preference because both approaches do roughly the same thing and .choice()
is fewer keystrokes.
Alright, Mavericks (Season of the Drifter anyone ) you’ve got the tools to fool 99% of bot filters. But do you know how to use them? Head out into the Wild and find out, then report back. I’d love to hear your stories of the datasets your gathering and what you plan to do with them! Successes and failures are equally welcome because this completes the first set of countermeasures we’re going to discover. These are referred to (by me at least) as the “hard skills” of web scraping. The skills we need to develop our knowledge. But how are your instincts? Find out next week in part four when we break into the intermediate stuff: Timing and Throttling!
原文链接:Bye Bye 403 – Building a Filter Resistant Web Crawler Part III: User Agents
暂无评论内容