Google Patents contains a vast database of patents granted worldwide and is used to explore them for a particular query. It has a vast collection of documents and consists of patents issued by the United States Patent and Trademark Office, the European Patent Office (EPO), and many other patent offices globally.
In this article, we will try to access and scrape Google Patents, a valuable resource for accessing patent information from the internet.
Let’s Start Scraping Google Patents Using Python
We’ll mainly focus on scraping organic results from Google Patents.
Let’s get started by installing the required libraries which we will use further in this tutorial.
pip install selenium
pip install beautifulsoup4
Enter fullscreen mode Exit fullscreen mode
-
Selenium — Web Driver for opening the URL in the Chrome browser.
-
Beautiful Soup — For parsing the HTML data.
Next, we will import these libraries into our program file.
import time
import json
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
from selenium import webdriver
Enter fullscreen mode Exit fullscreen mode
After that, we will initialize our Service Path where the Chrome driver is installed.
SERVICE_PATH = "E:\chromedriver.exe"
service = Service(SERVICE_PATH)
driver = webdriver.Chrome(service=service)
Enter fullscreen mode Exit fullscreen mode
Then, we will define our target URL to make a get request with the driver and wait for two seconds till the page completely loads.
driver.get("https://patents.google.com/?q=(ball+bearings)&oq=ball+bearings")
time.sleep(2)
Enter fullscreen mode Exit fullscreen mode
After that, we will create a Beautiful Soup instance to parse the extracted HTML data.
soup = BeautifulSoup(driver.page_source, "html.parser")
Enter fullscreen mode Exit fullscreen mode
Then, we will inspect the target web page to find the respective tags of the required data points.
As you can see, every patent result is under the tag search-result-item. We will use this tag as a reference to find other elements.
patent_results = [];
for el in soup.select("search-result-item"):
Enter fullscreen mode Exit fullscreen mode
Scraping Patent Title
So, the title is present under the h3
tag. Add this title inside the for loop block.
title = el.select_one("h3").get_text(),
Enter fullscreen mode Exit fullscreen mode
Scraping Patent Link
In the above image, we can see that the anchor link is under the tag a with the class state-modifier.
Let us study this link before adding it to our code.
https://patents.google.com/patent/US10945783B2/en?q=(ball+bearings)&oq=ball+bearings&peid=61296253f5d80%3Af%3A51b23c37
Enter fullscreen mode Exit fullscreen mode
Remove the noisy part, you will get this as the link.
https://patents.google.com/patent/US10945783B2/
Enter fullscreen mode Exit fullscreen mode
The unique entity in this URL is the patent ID which is US10945783B2
. So, we need a patent ID to create a link.
If you inspect the title again, you will find that the h3
header is inside the tag state-modifier
which consists of an attribute data-result
that has the patent ID.
We will add this to our code to create a link.
link = "https://patents.google.com/" + element.select_one("state-modifier")['data-result']
Enter fullscreen mode Exit fullscreen mode
Scraping Metadata and Dates
Metadata is additional information on the Google Patents page for categorization, indexing, and search purposes. You can notice it below the title of the organic patent result.
Let us add this in our code also.
metadata = ' '.join(element.select_one('h4.metadata').get_text().split())
Enter fullscreen mode Exit fullscreen mode
Similarly, we will select the dates that are present with the same tag h4
but with different class name dates.
dates = element.select_one('h4.dates').get_text().strip()
Enter fullscreen mode Exit fullscreen mode
Finally, we will extract the snippet part of the patent.
Scraping Snippet
The snippet is contained inside the span
tag with the id htmlContent
.
At last, we will add this to our code.
snippet = element.select_one('span#htmlContent').get_text()
Enter fullscreen mode Exit fullscreen mode
We have parsed every data point we need from the web page.
At the end, we will append these data points to our patent_results
array.
for el in soup.select('search-result-item'):
title = el.select_one('h3').get_text().strip()
link = "https://patents.google.com/" + el.select_one("state-modifier")['data-result']
metadata = ' '.join(el.select_one('h4.metadata').get_text().split())
date = el.select_one('h4.dates').get_text().strip()
snippet = el.select_one('span#htmlContent').get_text()
patent_results.append({
'title': title,
'link': link,
'metadata': metadata,
'date': date,
'snippet': snippet
})
Enter fullscreen mode Exit fullscreen mode
Complete Code
You can modify the below code as per your requirements. But for this tutorial, we will go with this one:
import time
import json
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
from selenium import webdriver
SERVICE_PATH = "E:\chromedriver.exe"
service = Service(SERVICE_PATH)
driver = webdriver.Chrome(service=service)
driver.get("https://patents.google.com/?q=(ball+bearings)&oq=ball+bearings")
time.sleep(2);
soup = BeautifulSoup(driver.page_source, "html.parser")
patent_results = []
for el in soup.select('search-result-item'):
title = el.select_one('h3').get_text().strip()
link = "https://patents.google.com/" + el.select_one("state-modifier")['data-result']
metadata = ' '.join(el.select_one('h4.metadata').get_text().split())
date = el.select_one('h4.dates').get_text().strip()
snippet = el.select_one('span#htmlContent').get_text()
patent_results.append({
'title': title,
'link': link,
'metadata': metadata,
'date': date,
'snippet': snippet
})
print(json.dumps(patent_results, indent=2))
driver.quit()
Enter fullscreen mode Exit fullscreen mode
Run this program in your terminal and you will get the below desired results.
[
{
title: 'Surgical instrument with modular shaft and end effector',
link: 'https://patent.google.com/patent/US10945783B2/en',
metadata: 'WO EP US CN JP AU CA US10945783B2 Kevin L. Houser Ethicon Llc',
date: 'Priority 2010-11-05 • Filed 2019-08-02 • Granted 2021-03-16 • Published 2021-03-16',
snippet: ' Surgical instrument with modular shaft and end effectorKevin L. HouserEthicon Llc A surgical instrument operable to sever tissue includes a body assembly and a selectively coupleable end effector assembly. The end effector assembly may include a transmission assembly, an end effector, and a rotational knob operable to rotate the transmission assembly and the end effector. The …'
},
{
title: 'Disk drive executing jerk seeks to rotate pivot ball bearings relative to races',
link: 'https://patent.google.com/patent/US8780479B1/en',
metadata: 'US CN HK US8780479B1 Daniel L. Helmick Western Digital Technologies, Inc.',
date: 'Priority 2013-05-17 • Filed 2013-06-24 • Granted 2014-07-15 • Published 2014-07-15',
snippet: ' Disk drive executing jerk seeks to rotate pivot ball bearings relative to racesDaniel L. HelmickWestern Digital Technologies, Inc. a voice coil motor (VCM) operable to rotate the actuator arm about a pivot bearing including a race and a plurality of ball bearings; and control circuitry operable to: control the VCM to execute a first jerk seek in a first radial direction so that the ball bearings slip within the race by a first …'
},
.....
Enter fullscreen mode Exit fullscreen mode
Conclusion
In a nutshell, scraping Google Patents can be a powerful tool for researchers who seek to create a patent database for market research, data analysis, academic projects, and much more.
In this article, we learned to scrape Google Patent results using Python. Shortly, we will keep updating this article to add more insights into scraping Google Patents data more efficiently.
Feel free to contact us for anything you need clarification on. Follow me on Twitter. Thanks for reading!
Additional Resources
I have prepared a complete list of blogs on Google scraping, which can help you in your data extraction journey:
暂无评论内容