Scraping NHK News Web Easy with Python: A Step-by-Step Guide - 拾光赋-拾光赋

Scraping NHK News Web Easy with Python: A Step-by-Step Guide

37天前发布

0287

Scraping NHK News Web Easy with Python

If you are learning Japanese. NHK News Web Easy is a very nice. Want to extract real-time news from NHK News Web Easy using Python?** In this tutorial, we’ll use Selenium + BeautifulSoup to scrape the latest articles, save them in a structured format, and even export them as a Word document.

By the end of this tutorial, you will learn how to:

Fetch and parse dynamic web pages using Selenium
Extract news titles, links, and content with BeautifulSoup
Export data into a structured Word document
Avoid getting blocked and optimize your scraper

Demo: What We Are Building

Before we dive into the code, here’s what our scraper will do:

Visit NHK News Web Easy
Extract the latest news headlines & links
Scrape the full article content
Save the data into a structured Word file

Here’s an example of the output file:

️ Step 1: Install Required Packages

We will use the following Python libraries:

requests: To fetch webpage content
selenium: To handle dynamic JavaScript-loaded content
bs4 (BeautifulSoup): To parse HTML
docx: To save news articles into a Word document

Install them using (Mac):

pip install requests selenium bs4 python-docx webdriver-manager

Enter fullscreen mode Exit fullscreen mode

**️ Step 2:Fetching the NHK News Web Easy Homepage

Since NHK News Web Easy loads content dynamically using JavaScript, we need Selenium to handle the page rendering.

Here’s how to launch a headless Chrome browser and fetch the homepage:


from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time

# Configure Selenium WebDriver
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run in background
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")

# Launch the WebDriver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=chrome_options)

# Visit NHK News Web Easy
NHK_URL = "https://www3.nhk.or.jp/news/easy/"
driver.get(NHK_URL)

# Wait for JavaScript to load content
time.sleep(5)

# Get the HTML source code
html = driver.page_source
driver.quit()

print(html[:500])  # Display first 500 characters of the HTML

Enter fullscreen mode Exit fullscreen mode

What this does:
• Starts a headless Chrome browser to load JavaScript content
• Fetches the entire rendered webpage (including dynamically loaded articles)

Step 3: Extracting News Links

Once we have the full page source, we use BeautifulSoup to extract all news articles.

from bs4 import BeautifulSoup

# Parse HTML with BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

# Extract news article links
articles = soup.select("article.news-list__item a")
news_links = ["https://www3.nhk.or.jp" + a["href"] for a in articles]

print(f"Found {len(news_links)} news articles.")
print(news_links[:5])  # Preview first 5 links

Enter fullscreen mode Exit fullscreen mode

This code:
• Finds all article links on the homepage
• Extracts absolute URLs for further scraping

Step 4: Scraping Full Article Content

Now, let’s fetch each article and extract title + content.

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}

for news_url in news_links[:5]:  # Limit to 5 articles
    response = requests.get(news_url, headers=headers)
    response.encoding = "utf-8"

    news_soup = BeautifulSoup(response.text, "html.parser")
    title = news_soup.find("h1", class_="article-title").text.strip()
    content_blocks = news_soup.find("div", class_="article-body")

    content = "\n".join([p.text.strip() for p in content_blocks.find_all("p")]) if content_blocks else "No content available."

    print(f" {title}\n{content[:200]}...\n")

Enter fullscreen mode Exit fullscreen mode

Conclusion

This script:
• Fetches each news page
• Extracts the title and article body
• Prints a preview of the first 200 characters

Step 5: Saving News to Word

Finally, let’s store the scraped articles in a structured Word document.

from docx import Document

doc = Document()
doc.add_heading("NHK News Web Easy Articles", level=1)

for news_url in news_links[:5]:  # Limit to 5 articles
    response = requests.get(news_url, headers=headers)
    response.encoding = "utf-8"
    news_soup = BeautifulSoup(response.text, "html.parser")

    title = news_soup.find("h1", class_="article-title").text.strip()
    content_blocks = news_soup.find("div", class_="article-body")
    content_jp = "\n".join([p.text.strip() for p in content_blocks.find_all("p")]) if content_blocks else "No content."

    doc.add_heading(title, level=2)
    doc.add_paragraph(f"Japanese Text:\n{content_jp}")
    doc.add_paragraph("-" * 50)

doc.save("NHK_News.docx")
print(" NHK_News.docx saved successfully!")

Enter fullscreen mode Exit fullscreen mode

Now you have a fully automated news scraper!

Conclusion

•  Selenium fetches dynamically loaded content
•  BeautifulSoup extracts articles
•  python-docx saves content in Word format

Enter fullscreen mode Exit fullscreen mode

“This is the final code. You can change the number 5 in for news_url in news_links[:5]: to any other number to adjust the number of news articles you want to generate.”

import requests
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
import time
from bs4 import BeautifulSoup
from docx import Document

# Fake browser request to prevent NHK from blocking
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
}

# Launch Selenium
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run in the background
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("start-maximized")
chrome_options.add_argument("disable-infobars")
chrome_options.add_argument("--disable-extensions")
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=chrome_options)

# Access NHK Easy News
NHK_URL = "https://www3.nhk.or.jp/news/easy/"
driver.get(NHK_URL)

# Wait 5 seconds for JavaScript to load
time.sleep(5)

# Get the page HTML
html = driver.page_source
driver.quit()  # Close the browser

# Parse the HTML
soup = BeautifulSoup(html, "html.parser")

# Extract news links
articles = soup.select("article.news-list__item a")
news_links = ["https://www3.nhk.or.jp" + a["href"] for a in articles]

print(f"Retrieved {len(news_links)} news articles")
if not news_links:
    print(" No news found, please check the HTML structure!")
    exit()

# Create a Word document
doc = Document()
doc.add_heading("NHK News Web Easy Article Collection", level=1)

for news_url in news_links[:20]:  # Scrape only the first 20 news articles
    print(f"Fetching: {news_url}")

    # Retrieve the news page
    response = requests.get(news_url, headers=headers)
    response.encoding = "utf-8"  # Ensure UTF-8 decoding
    news_soup = BeautifulSoup(response.text, "html.parser")

    # Get news title (updated class)
    title_tag = news_soup.find("h1", class_="article-title")
    title = title_tag.text.strip() if title_tag else "No Title"

    # Get news content (updated class)
    content_blocks = news_soup.find("div", class_="article-body")
    if content_blocks:
        content_jp = "\n".join([p.text.strip() for p in content_blocks.find_all("p")])
    else:
        content_jp = "No Content"

    print(f" Successfully retrieved: {title}")

    # Write to Word document
    doc.add_heading(title, level=2)
    doc.add_paragraph(f"Original Japanese Text:\n{content_jp}")
    doc.add_paragraph("-" * 50)

# Save Word file
doc.save("NHK_News.docx")
print(" Article collection completed, saved as NHK_News.docx")

Enter fullscreen mode Exit fullscreen mode

原文链接：Scraping NHK News Web Easy with Python: A Step-by-Step Guide

展开阅读全文

© 版权声明

文章版权声明 1、本网站名称：拾光赋
2、本站永久网址：https://www.blogs.ink
3、本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长QQ：805375623进行删除处理。
4、本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6、本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。

THE END

Python（EN）
# python # automation # webscraping # selenium

喜欢就支持一下吧

相关推荐

评论抢沙发

请登录后发表评论

暂无评论内容