Scraping NHK News Web Easy with Python: A Step-by-Step Guide

Scraping NHK News Web Easy with Python

If you are learning Japanese. NHK News Web Easy is a very nice. Want to extract real-time news from NHK News Web Easy using Python?** In this tutorial, we’ll use Selenium + BeautifulSoup to scrape the latest articles, save them in a structured format, and even export them as a Word document.

By the end of this tutorial, you will learn how to:

  • Fetch and parse dynamic web pages using Selenium
  • Extract news titles, links, and content with BeautifulSoup
  • Export data into a structured Word document
  • Avoid getting blocked and optimize your scraper

Demo: What We Are Building

Before we dive into the code, here’s what our scraper will do:

  1. Visit NHK News Web Easy
  2. Extract the latest news headlines & links
  3. Scrape the full article content
  4. Save the data into a structured Word file

Here’s an example of the output file:

️ Step 1: Install Required Packages

We will use the following Python libraries:

  • requests: To fetch webpage content
  • selenium: To handle dynamic JavaScript-loaded content
  • bs4 (BeautifulSoup): To parse HTML
  • docx: To save news articles into a Word document

Install them using (Mac):

pip install requests selenium bs4 python-docx webdriver-manager

Enter fullscreen mode Exit fullscreen mode

**️ Step 2:Fetching the NHK News Web Easy Homepage

Since NHK News Web Easy loads content dynamically using JavaScript, we need Selenium to handle the page rendering.

Here’s how to launch a headless Chrome browser and fetch the homepage:

from selenium import webdriver
from import Service
from import Options
from import ChromeDriverManager
import time

# Configure Selenium WebDriver
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run in background

# Launch the WebDriver
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=chrome_options)

# Visit NHK News Web Easy
NHK_URL = ""

# Wait for JavaScript to load content

# Get the HTML source code
html = driver.page_source

print(html[:500])  # Display first 500 characters of the HTML

Enter fullscreen mode Exit fullscreen mode

What this does:
• Starts a headless Chrome browser to load JavaScript content
• Fetches the entire rendered webpage (including dynamically loaded articles)

Step 3: Extracting News Links

Once we have the full page source, we use BeautifulSoup to extract all news articles.

from bs4 import BeautifulSoup

# Parse HTML with BeautifulSoup
soup = BeautifulSoup(html, "html.parser")

# Extract news article links
articles =" a")
news_links = ["" + a["href"] for a in articles]

print(f"Found {len(news_links)} news articles.")
print(news_links[:5])  # Preview first 5 links

Enter fullscreen mode Exit fullscreen mode

This code:
• Finds all article links on the homepage
• Extracts absolute URLs for further scraping

Step 4: Scraping Full Article Content

Now, let’s fetch each article and extract title + content.

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ Safari/537.36"

for news_url in news_links[:5]:  # Limit to 5 articles
    response = requests.get(news_url, headers=headers)
    response.encoding = "utf-8"

    news_soup = BeautifulSoup(response.text, "html.parser")
    title = news_soup.find("h1", class_="article-title").text.strip()
    content_blocks = news_soup.find("div", class_="article-body")

    content = "\n".join([p.text.strip() for p in content_blocks.find_all("p")]) if content_blocks else "No content available."

    print(f" {title}\n{content[:200]}...\n")

Enter fullscreen mode Exit fullscreen mode


This script:
• Fetches each news page
• Extracts the title and article body
• Prints a preview of the first 200 characters

Step 5: Saving News to Word

Finally, let’s store the scraped articles in a structured Word document.

from docx import Document

doc = Document()
doc.add_heading("NHK News Web Easy Articles", level=1)

for news_url in news_links[:5]:  # Limit to 5 articles
    response = requests.get(news_url, headers=headers)
    response.encoding = "utf-8"
    news_soup = BeautifulSoup(response.text, "html.parser")

    title = news_soup.find("h1", class_="article-title").text.strip()
    content_blocks = news_soup.find("div", class_="article-body")
    content_jp = "\n".join([p.text.strip() for p in content_blocks.find_all("p")]) if content_blocks else "No content."

    doc.add_heading(title, level=2)
    doc.add_paragraph(f"Japanese Text:\n{content_jp}")
    doc.add_paragraph("-" * 50)"NHK_News.docx")
print(" NHK_News.docx saved successfully!")

Enter fullscreen mode Exit fullscreen mode

Now you have a fully automated news scraper!


•  Selenium fetches dynamically loaded content
•  BeautifulSoup extracts articles
•  python-docx saves content in Word format

Enter fullscreen mode Exit fullscreen mode

“This is the final code. You can change the number 5 in for news_url in news_links[:5]: to any other number to adjust the number of news articles you want to generate.”

import requests
from selenium import webdriver
from import Service
from import Options
from import ChromeDriverManager
import time
from bs4 import BeautifulSoup
from docx import Document

# Fake browser request to prevent NHK from blocking
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/ Safari/537.36"

# Launch Selenium
chrome_options = Options()
chrome_options.add_argument("--headless")  # Run in the background
service = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=service, options=chrome_options)

# Access NHK Easy News
NHK_URL = ""

# Wait 5 seconds for JavaScript to load

# Get the page HTML
html = driver.page_source
driver.quit()  # Close the browser

# Parse the HTML
soup = BeautifulSoup(html, "html.parser")

# Extract news links
articles =" a")
news_links = ["" + a["href"] for a in articles]

print(f"Retrieved {len(news_links)} news articles")
if not news_links:
    print(" No news found, please check the HTML structure!")

# Create a Word document
doc = Document()
doc.add_heading("NHK News Web Easy Article Collection", level=1)

for news_url in news_links[:20]:  # Scrape only the first 20 news articles
    print(f"Fetching: {news_url}")

    # Retrieve the news page
    response = requests.get(news_url, headers=headers)
    response.encoding = "utf-8"  # Ensure UTF-8 decoding
    news_soup = BeautifulSoup(response.text, "html.parser")

    # Get news title (updated class)
    title_tag = news_soup.find("h1", class_="article-title")
    title = title_tag.text.strip() if title_tag else "No Title"

    # Get news content (updated class)
    content_blocks = news_soup.find("div", class_="article-body")
    if content_blocks:
        content_jp = "\n".join([p.text.strip() for p in content_blocks.find_all("p")])
        content_jp = "No Content"

    print(f" Successfully retrieved: {title}")

    # Write to Word document
    doc.add_heading(title, level=2)
    doc.add_paragraph(f"Original Japanese Text:\n{content_jp}")
    doc.add_paragraph("-" * 50)

# Save Word file"NHK_News.docx")
print(" Article collection completed, saved as NHK_News.docx")

Enter fullscreen mode Exit fullscreen mode

原文链接:Scraping NHK News Web Easy with Python: A Step-by-Step Guide

© 版权声明
点赞7 分享
评论 抢沙发

