Building a Review Scraper with Python using BeautifulSoup and Sentiment Analysis with NLTK

Introduction

In recent years, analyzing online reviews has become a crucial aspect for many businesses. Understanding customer sentiment can help identify areas for improvement and evaluate overall customer satisfaction. In this article, we’ll explore how to use Python to create a review scraper and analyze sentiment using the BeautifulSoup and NLTK libraries.

Creating the Review Scraper with BeautifulSoup

To begin, we utilized Python along with the BeautifulSoup library to extract reviews from a leading Italian company’s online review site. BeautifulSoup allows us to parse the HTML markup of a web page and efficiently extract the data of interest. Using BeautifulSoup’s features, we extracted the reviews and saved them for further analysis.


import requests
from bs4 import BeautifulSoup
import pandas as pd
# Number of pages to scrape
page_start = 1
page_end = 49
# DataFrame to store the data
df = pd.DataFrame(columns=["title", "text"])
# Loop through the pages
for page_num in range(page_start, page_end + 1):
    # Construct the URL for the current page
    url = f'https://it.trustpilot.com/review/www.companyname.it?page={page_num}'
    # Make an HTTP request to fetch the page content
    response = requests.get(url)
    if response.status_code == 200:
        # Use BeautifulSoup to parse the HTML of the page
        soup = BeautifulSoup(response.content, 'html.parser')
        # Find all review elements
        reviews = soup.find_all(attrs={"data-review-content": True})
        # Extract title and text of each review and add them to the DataFrame
        for review in reviews:
            title_element = review.find(attrs={"data-service-review-title-typography": True})
            content_element = review.find(attrs={"data-service-review-text-typography": True})
            if title_element and content_element:
                title = title_element.text
                content = content_element.text
                # Add data to the DataFrame
                df = df.append({"title": title, "text": content}, ignore_index=True)
            else:
                print("Title or text element not found.")
# Print the DataFrame with all review data
df
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Number of pages to scrape
page_start = 1
page_end = 49

# DataFrame to store the data
df = pd.DataFrame(columns=["title", "text"])

# Loop through the pages
for page_num in range(page_start, page_end + 1):
    # Construct the URL for the current page
    url = f'https://it.trustpilot.com/review/www.companyname.it?page={page_num}'

    # Make an HTTP request to fetch the page content
    response = requests.get(url)
    if response.status_code == 200:
        # Use BeautifulSoup to parse the HTML of the page
        soup = BeautifulSoup(response.content, 'html.parser')

        # Find all review elements
        reviews = soup.find_all(attrs={"data-review-content": True})

        # Extract title and text of each review and add them to the DataFrame
        for review in reviews:
            title_element = review.find(attrs={"data-service-review-title-typography": True})
            content_element = review.find(attrs={"data-service-review-text-typography": True})

            if title_element and content_element:
                title = title_element.text
                content = content_element.text
                # Add data to the DataFrame
                df = df.append({"title": title, "text": content}, ignore_index=True)
            else:
                print("Title or text element not found.")

# Print the DataFrame with all review data
df
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Number of pages to scrape
page_start = 1
page_end = 49

# DataFrame to store the data
df = pd.DataFrame(columns=["title", "text"])

# Loop through the pages
for page_num in range(page_start, page_end + 1):
    # Construct the URL for the current page
    url = f'https://it.trustpilot.com/review/www.companyname.it?page={page_num}'

    # Make an HTTP request to fetch the page content
    response = requests.get(url)
    if response.status_code == 200:
        # Use BeautifulSoup to parse the HTML of the page
        soup = BeautifulSoup(response.content, 'html.parser')

        # Find all review elements
        reviews = soup.find_all(attrs={"data-review-content": True})

        # Extract title and text of each review and add them to the DataFrame
        for review in reviews:
            title_element = review.find(attrs={"data-service-review-title-typography": True})
            content_element = review.find(attrs={"data-service-review-text-typography": True})

            if title_element and content_element:
                title = title_element.text
                content = content_element.text
                # Add data to the DataFrame
                df = df.append({"title": title, "text": content}, ignore_index=True)
            else:
                print("Title or text element not found.")

# Print the DataFrame with all review data
df

Enter fullscreen mode Exit fullscreen mode

Review Analysis with NLTK

Once the reviews were extracted, we employed the Natural Language Toolkit (NLTK), a widely-used Python library for Natural Language Processing (NLP). NLTK provides a range of tools for text analysis, including sentiment analysis.

We used NLTK’s SentimentIntensityAnalyzer to assess the sentiment of the reviews. This analyzer assigns a numerical score to each review, indicating whether the sentiment is positive, negative, or neutral. This analysis provided us with a clear insight into customer sentiment towards the company.


import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
# Download the VADER lexicon for sentiment analysis
nltk.download('vader_lexicon')
# Create a SentimentIntensityAnalyzer object
sid = SentimentIntensityAnalyzer()
# Define a function to get the sentiment of a text
def get_sentiment(text):
    # Calculate the sentiment score of the text
    scores = sid.polarity_scores(text)
    # Determine the sentiment based on the compound score
    if scores['compound'] >= 0.05:
        return 'positive'
    elif scores['compound'] <= -0.05:
        return 'negative'
    else:
        return 'neutral'
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# Download the VADER lexicon for sentiment analysis
nltk.download('vader_lexicon')

# Create a SentimentIntensityAnalyzer object
sid = SentimentIntensityAnalyzer()

# Define a function to get the sentiment of a text
def get_sentiment(text):
    # Calculate the sentiment score of the text
    scores = sid.polarity_scores(text)
    # Determine the sentiment based on the compound score
    if scores['compound'] >= 0.05:
        return 'positive'
    elif scores['compound'] <= -0.05:
        return 'negative'
    else:
        return 'neutral'

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# Download the VADER lexicon for sentiment analysis
nltk.download('vader_lexicon')

# Create a SentimentIntensityAnalyzer object
sid = SentimentIntensityAnalyzer()

# Define a function to get the sentiment of a text
def get_sentiment(text):
    # Calculate the sentiment score of the text
    scores = sid.polarity_scores(text)
    # Determine the sentiment based on the compound score
    if scores['compound'] >= 0.05:
        return 'positive'
    elif scores['compound'] <= -0.05:
        return 'negative'
    else:
        return 'neutral'

Enter fullscreen mode Exit fullscreen mode

Visualizing the Results

Finally, we used the analyzed data to create bar and pie charts displaying the percentages of negative, positive, and neutral reviews. These charts offer a visual representation of the overall sentiment of the reviews and allow for easy identification of trends.


import matplotlib.pyplot as plt
# Count unique values in the 'sentiment' column
value_counts = df['sentiment'].value_counts()
# Define colors for each category
colors = {'positive': 'green', 'negative': 'red', 'neutral': 'blue'}
# Create a pie chart using the defined colors
plt.pie(value_counts, labels=value_counts.index, colors=[colors[value] for value in value_counts.index], autopct='%1.1f%%')
# Add title
plt.title('Sentiment Analysis of Reviews for Company XYZ')
# Show the chart
plt.show()
import matplotlib.pyplot as plt

# Count unique values in the 'sentiment' column
value_counts = df['sentiment'].value_counts()

# Define colors for each category
colors = {'positive': 'green', 'negative': 'red', 'neutral': 'blue'}

# Create a pie chart using the defined colors
plt.pie(value_counts, labels=value_counts.index, colors=[colors[value] for value in value_counts.index], autopct='%1.1f%%')

# Add title
plt.title('Sentiment Analysis of Reviews for Company XYZ')

# Show the chart
plt.show()
import matplotlib.pyplot as plt

# Count unique values in the 'sentiment' column
value_counts = df['sentiment'].value_counts()

# Define colors for each category
colors = {'positive': 'green', 'negative': 'red', 'neutral': 'blue'}

# Create a pie chart using the defined colors
plt.pie(value_counts, labels=value_counts.index, colors=[colors[value] for value in value_counts.index], autopct='%1.1f%%')

# Add title
plt.title('Sentiment Analysis of Reviews for Company XYZ')

# Show the chart
plt.show()

Enter fullscreen mode Exit fullscreen mode

Conclusion

In this article, we’ve seen how to use Python along with the BeautifulSoup and NLTK libraries to create a review scraper and analyze online sentiment. The combination of these powerful libraries allowed us to gain valuable insights into customer sentiment and visualize the results clearly and comprehensively.

By employing similar techniques, businesses can actively monitor customer feedback and make informed decisions to enhance overall customer experience. The combination of web scraping and sentiment analysis is a powerful tool for online reputation monitoring and customer relationship management.

原文链接：Building a Review Scraper with Python using BeautifulSoup and Sentiment Analysis with NLTK

展开阅读全文

文章版权声明 1、本网站名称：拾光赋
2、本站永久网址：https://www.blogs.ink
3、本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长QQ：805375623进行删除处理。
4、本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6、本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。

THE END