Building a Review Scraper with Python using BeautifulSoup and Sentiment Analysis with NLTK

Introduction

In recent years, analyzing online reviews has become a crucial aspect for many businesses. Understanding customer sentiment can help identify areas for improvement and evaluate overall customer satisfaction. In this article, we’ll explore how to use Python to create a review scraper and analyze sentiment using the BeautifulSoup and NLTK libraries.

Creating the Review Scraper with BeautifulSoup

To begin, we utilized Python along with the BeautifulSoup library to extract reviews from a leading Italian company’s online review site. BeautifulSoup allows us to parse the HTML markup of a web page and efficiently extract the data of interest. Using BeautifulSoup’s features, we extracted the reviews and saved them for further analysis.

import requests
from bs4 import BeautifulSoup
import pandas as pd
# Number of pages to scrape
page_start = 1
page_end = 49
# DataFrame to store the data
df = pd.DataFrame(columns=["title", "text"])
# Loop through the pages
for page_num in range(page_start, page_end + 1):
# Construct the URL for the current page
url = f'https://it.trustpilot.com/review/www.companyname.it?page={page_num}'
# Make an HTTP request to fetch the page content
response = requests.get(url)
if response.status_code == 200:
# Use BeautifulSoup to parse the HTML of the page
soup = BeautifulSoup(response.content, 'html.parser')
# Find all review elements
reviews = soup.find_all(attrs={"data-review-content": True})
# Extract title and text of each review and add them to the DataFrame
for review in reviews:
title_element = review.find(attrs={"data-service-review-title-typography": True})
content_element = review.find(attrs={"data-service-review-text-typography": True})
if title_element and content_element:
title = title_element.text
content = content_element.text
# Add data to the DataFrame
df = df.append({"title": title, "text": content}, ignore_index=True)
else:
print("Title or text element not found.")
# Print the DataFrame with all review data
df
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Number of pages to scrape
page_start = 1
page_end = 49

# DataFrame to store the data
df = pd.DataFrame(columns=["title", "text"])

# Loop through the pages
for page_num in range(page_start, page_end + 1):
    # Construct the URL for the current page
    url = f'https://it.trustpilot.com/review/www.companyname.it?page={page_num}'

    # Make an HTTP request to fetch the page content
    response = requests.get(url)
    if response.status_code == 200:
        # Use BeautifulSoup to parse the HTML of the page
        soup = BeautifulSoup(response.content, 'html.parser')

        # Find all review elements
        reviews = soup.find_all(attrs={"data-review-content": True})

        # Extract title and text of each review and add them to the DataFrame
        for review in reviews:
            title_element = review.find(attrs={"data-service-review-title-typography": True})
            content_element = review.find(attrs={"data-service-review-text-typography": True})

            if title_element and content_element:
                title = title_element.text
                content = content_element.text
                # Add data to the DataFrame
                df = df.append({"title": title, "text": content}, ignore_index=True)
            else:
                print("Title or text element not found.")

# Print the DataFrame with all review data
df
import requests from bs4 import BeautifulSoup import pandas as pd # Number of pages to scrape page_start = 1 page_end = 49 # DataFrame to store the data df = pd.DataFrame(columns=["title", "text"]) # Loop through the pages for page_num in range(page_start, page_end + 1): # Construct the URL for the current page url = f'https://it.trustpilot.com/review/www.companyname.it?page={page_num}' # Make an HTTP request to fetch the page content response = requests.get(url) if response.status_code == 200: # Use BeautifulSoup to parse the HTML of the page soup = BeautifulSoup(response.content, 'html.parser') # Find all review elements reviews = soup.find_all(attrs={"data-review-content": True}) # Extract title and text of each review and add them to the DataFrame for review in reviews: title_element = review.find(attrs={"data-service-review-title-typography": True}) content_element = review.find(attrs={"data-service-review-text-typography": True}) if title_element and content_element: title = title_element.text content = content_element.text # Add data to the DataFrame df = df.append({"title": title, "text": content}, ignore_index=True) else: print("Title or text element not found.") # Print the DataFrame with all review data df

Enter fullscreen mode Exit fullscreen mode

Review Analysis with NLTK

Once the reviews were extracted, we employed the Natural Language Toolkit (NLTK), a widely-used Python library for Natural Language Processing (NLP). NLTK provides a range of tools for text analysis, including sentiment analysis.

We used NLTK’s SentimentIntensityAnalyzer to assess the sentiment of the reviews. This analyzer assigns a numerical score to each review, indicating whether the sentiment is positive, negative, or neutral. This analysis provided us with a clear insight into customer sentiment towards the company.

import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
# Download the VADER lexicon for sentiment analysis
nltk.download('vader_lexicon')
# Create a SentimentIntensityAnalyzer object
sid = SentimentIntensityAnalyzer()
# Define a function to get the sentiment of a text
def get_sentiment(text):
# Calculate the sentiment score of the text
scores = sid.polarity_scores(text)
# Determine the sentiment based on the compound score
if scores['compound'] >= 0.05:
return 'positive'
elif scores['compound'] <= -0.05:
return 'negative'
else:
return 'neutral'
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer

# Download the VADER lexicon for sentiment analysis
nltk.download('vader_lexicon')

# Create a SentimentIntensityAnalyzer object
sid = SentimentIntensityAnalyzer()

# Define a function to get the sentiment of a text
def get_sentiment(text):
    # Calculate the sentiment score of the text
    scores = sid.polarity_scores(text)
    # Determine the sentiment based on the compound score
    if scores['compound'] >= 0.05:
        return 'positive'
    elif scores['compound'] <= -0.05:
        return 'negative'
    else:
        return 'neutral'
import nltk from nltk.sentiment import SentimentIntensityAnalyzer # Download the VADER lexicon for sentiment analysis nltk.download('vader_lexicon') # Create a SentimentIntensityAnalyzer object sid = SentimentIntensityAnalyzer() # Define a function to get the sentiment of a text def get_sentiment(text): # Calculate the sentiment score of the text scores = sid.polarity_scores(text) # Determine the sentiment based on the compound score if scores['compound'] >= 0.05: return 'positive' elif scores['compound'] <= -0.05: return 'negative' else: return 'neutral'

Enter fullscreen mode Exit fullscreen mode

Visualizing the Results

Finally, we used the analyzed data to create bar and pie charts displaying the percentages of negative, positive, and neutral reviews. These charts offer a visual representation of the overall sentiment of the reviews and allow for easy identification of trends.

import matplotlib.pyplot as plt
# Count unique values in the 'sentiment' column
value_counts = df['sentiment'].value_counts()
# Define colors for each category
colors = {'positive': 'green', 'negative': 'red', 'neutral': 'blue'}
# Create a pie chart using the defined colors
plt.pie(value_counts, labels=value_counts.index, colors=[colors[value] for value in value_counts.index], autopct='%1.1f%%')
# Add title
plt.title('Sentiment Analysis of Reviews for Company XYZ')
# Show the chart
plt.show()
import matplotlib.pyplot as plt

# Count unique values in the 'sentiment' column
value_counts = df['sentiment'].value_counts()

# Define colors for each category
colors = {'positive': 'green', 'negative': 'red', 'neutral': 'blue'}

# Create a pie chart using the defined colors
plt.pie(value_counts, labels=value_counts.index, colors=[colors[value] for value in value_counts.index], autopct='%1.1f%%')

# Add title
plt.title('Sentiment Analysis of Reviews for Company XYZ')

# Show the chart
plt.show()
import matplotlib.pyplot as plt # Count unique values in the 'sentiment' column value_counts = df['sentiment'].value_counts() # Define colors for each category colors = {'positive': 'green', 'negative': 'red', 'neutral': 'blue'} # Create a pie chart using the defined colors plt.pie(value_counts, labels=value_counts.index, colors=[colors[value] for value in value_counts.index], autopct='%1.1f%%') # Add title plt.title('Sentiment Analysis of Reviews for Company XYZ') # Show the chart plt.show()

Enter fullscreen mode Exit fullscreen mode

Conclusion

In this article, we’ve seen how to use Python along with the BeautifulSoup and NLTK libraries to create a review scraper and analyze online sentiment. The combination of these powerful libraries allowed us to gain valuable insights into customer sentiment and visualize the results clearly and comprehensively.

By employing similar techniques, businesses can actively monitor customer feedback and make informed decisions to enhance overall customer experience. The combination of web scraping and sentiment analysis is a powerful tool for online reputation monitoring and customer relationship management.

原文链接:Building a Review Scraper with Python using BeautifulSoup and Sentiment Analysis with NLTK

© 版权声明
THE END
喜欢就支持一下吧
点赞11 分享
Happiness isn't about getting what you want all the time, it's about loving what you have.
幸福并不是一味得到自己想要的,而是珍爱自己拥有的
评论 抢沙发

请登录后发表评论

    暂无评论内容