Introduction
In recent years, analyzing online reviews has become a crucial aspect for many businesses. Understanding customer sentiment can help identify areas for improvement and evaluate overall customer satisfaction. In this article, we’ll explore how to use Python to create a review scraper and analyze sentiment using the BeautifulSoup and NLTK libraries.
Creating the Review Scraper with BeautifulSoup
To begin, we utilized Python along with the BeautifulSoup library to extract reviews from a leading Italian company’s online review site. BeautifulSoup allows us to parse the HTML markup of a web page and efficiently extract the data of interest. Using BeautifulSoup’s features, we extracted the reviews and saved them for further analysis.
import requestsfrom bs4 import BeautifulSoupimport pandas as pd# Number of pages to scrapepage_start = 1page_end = 49# DataFrame to store the datadf = pd.DataFrame(columns=["title", "text"])# Loop through the pagesfor page_num in range(page_start, page_end + 1):# Construct the URL for the current pageurl = f'https://it.trustpilot.com/review/www.companyname.it?page={page_num}'# Make an HTTP request to fetch the page contentresponse = requests.get(url)if response.status_code == 200:# Use BeautifulSoup to parse the HTML of the pagesoup = BeautifulSoup(response.content, 'html.parser')# Find all review elementsreviews = soup.find_all(attrs={"data-review-content": True})# Extract title and text of each review and add them to the DataFramefor review in reviews:title_element = review.find(attrs={"data-service-review-title-typography": True})content_element = review.find(attrs={"data-service-review-text-typography": True})if title_element and content_element:title = title_element.textcontent = content_element.text# Add data to the DataFramedf = df.append({"title": title, "text": content}, ignore_index=True)else:print("Title or text element not found.")# Print the DataFrame with all review datadfimport requests from bs4 import BeautifulSoup import pandas as pd # Number of pages to scrape page_start = 1 page_end = 49 # DataFrame to store the data df = pd.DataFrame(columns=["title", "text"]) # Loop through the pages for page_num in range(page_start, page_end + 1): # Construct the URL for the current page url = f'https://it.trustpilot.com/review/www.companyname.it?page={page_num}' # Make an HTTP request to fetch the page content response = requests.get(url) if response.status_code == 200: # Use BeautifulSoup to parse the HTML of the page soup = BeautifulSoup(response.content, 'html.parser') # Find all review elements reviews = soup.find_all(attrs={"data-review-content": True}) # Extract title and text of each review and add them to the DataFrame for review in reviews: title_element = review.find(attrs={"data-service-review-title-typography": True}) content_element = review.find(attrs={"data-service-review-text-typography": True}) if title_element and content_element: title = title_element.text content = content_element.text # Add data to the DataFrame df = df.append({"title": title, "text": content}, ignore_index=True) else: print("Title or text element not found.") # Print the DataFrame with all review data dfimport requests from bs4 import BeautifulSoup import pandas as pd # Number of pages to scrape page_start = 1 page_end = 49 # DataFrame to store the data df = pd.DataFrame(columns=["title", "text"]) # Loop through the pages for page_num in range(page_start, page_end + 1): # Construct the URL for the current page url = f'https://it.trustpilot.com/review/www.companyname.it?page={page_num}' # Make an HTTP request to fetch the page content response = requests.get(url) if response.status_code == 200: # Use BeautifulSoup to parse the HTML of the page soup = BeautifulSoup(response.content, 'html.parser') # Find all review elements reviews = soup.find_all(attrs={"data-review-content": True}) # Extract title and text of each review and add them to the DataFrame for review in reviews: title_element = review.find(attrs={"data-service-review-title-typography": True}) content_element = review.find(attrs={"data-service-review-text-typography": True}) if title_element and content_element: title = title_element.text content = content_element.text # Add data to the DataFrame df = df.append({"title": title, "text": content}, ignore_index=True) else: print("Title or text element not found.") # Print the DataFrame with all review data df
Enter fullscreen mode Exit fullscreen mode
Review Analysis with NLTK
Once the reviews were extracted, we employed the Natural Language Toolkit (NLTK), a widely-used Python library for Natural Language Processing (NLP). NLTK provides a range of tools for text analysis, including sentiment analysis.
We used NLTK’s SentimentIntensityAnalyzer to assess the sentiment of the reviews. This analyzer assigns a numerical score to each review, indicating whether the sentiment is positive, negative, or neutral. This analysis provided us with a clear insight into customer sentiment towards the company.
import nltkfrom nltk.sentiment import SentimentIntensityAnalyzer# Download the VADER lexicon for sentiment analysisnltk.download('vader_lexicon')# Create a SentimentIntensityAnalyzer objectsid = SentimentIntensityAnalyzer()# Define a function to get the sentiment of a textdef get_sentiment(text):# Calculate the sentiment score of the textscores = sid.polarity_scores(text)# Determine the sentiment based on the compound scoreif scores['compound'] >= 0.05:return 'positive'elif scores['compound'] <= -0.05:return 'negative'else:return 'neutral'import nltk from nltk.sentiment import SentimentIntensityAnalyzer # Download the VADER lexicon for sentiment analysis nltk.download('vader_lexicon') # Create a SentimentIntensityAnalyzer object sid = SentimentIntensityAnalyzer() # Define a function to get the sentiment of a text def get_sentiment(text): # Calculate the sentiment score of the text scores = sid.polarity_scores(text) # Determine the sentiment based on the compound score if scores['compound'] >= 0.05: return 'positive' elif scores['compound'] <= -0.05: return 'negative' else: return 'neutral'import nltk from nltk.sentiment import SentimentIntensityAnalyzer # Download the VADER lexicon for sentiment analysis nltk.download('vader_lexicon') # Create a SentimentIntensityAnalyzer object sid = SentimentIntensityAnalyzer() # Define a function to get the sentiment of a text def get_sentiment(text): # Calculate the sentiment score of the text scores = sid.polarity_scores(text) # Determine the sentiment based on the compound score if scores['compound'] >= 0.05: return 'positive' elif scores['compound'] <= -0.05: return 'negative' else: return 'neutral'
Enter fullscreen mode Exit fullscreen mode
Visualizing the Results
Finally, we used the analyzed data to create bar and pie charts displaying the percentages of negative, positive, and neutral reviews. These charts offer a visual representation of the overall sentiment of the reviews and allow for easy identification of trends.
import matplotlib.pyplot as plt# Count unique values in the 'sentiment' columnvalue_counts = df['sentiment'].value_counts()# Define colors for each categorycolors = {'positive': 'green', 'negative': 'red', 'neutral': 'blue'}# Create a pie chart using the defined colorsplt.pie(value_counts, labels=value_counts.index, colors=[colors[value] for value in value_counts.index], autopct='%1.1f%%')# Add titleplt.title('Sentiment Analysis of Reviews for Company XYZ')# Show the chartplt.show()import matplotlib.pyplot as plt # Count unique values in the 'sentiment' column value_counts = df['sentiment'].value_counts() # Define colors for each category colors = {'positive': 'green', 'negative': 'red', 'neutral': 'blue'} # Create a pie chart using the defined colors plt.pie(value_counts, labels=value_counts.index, colors=[colors[value] for value in value_counts.index], autopct='%1.1f%%') # Add title plt.title('Sentiment Analysis of Reviews for Company XYZ') # Show the chart plt.show()import matplotlib.pyplot as plt # Count unique values in the 'sentiment' column value_counts = df['sentiment'].value_counts() # Define colors for each category colors = {'positive': 'green', 'negative': 'red', 'neutral': 'blue'} # Create a pie chart using the defined colors plt.pie(value_counts, labels=value_counts.index, colors=[colors[value] for value in value_counts.index], autopct='%1.1f%%') # Add title plt.title('Sentiment Analysis of Reviews for Company XYZ') # Show the chart plt.show()
Enter fullscreen mode Exit fullscreen mode
Conclusion
In this article, we’ve seen how to use Python along with the BeautifulSoup and NLTK libraries to create a review scraper and analyze online sentiment. The combination of these powerful libraries allowed us to gain valuable insights into customer sentiment and visualize the results clearly and comprehensively.
By employing similar techniques, businesses can actively monitor customer feedback and make informed decisions to enhance overall customer experience. The combination of web scraping and sentiment analysis is a powerful tool for online reputation monitoring and customer relationship management.
原文链接:Building a Review Scraper with Python using BeautifulSoup and Sentiment Analysis with NLTK
暂无评论内容