Natural Language Processing
SpaCy Workbook: Natural Language Processing Analysis
This notebook demonstrates various natural language processing techniques using the SpaCy library. We'll explore text analysis, entity recognition, and visualization of linguistic features.
Table of Contents
- Setup and Installation
- Text Processing
- Entity Recognition
- Visualization
- Analysis Results
Setup and Installation
# Import the UCI ML Repository interface
from ucimlrepo import fetch_ucirepo
# Fetch the Drug Reviews dataset from druglib.com (Dataset ID: 461)
drug_reviews_druglib_com = fetch_ucirepo(id=461)
# Split the data into features (X) and targets (y)
# X contains the review text and other features
# y contains the rating/sentiment information
X = drug_reviews_druglib_com.data.features
y = drug_reviews_druglib_com.data.targets
Data Loading and Exploration
In this section, we load the Drug Reviews dataset from the UCI Machine Learning Repository. This dataset contains user reviews of various medications, along with their ratings and other metadata. We'll use this data to perform natural language processing analysis using SpaCy.
# metadata
print(drug_reviews_druglib_com.metadata)
# variable information
print(drug_reviews_druglib_com.variables)
# print the first few rows of the dataset
print(X.head())
{'uci_id': 461, 'name': 'Drug Reviews (Druglib.com)', 'repository_url': 'https://archive.ics.uci.edu/dataset/461/drug+review+dataset+druglib+com', 'data_url': 'https://archive.ics.uci.edu/static/public/461/data.csv', 'abstract': 'The dataset provides patient reviews on specific drugs along with related conditions. Reviews and ratings are grouped into reports on the three aspects benefits, side effects and overall comment.', 'area': 'Health and Medicine', 'tasks': ['Classification', 'Regression', 'Clustering'], 'characteristics': ['Multivariate', 'Text'], 'num_instances': 4143, 'num_features': 8, 'feature_types': ['Integer'], 'demographics': [], 'target_col': None, 'index_col': ['reviewID'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2018, 'last_updated': 'Wed Apr 03 2024', 'dataset_doi': '10.24432/C55G6J', 'creators': ['Surya Kallumadi', 'Felix Grer'], 'intro_paper': {'ID': 457, 'type': 'NATIVE', 'title': 'Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning', 'authors': 'F. Gräßer, Surya Kallumadi, H. Malberg, S. Zaunseder', 'venue': 'Digital Humanities Conference', 'year': 2018, 'journal': None, 'DOI': '10.1145/3194658.3194677', 'URL': 'https://www.semanticscholar.org/paper/4d7c25fe6131a79dfec9f45b70b270f400ac2b4f', 'sha': None, 'corpus': None, 'arxiv': None, 'mag': None, 'acl': None, 'pmid': None, 'pmcid': None}, 'additional_info': {'summary': "The dataset provides patient reviews on specific drugs along with related conditions. Furthermore, reviews are grouped into reports on the three aspects benefits, side effects and overall comment. Additionally, ratings are available concerning overall satisfaction as well as a 5 step side effect rating and a 5 step effectiveness rating. The data was obtained by crawling online pharmaceutical review sites. The intention was to study \r\n\r\n(1) sentiment analysis of drug experience over multiple facets, i.e. sentiments learned on specific aspects such as effectiveness and side effects,\r\n(2) the transferability of models among domains, i.e. conditions, and \r\n(3) the transferability of models among different data sources (see 'Drug Review Dataset (Drugs.com)').\r\n\r\nThe data is split into a train (75%) a test (25%) partition (see publication) and stored in two .tsv (tab-separated-values) files, respectively.\r\n\r\nImportant notes:\r\n\r\nWhen using this dataset, you agree that you\r\n1) only use the data for research purposes\r\n2) don't use the data for any commerical purposes\r\n3) don't distribute the data to anyone else\r\n4) cite us", 'purpose': None, 'funded_by': None, 'instances_represent': None, 'recommended_data_splits': None, 'sensitive_data': None, 'preprocessing_description': None, 'variable_info': '1. urlDrugName (categorical): name of drug\r\n2. condition (categorical): name of condition\r\n3. benefitsReview (text): patient on benefits\r\n4. sideEffectsReview (text): patient on side effects\r\n5. commentsReview (text): overall patient comment\r\n6. rating (numerical): 10 star patient rating\r\n7. sideEffects (categorical): 5 step side effect rating\r\n8. effectiveness (categorical): 5 step effectiveness rating', 'citation': None}} name role type demographic description units \ 0 reviewID ID Integer None None None 1 urlDrugName Feature Categorical None None None 2 rating Feature Integer None None None 3 effectiveness Feature Categorical None None None 4 sideEffects Feature Categorical None None None 5 condition Feature Categorical None None None 6 benefitsReview Feature Categorical None None None 7 sideEffectsReview Feature Categorical None None None 8 commentsReview Feature Categorical None None None missing_values 0 no 1 no 2 no 3 no 4 no 5 no 6 no 7 no 8 no urlDrugName rating effectiveness sideEffects \ 0 enalapril 4 Highly Effective Mild Side Effects 1 ortho-tri-cyclen 1 Highly Effective Severe Side Effects 2 ponstel 10 Highly Effective No Side Effects 3 prilosec 3 Marginally Effective Mild Side Effects 4 lyrica 2 Marginally Effective Severe Side Effects condition \ 0 management of congestive heart failure 1 birth prevention 2 menstrual cramps 3 acid reflux 4 fibromyalgia benefitsReview \ 0 slowed the progression of left ventricular dys... 1 Although this type of birth control has more c... 2 I was used to having cramps so badly that they... 3 The acid reflux went away for a few months aft... 4 I think that the Lyrica was starting to help w... sideEffectsReview \ 0 cough, hypotension , proteinuria, impotence , ... 1 Heavy Cycle, Cramps, Hot Flashes, Fatigue, Lon... 2 Heavier bleeding and clotting than normal. 3 Constipation, dry mouth and some mild dizzines... 4 I felt extremely drugged and dopey. Could not... commentsReview 0 monitor blood pressure , weight and asses for ... 1 I Hate This Birth Control, I Would Not Suggest... 2 I took 2 pills at the onset of my menstrual cr... 3 I was given Prilosec prescription at a dose of... 4 See above
import pandas as pd
import spacy
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
import seaborn as sns
import matplotlib.pyplot as plt
# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
# Load spaCy model
nlp = spacy.load("en_core_web_sm")
[nltk_data] Downloading package punkt to [nltk_data] C:\Users\ubuntu\AppData\Roaming\nltk_data... [nltk_data] Unzipping tokenizers\punkt.zip. [nltk_data] Downloading package stopwords to [nltk_data] C:\Users\ubuntu\AppData\Roaming\nltk_data... [nltk_data] Unzipping corpora\stopwords.zip.
# Configuration Parameters
CONFIG = {
# Visualization settings
'viz': {
'figure_size': (12, 8),
'style': 'seaborn',
'palette': 'viridis',
'title_fontsize': 14,
'label_fontsize': 12
},
# Analysis settings
'analysis': {
'min_review_length': 10, # Minimum characters for valid review
'top_n_drugs': 10, # Number of top drugs to show in plots
'random_seed': 42 # For reproducibility
},
# Text processing settings
'text': {
'min_word_length': 3, # Minimum length for valid words
'remove_numbers': True, # Whether to remove numbers from text
'remove_punctuation': True # Whether to remove punctuation
}
}
# Set random seed for reproducibility
np.random.seed(CONFIG['analysis']['random_seed'])
Library Imports and Setup
We'll use several Python libraries for our analysis:
- spaCy: For advanced NLP tasks
- TextBlob: For sentiment analysis
- NLTK: For text preprocessing
- scikit-learn: For text vectorization
- seaborn & matplotlib: For visualization
Below, we import these libraries and download required NLTK data.
# Data Preprocessing
print("Starting text preprocessing...")
try:
# Load the dataset
df = X.copy() # Create a copy of the original data
# Combine different review fields into a single text field
print("Combining review fields...")
df['full_review'] = df['benefitsReview'].fillna('') + ' ' + \
df['sideEffectsReview'].fillna('') + ' ' + \
df['commentsReview'].fillna('')
# Text Cleaning
print("Cleaning text...")
stop_words = set(stopwords.words('english'))
def clean_text(text):
"""
Clean and preprocess text data:
1. Convert to lowercase
2. Tokenize
3. Remove stopwords and non-alphabetic tokens
Args:
text (str): Input text to clean
Returns:
str: Cleaned text
"""
if not isinstance(text, str):
return ""
try:
tokens = word_tokenize(text.lower())
tokens = [t for t in tokens if t.isalpha() and t not in stop_words]
return ' '.join(tokens)
except Exception as e:
print(f"Error processing text: {e}")
return ""
df['clean_review'] = df['full_review'].apply(clean_text)
# Sentiment Analysis
print("Performing sentiment analysis...")
def analyze_sentiment(text):
"""
Calculate sentiment polarity score using TextBlob.
Args:
text (str): Input text to analyze
Returns:
float: Sentiment score between -1 (negative) and 1 (positive)
"""
try:
blob = TextBlob(text)
return blob.sentiment.polarity
except Exception as e:
print(f"Error analyzing sentiment: {e}")
return 0.0
df['sentiment_score'] = df['clean_review'].apply(analyze_sentiment)
print("Preprocessing complete!")
except Exception as e:
print(f"Error during preprocessing: {e}")
raise
Top Sentiment by Drug: urlDrugName norpramin 0.444444 combipatch 0.410250 amerge 0.369231 proair-hfa 0.355000 gonal-f-rff 0.343333 Name: sentiment_score, dtype: float64 Themes by Condition: theme cost mental other pain sleep condition 1mg 0.0 0.0 1.0 0.0 0.0 2 0.0 0.0 1.0 0.0 0.0 2 broken arms 0.0 0.0 1.0 0.0 0.0 2 compressed discs in neck 0.0 0.0 0.0 1.0 0.0 20 year pack a day smoker 0.0 0.0 0.0 0.0 1.0
# Data Quality Checks
print("\nPerforming data quality checks...")
# Check for missing values
print("\nMissing values summary:")
print(df.isnull().sum())
# Check sentiment distribution
print("\nSentiment score distribution:")
print(df['sentiment_score'].describe())
# Check review lengths
df['review_length'] = df['full_review'].str.len()
print("\nReview length statistics:")
print(df['review_length'].describe())
# Create a boxplot of review lengths
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, y='review_length')
plt.title('Distribution of Review Lengths')
plt.ylabel('Number of Characters')
plt.show()
Text Processing and Sentiment Analysis
In this section, we perform several text processing steps:
- Data Integration: Combine different types of reviews (benefits, side effects, and comments) into a single text field
- Text Cleaning:
- Convert text to lowercase
- Remove stopwords
- Keep only alphabetic tokens
- Sentiment Analysis: Calculate sentiment scores using TextBlob
- Scores range from -1 (negative) to 1 (positive)
- This helps us understand the emotional tone of reviews
The processed data will be used for further analysis and visualization.
# Prepare data for visualization
print("Preparing visualization data...")
# Calculate average sentiment by drug
sentiment_by_drug = df.groupby('drugName')['sentiment_score'].mean().sort_values(ascending=False)
# Set up the plotting style
plt.style.use('seaborn')
plt.figure(figsize=(12, 6))
Data Visualization
This section presents visual analyses of our drug review data:
- Top Drugs by Sentiment: Bar chart showing the drugs with the highest average sentiment scores
- Sentiment Distribution: Distribution of sentiment scores across all reviews
- Key Insights: Visual representation of the most important findings
The visualizations help us understand patterns and trends in the drug review data.
# Create visualization for top drugs by sentiment
plt.figure(figsize=(12, 8))
top_sentiment = sentiment_by_drug.head(10) # Top 10 drugs
# Create bar plot
ax = sns.barplot(x=top_sentiment.values, y=top_sentiment.index, palette='viridis')
# Customize the plot
plt.title("Top 10 Drugs by Average Sentiment Score", fontsize=14, pad=20)
plt.xlabel("Average Sentiment Score", fontsize=12)
plt.ylabel("Drug Name", fontsize=12)
# Add value labels on the bars
for i, v in enumerate(top_sentiment.values):
ax.text(v, i, f'{v:.3f}', va='center', fontsize=10)
plt.tight_layout()
plt.show()
# Heatmap of themes by condition
# plt.figure(figsize=(10, 6))
# sns.heatmap(themes_by_condition, cmap="YlGnBu", annot=True, fmt=".0f")
# plt.title("Themes by Condition")
# plt.xlabel("Theme")
# plt.ylabel("Condition")
# plt.tight_layout()
# plt.show()
# ===========================================================================================
# Heatmap of themes by condition for top N conditions
# ===========================================================================================
# Set N for top conditions
N = 15 # You can change this value
# Find top N conditions by number of reviews
top_conditions = df['condition'].value_counts().head(N).index
# Filter theme data to only those top conditions
filtered_themes = themes_by_condition.loc[themes_by_condition.index.isin(top_conditions)]
plt.figure(figsize=(10, 6))
sns.heatmap(filtered_themes, cmap="YlGnBu", annot=True, fmt=".0f")
plt.title(f"Themes by Top {N} Conditions")
plt.xlabel("Theme")
plt.ylabel("Condition")
plt.tight_layout()
plt.show()
# Group low-frequency conditions
condition_counts = df['condition'].value_counts()
threshold = 15 # you can adjust this
df['condition_grouped'] = df['condition'].where(condition_counts > threshold, 'Other')
# Recalculate theme matrix
grouped_themes = df.groupby('condition_grouped')['theme'].value_counts().unstack().fillna(0)
# Plot
plt.figure(figsize=(10, 6))
sns.heatmap(grouped_themes, cmap="coolwarm", annot=True, fmt=".0f")
plt.title("Themes by Grouped Condition")
plt.xlabel("Theme")
plt.ylabel("Grouped Condition")
plt.tight_layout()
plt.show()
theme_counts = df['theme'].value_counts()
sns.barplot(x=theme_counts.index, y=theme_counts.values)
plt.title("Distribution of Themes Across All Conditions")
plt.xlabel("Theme")
plt.ylabel("Count")
plt.tight_layout()
plt.show()
from wordcloud import WordCloud
# Combine all cleaned text into one string
all_text = ' '.join(df['clean_review'].dropna())
# Generate Word Cloud
wordcloud = WordCloud(
width=1000,
height=500,
background_color='white',
colormap='viridis',
max_words=25
).generate(all_text)
# Display the image
plt.figure(figsize=(15, 7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("Most Common Words in Drug Reviews", fontsize=18)
plt.tight_layout()
plt.show()
# Save for use on your website
wordcloud.to_file("project_wordcloud.png")
<wordcloud.wordcloud.WordCloud at 0x286f9b93ec0>
from wordcloud import WordCloud
import os
# Ensure output folder exists
os.makedirs("wordclouds/themes", exist_ok=True)
themes_list = df['theme'].unique()
for theme in themes_list:
text = ' '.join(df[df['theme'] == theme]['clean_review'].dropna())
if len(text.strip()) < 10:
continue # Skip short ones
wc = WordCloud(
width=1000,
height=500,
background_color='white',
colormap='plasma',
max_words=25 # 👈 LIMIT TO TOP 25 WORDS
).generate(text)
plt.figure(figsize=(10, 5))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.title(f"Top 25 Words – Theme: {theme}", fontsize=16)
plt.tight_layout()
plt.show()
wc.to_file(f"wordclouds/themes/wordcloud_{theme}.png")
os.makedirs("wordclouds/drugs", exist_ok=True)
top_drugs = df['urlDrugName'].value_counts().head(5).index
for drug in top_drugs:
text = ' '.join(df[df['urlDrugName'] == drug]['clean_review'].dropna())
if len(text.strip()) < 10:
continue
wc = WordCloud(
width=1000,
height=500,
background_color='white',
colormap='cool',
max_words=25 # 👈 LIMIT TO TOP 25 WORDS
).generate(text)
plt.figure(figsize=(10, 5))
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.title(f"Top 25 Words – Drug: {drug}", fontsize=16)
plt.tight_layout()
plt.show()
wc.to_file(f"wordclouds/drugs/wordcloud_{drug.replace('/', '_')}.png")
Conclusion
This notebook has demonstrated several key aspects of natural language processing with drug reviews:
- Data Processing: We processed and cleaned the text data from drug reviews
- Sentiment Analysis: We analyzed the emotional content of reviews using TextBlob
- Visualization: We created visual representations of our findings
- Key Insights: The analysis revealed patterns in drug reviews and sentiment scores
Next Steps
- Further analysis of specific drug categories
- More detailed sentiment analysis by condition
- Topic modeling to identify common themes in reviews