SpaCy Workbook: Natural Language Processing Analysis

This notebook demonstrates various natural language processing techniques using the SpaCy library. We'll explore text analysis, entity recognition, and visualization of linguistic features.

Setup and Installation
Text Processing
Entity Recognition
Visualization
Analysis Results

Setup and Installation

In [ ]:

# Import the UCI ML Repository interface
from ucimlrepo import fetch_ucirepo 

# Fetch the Drug Reviews dataset from druglib.com (Dataset ID: 461)
drug_reviews_druglib_com = fetch_ucirepo(id=461) 

# Split the data into features (X) and targets (y)
# X contains the review text and other features
# y contains the rating/sentiment information
X = drug_reviews_druglib_com.data.features 
y = drug_reviews_druglib_com.data.targets

Data Loading and Exploration

In this section, we load the Drug Reviews dataset from the UCI Machine Learning Repository. This dataset contains user reviews of various medications, along with their ratings and other metadata. We'll use this data to perform natural language processing analysis using SpaCy.

In [32]:

# metadata 
print(drug_reviews_druglib_com.metadata) 
  
# variable information 
print(drug_reviews_druglib_com.variables) 

# print the first few rows of the dataset
print(X.head())

{'uci_id': 461, 'name': 'Drug Reviews (Druglib.com)', 'repository_url': 'https://archive.ics.uci.edu/dataset/461/drug+review+dataset+druglib+com', 'data_url': 'https://archive.ics.uci.edu/static/public/461/data.csv', 'abstract': 'The dataset provides patient reviews on specific drugs along with related conditions. Reviews and ratings are grouped into reports on the three aspects benefits, side effects and overall comment.', 'area': 'Health and Medicine', 'tasks': ['Classification', 'Regression', 'Clustering'], 'characteristics': ['Multivariate', 'Text'], 'num_instances': 4143, 'num_features': 8, 'feature_types': ['Integer'], 'demographics': [], 'target_col': None, 'index_col': ['reviewID'], 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2018, 'last_updated': 'Wed Apr 03 2024', 'dataset_doi': '10.24432/C55G6J', 'creators': ['Surya Kallumadi', 'Felix Grer'], 'intro_paper': {'ID': 457, 'type': 'NATIVE', 'title': 'Aspect-Based Sentiment Analysis of Drug Reviews Applying Cross-Domain and Cross-Data Learning', 'authors': 'F. Gräßer, Surya Kallumadi, H. Malberg, S. Zaunseder', 'venue': 'Digital Humanities Conference', 'year': 2018, 'journal': None, 'DOI': '10.1145/3194658.3194677', 'URL': 'https://www.semanticscholar.org/paper/4d7c25fe6131a79dfec9f45b70b270f400ac2b4f', 'sha': None, 'corpus': None, 'arxiv': None, 'mag': None, 'acl': None, 'pmid': None, 'pmcid': None}, 'additional_info': {'summary': "The dataset provides patient reviews on specific drugs along with related conditions. Furthermore, reviews are grouped into reports on the three aspects benefits, side effects and overall comment. Additionally, ratings are available concerning overall satisfaction as well as a 5 step side effect rating and a 5 step effectiveness rating. The data was obtained by crawling online pharmaceutical review sites. The intention was to study \r\n\r\n(1) sentiment analysis of drug experience over multiple facets, i.e. sentiments learned on specific aspects such as effectiveness and side effects,\r\n(2) the transferability of models among domains, i.e. conditions, and \r\n(3) the transferability of models among different data sources (see 'Drug Review Dataset (Drugs.com)').\r\n\r\nThe data is split into a train (75%) a test (25%) partition (see publication) and stored in two .tsv (tab-separated-values) files, respectively.\r\n\r\nImportant notes:\r\n\r\nWhen using this dataset, you agree that you\r\n1) only use the data for research purposes\r\n2) don't use the data for any commerical purposes\r\n3) don't distribute the data to anyone else\r\n4) cite us", 'purpose': None, 'funded_by': None, 'instances_represent': None, 'recommended_data_splits': None, 'sensitive_data': None, 'preprocessing_description': None, 'variable_info': '1. urlDrugName (categorical): name of drug\r\n2. condition (categorical): name of condition\r\n3. benefitsReview (text): patient on benefits\r\n4. sideEffectsReview (text): patient on side effects\r\n5. commentsReview (text): overall patient comment\r\n6. rating (numerical): 10 star patient rating\r\n7. sideEffects (categorical): 5 step side effect rating\r\n8. effectiveness (categorical): 5 step effectiveness rating', 'citation': None}}
name role type demographic description units \
0 reviewID ID Integer None None None
1 urlDrugName Feature Categorical None None None
2 rating Feature Integer None None None
3 effectiveness Feature Categorical None None None
4 sideEffects Feature Categorical None None None
5 condition Feature Categorical None None None
6 benefitsReview Feature Categorical None None None
7 sideEffectsReview Feature Categorical None None None
8 commentsReview Feature Categorical None None None

missing_values
0 no
1 no
2 no
3 no
4 no
5 no
6 no
7 no
8 no
urlDrugName rating effectiveness sideEffects \
0 enalapril 4 Highly Effective Mild Side Effects
1 ortho-tri-cyclen 1 Highly Effective Severe Side Effects
2 ponstel 10 Highly Effective No Side Effects
3 prilosec 3 Marginally Effective Mild Side Effects
4 lyrica 2 Marginally Effective Severe Side Effects

condition \
0 management of congestive heart failure
1 birth prevention
2 menstrual cramps
3 acid reflux
4 fibromyalgia

benefitsReview \
0 slowed the progression of left ventricular dys...
1 Although this type of birth control has more c...
2 I was used to having cramps so badly that they...
3 The acid reflux went away for a few months aft...
4 I think that the Lyrica was starting to help w...

sideEffectsReview \
0 cough, hypotension , proteinuria, impotence , ...
1 Heavy Cycle, Cramps, Hot Flashes, Fatigue, Lon...
2 Heavier bleeding and clotting than normal.
3 Constipation, dry mouth and some mild dizzines...
4 I felt extremely drugged and dopey. Could not...

commentsReview
0 monitor blood pressure , weight and asses for ...
1 I Hate This Birth Control, I Would Not Suggest...
2 I took 2 pills at the onset of my menstrual cr...
3 I was given Prilosec prescription at a dose of...
4 See above

In [ ]:

import pandas as pd
import spacy
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import nltk
import seaborn as sns
import matplotlib.pyplot as plt

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\ubuntu\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ubuntu\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.

In [ ]:

# Configuration Parameters
CONFIG = {
    # Visualization settings
    'viz': {
        'figure_size': (12, 8),
        'style': 'seaborn',
        'palette': 'viridis',
        'title_fontsize': 14,
        'label_fontsize': 12
    },
    
    # Analysis settings
    'analysis': {
        'min_review_length': 10,  # Minimum characters for valid review
        'top_n_drugs': 10,        # Number of top drugs to show in plots
        'random_seed': 42         # For reproducibility
    },
    
    # Text processing settings
    'text': {
        'min_word_length': 3,     # Minimum length for valid words
        'remove_numbers': True,    # Whether to remove numbers from text
        'remove_punctuation': True # Whether to remove punctuation
    }
}

# Set random seed for reproducibility
np.random.seed(CONFIG['analysis']['random_seed'])

Requirements

To run this notebook, you need the following Python packages:

spacy>=3.0.0
textblob>=0.15.3
nltk>=3.6.0
pandas>=1.2.0
scikit-learn>=0.24.0
seaborn>=0.11.0
matplotlib>=3.3.0

You'll also need to download the English language model for spaCy:

python -m spacy download en_core_web_sm

Library Imports and Setup

We'll use several Python libraries for our analysis:

spaCy: For advanced NLP tasks
TextBlob: For sentiment analysis
NLTK: For text preprocessing
scikit-learn: For text vectorization
seaborn & matplotlib: For visualization

Below, we import these libraries and download required NLTK data.

In [ ]:

# Data Preprocessing
print("Starting text preprocessing...")

try:
    # Load the dataset
    df = X.copy()  # Create a copy of the original data
    
    # Combine different review fields into a single text field
    print("Combining review fields...")
    df['full_review'] = df['benefitsReview'].fillna('') + ' ' + \
                        df['sideEffectsReview'].fillna('') + ' ' + \
                        df['commentsReview'].fillna('')
    
    # Text Cleaning
    print("Cleaning text...")
    stop_words = set(stopwords.words('english'))
    
    def clean_text(text):
        """
        Clean and preprocess text data:
        1. Convert to lowercase
        2. Tokenize
        3. Remove stopwords and non-alphabetic tokens
        
        Args:
            text (str): Input text to clean
            
        Returns:
            str: Cleaned text
        """
        if not isinstance(text, str):
            return ""
        try:
            tokens = word_tokenize(text.lower())
            tokens = [t for t in tokens if t.isalpha() and t not in stop_words]
            return ' '.join(tokens)
        except Exception as e:
            print(f"Error processing text: {e}")
            return ""
    
    df['clean_review'] = df['full_review'].apply(clean_text)
    
    # Sentiment Analysis
    print("Performing sentiment analysis...")
    def analyze_sentiment(text):
        """
        Calculate sentiment polarity score using TextBlob.
        
        Args:
            text (str): Input text to analyze
            
        Returns:
            float: Sentiment score between -1 (negative) and 1 (positive)
        """
        try:
            blob = TextBlob(text)
            return blob.sentiment.polarity
        except Exception as e:
            print(f"Error analyzing sentiment: {e}")
            return 0.0
    
    df['sentiment_score'] = df['clean_review'].apply(analyze_sentiment)
    print("Preprocessing complete!")
    
except Exception as e:
    print(f"Error during preprocessing: {e}")
    raise

Top Sentiment by Drug:
urlDrugName
norpramin      0.444444
combipatch     0.410250
amerge         0.369231
proair-hfa     0.355000
gonal-f-rff    0.343333
Name: sentiment_score, dtype: float64

Themes by Condition:
theme                       cost  mental  other  pain  sleep
condition                                                   
1mg                          0.0     0.0    1.0   0.0    0.0
2                            0.0     0.0    1.0   0.0    0.0
2 broken arms                0.0     0.0    1.0   0.0    0.0
2 compressed discs in neck   0.0     0.0    0.0   1.0    0.0
20 year pack a day smoker    0.0     0.0    0.0   0.0    1.0

In [ ]:

# Data Quality Checks
print("\nPerforming data quality checks...")

# Check for missing values
print("\nMissing values summary:")
print(df.isnull().sum())

# Check sentiment distribution
print("\nSentiment score distribution:")
print(df['sentiment_score'].describe())

# Check review lengths
df['review_length'] = df['full_review'].str.len()
print("\nReview length statistics:")
print(df['review_length'].describe())

# Create a boxplot of review lengths
plt.figure(figsize=(10, 6))
sns.boxplot(data=df, y='review_length')
plt.title('Distribution of Review Lengths')
plt.ylabel('Number of Characters')
plt.show()

Text Processing and Sentiment Analysis

In this section, we perform several text processing steps:

Data Integration: Combine different types of reviews (benefits, side effects, and comments) into a single text field
Text Cleaning:
- Convert text to lowercase
- Remove stopwords
- Keep only alphabetic tokens
Sentiment Analysis: Calculate sentiment scores using TextBlob
- Scores range from -1 (negative) to 1 (positive)
- This helps us understand the emotional tone of reviews

The processed data will be used for further analysis and visualization.

In [ ]:

# Prepare data for visualization
print("Preparing visualization data...")

# Calculate average sentiment by drug
sentiment_by_drug = df.groupby('drugName')['sentiment_score'].mean().sort_values(ascending=False)

# Set up the plotting style
plt.style.use('seaborn')
plt.figure(figsize=(12, 6))

Data Visualization

This section presents visual analyses of our drug review data:

Top Drugs by Sentiment: Bar chart showing the drugs with the highest average sentiment scores
Sentiment Distribution: Distribution of sentiment scores across all reviews
Key Insights: Visual representation of the most important findings

The visualizations help us understand patterns and trends in the drug review data.

In [ ]:

# Create visualization for top drugs by sentiment
plt.figure(figsize=(12, 8))
top_sentiment = sentiment_by_drug.head(10)  # Top 10 drugs

# Create bar plot
ax = sns.barplot(x=top_sentiment.values, y=top_sentiment.index, palette='viridis')

# Customize the plot
plt.title("Top 10 Drugs by Average Sentiment Score", fontsize=14, pad=20)
plt.xlabel("Average Sentiment Score", fontsize=12)
plt.ylabel("Drug Name", fontsize=12)

# Add value labels on the bars
for i, v in enumerate(top_sentiment.values):
    ax.text(v, i, f'{v:.3f}', va='center', fontsize=10)

plt.tight_layout()
plt.show()

No description has been provided for this image

In [24]:

# Heatmap of themes by condition
# plt.figure(figsize=(10, 6))
# sns.heatmap(themes_by_condition, cmap="YlGnBu", annot=True, fmt=".0f")
# plt.title("Themes by Condition")
# plt.xlabel("Theme")
# plt.ylabel("Condition")
# plt.tight_layout()
# plt.show()

# ===========================================================================================
# Heatmap of themes by condition for top N conditions
# ===========================================================================================


# Set N for top conditions
N = 15  # You can change this value

# Find top N conditions by number of reviews
top_conditions = df['condition'].value_counts().head(N).index

# Filter theme data to only those top conditions
filtered_themes = themes_by_condition.loc[themes_by_condition.index.isin(top_conditions)]

plt.figure(figsize=(10, 6))
sns.heatmap(filtered_themes, cmap="YlGnBu", annot=True, fmt=".0f")
plt.title(f"Themes by Top {N} Conditions")
plt.xlabel("Theme")
plt.ylabel("Condition")
plt.tight_layout()
plt.show()

In [25]:

# Group low-frequency conditions
condition_counts = df['condition'].value_counts()
threshold = 15  # you can adjust this
df['condition_grouped'] = df['condition'].where(condition_counts > threshold, 'Other')

# Recalculate theme matrix
grouped_themes = df.groupby('condition_grouped')['theme'].value_counts().unstack().fillna(0)

# Plot
plt.figure(figsize=(10, 6))
sns.heatmap(grouped_themes, cmap="coolwarm", annot=True, fmt=".0f")
plt.title("Themes by Grouped Condition")
plt.xlabel("Theme")
plt.ylabel("Grouped Condition")
plt.tight_layout()
plt.show()

In [26]:

theme_counts = df['theme'].value_counts()

sns.barplot(x=theme_counts.index, y=theme_counts.values)
plt.title("Distribution of Themes Across All Conditions")
plt.xlabel("Theme")
plt.ylabel("Count")
plt.tight_layout()
plt.show()

In [27]:

from wordcloud import WordCloud

# Combine all cleaned text into one string
all_text = ' '.join(df['clean_review'].dropna())

# Generate Word Cloud
wordcloud = WordCloud(
    width=1000,
    height=500,
    background_color='white',
    colormap='viridis',
    max_words=25
).generate(all_text)

# Display the image
plt.figure(figsize=(15, 7))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title("Most Common Words in Drug Reviews", fontsize=18)
plt.tight_layout()
plt.show()

# Save for use on your website
wordcloud.to_file("project_wordcloud.png")

Out[27]:

<wordcloud.wordcloud.WordCloud at 0x286f9b93ec0>

In [30]:

from wordcloud import WordCloud
import os

# Ensure output folder exists
os.makedirs("wordclouds/themes", exist_ok=True)

themes_list = df['theme'].unique()

for theme in themes_list:
    text = ' '.join(df[df['theme'] == theme]['clean_review'].dropna())
    if len(text.strip()) < 10:
        continue  # Skip short ones

    wc = WordCloud(
        width=1000,
        height=500,
        background_color='white',
        colormap='plasma',
        max_words=25  # 👈 LIMIT TO TOP 25 WORDS
    ).generate(text)

    plt.figure(figsize=(10, 5))
    plt.imshow(wc, interpolation='bilinear')
    plt.axis('off')
    plt.title(f"Top 25 Words – Theme: {theme}", fontsize=16)
    plt.tight_layout()
    plt.show()

    wc.to_file(f"wordclouds/themes/wordcloud_{theme}.png")

In [31]:

os.makedirs("wordclouds/drugs", exist_ok=True)

top_drugs = df['urlDrugName'].value_counts().head(5).index

for drug in top_drugs:
    text = ' '.join(df[df['urlDrugName'] == drug]['clean_review'].dropna())
    if len(text.strip()) < 10:
        continue

    wc = WordCloud(
        width=1000,
        height=500,
        background_color='white',
        colormap='cool',
        max_words=25  # 👈 LIMIT TO TOP 25 WORDS
    ).generate(text)

    plt.figure(figsize=(10, 5))
    plt.imshow(wc, interpolation='bilinear')
    plt.axis('off')
    plt.title(f"Top 25 Words – Drug: {drug}", fontsize=16)
    plt.tight_layout()
    plt.show()

    wc.to_file(f"wordclouds/drugs/wordcloud_{drug.replace('/', '_')}.png")

Conclusion

This notebook has demonstrated several key aspects of natural language processing with drug reviews:

Data Processing: We processed and cleaned the text data from drug reviews
Sentiment Analysis: We analyzed the emotional content of reviews using TextBlob
Visualization: We created visual representations of our findings
Key Insights: The analysis revealed patterns in drug reviews and sentiment scores

Next Steps

Further analysis of specific drug categories
More detailed sentiment analysis by condition
Topic modeling to identify common themes in reviews

Natural Language Processing

SpaCy Workbook: Natural Language Processing Analysis

Table of Contents

Setup and Installation

Data Loading and Exploration

Requirements

Library Imports and Setup

Text Processing and Sentiment Analysis

Data Visualization

Conclusion

Next Steps

References