Word Clouds

Data Viz
EDA
Web Scraping
Word Cloud
Author

Tim Anderson

Published

May 24, 2020

Find Meaning From Comments / Reviews

Spending hours and hours reading through forums and other places where customers may leave feedback is a common task for Product Managers. But, sometimes there’s just too much data to read word for word. It’s useful take a big slice of data and throw it into some analysis to quickly parse through that data to find general themes.

This algorithmic approach to text is also useful to sometimes remove opinions and bias and getting right into the data.

Let’s find some customer comments to parse:

For this example I’m going to point to a BestBuy webpage for HP’s InstantInk product. The code below will read in their page and grab just the comments and add that to a data frame. It then repeats that process for the next 329 pages worth of comments. The code captures about 6600 customer comments. Imagine reading through all 6600 comments manually looking for themes?

Note…this type of scraping is generally frowned upon by the sites who own the data. That’s why I commented the scraping parts out after running once and saving the raw data as a csv. It’s probably not a huge deal to just scrape once, but if you build something like this into your daily Product Management workflow you’ll probably end up getting your ip address blocked by the site owner. Be nice and be careful scraping sites.

# Environment
suppressMessages(library(rvest))
suppressMessages(library(tidyverse))
suppressMessages(library(stringr))
suppressMessages(library(tidytext))
suppressMessages(library(stopwords))
suppressMessages(library(tm))
suppressMessages(library(wordcloud2))

# Target URL to scrape
#base_url <- "https://www.bestbuy.com/site/reviews/hp-instant-ink-50-page-monthly-plan-for-select-hp-printers/5119176"

# Load page
#page <- read_html(base_url)

# Scrape just the comments from the page
#comments <- page %>%
#     html_nodes(".pre-white-space") %>%
#     html_text() %>%
#     tbl_df()

# Be nice if you're using this approach...don't over tax someone's website.
# Loop to do the same over pages 2 to 330
#for (i in 2:330) {
#     url <- paste0(base_url, "?page=", i)
#     page <- read_html(url)
     
#     new_comments <- page %>%
#          html_nodes(".pre-white-space") %>%
#          html_text() %>%
#          tbl_df()
     
#     comments <- rbind(comments,new_comments)
#}


#write_csv(comments, "../../data/ink_comments.csv")

comments <- read_csv("ink_comments.csv")
Rows: 6622 Columns: 1
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): value

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(comments)
spc_tbl_ [6,622 × 1] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ value: chr [1:6622] "I have found this is excellent way to insure that to be your using HP INK for printer . Program Hp instant ink "| __truncated__ "I like Hewlett Packard products printer and computer . I Bought HP computer and Printer in 1991 and still using"| __truncated__ "It is easy to apply the refill plan. I just got the ink in the mail." "Not good for me. Watch out for overages at 5c a page! Really? I have given HP a total of 260 dollars and receiv"| __truncated__ ...
 - attr(*, "spec")=
  .. cols(
  ..   value = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 
head(comments, 5)
# A tibble: 5 × 1
  value                                                                         
  <chr>                                                                         
1 "I have found this is excellent way to insure that to be your using HP INK fo…
2 "I like Hewlett Packard products printer and computer . I Bought HP computer …
3 "It is easy to apply the refill plan. I just got the ink in the mail."        
4 "Not good for me. Watch out for overages at 5c a page! Really? I have given H…
5 "A $5 discount card for free at check out...it doesn’t get better than this! …

The code above created a dataframe with 6622 rows and one column. Each cell contains a single review, as can be partially seen in the table above.

Unpack all of those comments

To make use of all of those comments, the first step is to break the comments into a long list of individual words.

Next, using the TidyText package we can obtain a list of ‘stop words’ and filter those out from our long list. Stop words are words that just common words like ‘the’ ‘and’ ‘but’ and such that don’t really help in our analysis.

Once we’ve filtered out the stop words we can count the instances of each word.

# Unnest each word from the long list of comments
text <- unnest_tokens(comments, word, value)

dim(text)
[1] 193187      1
# Summarize words rejecting stopwords
word_count <- text %>%
     anti_join(get_stopwords(), by = "word") %>%
     count(word, sort = TRUE) 

dim(word_count)
[1] 4892    2
head(word_count)
# A tibble: 6 × 2
  word        n
  <chr>   <int>
1 ink      6828
2 printer  2190
3 great    1514
4 hp       1426
5 program  1227
6 plan     1047

Once we ‘unnest’ all of the comments, we get a frame that is 193k rows by one column…that’s every word we see in all of the comments.

Once we remove ‘stop’ words and count up instances for each word we a frame with 4892 unique words.

Determining the Sentiments for each word

The TidyText package also contains datasets which include sentiment information for a huge number of words. For example, “convenient” is tagged as a positive word, whereas “hassle” is considered negative.

We can ‘join’ our word_count frame with the sentiments data to get separate vectors with our positive and negative words.

# Join sentiments
word_count <- word_count %>%
     inner_join(get_sentiments("bing"), by = "word")



# data frames for Positive and negative words
pos_words <- word_count %>%
     filter(sentiment == "positive")

neg_words <- word_count %>%
        filter(word != "worry") %>%
     filter(sentiment == "negative")

Render a word cloud

I like the wordcloud2 package for rendering these illustrations. Wordcloud2 needs a data frame with a column for word and another column for the frequency for each word. The tool renders an image that becomes useful for additional exploration. The way the tool works, the size of each word relates to the frequency it showed up in the comments…so here, the most common negative word was “expensive”.

wordcloud2(neg_words, color = 'random-dark')

Charting Word Counts Traditionally

While word clouds may look nice and can be a great way to start a conversation, it’s also good to look at the word data in a more traditional way.

# Positive and negative frequency bar plots
pos_words %>%
     arrange(-n) %>%
     head(n =20) %>%
     ggplot(aes(x= reorder(word, n), y = n)) +
     geom_bar(stat = 'identity', fill = 'blue', alpha = .6) +
     coord_flip() +
     theme_minimal() +
     labs(title = "Frequency of top 20 Positive words", x = "")

neg_words %>%
     arrange(-n) %>%
     head(n =20) %>%
     ggplot(aes(x= reorder(word, n), y = n)) +
     geom_bar(stat = 'identity', fill = 'red', alpha = .6) +
     coord_flip() +
     theme_minimal() +
     labs(title = "Frequency of top 20 Negative words", x = "")

# quick summary of number of words
sum(pos_words$n)
[1] 11540
sum(neg_words$n)
[1] 1894
# ratio of good to bad
sum(pos_words$n)/sum(neg_words$n)
[1] 6.092925

Finally

These tools can be helpful, but you still need to do your manual work here. Going with just the code, many comments included the word “worry” which falls in the negative side. However, reading through a few comments we see things like “It’s great not to worry about…” and “I never have to worry about running to the store”. Clearly there are ways to look at words in relation to the words around it, but that’s outside the scope of this brief blog entry.