The Investigation Begins.

I have only watched the first 2 episodes of Game of Thrones. I have seen there is a huge fan base and culture that has followed the series and that leaves me curious as to what was going on in the Game of Thrones that made it so appealing? There are some questions I have been thinking about that might help me understand the appeal more, so here are the questions.
* What are the main ideas or concepts for each episode?
* What are the main ideas or concepts in each season?
* Can I develop a summary for the show from the data and research on topics that are analysis based(that do not give away too much)? * Who are the primary characters during each episode and season based upon the amount of dialog that each character has?
* What is the overall sentiment for each episode and season?
* What is the sentiment for the top characters in each episode and season?

I will address problem statement questions; the theme(s) for the show; who are the primary characters during each episode and season ;and what overall sentiment from each season, episode, and character.
1. The sentiment analysis: during my literature review I found that Sunita Parajuli’s Text Mining Project had already answered all the sentiment questions I had, so I moved on to the next questions.
2. The main concepts / ideas for the seasons and episodes.
relationships, loyalty, betrayal, sex, death and killing, kings and queens, factions of families, ruthlessness

Summary: (spoilers)

The series is set in a medieval time period set in a mythical place where various factions are fighting for a throne and or kingdom. There is a considerable amount of killing going on during each season and there are very few episodes that no one dies according to Akshay Goel’s Applying analytics to HBO’s Game of Thrones(GoT) article that I researched.There is a considerable amount of sex in the series to keep the characters developing new networks and transforming the stories further(Lancaster, n.d.).I would have expected to find some sort of magic elements in the story due to the period of the series but the text analysis and research I conducted did not reveal magic as a theme in the series. Determining the main characters is subjective, the main character could be determined by the amount of dialog, which would make Tyrion Lannister the main character in nearly every episode and the whole show (Parajuli, n.d.). Another measure for the main character could be the degree of centrality which would make Jon Snow the main character(Saleem,2022). Another metric could be who becomes king at the end of the series, which would be Bran Stark. All these measures are true, as the series progresses they refresh the main character to keep the series dynamic,the show kills off major characters throughout the series to keep the audience interested, and offers a dynamic plot where people may or may not like the results in the end.

Literature Review

– I searched for text and data mining from the Game of Thrones series and that led me to a plethora of examples; however, I had to narrow down the results to find sources that would provide good visualizations and not take too much time to read. Once I began researching Game of Thrones I came across Jefferey Lancaster’s 19 More Game of Thrones Game of Thrones Data Visualizations page, which I rank as the gold standard. Jeffrey Lancaster has access to a massive portion of data including screen time data, character genders, locations for events, and other details. Mr. Lancaster provides links to where he found the code, he modified to create his visualizations, but no firsthand code. There is certainly a point where too much information can become unusable, and just because you can do something does not mean you should. Here is one example from Mr. Lancaster’s exhibition where a visualization is overwhelming:
If Mr. Lancaster reduced this network visualization to a maximum of ten characters this visualization would have been easier to understand, but this looks like a giant yarn ball to me. There are at least twelve other visualizations where I am not sure what information I can easily gain from them.With that in mind Sunita Parajuli’s Text Mining Project came up next and this was a wonderful example with some excellent R code in it.
– Sunita Parajuli’s Text Mining Project answers some great questions like who are the top ten characters, what are the most frequent words, 3 questions around sentiment analysis regarding characters. The answers are great on their own, but Sunita’s techniques are great, for example: a characters per-capita dialog percentage, a word cloud on the top 200 words and a sentiment analysis on the top 200 words as well. I am not sure if the word clouds translate into usable information, but it is attention grabbing. Sunita uses a great visualization for the overall character sentiment for the first 5 seasons: and after filtering for season and then episode this would prove to add a greater understanding for the Game of Thrones. Sunita effectively covered; the most used words in the series with a horizontal bar chart, the number lines of dialog per character with another horizontal bar chart, A sentiment analysis for the entire series in a word cloud and horizontal bar charts, the most frequent terms and ten characters in another horizontal bra chart. Sunita does an excellent job doing an exploratory analysis on the data set with R and the next exploratory analysis is done with python by Ayesha Saleem.
– Ayesha Saleem’s Network theory and Game of Thrones analysis uses graph theory to evaluate two different data sets called nodes and edges through python with entirely different data than what I have in my data set. Ayesha Saleem’s python analysis is a Circos plot that resembles Mr. Lancaster’s plot that must have also been a node networking technique. . To find the most important person in the show Ayesha uses Degree Centrality of the node which measures the “number of edges that are incident upon a node (for an undirected graph this is the same as the outgoing nodes)”, If we compare the two charts, which one is the top character and why? The number of connections a character has, or is it the amount of dialog that a character has through the course of the series? I suppose both could be right depending on the question asked. I think the degree of centrality is good at showing the networks in the series which I was not seeing in the data set that I have. While Ayesha’s analysis provided a completely different perspective I found another analytical evaluation by Akshay Goel that focuses on the deaths in Game of Thrones.
– Akshay Goel’s Applying analytics to HBO’s Game of Thrones( GoT) article focuses on the deaths in the Game of Thrones series and how it affects the ratings of the show on the popular media rating website IMDB. Akshay measured the number of shows with zero deaths versus the total number of episodes and compared their ratings. The more deaths that occurred the higher the ratings. Each episode that a character was killed in received a higher rating for that episode, and episodes where a more important character was killed it received an even higher rating. Akshay mentions he used machine learning to predict the deaths of “first major character” by “significant variables,” to include: “time of death,character house, consumer sentiment, airtime(till death),Character importance(subjective), Character age, # of deaths in the house in previous seasons,” which covers a ton of data. I wish Akshay had mentioned where he found his data from so the investigation could be replicated. Akshay provided a great opinion piece that provided more data on the deaths in Game of Thrones, and I will summarize the articles next.
– These were all great articles on Game of Thrones that provided different perspectives and even provided different data on the series. The authors provided a greater understanding for what is happening in the Game of Thrones, and that is what data mining is about, gaining a greater understanding of your subject. There is a significant amount of exploratory analysis conducted to understand the subject(s) more, and different data can add a greater depth of overall understanding. With each additional layer of information provided by the investigators I now have a much greater understanding of the Game of Thrones than I did before I began this project.There has only been one Interpretations from the data by Mr. Lancaster for who would sit on the throne at the end, with no other summaries or predictions, the investigators are providing the facts.

clear the memory and global environment.

invisible(gc())          # clear unused mem
invisible(rm(list=ls()))  # clear global environment
start_time <- Sys.time()

load all the libraries necessary for the project

library(jsonlite)
#library(dplyr)
library(tidyr)
library(tidytext)
library(tidyverse, quietly = TRUE)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ dplyr   1.1.0
## ✔ tibble  3.1.8     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 0.5.2
## ✔ purrr   1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()  masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ dplyr::lag()     masks stats::lag()

library(widyr)
library(ggplot2)
library(igraph)

## 
## Attaching package: 'igraph'
## 
## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union
## 
## The following objects are masked from 'package:purrr':
## 
##     compose, simplify
## 
## The following object is masked from 'package:tibble':
## 
##     as_data_frame
## 
## The following object is masked from 'package:tidyr':
## 
##     crossing
## 
## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum
## 
## The following object is masked from 'package:base':
## 
##     union

library(ggraph)
library(topicmodels)
library(corpus)
library(gridExtra) #viewing multiple plots together

## 
## Attaching package: 'gridExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

library(wordcloud) # word cloud generator

## Loading required package: RColorBrewer

if (!require("wordcloud2")){
  install.packages("wordcloud2")
library(wordcloud2)} # word cloud generator

## Loading required package: wordcloud2

library(SnowballC) # for stemming the text
library(tm) # text mining

## Loading required package: NLP
## 
## Attaching package: 'NLP'
## 
## The following object is masked from 'package:ggplot2':
## 
##     annotate

library(reshape2) # reshapes a data frame

## 
## Attaching package: 'reshape2'
## 
## The following object is masked from 'package:tidyr':
## 
##     smiths

library(stm)

## stm v1.3.6 successfully loaded. See ?stm for help. 
##  Papers, resources, and other materials at structuraltopicmodel.com

library(reshape2)

The Data

I retrieved Alben Tumanggor’s Game of Thrones data set from kaggle April 22, 2023, and it was last updated three years ago. I evaluated the data for the number of variable and observations before and after cleaning. I changed the variable “Sentence” into “text” to make less overall work. I then checked all the variable names for the dataset. Finally I checked what type each variables is to see if there need to be any cleaning performed. The data structure is:

# got <- read_csv("Game_of_Thrones_Script.csv")
# #names(got)
# got <- got %>% 
#   mutate(text = Sentence, Sentence=NULL )
#   
# #names(got)
# 
# saveRDS(got,"got.rds")
# 
# fix.contractions <- function(doc) {
#   # "won't" is a special case as it does not expand to "wo not"
#   doc <- gsub("won't", "will not", doc)
#   doc <- gsub("can't", "can not", doc)
#   doc <- gsub("n't", " not", doc)
#   doc <- gsub("'ll", " will", doc)
#   doc <- gsub("'re", " are", doc)
#   doc <- gsub("'ve", " have", doc)
#   doc <- gsub("'m", " am", doc)
#   doc <- gsub("'d", " would", doc)
#   # 's could be 'is' or could be possessive: it has no expansion
#   doc <- gsub("'s", "", doc)
#   return(doc)
# }
# 
# # fix (expand) contractions
# got$text <- sapply(got$text, fix.contractions)
# 
# got1<- got
# got1 <- got1 %>% 
#   mutate(line = row_number()) %>%         # create a new col with a row number for each existing line
#   unnest_tokens(word, text) %>%           # tidy the words, text
#   anti_join(stop_words)                   # filter out the stop words 
#   
#   #distinct() #%>%                           # get rid of any duplicate records using
# 
# got1$word <- text_tokens(got1$word, stemmer = "en") # english stemmer
# got1$word<- gsub("[^a-zA-Z]","",got1$word)    # remove special characters


my_colors <- c("#E69F00", "#56B4E9", "#009E73", "salmon", "#DAF7A6", "#FDD300","#89A8BF","darkblue","darkred")
#names(got)

got <- readRDS("got.rds")

#saveRDS(got1,"got1.rds")
got1 <- readRDS("got1.rds")

str(got1)

## tibble [88,392 × 7] (S3: tbl_df/tbl/data.frame)
##  $ Release Date : Date[1:88392], format: "2011-04-17" "2011-04-17" ...
##  $ Season       : chr [1:88392] "Season 1" "Season 1" "Season 1" "Season 1" ...
##  $ Episode      : chr [1:88392] "Episode 1" "Episode 1" "Episode 1" "Episode 1" ...
##  $ Episode Title: chr [1:88392] "Winter is Coming" "Winter is Coming" "Winter is Coming" "Winter is Coming" ...
##  $ Name         : chr [1:88392] "waymar royce" "waymar royce" "waymar royce" "waymar royce" ...
##  $ line         : int [1:88392] 1 1 1 1 1 1 1 1 2 2 ...
##  $ word         : chr [1:88392] "expect" "savag" "lot" "steal" ...

# summary(got1)
# glimpse(got1)
# head(got1,3)


write.csv(got1, "got.csv", row.names=FALSE)

The Game of Thrones data set begins with 23911 observations and 6 variables.
The data set has 88392 observations and 7 variables after cleaning and making the data set tidy.
The variable names are Release Date, Season, Episode, Episode Title, Name, text. The Release Date variable is a date format, and the rest of the variables are character variables.

Word count for each character

After reviewing Sunita Parajuli’s Text Mining Project I felt applying a similar approach to Sunita’s dialog per character visual was a good place to begin the Game of Thrones series investigation. I also borrowed Sunita’s bit of code for the gradient color scheme because it added a nicer visual effect. I conducted a character word count for the series as a whole to see who was driving the stories. I then conducted a character word count on each season to see which characters were driving the series in each season.

# Character total word count
got1%>%
  count(Name, sort = TRUE) %>%  # get the n top words from the tidied, clean, filtered dataset using count() and top_n() 
  # slice(1:20) %>%
  top_n(20) %>%   # counts the top 10
  ungroup() %>%  
  mutate(Name = reorder(Name, n)) %>% # sort words according to the count using reorder()and reassign the ordered value to word using mutate()
    ggplot(aes(Name, n)) +
    geom_col(stat="identity", aes(fill=n), show.legend=FALSE) +   # put the aes in the ggplot and geom text will work properly
    geom_text(aes(label= n ),color= "black", hjust=1.10)+
    xlab("") + 
    scale_fill_gradient(low=my_colors[7], high=my_colors[1]) +
    ggtitle("Characters' word count") +
    coord_flip()+
    theme_bw()+
    theme(axis.text.x = element_blank())

## Selecting by n

# this is Sunita Parajuli's code for her "Lines of dialogue per character" visual
# got1 %>% 
#   count(Name) %>%
#   arrange(desc(n)) %>%
#   slice(1:20) %>%
#   ggplot(aes(y=reorder(Name, n), x=n)) +
#   geom_bar(stat="identity", aes(fill=n), show.legend=FALSE) + 
#   geom_label(aes(label=n)) +
#   scale_fill_gradient(low="dodgerblue", high="dodgerblue4") +
#   labs(x="Character", y="Lines of dialogue",
#        title="Lines of dialogue per character") +  
#   theme_bw() 

# Count names in got

# a table:

# matrix <-got1 %>% 
#   group_by(Season) %>% 
#   count(Name, sort = TRUE) %>%  # get the n top words from the tidied, clean, filtered dataset using count() and top_n() 
#   slice(1:10) %>%  # use top (1:digits)
#   # top_n(10) %>%   # counts the top 10
#   ungroup() %>%  
#   mutate(Name = reorder(Name, n)) #%>% # reorder the counted subject by n
# library(data.table) # may or may not be necessary for the as.matrix 
# matrix
# as.matrix(matrix[],rownames = "Season")


# top 10 character word count by season
got1 %>% 
  group_by(Season) %>% 
  count(Name, sort = TRUE) %>%  # get the n top words from the tidied, clean, filtered dataset using count() and top_n() 
  slice(1:10) %>%  # use top (1:digits)
  # top_n(10) %>%   # counts the top 10
  ungroup() %>%  
  mutate(Name = reorder(Name, n)) %>% # reorder the counted subject by n
    ggplot(aes(Name, n)) +
    geom_col(stat="identity", aes(fill=n), show.legend=FALSE) +   # put the aes in the ggplot and geom text will work properly
  scale_fill_gradient(low=my_colors[7], high=my_colors[1]) +
    geom_text(aes(label= n ),color= "black",size=2.5, hjust=1.10)+
    xlab("") + 
    #ylab("Word Count") +
    coord_flip()+
    labs(title="Characters' word count for each season")+
    theme_bw()+
    facet_wrap(~Season, ncol=3, scales= "free")+
    theme(x = NULL, y = NULL, axis.text.x = element_blank(),axis.ticks.x = element_blank())

# top 15 words in season 8 Episode 6
got1 %>% 
  group_by(Season) %>% 
  filter(grepl("8",Season), grepl("6",Episode)) %>% 
  count(word, sort = TRUE) %>%  # get the n top words from the tidied, clean, filtered dataset using count() and top_n() 
  slice(1:20) %>%  # use top (1:digits)
  # top_n(10) %>%   # counts the top 10
  ungroup() %>%  
  mutate(word = reorder(word, n)) %>% # reorder the counted subject by n
    ggplot(aes(word, n)) +
    geom_col(stat="identity", aes(fill=n), show.legend=FALSE) +   # put the aes in the ggplot and geom text will work properly
    scale_fill_gradient(low=my_colors[7], high=my_colors[1]) +
    geom_text(aes(label= n ),color= "black",size=2.5, hjust=1.10)+
    xlab("") + 
    #ylab("Word Count") +
    coord_flip()+
    labs(title="Top 15 words for Season 8 Episode 6")+
    theme_bw()+
    #facet_wrap(~Episode, ncol=2, scales= "free")+
    theme(x = NULL, y = NULL, axis.text.x = element_blank(),axis.ticks.x = element_blank())

The character word count was helpful to figure out who the drivers for the series are. The character word count per season helped figure out which character was driving conversations in which season. The top 15 words in Season 8 episode 6 was helpful in determining an overall theme for the end of the series.

I tried an STM, LDA and gamma model to see what insights that would provide me.

I thought using an STM would provide me with additional information on the themes or activities in the series, and it did provide a basic insight into the Game of Thrones themes. The gamma model did not produce information that was as usable as the beta model, so I focused on the beta model to complement the STM model with additional insights.

# an STM 

# processed <- textProcessor(got$text, metadata = got)
# out <- prepDocuments(processed$documents, processed$vocab, processed$meta)
# docs <- out$documents
# vocab <- out$vocab
# meta <-out$meta
# 
# First_STM <- stm(documents = out$documents, vocab = out$vocab,
#               K = 4,  
#               max.em.its = 75, data = out$meta,
#               init.type = "Spectral", verbose = FALSE)
# 
# saveRDS(First_STM,"First_STM.rds")
First_STM <- readRDS("First_STM.rds")
# plot(First_STM)

# labelTopics(First_STM, topics=c(1,2,3), n=10)# complete list of top 10 words per topics 3-5-9
plot.STM(First_STM, "labels", topics=c(1,2,3,4), label="frex", n=10, width=65)#top 10 FREX words per topics 3-5-9

word_counts <- got1 %>%
  count(line, word, sort = TRUE) %>%
  ungroup()

#word_counts

got_dtm <- word_counts %>%
  cast_dtm(line, word, n)

#got_dtm

got_lda <- LDA(got_dtm, k = 4, control = list(seed = 1234))
#got_lda

tidy_lda <- tidy(got_lda) %>% 
  distinct()

#tidy_lda

top_terms <- tidy_lda %>%
  group_by(topic) %>%
  slice_max(beta, n = 20, with_ties = FALSE) %>%  # maximum number of figures per topic
  ungroup() %>%           # ungroup the group_by(topic)
  arrange(desc(beta)) %>% # arrange beta in desc order
  unique()                # allow only unique observations

#top_terms

#visual makes it easier to interpret
top_terms %>%
  filter(!grepl("ser",term)) %>% 
  mutate(term = reorder_within(term, beta, topic)) %>%
  group_by(topic, term) %>%    
  arrange(desc(beta)) %>%  
  ungroup() %>%
  ggplot(aes(beta, term, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  scale_y_reordered() +
  labs(title = "Top 20 terms in each LDA topic",
       x = expression(beta), y = NULL) +
  theme_bw()+
  facet_wrap(~ topic, ncol = 4, scales = "free")+
  theme(axis.text.x = element_blank(),axis.ticks.x = element_blank())

#8.4.3 interpreting the topic model
lda_gamma <- tidy(got_lda, matrix = "gamma") %>% 
  arrange(desc(gamma))



#lda_gamma  # verify

# how are the probabilities distributed?
# ggplot(lda_gamma, aes(gamma)) +
#   stat_bin(bins = 30)+
#   # geom_histogram(alpha = 0.8) +
#   scale_y_log10() +
#   labs(title = "Distribution of probabilities for all topics",
#        y = "Number of documents", x = expression(gamma))+
#   theme_bw()

The topic modeling has provided eighty items that when analyzed will provide definitive insight into the series. There is a recurring theme in each topic about: a king, lannister, a son, a brother, killing and dying,sex,and stark. These results show the main characters in the series, their relationships, locations that are significant, and characters’ activities. This is helpful for informing me as to the content of the series.

Bigram insight.

Bigrams will be used to look for word networks that will provide insight into the series. The bigrams will be conducted on the whole series to reveal any connections that can be made for the whole series. The next bigram analysis to be conducted will be on season eight episode 5 and six to provide context for the ending of the season and provide insight into the series ending.

# complete Series bigrams
got_bigrams0 <-got %>%
  #filter(Season %in% c("Season 7","Season 8")) %>% 
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%     # tokenize the bigrams
  separate(bigram, c("word1", "word2"), sep = " ") %>% 
  filter(nchar(word1) >3) %>% 
  filter(nchar(word2) >3) %>%  
  filter(!word1 %in% stop_words$word) %>%  # remove stop_words
  filter(!word2 %in% stop_words$word) %>%  # remove stop_words  
  count(word1, word2, sort = TRUE) %>% 
  filter(n > 17) %>%
  graph_from_data_frame(directed = FALSE)


set.seed(2323)
graph0<-ggraph(got_bigrams0, layout = "fr") +
geom_edge_link(aes(edge_alpha = n,edge_width=n), edge_colour = my_colors[7],show.legend = TRUE,
end_cap = circle(.07, 'inches')) +
geom_node_point(color = my_colors[2], size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()+
labs(title= "Game of Thrones Complete Series Bigrams")


# # Season 1&2 Bigrams
# got_bigrams <-got %>%
#   filter(grepl("1|2",Season)) %>% 
#   unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%     # tokenize the bigrams
#   separate(bigram, c("word1", "word2"), sep = " ") %>% 
#   filter(nchar(word1) >3) %>% 
#   filter(nchar(word2) >3) %>%  
#   filter(!word1 %in% stop_words$word) %>%  # remove stop_words
#   filter(!word2 %in% stop_words$word) %>%  # remove stop_words  
#   count(word1, word2, sort = TRUE) %>% 
#   filter(n > 8) %>%
#   graph_from_data_frame(directed = FALSE)
# 
# 
# set.seed(2323)
# graph1<-ggraph(got_bigrams, layout = "fr") +
# geom_edge_link(aes(edge_alpha = n,edge_width=n), edge_colour = my_colors[7],show.legend = TRUE,
# end_cap = circle(.07, 'inches')) +
# geom_node_point(color = my_colors[2], size = 5) +
# geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
# theme_void()+
# labs(title= "Game of Thrones Season 1 & 2 Bigrams")
# 
# # Season 3&4 bigrams
# got_bigrams2 <-got %>%
#   filter(grepl("3|4",Season)) %>% 
#   unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%     # tokenize the bigrams
#   separate(bigram, c("word1", "word2"), sep = " ") %>% 
#   filter(nchar(word1) >3) %>% 
#   filter(nchar(word2) >3) %>%  
#   filter(!word1 %in% stop_words$word) %>%  # remove stop_words
#   filter(!word2 %in% stop_words$word) %>%  # remove stop_words  
#   count(word1, word2, sort = TRUE) %>% 
#   filter(n > 8) %>%
#   graph_from_data_frame(directed = FALSE)
# 
# 
# set.seed(2323)
# graph2<-ggraph(got_bigrams2, layout = "fr") +
# geom_edge_link(aes(edge_alpha = n,edge_width=n), edge_colour = my_colors[7],show.legend = TRUE,
# end_cap = circle(.07, 'inches')) +
# geom_node_point(color = my_colors[2], size = 5) +
# geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
# theme_void()+
# labs(title= "Game of Thrones Season 3 & 4 Bigrams")
# 
# 
# # Season 5&6 bigrams
# got_bigrams3 <-got %>%
#   filter(grepl("5|6",Season)) %>% 
#   unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%     # tokenize the bigrams
#   separate(bigram, c("word1", "word2"), sep = " ") %>% 
#   filter(nchar(word1) >3) %>% 
#   filter(nchar(word2) >3) %>%  
#   filter(!word1 %in% stop_words$word) %>%  # remove stop_words
#   filter(!word2 %in% stop_words$word) %>%  # remove stop_words  
#   count(word1, word2, sort = TRUE) %>% 
#   filter(n > 8) %>%
#   graph_from_data_frame(directed = FALSE)
# 
# 
# set.seed(2323)
# graph3<-ggraph(got_bigrams3, layout = "fr") +
# geom_edge_link(aes(edge_alpha = n,edge_width=n), edge_colour = my_colors[7],show.legend = TRUE,
# end_cap = circle(.07, 'inches')) +
# geom_node_point(color = my_colors[2], size = 5) +
# geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
# theme_void()+
# labs(title= "Game of Thrones Season 5 & 6 Bigrams")
# 
# # season 7 & 8 bigrams
# got_bigrams4 <-got %>%
#   filter(grepl("7|8",Season)) %>% 
#   unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%     # tokenize the bigrams
#   separate(bigram, c("word1", "word2"), sep = " ") %>% 
#   filter(nchar(word1) >3) %>% 
#   filter(nchar(word2) >3) %>%  
#   filter(!word1 %in% stop_words$word) %>%  # remove stop_words
#   filter(!word2 %in% stop_words$word) %>%  # remove stop_words  
#   count(word1, word2, sort = TRUE) %>% 
#   filter(n > 5) %>%
#   graph_from_data_frame(directed = FALSE)
# 
# 
# set.seed(2323)
# graph4<-ggraph(got_bigrams4, layout = "fr") +
# geom_edge_link(aes(edge_alpha = n,edge_width=n), edge_colour = my_colors[7],show.legend = TRUE,
# end_cap = circle(.07, 'inches')) +
# geom_node_point(color = my_colors[2], size = 5) +
# geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
# theme_void()+
# labs(title= "Game of Thrones Season 7 & 8 Bigrams")


# Season 8 Final 2 episodes Bigrams
# complete Series bigrams
got_bigrams5 <-got %>%
  filter(grepl("8",Season), grepl("5|6",Episode)) %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%     # tokenize the bigrams
  separate(bigram, c("word1", "word2"), sep = " ") %>% 
  filter(nchar(word1) >3) %>% 
  filter(nchar(word2) >3) %>%  
  filter(!word1 %in% stop_words$word) %>%  # remove stop_words
  filter(!word2 %in% stop_words$word) #%>%  # remove stop_words  

got_bigrams5 %>% filter(grepl("kil", word1)|grepl("kil", word2)|grepl("king",word1)|grepl("king",word2)) %>% 
  select(Name, word1,word2,Season,Episode)

got_bigrams5 <- got_bigrams5 %>% 
  filter(grepl("kil", word1)|grepl("kil", word2)|grepl("king",word1)|grepl("king",word2)|grepl("tyrio",word1)|grepl("tyrio",word2)|grepl("queen",word1)|grepl("queen",word2)) %>% 
  count(word1, word2, sort = TRUE) %>% 
  filter(n >= 1) %>%
  graph_from_data_frame(directed = FALSE)


set.seed(2323)
graph5<-ggraph(got_bigrams5, layout = "fr") +
geom_edge_link(aes(edge_alpha = n,edge_width=n), edge_colour = my_colors[7],show.legend = TRUE,
end_cap = circle(.07, 'inches')) +
geom_node_point(color = my_colors[2], size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()+
labs(title= "Game of Thrones Season 8 Episode 5 & 6 Bigrams")


# visual calls
graph0

# graph1
# graph2
# graph3
# graph4
graph5

The search through the bigrams for the whole series provided me with a good network to see word relationships and how characters were interacting and what their titles may have been. When I filter down to a more precise subject, I get a more concise result. I filtered season 8 down to the last two episodes and then searched for the words: “kill”, “king”, “queen”, and “tyrion”. I figured the character that has the most dialog would be relevant to the end, and someone would be on the throne. Based on the bigram visual there could be a King Bran or Robert, and a Queen Daenerys. Based on the trigram which comes up next Cersei was killed, so she will not be queen.

Trigram insight.

I tried the trigrams to compliment the bigrams and thought the extra content would help me understand the bigrams more. I found the trigram networked word chart for the final season and episodes to be helpful in adding context for how the show ended.

got_trigrams0 <-got %>%
  #filter(Season %in% c("Season 7","Season 8")) %>% 
  unnest_tokens(output = trigram, input =  text, token = "ngrams", n = 3) %>%     # tokenize the bigrams
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>% 
  filter(nchar(word1) >3) %>% 
  filter(nchar(word2) >3) %>%  
  filter(nchar(word3) >3) %>%
  filter(!word1 %in% stop_words$word) %>%  # remove stop_words
  filter(!word2 %in% stop_words$word) %>%  # remove stop_words  
  filter(!word3 %in% stop_words$word) %>%  # remove stop_words 
  count(word1, word2, word3, sort = TRUE) %>% 
  filter(n > 2) %>%
  graph_from_data_frame(directed = FALSE)

# got_trigrams0

set.seed(2016)
trigraph0<-ggraph(got_trigrams0, layout = "fr") +
geom_edge_link(aes(edge_alpha = n,edge_width=n), edge_colour = my_colors[7],show.legend = TRUE,
end_cap = circle(.07, 'inches')) +
geom_node_point(color = my_colors[2], size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()+
labs(title= "Game of Thrones Complete Series trigrams")

# Season 8, episode 6 trigrams
got_trigrams7 <-got %>%
  filter(grepl("8",Season),grepl("5|6",Episode), grepl("jon s|tyrio|bran|dae|sans|star|lann",Name)) %>% 
  unnest_tokens(output = trigram, input =  text, token = "ngrams", n = 3) %>%     # tokenize the bigrams
  separate(trigram, c("word1", "word2", "word3"), sep = " ") %>% 
  filter(nchar(word1) >3) %>% 
  filter(nchar(word2) >3) %>%  
  filter(nchar(word3) >3) %>%
  filter(!word1 %in% stop_words$word) %>%  # remove stop_words
  filter(!word2 %in% stop_words$word) %>%  # remove stop_words  
  filter(!word3 %in% stop_words$word) #%>%  # remove stop_words 
  
got_trigrams7 %>% 
  select(Name, word1,word2,word3,Season,Episode)

graph_got_trigrams7<- got_trigrams7 %>%
  select(word1,word2,word3) 

graph_got_trigrams7 <- graph_got_trigrams7 %>% 
  # filter(grepl("kil", word1)|grepl("kil", word2)|
  #        grepl("king",word1)|grepl("king",word2)|
  #        grepl("tyrio",word1)|grepl("tyrio",word2)|
  #        grepl("queen",word1)|grepl("queen",word2)) %>% 
  count(word1, word2, word3, sort = TRUE) %>% 
  #filter(n > 1) %>%
  graph_from_data_frame(directed = FALSE)
# graph_got_trigrams7

# visual to make it easier to read
set.seed(2016)
trigraph7<-ggraph(graph_got_trigrams7, layout = "fr") +
geom_edge_link(aes(edge_alpha = n,edge_width=n), edge_colour = my_colors[7],show.legend = TRUE,
end_cap = circle(.07, 'inches')) +
geom_node_point(color = my_colors[2], size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()+
labs(title= "Game of Thrones Season 8 Episode 5 & 6 trigrams")

trigraph0

trigraph7

The trigram investigation into the entire series did not provide any additional insight into the information the bigrams provided. I was unable to use the same search and filtering parameters in Season 8 episode 5 & 6, there were too few points to be useful for anything. I do see that Arya Stark gave the order to “kill Queen Cersei” so that was significant.

Term Frequency - Inverse Document Frequency, tf-idf

I planned to conduct a tf-idf word analysis for character word count for all seasons. A second analysis for all seasons to see if there was any information to be gained from the analysis. One final analysis for season 8 to see if the six episodes from season 8 could tell me a little more about what was happening near the end of the series.

# tfidf Names 
got_names_tf_idf <- got1 %>% 
  filter(nchar(word)>2) %>% 
  count(Season, Name, sort = TRUE) %>%
    bind_tf_idf(Name, Season, n) %>%
    arrange(-tf_idf) %>%
    group_by(Season) %>%
    top_n(10) %>%
    ungroup %>% 
  filter(!grepl("sol|qyb|dav|ring|gend",Name)) %>% 
  arrange(Season)

got_names_tf_idf %>%
    mutate(Name = reorder_within(Name, tf_idf,Season)) %>%
    ggplot(aes(Name, tf_idf, fill = Season)) +
    geom_col(alpha = 0.4, show.legend = FALSE) +
    geom_text(aes(label= round(tf_idf,2) ),color= "black", hjust=1.0) +
    facet_wrap(~ Season, scales = "free", ncol = 4) +
    coord_flip() +
    theme(strip.text=element_text(size=7),axis.text.x = element_blank(),axis.ticks.x = element_blank()) +
    labs(x = NULL, y = NULL,
         title = "Highest tf-idf Names in Game of Thrones Seasons")+
  scale_x_reordered()

# tf-idf Seaons
got_tf_idf <- got1 %>% 
  filter(nchar(word)>2) %>%
  count(Season, word, sort = TRUE) %>%
    bind_tf_idf(word, Season, n) %>%
    arrange(-tf_idf) %>%
    group_by(Season) %>%
    top_n(10) %>%
   filter(!grepl("dani",word)) %>% 
    ungroup %>% 
  arrange(Season)

got_tf_idf %>%
    mutate(word = reorder_within(word, tf_idf,Season)) %>%
    ggplot(aes(word, tf_idf, fill = Season)) +
    geom_col(alpha = 0.4, show.legend = FALSE) +
    geom_text(aes(label= round(tf_idf,3) ),color= "black", hjust=1.0) +
    facet_wrap(~ Season, scales = "free", ncol = 4) +
    coord_flip() +
    theme(strip.text=element_text(size=7),axis.text.x = element_blank(),axis.ticks.x = element_blank()) +
    labs(x = NULL, y = "tf-idf",
         title = "Highest tf-idf words in Game of Thrones Seasons")+
  scale_x_reordered()

# tf-idf for episodes in season 1
got_tf_idf <- got1 %>% 
    filter(grepl("8",Season)) %>%  # Season %in% "Season 1") # another way to do the same thing 
    filter(nchar(word)>2) %>%
    count(Episode, word, sort = TRUE) %>%
    bind_tf_idf(word, Episode, n) %>%
    arrange(-tf_idf) %>%
    group_by(Episode) %>%
    top_n(8) %>%
    filter(!grepl("dani",word)) %>% 
    ungroup %>% 
    arrange(Episode)

got_tf_idf %>%
    mutate(word = reorder_within(word, tf_idf,Episode)) %>%
    ggplot(aes(word, tf_idf, fill = Episode)) +
    geom_col(alpha = 0.4, show.legend = FALSE) +
    geom_text(aes(label= round(tf_idf,2) ),color= "black", hjust=1.0) +
    facet_wrap(~ Episode, scales = "free", ncol = 4) +
    coord_flip() +
    theme(strip.text=element_text(size=7),axis.text.x = element_blank(),axis.ticks.x = element_blank()) +
    labs(x = NULL, y = "tf-idf",
         title = "Highest tf-idf words in Game of Thrones Season 8 Episodes")+
  scale_x_reordered()

The tf-idf analysis had a little bit of information in each analysis I conducted. When I synthesized all the data into a larger list I was able to gather more information from this part of the analysis. The names analysis provided me with robert and Stannis Baratheon, stark continuously, Ygritte in season 3,tywin lannister 3 times, drogon, lady crane, harry, sam tarly, there are others but these stood out the most. The words from all the seasons did not make a lot of sense to me and I do not think they helped to further my understanding of the Game of Thrones.

#Analysis Findings:

– This analysis began with questions I wanted answered about the Game of Thrones, leading to a review of 4 pieces of literature on the series, and leading into an analysis on the series. The questions were what the main ideas or themes to the show are, season, episode; and who were the key characters in the series. After developing unique questions to ask about the show I set about researching if anyone had conducted data analysis on the Game of Thrones. Researching data analysis on Game of Thrones was very productive and I found Sunita Parajuli’ Text mining project almost immediately. Sunita covered the most used words, the characters with the most dialog, and sentiment analysis to a point where I did not think I could offer any additional sentiment analysis without getting into extremely specific elements of the show. Jefferey Lancaster’s Games of Thrones was the next report I reviewed and I recognized the data was from a very different data source. Jefferey did not provide any code to review, and his visuals were thorough in every way possible. Jefferey’s visuals in many cases were too busy to make sense of and even answered questions that no one asked. The next analysis review was from Ayesha Saleem’s Network which used the same data as Jefferey Lancaster, and used python to perform the analysis. Ayesha’s networking analysis provided a unique perspective for who the main characters were in the series, this added a unique perspective into who were the main characters. Finally I came across Akshay Goel’s Applying analytics to HBO’s Game of Thrones, this article was primarily focused on character deaths and how it was driving ratings. The more important the character the higher the ratings were driven when that character died. With these four articles researched I felt I needed to shift my questions away from sentiment analysis and more towards the themes and characters in Game of Thrones. It was unexpected to find so many useful sources for information on the series and so my analysis would shift slightly on the series. I started with Alben Tumanggor’s Game of Thrones Script All Seasons cleaning, removing stopwords, correcting contractions, stemming, removing special characters from the script, and finally tidying the data. I started my analysis with a top twenty-character word count for the series, and then a top ten-character word count for each season to gain an understanding for whose dialog was driving the show overall and during various seasons. After learning which characters were driving the show, I modeled the data and learned a great deal about the show. The modeling showed the significant factors were, the amount of death and or killing in the show, the lannister and stark family were significant, kings and queens, the people, families, and sex were the main themes. The modeling was a good start to adding context to the show so next up was bigrams to see what that technique had to offer. I continued to explore the Game of Thrones themes and main characters with a bigram exploration for the whole series and episodes five and six from season eight. I decided to fine tune my season 8 episode 5 & 6 exploration by searching for bigrams that had “kill” or “king” in them. This provided “kill Cersei,” “kills Cersei,” king Robert” and finally “king bran.” I found the chart of bigrams easier to read than the visualizations, but the visuals provided a network joining the elements together. The bigram chart provided information about a cutthroat series with every major character going for the throne and Bran appears to be king in the end. While bigrams were helpful in finding additional information, I still wanted to try trigrams, and they did not provide any additional info over the bigrams. I also tried term frequency - inverse document frequency and gained a little more insight into the characters in particular seasons. More research into addition literature could be conducted to further the understanding of the series. A more specific analysis could be conducted on this data set, and additional data sets, to increase the amount of data to be analyzed. The most interesting things were tools I thought would yield impressive results like tf-idf, and more in-depth bigrams did not provide great insight like it had in previous analysis.

Sources:

Goel, A. (2019, April 12). Applying analytics to HBO’s Game of Thrones (got). LinkedIn. Retrieved April 26, 2023, from https://www.linkedin.com/pulse/applying-analytics-hbos-game-thrones-got-akshay-goel

Lancaster, J. (n.d.). Game of thrones. Game of Thrones. Retrieved April 25, 2023, from https://jeffreylancaster.com/game-of-thrones/

Parajuli, S. (n.d.). Text Mining Project. RPubs_Text_mining_project. Retrieved April 25, 2023, from https://rpubs.com/shinygalena/854607

Saleem, A. (2022, September 1). Network theory and Game of Thrones - a perfect combination. Data Science Dojo. Retrieved April 26, 2023, from https://datasciencedojo.com/blog/network-theory-game-of-thrones/

Tumanggor, A. (2019, November 19). Game of thrones script all seasons. Kaggle. Retrieved April 26, 2023, from https://www.kaggle.com/datasets/albenft/game-of-thrones-script-all-seasons

This report took 15.12 seconds to complete.

A GAME OF THRONES ANALYSIS

Kevin Foley

2024-01-28