I have only watched the first 2 episodes of Game of Thrones. I have
seen there is a huge fan base and culture that has followed the series
and that leaves me curious as to what was going on in the Game of
Thrones that made it so appealing? There are some questions I have been
thinking about that might help me understand the appeal more, so here
are the questions.
* What are the main ideas or concepts for each episode?
* What are the main ideas or concepts in each season?
* Can I develop a summary for the show from the data and research on
topics that are analysis based(that do not give away too much)? * Who
are the primary characters during each episode and season based upon the
amount of dialog that each character has?
* What is the overall sentiment for each episode and season?
* What is the sentiment for the top characters in each episode and
season?
I will address problem statement questions; the theme(s) for the
show; who are the primary characters during each episode and season ;and
what overall sentiment from each season, episode, and character.
1. The sentiment analysis: during my literature review I found that
Sunita Parajuli’s Text Mining Project had already answered all the
sentiment questions I had, so I moved on to the next questions.
2. The main concepts / ideas for the seasons and episodes.
relationships, loyalty, betrayal, sex, death and killing, kings and
queens, factions of families, ruthlessness
The series is set in a medieval time period set in a mythical place where various factions are fighting for a throne and or kingdom. There is a considerable amount of killing going on during each season and there are very few episodes that no one dies according to Akshay Goel’s Applying analytics to HBO’s Game of Thrones(GoT) article that I researched.There is a considerable amount of sex in the series to keep the characters developing new networks and transforming the stories further(Lancaster, n.d.).I would have expected to find some sort of magic elements in the story due to the period of the series but the text analysis and research I conducted did not reveal magic as a theme in the series. Determining the main characters is subjective, the main character could be determined by the amount of dialog, which would make Tyrion Lannister the main character in nearly every episode and the whole show (Parajuli, n.d.). Another measure for the main character could be the degree of centrality which would make Jon Snow the main character(Saleem,2022). Another metric could be who becomes king at the end of the series, which would be Bran Stark. All these measures are true, as the series progresses they refresh the main character to keep the series dynamic,the show kills off major characters throughout the series to keep the audience interested, and offers a dynamic plot where people may or may not like the results in the end.
– I searched for text and data mining from the Game of Thrones series
and that led me to a plethora of examples; however, I had to narrow down
the results to find sources that would provide good visualizations and
not take too much time to read. Once I began researching Game of Thrones
I came across Jefferey Lancaster’s 19
More Game of Thrones Game of Thrones Data Visualizations page, which
I rank as the gold standard. Jeffrey Lancaster has access to a massive
portion of data including screen time data, character genders, locations
for events, and other details. Mr. Lancaster provides links to where he
found the code, he modified to create his visualizations, but no
firsthand code. There is certainly a point where too much information
can become unusable, and just because you can do something does not mean
you should. Here is one example from Mr. Lancaster’s exhibition where a
visualization is overwhelming:
If Mr. Lancaster reduced this network visualization to a maximum of ten
characters this visualization would have been easier to understand, but
this looks like a giant yarn ball to me. There are at least twelve other
visualizations where I am not sure what information I can easily gain
from them.With that in mind Sunita Parajuli’s Text Mining Project came
up next and this was a wonderful example with some excellent R code in
it.
– Sunita Parajuli’s Text
Mining Project answers some great questions like who are the top ten
characters, what are the most frequent words, 3 questions around
sentiment analysis regarding characters. The answers are great on their
own, but Sunita’s techniques are great, for example: a characters
per-capita dialog percentage, a word
cloud on the top 200 words and a sentiment analysis on the top 200 words
as well. I am not sure if the word clouds translate into usable
information, but it is attention grabbing. Sunita uses a great
visualization for the overall character sentiment for the first 5
seasons:
and after filtering for
season and then episode this would prove to add a greater understanding
for the Game of Thrones. Sunita effectively covered; the most used words
in the series with a horizontal bar chart, the number lines of dialog
per character with another horizontal bar chart, A sentiment analysis
for the entire series in a word cloud and horizontal bar charts, the
most frequent terms and ten characters in another horizontal bra chart.
Sunita does an excellent job doing an exploratory analysis on the data
set with R and the next exploratory analysis is done with python by
Ayesha Saleem.
– Ayesha Saleem’s Network
theory and Game of Thrones analysis uses graph theory to evaluate
two different data sets called nodes and edges through python with
entirely different data than what I have in my data set. Ayesha Saleem’s
python analysis is a Circos plot that resembles Mr. Lancaster’s plot
that must have also been a node networking technique. . To find the most important person in the
show Ayesha uses Degree Centrality of the node which measures the
“number of edges that are incident upon a node (for an undirected graph
this is the same as the outgoing nodes)”,
If we compare the two charts, which one
is the top character and why?
The
number of connections a character has, or is it the amount of dialog
that a character has through the course of the series? I suppose both
could be right depending on the question asked. I think the degree of
centrality is good at showing the networks in the series which I was not
seeing in the data set that I have. While Ayesha’s analysis provided a
completely different perspective I found another analytical evaluation
by Akshay Goel that focuses on the deaths in Game of Thrones.
– Akshay Goel’s Applying
analytics to HBO’s Game of Thrones( GoT) article focuses on the
deaths in the Game of Thrones series and how it affects the ratings of
the show on the popular media rating website IMDB. Akshay measured the
number of shows with zero deaths versus the total number of episodes and
compared their ratings. The more deaths that occurred the higher the
ratings. Each episode that a character was killed in received a higher
rating for that episode, and episodes where a more important character
was killed it received an even higher rating. Akshay mentions he used machine learning to
predict the deaths of “first major character” by “significant
variables,” to include: “time of death,character house, consumer
sentiment, airtime(till death),Character importance(subjective),
Character age, # of deaths in the house in previous seasons,” which
covers a ton of data. I wish Akshay had mentioned where he found his
data from so the investigation could be replicated. Akshay provided a
great opinion piece that provided more data on the deaths in Game of
Thrones, and I will summarize the articles next.
– These were all great articles on Game of Thrones that provided
different perspectives and even provided different data on the series.
The authors provided a greater understanding for what is happening in
the Game of Thrones, and that is what data mining is about, gaining a
greater understanding of your subject. There is a significant amount of
exploratory analysis conducted to understand the subject(s) more, and
different data can add a greater depth of overall understanding. With
each additional layer of information provided by the investigators I now
have a much greater understanding of the Game of Thrones than I did
before I began this project.There has only been one Interpretations from
the data by Mr. Lancaster for who would sit on the throne at the end,
with no other summaries or predictions, the investigators are providing
the facts.
invisible(gc()) # clear unused mem
invisible(rm(list=ls())) # clear global environment
start_time <- Sys.time()
library(jsonlite)
#library(dplyr)
library(tidyr)
library(tidytext)
library(tidyverse, quietly = TRUE)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ dplyr 1.1.0
## ✔ tibble 3.1.8 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ purrr::flatten() masks jsonlite::flatten()
## ✖ dplyr::lag() masks stats::lag()
library(widyr)
library(ggplot2)
library(igraph)
##
## Attaching package: 'igraph'
##
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
##
## The following objects are masked from 'package:purrr':
##
## compose, simplify
##
## The following object is masked from 'package:tibble':
##
## as_data_frame
##
## The following object is masked from 'package:tidyr':
##
## crossing
##
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
##
## The following object is masked from 'package:base':
##
## union
library(ggraph)
library(topicmodels)
library(corpus)
library(gridExtra) #viewing multiple plots together
##
## Attaching package: 'gridExtra'
##
## The following object is masked from 'package:dplyr':
##
## combine
library(wordcloud) # word cloud generator
## Loading required package: RColorBrewer
if (!require("wordcloud2")){
install.packages("wordcloud2")
library(wordcloud2)} # word cloud generator
## Loading required package: wordcloud2
library(SnowballC) # for stemming the text
library(tm) # text mining
## Loading required package: NLP
##
## Attaching package: 'NLP'
##
## The following object is masked from 'package:ggplot2':
##
## annotate
library(reshape2) # reshapes a data frame
##
## Attaching package: 'reshape2'
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(stm)
## stm v1.3.6 successfully loaded. See ?stm for help.
## Papers, resources, and other materials at structuraltopicmodel.com
library(reshape2)
I retrieved Alben Tumanggor’s Game of Thrones data set from kaggle April 22, 2023, and it was last updated three years ago. I evaluated the data for the number of variable and observations before and after cleaning. I changed the variable “Sentence” into “text” to make less overall work. I then checked all the variable names for the dataset. Finally I checked what type each variables is to see if there need to be any cleaning performed. The data structure is:
# got <- read_csv("Game_of_Thrones_Script.csv")
# #names(got)
# got <- got %>%
# mutate(text = Sentence, Sentence=NULL )
#
# #names(got)
#
# saveRDS(got,"got.rds")
#
# fix.contractions <- function(doc) {
# # "won't" is a special case as it does not expand to "wo not"
# doc <- gsub("won't", "will not", doc)
# doc <- gsub("can't", "can not", doc)
# doc <- gsub("n't", " not", doc)
# doc <- gsub("'ll", " will", doc)
# doc <- gsub("'re", " are", doc)
# doc <- gsub("'ve", " have", doc)
# doc <- gsub("'m", " am", doc)
# doc <- gsub("'d", " would", doc)
# # 's could be 'is' or could be possessive: it has no expansion
# doc <- gsub("'s", "", doc)
# return(doc)
# }
#
# # fix (expand) contractions
# got$text <- sapply(got$text, fix.contractions)
#
# got1<- got
# got1 <- got1 %>%
# mutate(line = row_number()) %>% # create a new col with a row number for each existing line
# unnest_tokens(word, text) %>% # tidy the words, text
# anti_join(stop_words) # filter out the stop words
#
# #distinct() #%>% # get rid of any duplicate records using
#
# got1$word <- text_tokens(got1$word, stemmer = "en") # english stemmer
# got1$word<- gsub("[^a-zA-Z]","",got1$word) # remove special characters
my_colors <- c("#E69F00", "#56B4E9", "#009E73", "salmon", "#DAF7A6", "#FDD300","#89A8BF","darkblue","darkred")
#names(got)
got <- readRDS("got.rds")
#saveRDS(got1,"got1.rds")
got1 <- readRDS("got1.rds")
str(got1)
## tibble [88,392 × 7] (S3: tbl_df/tbl/data.frame)
## $ Release Date : Date[1:88392], format: "2011-04-17" "2011-04-17" ...
## $ Season : chr [1:88392] "Season 1" "Season 1" "Season 1" "Season 1" ...
## $ Episode : chr [1:88392] "Episode 1" "Episode 1" "Episode 1" "Episode 1" ...
## $ Episode Title: chr [1:88392] "Winter is Coming" "Winter is Coming" "Winter is Coming" "Winter is Coming" ...
## $ Name : chr [1:88392] "waymar royce" "waymar royce" "waymar royce" "waymar royce" ...
## $ line : int [1:88392] 1 1 1 1 1 1 1 1 2 2 ...
## $ word : chr [1:88392] "expect" "savag" "lot" "steal" ...
# summary(got1)
# glimpse(got1)
# head(got1,3)
write.csv(got1, "got.csv", row.names=FALSE)
The Game of Thrones data set begins with 23911
observations and 6 variables.
The data set has 88392 observations and
7 variables after cleaning and making the data set
tidy.
The variable names are Release Date, Season, Episode, Episode
Title, Name, text. The Release Date variable is a date format,
and the rest of the variables are character variables.
After reviewing Sunita Parajuli’s Text Mining Project I felt applying a similar approach to Sunita’s dialog per character visual was a good place to begin the Game of Thrones series investigation. I also borrowed Sunita’s bit of code for the gradient color scheme because it added a nicer visual effect. I conducted a character word count for the series as a whole to see who was driving the stories. I then conducted a character word count on each season to see which characters were driving the series in each season.
# Character total word count
got1%>%
count(Name, sort = TRUE) %>% # get the n top words from the tidied, clean, filtered dataset using count() and top_n()
# slice(1:20) %>%
top_n(20) %>% # counts the top 10
ungroup() %>%
mutate(Name = reorder(Name, n)) %>% # sort words according to the count using reorder()and reassign the ordered value to word using mutate()
ggplot(aes(Name, n)) +
geom_col(stat="identity", aes(fill=n), show.legend=FALSE) + # put the aes in the ggplot and geom text will work properly
geom_text(aes(label= n ),color= "black", hjust=1.10)+
xlab("") +
scale_fill_gradient(low=my_colors[7], high=my_colors[1]) +
ggtitle("Characters' word count") +
coord_flip()+
theme_bw()+
theme(axis.text.x = element_blank())
## Selecting by n
# this is Sunita Parajuli's code for her "Lines of dialogue per character" visual
# got1 %>%
# count(Name) %>%
# arrange(desc(n)) %>%
# slice(1:20) %>%
# ggplot(aes(y=reorder(Name, n), x=n)) +
# geom_bar(stat="identity", aes(fill=n), show.legend=FALSE) +
# geom_label(aes(label=n)) +
# scale_fill_gradient(low="dodgerblue", high="dodgerblue4") +
# labs(x="Character", y="Lines of dialogue",
# title="Lines of dialogue per character") +
# theme_bw()
# Count names in got
# a table:
# matrix <-got1 %>%
# group_by(Season) %>%
# count(Name, sort = TRUE) %>% # get the n top words from the tidied, clean, filtered dataset using count() and top_n()
# slice(1:10) %>% # use top (1:digits)
# # top_n(10) %>% # counts the top 10
# ungroup() %>%
# mutate(Name = reorder(Name, n)) #%>% # reorder the counted subject by n
# library(data.table) # may or may not be necessary for the as.matrix
# matrix
# as.matrix(matrix[],rownames = "Season")
# top 10 character word count by season
got1 %>%
group_by(Season) %>%
count(Name, sort = TRUE) %>% # get the n top words from the tidied, clean, filtered dataset using count() and top_n()
slice(1:10) %>% # use top (1:digits)
# top_n(10) %>% # counts the top 10
ungroup() %>%
mutate(Name = reorder(Name, n)) %>% # reorder the counted subject by n
ggplot(aes(Name, n)) +
geom_col(stat="identity", aes(fill=n), show.legend=FALSE) + # put the aes in the ggplot and geom text will work properly
scale_fill_gradient(low=my_colors[7], high=my_colors[1]) +
geom_text(aes(label= n ),color= "black",size=2.5, hjust=1.10)+
xlab("") +
#ylab("Word Count") +
coord_flip()+
labs(title="Characters' word count for each season")+
theme_bw()+
facet_wrap(~Season, ncol=3, scales= "free")+
theme(x = NULL, y = NULL, axis.text.x = element_blank(),axis.ticks.x = element_blank())
# top 15 words in season 8 Episode 6
got1 %>%
group_by(Season) %>%
filter(grepl("8",Season), grepl("6",Episode)) %>%
count(word, sort = TRUE) %>% # get the n top words from the tidied, clean, filtered dataset using count() and top_n()
slice(1:20) %>% # use top (1:digits)
# top_n(10) %>% # counts the top 10
ungroup() %>%
mutate(word = reorder(word, n)) %>% # reorder the counted subject by n
ggplot(aes(word, n)) +
geom_col(stat="identity", aes(fill=n), show.legend=FALSE) + # put the aes in the ggplot and geom text will work properly
scale_fill_gradient(low=my_colors[7], high=my_colors[1]) +
geom_text(aes(label= n ),color= "black",size=2.5, hjust=1.10)+
xlab("") +
#ylab("Word Count") +
coord_flip()+
labs(title="Top 15 words for Season 8 Episode 6")+
theme_bw()+
#facet_wrap(~Episode, ncol=2, scales= "free")+
theme(x = NULL, y = NULL, axis.text.x = element_blank(),axis.ticks.x = element_blank())
The character word count was helpful to figure out who the drivers for the series are. The character word count per season helped figure out which character was driving conversations in which season. The top 15 words in Season 8 episode 6 was helpful in determining an overall theme for the end of the series.
I thought using an STM would provide me with additional information on the themes or activities in the series, and it did provide a basic insight into the Game of Thrones themes. The gamma model did not produce information that was as usable as the beta model, so I focused on the beta model to complement the STM model with additional insights.
# an STM
# processed <- textProcessor(got$text, metadata = got)
# out <- prepDocuments(processed$documents, processed$vocab, processed$meta)
# docs <- out$documents
# vocab <- out$vocab
# meta <-out$meta
#
# First_STM <- stm(documents = out$documents, vocab = out$vocab,
# K = 4,
# max.em.its = 75, data = out$meta,
# init.type = "Spectral", verbose = FALSE)
#
# saveRDS(First_STM,"First_STM.rds")
First_STM <- readRDS("First_STM.rds")
# plot(First_STM)
# labelTopics(First_STM, topics=c(1,2,3), n=10)# complete list of top 10 words per topics 3-5-9
plot.STM(First_STM, "labels", topics=c(1,2,3,4), label="frex", n=10, width=65)#top 10 FREX words per topics 3-5-9
word_counts <- got1 %>%
count(line, word, sort = TRUE) %>%
ungroup()
#word_counts
got_dtm <- word_counts %>%
cast_dtm(line, word, n)
#got_dtm
got_lda <- LDA(got_dtm, k = 4, control = list(seed = 1234))
#got_lda
tidy_lda <- tidy(got_lda) %>%
distinct()
#tidy_lda
top_terms <- tidy_lda %>%
group_by(topic) %>%
slice_max(beta, n = 20, with_ties = FALSE) %>% # maximum number of figures per topic
ungroup() %>% # ungroup the group_by(topic)
arrange(desc(beta)) %>% # arrange beta in desc order
unique() # allow only unique observations
#top_terms
#visual makes it easier to interpret
top_terms %>%
filter(!grepl("ser",term)) %>%
mutate(term = reorder_within(term, beta, topic)) %>%
group_by(topic, term) %>%
arrange(desc(beta)) %>%
ungroup() %>%
ggplot(aes(beta, term, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
scale_y_reordered() +
labs(title = "Top 20 terms in each LDA topic",
x = expression(beta), y = NULL) +
theme_bw()+
facet_wrap(~ topic, ncol = 4, scales = "free")+
theme(axis.text.x = element_blank(),axis.ticks.x = element_blank())
#8.4.3 interpreting the topic model
lda_gamma <- tidy(got_lda, matrix = "gamma") %>%
arrange(desc(gamma))
#lda_gamma # verify
# how are the probabilities distributed?
# ggplot(lda_gamma, aes(gamma)) +
# stat_bin(bins = 30)+
# # geom_histogram(alpha = 0.8) +
# scale_y_log10() +
# labs(title = "Distribution of probabilities for all topics",
# y = "Number of documents", x = expression(gamma))+
# theme_bw()
The topic modeling has provided eighty items that when analyzed will provide definitive insight into the series. There is a recurring theme in each topic about: a king, lannister, a son, a brother, killing and dying,sex,and stark. These results show the main characters in the series, their relationships, locations that are significant, and characters’ activities. This is helpful for informing me as to the content of the series.
Bigrams will be used to look for word networks that will provide insight into the series. The bigrams will be conducted on the whole series to reveal any connections that can be made for the whole series. The next bigram analysis to be conducted will be on season eight episode 5 and six to provide context for the ending of the season and provide insight into the series ending.
# complete Series bigrams
got_bigrams0 <-got %>%
#filter(Season %in% c("Season 7","Season 8")) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% # tokenize the bigrams
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(nchar(word1) >3) %>%
filter(nchar(word2) >3) %>%
filter(!word1 %in% stop_words$word) %>% # remove stop_words
filter(!word2 %in% stop_words$word) %>% # remove stop_words
count(word1, word2, sort = TRUE) %>%
filter(n > 17) %>%
graph_from_data_frame(directed = FALSE)
set.seed(2323)
graph0<-ggraph(got_bigrams0, layout = "fr") +
geom_edge_link(aes(edge_alpha = n,edge_width=n), edge_colour = my_colors[7],show.legend = TRUE,
end_cap = circle(.07, 'inches')) +
geom_node_point(color = my_colors[2], size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()+
labs(title= "Game of Thrones Complete Series Bigrams")
# # Season 1&2 Bigrams
# got_bigrams <-got %>%
# filter(grepl("1|2",Season)) %>%
# unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% # tokenize the bigrams
# separate(bigram, c("word1", "word2"), sep = " ") %>%
# filter(nchar(word1) >3) %>%
# filter(nchar(word2) >3) %>%
# filter(!word1 %in% stop_words$word) %>% # remove stop_words
# filter(!word2 %in% stop_words$word) %>% # remove stop_words
# count(word1, word2, sort = TRUE) %>%
# filter(n > 8) %>%
# graph_from_data_frame(directed = FALSE)
#
#
# set.seed(2323)
# graph1<-ggraph(got_bigrams, layout = "fr") +
# geom_edge_link(aes(edge_alpha = n,edge_width=n), edge_colour = my_colors[7],show.legend = TRUE,
# end_cap = circle(.07, 'inches')) +
# geom_node_point(color = my_colors[2], size = 5) +
# geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
# theme_void()+
# labs(title= "Game of Thrones Season 1 & 2 Bigrams")
#
# # Season 3&4 bigrams
# got_bigrams2 <-got %>%
# filter(grepl("3|4",Season)) %>%
# unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% # tokenize the bigrams
# separate(bigram, c("word1", "word2"), sep = " ") %>%
# filter(nchar(word1) >3) %>%
# filter(nchar(word2) >3) %>%
# filter(!word1 %in% stop_words$word) %>% # remove stop_words
# filter(!word2 %in% stop_words$word) %>% # remove stop_words
# count(word1, word2, sort = TRUE) %>%
# filter(n > 8) %>%
# graph_from_data_frame(directed = FALSE)
#
#
# set.seed(2323)
# graph2<-ggraph(got_bigrams2, layout = "fr") +
# geom_edge_link(aes(edge_alpha = n,edge_width=n), edge_colour = my_colors[7],show.legend = TRUE,
# end_cap = circle(.07, 'inches')) +
# geom_node_point(color = my_colors[2], size = 5) +
# geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
# theme_void()+
# labs(title= "Game of Thrones Season 3 & 4 Bigrams")
#
#
# # Season 5&6 bigrams
# got_bigrams3 <-got %>%
# filter(grepl("5|6",Season)) %>%
# unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% # tokenize the bigrams
# separate(bigram, c("word1", "word2"), sep = " ") %>%
# filter(nchar(word1) >3) %>%
# filter(nchar(word2) >3) %>%
# filter(!word1 %in% stop_words$word) %>% # remove stop_words
# filter(!word2 %in% stop_words$word) %>% # remove stop_words
# count(word1, word2, sort = TRUE) %>%
# filter(n > 8) %>%
# graph_from_data_frame(directed = FALSE)
#
#
# set.seed(2323)
# graph3<-ggraph(got_bigrams3, layout = "fr") +
# geom_edge_link(aes(edge_alpha = n,edge_width=n), edge_colour = my_colors[7],show.legend = TRUE,
# end_cap = circle(.07, 'inches')) +
# geom_node_point(color = my_colors[2], size = 5) +
# geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
# theme_void()+
# labs(title= "Game of Thrones Season 5 & 6 Bigrams")
#
# # season 7 & 8 bigrams
# got_bigrams4 <-got %>%
# filter(grepl("7|8",Season)) %>%
# unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% # tokenize the bigrams
# separate(bigram, c("word1", "word2"), sep = " ") %>%
# filter(nchar(word1) >3) %>%
# filter(nchar(word2) >3) %>%
# filter(!word1 %in% stop_words$word) %>% # remove stop_words
# filter(!word2 %in% stop_words$word) %>% # remove stop_words
# count(word1, word2, sort = TRUE) %>%
# filter(n > 5) %>%
# graph_from_data_frame(directed = FALSE)
#
#
# set.seed(2323)
# graph4<-ggraph(got_bigrams4, layout = "fr") +
# geom_edge_link(aes(edge_alpha = n,edge_width=n), edge_colour = my_colors[7],show.legend = TRUE,
# end_cap = circle(.07, 'inches')) +
# geom_node_point(color = my_colors[2], size = 5) +
# geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
# theme_void()+
# labs(title= "Game of Thrones Season 7 & 8 Bigrams")
# Season 8 Final 2 episodes Bigrams
# complete Series bigrams
got_bigrams5 <-got %>%
filter(grepl("8",Season), grepl("5|6",Episode)) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% # tokenize the bigrams
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(nchar(word1) >3) %>%
filter(nchar(word2) >3) %>%
filter(!word1 %in% stop_words$word) %>% # remove stop_words
filter(!word2 %in% stop_words$word) #%>% # remove stop_words
got_bigrams5 %>% filter(grepl("kil", word1)|grepl("kil", word2)|grepl("king",word1)|grepl("king",word2)) %>%
select(Name, word1,word2,Season,Episode)
got_bigrams5 <- got_bigrams5 %>%
filter(grepl("kil", word1)|grepl("kil", word2)|grepl("king",word1)|grepl("king",word2)|grepl("tyrio",word1)|grepl("tyrio",word2)|grepl("queen",word1)|grepl("queen",word2)) %>%
count(word1, word2, sort = TRUE) %>%
filter(n >= 1) %>%
graph_from_data_frame(directed = FALSE)
set.seed(2323)
graph5<-ggraph(got_bigrams5, layout = "fr") +
geom_edge_link(aes(edge_alpha = n,edge_width=n), edge_colour = my_colors[7],show.legend = TRUE,
end_cap = circle(.07, 'inches')) +
geom_node_point(color = my_colors[2], size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()+
labs(title= "Game of Thrones Season 8 Episode 5 & 6 Bigrams")
# visual calls
graph0
# graph1
# graph2
# graph3
# graph4
graph5
The search through the bigrams for the whole series provided me with a
good network to see word relationships and how characters were
interacting and what their titles may have been. When I filter down to a
more precise subject, I get a more concise result. I filtered season 8
down to the last two episodes and then searched for the words: “kill”,
“king”, “queen”, and “tyrion”. I figured the character that has the most
dialog would be relevant to the end, and someone would be on the throne.
Based on the bigram visual there could be a King Bran or Robert, and a
Queen Daenerys. Based on the trigram which comes up next Cersei was
killed, so she will not be queen.
I tried the trigrams to compliment the bigrams and thought the extra content would help me understand the bigrams more. I found the trigram networked word chart for the final season and episodes to be helpful in adding context for how the show ended.
got_trigrams0 <-got %>%
#filter(Season %in% c("Season 7","Season 8")) %>%
unnest_tokens(output = trigram, input = text, token = "ngrams", n = 3) %>% # tokenize the bigrams
separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
filter(nchar(word1) >3) %>%
filter(nchar(word2) >3) %>%
filter(nchar(word3) >3) %>%
filter(!word1 %in% stop_words$word) %>% # remove stop_words
filter(!word2 %in% stop_words$word) %>% # remove stop_words
filter(!word3 %in% stop_words$word) %>% # remove stop_words
count(word1, word2, word3, sort = TRUE) %>%
filter(n > 2) %>%
graph_from_data_frame(directed = FALSE)
# got_trigrams0
set.seed(2016)
trigraph0<-ggraph(got_trigrams0, layout = "fr") +
geom_edge_link(aes(edge_alpha = n,edge_width=n), edge_colour = my_colors[7],show.legend = TRUE,
end_cap = circle(.07, 'inches')) +
geom_node_point(color = my_colors[2], size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()+
labs(title= "Game of Thrones Complete Series trigrams")
# Season 8, episode 6 trigrams
got_trigrams7 <-got %>%
filter(grepl("8",Season),grepl("5|6",Episode), grepl("jon s|tyrio|bran|dae|sans|star|lann",Name)) %>%
unnest_tokens(output = trigram, input = text, token = "ngrams", n = 3) %>% # tokenize the bigrams
separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
filter(nchar(word1) >3) %>%
filter(nchar(word2) >3) %>%
filter(nchar(word3) >3) %>%
filter(!word1 %in% stop_words$word) %>% # remove stop_words
filter(!word2 %in% stop_words$word) %>% # remove stop_words
filter(!word3 %in% stop_words$word) #%>% # remove stop_words
got_trigrams7 %>%
select(Name, word1,word2,word3,Season,Episode)
graph_got_trigrams7<- got_trigrams7 %>%
select(word1,word2,word3)
graph_got_trigrams7 <- graph_got_trigrams7 %>%
# filter(grepl("kil", word1)|grepl("kil", word2)|
# grepl("king",word1)|grepl("king",word2)|
# grepl("tyrio",word1)|grepl("tyrio",word2)|
# grepl("queen",word1)|grepl("queen",word2)) %>%
count(word1, word2, word3, sort = TRUE) %>%
#filter(n > 1) %>%
graph_from_data_frame(directed = FALSE)
# graph_got_trigrams7
# visual to make it easier to read
set.seed(2016)
trigraph7<-ggraph(graph_got_trigrams7, layout = "fr") +
geom_edge_link(aes(edge_alpha = n,edge_width=n), edge_colour = my_colors[7],show.legend = TRUE,
end_cap = circle(.07, 'inches')) +
geom_node_point(color = my_colors[2], size = 5) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()+
labs(title= "Game of Thrones Season 8 Episode 5 & 6 trigrams")
trigraph0
trigraph7
The trigram investigation into the entire series did not provide any
additional insight into the information the bigrams provided. I was
unable to use the same search and filtering parameters in Season 8
episode 5 & 6, there were too few points to be useful for anything.
I do see that Arya Stark gave the order to “kill Queen Cersei” so that
was significant.
I planned to conduct a tf-idf word analysis for character word count for all seasons. A second analysis for all seasons to see if there was any information to be gained from the analysis. One final analysis for season 8 to see if the six episodes from season 8 could tell me a little more about what was happening near the end of the series.
# tfidf Names
got_names_tf_idf <- got1 %>%
filter(nchar(word)>2) %>%
count(Season, Name, sort = TRUE) %>%
bind_tf_idf(Name, Season, n) %>%
arrange(-tf_idf) %>%
group_by(Season) %>%
top_n(10) %>%
ungroup %>%
filter(!grepl("sol|qyb|dav|ring|gend",Name)) %>%
arrange(Season)
got_names_tf_idf %>%
mutate(Name = reorder_within(Name, tf_idf,Season)) %>%
ggplot(aes(Name, tf_idf, fill = Season)) +
geom_col(alpha = 0.4, show.legend = FALSE) +
geom_text(aes(label= round(tf_idf,2) ),color= "black", hjust=1.0) +
facet_wrap(~ Season, scales = "free", ncol = 4) +
coord_flip() +
theme(strip.text=element_text(size=7),axis.text.x = element_blank(),axis.ticks.x = element_blank()) +
labs(x = NULL, y = NULL,
title = "Highest tf-idf Names in Game of Thrones Seasons")+
scale_x_reordered()
# tf-idf Seaons
got_tf_idf <- got1 %>%
filter(nchar(word)>2) %>%
count(Season, word, sort = TRUE) %>%
bind_tf_idf(word, Season, n) %>%
arrange(-tf_idf) %>%
group_by(Season) %>%
top_n(10) %>%
filter(!grepl("dani",word)) %>%
ungroup %>%
arrange(Season)
got_tf_idf %>%
mutate(word = reorder_within(word, tf_idf,Season)) %>%
ggplot(aes(word, tf_idf, fill = Season)) +
geom_col(alpha = 0.4, show.legend = FALSE) +
geom_text(aes(label= round(tf_idf,3) ),color= "black", hjust=1.0) +
facet_wrap(~ Season, scales = "free", ncol = 4) +
coord_flip() +
theme(strip.text=element_text(size=7),axis.text.x = element_blank(),axis.ticks.x = element_blank()) +
labs(x = NULL, y = "tf-idf",
title = "Highest tf-idf words in Game of Thrones Seasons")+
scale_x_reordered()
# tf-idf for episodes in season 1
got_tf_idf <- got1 %>%
filter(grepl("8",Season)) %>% # Season %in% "Season 1") # another way to do the same thing
filter(nchar(word)>2) %>%
count(Episode, word, sort = TRUE) %>%
bind_tf_idf(word, Episode, n) %>%
arrange(-tf_idf) %>%
group_by(Episode) %>%
top_n(8) %>%
filter(!grepl("dani",word)) %>%
ungroup %>%
arrange(Episode)
got_tf_idf %>%
mutate(word = reorder_within(word, tf_idf,Episode)) %>%
ggplot(aes(word, tf_idf, fill = Episode)) +
geom_col(alpha = 0.4, show.legend = FALSE) +
geom_text(aes(label= round(tf_idf,2) ),color= "black", hjust=1.0) +
facet_wrap(~ Episode, scales = "free", ncol = 4) +
coord_flip() +
theme(strip.text=element_text(size=7),axis.text.x = element_blank(),axis.ticks.x = element_blank()) +
labs(x = NULL, y = "tf-idf",
title = "Highest tf-idf words in Game of Thrones Season 8 Episodes")+
scale_x_reordered()
The tf-idf analysis had a little bit of information in each analysis I
conducted. When I synthesized all the data into a larger list I was able
to gather more information from this part of the analysis. The names
analysis provided me with robert and Stannis Baratheon, stark
continuously, Ygritte in season 3,tywin lannister 3 times, drogon, lady
crane, harry, sam tarly, there are others but these stood out the most.
The words from all the seasons did not make a lot of sense to me and I
do not think they helped to further my understanding of the Game of
Thrones.
#Analysis Findings:
– This analysis began with questions I wanted answered about the Game of Thrones, leading to a review of 4 pieces of literature on the series, and leading into an analysis on the series. The questions were what the main ideas or themes to the show are, season, episode; and who were the key characters in the series. After developing unique questions to ask about the show I set about researching if anyone had conducted data analysis on the Game of Thrones. Researching data analysis on Game of Thrones was very productive and I found Sunita Parajuli’ Text mining project almost immediately. Sunita covered the most used words, the characters with the most dialog, and sentiment analysis to a point where I did not think I could offer any additional sentiment analysis without getting into extremely specific elements of the show. Jefferey Lancaster’s Games of Thrones was the next report I reviewed and I recognized the data was from a very different data source. Jefferey did not provide any code to review, and his visuals were thorough in every way possible. Jefferey’s visuals in many cases were too busy to make sense of and even answered questions that no one asked. The next analysis review was from Ayesha Saleem’s Network which used the same data as Jefferey Lancaster, and used python to perform the analysis. Ayesha’s networking analysis provided a unique perspective for who the main characters were in the series, this added a unique perspective into who were the main characters. Finally I came across Akshay Goel’s Applying analytics to HBO’s Game of Thrones, this article was primarily focused on character deaths and how it was driving ratings. The more important the character the higher the ratings were driven when that character died. With these four articles researched I felt I needed to shift my questions away from sentiment analysis and more towards the themes and characters in Game of Thrones. It was unexpected to find so many useful sources for information on the series and so my analysis would shift slightly on the series. I started with Alben Tumanggor’s Game of Thrones Script All Seasons cleaning, removing stopwords, correcting contractions, stemming, removing special characters from the script, and finally tidying the data. I started my analysis with a top twenty-character word count for the series, and then a top ten-character word count for each season to gain an understanding for whose dialog was driving the show overall and during various seasons. After learning which characters were driving the show, I modeled the data and learned a great deal about the show. The modeling showed the significant factors were, the amount of death and or killing in the show, the lannister and stark family were significant, kings and queens, the people, families, and sex were the main themes. The modeling was a good start to adding context to the show so next up was bigrams to see what that technique had to offer. I continued to explore the Game of Thrones themes and main characters with a bigram exploration for the whole series and episodes five and six from season eight. I decided to fine tune my season 8 episode 5 & 6 exploration by searching for bigrams that had “kill” or “king” in them. This provided “kill Cersei,” “kills Cersei,” king Robert” and finally “king bran.” I found the chart of bigrams easier to read than the visualizations, but the visuals provided a network joining the elements together. The bigram chart provided information about a cutthroat series with every major character going for the throne and Bran appears to be king in the end. While bigrams were helpful in finding additional information, I still wanted to try trigrams, and they did not provide any additional info over the bigrams. I also tried term frequency - inverse document frequency and gained a little more insight into the characters in particular seasons. More research into addition literature could be conducted to further the understanding of the series. A more specific analysis could be conducted on this data set, and additional data sets, to increase the amount of data to be analyzed. The most interesting things were tools I thought would yield impressive results like tf-idf, and more in-depth bigrams did not provide great insight like it had in previous analysis.
Goel, A. (2019, April 12). Applying analytics to HBO’s Game of Thrones (got). LinkedIn. Retrieved April 26, 2023, from https://www.linkedin.com/pulse/applying-analytics-hbos-game-thrones-got-akshay-goel
Lancaster, J. (n.d.). Game of thrones. Game of Thrones. Retrieved April 25, 2023, from https://jeffreylancaster.com/game-of-thrones/
Parajuli, S. (n.d.). Text Mining Project. RPubs_Text_mining_project. Retrieved April 25, 2023, from https://rpubs.com/shinygalena/854607
Saleem, A. (2022, September 1). Network theory and Game of Thrones - a perfect combination. Data Science Dojo. Retrieved April 26, 2023, from https://datasciencedojo.com/blog/network-theory-game-of-thrones/
Tumanggor, A. (2019, November 19). Game of thrones script all seasons. Kaggle. Retrieved April 26, 2023, from https://www.kaggle.com/datasets/albenft/game-of-thrones-script-all-seasons
This report took 15.12 seconds to complete.