\(~\)
This is probably the nerdiest thing you’ll ever read about Pokémon.
The beauty of this game is that fundamentally, it’s one of numbers. A bevy of formulas and calculations go into every nuance of what happens in-game. From stat modifiers, critical hits, how much damage each move does (which takes into account many variables based on stats) and even held items and their effects.
It’s what makes it such a rich game, with a simple and great lore (especially for a game made for children!).
As an avid player of the Pokémon franchise of games since childhood. I pondered: “what could I do to show off my skill in a domain that I could really sink my teeth into and relate with data-wise for this portfolio piece?”.
Having played through the Gen 2 games (think Gold, Silver and Crystal)-I wanted to do a project that analyzed trends on the Pokémon in my storage boxes.
As I journeyed around the Johto and Kanto regions, even back in the early days. I became entrenched in a never-ending journey.
…
\(~\)
Before anything, make sure your working directory is set where you want to store all of your project’s data.
You can do that with these functions:
getwd()
# Tells us where your working directory currently is
and..
setwd()
# Tells R to set your working directory to a specific path
Then, for setup, we’ll need to run the needed libraries and turn off scientific notion.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.0
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
options(scipen = 99)
I’ll assign the .csv of my Pokémon Crystal box data to its own data set…
CrystalBox <- read.csv(file = "Data/CrystalBoxData.csv",
stringsAsFactors = FALSE)
If you want, and I recommend this, use the code below to get rid of row names,
# CrystalBox$X <- NULL
And then we can look at the data using these functions:
# Look at the entire data set
View(CrystalBox)
# Look at the first few rows and their columns
head(CrystalBox)
# This gives quick summary statistics for each column in the data
summary(CrystalBox)
The code below allows us to view the class for each column in the data set, and I’ll do so for the first 10 columns.
This should give us an idea of what we’re working with, as each class of variable comes with its own set of rules.
str(CrystalBox[,1:6])
## 'data.frame': 95 obs. of 6 variables:
## $ Sprite : logi NA NA NA NA NA NA ...
## $ Position: chr "BOX1:01" "BOX1:02" "BOX1:03" "BOX1:04" ...
## $ Nickname: chr "LUGIA" "SCIZOR" "MILTANK" "ABRA" ...
## $ Species : chr "Lugia" "Scizor" "Miltank" "Abra" ...
## $ Nature : chr "Hardy" "Hardy" "Hardy" "Hardy" ...
## $ Gender : chr "-" "M" "F" "M" ...
The .csv that the data was sourced from does not include types for the various assortments of Pokémon listed, so it would serve well to include them, as those columns will helps us answer many interesting questions when analyzing the data. I’ve further elaborated on the need for this a little ways down.
CrystalBox$Type1 <- NA
CrystalBox$Type2 <- NA
Next, we’ll select specific columns and then rewrite into a new data set. Some of the columns in the original data set aren’t even applicable to the way some of the mechanics in generation 2 Pokémon games work. That, or they’re just not useful. So for the purpose of this script and analysis they can be removed.
I’ll subset the needed rows using this script in dpylr:
CrystalBox <-
CrystalBox %>%
select(Species, Gender, HP_Type, Type1, Type2, Move1, Move2, Move3, Move4,
HeldItem, EXP, Level, HP, ATK, DEF, SPA, SPD, SPE,
MetLoc, MetLevel, OT, HP_IV, ATK_IV, DEF_IV, SPD_IV,
SPE_IV, HP_EV, ATK_EV, DEF_EV, SPA_EV, SPD_EV, SPE_EV,
IsNicknamed, IsShiny)
Once everything is completed, for this part, it’s time to write this into a .csv!
…
We’ll now view the changed dataet, as you can see, its cut from 90 rows to 32
View(CrystalBox)
Above, I have created two new columns; as some Pokémon have two types.
Now you can either choose to export the current .csv to Excel and edit the data manually, or you use ’c(“Psychic”, “Bug”, etc.) to enter the types manually. However, for the 95 rows of different Pokémon available, doing that will take some time.
So instead, I am using write.csv() to edit this offsite on Excel using the following functions below.
It’s important to use row.names = FALSE, because when importing the edited dataset, an unwanted column usually called “X” will appear as a placeholder. While it’s simple to remove in the console, just do this step instead.
write.csv(x = CrystalBox,
"Data/CrystalBoxData3.csv",
row.names = FALSE)
EDA may seem boring to many I’m sure, but it’s important to always do this when encountering a new dataset. Just to see what possible outliers or incongruities can exist in the data, before actually working with it. Plus, doing this step first and foremost can save everyone time later if done correctly.
…
\(~\)
Now that we have looked at our data, cleaned it, doused it in Pine-Sol and gave it a good scrub. We’ll reload tidyverse and take a look at some sample questions that we can use to answer some problems with our data, to gain insights into things.
library(tidyverse)
library(knitr)
# Import data
CrystalBox <- read.csv(file = "Data/CrystalBoxData3.csv",
stringsAsFactors = FALSE)
…
\(~\)
Lets run a summary to glean some insights. …
Species | Gender | Type1 | Type2 | Move1 | Move2 | Move3 | Move4 | EXP | Level | HeldItem |
---|---|---|---|---|---|---|---|---|---|---|
Lugia | - | Psychic | Flying | Aeroblast | Recover | Safeguard | Toxic | 1250000 | 100 | Leftovers |
Scizor | M | Bug | Steel | Swords Dance | Hidden Power | Agility | Baton Pass | 125000 | 50 | Gold Berry |
Miltank | F | Normal | NA | Body Slam | Rollout | Defense Curl | Milk Drink | 156250 | 50 | King’s Rock |
Abra | M | Psychic | NA | Teleport | (None) | (None) | (None) | 560 | 10 | (None) |
Granbull | M | Normal | NA | Curse | Shadow Ball | Return | Hidden Power | 100000 | 50 | TwistedSpoon |
Ditto | - | Normal | NA | Transform | (None) | (None) | (None) | 1000 | 10 | (None) |
\(~\)
Columns 1-8 are very interesting, especially the gender and type columns, so I’ll run table() to investigate further
table(CrystalBox$Gender)
##
## - F M
## 13 18 64
According to the data, I have 64 male Pokémon, 18 female Pokémon and 13 genderless. As an aside, there exists a programming quirk in Gen 2 where stats have a bias towards male Pokémon.
So next, let’s take a better look at the types of Pokémon we have.
## # A tibble: 95 × 3
## Species Type1 Type2
## <chr> <chr> <chr>
## 1 Lugia Psychic Flying
## 2 Scizor Bug Steel
## 3 Miltank Normal <NA>
## 4 Abra Psychic <NA>
## 5 Granbull Normal <NA>
## 6 Ditto Normal <NA>
## 7 Metapod Bug <NA>
## 8 Kakuna Bug <NA>
## 9 Unown Psychic <NA>
## 10 Vulpix Fire <NA>
## # … with 85 more rows
I recommend assigning the above to a different data set to view everything in detail.
Apparently, I have an overwhelming amount of water types within my boxes at 21. As we can see here:
CrystalBox %>%
filter(Type1 == "Water") %>%
count(Type1)
## Type1 n
## 1 Water 21
Most of my Pokémon that do have a second type are Flying, they’re likely to be legendary Pokémon.
The Pokémon I have are only of one type it seems judging by the amount of zeros in this table, as they account for the NAs that are being reported. In fact we can run the below to make sure…
CrystalBox %>%
filter(!complete.cases(Type1, Type2)) %>%
pull(Species)
## [1] "Miltank" "Abra" "Granbull" "Ditto" "Metapod"
## [6] "Kakuna" "Unown" "Vulpix" "Sandshrew" "Ampharos"
## [11] "Snorlax" "Sentret" "Koffing" "Dratini" "Suicune"
## [16] "Furret" "Remoraid" "Lickitung" "Marill" "Persian"
## [21] "Electabuzz" "Krabby" "Hitmonlee" "Alakazam" "Dragonair"
## [26] "Raikou" "Entei" "Sudowoodo" "Stantler" "Seel"
## [31] "Arcanine" "Ursaring" "Magmar" "Blissey" "Meganium"
## [36] "Typhlosion" "Machamp" "Togepi" "Tauros" "Vaporeon"
## [41] "Pinsir" "Psyduck" "Pikachu" "Smeargle" "Tangela"
## [46] "Seaking" "Mewtwo" "Donphan" "Feraligatr" "Umbreon"
## [51] "Espeon"
The above gives us a list of Pokémon that do not have a 2nd type, but that won’t negatively affect what we need to do with our data as it’s commonplace in Pokémon for them to only have one Type.
Lets take a look at EXP and Levels:
## EXP Level
## Min. : 100 Min. : 5.00
## 1st Qu.: 3763 1st Qu.: 15.50
## Median : 120500 Median : 50.00
## Mean : 121588 Mean : 37.73
## 3rd Qu.: 146879 3rd Qu.: 50.00
## Max. :1250000 Max. :100.00
That’s about right considering the volume of Pokémon in my boxes that are around level 50.
Lets take a look at items:
##
## (None) Amulet Coin Berry Juice Berserk Gene Bitter Berry Black Belt
## 46 1 1 3 1 1
## Brick Piece BrightPowder Burnt Berry Charcoal Focus Band Gold Berry
## 1 1 1 1 1 2
## Hard Stone King's Rock Leftovers Light Ball Lucky Egg Metal Coat
## 1 1 8 1 1 1
## Metal Powder Miracle Seed MiracleBerry Mystic Water Pink Bow Poison Barb
## 1 3 1 3 1 2
## Quick Claw Sacred Ash Scope Lens Silver Leaf SilverPowder Smoke Ball
## 2 1 1 1 1 2
## Soft Sand Thick Club TwistedSpoon
## 1 1 1
It seems as though Leftovers is the most common item among my boxed Pokémon. A lot of the sets I’ve made were purpose-built for competitive battling and other nerdiness…so that’s about right.
…
\(~\)
Now lets take a look at the Pokémon’s stats. For ease of use, I’ll use tidyverse functions.
\(~\)
For starters:
Which Pokémon has the highest Attack stat?
CrystalBox %>%
filter(ATK == max(ATK)) %>%
pull(Species)
## [1] "Ho-Oh"
This gives us Ho-Oh the Rainbow Pokémon!
…
Which Pokémon has the highest level?
CrystalBox %>%
filter(Level == "100") %>%
pull(Species)
## [1] "Lugia" "Ho-Oh" "Mewtwo"
We’re tied with Lugia, Ho-Oh and Mewtwo
…
Which Pokémon has the highest Stat totals among them?
CrystalBox %>% group_by(Species) %>%
summarise(total_stats = sum(HP, ATK, DEF, SPA, SPD, SPE)) %>%
filter(total_stats == max(total_stats))
## # A tibble: 3 × 2
## Species total_stats
## <chr> <int>
## 1 Ho-Oh 2053
## 2 Lugia 2053
## 3 Mewtwo 2053
Again, we have a three way tie between Lugia, Ho-Oh and Mewtwo. All have the same base stat totals, so at maximum leveling (which is 100 and maximum StatEXP), they all have the same total stats.
However, what do “base stat totals” and “StatEXP” mean? Let’s take a look.
…
\(~\)
This section is going to highlight all of the visualizations that can give further context to the data we have; while also giving insight to what visualization method is best suited for a given use case.
ggplot(CrystalBox, aes(y = Gender, fill = OT)) +
geom_bar() +
xlim(0,10) +
xlab("Count") +
ylab("1st Type") +
ggtitle("Gender of Pokémon by Type & Original Trainer") +
theme_light() +
facet_wrap(Type1 ~ .)
ggplot(CrystalBox, aes(y = HeldItem, fill = OT)) +
geom_bar() +
ggtitle("Held Items by Sex & Original Trainer") +
xlab("Count") +
ylab("Item") +
facet_wrap(Gender ~ .)
ggplot(CrystalBox, aes(x = Gender, fill = OT)) +
geom_bar() +
ggtitle("# of Pokémon by Sex and Original Trainer") +
xlab("Sex") +
ylab("Count")
#One variable, continuous
ggplot(CrystalBox, aes(x = Level, fill = Gender)) +
geom_histogram(bins = 5, position = "stack") +
ggtitle("Count of Pokémon by Level, Original Trainer & Gender",
subtitle = "From Levels 0 to 100") +
ylab("Count") +
theme(axis.text.x = element_text(angle = 90),
axis.text.y = element_text(angle = 0)) +
facet_grid(cols = vars(OT))
…
#Ver. 1
ggplot(CrystalBox, aes(x = OT, y = Level, color = EXP)) + geom_count() +
ggtitle("Level of Pokémon by Original Trainer",
subtitle = "w/ Experience Points") +
xlab("Original Trainer") +
theme_minimal()
#Ver. 2 — Using Facet Wrap
gender.labs <- c("Genderless", "Male", "Female")
names(gender.labs) <- c("Genderless", "M", "F")
ggplot(CrystalBox, aes(x = Level, y = OT, color = EXP)) + geom_count() +
ggtitle("Level of Pokémon by Original Trainer",
subtitle = "w/ Experience Points") +
xlab("Level") +
ylab("Original Trainer") +
theme_linedraw() +
facet_wrap(Gender ~ ., labeller = labeller(Gender = gender.labs))
ggplot(CrystalBox, aes(x = Level, y = ..density.., fill = OT)) +
geom_histogram(bins = 6) +
geom_density(kernel = "gaussian") +
ggtitle("Boxed Pokémon Levels & Density by Original Trainer") +
scale_fill_discrete(name = "Trainer") +
xlab("Level") +
ylab("Density") +
theme_light()
ggplot(CrystalBox, aes(x = StatTotals, y = after_stat(density), fill = OT)) +
geom_histogram(bins = 6) +
geom_density(kernel = "gaussian") +
theme(axis.title.x = element_text(face = "bold",
margin = margin(t = 0, r = 0, b = 10, l = 0)),
axis.title.y = element_text(face = "bold",
margin = margin(t = 0, r = 10, b = 0, l = 0))) +
ggtitle("Boxed Pokémon Total Stats & Density by Original Trainer") +
scale_fill_discrete(name = "Trainer") +
xlab("Total Stats") +
ylab("Density") +
theme_light()
…
ggplot(CrystalBox, aes(x = Level, y = StatTotals)) +
geom_violin(draw_quantiles = c(0.25, 0.50, 0.75), trim = FALSE) +
xlim(0,100) +
ylab("StatTotals") +
ggtitle("Stat Totals by Level") +
coord_flip()
…
#This shows the correlation between Hit points and offensive stats by level
ggplot(CrystalBox, aes(x = HP, y = ATK, size = Level)) +
geom_point() +
geom_line(linewidth = 1, color = "sky blue") +
xlab("Hit Points") +
ylab("Attack") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 20)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 15)) +
ggtitle("Hit Points by Attack stat and Level") +
theme(axis.text.x = element_text(face = "bold"),
axis.text.y = element_text(face = "bold"),
axis.title.x = element_text(face = "bold",
margin = margin(t = 8, r = 10, b = 0, l = 0)),
axis.title.y = element_text(face = "bold",
margin = margin(t = 0, r = 10, b = 0, l = 0)),
axis.line = element_line(colour = "black",
linewidth = 1, linetype = "solid"))
ggplot(CrystalBox, aes(x = HP, y = SPA, size = Level)) +
geom_point() +
geom_line(linewidth = .75, color = "red") +
xlab("Hit Points") +
ylab("Special Attack") +
scale_x_continuous(breaks = scales::pretty_breaks(n = 20)) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 15)) +
ggtitle("Hit Points by Special Attack stat and Level") +
theme(axis.text.x = element_text(face = "bold"),
axis.text.y = element_text(face = "bold"),
axis.title.x =
element_text
(face = "bold", margin = margin(t = 8, r = 10, b = 0, l = 0)),
axis.title.y =
element_text
(face = "bold", margin = margin(t = 0, r = 10, b = 0, l = 0)),
axis.line = element_line(colour = "black",
linewidth = 1, linetype = "solid"))
…
ggplot(CrystalBox, aes(x = ATK, y = SPE, color = Gender)) +
ggtitle("Attack and Speed by Type") +
xlab("Attack") +
ylab("Speed") +
theme(axis.text.x = element_text(angle = 30),
axis.text.y = element_text(angle = 30)) +
geom_jitter() +
geom_point() +
scale_fill_continuous(type = "viridis") +
geom_smooth(method = lm, se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
ggplot(CrystalBox, aes(x = SPA, y = SPE, color = Gender)) +
geom_jitter() +
xlab("Special Attack") +
ylab("Speed") +
ggtitle("Speed by Special Attack Stats and Gender") +
geom_smooth(method = lm, se = FALSE) +
theme(axis.text.x = element_text(angle = 30),
axis.text.y = element_text(angle = 30)
)
## `geom_smooth()` using formula = 'y ~ x'
#Attack and Speed density plot
ggplot(CrystalBox, aes(x = ATK, y = SPE)) +
ggtitle("Density of Attack by Speed Stats") +
xlab("Attack") +
ylab("Speed") +
stat_density_2d(aes(fill = ..density..), geom = "raster", contour = FALSE) +
scale_fill_distiller(palette = 4, direction = -1) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0)) +
theme(
legend.position = 'none',
axis.title.x = element_text(margin = margin(
t = 8,
r = 10,
b = 0,
l = 0
)),
axis.title.y = element_text(margin = margin(
t = 0,
r = 10,
b = 0,
l = 0
)),
)
…
I’ll admit, making a heatmap using base R was a lot more difficult than I realized. So much has to go into properly formatting whatever you’re putting in the x axis if your data isn’t properly aggregated. If you’re looking to make more nuanced heatmaps(), I recommend heatmap2() and pheatmap().
Firstly, you have to make sure all the variables you’re putting into the heatmap are numeric, due to the design of how the chart works. So if you have variables that are non-numeric like character-based columns that you want to have accounted for, you’ll need to make counts for each instance and then coalesce them into the main data set you want to chart.
Here we go!
# Subset main dataset
CrystalBoxHM <- subset(CrystalBox[,c(3:4,9:10,12:17,19,21:31)])
# Edit the source data set for use with the heatmap, as heatmaps can only be used with matrices containing
# numeric variables
#Replace NAs with blank spaces for sum functions below
CrystalBoxHM[,2] <-
replace(CrystalBoxHM$Type2,is.na(CrystalBoxHM$Type2),"")
#Make Type columns
CrystalBoxHM2 <- unique( CrystalBoxHM[ , 1 ] )
CrystalBoxHM2 <- as.data.frame(CrystalBoxHM2)
CrystalBoxHM2$Type2 <- NA
colnames(CrystalBoxHM2) <- c("Type1", "Type2")
CrystalBoxHM2$Type2 = CrystalBoxHM2$Type1 #Run first
CrystalBoxHM2 <- CrystalBoxHM2 %>% add_row(Type1 = "Flying")
CrystalBoxHM2$Type2 = CrystalBoxHM2$Type1 #Run again
#See above, since the main data set doesn't account for Flying types in Type 1 as they're no pure Flying types in all of Pokémon.
#Make counts of Type 1 & 2 columns
CrystalBoxHM2$Type1_Count <-
sapply(CrystalBoxHM2[,1], function(string)
sum(string == CrystalBoxHM[,1]))
CrystalBoxHM2$Type2_Count <-
sapply(CrystalBoxHM2[,2], function(string)
sum(string == CrystalBoxHM[,2]))
#Edit row names for heatmap
row.names(CrystalBoxHM2) <- c("Psychic", "Bug", "Normal", "Fire",
"Ground", "Water", "Electric",
"Poison", "Dragon", "Ice", "Dark",
"Fighting", "Rock", "Steel", "Grass",
"Ghost", "Flying")
#Make other columns for the matrix that the heatmap will be based on...
#Make Type counts for the other data set we subsetted originally:
CrystalBoxHM$Type1_Count <-
sapply(CrystalBoxHM[,1], function(string)
sum(string == CrystalBoxHM[,1]))
CrystalBoxHM$Type2_Count <-
sapply(CrystalBoxHM[,2], function(string)
sum(string == CrystalBoxHM[,2]))
#Splitting data, merging new datasets, summarizing and subsetting:
Type1_meanEXP <-
CrystalBoxHM %>% group_by(Type1) %>% summarize(mean_EXP = mean(EXP))
Type1_meanEXP <-
Type1_meanEXP %>% remove_rownames %>%
column_to_rownames(var = "Type1") %>% as.data.frame()
meanEXP2 <- merge(CrystalBoxHM2, Type1_meanEXP, by = 0, all = TRUE)
meanEXP2 <-
meanEXP2 %>% remove_rownames %>%
column_to_rownames(var = "Row.names") %>% as.data.frame()
#Note above that we must change the row names to the types that will display in the heatmap, because remember, heatmaps can only be used with dataset that contain columns with only numeric properties. Row names are not affected by this.
#Merge and null unneeded columns
meanEXP2[,1:2] <- NULL
#New data with all the means we need
all_means <- subset(CrystalBoxHM, select = c(Type1:Type2, Level, HP:SPE))
all_means[, 10:16] <- colMeans(x = all_means[, 3:9])
all_means[, 3:9] <- NULL
#Rename rows to proper averages:
all_means <- all_means %>%
rename(
Level_Avg = V10,
HP_Avg = V11,
ATK_Avg = V12,
DEF_Avg = V13,
SPA_Avg = V14,
SPD_Avg = V15,
SPE_Avg = V16
)
#Re-summarise all the means in our group
all_means <-
all_means %>% group_by(Type1) %>% summarise_all(mean)
all_means[, 2] <- NULL
all_means <-
all_means %>% remove_rownames %>% column_to_rownames(var = "Type1") %>% as.data.frame()
#Merge everything and fix row names:
meanEXP2 <- merge(meanEXP2, all_means, by = 0, all = TRUE)
meanEXP2 <-
meanEXP2 %>% remove_rownames %>% column_to_rownames(var = "Row.names") %>%
as.data.frame()
CrystalBoxHM3 <- meanEXP2 #I renamed the merged dataset to maintain consistency.
#The heatmap itself
par(cex.main = .85)
heatmap(as.matrix(x = CrystalBoxHM3), scale="col",
margins = c(6, 6),
Colv = NA,
na.rm = TRUE,
main = "Aggregations by Type",
xlab= "Aggregates",
ylab= "Type",
cexCol=.75, col = cm.colors(256))
#Note how I edited the code to remove the column dendrogram from the chart.
…
\(~\)
When it comes to the intricate details of Pokémon, you’ll start to unravel many layers to a game that seems rather innocuous on the outside.
Frankly, when I started to delve deeper and deconstruct one of my all-time favorite games in this manner; it took a good amount of self-reflection. Data is certainly something I find passion in doing well, but sometimes when it comes to games—–the fun comes in the mystery that comes with discovering novel things, and joy that is found in the experience you have in the present moment.
While the fun of my childhood self playing Pokémon will always be seeped in nostalgia, through finding novel experiences via my pursuits in data and uncovering useful insights—I hope that can give me a similar feeling in my adult life.
I’ll always appreciate what Satoshi Tajiri gave me and millions of other youth worldwide.
\(~\)
Thank you for reading, Bruce A. Lee