March 14, 2022

Scrape Your Own Discord Data with a Simple Bot

I chat with my friends on Discord basically all day long. I thought it would be fun to run some basic stats on our chats and put them together in some sort of fun graphics package vis-a-vis Spotify Wrapped. I soon found out that Discord does not directly expose a way for a server admin to download their own data. Lame, but I did notice that Discord bots are able to access chat logs programatically. I was able to get a simple analysis up and running, by solving three larger problems.

  1. Figure out how to download the data
  2. Parse the incoming data into a usable format
  3. Analyze the data

In Hadley Wickham’s excellent R for Data Science, this is a small part of the “Data Science Workflow”

Prerequisite Software

1) Figure out how to download the data

My first goal is to download all of my server’s chat data in a plain text format, ideally in a csv.

Discord Chat Explorer has a command exportguild which lets you export all the chat data in a single server as csv. All you need in order to export data is to run this command with your auth token and guild ID.

dotnet DiscordChatExporter.Cli/DiscordChatExporter.Cli.dll exportguild -t "token" -g guild_id -b -f csv

The only tricky part about this command is gaining an authorization token. In order to get a valid token, you need to create a Discord bot that has permission to read your server’s chat history. You can do this by going to https://discord.com/developers/applications and then following New Application > New Bot. The new bot will need to be granted “Read Messages/View Channels” and “Read Message History” privileges by going to OAuth2 > URL Generator on the Bot’s Application Page. The URL Generator will then provide a link that finally allows you to add a bot to your server.

(Note: You’re able to access your server ID by right clicking your server and selecting “Copy ID”)

Once you run the exportguild command, all of your channels should be downloaded in separate csv files!

2) Parse the incoming data into a usable format

My second goal is to take the csvs that were generated in step 1 and store them all in a single dataframe. A dataframe is a representation of the data that we can manipulate in R, much like we can represent and manipulate data in Excel. Much like an Excel table, the dataframe will have rows and columns. Each row will contain a single message that was sent as well as message metadata like who sent it, what channel it was sent in and when it was sent.

We can use the excellent R package purrr to read all of our csvs at once and stack them together. All of the files exported in step 1 will have the naming format ServerName - Folder - channel_name [channel_id].csv. We can extract this information from the filenames in order to extract the channel names and put them into our chat dataframe.

chat_data <- list.files('data') %>% #list out all the data files
  map_df(function(filename){
    channel_name <- filename %>%
      str_split('- ') %>%
      unlist %>%
      pluck(3) %>%
      str_split(' ') %>%
      unlist %>%
      pluck(1)
    
    channel_category <-  filename %>%
      str_split('- ') %>%
      unlist %>%
      pluck(2) 
    
    read_csv(paste0('data/', filename)) %>%
      mutate(channel = channel_name)  %>%
      mutate(date_time = dmy_hm(Date)
             , date = as_date(date_time)
             , hour = hour(date)
             , time = as_hms(date_time)
             , author = word(Author, sep = '#')
             , channel_category = channel_category) 
  })

3) Analyze the data

Now that we have the data in-memory, what do we actually do with it? My Discord group has a channel where we post our Wordle scores every day. The Wordle posts are all automatically generated by the app, so we can always rely on the Wordle posts to look the exact same. For example, every Wordle scoring post looks like this

Wordle 268 4/6

🟩⬛⬛⬛🟨

🟩⬛🟩⬛⬛

🟩🟨🟩⬛🟨

🟩🟩🟩🟩🟩

We can search all the messages that look like “Wordle wordle_id guess_attempts/6”. Further, since we have the channel information and we only post our Wordle scores in the channel titled “wordle-gods”, we can easily filter out non-Wordle messages.

wordle_data <- chat_data %>%
  filter(channel %in% c('wordle-gods'), str_detect(Content, 'Wordle [0-9]+ [0-9]/[0-9]'))  %>% # detect Wordle score pattern
  mutate(content =  str_extract(Content, 'Wordle [0-9]+ [0-9]/[0-9]')) %>%
  separate(content, ' ', into = c('wordle', 'id', 'attempts')) %>%
  separate(attempts, '/', into = c('guesses', 'junk')) %>%
  select(author, date, id,  guesses) %>%
  mutate(guesses = as.numeric(guesses), id = as.numeric(id))

So now we have a dataframe that has author, date, wordle ID and guesses. To get a sense for our rankings, we can construct a rolling average of each person’s guesses - the lower the number of guesses, the better.

wordle_data %>%
  mutate(rec = 1) %>%
  group_by(author) %>%
  mutate(total_guesses = cumsum(guesses)
         , total_days = cumsum(rec)
         , running_avg = total_guesses/total_days) %>%
  ggplot(aes(x = id, y = running_avg, colour = author, linetype = author)) +
  geom_line(size = 1.3) +
  ylab('Average # of Guesses') +
  xlab('Wordle ID') +
  theme_bw()

bad at wordle

Oh-it turns out I’m bad at this game.