/ data science

Navigating the WordPress Multisite database labyrinth

Who is the owner of this blog? You'd think that's an easy question for a multisite administrator to answer. And it is, if you only have a handful of sites. Just visit your dashboard, go to sites, select the site, and click on the number in the users column, then look at the users labeled "administrator" in the list that pops up.

But what if you have over 10,000 sites on your WordPress Multisite platform? (like we do on our legacy UMW Blogs platform)

Surely there's a table in the database that lists the administrators for each blog...

Or so I thought.

Turns out, the information is there, but not at all in a way that is readily accessible to someone trying to list all the owners of all the blogs on the platform. (Let alone ask "Which blogs belong to people who are current students, faculty, or staff at the university?")

I spent the better part of yesterday morning pulling tables and merging them in order to create such a list. On the off chance that someone else administers a mammoth WordPress Multisite instance and wants a straightforward way to list blog owners (and possibly their institutional/company status), I thought I'd write up the process and code I used, in the hopes that it will help someone else.

So here goes...

Which tables?

Our WordPress Multisite has tens of thousands of tables in its database. As I found today, digging through that to find the single table containing post data for a particular blog can be a nightmare. But for site metadata across the system, there are just a few easy-to-find tables you'll need.

On our platform, they are all in a database called ***_wpmu_global. (Replace the *** with your main MySQL database name.) Even this database has dozens of tables in it, but the ones we need to connect users and the blogs they administer are wp_blogs, wp_users, and wp_usermeta. Using your database admin tool of choice, download these complete tables to your local machine as CSV files. (Watch out that your database admin tool -- in my case MySQL Workbench for Windows and Linux -- doesn't limit your download to, say, the first 1000 rows. You need the whole thing.)

Merging the data

Now we need to merge the data. My tool of choice for this kind of work is R, using TidyVerse tools like dplyr and stringr. The code that follows will work in R, but you could easily use Python or SQL.

First we load the libraries we need.

library(tidyverse)
library(lubridate)
library(stringr)

Then we import the data and do a little pre-processing on the dates to make them easy to work with in R.

wp_blogs <- read_csv('wp_blogs.csv', col_types = 'iicccciiiiii') %>%
  mutate(reg_date = ymd(substring(registered, 1, 10)),
         last_updated_date = ymd(substring(last_updated, 1, 10))) %>%
  select(blog_id, site_id, domain, path, public, reg_date, last_updated_date)
wp_users <- read_csv('wp_users.csv')
wp_user_meta <- read_csv('wp_user_meta.csv')

I'm also going to pull in a table from our campus information system, containing basic user info about current students, faculty, and staff, so I can parse the list of blogs by current/former UMW people, and by students/faculty/staff.

umw_users <- read_csv('User_List.csv')

To get all the info I need about users, I need to merge some info from the wp_users and the wp_usermeta tables. wp_usermeta contains way more information than I need, so I'll limit it to just a few items that are necessary to link the users with their sites and with their institutional record (namely their personal name info, their primary blog activity, and their date of most recent activity).

Screenshot--9-

Sample data from wp_usermeta.

The following code takes only the records from wp_usermeta that contain this information, and then transposes it so that instead of one column with keys and another column with values, we get several columns (one per relevant key), with the values corresponding to each user in those columns. It then takes the resulting table and joins it with the more basic user information in wp_users and cleans up the dates. Finally, it takes their email address and, if it's a UMW email, it creates a new netid value (which will allow me to merge it with the campus user info database).

# function to find or generate netid

generate_netid <- function (email) {
  new_netid <- ifelse(grepl('umw.edu', email, ignore.case = TRUE),
                      tolower(unlist(strsplit(email, '@'))[1]),
                      NA)
  return(new_netid)
}

# create master users table

users <- wp_user_meta %>%
  filter(meta_key %in% c('nickname',
                         'first_name',
                         'last_name',
                         'source_domain',
                         'primary_blog',
                         'last_activity'),
         user_id != 0) %>%
  select(user_id, meta_key, meta_value) %>%
  spread(meta_key, meta_value) %>%
  full_join(wp_users, by = c('user_id' = 'ID')) %>%
  mutate(reg_date = ymd(substring(user_registered, 1, 10)),
         netid = sapply(user_email, generate_netid)) %>%
  left_join(umw_users) 

The last thing I need to do is connect this user data with specific blogs on the platform. (That's the fun part...)

Like I mentioned above, there is no table in the database that links users and their sites directly. Instead, the wp_usermeta table contains a key called wp_*_capabilities, where the * is an integer corresponding to the blog_id of the blog in question (wp_453_capabilities for the blog with blog_id 453, for example). The value that corresponds to these keys is a PHP serialization of a potentially long list of parameters. Neither R nor MySQL have a built-in tool (that I could find) for de-serializing this data. Fortunately, it is plain text, and WordPress only puts the Administrator (or, sometimes, administrator) parameter in this serialization if the user in an administrator of that site.

Screenshot--8-

Sample wp_*_capabilities data from wp_usermeta.

So to connect users with the blogs they administer, I extracted the key-value pairs where the key contained wp_ and the value contained administrator (case-insensitive). Then I extracted the integer from wp_*_capabilities and designated it the blog_id to associate with the user_id in question. Then I merged the resulting table with the users table I created above, resulting in one massive table containing the institutional and blog metadata I need for each blog-administrator pair in our WordPress multisite.

# function to extract site id

find_site_id <- function(mkey) {
  return(as.integer(unlist(str_split(mkey, '_'))[2]))
}

admins <- wp_user_meta %>%
  filter(str_detect(str_to_lower(meta_value), 'administrator'),
         str_detect(meta_key, 'wp_')) %>%
  mutate(blog_id = sapply(meta_key, find_site_id)) %>%
  select(user_id, blog_id) %>%
  left_join(wp_blogs %>%
              select(blog_id, domain, public, blog_reg_date = reg_date, blog_last_updated_date = last_updated_date)) %>%
  left_join(users %>%
              select(user_id, first_name, last_name, user_reg_date = reg_date, netid, group_name, stst, department, employee_class, student_class))

Analysis

Now I can do some helpful analysis on our platform. (Note: the results of the following queries are omitted to protect user privacy.)

For example, which users have the most blogs on our site?

top <- admins %>%
  count(netid, sort = TRUE)

Which sites are "active" (updated in the past year and/or administered by a current UMW student/faculty/staff member)?

active_sites <- admins %>%
  filter(stst == 'AS' |
           group_name %in% c('Faculty', 'Staff') |
           blog_last_updated_date >= '2016-09-22')

Which sites are "inactive" (the opposite of "active")?

inactive_last_two_years <- admins %>%
  anti_join(active_sites)

How many sites were abandoned the same day they were created? And whose were they?

never_updated <- admins %>%
  filter(blog_reg_date == blog_last_updated_date) %>%
  arrange(desc(blog_last_updated_date))

We also have some analytics running on the platform, so I merged those analytics with the wp_blogs table, so the resulting admins table will contain the number of unique sessions all time, and in the past two years. That means I can redefine "active" sites according to reader engagement, as well as author/administrator activity.

For example, which sites are administered by a current UMW student/faculty/staff member and have been visitied at least once in the past two years?

active_sites <- admins %>%
  filter(stst == 'AS' |
           group_name %in% c('Faculty', 'Staff') |
           sessions_two_years > 0)

Or have had at least 100 visits all time?

active_sites <- admins %>%
  filter(stst == 'AS' |
           group_name %in% c('Faculty', 'Staff') |
           sessions_all_time > 100)

And which sites qualify as "low activity" sites? (not administered by a current UMW student/faculty/staff member, not updated in the past year, and visited less than 100 times in its entire existence)

low_activity_all_time <- admins %>%
  anti_join(admins %>%
              filter(stst == 'AS' |
                       group_name %in% c('Faculty', 'Staff') |
                       blog_last_updated_date >= '2016-09-22' |
                       sessions_all_time > 100))

These are just a few of the things we can do once we have this information in one place. We could also ask how many of the currently active sites are administed by freshmen, sophomores, etc.; which sites have been active the longest; which sites were abandoned at the end of a semester; etc. The hard work, though, was getting all of this information into one place. WordPress can do a lot of really cool things, but its popularity and diversity of use cases mean that even some seemingly straightforward things are kind of hacked together on the backend. But with a little data wrangling, WordPress Multisite administrators can do some pretty cool analysis.