Call them what you will
Will Lowe (2013-10-28 17:44)
I've been playing around with the R package
{texreg}
for creating combined regression tables for multiple
models. It's not the only package to do that - see the R to LateX
packages overview for a
review - but it's often handy to be able to generate both ascii art,
latex, and html versions of the same table using almost identical
syntax. Also, the ascii art creating screenreg
function allows me to
bypass the pdf construction cycle I previously described
here. The coefficient plots from
plotreg
are pretty cool too.
This post is about making the variables listed in those combined regression tables more readable. That is particularly important when data comes from variable-mangling statistical software or from co-authors whose idea of a descriptive name could pass for an online banking password. Even R will cheerfully mash up your carefully chosen variable names through formulas, factors, and interactions. So for work people are going to see, variables should have sensible names.
First I'll walk through the existing texreg
machinery for renaming,
omitting, and reordering variables, and then propose a hopefully more
intuitive implementation. I'll demonstrate all this using screenreg
on a classic data set on job prestige.
Job prestige
Consider the 'Prestige' data in the
car package. This dataset gives information about
various occupations collected from census data in Canada in the early
1970s. For each occupation there's a level of job prestige
and
some covariates that potentially explain that level. The covariates
are the average income
of that occupation in dollars, the average
years of education
in years, the percentage of women
it
contained, a census occupational code that we can ignore, and a
nominal variable called type
that takes the values blue collar
('bc'), white collar ('wc'), and professional ('prof'). Let's tidy
the data and construct a couple of models to show texreg
at work.
require(car)
require(texreg)
data(Prestige)
# make professional the base category and rebase income
prest <- transform(Prestige,
type=relevel(type, 'prof'),
income=income/1000)
# Income has a single relationship to job prestige
mod1 <- lm(prestige ~ education + type + income + women, data=prest)
# Income has different relationship to job prestige depending on job type
mod2 <- lm(prestige ~ education + type * income + women, data=prest)
Now let's compare these two models
screenreg(list(mod1, mod2))
====================================
Model 1 Model 2
------------------------------------
(Intercept) 5.09 17.47 *
(8.87) (8.19)
education 3.66 *** 2.80 ***
(0.65) (0.59)
typebc -5.91 -27.55 ***
(3.94) (5.41)
typewc -8.82 ** -24.12 ***
(2.79) (5.35)
income 1.04 *** 0.85 ***
(0.26) (0.23)
women 0.01 0.08 *
(0.03) (0.03)
typebc:income 3.01 ***
(0.58)
typewc:income 1.91 *
(0.81)
------------------------------------
R^2 0.83 0.87
Adj. R^2 0.83 0.86
Num. obs. 98 98
====================================
*** p < 0.001, ** p < 0.01, * p < 0.05
So far so good. But these variable names are automatically constructed and inevitably a bit weird-looking. I mean, 'typebc'?
Making better variable names
We can rename variables before constructing the table using the
custom.coef.names
argument.
screenreg(list(mod1, mod2),
custom.coef.names=c('intercept', 'education', 'type: blue collar',
'type: white collar', 'income', 'prop. female',
'income x blue collar', 'income x white collar'))
============================================
Model 1 Model 2
--------------------------------------------
intercept 5.09 17.47 *
(8.87) (8.19)
education 3.66 *** 2.80 ***
(0.65) (0.59)
type: blue collar -5.91 -27.55 ***
(3.94) (5.41)
type: white collar -8.82 ** -24.12 ***
(2.79) (5.35)
income 1.04 *** 0.85 ***
(0.26) (0.23)
prop. female 0.01 0.08 *
(0.03) (0.03)
income x blue collar 3.01 ***
(0.58)
income x white collar 1.91 *
(0.81)
--------------------------------------------
R^2 0.83 0.87
Adj. R^2 0.83 0.86
Num. obs. 98 98
============================================
*** p < 0.001, ** p < 0.01, * p < 0.05
That's more readable already.
Omitting things
Actually maybe we don't care about the intercept because not all of these variables are meaningful at zero, so let's omit that.
screenreg(list(mod1, mod2),
custom.coef.names=c('intercept', 'education', 'type: blue collar',
'type: white collar', 'income', 'prop. female',
'income x blue collar', 'income x white collar'),
omit.coef='intercept')
============================================
Model 1 Model 2
--------------------------------------------
education 3.66 *** 2.80 ***
(0.65) (0.59)
type: blue collar -5.91 -27.55 ***
(3.94) (5.41)
type: white collar -8.82 ** -24.12 ***
(2.79) (5.35)
income 1.04 *** 0.85 ***
(0.26) (0.23)
prop. female 0.01 0.08 *
(0.03) (0.03)
income x blue collar 3.01 ***
(0.58)
income x white collar 1.91 *
(0.81)
--------------------------------------------
R^2 0.83 0.87
Adj. R^2 0.83 0.86
Num. obs. 98 98
============================================
*** p < 0.001, ** p < 0.01, * p < 0.05
this doesn't seem such a big deal, but in other analyses when you've
got, for example, 120 country fixed effects in a variable called
country
that you don't want to see then it's nice to be able to
say
omit.coef='country'
This gets treated like a regular expression and omits all the 'countryAngola' type things. If you've also got, say, regional fixed effects and you want to skip them too then it would be
omit.coef='country|region'
We'll use this little trick later.
Reordering things
This prettification is all very nice, but arguably the coefficients
appear in the wrong order. Specifically, we should probably see the
interaction between type
and income
directly after the
income
and type
'main effects' without the intervening
women
variable. To do this we can use the reorder.coef
argument
screenreg(list(mod1, mod2),
custom.coef.names=c('intercept', 'education', 'type: blue collar',
'type: white collar', 'income', 'prop. female',
'income x blue collar', 'income x white collar'),
omit.coef='intercept',
reorder.coef=c(1, 4, 2, 3, 6, 7, 5))
============================================
Model 1 Model 2
--------------------------------------------
education 3.66 *** 2.80 ***
(0.65) (0.59)
income 1.04 *** 0.85 ***
(0.26) (0.23)
type: blue collar -5.91 -27.55 ***
(3.94) (5.41)
type: white collar -8.82 ** -24.12 ***
(2.79) (5.35)
income x blue collar 3.01 ***
(0.58)
income x white collar 1.91 *
(0.81)
prop. female 0.01 0.08 *
(0.03) (0.03)
--------------------------------------------
R^2 0.83 0.87
Adj. R^2 0.83 0.86
Num. obs. 98 98
============================================
*** p < 0.001, ** p < 0.01, * p < 0.05
Here we explicitly specified a reordering of the indexes of the original variable names.
This works, but it takes a little bit of trial and error to get all
the indexing right. Basically, it's brittle. What happens if we decide
to drop the female
variable? Or move it to the top of the variable
list? It's right in the middle of the reordering sequence and the list
of new variable names so we'd have to construct these arguments again,
more or less from scratch.
Let's think about this again and instead of exploring the existing functionality let's ask for what we want.
Asking for what we want
First, we'd like to be able to rename variables. That is, really
rename them, not just assign a string to whatever variable happens to
be number 2 in a list of other strings. Also, we'd usually like to be
implicit about omission. A reasonable convention might be that if we
don't bother to mention a variable then we don't ever want to see it
in the table. And we'd like to be implicit about ordering. Normally we
can provide a total ordering of variables, e.g. we can assert that if
type
and income
are included we want income
to appear
first, and if they are interacted we want to see the interaction terms
directly after "type".
A straightforward way to express all these things would be to define a list that describes what new name each possible variable name should be mapped to, what sequence they should appear in, and which variables should be omitted in the final table. That is to say, something like
name.map <- list(education="education",
income="income",
typewc="type: white collar",
typebc="type: blue collar",
"typewc:income"="income x white collar",
"typebc:income"="income x blue collar",
women="prop. female")
This is just an ordinary R list, although note that R's default colon-containing interaction term names need to be quoted to stop R thinking they are sequences or some other foolishness. We could get around that with a little fancy programming, but let's not bother.
The order of the list is a total ordering over things we'd like to see in a regression table. Regression tables should show the subset of names in this list, in the order they appear there. Variable names that are not mapped are ones we never wish to see; there is no intercept mentioned because we always want to omit that.
Building something to get what we want
So much for asking for what we want. Now to construct a function that
turns this little specification into something that screenreg
can
deal with.
The following function has two arguments: final.rnames
: an array
containing the variable names that would appear in screenreg output if
we didn't change anything, and name.map
which we discussed above. It
returns a list with three elements, ccn
: a suitable value for the
custom.coef.names
argument, oc
: a suitable value for the
omit.coef
argument, and rc
: a suitable value for the
reorder.coef
argument.
build.ror <- function(final.rnames, name.map){
keep <- final.rnames %in% names(name.map)
mapper <- function(x){
mp <- name.map[[x]]
ifelse(is.null(mp), x, mp)
}
newnames <- sapply(final.rnames, mapper)
omit <- paste0(final.rnames[!keep], collapse="|")
reorder <- na.omit(match(unlist(name.map), newnames[keep]))
list(ccn=newnames, oc=omit, rc=reorder)
}
Aside
What is final.rnames
here? It's
c("(Intercept)", "education", "typebc", "typewc", "income",
"women", "typebc:income", "typewc:income")
But where did it come from?
Well, we could have just typed in these names. But really, life is short, and it's more fun to use a probably-ill-advised and certainly contrary-to-all-the-etiquette-of-R function like the one below to figure out the variable names
all.varnames.dammit <- function(model.list){
mods <- texreg:::get.data(model.list)
gofers <- texreg:::get.gof(mods)
mm <- texreg:::aggregate.matrix(mods, gofers, digits=3)
rownames(mm)
}
This takes the same list of models as you'd give screenreg
and tells
you what the final variable list would be. Handy, eh? Maybe there is a
kosher method of getting hold of this info, but I don't know what it
is.
As it happens, this function just does what texreg
does
internally. All those triple colons are there because the author
didn't really want us messing around in the unexported functions the
way he does. Shenanigan-free exposure of this variable name
information would, of course, be preferable. I'll ask... In the
meantime, use at your own risk.
Getting what we want
Now we've now got our original variable name list and a mapping
function let's put build.ror
to use.
# bundle up some models
model.list <- list(mod1, mod2)
# get the names of the coefficients in all the models
oldnames <- all.varnames.dammit(model.list)
# construct three suitable arguments for screenreg
ror <- build.ror(oldnames, name.map)
# add them as arguments to a screenreg call
screenreg(model.list, custom.coef.names=ror$ccn,
omit.coef=ror$oc, reorder.coef=ror$rc)
============================================
Model 1 Model 2
--------------------------------------------
education 3.66 *** 2.80 ***
(0.65) (0.59)
income 1.04 *** 0.85 ***
(0.26) (0.23)
type: white collar -8.82 ** -24.12 ***
(2.79) (5.35)
type: blue collar -5.91 -27.55 ***
(3.94) (5.41)
income x white collar 1.91 *
(0.81)
income x blue collar 3.01 ***
(0.58)
prop. female 0.01 0.08 *
(0.03) (0.03)
--------------------------------------------
R^2 0.83 0.87
Adj. R^2 0.83 0.86
Num. obs. 98 98
============================================
*** p < 0.001, ** p < 0.01, * p < 0.05
and there we are.
To put the proportion of women variable first, just move it to the
front of the name.map
and run the whole thing again. No more index
calculations in your head. To omit a variable, just take it out of the
list.
It doesn't matter if there are names in name.map
that never appear
as variables in any of your models. You should feel free to put a
project's worth of variable name mappings in there, just in case.
Finally, all this works with texreg
and htmlreg
too (but not for
plotreg
because that doesn't combine models). If you use it to create
latex, do make sure your new variable names are properly escaped
before they meet the latex compiler. Percentage and dollars signs are
the usual culprits.
Extensions
It would be natural to build something like these functions into
texreg
itself. Or at the very least, bundle the two functions above
together so that we could just hand in a list of models and a variable
name mapping. Maybe I'll save that for another post.