Making available replication materials for the research you do is A Good Thing. It’s also work, and it’s quite easy to never get around to. Certainly I claim no special virtue in this department so I am always happy when there’s an institutional stick to prod my better nature in the right direction. One such institutional prod comes from academic journals and their data policies. If you *have* to give them your replication data before you they’ll publish your paper, then you probably will. What sorts of journals have data policies?

In a recent paper looking at the relationship between Thomson ISI impact factor, audience, language, age, and issue frequency and the existence of a data policy, Gherghina and Katsanidou find that in a sample of political science journals the presence of a data policy mostly covaries with the journal’s target audience – general or specialised – and with its impact factor. And to their eternal credit, the authors make their data available.

As one might hope, their analysis has already been replicated. (You should read this link for some useful discussion and citations). Ghergina and Katsanidou only present correlations, so we can dig a bit further with some modelling with their data. I do that below. But first, here’s the headline.

The plot below represents the predicted probability of having a data policy for English-language general (red, higher) and specific (blueish-green, lower) journals as a function of their impact factor. The bands are two standard error pointwise confidence intervals, with approximately 95% coverage under the usual assumptions.

Higher impact factors strongly predict higher probabilities of having a data policy, with general audience journals slightly more likely to have a data policy than specific journals.

Now it could be that better journals are better because they have a data policy, or that they try to signal their quality by having one, or that a data policy increases citation rates and thus the journal’s impact factor, or that some other factors generate high impact factors and the desire for a data policy. That is, of course, not settled by anything we’ve done here. Nevertheless, knowing the impact factor and audience of a journal apparently tells you quite a lot about whether it’s going to have a data policy.

### Trying it at Home

In the model represented in the plot I’ve assumed that only **language** (english or not), **audience** (general or specific) and **impact.factor** (from 0 to 5.22) matter for predicting a data policy.

#### Fitting a Model

Here’s a straightforward logistic regression using my own version of the data

load('jpols.Rdata') ## jpols is my tidied-up version of the original data glm.mod <- glm(data.policy ~ language + audience + impact.factor, data=jpols, family=binomial)

R's default listwise deletion procedure drops journals without impact factors when this model is fitted, which leaves 95 journals.

There's no need to dwell here on the model interpretation except perhaps to note that, in contrast to the correlations in the paper, the language of a journal no longer significantly predicts having a data policy. This is because of the tight connection between being in English, having an ISI ranking, and having a higher impact factor if ranked. (Similar arguments hold for the age of the journal, which I've simply left out.)

Perhaps the most interesting single quantity in this model is the predicted probability of a data policy for a unit increase in impact factor. A unit more impact makes it

logodds.impact <- coef(glm.mod)['impact.factor'] prop.increase <- exp(logodds.impact)-1 ## about 2.7

nearly three times more likely that a journal has a data policy.

#### Constructing Data for Prediction

The figure requires predicted probabilities for English language journals with impact factors between zero and five for both general and specific audiences, so I make up that data:

imps <- seq(0,5,by=0.1) ## range of impact factors eng.nd <- data.frame(language="english", audience=rep(c("general", "specific"), each=length(imps)), impact.factor=rep(imps, 2))

and get predictions from the fitted model with standard errors.

preds <- predict(glm.mod, newdata=eng.nd, se.fit=TRUE)

This gives standard errors for the link (a log odds) rather than the response (a probability) so I define the inverse link function **invl** to transform log odds into probabilities

invl <- function(x){ 1/(1+exp(-x)) }

Next I make a data frame with everything needed for the plot: the audience, the impact factor, and the upper, lower and expected predicted probabilities suitably transformed

dd <- with(preds, data.frame(audience=eng.nd$audience, impact=eng.nd$impact.factor, lower=invl(fit-2*se.fit), pred=invl(fit), upper=invl(fit+2*se.fit)) )

**An Aside**: Why bother with this **invl** business? Why not ask directly for standard errors on the response by specifying **type="response"** in the **predict** function? Because this will return only one column of standard errors but with data where the predicted probabilities are close to zero or one, the correct predictive intervals will be strongly asymmetrical. For example, in this model the predicted probability for a general journal with impact factor 4 is 0.94 with a two standard error interval of [0.66, 0.99]. On the response scale, this would have been [0.82, 1.06] which is too high at one end and impossible at the other.

#### Plotting Predictive Probabilities

Plotting is relatively straightforward using the ggplot2 package

library(ggplot2) ## load package ggplot(dd, aes(impact, pred, colour=audience)) + geom_ribbon(aes(x=impact, ymin=lower, ymax=upper, fill=audience, linetype=NA), alpha=0.3) + geom_line(size=1) + theme_bw() + theme(legend.position="none") + labs(x = "Impact Factor", y = "P(Data Policy)")

And that's the final plot. I'm still getting my head around ggplot2 so there may be neater ways to do this.

Will, this is such a worthy quest! I recently had a mission to obtain the data for this paper. The author stats that the data are available from the company that sponsored the study. Big surprise! The data is not for release according to the sponsor. I contacted the author, no response. I contacted the American Journal of Medicine, the response was essentially, “not our problem.”

ugh

if you take the excel data, and plot IF on the x axis and data policy (0,1) on the y axis…you get prtty much 99% of hte meat with otu the fancy graphics and data anlysis that are interpretable to only an handfull of cognoscenti

In other words, this is not an example of how to do things, but an example of how not to do things.

(berfore you ding me for being an arrogant witless know nothing, LOOK at the excel plot)

Rather then just publishing the data, which would be a big step, the code used to manipulate the data to get to their conclusions is just as important.

Small coding errors are often impossible to trace but can have a large impact on the final outcome. Adding their code in the supplementary data would have a tremendously added value.