library(papercheck)
library(dplyr)
set.seed(8675309)
You can build a simple general linear model to predict the classification of text strings.
Set up ground truth
See the metascience vignette for an explanation of how to set up a ground truth table. Here, we’re going to split our data into a training and test set.
ground_truth <- readxl::read_excel("power/power_screening_coded.xlsx")
train <- dplyr::slice_sample(ground_truth, prop = 0.5)
test <- anti_join(ground_truth, train, by = "text")
Get important words
You can use any method for finding the words you want to use in your model, but papercheck has a built-in function to find the words that are most distinctive in your classification groups. The classification values here are 0 and 1, but can be TRUE/FALSE or any two text values.
n_X
is the total number of incidents of the word in
category X, while freq_X
is the average number of incidents
per text string in category X (so can be higher than 1 if a word tends
to be found several times per sentence). The table gives you the top
n
words with the largest absolute difference in
frequency.
words <- distinctive_words(
text = train$text,
classification = train$power_computation,
n = 10
)
word | n_0 | n_1 | total | freq_0 | freq_1 | difference |
---|---|---|---|---|---|---|
### | 348 | 341 | 689 | 1.56 | 3.22 | 1.66 |
size | 69 | 101 | 170 | 0.31 | 0.95 | 0.64 |
a | 177 | 148 | 325 | 0.79 | 1.40 | 0.60 |
sampl | 55 | 71 | 126 | 0.25 | 0.67 | 0.42 |
the | 343 | 122 | 465 | 1.54 | 1.15 | 0.39 |
effect | 69 | 72 | 141 | 0.31 | 0.68 | 0.37 |
detect | 31 | 47 | 78 | 0.14 | 0.44 | 0.30 |
to | 172 | 114 | 286 | 0.77 | 1.08 | 0.30 |
particip | 36 | 46 | 82 | 0.16 | 0.43 | 0.27 |
an | 36 | 43 | 79 | 0.16 | 0.41 | 0.24 |
By default, the function will “stem” words using the “porter”
algorithm. For example, “sampl” will match “sample”, “samples” and
“sampling”. If your text is not English, check
SnowballC::getStemLanguages()
for other supported
languages, or set stem_language = FALSE
.
You can get rid of words that you think will be irrelevant (even if
they are predictive of classification in this data set) by adding them
to stop_words
. The tidytext::stop_words
object
gives you a list of common stop words, but this includes words like
“above”, “according”, or “small”, so use this with caution.
The “###” value represents any number (the default setting for the
numbers
argument). We can set the numbers
argument to “specific” to see if there are any specific numbers
associated with power analyses.
words <- distinctive_words(
text = train$text,
classification = train$power_computation,
n = 10,
numbers = "specific",
stop_words = c("the", "a", "of", "an", "and")
)
word | n_0 | n_1 | total | freq_0 | freq_1 | difference |
---|---|---|---|---|---|---|
size | 69 | 101 | 170 | 0.31 | 0.95 | 0.64 |
sampl | 55 | 71 | 126 | 0.25 | 0.67 | 0.42 |
effect | 69 | 72 | 141 | 0.31 | 0.68 | 0.37 |
detect | 31 | 47 | 78 | 0.14 | 0.44 | 0.30 |
to | 172 | 114 | 286 | 0.77 | 1.08 | 0.30 |
particip | 36 | 46 | 82 | 0.16 | 0.43 | 0.27 |
we | 74 | 58 | 132 | 0.33 | 0.55 | 0.22 |
analysi | 37 | 39 | 76 | 0.17 | 0.37 | 0.20 |
power | 242 | 136 | 378 | 1.09 | 1.28 | 0.20 |
d | 12 | 26 | 38 | 0.05 | 0.25 | 0.19 |
Code text features
Next, code the features of your ground truth text using
text_features()
. This will give you a data frame that codes
0 or 1 for the absence or presence of each word or feature.
-
word_count
defaults to TRUE, and returns the number of words in each text string. -
has_number
defaults to TRUE, and checks for any number in your text. If “###” is in your words list, this will be automatically set to TRUE. -
has_symbols
is a named vector of non-word strings (use regex) that you want to detect. -
values
defaults to “presence” and returns 0 or 1 for the presence of a word in each text string, while “count” returns the number of incidences of the word per string.
has_symbols <- c(has_equals = "=",
has_percent = "%")
features <- text_features(
text = train$text,
words = words,
word_count = FALSE,
has_number = TRUE,
has_symbol = has_symbols,
values = "presence" # presence or count
)
# show the first row
features[1, ] |> str()
#> 'data.frame': 1 obs. of 13 variables:
#> $ has_number : num 0
#> $ has_equals : num 0
#> $ has_percent: num 0
#> $ size : num 0
#> $ sampl : num 0
#> $ effect : num 0
#> $ detect : num 0
#> $ to : num 0
#> $ particip : num 0
#> $ we : num 0
#> $ analysi : num 0
#> $ power : num 1
#> $ d : num 0
Train a model
You can then use this feature data to train a model. Here, we’re using a simple binomial logistic regression to predict the classification from all of the features.
# Train logistic regression model
model <- glm(train$power_computation ~ .,
data = features,
family = "binomial")
summary(model)
#>
#> Call:
#> glm(formula = train$power_computation ~ ., family = "binomial",
#> data = features)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) -19.19030 884.81326 -0.022 0.98270
#> has_number 0.77063 0.45081 1.709 0.08737 .
#> has_equals 0.95531 0.40654 2.350 0.01878 *
#> has_percent 0.03501 0.41426 0.085 0.93266
#> size 1.02841 0.41502 2.478 0.01321 *
#> sampl 0.81864 0.38491 2.127 0.03343 *
#> effect 0.38253 0.38295 0.999 0.31784
#> detect 0.95231 0.44076 2.161 0.03072 *
#> to -0.92623 0.38247 -2.422 0.01545 *
#> particip 0.86969 0.36439 2.387 0.01700 *
#> we 0.98129 0.32641 3.006 0.00264 **
#> analysi 0.89200 0.34875 2.558 0.01054 *
#> power 16.09475 884.81319 0.018 0.98549
#> d 0.20994 0.51380 0.409 0.68283
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 413.56 on 328 degrees of freedom
#> Residual deviance: 272.41 on 315 degrees of freedom
#> AIC: 300.41
#>
#> Number of Fisher Scoring iterations: 16
You can use any model you like and any method to assess and choose the best model.
Predict classification
Now you can classify any text using this model. First, we’re going to
predict the classification of the original training data. Use
text_features()
to get the feature data and
predict()
to return the model response, and compare this
result to a threshold (here 0.5) to generate the predicted
classification.
train$model_response <- predict(model, features)
train$power_computation_predict <-
train$model_response > 0.5
dplyr::count(train,
power_computation,
power_computation_predict)
#> # A tibble: 4 × 3
#> power_computation power_computation_predict n
#> <dbl> <lgl> <int>
#> 1 0 FALSE 210
#> 2 0 TRUE 13
#> 3 1 FALSE 51
#> 4 1 TRUE 55
Now you should test this on a new set of data.
test_features <- text_features(
text = test$text,
words = words,
word_count = FALSE,
has_number = TRUE,
has_symbol = has_symbols,
values = "presence" # presence or count
)
test$model_response <- predict(model, test_features)
test$power_computation_predict <-
test$model_response > 0.5
dplyr::count(test,
power_computation,
power_computation_predict)
#> # A tibble: 4 × 3
#> power_computation power_computation_predict n
#> <dbl> <lgl> <int>
#> 1 0 FALSE 201
#> 2 0 TRUE 19
#> 3 1 FALSE 58
#> 4 1 TRUE 50