Skip to contents

You can build a simple general linear model to predict the classification of text strings.

Set up ground truth

See the metascience vignette for an explanation of how to set up a ground truth table. Here, we’re going to split our data into a training and test set.

ground_truth <- readxl::read_excel("power/power_screening_coded.xlsx")

train <- dplyr::slice_sample(ground_truth, prop = 0.5)
test <- anti_join(ground_truth, train, by = "text")

Get important words

You can use any method for finding the words you want to use in your model, but papercheck has a built-in function to find the words that are most distinctive in your classification groups. The classification values here are 0 and 1, but can be TRUE/FALSE or any two text values.

n_X is the total number of incidents of the word in category X, while freq_X is the average number of incidents per text string in category X (so can be higher than 1 if a word tends to be found several times per sentence). The table gives you the top n words with the largest absolute difference in frequency.

words <- distinctive_words(
  text = train$text,
  classification = train$power_computation,
  n = 10
)
word n_0 n_1 total freq_0 freq_1 difference
### 348 341 689 1.56 3.22 1.66
size 69 101 170 0.31 0.95 0.64
a 177 148 325 0.79 1.40 0.60
sampl 55 71 126 0.25 0.67 0.42
the 343 122 465 1.54 1.15 0.39
effect 69 72 141 0.31 0.68 0.37
detect 31 47 78 0.14 0.44 0.30
to 172 114 286 0.77 1.08 0.30
particip 36 46 82 0.16 0.43 0.27
an 36 43 79 0.16 0.41 0.24

By default, the function will “stem” words using the “porter” algorithm. For example, “sampl” will match “sample”, “samples” and “sampling”. If your text is not English, check SnowballC::getStemLanguages() for other supported languages, or set stem_language = FALSE.

You can get rid of words that you think will be irrelevant (even if they are predictive of classification in this data set) by adding them to stop_words. The tidytext::stop_words object gives you a list of common stop words, but this includes words like “above”, “according”, or “small”, so use this with caution.

The “###” value represents any number (the default setting for the numbers argument). We can set the numbers argument to “specific” to see if there are any specific numbers associated with power analyses.

words <- distinctive_words(
  text = train$text,
  classification = train$power_computation,
  n = 10,
  numbers = "specific",
  stop_words = c("the", "a", "of", "an", "and")
)
word n_0 n_1 total freq_0 freq_1 difference
size 69 101 170 0.31 0.95 0.64
sampl 55 71 126 0.25 0.67 0.42
effect 69 72 141 0.31 0.68 0.37
detect 31 47 78 0.14 0.44 0.30
to 172 114 286 0.77 1.08 0.30
particip 36 46 82 0.16 0.43 0.27
we 74 58 132 0.33 0.55 0.22
analysi 37 39 76 0.17 0.37 0.20
power 242 136 378 1.09 1.28 0.20
d 12 26 38 0.05 0.25 0.19

Code text features

Next, code the features of your ground truth text using text_features(). This will give you a data frame that codes 0 or 1 for the absence or presence of each word or feature.

  • word_count defaults to TRUE, and returns the number of words in each text string.
  • has_number defaults to TRUE, and checks for any number in your text. If “###” is in your words list, this will be automatically set to TRUE.
  • has_symbols is a named vector of non-word strings (use regex) that you want to detect.
  • values defaults to “presence” and returns 0 or 1 for the presence of a word in each text string, while “count” returns the number of incidences of the word per string.
has_symbols <- c(has_equals = "=", 
                 has_percent = "%")

features <- text_features(
  text = train$text, 
  words = words,
  word_count = FALSE, 
  has_number = TRUE,
  has_symbol = has_symbols, 
  values = "presence" # presence or count
)

# show the first row
features[1, ] |> str()
#> 'data.frame':    1 obs. of  13 variables:
#>  $ has_number : num 0
#>  $ has_equals : num 0
#>  $ has_percent: num 0
#>  $ size       : num 0
#>  $ sampl      : num 0
#>  $ effect     : num 0
#>  $ detect     : num 0
#>  $ to         : num 0
#>  $ particip   : num 0
#>  $ we         : num 0
#>  $ analysi    : num 0
#>  $ power      : num 1
#>  $ d          : num 0

Train a model

You can then use this feature data to train a model. Here, we’re using a simple binomial logistic regression to predict the classification from all of the features.

# Train logistic regression model
model <- glm(train$power_computation ~ .,
             data = features,
             family = "binomial")

summary(model)
#> 
#> Call:
#> glm(formula = train$power_computation ~ ., family = "binomial", 
#>     data = features)
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)   
#> (Intercept) -19.19030  884.81326  -0.022  0.98270   
#> has_number    0.77063    0.45081   1.709  0.08737 . 
#> has_equals    0.95531    0.40654   2.350  0.01878 * 
#> has_percent   0.03501    0.41426   0.085  0.93266   
#> size          1.02841    0.41502   2.478  0.01321 * 
#> sampl         0.81864    0.38491   2.127  0.03343 * 
#> effect        0.38253    0.38295   0.999  0.31784   
#> detect        0.95231    0.44076   2.161  0.03072 * 
#> to           -0.92623    0.38247  -2.422  0.01545 * 
#> particip      0.86969    0.36439   2.387  0.01700 * 
#> we            0.98129    0.32641   3.006  0.00264 **
#> analysi       0.89200    0.34875   2.558  0.01054 * 
#> power        16.09475  884.81319   0.018  0.98549   
#> d             0.20994    0.51380   0.409  0.68283   
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 413.56  on 328  degrees of freedom
#> Residual deviance: 272.41  on 315  degrees of freedom
#> AIC: 300.41
#> 
#> Number of Fisher Scoring iterations: 16

You can use any model you like and any method to assess and choose the best model.

Predict classification

Now you can classify any text using this model. First, we’re going to predict the classification of the original training data. Use text_features() to get the feature data and predict() to return the model response, and compare this result to a threshold (here 0.5) to generate the predicted classification.

train$model_response <- predict(model, features)

train$power_computation_predict <-
  train$model_response > 0.5

dplyr::count(train, 
             power_computation, 
             power_computation_predict)
#> # A tibble: 4 × 3
#>   power_computation power_computation_predict     n
#>               <dbl> <lgl>                     <int>
#> 1                 0 FALSE                       210
#> 2                 0 TRUE                         13
#> 3                 1 FALSE                        51
#> 4                 1 TRUE                         55

Now you should test this on a new set of data.


test_features <- text_features(
  text = test$text, 
  words = words,
  word_count = FALSE, 
  has_number = TRUE,
  has_symbol = has_symbols, 
  values = "presence" # presence or count
)
test$model_response <- predict(model, test_features)

test$power_computation_predict <-
  test$model_response > 0.5

dplyr::count(test, 
             power_computation, 
             power_computation_predict)
#> # A tibble: 4 × 3
#>   power_computation power_computation_predict     n
#>               <dbl> <lgl>                     <int>
#> 1                 0 FALSE                       201
#> 2                 0 TRUE                         19
#> 3                 1 FALSE                        58
#> 4                 1 TRUE                         50