Skip to contents

You can build a simple general linear model to predict the classification of text strings.

Set up ground truth

See the metascience vignette for an explanation of how to set up a ground truth table. Here, we’re going to split our data into a training and test set.

ground_truth <- readr::read_csv("power/power_screening_coded.csv", 
                         show_col_types = FALSE)

train <- slice_sample(ground_truth, prop = 0.5)
test <- anti_join(ground_truth, train, by = "text")

Get important words

You can use any method for finding the words you want to use in your model, but papercheck has a built-in function to find the words that are most distinctive in your classification groups. The classification values here are 0 and 1, but can be TRUE/FALSE or any two text values.

n_X is the total number of incidents of the word in category X, while freq_X is the average number of incidents per text string in category X (so can be higher than 1 if a word tends to be found several times per sentence). The table gives you the top n words with the largest absolute difference in frequency.

words <- distinctive_words(
  text = train$text,
  classification = train$power_computation,
  n = 10
)
word n_1 n_0 total freq_1 freq_0 difference
### 363 232 595 3.02 1.67 1.36
size 96 29 125 0.80 0.21 0.59
of 183 135 318 1.52 0.97 0.55
a 147 105 252 1.23 0.76 0.47
the 157 242 399 1.31 1.74 0.43
sampl 68 32 100 0.57 0.23 0.34
effect 74 42 116 0.62 0.30 0.31
an 52 17 69 0.43 0.12 0.31
particip 49 26 75 0.41 0.19 0.22
α 26 0 26 0.22 0.00 0.22

By default, the function will “stem” words using the “porter” algorithm. For example, “sampl” will match “sample”, “samples” and “sampling”. If your text is not English, check SnowballC::getStemLanguages() for other supported languages, or set stem_language = FALSE.

You can get rid of words that you think will be irrelevant (even if they are predictive of classification in this data set) by adding them to stop_words. The tidytext::stop_words object gives you a list of common stop words, but this includes words like “above”, “according”, or “small”, so use this with caution.

The “###” value represents any number (the default setting for the numbers argument). We can set the numbers argument to “specific” to see if there are any specific numbers associated with power analyses.

words <- distinctive_words(
  text = train$text,
  classification = train$power_computation,
  n = 10,
  numbers = "specific",
  stop_words = c("the", "a", "of", "an", "and")
)
word n_0 n_1 total freq_0 freq_1 difference
size 29 96 125 0.21 0.80 0.59
sampl 32 68 100 0.23 0.57 0.34
effect 42 74 116 0.30 0.62 0.31
particip 26 49 75 0.19 0.41 0.22
α 0 26 26 0.00 0.22 0.22
analysi 24 45 69 0.17 0.38 0.20
80 6 29 35 0.04 0.24 0.20
in 105 67 172 0.76 0.56 0.20
detect 27 46 73 0.19 0.38 0.19
we 46 62 108 0.33 0.52 0.19

Code text features

Next, code the features of your ground truth text using text_features(). This will give you a data frame that codes 0 or 1 for the absence or presence of each word or feature.

  • word_count defaults to TRUE, and returns the number of words in each text string.
  • has_number defaults to TRUE, and checks for any number in your text. If “###” is in your words list, this will be automatically set to TRUE.
  • has_symbols is a named vector of non-word strings (use regex) that you want to detect.
  • values defaults to “presence” and returns 0 or 1 for the presence of a word in each text string, while “count” returns the number of incidences of the word per string.
has_symbols <- c(has_equals = "=", 
                 has_percent = "%")

features <- text_features(
  text = train$text, 
  words = words$word,
  word_count = FALSE, 
  has_number = TRUE,
  has_symbol = has_symbols, 
  values = "presence" # presence or count
)

# show the first row
features[1, ] |> str()
#> 'data.frame':    1 obs. of  13 variables:
#>  $ has_number : num 0
#>  $ has_equals : num 0
#>  $ has_percent: num 0
#>  $ size       : num 0
#>  $ sampl      : num 1
#>  $ effect     : num 0
#>  $ particip   : num 0
#>  $ α          : num 0
#>  $ analysi    : num 0
#>  $ 80         : num 0
#>  $ in         : num 0
#>  $ detect     : num 0
#>  $ we         : num 0

Train a model

You can then use this feature data to train a model. Here, we’re using a simple binomial logistic regression to predict the classification from all of the features.

# Train logistic regression model
model <- glm(train$power_computation ~ .,
             data = features,
             family = "binomial")

summary(model)
#> 
#> Call:
#> glm(formula = train$power_computation ~ ., family = "binomial", 
#>     data = features)
#> 
#> Coefficients:
#>               Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)   -2.03224    0.38135  -5.329 9.87e-08 ***
#> has_number     0.69528    0.40851   1.702   0.0888 .  
#> has_equals     0.39358    0.43149   0.912   0.3617    
#> has_percent    0.10123    0.51281   0.197   0.8435    
#> size           1.60936    0.49690   3.239   0.0012 ** 
#> sampl         -0.08325    0.44324  -0.188   0.8510    
#> effect        -0.51222    0.44972  -1.139   0.2547    
#> particip       0.37127    0.39893   0.931   0.3520    
#> α             18.11996 1144.10286   0.016   0.9874    
#> analysi        0.86501    0.38961   2.220   0.0264 *  
#> `80`           0.78556    0.58414   1.345   0.1787    
#> `in`          -0.17688    0.33034  -0.535   0.5923    
#> detect         0.67571    0.46059   1.467   0.1424    
#> we             0.51235    0.33593   1.525   0.1272    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 357.66  on 258  degrees of freedom
#> Residual deviance: 244.87  on 245  degrees of freedom
#> AIC: 272.87
#> 
#> Number of Fisher Scoring iterations: 17

You can use any model you like and any method to assess and choose the best model.

Predict classification

Now you can classify any text using this model. First, we’re going to predict the classification of the original training data. Use text_features() to get the feature data and predict() to return the model response, and compare this result to a threshold (here 0.5) to generate the predicted classification.

train$model_response <- predict(model, features)

You can see that it’s just below 0 that Yes values become more probable than No values. Remeber, this will change a bite depending on your sample, so you don’t need to optimize the threshold to two decimal places or anything.

[!NOTE] You need to consider the use of your classification when setting a threshold. Do you want to be over-inclusive and have another step to weed out false positives? Or under-inclusive and be absolutely sure that you only classify a sentence as having a power analysis when it defiitely does? Or do you want to balance these two types of error?

train$power_computation_predict <- train$model_response > 0

dplyr::count(train, 
             power_computation, 
             power_computation_predict)
#> # A tibble: 4 × 3
#>   power_computation power_computation_predict     n
#>               <dbl> <lgl>                     <int>
#> 1                 0 FALSE                       115
#> 2                 0 TRUE                         24
#> 3                 1 FALSE                        35
#> 4                 1 TRUE                         85

Now you should test this on a new set of data.

test_features <- text_features(
  text = test$text, 
  words = words,
  word_count = FALSE, 
  has_number = TRUE,
  has_symbol = has_symbols, 
  values = "presence" # presence or count
)
test$model_response <- predict(model, test_features)

test$power_computation_predict <- test$model_response > 0

dplyr::count(test, 
             power_computation, 
             power_computation_predict)
#> # A tibble: 4 × 3
#>   power_computation power_computation_predict     n
#>               <dbl> <lgl>                     <int>
#> 1                 0 FALSE                       109
#> 2                 0 TRUE                         43
#> 3                 1 FALSE                        17
#> 4                 1 TRUE                         90

Create a module

To create a module for this, you will need to save your model and create a module with code.

saveRDS(model, "power/power_log_model.Rds")
#' Power Log Model
power_log_model <- function(paper) {
  # initial text search
  table <- paper |>
    papercheck::search_text("power(ed)?\\b") |>
    papercheck::search_text("(\\.[0-9]|[0-9]%)")

  # get text features
  words <- c("size", "sampl", "effect", "particip", "α", "analysi", "80", 
"in", "detect", "we")

  has_symbols <- c(has_equals = "=",
                   has_percent = "%")

  features <- text_features(
    text = table$text,
    words = words,
    word_count = FALSE,
    has_number = TRUE,
    has_symbol = has_symbols,
    values = "presence" # presence or count
  )

  # predict response from model
  model <- readRDS("power_log_model.Rds")
  table$model_response <- predict(model, features)
  table$power_computation_predict <- table$model_response > 0

  # summarise by paper
  summary_table <- table[table$power_computation_predict, ] |>
    dplyr::count(id, name = "n_power")

  list(
    table = table,
    summary = summary_table,
    na_replace = 0
  )
}

Now you can use module_run() to run the module on a new paper or set of papers.

m <- demodir() |> read_grobid() |>
  module_run("power/power_log_model.R")
m$table[, c("id", "model_response", "power_computation_predict", "text")]
#> # A tibble: 2 × 4
#>   id     model_response power_computation_predict text                          
#>   <chr>           <dbl> <lgl>                     <chr>                         
#> 1 prereg         -0.256 FALSE                     The efficiency benefit can be…
#> 2 prereg          0.290 TRUE                      They require a smaller sample…
m$summary
#>         id n_power
#> 1 eyecolor       0
#> 2   incest       0
#> 3   prereg       1