Skip to contents

Converts downloaded files to text and evaluates screening term matches before ingestion.

Usage

screen_servcat_downloads(
  downloads,
  screening_terms = DEFAULT_SCREENING_TERMS,
  min_screening_term_hits = if (length(screening_terms) > 0) 1L else 0L,
  target_size = 2200L,
  target_overlap = 0.15,
  segment_by_heading_levels = c(1L, 2L),
  screening_cache_dir = NULL,
  use_screening_cache = !is.null(screening_cache_dir),
  refresh_screening_cache = FALSE,
  verbose = FALSE
)

Arguments

downloads

Download log from download_servcat_files().

screening_terms

Character vector of required file-screening terms.

min_screening_term_hits

Minimum required screening-term hits.

target_size, target_overlap, segment_by_heading_levels

Chunking controls.

screening_cache_dir

Optional directory for cached screening text.

use_screening_cache

Logical; use cached screening text when available.

refresh_screening_cache

Logical; ignore existing cache entries and rewrite them.

verbose

Logical; emit progress messages.

Value

A tibble screening log.

Examples

if (FALSE) { # \dontrun{
downloads <- tibble::tibble(
  referenceId = 12345,
  resourceId = 67890,
  fileName = "report.pdf",
  localPath = file.path("data", "servcat_downloads", "12345", "report.pdf"),
  downloadLink = NA_character_,
  success = TRUE,
  error = NA_character_
)
screen_servcat_downloads(
  downloads,
  screening_terms = c("habitat", "survey"),
  verbose = TRUE
)
} # }