Skip to contents

End-to-end workflow for discovering ServCat references, downloading files, screening ServCat and local files, ingesting verified sources, and building a ragnar DuckDB index.

Usage

build_rag_store(
  topic = "custom_rag",
  store_dir = RAG_STORE_DIR,
  store_location = file.path(store_dir, RAG_STORE_FILENAME),
  secure = FALSE,
  reference_ids = NULL,
  search_terms = DEFAULT_SEARCH_TERMS,
  screening_terms = DEFAULT_SCREENING_TERMS,
  min_screening_term_hits = if (length(screening_terms) > 0) 1L else 0L,
  servcat_top = 100,
  max_servcat_queries = Inf,
  servcat_max_pages = Inf,
  overwrite_downloads = FALSE,
  overwrite_store = TRUE,
  include_servcat_metadata = FALSE,
  local_file_path = NULL,
  urls = DEFAULT_URLS,
  embedding = c("azure-openai-small", "openai-small", "ollama-default", "none"),
  store_name = make_store_identifier(topic),
  store_title = paste0(topic, " RAG Store"),
  target_size = 1800L,
  target_overlap = 0.35,
  segment_by_heading_levels = c(1L, 2L),
  screening_cache_dir = NULL,
  use_screening_cache = TRUE,
  refresh_screening_cache = FALSE,
  keep_store_open = FALSE,
  verbose = TRUE
)

Arguments

topic

Descriptive topic name.

store_dir

Directory for store artifacts and logs.

store_location

Path to the DuckDB-backed ragnar store.

secure

Logical passed to ServCat functions.

reference_ids

Optional explicit ServCat reference IDs.

search_terms

Character vector of ServCat Quick Search terms. If empty, build_rag_store() skips ServCat search; explicit reference_ids can still be downloaded.

screening_terms

Character vector of terms used to screen URL text, downloaded ServCat files, and local files before ingestion.

min_screening_term_hits

Minimum number of screening_terms that must be found for downloaded ServCat files and local files to pass screening.

servcat_top, max_servcat_queries, servcat_max_pages

ServCat paging limits.

overwrite_downloads, overwrite_store

Overwrite controls.

include_servcat_metadata

Logical; ingest ServCat metadata for verified references.

local_file_path

Optional path to one or more directories containing local source files. If NULL or empty, build_rag_store() skips local file discovery, screening, and ingestion. Supplied directories are scanned recursively for supported document types using ragcat defaults; generated store artifacts are excluded automatically.

urls

Optional character vector of webpage URLs to ingest directly.

embedding

Embedding backend: azure-openai-small, openai-small, ollama-default, or none. The default, azure-openai-small, uses ragnar::embed_azure_openai() with model text-embedding-3-small, endpoint https://api-dev.ai.doi.net/, and API version 2024-02-15-preview.

store_name, store_title

Store metadata.

target_size, target_overlap, segment_by_heading_levels

Chunking controls.

screening_cache_dir

Directory for cached Markdown/text used during file screening. Defaults to a .ragcat_cache/screening_text folder next to store_dir, so the cache survives deleting/rebuilding the store folder.

use_screening_cache

Logical; reuse cached screening text when source content and chunking settings match.

refresh_screening_cache

Logical; force reconversion and rewrite cache entries.

keep_store_open

Logical; keep the writable ragnar connection open in the returned object. The default, FALSE, checkpoints and disconnects the DuckDB connection so the store is immediately visible on disk and can be reopened by ask_rag_store().

verbose

Logical; emit progress messages.

Value

Invisibly, a list containing the store connection, manifest, logs, and verified sources.

Examples

if (FALSE) { # \dontrun{
build <- build_rag_store(
  topic = "Example topic",
  store_dir = file.path("data", "example_rag_store"),
  local_file_path = file.path("data", "local_sources"),
  screening_terms = c("habitat", "survey")
)
} # }