End-to-end workflow for discovering ServCat references, downloading files, screening ServCat and local files, ingesting verified sources, and building a ragnar DuckDB index.
Usage
build_rag_store(
topic = "custom_rag",
store_dir = RAG_STORE_DIR,
store_location = file.path(store_dir, RAG_STORE_FILENAME),
secure = FALSE,
reference_ids = NULL,
search_terms = DEFAULT_SEARCH_TERMS,
screening_terms = DEFAULT_SCREENING_TERMS,
min_screening_term_hits = if (length(screening_terms) > 0) 1L else 0L,
servcat_top = 100,
max_servcat_queries = Inf,
servcat_max_pages = Inf,
overwrite_downloads = FALSE,
overwrite_store = TRUE,
include_servcat_metadata = FALSE,
local_file_path = NULL,
urls = DEFAULT_URLS,
embedding = c("azure-openai-small", "openai-small", "ollama-default", "none"),
store_name = make_store_identifier(topic),
store_title = paste0(topic, " RAG Store"),
target_size = 1800L,
target_overlap = 0.35,
segment_by_heading_levels = c(1L, 2L),
screening_cache_dir = NULL,
use_screening_cache = TRUE,
refresh_screening_cache = FALSE,
keep_store_open = FALSE,
verbose = TRUE
)Arguments
- topic
Descriptive topic name.
- store_dir
Directory for store artifacts and logs.
- store_location
Path to the DuckDB-backed ragnar store.
- secure
Logical passed to ServCat functions.
- reference_ids
Optional explicit ServCat reference IDs.
- search_terms
Character vector of ServCat Quick Search terms. If empty,
build_rag_store()skips ServCat search; explicitreference_idscan still be downloaded.- screening_terms
Character vector of terms used to screen URL text, downloaded ServCat files, and local files before ingestion.
- min_screening_term_hits
Minimum number of
screening_termsthat must be found for downloaded ServCat files and local files to pass screening.- servcat_top, max_servcat_queries, servcat_max_pages
ServCat paging limits.
- overwrite_downloads, overwrite_store
Overwrite controls.
- include_servcat_metadata
Logical; ingest ServCat metadata for verified references.
- local_file_path
Optional path to one or more directories containing local source files. If
NULLor empty,build_rag_store()skips local file discovery, screening, and ingestion. Supplied directories are scanned recursively for supported document types using ragcat defaults; generated store artifacts are excluded automatically.- urls
Optional character vector of webpage URLs to ingest directly.
- embedding
Embedding backend:
azure-openai-small,openai-small,ollama-default, ornone. The default,azure-openai-small, usesragnar::embed_azure_openai()with modeltext-embedding-3-small, endpointhttps://api-dev.ai.doi.net/, and API version2024-02-15-preview.- store_name, store_title
Store metadata.
- target_size, target_overlap, segment_by_heading_levels
Chunking controls.
- screening_cache_dir
Directory for cached Markdown/text used during file screening. Defaults to a
.ragcat_cache/screening_textfolder next tostore_dir, so the cache survives deleting/rebuilding the store folder.- use_screening_cache
Logical; reuse cached screening text when source content and chunking settings match.
- refresh_screening_cache
Logical; force reconversion and rewrite cache entries.
- keep_store_open
Logical; keep the writable ragnar connection open in the returned object. The default,
FALSE, checkpoints and disconnects the DuckDB connection so the store is immediately visible on disk and can be reopened byask_rag_store().- verbose
Logical; emit progress messages.
