Merchandising Pipeline Configuration File Guide

Who this guide is for
Anyone who needs to set up or change a configuration file for the merchandising pipeline. You do not need to know any programming language to use this guide. If you can edit a YAML file in a text editor, you have everything you need.
What this guide covers
Every setting you can put in a pipeline configuration file: what it does, when to use it, what each option means, and what the pipeline does at every stage when it processes your products. This is the complete reference. It is long because it covers everything; you do not have to read it end-to-end. Use the table of contents to jump to what you need.

The basics of configuration files
What the pipeline does to each row of your spreadsheet
Matching your spreadsheet columns to pipeline fields
Research: gathering product information
Search workflows: ready-made research recipes
Customizing your own search steps
Scraping product pages
Generating product content
Quality controls: rules, examples, and length limits
Running the same generation step twice with different settings
Find-and-replace terms
Protecting manufacturer part numbers (MPN)
Category-driven attributes
Marketplace channels (Amazon, eBay, Walmart)
A complete working example
Glossary of terms

1. The basics of configuration files

Every run of the merchandising pipeline is controlled by a configuration file. This file is written in YAML, a plain-text format that uses indentation (spaces, not tabs) to show how settings are grouped together.

You do not edit any application code. You edit one YAML file. The pipeline reads that file at the start of every run and uses it to decide what research to perform, what content to generate, how to validate the results, and how to adapt content for marketplace channels.

Two starter files we provide

Filename	When to use it
`example_config.yaml`	A full-featured starting point. Contains every available setting at least once: multi-step research with fallback strategies, several generation steps, smart scrapers, find-and-replace terms, attribute generation, and marketplace channels for Amazon, eBay, and Walmart. Copy this when you need the complete feature set; you can always delete the parts you don't need.
`example_scraper_config.yaml`	A simpler starting point for a search-scrape-generate workflow. Defines one research step, one smart scraper, and four generation steps. No fallback strategies, no marketplace channels, no replacement terms. Copy this when you want a straightforward pipeline without the extra machinery.

Both files are valid and ready to run. Pick the one closest to what you need and customize from there.

The top-level structure

Every configuration file must have one top-level setting: apiConfig. Everything else lives indented underneath it. Anything you put outside apiConfig is ignored when the pipeline runs.

apiConfig:
  # All of your settings go here, indented underneath apiConfig.
  generationEndpoints:
    - "generate_title"
  generate_title:
    rules:
      - "Keep it under 60 characters."
    max_characters: 60

Underneath apiConfig you will add a mix of three kinds of keys:

1. Orchestration keys — these tell the pipeline what to do, and in what order.

Key	What it does
`generationEndpoints`	The ordered list of content generation steps you want to run (for example: title first, then description, then bullets). Required if you want any content generated. Each name in this list should also have a matching settings block elsewhere in the file.
`researchStepEndpoints`	The list of research steps to run before content generation begins. Optional. If you leave it out, the pipeline skips research entirely and goes straight to generation.

2. Settings blocks — one block per generation or research step. Each block holds the rules, examples, and limits for that step. If generationEndpoints includes "generate_title", you should also have a generate_title: block at the same indentation level holding that step's settings.

3. Cross-cutting keys — settings that apply across the whole pipeline.

Key	What it does
`replacements`	Global find-and-replace terms applied to every piece of generated text.
`fieldMapping`	Tells the pipeline which of your spreadsheet columns hold which product fields.
`thirdPartyChannels`	Marketplace channel settings (Amazon, eBay, Walmart) for generating channel-specific content.
`universal_search`	Settings for web search during research.
`scrape_results`	Settings for scraping URLs that are already in your spreadsheet.
`smart_scrape`	A placeholder for the smart-scrape research step. Usually just `{}`.
`get_images`	A placeholder for the image-fetching research step. Usually just `{}`.
`attribute_mapping_file_path`	Set automatically by the system when you upload a category-to-attribute mapping CSV. You do not set this yourself.

The rest of this guide walks through each of these in detail.

2. What the pipeline does to each row of your spreadsheet

When you submit a job, the pipeline reads your input CSV one row at a time and runs each row through five stages, in this fixed order:

Stage	What happens
1. Field mapping	Rename your spreadsheet columns to the field names the pipeline expects.
2. Research	Look up product data on the web, in the internal catalog, in uploads, etc.
3. Content generation	Write titles, descriptions, bullets, attributes, and categories.
4. Category-driven attributes	If you uploaded an attribute mapping file, look up the category and generate that category's attributes in batches.
5. Marketplace channels	If marketplace channels are enabled, re-run generation with Amazon/eBay/Walmart-specific rules.

The pipeline then produces an output row containing your original input columns, the research results, and every piece of generated content.

Each row is processed independently, so the result for one row never depends on the result for another.

Here is what happens at each stage in plain language.

Stage 1 — Field mapping

If your configuration file contains a fieldMapping block, the pipeline renames your spreadsheet columns first, before any other work happens. This lets the rest of the pipeline use a single set of internal field names regardless of how your spreadsheet headers are written.

Columns in your spreadsheet that are not listed in fieldMapping are kept exactly as-is, just under their original names. If you have a product_name column but no mpn column, the pipeline automatically uses product_name wherever it would normally use mpn, so search templates that reference {mpn} keep working.

Stage 2 — Research

The pipeline runs research steps only when you list them under researchStepEndpoints. Each step group has a strategy that controls how its endpoints are executed:

primary_then_fallback — Try the primary endpoint first; if it returns nothing, try each fallback in order until one succeeds.
use_both — Run every endpoint in the group and store every result.
standalone — Run each endpoint independently. This is the default.
parallel — Currently behaves the same as use_both. (The name reflects the intent that one day these calls may be made concurrently.)

A step group can also declare a required field naming another endpoint. The group is only run if that named endpoint already produced results. Any inline scrapers you defined under universal_search.scrapers run as each search step completes — that is, the pipeline searches, scrapes the URLs that step discovered, and only then moves on to the next step (or stops, depending on stopOnFirstSuccess).

Stage 3 — Content generation

Each step listed in generationEndpoints runs in order. For each step, the pipeline picks up the matching settings block, generates content, checks the result against any length limits you set, and applies your find-and-replace terms. If a length-limit check fails, the pipeline retries the generation up to three more times before giving up and using the last result anyway.

Stage 4 — Category-driven attributes

This stage runs only if all three of the following are true:

You uploaded an attribute mapping file with your job.
categorization is in your generationEndpoints list.
Either generate_attributes or extract_attributes is in your generationEndpoints list.

When all three are met, the pipeline reads the category that the categorization step produced, looks up that exact category in your mapping file, gathers the attributes the file says belong to that category, and generates them in groups of five.

Stage 5 — Marketplace channels

This stage runs only when thirdPartyChannels.enabled is set to true in your config file. The pipeline iterates over each marketplace you configured (any of: amazon, ebay, walmart) and re-runs the generation steps you specified for that channel using channel-specific rules and limits. Research is reused — no extra search calls are made — so this stage is purely about generating new variations of content. Results are saved into the output under channel-prefixed names like amazon_generate_title and ebay_generate_description.

3. Matching your spreadsheet columns to pipeline fields

The pipeline expects to find product information under specific internal field names like product_name, mpn, domain, and description. Your spreadsheet probably uses different headers like "Product Name", "MPN", "Manufacturer Website", and "Product Description". The fieldMapping block tells the pipeline how to translate one to the other.

Format

apiConfig:
  fieldMapping:
    pipeline_field_name: "Your Spreadsheet Column Header"

The key on the left is the internal field name the pipeline expects. The value on the right is the header text as it appears in your spreadsheet.

Example

Suppose your spreadsheet looks like this:

Product Name	MPN	Manufacturer Website
Office Chair	12345-ABC	acme.com

To make the pipeline read this spreadsheet, add this block:

apiConfig:
  fieldMapping:
    product_name: "Product Name"
    mpn: "MPN"
    domain: "Manufacturer Website"
    description: "Product Description"

After this mapping is applied, the pipeline sees:

product_name	mpn	domain
Office Chair	12345-ABC	acme.com

Things to know

Field mapping is the very first thing that happens for every row. It runs before research, generation, or anything else.
Unmapped columns are preserved. If your spreadsheet has columns that you do not list in fieldMapping (for example, an internal note column or a customer ID), they stay in the row unchanged.
You only need to map columns that the pipeline cares about. If your headers happen to already match the expected names (product_name, mpn, domain, description), you do not need a fieldMapping block at all.

4. Research: gathering product information

Research is the stage where the pipeline collects information about each product before generating any content. The richer the research results, the better the generated titles, descriptions, bullets, and attributes will be.

There are five research sources you can use, one or more at a time:

Source	What it does	Required input
`universal_search`	Web search using configurable query templates. Returns search hits (title, link, snippet).	Whatever fields your search templates reference (typically `domain`, `mpn`, `product_name`).
`catalog_search`	Searches the internal product catalog for matching manufacturer products.	`product_name` (or `title`), `mpn`, and `domain`.
`data_source_upload`	Returns data from a file you uploaded with the job.	The system must have pre-loaded the upload into the row.
`get_images`	Returns product image URLs.	The image URLs must already be in the row (typically from a prior catalog search or directly in the CSV).
`scrape_results`	A standalone scraper for when product URLs are already in your spreadsheet, rather than discovered via search.	A column containing the URL(s) to scrape.

When researchStepEndpoints is omitted from your config, no research runs and the pipeline goes straight to generation.

How research is configured

Research is configured under the researchStepEndpoints key as a list of step groups. Each group bundles one or more research sources together with a strategy that says how they should be executed.

apiConfig:
  researchStepEndpoints:
    - endpoints:
        - "universal_search"
        - "catalog_search"
        - "data_source_upload"
      strategy: "primary_then_fallback"
      primaryEndpoint: "universal_search"
      fallbacks:
        - "catalog_search"
        - "data_source_upload"
    - endpoints:
        - "get_images"
      strategy: "standalone"
      required: "universal_search"

Each step group accepts these fields:

Field	Required?	What it does
`endpoints`	Yes	The names of the research sources in this group.
`strategy`	No (defaults to `standalone`)	One of: `primary_then_fallback`, `use_both`, `parallel`, `standalone`. See below.
`primaryEndpoint`	Only with `primary_then_fallback`	The endpoint to try first.
`fallbacks`	Only with `primary_then_fallback`	The ordered list of endpoints to try if the primary returns nothing.
`required`	No	Names another research source that must have already produced results. The whole group is skipped if that source did not.

The four strategies

primary_then_fallback — Try the primary endpoint first. If it fails or returns nothing, try each fallback in order. Stop at the first fallback that succeeds. The result is stored under the primary endpoint's name. Use this when you have a preferred data source with one or more "if that fails, try this instead" alternatives.

- endpoints:
    - "universal_search"
    - "catalog_search"
    - "data_source_upload"
  strategy: "primary_then_fallback"
  primaryEndpoint: "universal_search"
  fallbacks:
    - "catalog_search"
    - "data_source_upload"

use_both — Run every endpoint in the group and store every result independently. Each endpoint that returns data gets its own entry in the research results. Use this when you want to combine data from multiple sources, for example running both a web search and a catalog lookup so generation has both kinds of information.

- endpoints:
    - "universal_search"
    - "catalog_search"
  strategy: "use_both"

parallel — Same behavior as use_both today. The name signals the intent that this group's endpoints may eventually run concurrently. Treat it as a synonym for use_both for now.

- endpoints:
    - "universal_search"
    - "catalog_search"
  strategy: "parallel"

standalone — Run each endpoint in the group independently. This is the default if you do not specify a strategy. Functionally identical to use_both. Use this for endpoints that don't need fallback logic and don't share results with each other.

- endpoints:
    - "get_images"
  strategy: "standalone"

The `required` field

The required field lets you say "only run this group if some other research source already produced results." It is useful for chains where one step depends on another.

- endpoints:
    - "get_images"
  strategy: "standalone"
  required: "universal_search"

In this example, get_images only runs if universal_search already returned at least one result. If universal_search was skipped, failed, or returned nothing, the entire get_images group is skipped.

What each research source returns

Source	What it returns
`universal_search`	A list of web search hits, each typically containing a link, a title, and a snippet.
`catalog_search`	A bundle of catalog data with matched products and image URLs from the internal manufacturer catalog.
`data_source_upload`	Whatever data was attached to the row from your upload. The structure depends on the upload format.
`get_images`	A list of product image URLs, drawn from earlier catalog or upload results.
`scrape_results`	Scraped page data, organized by scraper name. Each scraper produces its own named bundle of results.
`pdfs_to_review`	A list of PDF URLs that came back as high-confidence matches from `universal_search` (for example, a datasheet whose filename or snippet contains the MPN, at the top of the search results). PDFs are pulled out of the scraping flow and surfaced separately for human review instead, since scraping a PDF rarely produces useful structured data. See PDFs in search results.

All of these results are available to the generation steps that come next, and they are also reused (without re-running) when marketplace channel generation happens later in the pipeline. Research happens once per row and is never repeated.

5. Search workflows: ready-made research recipes

Most users do not need to write search queries from scratch. The project ships with six ready-made search workflows that cover the most common research patterns. You reference one of them by name and the pipeline uses its built-in search steps automatically.

To use a workflow, set the workflow key inside your universal_search block:

apiConfig:
  universal_search:
    workflow: "standard_manufacturer_search"

When the pipeline sees a workflow key, it loads that workflow's search steps and ignores any inline searchSteps you may have written. The workflow's settings replace your inline settings.

The six built-in workflows

Workflow name	Description	Steps	When to use it
`standard_manufacturer_search`	Searches the manufacturer's domain by part number, then by part number plus product name, then a fallback domain, then an open web search.	4	The default for most products.
`multi_domain_search`	Searches across multiple domains at once, then falls back to an open web search.	3	Products sold on several distributor websites.
`aggressive_search`	Tries every strategy: single domain, fallback domain, multi-domain, and open search, in sequence.	7	Hard-to-find products where earlier, more targeted searches are likely to miss.
`simple_mpn_search`	One step: search the open web for the part number alone.	1	Quick lookups when you only have a part number.
`domain_focused_search`	Searches only the primary manufacturer domain using three different query patterns.	3	When you are confident the product is on a specific manufacturer's website and want to avoid open-web noise.
`standard_manufacturer_search_with_scraping`	The same four steps as the standard workflow, plus two built-in scrapers that automatically extract product data and images from the discovered pages.	4 + scrapers	When you want both search and scraping bundled into a single reference.

Each workflow expects specific spreadsheet columns to be present. Here is what each one needs.

`standard_manufacturer_search`

The default workflow for most product research. It progressively widens the search until something hits.

Step	What it searches
1	Site search on the manufacturer domain by part number.
2	Site search on the manufacturer domain by part number plus product name.
3	Site search on the fallback domain by part number.
4	Open web search by part number plus product name.

Required spreadsheet columns: domain, mpn, product_name, fallback_domain

When to use it: This is the recommended starting point. It first looks at the manufacturer's website by part number, then adds the product name for broader matching, then tries a fallback domain, then performs an open web search. Because it stops at the first step that finds anything, it returns quickly when the product is well-known.

`multi_domain_search`

Searches across several domains at the same time using one combined query.

Step	What it searches
1	Combined site search across up to three domains by part number.
2	Combined site search across up to three domains by part number plus product name.
3	Open web search by part number plus product name.

Required spreadsheet columns: domains (a list column with up to three domain strings), mpn, product_name

When to use it: When a product is sold by several distributors or retailers and you want to search all of their websites in a single query before falling back to the open web.

`aggressive_search`

Combines every available search strategy into one workflow with seven steps.

Step	What it searches
1	Manufacturer domain by part number.
2	Manufacturer domain by part number plus product name.
3	Fallback domain by part number.
4	Fallback domain by part number plus product name.
5	Multiple domains combined by part number.
6	Multiple domains combined by part number plus product name.
7	Open web search by part number plus product name.

Required spreadsheet columns: domain, mpn, product_name, fallback_domain, domains (list)

When to use it: For hard-to-find products where the more targeted searches are likely to miss. This workflow tries single-domain, fallback-domain, multi-domain, and open search patterns in sequence. Because it stops at the first step that succeeds, it still returns quickly when the early steps work, but the breadth of strategies maximizes the chance of finding something.

`simple_mpn_search`

The simplest workflow — a single open web search by part number.

Step	What it searches
1	Open web search by part number alone.

Required spreadsheet columns: mpn

When to use it: When you only have a part number and no domain information. Also useful for quick lookups where speed matters more than precision.

`domain_focused_search`

Restricts every search to the primary manufacturer domain with three different query patterns.

Step	What it searches
1	Manufacturer domain by part number.
2	Manufacturer domain by part number plus product name.
3	Manufacturer domain by product name only.

Required spreadsheet columns: domain, mpn, product_name

When to use it: When you are confident the product exists on a specific manufacturer's website and you want to avoid open-web results. Useful for manufacturers with extensive catalogs where you want only authoritative product data from the source.

`standard_manufacturer_search_with_scraping`

Identical search steps to standard_manufacturer_search, but with two built-in scrapers attached that automatically extract product data and images from any discovered pages.

Search steps: Same four steps as standard_manufacturer_search (see above).

Built-in scrapers:

Scraper name	What it extracts	Pages scraped
`product_data`	Title, description, attributes/specs, and SKU for the given product name and part number.	Up to 3
`image_extraction`	All image URLs and accessory information for the given product name.	Up to 1

Required spreadsheet columns: domain, mpn, product_name, fallback_domain

When to use it: When you want a single workflow reference that handles both search and scraping. Instead of defining search steps in the config and separately listing scrapers, this workflow bundles everything together, and the scrapers run automatically on the URLs found by the search steps.

Adding your own custom workflow

If none of the six built-in workflows fit, you can add a new one to the workflow templates file (workflow_templates.yaml). The format mirrors the inline format described in the next section.

To add a new workflow:

Open workflow_templates.yaml.
Add a new entry under the workflows: key with a unique name. That name is what you reference from your configuration file.
Add a description explaining what the workflow does and when to use it.
Add searchSteps as an ordered list of search step entries (each with a query template and a useFields mapping — see the next section).
Set stopOnFirstSuccess to true (stop after the first hit) or false (run every step and combine results).
Optionally add a scrapers list to attach inline scrapers to the workflow.
In your configuration file, reference the new workflow by name:

apiConfig:
  universal_search:
    workflow: "your_custom_workflow_name"

Example — a custom workflow that searches a distributor by SKU and falls back to open search:

workflows:
  # ... existing workflows ...
  distributor_search:
    description: "Search a distributor website by SKU, then fall back to open web search by SKU + product name"
    searchSteps:
      - query: "site:{distributor_url} {sku}"
        useFields:
          distributor_url: "distributor_url"
          sku: "sku"
      - query: "{sku} {product_name}"
        useFields:
          sku: "sku"
          product_name: "product_name"
    stopOnFirstSuccess: true

This workflow expects distributor_url, sku, and product_name columns in the spreadsheet (or those names mapped via fieldMapping).

6. Customizing your own search steps

If you would rather define your own search steps directly in your configuration file instead of using a workflow, you can. Use the searchSteps key inside the universal_search block.

How a search step works

Every search step is a small object with two fields:

Field	Required?	What it does
`query`	Yes	The query template. Contains placeholders like `{mpn}` and `{domain}` that will be replaced with values from each row.
`useFields`	Yes	A mapping that says which spreadsheet column supplies each placeholder. The key on the left is the placeholder name; the value on the right is the column name.

Example

universal_search:
  searchSteps:
    - query: "site:{domain} {mpn}"
      useFields:
        domain: "domain"
        mpn: "mpn"
    - query: "site:{domain} {mpn} {product_name}"
      useFields:
        domain: "domain"
        mpn: "mpn"
        product_name: "product_name"

The pipeline runs the steps in the order you list them. For each step, every {placeholder} in the query is replaced with the matching value from the spreadsheet row before the search runs.

Worked example: Given a row with domain = manufacturer.com and mpn = 12345, this step:

- query: "site:{domain} {mpn}"
  useFields:
    domain: "domain"
    mpn: "mpn"

becomes the search query: site:manufacturer.com 12345

What happens when fields are missing

If a row is missing one of the fields the query needs (the value is empty or the column is absent), the entire search step is skipped for that row, and the pipeline moves on to the next step. You will see a log message indicating which fields were missing.

Searching multiple domains in one query

If a row has a list of several domains (e.g., the product is sold by three different distributors), you can search all of them in a single query using multi-domain expansion.

In useFields, set the num_domains value to how many domain placeholders you want, and point domain at the spreadsheet column that holds the list of domains:

- query: "(site:{domain1} OR site:{domain2} OR site:{domain3}) {mpn}"
  useFields:
    domain: "domains"     # The column containing the list of domains
    num_domains: 3        # How many domain placeholders to fill
    mpn: "mpn"

The pipeline expects the domains column to contain a list of domain strings. It creates numbered placeholders (domain1, domain2, domain3, and so on), filling each with the next domain in the list. If the list has fewer entries than num_domains, the extra placeholders are left empty.

Worked example: Given domains = ["site1.com", "site2.com", "site3.com"] and mpn = 12345, the resulting query is:

(site:site1.com OR site:site2.com OR site:site3.com) 12345

Stopping early or running every step

The stopOnFirstSuccess flag controls whether the pipeline stops at the first successful step or runs every step.

universal_search:
  searchSteps: [...]
  stopOnFirstSuccess: true   # or false

Value	Behavior
`true` (default)	Steps run in order and stop at the first one that succeeds. Best when your earlier steps are more targeted and your later steps are broader fallbacks.
`false`	Every step runs, regardless of whether earlier steps succeeded. Results from each step are collected together.

If you do not set stopOnFirstSuccess, it defaults to true.

What counts as "success" depends on your scraper setup.

Without a matchingCondition on any of your inline scrapers (or with no inline scrapers at all), a step succeeds as soon as it returns a non-PDF search result.
With a matchingCondition on an inline scraper (see section 7), a step only succeeds when at least one of that step's scraped pages passes the match check and lands in scrape_results. If a step returns plenty of URLs but none of them are actually the product you are looking for, the pipeline advances to the next step and tries again. If no step ever produces a match, universal_search returns no results — better than silently keeping unrelated pages.

This is the right mental model when earlier steps need to act as precise filters — for example, when a broad query might surface accessory or lookalike pages. Put narrower, higher-confidence queries early and broader fallbacks later, and the pipeline keeps going until it finds a real match (or runs out of steps).

PDFs in search results

When a search step's results include a PDF link that looks like a high-confidence match for the product — specifically, a PDF appearing in the top two positions whose title or snippet contains the row's MPN (or product name, if MPN is missing) — the pipeline pulls that link out of the normal scraping flow and adds it to a separate pdfs_to_review list in the research output. Scrapers do not run on it.

This is deliberate: scraping a PDF rarely produces clean structured data, but a PDF with the right MPN at the top of the search results is often the manufacturer datasheet, which is valuable reference material for a human to review. PDFs that do not meet these criteria (lower down the results, or without the MPN in the title/snippet) are skipped entirely rather than surfaced.

PDFs are collected across every search step that runs, regardless of whether those steps "succeed." If stopOnFirstSuccess: true stops the loop at step 2, any qualifying PDFs from step 1 are still included in pdfs_to_review.

What happens when no search steps fit

If you do not define any searchSteps and you do not reference a workflow, or if every step gets skipped because of missing fields, the pipeline falls back to a simple search built from the query, product_name, or title field of the row, in that order of preference.

Workflow vs. inline `searchSteps`

If you set both a workflow and an inline searchSteps list inside universal_search, the workflow wins. The inline searchSteps are ignored. Pick one or the other, not both.

7. Scraping product pages

Scrapers extract structured information from product web pages. There are two ways to use them:

Inline scrapers, attached to your universal_search block, run automatically on URLs that the search steps discover.
Standalone scraping (scrape_results), which scrapes URLs that are already in your spreadsheet, without doing a search first.

Both use the same scraper settings.

Inline scrapers (search-then-scrape)

When you list one or more scrapers inside your universal_search block, they automatically run on the URLs discovered by the search steps. This bundles search and extraction into a single research operation.

universal_search:
  searchSteps: [...]
  stopOnFirstSuccess: true
  scrapers:
    - type: smart_scraper
      name: product_data
      prompt: >
        For the product: {product_name} Number: {mpn} extract ALL product
        data including title, description, attributes/specs, and SKU.
      maxResults: 3
      retries: 2
      timeout: 120
    - type: smart_scraper
      name: image_extraction
      prompt: >
        Extract all image URLs from the page for {product_name}.
      maxResults: 1

Each scraper entry has these fields:

Field	Required?	Default	What it does
`type`	Yes	—	The kind of scraper to use. Currently supported: `smart_scraper`.
`name`	Yes	—	A unique name for this scraper's output (for example, `product_data` or `image_extraction`). Each scraper produces its own named bundle of results.
`prompt`	Yes	—	The instructions sent to the scraper telling it what to extract. Supports `{placeholder}` template variables that are filled in from your spreadsheet columns at run time (for example, `{product_name}` and `{mpn}`).
`maxResults`	No	3	The maximum number of URLs this scraper will process per search step. Each step gets its own fresh budget — for example, with two search steps and `maxResults: 3`, up to 3 URLs from each step are scraped (6 total across both). URLs beyond the cap within a single step are skipped.
`retries`	No	2	How many times to retry on failure or timeout.
`timeout`	No	120	How long to wait per scrape, in seconds.
`matchingCondition`	No	—	Optional match-check block. When present, each scraped page is compared against the spreadsheet row with an LLM match check; only pages that match are stored. This is also what `stopOnFirstSuccess: true` uses to decide whether a search step counts as successful. See Matching conditions below.

Prompt placeholders. Just like search query templates, the prompt field can contain {placeholder} variables. These are filled in from your spreadsheet columns for each row, so you can tailor the prompt to the specific product being scraped.

Multiple scrapers, one set of pages. You can list multiple scraper entries with different names to run different extraction prompts against the same set of discovered pages. Each scraper runs independently and produces its own named bundle. To run multiple kinds of extraction, define multiple scraper entries with different names — one scraper has one prompt.

Matching conditions

A matchingCondition turns a scraper into a filter. For each URL, the pipeline runs the scraper, then asks an LLM: "does this scraped page describe the same product as the spreadsheet row?" Only pages that pass the check get stored under the scraper's name; everything else is discarded. This keeps unrelated or lookalike pages out of scrape_results so downstream generation does not get mixed up with the wrong product's content.

universal_search:
  searchSteps: [...]
  stopOnFirstSuccess: true
  scrapers:
    - type: smart_scraper
      name: product_data
      prompt: "Extract product data for {product_name} {mpn}."
      maxResults: 3
      matchingCondition:
        productDescription: "{product_name} {mpn}"
        useFields:
          product_name: "product_name"
          mpn: "mpn"
        rules:
          - "MPN must match exactly."
          - "Ignore kits and accessory packs."
        examples:
          - "MPN 12345-A matches 12345-A; 12345-B is a different product."

Sub-fields under matchingCondition:

Sub-field	Required?	What it does
`productDescription`	Yes	A short description template of the spreadsheet row, with `{placeholder}` variables. This is what the LLM compares the scraped page against.
`useFields`	Yes	Placeholder-to-column mapping, same shape as for search query templates (see section 6).
`rules`	No	Plain-language rules the match check should follow (for example, "MPN must match exactly").
`examples`	No	Example decisions to guide the match check.
`endpoint`	No	Override for the match-check endpoint. Most configurations do not set this.
`fields`	No	Specific fields to focus the match check on.

Why this matters for stopOnFirstSuccess. When any inline scraper has a matchingCondition, a search step is only considered successful once one of its pages passes the match check. With stopOnFirstSuccess: true, the pipeline keeps trying later search steps until a match is found — even if earlier steps each returned a page of junk URLs. See Stopping early or running every step.

Standalone scraping (URLs already in your spreadsheet)

If your spreadsheet already contains the product URLs you want to scrape, you can skip the search step entirely and use the scrape_results block to scrape directly.

apiConfig:
  researchStepEndpoints:
    - endpoints: [scrape_results]
      strategy: standalone
  scrape_results:
    urlField: "product_url"
    scrapers:
      - type: smart_scraper
        name: product_data
        prompt: "Extract all product data including title, description, specs..."
        retries: 2
        timeout: 120

Fields under scrape_results:

Field	Required?	Default	What it does
`urlField`	No	`"url"`	The spreadsheet column that contains the URL to scrape. The value can be a single URL string or a list of URLs.
`scrapers`	Yes	—	A list of scraper entries. Same shape as inline scrapers (see above), except `maxResults` does not apply because URLs come from the spreadsheet, not from search.

How standalone scraping decides what to do:

The pipeline reads the column named by urlField (defaults to "url" if you do not set it).
If the value is missing or empty for that row, scraping is skipped.
If the value is a single URL, it is treated as a one-item list. If it is a list, it is used as-is.
Each scraper in the scrapers list runs against every URL in the list.
Results are bundled by scraper name (one bundle per scraper, holding one entry per URL).

Key difference from inline scrapers: Inline scrapers run on URLs discovered by search steps. Standalone scraping runs on URLs that are already in your spreadsheet. Both use the same scraper settings.

8. Generating product content

Content generation is the heart of the pipeline. Each entry in your generationEndpoints list is a separate generation step. The pipeline runs them in the order you list them, and each one writes one piece of content for every row.

The standard generation steps

Step name	What it produces
`generate_title`	A product title (a string).
`generate_description`	A product description (a string).
`generate_bullets`	Bullet points (a list of strings).
`categorization`	A product category (structured data — typically a dictionary or string).
`generate_attributes`	Product attributes such as color, material, weight (a dictionary of attribute names to values).
`extract_attributes`	Attributes pulled from the existing description rather than generated freshly (a dictionary).
`rewrite_product`	A rewritten version of an existing product description.
`generate_validator`	Specialized validation content.
`product_variant_field_standardization`	Standardized variant field text.

How a generation step is configured

Each step name in generationEndpoints should also have a settings block with the same name elsewhere in your apiConfig. For example, if generationEndpoints contains "generate_title", you need a generate_title: block.

Field	Type	What it does
`rules`	List of strings	Plain-language instructions sent to the model during generation. These guide the writing style, tone, structure, and content. Examples: "Keep under 60 characters", "Begin with the main keyword", "Use a friendly tone".
`examples`	List of strings	Sample outputs that show the model what good results look like. The model uses them as guidance.
`max_characters`	Whole number	A hard maximum on the length of the result. If the result is longer, the pipeline retries the step.
`min_characters`	Whole number	A hard minimum on the length of the result. If the result is shorter, the pipeline retries the step.
`endpoint`	String	An override that lets you call a built-in generation step under a custom name. See section 10.
`model_id`	String	A custom taxonomy name. Only used by `categorization`. Leave it out to use the default taxonomy.
`attribute_list`	List of strings	The list of attribute names to generate or extract. Used by `generate_attributes` and `extract_attributes`.
`replacements`	Mapping of strings	Find-and-replace terms specific to this step. Combined with the global replacements (see section 11).

A simple example

apiConfig:
  generationEndpoints:
    - "generate_title"
    - "generate_description"
  generate_title:
    rules:
      - "Keep under 60 characters."
      - "Lead with the brand name."
    examples:
      - "Acme Pro 2000 Ergonomic Office Chair, Black"
    max_characters: 60
    min_characters: 10
  generate_description:
    rules:
      - "Write 3 to 4 sentences."
      - "Highlight key benefits, not just features."
    max_characters: 500
    min_characters: 100

Categorization is a little different

The categorization step has a few special rules:

It accepts an optional model_id field that points at a custom category taxonomy. Leave it out to use the default.
It still uses rules and examples like the other steps, plus your product's title, description, and image.
It does not use max_characters or min_characters. Categorization returns structured data, not plain text, so length checks don't apply. Don't set those fields on a categorization block.

categorization:
  model_id: "my-custom-taxonomy"  # optional
  rules:
    - "Pick the most specific applicable category."
    - "Use both the title and the description when deciding."
  examples:
    - "Electronics > Computers > Laptops"

Generating versus extracting attributes

Two different steps handle product attributes, and they behave differently:

generate_attributes writes attribute values based on whatever product information is available, even if those values are not stated explicitly in the source text. It can fill in attributes the model can reasonably infer.
extract_attributes only pulls attribute values that are explicitly mentioned in the existing product description. If a value is not stated, it stays empty.

Both steps use the same settings, including an attribute_list field that names the attributes to work with.

generate_attributes:
  rules:
    - "Generate detailed product attributes."
  examples:
    - "Color: Black, Material: Aluminum, Weight: 2.5 lbs"
  attribute_list:
    - "Color"
    - "Material"
    - "Weight"
    - "Dimensions"
extract_attributes:
  rules:
    - "Extract only attributes explicitly mentioned in the description."
  attribute_list:
    - "Brand"
    - "Model Number"
    - "Voltage"

9. Quality controls: rules, examples, and length limits

The pipeline supports two different kinds of quality control on every generation step.

Two kinds of rules

1. Plain-language rules. These are the entries in the rules list. They are passed to the model as instructions during generation. Examples: "Keep under 60 characters", "Begin with the brand name", "Focus on key benefits". These influence how the model writes, but they are not strictly enforced after the fact — the model does its best to follow them.

2. Hard length limits. These are max_characters and min_characters. They are enforced after generation. If the result is too long or too short, the result is rejected and the step is retried.

You can use both kinds at once. The plain-language rules guide the writing style; the length limits act as a safety net.

How retries work

When a length limit is set and the generated content fails to meet it, the pipeline tries again. Here is the exact sequence:

The pipeline calls the generation step to write the content.
If no length limits are set on this step, the pipeline accepts the result immediately and moves on.
If length limits are set, the pipeline checks the result against them.
If the result passes, the pipeline accepts it and moves on.
If the result fails, and there are still retries available, the pipeline logs the failure and tries again from step 1.
If the result fails and the retry budget is exhausted, the pipeline keeps the last result anyway, logs a warning, and moves on. Nothing is silently dropped.

By default the pipeline tries the original generation plus three additional retries for a total of four attempts.

When length limits do not apply

The categorization step does not use max_characters or min_characters because its output is structured data, not plain text. If you put length limits on a categorization block, they will be ignored — but it is cleaner to leave them out.

The same is true for generate_attributes and extract_attributes, which return a dictionary of attribute values rather than a single piece of text.

What you see in the output

If a generated value fails its length checks even after every retry, you will get the value the model produced on its last attempt. The output is never empty just because validation failed — you always get something. This is on purpose: the pipeline assumes a not-quite-perfect result is more useful than an empty cell.

10. Running the same generation step twice with different settings

Sometimes you want two versions of the same kind of content with different rules — for example, a short description and a long description. The pipeline supports this through an aliasing pattern.

How it works

You list a custom name in generationEndpoints, then create a settings block with that custom name and use the endpoint field to point at the underlying built-in step.

apiConfig:
  generationEndpoints:
    - "short_description"
    - "long_description"
  short_description:
    endpoint: "generate_description"   # use the description generator
    rules:
      - "Keep it concise, under 100 words."
    max_characters: 500
    min_characters: 50
  long_description:
    endpoint: "generate_description"   # same generator, different rules
    rules:
      - "Be detailed and comprehensive."
      - "Include technical specifications."
    max_characters: 3000
    min_characters: 1000

In the output, both short_description and long_description will appear as separate columns. Both are produced by the description generator, but each follows its own rules and length limits.

When to use this pattern

When you want multiple lengths of the same kind of content (short vs. long descriptions).
When you want the same step to produce content for different audiences with different tones.
When you need separate versions for separate downstream uses (for example, one for your website and a different one for a partner export).

When the `endpoint` field is not needed

If your settings block name already matches a built-in generation step name (generate_title, generate_description, generate_bullets, categorization, generate_attributes, extract_attributes, etc.), you do not need to set endpoint. The pipeline finds the matching step automatically. The endpoint field is only needed when you want to use a custom name.

11. Find-and-replace terms

Find-and-replace terms let you swap specific words or phrases in generated content for replacements you control. They are useful for brand renaming, replacing prohibited terms with safe alternatives, and standardizing terminology.

Where you can define replacement terms

Replacement terms can come from four places. They are applied in this order, where each later source can override the earlier ones for the same term:

Global replacements. Defined under apiConfig.replacements. Applied to every generation step.
Per-step replacements. Defined inside a specific generation step's settings block, under replacements. Combined with the global replacements; per-step values win for any term that exists in both.
CSV-uploaded replacements. Uploaded with your job through the file upload field. Override anything from the inline settings for the same term.
Per-channel replacements. Defined inside a marketplace channel block under thirdPartyChannels. Applied during marketplace channel generation, on top of the global and per-step replacements.

apiConfig:
  # Global replacements applied to every generation step
  replacements:
    "old brand name": "new brand name"
    "rifle": "sporting equipment"
  generate_title:
    rules:
      - "Keep under 60 characters."
    # Per-step replacements (combined with the global ones)
    replacements:
      "endpoint-specific term": "replacement for titles only"

How matching works

Whole-word matching. A term matches only when it appears as a complete word, not as part of a larger word. So "rifle" does not accidentally match "rifles" unless you explicitly add "rifles" too.
Case insensitive. "Brand", "brand", and "BRAND" all match a term written as "brand".
Longer terms win. If two terms could match the same text, the longer one is tried first. So "high power" will match before "high".
Replacement is applied to text only. Replacements affect string outputs (like titles and descriptions) and string items inside lists (like bullet points). They are skipped for structured outputs like categories and attribute dictionaries. Empty values are left alone.

Loading replacements from a CSV

You can upload a CSV file of replacement terms with your job. Two column-header formats are accepted:

Format	"Find" column	"Replace with" column
Preferred	`term`	`replacement`
Legacy	`RPK`	`Safe Alternative`

Column names are matched case-insensitively. If a CSV file uses neither format, the upload is rejected with an error.

CSV-uploaded values override anything you put in your YAML for the same term. The reasoning is that CSV uploads usually represent the most recent user intent.

Per-channel replacements work the same way

Marketplace channel blocks (Amazon, eBay, Walmart) can also define their own replacements. These are applied after the global and per-step replacements during marketplace generation. See section 14 for the full marketplace channel reference.

12. Protecting manufacturer part numbers (MPN)

Replacement terms are powerful, but they create one risk: a replacement term could accidentally overlap with a manufacturer part number (MPN) and corrupt it. For example, if you have a replacement that turns the letter sequence "abc" into something else, and a part number happens to contain abc somewhere, your part number would be silently mangled.

To prevent this, the pipeline automatically protects part numbers whenever a row has an mpn field (or a product_name field as a fallback). This is called MPN masking.

How MPN masking works

For every piece of generated text where replacements would be applied, the pipeline:

Hides every occurrence of the part number by replacing it with a special placeholder string that no replacement term can ever match.
Runs your replacement terms against the hidden text. Any term that would have matched the part number now matches the placeholder instead, which has no effect.
Restores the original part number in place of the placeholder.

The result: even if one of your replacement terms could have matched part of an MPN, the MPN comes out exactly as it went in.

What gets used as the part number

The pipeline looks for the part number in this order:

The mpn field in the row.
If mpn is empty, the product_name field.

If neither is present, MPN masking simply does nothing for that row, and replacements run normally.

When MPN masking applies

MPN masking is automatic and always on. You do not configure it. It applies wherever replacement terms are applied: standard generation, per-step replacements, and per-channel marketplace replacements.

13. Category-driven attributes

If you have a CSV file that lists which attributes belong to which product categories, the pipeline can use it to drive attribute generation. This means you can have one configuration that works across many product categories — the pipeline figures out which attributes to generate for each row based on which category it belongs to.

How to enable it

You enable category-driven attributes by uploading an attribute mapping file with your job. The system stores the file location for you; you do not edit your YAML to point at it.

The pipeline runs the category-driven attribute step only when all three of these conditions are met:

You uploaded an attribute mapping file with the job.
categorization is in your generationEndpoints list.
Either generate_attributes or extract_attributes is in your generationEndpoints list.

If any one of those is missing, the category-driven step is silently skipped and you get the standard attribute generation (or no attribute generation at all, if no attribute step is in the list).

What the mapping file looks like

The mapping file is a CSV with these columns (the column names are matched case-insensitively):

Column	Required?	What it holds
`category`	Yes	The exact product category string. Must match what `categorization` returns exactly, character for character.
`attribute key`	Yes	The name of an attribute (for example, `Color`, `Material`, `Screen Size`).
`attribute possible value`	Yes	A comma-separated list of allowed values for this attribute.
`rules`	No	Optional plain-language guidance for the model when generating this specific attribute (for example, "Only use standard RAM sizes").

Example CSV:

category,attribute key,attribute possible value,rules
Electronics > Laptops,Screen Size,"13 inch,14 inch,15 inch,17 inch",
Electronics > Laptops,RAM,"8GB,16GB,32GB,64GB",Only use standard RAM sizes
Electronics > Laptops,Color,"Silver,Space Gray,Black",

In this example, when a product is categorized as Electronics > Laptops, the pipeline will generate the three attributes Screen Size, RAM, and Color, restricted to the listed possible values, and the RAM attribute will receive the extra rule "Only use standard RAM sizes".

How attribute generation runs

For each row that meets all three conditions:

The pipeline reads the category that the categorization step produced.
It looks up the matching category in the mapping file (exact string match).
It collects every attribute defined for that category in the file.
It groups the attributes into batches of 5 and generates each batch in turn.
The results from every batch are merged together into one combined attribute dictionary.

Each batch automatically gets two extra rules built from the file contents:

For every attribute that has a rules value in the file, a rule is added: "For 'AttributeName': <your rules>".
For every attribute that has possible values, a constraint is added: "For attribute 'AttributeName', return only values from this list: <values>".

What the output looks like

The merged attribute results replace whatever the standard generate_attributes (or extract_attributes) step would have produced. They are stored under the same column name in the output, so downstream consumers see one consistent attribute column.

When the category is not found

If the category that categorization produced does not appear in the mapping file (no exact match), the category-driven step does nothing. In that case, whatever the standard attribute generation step produced (if any) is preserved unchanged.

14. Marketplace channels (Amazon, eBay, Walmart)

The marketplace channels feature lets you generate channel-specific content for third-party marketplaces in a single pipeline run. You can have one set of standard titles and descriptions and also have Amazon-optimized titles and descriptions, eBay-optimized titles, and Walmart-optimized content — all from the same job.

What it does

The pipeline first runs your standard generation steps as usual. Then, if marketplace channels are enabled, it iterates over each marketplace you configured and re-runs the listed generation steps with marketplace-specific rules, examples, and length limits. Research is reused. No additional web searches or scraping happens for marketplace generation — only the content generation step runs again.

The output of each marketplace generation step appears as a new column in your output, named with a marketplace prefix: amazon_generate_title, ebay_generate_description, and so on.

Enabling marketplace channels

The marketplace channels block lives at the same level as your other top-level settings inside apiConfig:

apiConfig:
  thirdPartyChannels:
    enabled: true
    channels:
      amazon:
        # channel-specific generation step settings...
      ebay:
        # channel-specific generation step settings...

Two things must be true for marketplace generation to run:

enabled: true is set. If enabled is false or missing, the entire marketplace step is skipped and no channel content is produced.
channels is a non-empty list. If enabled is true but channels is missing or empty, the job is rejected up front with an error.

Allowed marketplace names

Only three marketplace names are recognized:

Marketplace	Name in config
Amazon	`amazon`
eBay	`ebay`
Walmart	`walmart`

If you use any other name (for example, etsy), the job is rejected up front with a clear error message saying which name was invalid and which names are allowed.

Per-channel generation step settings

Inside each marketplace block, you list the same kinds of generation step settings blocks you would use at the top level. The block names should match the names in your generationEndpoints list (or use the endpoint field to alias one to a different built-in step, exactly like the standard aliasing pattern in section 10).

channels:
  amazon:
    generate_title:
      rules:
        - "Title must be optimized for Amazon search."
        - "Include brand, product type, and key attributes."
      examples:
        - "BrandName Ergonomic Office Chair with Lumbar Support, Black"
      max_characters: 200
      min_characters: 10
    generate_description:
      rules:
        - "Use HTML formatting with <p>, <b>, <li> tags only."
      max_characters: 2000

You can use the same fields you use at the top level:

Field	What it does
`rules`	Plain-language instructions specific to this marketplace.
`examples`	Example outputs specific to this marketplace.
`max_characters`	A length cap specific to this marketplace's listing rules.
`min_characters`	A length floor specific to this marketplace.
`endpoint`	An alias to a different built-in step (same pattern as standard aliasing).
`model_id`	A custom taxonomy for the `categorization` step.
`attribute_list`	The attributes to generate or extract (for attribute steps).

Each marketplace's length limits are independent of the standard step's length limits. You might have a standard generate_title with a 60-character cap and an amazon.generate_title with a 200-character cap. Both will be enforced separately during their respective generation passes.

Per-channel replacement terms

Each marketplace can define its own replacement terms, either inline in the YAML or via an uploaded CSV file:

channels:
  amazon:
    generate_title:
      rules:
        - "Optimize for Amazon search."
      max_characters: 200
    replacements:
      "rifle": "sporting equipment"
      "gun": "item"

Things to know about per-channel replacements:

They are applied after the global and per-step replacements have already run.
They use the same whole-word, case-insensitive matching as global replacements.
They benefit from the same automatic MPN protection.
They apply to string outputs and to string items inside lists (like bullet points), and they are skipped for structured outputs like categories.
You can also upload per-channel replacement CSV files at job submission time using upload fields named replacement_amazon, replacement_ebay, and replacement_walmart. Uploaded CSV values override anything in the YAML for the same term.

Output column naming

Each marketplace's results show up in the output as a new column. The naming pattern is <marketplace>_<step name>.

Standard column	Amazon column	eBay column	Walmart column
`generate_title`	`amazon_generate_title`	`ebay_generate_title`	`walmart_generate_title`
`generate_description`	`amazon_generate_description`	`ebay_generate_description`	`walmart_generate_description`
`generate_bullets`	`amazon_generate_bullets`	`ebay_generate_bullets`	`walmart_generate_bullets`

A run that uses two standard steps and configures two marketplaces with the same two steps produces six output columns total (two standard plus two per marketplace times two marketplaces).

What the pipeline does for each marketplace

For each marketplace listed in channels, the pipeline:

Skips the marketplace if its name is not in the allowed set (and logs a clear warning).
Skips the marketplace if its settings block is empty or invalid.
Loads any inline replacement terms from the marketplace block, then layers any uploaded marketplace CSV replacements on top.
Iterates over each generation step settings block in the marketplace.
For each step, builds the channel-specific length checks, runs the generation, applies the standard replacement terms, then applies the marketplace-specific replacement terms on top.
Saves the result under the prefixed column name (amazon_generate_title, etc.).

If enabled is false or the section is missing, none of this happens. The marketplace stage is a complete no-op.

Job-submission validation for marketplace channels

The system validates marketplace channel configuration when you submit a job, before any processing starts. If anything is wrong, the job is rejected up front with a clear error message and no work is done. Validation checks, in order:

Is the marketplace section enabled? If enabled is true, validation continues; otherwise validation is skipped entirely.
Is the channels block present and non-empty? If not, the job is rejected with "thirdPartyChannels.enabled is true but 'channels' is missing or empty."
Is every marketplace name allowed? Each marketplace name must be one of amazon, ebay, or walmart. Any other name is rejected with a message naming the invalid entry.
Is every marketplace block a settings block? Each marketplace value must be an object containing generation step settings. If not, the job is rejected.
Does every marketplace have at least one generation step? A marketplace with no generation steps is rejected.
Replacement CSV uploads are processed last. After validation passes, any per-marketplace replacement CSV files you uploaded are saved and attached to the corresponding marketplace block automatically.

You will see clear error messages for any validation failure, so you do not have to guess what is wrong.

15. A complete working example

Here is a complete configuration that produces standard titles and descriptions plus Amazon and eBay marketplace versions, all in one run:

apiConfig:
  generationEndpoints:
    - "generate_title"
    - "generate_description"
  generate_title:
    rules:
      - "Keep under 60 characters."
    max_characters: 60
  generate_description:
    rules:
      - "Keep under 500 characters."
    max_characters: 500
  thirdPartyChannels:
    enabled: true
    channels:
      amazon:
        generate_title:
          rules:
            - "Optimize for Amazon search. Include brand and key specs."
          max_characters: 200
        generate_description:
          rules:
            - "Use only <p>, <b>, <li> HTML tags."
          max_characters: 2000
      ebay:
        generate_title:
          rules:
            - "Short and keyword-rich."
          max_characters: 80
        generate_description:
          rules:
            - "Plain text only, no HTML."
          max_characters: 4000

The output of this configuration produces these columns for each row:

Column	Where it comes from
`generate_title`	The standard title generation step.
`generate_description`	The standard description generation step.
`amazon_generate_title`	The Amazon marketplace title generation step.
`amazon_generate_description`	The Amazon marketplace description generation step.
`ebay_generate_title`	The eBay marketplace title generation step.
`ebay_generate_description`	The eBay marketplace description generation step.

You would expand this example by adding research steps, replacement terms, attribute generation, and any other settings you need from earlier sections. Start small and add complexity one piece at a time.

16. Glossary of terms

apiConfig — The single top-level setting in every configuration file. Everything else is indented underneath it. Anything outside apiConfig is ignored.

Aliasing — Using a custom name in generationEndpoints together with the endpoint field to call a built-in generation step under a different name. Lets you run the same generation step twice with different rules and have both results in the output.

Attribute mapping file — A CSV file you upload with your job that lists which attributes (with allowed values) belong to which product categories. When uploaded, it drives the category-driven attribute generation step.

Category-driven attributes — A pipeline stage that runs after standard generation when an attribute mapping file has been uploaded. It looks up the row's category in the file and generates that category's specific attributes in batches of five.

Configuration file — The YAML file that controls everything the pipeline does for a given job. You edit it in any text editor.

fieldMapping — A configuration block that translates your spreadsheet column headers into the internal field names the pipeline expects.

Generation step — A single named action that produces one piece of content for each row. Examples: generate_title, generate_description, generate_bullets, categorization, generate_attributes. Each generation step has a settings block that holds its rules, examples, and length limits.

Inline scraper — A scraper attached to your universal_search block that automatically runs on URLs discovered by the search steps.

Length limits (max_characters, min_characters) — Hard limits on the length of generated text. If the result is too long or too short, the pipeline retries the step. After the retry budget is exhausted, the last result is kept anyway.

Marketplace channel — A named third-party marketplace (Amazon, eBay, or Walmart) for which the pipeline can generate marketplace-specific versions of your content in addition to the standard versions. Configured under thirdPartyChannels.channels.

MPN — Manufacturer Part Number. The unique identifier used by a manufacturer for a specific product.

MPN masking — An automatic protection that prevents replacement terms from accidentally corrupting manufacturer part numbers. Always on; you do not configure it.

Per-channel replacements — Find-and-replace terms specific to one marketplace. Applied after the global and per-step replacements during marketplace generation.

Per-step replacements — Find-and-replace terms specific to a single generation step. Combined with the global replacements; per-step values win for any term defined in both.

Plain-language rules (rules) — Instructions you give the model in natural language to guide how it writes a particular piece of content. Examples: "Lead with the brand name", "Keep under 60 characters", "Use a friendly tone".

Replacement terms — Find-and-replace pairs that swap specific words or phrases in generated content for replacements you control. Defined in four places (global, per-step, CSV upload, per-channel) with a clear precedence order.

pdfs_to_review — A research-output list of PDF URLs that looked like high-confidence matches during universal_search (for example, a datasheet at the top of the results whose title contains the MPN). These are surfaced for human review instead of being scraped. See section 6.

Research — The pipeline stage where information about each product is gathered from sources like web search, the internal catalog, uploaded files, or page scraping. Configured under researchStepEndpoints.

Research strategy — How the research sources in a step group are executed. One of: primary_then_fallback, use_both, parallel, standalone. The default is standalone.

matchingCondition — An optional block on an inline scraper that runs an LLM product-match check on each scraped page. Non-matching pages are discarded, and match results are what stopOnFirstSuccess: true uses to decide whether a search step succeeded. See section 7.

Scraper — A research tool that extracts structured data from a product web page using a plain-language extraction prompt. Comes in two flavors: inline (runs on URLs found by search) and standalone (runs on URLs already in your spreadsheet).

Search step — A single search query template, with placeholders that get filled in from each spreadsheet row.

Search workflow — A pre-built bundle of search steps that you reference by name from your configuration file. Six are included; you can also add your own.

Settings block — A YAML block that holds the rules, examples, length limits, and other settings for a particular generation step or research source. Each block sits at the same indentation level under apiConfig and is named after the step it configures.

Standalone scraping — Scraping URLs that are already present in your spreadsheet, configured under scrape_results. Different from inline scraping, which scrapes URLs discovered by search.

stopOnFirstSuccess — A flag inside universal_search that controls when to stop running search steps. With true (the default), the pipeline stops at the first step that succeeds; with false, every step runs. When any inline scraper has a matchingCondition, "success" means at least one scraped page passed the match check — not just that the search returned URLs. See section 6.

Step group — A bundle of one or more research sources together with the strategy that controls how they are executed. Each entry under researchStepEndpoints is a step group.

thirdPartyChannels — The configuration block where you enable and configure marketplace channels (Amazon, eBay, Walmart).

Workflow templates file — The file workflow_templates.yaml that ships with the project. It holds the six built-in search workflows, and you can add your own to it.

YAML — The plain-text format used for configuration files. It uses indentation (spaces, not tabs) to show how settings are grouped together.