Merchandising Pipeline Configuration File Guide
Who this guide is for
Anyone who needs to set up or change a configuration file for the merchandising pipeline. You do not need to know any programming language to use this guide. If you can edit a YAML file in a text editor, you have everything you need.
What this guide covers
Every setting you can put in a pipeline configuration file: what it does, when to use it, what each option means, and what the pipeline does at every stage when it processes your products. This is the complete reference. It is long because it covers everything; you do not have to read it end-to-end. Use the table of contents to jump to what you need.
Table of Contents
- The basics of configuration files
- What the pipeline does to each row of your spreadsheet
- Matching your spreadsheet columns to pipeline fields
- Research: gathering product information
- Search workflows: ready-made research recipes
- Customizing your own search steps
- Scraping product pages
- Generating product content
- Quality controls: rules, examples, and length limits
- Running the same generation step twice with different settings
- Find-and-replace terms
- Protecting manufacturer part numbers (MPN)
- Category-driven attributes
- Marketplace channels (Amazon, eBay, Walmart)
- A complete working example
- Glossary of terms
1. The basics of configuration files
Every run of the merchandising pipeline is controlled by a configuration file. This file is written in YAML, a plain-text format that uses indentation (spaces, not tabs) to show how settings are grouped together.
You do not edit any application code. You edit one YAML file. The pipeline reads that file at the start of every run and uses it to decide what research to perform, what content to generate, how to validate the results, and how to adapt content for marketplace channels.
Two starter files we provide
| Filename | When to use it |
|---|---|
example_config.yaml | A full-featured starting point. Contains every available setting at least once: multi-step research with fallback strategies, several generation steps, smart scrapers, find-and-replace terms, attribute generation, and marketplace channels for Amazon, eBay, and Walmart. Copy this when you need the complete feature set; you can always delete the parts you don't need. |
example_scraper_config.yaml | A simpler starting point for a search-scrape-generate workflow. Defines one research step, one smart scraper, and four generation steps. No fallback strategies, no marketplace channels, no replacement terms. Copy this when you want a straightforward pipeline without the extra machinery. |
Both files are valid and ready to run. Pick the one closest to what you need and customize from there.
The top-level structure
Every configuration file must have one top-level setting: apiConfig. Everything else lives indented underneath it. Anything you put outside apiConfig is ignored when the pipeline runs.
apiConfig: # All of your settings go here, indented underneath apiConfig. generationEndpoints: - "generate_title" generate_title: rules: - "Keep it under 60 characters." max_characters: 60Underneath apiConfig you will add a mix of three kinds of keys:
1. Orchestration keys — these tell the pipeline what to do, and in what order.
| Key | What it does |
|---|---|
generationEndpoints | The ordered list of content generation steps you want to run (for example: title first, then description, then bullets). Required if you want any content generated. Each name in this list should also have a matching settings block elsewhere in the file. |
researchStepEndpoints | The list of research steps to run before content generation begins. Optional. If you leave it out, the pipeline skips research entirely and goes straight to generation. |
2. Settings blocks — one block per generation or research step. Each block holds the rules, examples, and limits for that step. If generationEndpoints includes "generate_title", you should also have a generate_title: block at the same indentation level holding that step's settings.
3. Cross-cutting keys — settings that apply across the whole pipeline.
| Key | What it does |
|---|---|
replacements | Global find-and-replace terms applied to every piece of generated text. |
fieldMapping | Tells the pipeline which of your spreadsheet columns hold which product fields. |
thirdPartyChannels | Marketplace channel settings (Amazon, eBay, Walmart) for generating channel-specific content. |
universal_search | Settings for web search during research. |
scrape_results | Settings for scraping URLs that are already in your spreadsheet. |
smart_scrape | A placeholder for the smart-scrape research step. Usually just {}. |
get_images | A placeholder for the image-fetching research step. Usually just {}. |
attribute_mapping_file_path | Set automatically by the system when you upload a category-to-attribute mapping CSV. You do not set this yourself. |
The rest of this guide walks through each of these in detail.
2. What the pipeline does to each row of your spreadsheet
When you submit a job, the pipeline reads your input CSV one row at a time and runs each row through five stages, in this fixed order:
| Stage | What happens |
|---|---|
| 1. Field mapping | Rename your spreadsheet columns to the field names the pipeline expects. |
| 2. Research | Look up product data on the web, in the internal catalog, in uploads, etc. |
| 3. Content generation | Write titles, descriptions, bullets, attributes, and categories. |
| 4. Category-driven attributes | If you uploaded an attribute mapping file, look up the category and generate that category's attributes in batches. |
| 5. Marketplace channels | If marketplace channels are enabled, re-run generation with Amazon/eBay/Walmart-specific rules. |
The pipeline then produces an output row containing your original input columns, the research results, and every piece of generated content.
Each row is processed independently, so the result for one row never depends on the result for another.
Here is what happens at each stage in plain language.
Stage 1 — Field mapping
If your configuration file contains a fieldMapping block, the pipeline renames your spreadsheet columns first, before any other work happens. This lets the rest of the pipeline use a single set of internal field names regardless of how your spreadsheet headers are written.
Columns in your spreadsheet that are not listed in fieldMapping are kept exactly as-is, just under their original names. If you have a product_name column but no mpn column, the pipeline automatically uses product_name wherever it would normally use mpn, so search templates that reference {mpn} keep working.
Stage 2 — Research
The pipeline runs research steps only when you list them under researchStepEndpoints. Each step group has a strategy that controls how its endpoints are executed:
primary_then_fallback— Try the primary endpoint first; if it returns nothing, try each fallback in order until one succeeds.use_both— Run every endpoint in the group and store every result.standalone— Run each endpoint independently. This is the default.parallel— Currently behaves the same asuse_both. (The name reflects the intent that one day these calls may be made concurrently.)
A step group can also declare a required field naming another endpoint. The group is only run if that named endpoint already produced results. After the search steps finish, any inline scrapers that you defined under universal_search.scrapers automatically run on the URLs that were discovered.
Stage 3 — Content generation
Each step listed in generationEndpoints runs in order. For each step, the pipeline picks up the matching settings block, generates content, checks the result against any length limits you set, and applies your find-and-replace terms. If a length-limit check fails, the pipeline retries the generation up to three more times before giving up and using the last result anyway.
Stage 4 — Category-driven attributes
This stage runs only if all three of the following are true:
- You uploaded an attribute mapping file with your job.
categorizationis in yourgenerationEndpointslist.- Either
generate_attributesorextract_attributesis in yourgenerationEndpointslist.
When all three are met, the pipeline reads the category that the categorization step produced, looks up that exact category in your mapping file, gathers the attributes the file says belong to that category, and generates them in groups of five.
Stage 5 — Marketplace channels
This stage runs only when thirdPartyChannels.enabled is set to true in your config file. The pipeline iterates over each marketplace you configured (any of: amazon, ebay, walmart) and re-runs the generation steps you specified for that channel using channel-specific rules and limits. Research is reused — no extra search calls are made — so this stage is purely about generating new variations of content. Results are saved into the output under channel-prefixed names like amazon_generate_title and ebay_generate_description.
3. Matching your spreadsheet columns to pipeline fields
The pipeline expects to find product information under specific internal field names like product_name, mpn, domain, and description. Your spreadsheet probably uses different headers like "Product Name", "MPN", "Manufacturer Website", and "Product Description". The fieldMapping block tells the pipeline how to translate one to the other.
Format
apiConfig: fieldMapping: pipeline_field_name: "Your Spreadsheet Column Header"The key on the left is the internal field name the pipeline expects. The value on the right is the header text as it appears in your spreadsheet.
Example
Suppose your spreadsheet looks like this:
| Product Name | MPN | Manufacturer Website |
|---|---|---|
| Office Chair | 12345-ABC | acme.com |
To make the pipeline read this spreadsheet, add this block:
apiConfig: fieldMapping: product_name: "Product Name" mpn: "MPN" domain: "Manufacturer Website" description: "Product Description"After this mapping is applied, the pipeline sees:
| product_name | mpn | domain |
|---|---|---|
| Office Chair | 12345-ABC | acme.com |
Things to know
- Field mapping is the very first thing that happens for every row. It runs before research, generation, or anything else.
- Unmapped columns are preserved. If your spreadsheet has columns that you do not list in
fieldMapping(for example, an internal note column or a customer ID), they stay in the row unchanged. - You only need to map columns that the pipeline cares about. If your headers happen to already match the expected names (
product_name,mpn,domain,description), you do not need afieldMappingblock at all.
4. Research: gathering product information
Research is the stage where the pipeline collects information about each product before generating any content. The richer the research results, the better the generated titles, descriptions, bullets, and attributes will be.
There are five research sources you can use, one or more at a time:
| Source | What it does | Required input |
|---|---|---|
universal_search | Web search using configurable query templates. Returns search hits (title, link, snippet). | Whatever fields your search templates reference (typically domain, mpn, product_name). |
catalog_search | Searches the internal product catalog for matching manufacturer products. | product_name (or title), mpn, and domain. |
data_source_upload | Returns data from a file you uploaded with the job. | The system must have pre-loaded the upload into the row. |
get_images | Returns product image URLs. | The image URLs must already be in the row (typically from a prior catalog search or directly in the CSV). |
scrape_results | A standalone scraper for when product URLs are already in your spreadsheet, rather than discovered via search. | A column containing the URL(s) to scrape. |
When researchStepEndpoints is omitted from your config, no research runs and the pipeline goes straight to generation.
How research is configured
Research is configured under the researchStepEndpoints key as a list of step groups. Each group bundles one or more research sources together with a strategy that says how they should be executed.
apiConfig: researchStepEndpoints: - endpoints: - "universal_search" - "catalog_search" - "data_source_upload" strategy: "primary_then_fallback" primaryEndpoint: "universal_search" fallbacks: - "catalog_search" - "data_source_upload" - endpoints: - "get_images" strategy: "standalone" required: "universal_search"Each step group accepts these fields:
| Field | Required? | What it does |
|---|---|---|
endpoints | Yes | The names of the research sources in this group. |
strategy | No (defaults to standalone) | One of: primary_then_fallback, use_both, parallel, standalone. See below. |
primaryEndpoint | Only with primary_then_fallback | The endpoint to try first. |
fallbacks | Only with primary_then_fallback | The ordered list of endpoints to try if the primary returns nothing. |
required | No | Names another research source that must have already produced results. The whole group is skipped if that source did not. |
The four strategies
primary_then_fallback — Try the primary endpoint first. If it fails or returns nothing, try each fallback in order. Stop at the first fallback that succeeds. The result is stored under the primary endpoint's name. Use this when you have a preferred data source with one or more "if that fails, try this instead" alternatives.
- endpoints: - "universal_search" - "catalog_search" - "data_source_upload" strategy: "primary_then_fallback" primaryEndpoint: "universal_search" fallbacks: - "catalog_search" - "data_source_upload"use_both — Run every endpoint in the group and store every result independently. Each endpoint that returns data gets its own entry in the research results. Use this when you want to combine data from multiple sources, for example running both a web search and a catalog lookup so generation has both kinds of information.
- endpoints: - "universal_search" - "catalog_search" strategy: "use_both"parallel — Same behavior as use_both today. The name signals the intent that this group's endpoints may eventually run concurrently. Treat it as a synonym for use_both for now.
- endpoints: - "universal_search" - "catalog_search" strategy: "parallel"standalone — Run each endpoint in the group independently. This is the default if you do not specify a strategy. Functionally identical to use_both. Use this for endpoints that don't need fallback logic and don't share results with each other.
- endpoints: - "get_images" strategy: "standalone"The required field
The required field lets you say "only run this group if some other research source already produced results." It is useful for chains where one step depends on another.
- endpoints: - "get_images" strategy: "standalone" required: "universal_search"In this example, get_images only runs if universal_search already returned at least one result. If universal_search was skipped, failed, or returned nothing, the entire get_images group is skipped.
What each research source returns
| Source | What it returns |
|---|---|
universal_search | A list of web search hits, each typically containing a link, a title, and a snippet. |
catalog_search | A bundle of catalog data with matched products and image URLs from the internal manufacturer catalog. |
data_source_upload | Whatever data was attached to the row from your upload. The structure depends on the upload format. |
get_images | A list of product image URLs, drawn from earlier catalog or upload results. |
scrape_results | Scraped page data, organized by scraper name. Each scraper produces its own named bundle of results. |
All of these results are available to the generation steps that come next, and they are also reused (without re-running) when marketplace channel generation happens later in the pipeline. Research happens once per row and is never repeated.
5. Search workflows: ready-made research recipes
Most users do not need to write search queries from scratch. The project ships with six ready-made search workflows that cover the most common research patterns. You reference one of them by name and the pipeline uses its built-in search steps automatically.
To use a workflow, set the workflow key inside your universal_search block:
apiConfig: universal_search: workflow: "standard_manufacturer_search"When the pipeline sees a workflow key, it loads that workflow's search steps and ignores any inline searchSteps you may have written. The workflow's settings replace your inline settings.
The six built-in workflows
| Workflow name | Description | Steps | When to use it |
|---|---|---|---|
standard_manufacturer_search | Searches the manufacturer's domain by part number, then by part number plus product name, then a fallback domain, then an open web search. | 4 | The default for most products. |
multi_domain_search | Searches across multiple domains at once, then falls back to an open web search. | 3 | Products sold on several distributor websites. |
aggressive_search | Tries every strategy: single domain, fallback domain, multi-domain, and open search, in sequence. | 7 | Hard-to-find products where earlier, more targeted searches are likely to miss. |
simple_mpn_search | One step: search the open web for the part number alone. | 1 | Quick lookups when you only have a part number. |
domain_focused_search | Searches only the primary manufacturer domain using three different query patterns. | 3 | When you are confident the product is on a specific manufacturer's website and want to avoid open-web noise. |
standard_manufacturer_search_with_scraping | The same four steps as the standard workflow, plus two built-in scrapers that automatically extract product data and images from the discovered pages. | 4 + scrapers | When you want both search and scraping bundled into a single reference. |
Each workflow expects specific spreadsheet columns to be present. Here is what each one needs.
standard_manufacturer_search
The default workflow for most product research. It progressively widens the search until something hits.
| Step | What it searches |
|---|---|
| 1 | Site search on the manufacturer domain by part number. |
| 2 | Site search on the manufacturer domain by part number plus product name. |
| 3 | Site search on the fallback domain by part number. |
| 4 | Open web search by part number plus product name. |
Required spreadsheet columns: domain, mpn, product_name, fallback_domain
When to use it: This is the recommended starting point. It first looks at the manufacturer's website by part number, then adds the product name for broader matching, then tries a fallback domain, then performs an open web search. Because it stops at the first step that finds anything, it returns quickly when the product is well-known.
multi_domain_search
Searches across several domains at the same time using one combined query.
| Step | What it searches |
|---|---|
| 1 | Combined site search across up to three domains by part number. |
| 2 | Combined site search across up to three domains by part number plus product name. |
| 3 | Open web search by part number plus product name. |
Required spreadsheet columns: domains (a list column with up to three domain strings), mpn, product_name
When to use it: When a product is sold by several distributors or retailers and you want to search all of their websites in a single query before falling back to the open web.
aggressive_search
Combines every available search strategy into one workflow with seven steps.
| Step | What it searches |
|---|---|
| 1 | Manufacturer domain by part number. |
| 2 | Manufacturer domain by part number plus product name. |
| 3 | Fallback domain by part number. |
| 4 | Fallback domain by part number plus product name. |
| 5 | Multiple domains combined by part number. |
| 6 | Multiple domains combined by part number plus product name. |
| 7 | Open web search by part number plus product name. |
Required spreadsheet columns: domain, mpn, product_name, fallback_domain, domains (list)
When to use it: For hard-to-find products where the more targeted searches are likely to miss. This workflow tries single-domain, fallback-domain, multi-domain, and open search patterns in sequence. Because it stops at the first step that succeeds, it still returns quickly when the early steps work, but the breadth of strategies maximizes the chance of finding something.
simple_mpn_search
The simplest workflow — a single open web search by part number.
| Step | What it searches |
|---|---|
| 1 | Open web search by part number alone. |
Required spreadsheet columns: mpn
When to use it: When you only have a part number and no domain information. Also useful for quick lookups where speed matters more than precision.
domain_focused_search
Restricts every search to the primary manufacturer domain with three different query patterns.
| Step | What it searches |
|---|---|
| 1 | Manufacturer domain by part number. |
| 2 | Manufacturer domain by part number plus product name. |
| 3 | Manufacturer domain by product name only. |
Required spreadsheet columns: domain, mpn, product_name
When to use it: When you are confident the product exists on a specific manufacturer's website and you want to avoid open-web results. Useful for manufacturers with extensive catalogs where you want only authoritative product data from the source.
standard_manufacturer_search_with_scraping
Identical search steps to standard_manufacturer_search, but with two built-in scrapers attached that automatically extract product data and images from any discovered pages.
Search steps: Same four steps as standard_manufacturer_search (see above).
Built-in scrapers:
| Scraper name | What it extracts | Pages scraped |
|---|---|---|
product_data | Title, description, attributes/specs, and SKU for the given product name and part number. | Up to 3 |
image_extraction | All image URLs and accessory information for the given product name. | Up to 1 |
Required spreadsheet columns: domain, mpn, product_name, fallback_domain
When to use it: When you want a single workflow reference that handles both search and scraping. Instead of defining search steps in the config and separately listing scrapers, this workflow bundles everything together, and the scrapers run automatically on the URLs found by the search steps.
Adding your own custom workflow
If none of the six built-in workflows fit, you can add a new one to the workflow templates file (workflow_templates.yaml). The format mirrors the inline format described in the next section.
To add a new workflow:
- Open
workflow_templates.yaml. - Add a new entry under the
workflows:key with a unique name. That name is what you reference from your configuration file. - Add a
descriptionexplaining what the workflow does and when to use it. - Add
searchStepsas an ordered list of search step entries (each with aquerytemplate and auseFieldsmapping — see the next section). - Set
stopOnFirstSuccesstotrue(stop after the first hit) orfalse(run every step and combine results). - Optionally add a
scraperslist to attach inline scrapers to the workflow. - In your configuration file, reference the new workflow by name:
apiConfig: universal_search: workflow: "your_custom_workflow_name"Example — a custom workflow that searches a distributor by SKU and falls back to open search:
workflows: # ... existing workflows ... distributor_search: description: "Search a distributor website by SKU, then fall back to open web search by SKU + product name" searchSteps: - query: "site:{distributor_url} {sku}" useFields: distributor_url: "distributor_url" sku: "sku" - query: "{sku} {product_name}" useFields: sku: "sku" product_name: "product_name" stopOnFirstSuccess: trueThis workflow expects distributor_url, sku, and product_name columns in the spreadsheet (or those names mapped via fieldMapping).
6. Customizing your own search steps
If you would rather define your own search steps directly in your configuration file instead of using a workflow, you can. Use the searchSteps key inside the universal_search block.
How a search step works
Every search step is a small object with two fields:
| Field | Required? | What it does |
|---|---|---|
query | Yes | The query template. Contains placeholders like {mpn} and {domain} that will be replaced with values from each row. |
useFields | Yes | A mapping that says which spreadsheet column supplies each placeholder. The key on the left is the placeholder name; the value on the right is the column name. |
Example
universal_search: searchSteps: - query: "site:{domain} {mpn}" useFields: domain: "domain" mpn: "mpn" - query: "site:{domain} {mpn} {product_name}" useFields: domain: "domain" mpn: "mpn" product_name: "product_name"The pipeline runs the steps in the order you list them. For each step, every {placeholder} in the query is replaced with the matching value from the spreadsheet row before the search runs.
Worked example: Given a row with domain = manufacturer.com and mpn = 12345, this step:
- query: "site:{domain} {mpn}" useFields: domain: "domain" mpn: "mpn"becomes the search query: site:manufacturer.com 12345
What happens when fields are missing
If a row is missing one of the fields the query needs (the value is empty or the column is absent), the entire search step is skipped for that row, and the pipeline moves on to the next step. You will see a log message indicating which fields were missing.
Searching multiple domains in one query
If a row has a list of several domains (e.g., the product is sold by three different distributors), you can search all of them in a single query using multi-domain expansion.
In useFields, set the num_domains value to how many domain placeholders you want, and point domain at the spreadsheet column that holds the list of domains:
- query: "(site:{domain1} OR site:{domain2} OR site:{domain3}) {mpn}" useFields: domain: "domains" # The column containing the list of domains num_domains: 3 # How many domain placeholders to fill mpn: "mpn"The pipeline expects the domains column to contain a list of domain strings. It creates numbered placeholders (domain1, domain2, domain3, and so on), filling each with the next domain in the list. If the list has fewer entries than num_domains, the extra placeholders are left empty.
Worked example: Given domains = ["site1.com", "site2.com", "site3.com"] and mpn = 12345, the resulting query is:
(site:site1.com OR site:site2.com OR site:site3.com) 12345
Stopping early or running every step
The stopOnFirstSuccess flag controls whether the pipeline stops at the first successful step or runs every step.
universal_search: searchSteps: [...] stopOnFirstSuccess: true # or false| Value | Behavior |
|---|---|
true (default) | Steps run in order. As soon as one step returns results, those results are kept and no further steps run. Best when your earlier steps are more targeted and your later steps are broader fallbacks. |
false | Every step runs, regardless of whether earlier steps succeeded. Results from each successful step are collected together. |
If you do not set stopOnFirstSuccess, it defaults to true.
What happens when no search steps fit
If you do not define any searchSteps and you do not reference a workflow, or if every step gets skipped because of missing fields, the pipeline falls back to a simple search built from the query, product_name, or title field of the row, in that order of preference.
Workflow vs. inline searchSteps
If you set both a workflow and an inline searchSteps list inside universal_search, the workflow wins. The inline searchSteps are ignored. Pick one or the other, not both.
7. Scraping product pages
Scrapers extract structured information from product web pages. There are two ways to use them:
- Inline scrapers, attached to your
universal_searchblock, run automatically on URLs that the search steps discover. - Standalone scraping (
scrape_results), which scrapes URLs that are already in your spreadsheet, without doing a search first.
Both use the same scraper settings.
Inline scrapers (search-then-scrape)
When you list one or more scrapers inside your universal_search block, they automatically run on the URLs discovered by the search steps. This bundles search and extraction into a single research operation.
universal_search: searchSteps: [...] stopOnFirstSuccess: true scrapers: - type: smart_scraper name: product_data prompt: > For the product: {product_name} Number: {mpn} extract ALL product data including title, description, attributes/specs, and SKU. maxResults: 3 retries: 2 timeout: 120 - type: smart_scraper name: image_extraction prompt: > Extract all image URLs from the page for {product_name}. maxResults: 1Each scraper entry has these fields:
| Field | Required? | Default | What it does |
|---|---|---|---|
type | Yes | — | The kind of scraper to use. Currently supported: smart_scraper. |
name | Yes | — | A unique name for this scraper's output (for example, product_data or image_extraction). Each scraper produces its own named bundle of results. |
prompt | Yes | — | The instructions sent to the scraper telling it what to extract. Supports {placeholder} template variables that are filled in from your spreadsheet columns at run time (for example, {product_name} and {mpn}). |
maxResults | No | 3 | The maximum number of search-result URLs to scrape. The pipeline takes the top N URLs from the search results and runs the scraper on each. |
retries | No | 2 | How many times to retry on failure or timeout. |
timeout | No | 120 | How long to wait per scrape, in seconds. |
Prompt placeholders. Just like search query templates, the prompt field can contain {placeholder} variables. These are filled in from your spreadsheet columns for each row, so you can tailor the prompt to the specific product being scraped.
Multiple scrapers, one set of pages. You can list multiple scraper entries with different names to run different extraction prompts against the same set of discovered pages. Each scraper runs independently and produces its own named bundle. To run multiple kinds of extraction, define multiple scraper entries with different names — one scraper has one prompt.
Standalone scraping (URLs already in your spreadsheet)
If your spreadsheet already contains the product URLs you want to scrape, you can skip the search step entirely and use the scrape_results block to scrape directly.
apiConfig: researchStepEndpoints: - endpoints: [scrape_results] strategy: standalone scrape_results: urlField: "product_url" scrapers: - type: smart_scraper name: product_data prompt: "Extract all product data including title, description, specs..." retries: 2 timeout: 120Fields under scrape_results:
| Field | Required? | Default | What it does |
|---|---|---|---|
urlField | No | "url" | The spreadsheet column that contains the URL to scrape. The value can be a single URL string or a list of URLs. |
scrapers | Yes | — | A list of scraper entries. Same shape as inline scrapers (see above), except maxResults does not apply because URLs come from the spreadsheet, not from search. |
How standalone scraping decides what to do:
- The pipeline reads the column named by
urlField(defaults to"url"if you do not set it). - If the value is missing or empty for that row, scraping is skipped.
- If the value is a single URL, it is treated as a one-item list. If it is a list, it is used as-is.
- Each scraper in the
scraperslist runs against every URL in the list. - Results are bundled by scraper name (one bundle per scraper, holding one entry per URL).
Key difference from inline scrapers: Inline scrapers run on URLs discovered by search steps. Standalone scraping runs on URLs that are already in your spreadsheet. Both use the same scraper settings.
8. Generating product content
Content generation is the heart of the pipeline. Each entry in your generationEndpoints list is a separate generation step. The pipeline runs them in the order you list them, and each one writes one piece of content for every row.
The standard generation steps
| Step name | What it produces |
|---|---|
generate_title | A product title (a string). |
generate_description | A product description (a string). |
generate_bullets | Bullet points (a list of strings). |
categorization | A product category (structured data — typically a dictionary or string). |
generate_attributes | Product attributes such as color, material, weight (a dictionary of attribute names to values). |
extract_attributes | Attributes pulled from the existing description rather than generated freshly (a dictionary). |
rewrite_product | A rewritten version of an existing product description. |
generate_validator | Specialized validation content. |
product_variant_field_standardization | Standardized variant field text. |
How a generation step is configured
Each step name in generationEndpoints should also have a settings block with the same name elsewhere in your apiConfig. For example, if generationEndpoints contains "generate_title", you need a generate_title: block.
| Field | Type | What it does |
|---|---|---|
rules | List of strings | Plain-language instructions sent to the model during generation. These guide the writing style, tone, structure, and content. Examples: "Keep under 60 characters", "Begin with the main keyword", "Use a friendly tone". |
examples | List of strings | Sample outputs that show the model what good results look like. The model uses them as guidance. |
max_characters | Whole number | A hard maximum on the length of the result. If the result is longer, the pipeline retries the step. |
min_characters | Whole number | A hard minimum on the length of the result. If the result is shorter, the pipeline retries the step. |
endpoint | String | An override that lets you call a built-in generation step under a custom name. See section 10. |
model_id | String | A custom taxonomy name. Only used by categorization. Leave it out to use the default taxonomy. |
attribute_list | List of strings | The list of attribute names to generate or extract. Used by generate_attributes and extract_attributes. |
replacements | Mapping of strings | Find-and-replace terms specific to this step. Combined with the global replacements (see section 11). |
A simple example
apiConfig: generationEndpoints: - "generate_title" - "generate_description" generate_title: rules: - "Keep under 60 characters." - "Lead with the brand name." examples: - "Acme Pro 2000 Ergonomic Office Chair, Black" max_characters: 60 min_characters: 10 generate_description: rules: - "Write 3 to 4 sentences." - "Highlight key benefits, not just features." max_characters: 500 min_characters: 100Categorization is a little different
The categorization step has a few special rules:
- It accepts an optional
model_idfield that points at a custom category taxonomy. Leave it out to use the default. - It still uses
rulesandexampleslike the other steps, plus your product's title, description, and image. - It does not use
max_charactersormin_characters. Categorization returns structured data, not plain text, so length checks don't apply. Don't set those fields on acategorizationblock.
categorization: model_id: "my-custom-taxonomy" # optional rules: - "Pick the most specific applicable category." - "Use both the title and the description when deciding." examples: - "Electronics > Computers > Laptops"Generating versus extracting attributes
Two different steps handle product attributes, and they behave differently:
generate_attributeswrites attribute values based on whatever product information is available, even if those values are not stated explicitly in the source text. It can fill in attributes the model can reasonably infer.extract_attributesonly pulls attribute values that are explicitly mentioned in the existing product description. If a value is not stated, it stays empty.
Both steps use the same settings, including an attribute_list field that names the attributes to work with.
generate_attributes: rules: - "Generate detailed product attributes." examples: - "Color: Black, Material: Aluminum, Weight: 2.5 lbs" attribute_list: - "Color" - "Material" - "Weight" - "Dimensions"extract_attributes: rules: - "Extract only attributes explicitly mentioned in the description." attribute_list: - "Brand" - "Model Number" - "Voltage"9. Quality controls: rules, examples, and length limits
The pipeline supports two different kinds of quality control on every generation step.
Two kinds of rules
1. Plain-language rules. These are the entries in the rules list. They are passed to the model as instructions during generation. Examples: "Keep under 60 characters", "Begin with the brand name", "Focus on key benefits". These influence how the model writes, but they are not strictly enforced after the fact — the model does its best to follow them.
2. Hard length limits. These are max_characters and min_characters. They are enforced after generation. If the result is too long or too short, the result is rejected and the step is retried.
You can use both kinds at once. The plain-language rules guide the writing style; the length limits act as a safety net.
How retries work
When a length limit is set and the generated content fails to meet it, the pipeline tries again. Here is the exact sequence:
- The pipeline calls the generation step to write the content.
- If no length limits are set on this step, the pipeline accepts the result immediately and moves on.
- If length limits are set, the pipeline checks the result against them.
- If the result passes, the pipeline accepts it and moves on.
- If the result fails, and there are still retries available, the pipeline logs the failure and tries again from step 1.
- If the result fails and the retry budget is exhausted, the pipeline keeps the last result anyway, logs a warning, and moves on. Nothing is silently dropped.
By default the pipeline tries the original generation plus three additional retries for a total of four attempts.
When length limits do not apply
The categorization step does not use max_characters or min_characters because its output is structured data, not plain text. If you put length limits on a categorization block, they will be ignored — but it is cleaner to leave them out.
The same is true for generate_attributes and extract_attributes, which return a dictionary of attribute values rather than a single piece of text.
What you see in the output
If a generated value fails its length checks even after every retry, you will get the value the model produced on its last attempt. The output is never empty just because validation failed — you always get something. This is on purpose: the pipeline assumes a not-quite-perfect result is more useful than an empty cell.
10. Running the same generation step twice with different settings
Sometimes you want two versions of the same kind of content with different rules — for example, a short description and a long description. The pipeline supports this through an aliasing pattern.
How it works
You list a custom name in generationEndpoints, then create a settings block with that custom name and use the endpoint field to point at the underlying built-in step.
apiConfig: generationEndpoints: - "short_description" - "long_description" short_description: endpoint: "generate_description" # use the description generator rules: - "Keep it concise, under 100 words." max_characters: 500 min_characters: 50 long_description: endpoint: "generate_description" # same generator, different rules rules: - "Be detailed and comprehensive." - "Include technical specifications." max_characters: 3000 min_characters: 1000In the output, both short_description and long_description will appear as separate columns. Both are produced by the description generator, but each follows its own rules and length limits.
When to use this pattern
- When you want multiple lengths of the same kind of content (short vs. long descriptions).
- When you want the same step to produce content for different audiences with different tones.
- When you need separate versions for separate downstream uses (for example, one for your website and a different one for a partner export).
When the endpoint field is not needed
If your settings block name already matches a built-in generation step name (generate_title, generate_description, generate_bullets, categorization, generate_attributes, extract_attributes, etc.), you do not need to set endpoint. The pipeline finds the matching step automatically. The endpoint field is only needed when you want to use a custom name.
11. Find-and-replace terms
Find-and-replace terms let you swap specific words or phrases in generated content for replacements you control. They are useful for brand renaming, replacing prohibited terms with safe alternatives, and standardizing terminology.
Where you can define replacement terms
Replacement terms can come from four places. They are applied in this order, where each later source can override the earlier ones for the same term:
- Global replacements. Defined under
apiConfig.replacements. Applied to every generation step. - Per-step replacements. Defined inside a specific generation step's settings block, under
replacements. Combined with the global replacements; per-step values win for any term that exists in both. - CSV-uploaded replacements. Uploaded with your job through the file upload field. Override anything from the inline settings for the same term.
- Per-channel replacements. Defined inside a marketplace channel block under
thirdPartyChannels. Applied during marketplace channel generation, on top of the global and per-step replacements.
apiConfig: # Global replacements applied to every generation step replacements: "old brand name": "new brand name" "rifle": "sporting equipment" generate_title: rules: - "Keep under 60 characters." # Per-step replacements (combined with the global ones) replacements: "endpoint-specific term": "replacement for titles only"How matching works
- Whole-word matching. A term matches only when it appears as a complete word, not as part of a larger word. So
"rifle"does not accidentally match"rifles"unless you explicitly add"rifles"too. - Case insensitive.
"Brand","brand", and"BRAND"all match a term written as"brand". - Longer terms win. If two terms could match the same text, the longer one is tried first. So
"high power"will match before"high". - Replacement is applied to text only. Replacements affect string outputs (like titles and descriptions) and string items inside lists (like bullet points). They are skipped for structured outputs like categories and attribute dictionaries. Empty values are left alone.
Loading replacements from a CSV
You can upload a CSV file of replacement terms with your job. Two column-header formats are accepted:
| Format | "Find" column | "Replace with" column |
|---|---|---|
| Preferred | term | replacement |
| Legacy | RPK | Safe Alternative |
Column names are matched case-insensitively. If a CSV file uses neither format, the upload is rejected with an error.
CSV-uploaded values override anything you put in your YAML for the same term. The reasoning is that CSV uploads usually represent the most recent user intent.
Per-channel replacements work the same way
Marketplace channel blocks (Amazon, eBay, Walmart) can also define their own replacements. These are applied after the global and per-step replacements during marketplace generation. See section 14 for the full marketplace channel reference.
12. Protecting manufacturer part numbers (MPN)
Replacement terms are powerful, but they create one risk: a replacement term could accidentally overlap with a manufacturer part number (MPN) and corrupt it. For example, if you have a replacement that turns the letter sequence "abc" into something else, and a part number happens to contain abc somewhere, your part number would be silently mangled.
To prevent this, the pipeline automatically protects part numbers whenever a row has an mpn field (or a product_name field as a fallback). This is called MPN masking.
How MPN masking works
For every piece of generated text where replacements would be applied, the pipeline:
- Hides every occurrence of the part number by replacing it with a special placeholder string that no replacement term can ever match.
- Runs your replacement terms against the hidden text. Any term that would have matched the part number now matches the placeholder instead, which has no effect.
- Restores the original part number in place of the placeholder.
The result: even if one of your replacement terms could have matched part of an MPN, the MPN comes out exactly as it went in.
What gets used as the part number
The pipeline looks for the part number in this order:
- The
mpnfield in the row. - If
mpnis empty, theproduct_namefield.
If neither is present, MPN masking simply does nothing for that row, and replacements run normally.
When MPN masking applies
MPN masking is automatic and always on. You do not configure it. It applies wherever replacement terms are applied: standard generation, per-step replacements, and per-channel marketplace replacements.
13. Category-driven attributes
If you have a CSV file that lists which attributes belong to which product categories, the pipeline can use it to drive attribute generation. This means you can have one configuration that works across many product categories — the pipeline figures out which attributes to generate for each row based on which category it belongs to.
How to enable it
You enable category-driven attributes by uploading an attribute mapping file with your job. The system stores the file location for you; you do not edit your YAML to point at it.
The pipeline runs the category-driven attribute step only when all three of these conditions are met:
- You uploaded an attribute mapping file with the job.
categorizationis in yourgenerationEndpointslist.- Either
generate_attributesorextract_attributesis in yourgenerationEndpointslist.
If any one of those is missing, the category-driven step is silently skipped and you get the standard attribute generation (or no attribute generation at all, if no attribute step is in the list).
What the mapping file looks like
The mapping file is a CSV with these columns (the column names are matched case-insensitively):
| Column | Required? | What it holds |
|---|---|---|
category | Yes | The exact product category string. Must match what categorization returns exactly, character for character. |
attribute key | Yes | The name of an attribute (for example, Color, Material, Screen Size). |
attribute possible value | Yes | A comma-separated list of allowed values for this attribute. |
rules | No | Optional plain-language guidance for the model when generating this specific attribute (for example, "Only use standard RAM sizes"). |
Example CSV:
category,attribute key,attribute possible value,rulesElectronics > Laptops,Screen Size,"13 inch,14 inch,15 inch,17 inch",Electronics > Laptops,RAM,"8GB,16GB,32GB,64GB",Only use standard RAM sizesElectronics > Laptops,Color,"Silver,Space Gray,Black",In this example, when a product is categorized as Electronics > Laptops, the pipeline will generate the three attributes Screen Size, RAM, and Color, restricted to the listed possible values, and the RAM attribute will receive the extra rule "Only use standard RAM sizes".
How attribute generation runs
For each row that meets all three conditions:
- The pipeline reads the category that the
categorizationstep produced. - It looks up the matching category in the mapping file (exact string match).
- It collects every attribute defined for that category in the file.
- It groups the attributes into batches of 5 and generates each batch in turn.
- The results from every batch are merged together into one combined attribute dictionary.
Each batch automatically gets two extra rules built from the file contents:
- For every attribute that has a
rulesvalue in the file, a rule is added:"For 'AttributeName': <your rules>". - For every attribute that has possible values, a constraint is added:
"For attribute 'AttributeName', return only values from this list: <values>".
What the output looks like
The merged attribute results replace whatever the standard generate_attributes (or extract_attributes) step would have produced. They are stored under the same column name in the output, so downstream consumers see one consistent attribute column.
When the category is not found
If the category that categorization produced does not appear in the mapping file (no exact match), the category-driven step does nothing. In that case, whatever the standard attribute generation step produced (if any) is preserved unchanged.
14. Marketplace channels (Amazon, eBay, Walmart)
The marketplace channels feature lets you generate channel-specific content for third-party marketplaces in a single pipeline run. You can have one set of standard titles and descriptions and also have Amazon-optimized titles and descriptions, eBay-optimized titles, and Walmart-optimized content — all from the same job.
What it does
The pipeline first runs your standard generation steps as usual. Then, if marketplace channels are enabled, it iterates over each marketplace you configured and re-runs the listed generation steps with marketplace-specific rules, examples, and length limits. Research is reused. No additional web searches or scraping happens for marketplace generation — only the content generation step runs again.
The output of each marketplace generation step appears as a new column in your output, named with a marketplace prefix: amazon_generate_title, ebay_generate_description, and so on.
Enabling marketplace channels
The marketplace channels block lives at the same level as your other top-level settings inside apiConfig:
apiConfig: thirdPartyChannels: enabled: true channels: amazon: # channel-specific generation step settings... ebay: # channel-specific generation step settings...Two things must be true for marketplace generation to run:
enabled: trueis set. Ifenabledis false or missing, the entire marketplace step is skipped and no channel content is produced.channelsis a non-empty list. Ifenabledis true butchannelsis missing or empty, the job is rejected up front with an error.
Allowed marketplace names
Only three marketplace names are recognized:
| Marketplace | Name in config |
|---|---|
| Amazon | amazon |
| eBay | ebay |
| Walmart | walmart |
If you use any other name (for example, etsy), the job is rejected up front with a clear error message saying which name was invalid and which names are allowed.
Per-channel generation step settings
Inside each marketplace block, you list the same kinds of generation step settings blocks you would use at the top level. The block names should match the names in your generationEndpoints list (or use the endpoint field to alias one to a different built-in step, exactly like the standard aliasing pattern in section 10).
channels: amazon: generate_title: rules: - "Title must be optimized for Amazon search." - "Include brand, product type, and key attributes." examples: - "BrandName Ergonomic Office Chair with Lumbar Support, Black" max_characters: 200 min_characters: 10 generate_description: rules: - "Use HTML formatting with <p>, <b>, <li> tags only." max_characters: 2000You can use the same fields you use at the top level:
| Field | What it does |
|---|---|
rules | Plain-language instructions specific to this marketplace. |
examples | Example outputs specific to this marketplace. |
max_characters | A length cap specific to this marketplace's listing rules. |
min_characters | A length floor specific to this marketplace. |
endpoint | An alias to a different built-in step (same pattern as standard aliasing). |
model_id | A custom taxonomy for the categorization step. |
attribute_list | The attributes to generate or extract (for attribute steps). |
Each marketplace's length limits are independent of the standard step's length limits. You might have a standard generate_title with a 60-character cap and an amazon.generate_title with a 200-character cap. Both will be enforced separately during their respective generation passes.
Per-channel replacement terms
Each marketplace can define its own replacement terms, either inline in the YAML or via an uploaded CSV file:
channels: amazon: generate_title: rules: - "Optimize for Amazon search." max_characters: 200 replacements: "rifle": "sporting equipment" "gun": "item"Things to know about per-channel replacements:
- They are applied after the global and per-step replacements have already run.
- They use the same whole-word, case-insensitive matching as global replacements.
- They benefit from the same automatic MPN protection.
- They apply to string outputs and to string items inside lists (like bullet points), and they are skipped for structured outputs like categories.
- You can also upload per-channel replacement CSV files at job submission time using upload fields named
replacement_amazon,replacement_ebay, andreplacement_walmart. Uploaded CSV values override anything in the YAML for the same term.
Output column naming
Each marketplace's results show up in the output as a new column. The naming pattern is <marketplace>_<step name>.
| Standard column | Amazon column | eBay column | Walmart column |
|---|---|---|---|
generate_title | amazon_generate_title | ebay_generate_title | walmart_generate_title |
generate_description | amazon_generate_description | ebay_generate_description | walmart_generate_description |
generate_bullets | amazon_generate_bullets | ebay_generate_bullets | walmart_generate_bullets |
A run that uses two standard steps and configures two marketplaces with the same two steps produces six output columns total (two standard plus two per marketplace times two marketplaces).
What the pipeline does for each marketplace
For each marketplace listed in channels, the pipeline:
- Skips the marketplace if its name is not in the allowed set (and logs a clear warning).
- Skips the marketplace if its settings block is empty or invalid.
- Loads any inline replacement terms from the marketplace block, then layers any uploaded marketplace CSV replacements on top.
- Iterates over each generation step settings block in the marketplace.
- For each step, builds the channel-specific length checks, runs the generation, applies the standard replacement terms, then applies the marketplace-specific replacement terms on top.
- Saves the result under the prefixed column name (
amazon_generate_title, etc.).
If enabled is false or the section is missing, none of this happens. The marketplace stage is a complete no-op.
Job-submission validation for marketplace channels
The system validates marketplace channel configuration when you submit a job, before any processing starts. If anything is wrong, the job is rejected up front with a clear error message and no work is done. Validation checks, in order:
- Is the marketplace section enabled? If
enabledis true, validation continues; otherwise validation is skipped entirely. - Is the
channelsblock present and non-empty? If not, the job is rejected with"thirdPartyChannels.enabled is true but 'channels' is missing or empty." - Is every marketplace name allowed? Each marketplace name must be one of
amazon,ebay, orwalmart. Any other name is rejected with a message naming the invalid entry. - Is every marketplace block a settings block? Each marketplace value must be an object containing generation step settings. If not, the job is rejected.
- Does every marketplace have at least one generation step? A marketplace with no generation steps is rejected.
- Replacement CSV uploads are processed last. After validation passes, any per-marketplace replacement CSV files you uploaded are saved and attached to the corresponding marketplace block automatically.
You will see clear error messages for any validation failure, so you do not have to guess what is wrong.
15. A complete working example
Here is a complete configuration that produces standard titles and descriptions plus Amazon and eBay marketplace versions, all in one run:
apiConfig: generationEndpoints: - "generate_title" - "generate_description" generate_title: rules: - "Keep under 60 characters." max_characters: 60 generate_description: rules: - "Keep under 500 characters." max_characters: 500 thirdPartyChannels: enabled: true channels: amazon: generate_title: rules: - "Optimize for Amazon search. Include brand and key specs." max_characters: 200 generate_description: rules: - "Use only <p>, <b>, <li> HTML tags." max_characters: 2000 ebay: generate_title: rules: - "Short and keyword-rich." max_characters: 80 generate_description: rules: - "Plain text only, no HTML." max_characters: 4000The output of this configuration produces these columns for each row:
| Column | Where it comes from |
|---|---|
generate_title | The standard title generation step. |
generate_description | The standard description generation step. |
amazon_generate_title | The Amazon marketplace title generation step. |
amazon_generate_description | The Amazon marketplace description generation step. |
ebay_generate_title | The eBay marketplace title generation step. |
ebay_generate_description | The eBay marketplace description generation step. |
You would expand this example by adding research steps, replacement terms, attribute generation, and any other settings you need from earlier sections. Start small and add complexity one piece at a time.
16. Glossary of terms
apiConfig — The single top-level setting in every configuration file. Everything else is indented underneath it. Anything outside apiConfig is ignored.
Aliasing — Using a custom name in generationEndpoints together with the endpoint field to call a built-in generation step under a different name. Lets you run the same generation step twice with different rules and have both results in the output.
Attribute mapping file — A CSV file you upload with your job that lists which attributes (with allowed values) belong to which product categories. When uploaded, it drives the category-driven attribute generation step.
Category-driven attributes — A pipeline stage that runs after standard generation when an attribute mapping file has been uploaded. It looks up the row's category in the file and generates that category's specific attributes in batches of five.
Configuration file — The YAML file that controls everything the pipeline does for a given job. You edit it in any text editor.
fieldMapping — A configuration block that translates your spreadsheet column headers into the internal field names the pipeline expects.
Generation step — A single named action that produces one piece of content for each row. Examples: generate_title, generate_description, generate_bullets, categorization, generate_attributes. Each generation step has a settings block that holds its rules, examples, and length limits.
Inline scraper — A scraper attached to your universal_search block that automatically runs on URLs discovered by the search steps.
Length limits (max_characters, min_characters) — Hard limits on the length of generated text. If the result is too long or too short, the pipeline retries the step. After the retry budget is exhausted, the last result is kept anyway.
Marketplace channel — A named third-party marketplace (Amazon, eBay, or Walmart) for which the pipeline can generate marketplace-specific versions of your content in addition to the standard versions. Configured under thirdPartyChannels.channels.
MPN — Manufacturer Part Number. The unique identifier used by a manufacturer for a specific product.
MPN masking — An automatic protection that prevents replacement terms from accidentally corrupting manufacturer part numbers. Always on; you do not configure it.
Per-channel replacements — Find-and-replace terms specific to one marketplace. Applied after the global and per-step replacements during marketplace generation.
Per-step replacements — Find-and-replace terms specific to a single generation step. Combined with the global replacements; per-step values win for any term defined in both.
Plain-language rules (rules) — Instructions you give the model in natural language to guide how it writes a particular piece of content. Examples: "Lead with the brand name", "Keep under 60 characters", "Use a friendly tone".
Replacement terms — Find-and-replace pairs that swap specific words or phrases in generated content for replacements you control. Defined in four places (global, per-step, CSV upload, per-channel) with a clear precedence order.
Research — The pipeline stage where information about each product is gathered from sources like web search, the internal catalog, uploaded files, or page scraping. Configured under researchStepEndpoints.
Research strategy — How the research sources in a step group are executed. One of: primary_then_fallback, use_both, parallel, standalone. The default is standalone.
Scraper — A research tool that extracts structured data from a product web page using a plain-language extraction prompt. Comes in two flavors: inline (runs on URLs found by search) and standalone (runs on URLs already in your spreadsheet).
Search step — A single search query template, with placeholders that get filled in from each spreadsheet row.
Search workflow — A pre-built bundle of search steps that you reference by name from your configuration file. Six are included; you can also add your own.
Settings block — A YAML block that holds the rules, examples, length limits, and other settings for a particular generation step or research source. Each block sits at the same indentation level under apiConfig and is named after the step it configures.
Standalone scraping — Scraping URLs that are already present in your spreadsheet, configured under scrape_results. Different from inline scraping, which scrapes URLs discovered by search.
stopOnFirstSuccess — A flag inside universal_search that says whether to stop running search steps as soon as one returns results (true, the default) or to run every step regardless (false).
Step group — A bundle of one or more research sources together with the strategy that controls how they are executed. Each entry under researchStepEndpoints is a step group.
thirdPartyChannels — The configuration block where you enable and configure marketplace channels (Amazon, eBay, Walmart).
Workflow templates file — The file workflow_templates.yaml that ships with the project. It holds the six built-in search workflows, and you can add your own to it.
YAML — The plain-text format used for configuration files. It uses indentation (spaces, not tabs) to show how settings are grouped together.