News Generation from RSS: End-to-End
End-to-end means every hop is owned: fetch schedules, normalization, dedupe keys, novelty scoring, drafting, media upload, WordPress categories, publish, and archival for audit.
If you cannot draw a box diagram from fetch to live URL—with owners and SLOs per box—you are not “doing RSS news” yet; you are experimenting. Production systems need clear handoffs and rollback points.
Reference pipeline
If any step fails, you need idempotent retries and alerts. Human review queues are optional—but logging is not.
Design each step to be restartable: a failed image upload should not corrupt the post record; a failed publish should leave the draft in a quarantine state with a clear error code.
Scheduling and fairness
Stagger fetches to respect upstream rate limits. Back off globally when error rates spike—protects your IP reputation and partner relationships.
Prioritize feeds by business value: breaking wires may get shorter intervals than long-tail blogs. Fairness inside your pipeline matters less than not harming publishers’ infrastructure.
Audit
Store immutable copies of feed items and model outputs. When someone asks “why did we publish this?” you can answer with data.
Retention policies should support legal holds: if a story becomes contested, freezing the exact inputs and outputs is non-optional.
Observability and on-call
Export metrics to your existing monitoring stack: fetch success, queue depth, publish latency, and WordPress HTTP codes. PagerDuty should fire on sustained degradation, not on single blips.
Runbooks should cover “feed poisoning” (malicious items injected into RSS) and “CMS partial failure” (posts live but images missing). Practice those scenarios quarterly.
Cost and capacity planning
Model costs scale with tokens; image generation scales with pixels. Budget per thousand published articles and set hard caps with graceful degradation—e.g., text-only publish when image quota is exhausted.
Right-size workers for peak news days. Black Friday for commerce beats are not the day to discover your queue cannot drain.
Idempotency patterns in detail
Use deterministic IDs for drafts derived from source item IDs plus template version. Publishing twice should either no-op or explicitly version—never silently overwrite without audit. WordPress clients should use transactional patterns where available: create draft, upload media, attach, publish—each step retryable without duplicating posts.
Store idempotency keys at the edge of your system—where external APIs are called—so retries are safe even when internal job runners are not perfectly exactly-once.
Data model for audits
Prefer append-only logs for generation decisions over mutable rows that “just get updated.” Immutability makes disputes tractable. If storage costs worry you, tier to object storage with lifecycle policies rather than deleting history.
Link every published URL to a bundle identifier that points to all inputs—feed item bytes, model output, editor overrides.
Human-in-the-loop when you need it
Design queues for review without blocking the entire pipeline: high-risk topics go to review; low-risk items flow. Use SLA timers—items should not sit unseen for days because your queue is a dumping ground.
Measure reviewer throughput and error rates; reviewers need breaks and rotation too.
Migration and backfills
Backfilling old RSS history can explode cost and create duplicate URLs. If you must backfill, do it in waves with dedupe against already published URLs and explicit canonical rules. Prefer backfill to draft-only modes first.
Communicate internally when backfills run—support and sales should know why old stories suddenly appear.
End-to-end acceptance tests
Before major releases, run a full acceptance test in staging: fetch → generate → publish → verify structured data → verify sitemap entry. Automate what you can; keep a human sign-off checklist for what you cannot.
Save signed test reports—auditors love artifacts.
Closing: the printable master checklist
RSS news generation is a chain; the chain fails at the weakest link. Print this article, walk the chain in order, and mark risks: scheduling, normalization, dedupe, generation, media, SEO fields, publish, archive, monitoring, cost controls, and governance. If every box has an owner and a test, you have a system. If not, you have a demo waiting for a bad news day to become a real incident. Build the system.
Appendix: field troubleshooting matrix
Use the table as a first-response map; drill into logs when a row matches your symptom.
| Symptom | Likely cause | First checks |
|---|---|---|
| Posts publish but metadata is wrong | Field mapping or merge-on-retry defaults | WP custom field map; retry path; template version tags |
| Intermittent duplicates | Races between workers or inconsistent dedupe keys per locale | Serialize by source cluster; distributed locks; compare keys across languages |
| Cost spike on quiet news days | Retry storm, accidental backfill, runaway cron | Retry counts per item; scheduled jobs; token usage by job name |
| Good HTML in feed, garbage in CMS | Encoding / double-encoding in parser | Hex-dump a sample; add tests with non-Latin and entities |
Appendix: minimum documentation set (keep in your repo)
Maintain: architecture diagram; data retention policy; list of feeds with owners; template changelog; on-call runbook; disaster recovery steps; and an incident log with lessons learned. Documentation is not bureaucracy—it is how you survive personnel changes and audits.
If your documentation is shorter than this article, you probably have not written down enough reality yet.
Appendix: reader note on length
This article is long on purpose: RSS pipelines fail in edge cases, and edge cases only show up when you operate at volume. Use the printed copy as a workbook—add your environment-specific notes, IP allowlists, API account identifiers (redacted in shared copies), and escalation phone numbers. The goal is a single place your team can trust when production misbehaves at the worst possible time—which, by Murphy’s law, is also when leadership is watching.
Appendix: capacity table (fill quarterly)
Estimate peak items per hour your pipeline can process end-to-end at current worker counts, model limits, and CMS throughput. Compare to historical peaks. If headroom is under twenty percent, expand capacity before the next predictable news spike—not after queues overflow.
| Line item | Current estimate | Historical peak observed | Headroom % | Owner |
|---|---|---|---|---|
| Peak items/hour (end-to-end) | — | — | — | — |
| Worker / job concurrency | — | — | — | — |
| Model quota (RPM/TPM limits) | — | — | — | — |
| CMS publish throughput | — | — | — | — |
Attach this table to your capacity reviews; numbers beat vibes when finance asks why you need more budget.
