Methodology

How PORID collects, processes, and presents OR solver and tool data

1. Data Collection

PORID aggregates data from 10 sources via a Python pipeline that runs daily at 07:00 UTC through GitHub Actions. Each source is fetched by a dedicated module with its own rate-limiting and error handling.

Publications 5 APIs

arXiv (math.OC, cs.AI, cs.DS) — 7-day lookback via OAI-PMH.
Crossref — 28 OR journals, 14-day lookback window.
OpenAlex — concept-based queries for OR topics.
Semantic Scholar — keyword search across their full corpus.
Optimization Online — RSS feed parsing.

Software 2 APIs

GitHub Releases API — tracks 19 open-source solver repositories (OR-Tools, CBC, HiGHS, SCIP, GLPK, JuMP, Pyomo, CVXPY, and more).
PyPI JSON API — download statistics and release metadata for 8 Python OR packages.

Conferences 1 source

WikiCFP RSS feeds plus manually curated entries in config.yaml. Tracks call-for-papers deadlines, venues, and formats.

Funding & Opportunities 2 sources

NSF Awards API — active awards in OR, combinatorial optimization, and mathematical programming.
OpenAIRE / CORDIS — EU-funded Horizon Europe and Horizon 2020 projects in optimization.

2. Classification

Every item passes through a tag classifier that scans its title and abstract (lowercased) for keyword matches. The taxonomy covers 16 OR subdomains:

linear-programming       integer-programming      metaheuristics
network-optimization     scheduling               vehicle-routing
stochastic               ml-for-or                healthcare-or
supply-chain             facility-location         multi-objective
decomposition            constraint-programming    game-theory
survey
          

Each subdomain has 5–15 keywords defined in config.yaml. Short keywords (4 characters or fewer) use word-boundary matching to prevent false positives. Longer keywords use substring matching. Items with no keyword hits receive a general-or fallback tag.

3. Deduplication

Because PORID pulls from overlapping sources, deduplication runs in two passes:

  • Pass 1 — DOI exact match: items sharing the same DOI are deduplicated (first seen wins).
  • Pass 2 — Title Jaccard similarity: titles are normalized (lowercased, punctuation stripped, stop words under 3 characters removed), then compared as word sets. Pairs with Jaccard similarity above 0.85 are treated as duplicates (first seen wins).

An additional ID-based merge prevents re-adding items already present in the stored dataset from prior pipeline runs.

4. Quality Scoring

Each item receives a relevance score (0–100) used for default sorting and the Pulse view. The score is computed as follows:

  • Base score: 10 points
  • Tag matches: +5 per matched subdomain tag (max +40)
  • Freshness: +20 if published in last 24 hours, +10 if last 3 days, +5 if last 7 days
  • DOI present: +10 (indicates peer-reviewed or formally published)
  • Abstract length: +5 if abstract exceeds 20 characters

Items with future dates (bad metadata) are filtered out entirely. Validation also drops items missing required fields (id, title, type, date).

5. Archival Policy

Items older than 90 days are automatically dropped from the active dataset during each pipeline run. Dropped items are not discarded; they are appended to monthly archive files at data/archive/YYYY-MM.json.

This keeps the dashboard fast and focused on recent developments while preserving historical data for analysis. The incremental changelog retains the last 90 entries.

6. Transparency & Open Source

PORID is fully open source. The entire pipeline, configuration, and frontend code are available on GitHub:

Found an error? Use the Report button on any card to open a pre-filled GitHub Issue, or visit the issues page directly.