Methodology
How PORID collects, processes, and presents OR solver and tool data
1. Data Collection
PORID aggregates data from 10 sources via a Python pipeline that runs daily at 07:00 UTC through GitHub Actions. Each source is fetched by a dedicated module with its own rate-limiting and error handling.
arXiv (math.OC, cs.AI, cs.DS) — 7-day lookback via OAI-PMH.
Crossref — 28 OR journals, 14-day lookback window.
OpenAlex — concept-based queries for OR topics.
Semantic Scholar — keyword search across their full corpus.
Optimization Online — RSS feed parsing.
GitHub Releases API — tracks 19 open-source solver repositories
(OR-Tools, CBC, HiGHS, SCIP, GLPK, JuMP, Pyomo, CVXPY, and more).
PyPI JSON API — download statistics and release metadata for
8 Python OR packages.
WikiCFP RSS feeds plus manually curated entries in
config.yaml. Tracks call-for-papers deadlines, venues, and formats.
NSF Awards API — active awards in OR, combinatorial optimization,
and mathematical programming.
OpenAIRE / CORDIS — EU-funded Horizon Europe and Horizon 2020
projects in optimization.
2. Classification
Every item passes through a tag classifier that scans its title and abstract (lowercased) for keyword matches. The taxonomy covers 16 OR subdomains:
linear-programming integer-programming metaheuristics
network-optimization scheduling vehicle-routing
stochastic ml-for-or healthcare-or
supply-chain facility-location multi-objective
decomposition constraint-programming game-theory
survey
Each subdomain has 5–15 keywords defined in config.yaml. Short keywords
(4 characters or fewer) use word-boundary matching to prevent false positives. Longer
keywords use substring matching. Items with no keyword hits receive a
general-or fallback tag.
3. Deduplication
Because PORID pulls from overlapping sources, deduplication runs in two passes:
- Pass 1 — DOI exact match: items sharing the same DOI are deduplicated (first seen wins).
- Pass 2 — Title Jaccard similarity: titles are normalized (lowercased, punctuation stripped, stop words under 3 characters removed), then compared as word sets. Pairs with Jaccard similarity above 0.85 are treated as duplicates (first seen wins).
An additional ID-based merge prevents re-adding items already present in the stored dataset from prior pipeline runs.
4. Quality Scoring
Each item receives a relevance score (0–100) used for default sorting and the Pulse view. The score is computed as follows:
- Base score: 10 points
- Tag matches: +5 per matched subdomain tag (max +40)
- Freshness: +20 if published in last 24 hours, +10 if last 3 days, +5 if last 7 days
- DOI present: +10 (indicates peer-reviewed or formally published)
- Abstract length: +5 if abstract exceeds 20 characters
Items with future dates (bad metadata) are filtered out entirely. Validation also drops items missing required fields (id, title, type, date).
5. Archival Policy
Items older than 90 days are automatically dropped from the active
dataset during each pipeline run. Dropped items are not discarded; they are appended
to monthly archive files at data/archive/YYYY-MM.json.
This keeps the dashboard fast and focused on recent developments while preserving historical data for analysis. The incremental changelog retains the last 90 entries.
6. Transparency & Open Source
PORID is fully open source. The entire pipeline, configuration, and frontend code are available on GitHub:
- github.com/mghnasiri/PORID → — source code, pipeline scripts, and configuration
- GitHub Actions logs → — every pipeline run is logged and auditable
- config.yaml → — full list of sources, journals, keywords, and solver repos
Found an error? Use the Report button on any card to open a pre-filled GitHub Issue, or visit the issues page directly.