Project: CIREN Facilitator — CTI Knowledge Graph (CTIKG) bootstrapping for SOL HPC security
Authors: Trevor Whipple + ChatGPT 5 Thinking
Date: 9-10-25
We designed a reproducible pipeline to assemble, pre‑rank, and curate cybersecurity articles relevant to an HPC environment like SOL. The outputs will seed a simple, testable CTI knowledge graph (CTIKG) focused on four operational categories:
requests
, feedparser
, python-dateutil
, pandas
(optional)link queue/
Source list: Sources_Config_Expanded_v2.json
(RSS feeds + HTML index pages; weighted by domain)
Category keywords: Category_Keywords_Expanded.json
(include/exclude terms per category)
We favor reputable advisories, vendor IR blogs, and core project/distro pages. News sites added at lower weight.
Pre‑ranking (no full scraping): pre_rank_links_v3.py
collects recent/history from RSS and select HTML indexes, generating Links_Queue.csv
with lightweight metadata.
Merge & de‑dupe: merge_dedupe.py
combines batches and drops duplicate URLs.
Flagging for triage: make_helper_flags.py
adds RepFlag
, SigFlag
, and composite Quality2
/Quality4
signals.
Selection: select_winners.py
picks N winners per category into per‑category CSVs + a master list.
For each feed item we compute:
recency_score
via exponential decay with a configurable half‑life (30–9999 days).category_hits
= keyword matches in title/summary.signal_score
= presence of CVE IDs / MITRE T‑IDs / IOC tokens.domain_weight
from the source config.The final score (0–1-ish) is:
Score = 0.35*domain_weight + 0.30*recency_score + 0.25*(category_hits/3) + 0.10*min(signal_score/3, 1)
We use this only to prioritize; we don’t fetch article bodies at this stage.
We target 100 winners per category. Items are ranked by Quality4
, then Quality2
, RepFlag
, SigFlag
, and Score
. We prioritize:
Status=Rejected
are never selected unless explicitly requested.From inside the project folder:
# 1) Install deps
python3 -m pip install --upgrade pip
python3 -m pip install requests feedparser python-dateutil pandas
# 2) Pre-rank (big sweep)
python3 pre_rank_links_v3.py \
--sources Sources_Config_Expanded_v2.json \
--categories Category_Keywords_Expanded.json \
--out batch_extra.csv \
--limit_per_feed 500 \
--half_life_days 9999 \
--verbose
# 3) Merge & de-dupe into working queue
python3 merge_dedupe.py Links_Queue_master.csv Links_Queue.csv batch*.csv
mv Links_Queue_master.csv Links_Queue.csv
# 4) Sort by Score (optional view)
python3 - << 'PY'
import pandas as pd
df=pd.read_csv('Links_Queue.csv')
df['Score']=pd.to_numeric(df['Score'],errors='coerce')
df.sort_values('Score',ascending=False).to_csv('Links_Queue_sorted.csv',index=False)
print('Wrote Links_Queue_sorted.csv with',len(df),'rows.')
PY
# 5) Add flags
python3 make_helper_flags.py Links_Queue_sorted.csv
# 6) Auto-select winners (100/category by default)
python3 select_winners.py \
--in Links_Queue_sorted_flags.csv \
--out Links_Queue_with_selected.csv
Queue size (pre‑dedupe): 889 rows.
Queue size (unique URLs): 889 rows.
Category | Count |
---|---|
SSH & Credential Abuse | 704 |
JupyterHub / Open OnDemand | 111 |
NFS / File-Share Exposure | 71 |
Cryptomining on HPC | 3 |
Category | SelectedCount |
---|---|
SSH & Credential Abuse | 100 |
JupyterHub / Open OnDemand | 100 |
NFS / File-Share Exposure | 71 |
Cryptomining on HPC | 3 |
Source_Domain | Count |
---|---|
www.huntress.com | 125 |
unit42.paloaltonetworks.com | 17 |
blog.talosintelligence.com | 16 |
ubuntu.com | 11 |
www.uptycs.com | 10 |
www.crowdstrike.com | 10 |
www.darkreading.com | 10 |
www.microsoft.com | 10 |
thedfirreport.com | 9 |
jupyter.org | 9 |
googleprojectzero.blogspot.com | 9 |
www.schneier.com | 5 |
We selected items that are directly actionable for SOC workflows on SOL:
This balances breadth (news and general analysis at lower weights) with depth (advisories and post‑mortems at higher weights).
Next: run the winner‑scraper to fetch PDFs + cleaned text for Status=Selected
, then convert to CTIKG JSONL and link to actual SOL logs for a pilot evaluation.