Job Fit & Salary Estimator

Design explainer — every step, every calculation, every decision

Headline architecture: Official wage anchor + transparent seniority model + bounded LLM explanation.
Version 0.1.0  ·  CV PDF/DOCX → seniority score 0–100 + salary range + +30% growth plan  ·  Czech ISPV anchor  ·  bilingual CZ/EN

Pipeline overview

The system is a pure-function pipeline with Pydantic-typed I/O at each step. Every step is auditable in isolation: same input → same output. The LLM is bounded to evidence extraction and soft subscores within strict JSON schemas; final score and salary math are deterministic.

flowchart TD A[PDF or DOCX] --> B[extract
pypdf / python-docx] B --> C[redact
regex strip PII
+ sha256] C --> D[parse
Claude tool-use
→ CVJson] D --> E[classify
ESCO retrieval
+ Claude pick
+ confidence-gated rollup] E --> F[score
5 subscores
weights sum to 1.0] F --> G[salary
anchored interpolation
through ISPV deciles] G --> H[recommend
+30% branching
+ LLM actions] H --> I[ResultJson] I --> J[Streamlit UI / CLI / FastAPI] style A fill:#fef3c7,stroke:#b45309 style I fill:#ecfdf5,stroke:#047857 style F fill:#eef2ff,stroke:#2563eb style G fill:#eef2ff,stroke:#2563eb

All seven steps run in sequence. Five of them (extract, redact, score, salary, parts of recommend) contain no randomness — they're pure functions and produce identical output for identical input. The other two (parse, classify, soft-subscores via LLM) use Anthropic Claude with structured tool-use; their JSON output is validated against a Pydantic schema before downstream code sees it.

1Extract — text from PDF/DOCX

What it does: reads a binary file and returns plain text.

Logic

Decision: why no OCR

OCR is out of scope for v1. Image-only PDFs result in low text length, which the parse step flags as "PDF extraction likely failed". The score's confidence label drops to low, and the salary range widens accordingly. Honest under-fitting is safer than fragile OCR.

2Redact — PII pre-pass

What it does: strips emails, phone numbers, URLs, birth dates, and Czech ID numbers from the text before any LLM call. Original file gets a SHA-256 hash for traceability.

Why redact at all? GDPR Art. 5 data minimisation. The system does not need names/contacts/addresses to estimate seniority and salary. Redacting before the LLM call means the model never sees them. This is also a defensibility win for the interview: data minimisation is something any production reviewer expects.

Patterns (in order — order matters)

LabelPatternWhy this order
BIRTH_DATE(?:datum narození\|date of birth\|born\|narozen[aá]?)\s*[:\-]?\s*\d{1,2}[./-]\d{1,2}[./-]\d{2,4}First — date numerals could otherwise be eaten by phone regex
EMAIL\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\bStandard RFC-ish
LINKEDIN(?:https?://)?(?:www\.)?linkedin\.com/[^\s)]+Before generic URL — LinkedIn is more specific
URLhttps?://\S+\|www\.\S+Generic fallback
PHONE(?:(?:\+\|00)\d{1,3}[\s.-]?)?\d{3}[\s.-]\d{3}[\s.-]\d{3}(?!\d)Tightened to require explicit separators between three groups of 3 digits, so 03/2018 employment dates aren't captured
CZ_ID\b\d{6}\s?/\s?\d{3,4}\bRodné číslo (Czech birth ID)
Known limitation: regex doesn't catch names or addresses. Production would use NER (Named Entity Recognition). For an interview prototype, the regex pass + an explicit "production would use NER" note in the README is acceptable — the methodology is honest.

3Parse — text → CVJson

What it does: Claude reads the redacted text and emits a structured CVJson via tool-use. The schema enforces the shape; Pydantic validates the result.

Tool schema (abbreviated)

{
  "name": "parse_cv",
  "input_schema": {
    "type": "object",
    "required": ["roles", "skills", "education", "languages", "certifications",
                 "detected_language", "parse_confidence", "extraction_warnings"],
    "properties": {
      "roles": {"type": "array", "items": {Role schema with title, dates, description, is_current}},
      "skills": {"type": "array", "items": {"type": "string"}},
      "education": {"type": "array", "items": {Education schema}},
      "languages": [...],
      "certifications": [...],
      "detected_language": {"enum": ["cs", "en", "other"]},
      "parse_confidence": {"type": "number", "minimum": 0, "maximum": 1},
      "extraction_warnings": {"type": "array", "items": {"type": "string"}}
    }
  }
}

Decisions

4Classify — CV → ISCO with confidence-gated rollup

What it does: picks the best ISCO-08 occupation code for the candidate's anchor role. The output drives salary lookup downstream.

Process

  1. Pick anchor role (current role with longest duration; fallback to most recent role ≥ 12 months; fallback to most recent role).
  2. Build a query string from anchor's title + description.
  3. Query the local ESCO occupation index (28 hand-curated ISCO codes covering common occupations across all major groups) → top 8 candidates by Jaccard token overlap + substring bonus.
  4. Pass the shortlist + CV skills + detected language to Claude. Claude picks ONE code, returns {isco_code, role_label, confidence, top1_margin, alternatives}.
  5. Apply the confidence-gated rollup:
Confidencetop1_marginISCO level returned
≥ 0.80≥ 0.154 (4-digit, e.g. 2512)
≥ 0.60any3 (3-digit, e.g. 251)
any2 (2-digit, e.g. 25)
Why rollup matters: a 4-digit ISCO has narrow salary distribution. If the LLM is unsure, returning the more aggregate 2-digit median is more honest than committing to a possibly-wrong 4-digit. Salary range automatically widens via the confidence label.

Edge case: high confidence + tiny margin

Confidence 0.85 with margin 0.05 means "I'm sure it's one of these two, I just don't know which." The gate returns level 3, not 4 — both candidates likely share a 3-digit prefix (e.g. 2511 Systems analyst and 2512 Software developer both roll up to 251).

5Score — five subscores blended

Five subscores, each 0–100, blended by fixed weights summing to 1.0:

SubscoreWeightComputed bySource
relevant_experience0.25Interval-union YoE × ISCO similaritydeterministic
skills_match0.25Keyword-group overlap with ISCO expected skillsdeterministic
impact_scope0.20Numeric outcomes, scope of work — LLM-bounded with rubricLLM-bounded
leadership_ownership_growth0.205 sub-dims × 0–20, summed and capped — LLM-boundedLLM-bounded
education0.10Role-sensitive ordinal mappingdeterministic
total = 0.25·rel_exp + 0.25·skills + 0.20·impact + 0.20·leadership + 0.10·education

The total is mapped to a band:

BandTotal range
Junior< 40
Mid40 ≤ t < 60
Senior60 ≤ t < 80
Lead/Principal80 ≤ t < 95
Exec95+

5a. relevant_experience (deterministic)

Two YoE measures, both with explicit handling of overlapping/parallel roles:

Total YoE — interval union

Sum the union of all dated role intervals. Parallel jobs don't double-count.

def total_yoe(roles):
    intervals = [(r.start_date, r.end_date or today) for r in roles if r.start_date]
    merged = merge_intervals(sorted(intervals))    # union: [(a, max(b, c)), ...]
    return sum(months_between(s, e) for s, e in merged) / 12.0

Relevant YoE — FTE-equivalent capped

For each calendar month a role was active, contribute isco_similarity × max(confidence, 0.5), capped at 1.0 per month so overlapping relevant roles can't sum to more than full-time. This is the correct way to handle freelance overlap.

def relevant_yoe(roles, anchor_isco):
    month_weights = {}
    for r in roles:
        weight = isco_similarity(r.isco_code, anchor_isco) * max(r.isco_confidence, 0.5)
        for ym in iter_months(r.start_date, r.end_date or today):
            month_weights[ym] = min(1.0, month_weights.get(ym, 0) + weight)
    return sum(month_weights.values()) / 12.0

ISCO similarity

MatchSimilarityExample
identical 4-digit1.002512 ↔ 2512
same 3-digit prefix (minor group)0.752511 ↔ 2512 (both Systems analysts/Software devs)
same 2-digit prefix (sub-major)0.502511 ↔ 2521 (both ICT professionals)
same occupation code at different level (suffix match)0.252511 ↔ 3511 (Systems analyst Pro ↔ ICT technician)
unknown ISCO on either side0.25None ↔ 2512
different occupation entirely0.102221 (Nurse) ↔ 2512 (Software dev)
Design note: the suffix-match tier (0.25) was a deliberate fix. ISCO-08 is hierarchical by major-group (first digit), but the same occupation type at different seniority levels (Professional vs Technician) lives in different major groups (e.g. 2511 vs 3511). Strict prefix matching would say these are unrelated; suffix matching correctly says they're related. Conversely, two different occupations within the same major group (Nurse 2221 vs Software dev 2512, both "Professionals") are correctly scored as unrelated (0.10).

YoE → 0–100 score (piecewise interpolation)

Five anchor points calibrate years of relevant experience to a subscore:

YearsScore
00
230
555
1080
1595
20+100

Linear interpolation between anchors. Concave (faster early gains, diminishing returns past 15y) — this matches market reality for IC roles.

Pipeline glue: ISCO propagation to roles

Subtle integration bug + fix. The parse step doesn't classify each role individually — only the classify step picks ONE ISCO for the candidate. Without correction, every role has isco_code=None, so isco_similarity returns the 0.25 unknown-default for all of them, deflating relevant experience to roughly a quarter of its real value.

The pipeline now propagates the classified ISCO to the anchor role at full confidence, and to any other role whose title shares a meaningful word with the classified role_label at 70% confidence. Career-changer roles (e.g. "Registered Nurse" vs "Software developer") have no shared title keywords, so they correctly stay None and contribute at the 0.25 baseline — preserving the property that career-changer's relevant YoE < total YoE.

5b. skills_match (deterministic, two-tier)

Three signals combined to 0–100:

Two-tier confidence

5c. impact_scope (LLM-bounded)

The LLM scores 0–100 against this rubric, returns 1–4 evidence quotes from the CV:

RangeDescription
0–20Vague responsibilities, no measurable outcomes
21–40Concrete deliverables, no numbers
41–60Some measurable results (counts, percentages)
61–80Business-level impact (revenue, cost, reliability, scale)
81–100Cross-org / multi-million-scale impact

The schema enforces integer 0–100; Python clamps any out-of-range output as a safety net.

5d. leadership_ownership_growth (5 sub-dimensions × 0–20)

Originally called "personality" in the brief — renamed because CV text cannot defensibly infer personality. CV-observable signals are the honest proxy. Five dimensions, each scored 0/10/20 with evidence:

Dimension01020
OwnershipTask executorOwns featuresOwns outcomes / budgets
LeadershipNo signalMentored / coordinatedLed people / strategy / hiring
Learning trajectoryStagnantVisible progressionRepeated upskilling / domain shifts
Impact clarityVagueConcrete deliverablesMeasurable business results
Stability / executionUnexplained hopsNormal transitionsSustained delivery + promotions

Sum of the five = 0–100. Each clamped before summing as a safety net.

5e. education (deterministic, role-sensitive)

Highest formal degree → base score, then adjustment for relevance to the anchor ISCO:

Base score

None / unknown30
High school / vocational45
Bachelor's65
Master's80
PhD / Doctorate90

Relevance adjustment

  • +10 if degree field matches anchor ISCO 2-digit keyword set
  • −20 if unrelated to anchor occupation (capped at floor 0)
  • 0 otherwise

No institution-prestige scoring — deliberately excluded to avoid bias and the maintenance overhead of an "elite institution" allowlist. Reviewers see this in the README's "Limitations" section as a deliberate choice.

6Salary math — the load-bearing logic

Two-stage interpolation: score → percentile → salary. The percentile axis is "where in your ISCO group's salary distribution does this candidate fall." The salary axis is the actual CZK figure read from MPSV ISPV for that ISCO.

6a. Score → percentile (anchored interpolation)

The most important design decision in the whole system. A 75/100 candidate is not P75 of their ISCO cohort. The cohort already includes candidates from every score band; if 75/100 mapped to P75 linearly, the math would be silently wrong. We map named seniority bands to realistic salary percentiles:
Seniority bandScore anchorSalary percentile
Entry / weak match20P10
Junior35P25
Solid mid55P50
Senior75P75
Lead / Principal90P90

Linear interpolation between anchors; clamped to P5–P95 at the edges (no extrapolation past observed deciles).

score (0–100) percentile 0 20 35 55 75 90 100 P5 P10 P25 P50 P75 P90 P95 linear (incorrect) Entry Junior Mid Senior Lead

6b. Percentile → salary (linear through ISPV deciles)

MPSV ISPV publishes D1, Q1, median, Q3, D9 per CZ-ISCO occupation. Linear interpolation between observed deciles, clamped to [P10, P90] — never extrapolates past observed data:

def percentile_to_salary(p, *, d1, q1, median, q3, d9):
    p = max(10, min(90, p))                    # clamp to observed range
    points = [(10, d1), (25, q1), (50, median), (75, q3), (90, d9)]
    for (p0, v0), (p1, v1) in zip(points, points[1:]):
        if p0 <= p <= p1:
            t = (p - p0) / (p1 - p0)
            return v0 + t * (v1 - v0)

Why no extrapolation past D1/D9: ISPV doesn't publish percentiles below D1 or above D9. Any "P95 of nurses" or "P3 of CEOs" would be invented, which is exactly the kind of fake precision the methodology is meant to avoid.

6c. Confidence-driven salary range width

Confidence labels have a mathematical effect, not just a UI chip. Lower confidence widens the percentile band on each side of the point estimate before re-mapping to salary:

Confidence± width (percentile points)Effect on a P50 estimate
high10P40–P60
medium15P35–P65
low25P25–P75

Confidence label is high only with zero confidence reasons; it drops to medium on one reason, low on two or more.

7+30% growth plan — three branches

Compute target_salary = current_salary × 1.30, find where it lands in the ISCO's salary distribution, and branch on that:

flowchart TD S[current_salary × 1.30] --> P[salary_to_percentile
through ISPV deciles] P --> Q{above D9?} Q -- yes --> A[role_family_change
can't get +30% in current ISCO] Q -- no --> R{P85 or above?} R -- yes --> B[stretch_within_role_or_market_change
realistic only with company/industry/geography move] R -- no --> C[skill_up_within_role
compute required score delta + allocate to subscores] style A fill:#fef3c7,stroke:#b45309 style B fill:#eef2ff,stroke:#2563eb style C fill:#ecfdf5,stroke:#047857

7a. role_family_change

Target salary exceeds the top decile of the candidate's current ISCO group. Even being P99 in this role won't get there. Recommendation is to change occupation (different ISCO, higher-paying industry) or change geography/comp model (foreign client, equity). Skill-up alone is mathematically impossible.

7b. stretch_within_role_or_market_change

Target sits in P85–P90. Mathematically reachable inside the current ISCO, but realistically requires a combination of demonstrated impact + a market move (changing companies, industries, or geographies). Skill-up alone usually isn't enough at this band.

7c. skill_up_within_role

Target lands inside P5–P85 of the current ISCO. Compute:

  1. target_percentile from salary_to_percentile(target_salary, ...)
  2. required_score = percentile_to_required_score(target_p) (inverse of the anchored interpolation)
  3. score_delta = required_score − current_score
  4. Allocate score_delta across subscores by weighted capacity (gap × weight), skipping relevant_experience — you can't fast-track years of experience.
  5. Pass the deltas + redacted CV text to Claude → 3–5 concrete actions referencing CV evidence.

Subscore-delta allocation

def allocate_subscore_deltas(subscores, required_total_delta, weights, skip):
    gaps = {k: 100 - subscores[k] for k in subscores if k not in skip}
    weighted_capacity = {k: gaps[k] * weights[k] for k in gaps}
    total_capacity = sum(weighted_capacity.values()) or 1.0
    plan = {k: 0.0 for k in subscores}
    for k in gaps:
        share = required_total_delta * (weighted_capacity[k] / total_capacity)
        plan[k] = min(gaps[k], share / weights[k])      # convert total-points back to subscore-points
    return plan
Why this allocation rule: the candidate gets the most "score per unit of effort" by investing in subscores that are both (1) low (large gap) and (2) heavily weighted in the total. Skills and impact dominate. Education's tiny 0.10 weight means a degree change rarely shows up as a recommendation.

Worked example: mid_dev_4y.docx

A 4-year-experienced software engineer in Czechia. The CV (paragraph form):

Jane Smith
Software Engineer

EXPERIENCE
Software Engineer — FinTech Plus, 05/2022 — present
- Owned the payment ingestion service handling 50k tx/day.
- Reduced p95 latency by 35% by introducing async Postgres pool.
- Mentored 1 intern.
Junior Developer — StartupX, 06/2020 — 04/2022
- Built REST APIs in Flask; helped migrate to FastAPI.
- Wrote integration tests; maintained CI pipeline.

SKILLS
Python, FastAPI, Flask, PostgreSQL, Redis, Docker, AWS, Kubernetes, Git, pytest, GitHub Actions

EDUCATION
Master's — Software Engineering, CTU Prague (2018–2020)

Step-by-step

StepOutput
1. extract Plain text, ~600 chars (DOCX → string).
2. redact No emails/phones/URLs in this CV → unchanged. SHA-256 of the original DOCX recorded.
3. parse 2 roles, 11 skills, 1 education entry. parse_confidence = 0.95. 3 minor warnings (no languages, no certs, education dates inferred).
4. classify ISCO 2512 "Software developer". Confidence 0.95, top1_margin 0.75 → kept at 4-digit (no rollup).
5a. relevant_experience Anchor role "Software Engineer" gets ISCO 2512; non-anchor "Junior Developer" shares the word "developer" with role_label "Software developer" → gets 2512 at 0.7× confidence.
relevant_yoe ≈ 4.0y · 0.95 + 1.83y · 0.665 ≈ 5.0y → score ≈ 55.
5b. skills_match 11 skills → breadth 40, no senior depth markers → 0, ISCO-25 keyword overlap (python, fastapi, postgresql, redis, docker, aws, kubernetes, git, pytest) → 30. Total 70, confidence medium.
5c. impact_scope "Reduced p95 latency by 35%", "50k tx/day" → measurable but not multi-million-scale → LLM scored 55.
5d. leadership_ownership_growth ownership 14 (owned ingestion service), leadership 10 (mentored 1 intern), learning trajectory 12 (Flask → FastAPI), impact_clarity 12 (concrete numbers), stability 12 (clean progression) → total 60.
5e. education Master's (80) + Software Engineering field matches ISCO 25 keywords (+10) → 90.
total 0.25·55 + 0.25·70 + 0.20·55 + 0.20·60 + 0.10·90 = 13.75 + 17.5 + 11 + 12 + 9 = 63 → Senior.
6. salary Score 63 → percentile P60 (anchored interpolation). ISCO 2512 ISPV deciles: D1 ~50k, Q1 ~70k, median ~95k, Q3 ~130k, D9 ~180k. P60 ≈ 109,000 CZK/month. Medium confidence (extraction warnings) → ±15 percentile points → range ~90,000–130,000 CZK.
7. growth plan target = 109,000 × 1.30 = ~141,700 CZK. Salary→percentile ≈ P78 (within distribution but above P85 threshold? close). Branch: likely stretch_within_role_or_market_change. LLM generates 3–5 actions referencing CV evidence (e.g. "Lead a cross-functional initiative for 6 months and document business impact in numbers").
Note on numbers. The exact figures above are illustrative — they reflect the methodology after the ISCO-propagation fix. The actual numbers from a live run will vary slightly with the LLM's exact subscore values for impact and leadership. The deterministic parts (relevant_experience, skills, education, salary math) are reproducible.

Validation

No labelled ground-truth dataset exists for "correct" CV scores. Instead, the system is validated by rank-order assertions on a synthetic CV pack: hand-crafted CVs at known seniority bands run through the full pipeline (live LLM calls), and the relative ordering must be sensible.

Synthetic pack (5 CVs in v1 core)

CVProfileExpected band
junior_dev_1y.docx1y dev, strong skills, low impactJunior (25–40)
mid_dev_4y.docx4y dev, normal progressionMid (45–60) or Senior
senior_dev_8y.docx8y dev, ownership, architectureSenior (65–80)
nurse_to_dev_5y_2y.docxCareer changer (5y nurse → 2y dev)Mid; salary anchors to dev ISCO not nurse
buzzword_no_evidence.docxVerbose, no concrete resultsJunior or low Mid

Rank-order assertions (live E2E tests)

# tests/test_e2e_synthetic.py — gated on ANTHROPIC_API_KEY

assert score("junior_dev_1y") < score("mid_dev_4y") < score("senior_dev_8y")
assert score("buzzword_no_evidence") < score("mid_dev_4y")
assert relevant_yoe("nurse_to_dev_5y_2y") < total_yoe("nurse_to_dev_5y_2y")
assert classification("nurse_to_dev_5y_2y").isco_code.startswith("25")  # dev, not nurse
assert score("senior_dev_8y").band in ("Senior", "Lead/Principal")
for cv in all_cvs: assert len(growth_plan(cv).actions) >= 3

Assertions are intentionally loose (band ranges, not exact scores) — the goal is verifying the methodology rank-orders correctly, not that the LLM produces specific numbers.

Latest run: all 6 E2E rank assertions passed in 2:11 across 5 synthetic CVs (~20 LLM calls total). Plus 71 unit tests for the deterministic math (interval union, anchored interpolation, percentile↔salary inverse, +30% branching, etc.) running in < 1 second without an API key.

Pure-function unit tests run without an API key

$ uv run pytest tests/ --ignore=tests/test_e2e_synthetic.py -q
71 passed, 1 skipped in 0.75s

$ uv run pytest tests/ --html=docs/reports/test-report.html \
    --cov=src/job_fit --cov-report=html:docs/reports/coverage --cov-report=term
TOTAL coverage: 86%

Data sources

CountrySourceFormatRefreshStatus
CZ MPSV ISPV open data (ispv-zamestnani.json) JSONsemi-annual implemented
EU Eurostat SES 2022 (earn_ses_main) via eurostat PyPI package APIquadrennial designed (stretch)
US BLS OEWS + official SOC↔ISCO crosswalk XLS/CSVannual designed (stretch)
UK ONS SOC 2020 earnings + CASCOT mapping CSVannual designed (stretch)
else Eurostat fallback + explicit "low confidence" label designed (stretch)

What ISPV gives us

440 rows after fetch, period "rok 2025". Per CZ-ISCO occupation × wage/pay sphere (MZDOVA = private sector, PLATOVA = public sector), with fields: medianMzda, diferenciaceD1M, diferenciaceQ1M, diferenciaceQ3M, diferenciaceD9M, mzdaPrumer, pocetZamestnancuMzda, obdobi, czIsco.

Caveat: the public dataset is CZ-ISCO × sphere only — it is not a full region × education × age cube. Region/education/age salary adjustments are designed but not in v1. This is stated explicitly in the README's "Limitations" section.

What the local ESCO index gives us

A hand-curated CSV of 28 ISCO-08 occupation codes covering all major groups (managers, professionals, technicians, clerical, services, agricultural, trades, drivers). Each row has EN+CZ labels and bilingual keywords for keyword-overlap retrieval. The classify step uses this for top-k retrieval before passing to Claude. Stretch task 29 would extend this to the full ESCO occupation→skill mapping via the ESCO API.

Limitations & honest disclosures

What this is, what this isn't.

This is an interview-grade prototype, not a production compensation engine. The methodology is designed to be auditable and honest about its uncertainty (confidence labels, range widening, rollup on low confidence) rather than to be the most precise estimate. A reviewer should be able to follow the math and disagree productively with specific weights — the architecture is correct.