PlanGraph Construction

Module: src-tauri/src/pipeline/plan_graph.rs Function: build_plan_graph(wall_graph, vlm_response, image_width, image_height) -> PlanGraphJSON

The PlanGraph stage merges CV wall segments with VLM semantic data into a unified intermediate representation called PlanGraphJSON. This is the central data structure of the hybrid pipeline.

Data Structure

pub struct PlanGraphJSON {
    pub wall_segments: Vec<WallSegment>,   // CV-extracted walls
    pub faces: Vec<Face>,                   // Room polygons
    pub labels: Vec<RoomLabel>,             // VLM room labels
    pub doors: Vec<DoorCandidate>,          // VLM doors
    pub windows: Vec<WindowCandidate>,      // VLM windows
    pub scale_candidates: Vec<ScaleCandidate>, // Scale estimates
    pub alignment_scores: Option<AlignmentScores>,
    pub source: String,                     // "hybrid_cv_vlm" or "vlm_only"
}

Wall Segments

CV wall segments are converted directly from the WallGraphResult:

WallSegment {
    id: seg.id.clone(),           // "cv_wall_1"
    start: seg.start,             // [x, y] in pixels
    end: seg.end,                 // [x, y] in pixels
    thickness_px: 3.0,            // default thickness
    source: "cv_mask".into(),
    confidence: seg.confidence,
}

Room Label Extraction

Labels are extracted from the VLM response's detected_rooms array. Supports two formats:

Polygon format (legacy VLM): room has a polygon field, centroid is computed from it
Centroid format (hybrid VLM): room has a centroid field directly

pub struct RoomLabel {
    pub id: String,           // "label_1"
    pub room_type: String,    // "living_room"
    pub name: String,         // "客厅"
    pub centroid: [f64; 2],   // [x, y] in pixels
    pub confidence: f64,
    pub source: String,       // "vlm"
}

Room Face Generation

Faces are the polygon representations of rooms. The system uses a 3-level fallback strategy:

Level 1: VLM Polygons (Preferred)

If the VLM response includes polygon arrays for rooms, those are used directly as faces. This is the highest quality source.

Level 2: Wall Grid Topology

If no VLM polygons are available but labels and wall segments exist, the system generates faces from wall positions:

fn generate_faces_from_walls(labels, wall_segments, image_width, image_height)

Algorithm:

Collect X coordinates from vertical walls (segments where |dx| < |dy| and length >= 50px)
Collect Y coordinates from horizontal walls (segments where |dy| < |dx| and length >= 50px)
Snap nearby coordinates within 20px (average them)
Add image bounds as boundary coordinates
Form grid cells from the X and Y divider arrays
Assign each cell to the nearest room centroid (Euclidean distance)
Ensure coverage: if any label has no cells, reassign the cell nearest to its centroid
Merge cells per label into one polygon (bounding box of all owned cells)
Clip to wall bounding box

Grid formation example:

    X dividers: [100, 300, 500, 700]
    Y dividers: [50, 250, 450]

    Cells:
    +--------+--------+--------+
    | cell   | cell   | cell   |
    | (0,0)  | (1,0)  | (2,0)  |
    +--------+--------+--------+
    | cell   | cell   | cell   |
    | (0,1)  | (1,1)  | (2,1)  |
    +--------+--------+--------+

    Each cell assigned to nearest room centroid.

Level 3: Centroid Subdivision

If the wall grid fails (too few dividers), falls back to pure centroid-based subdivision:

fn generate_faces_from_centroids(labels, wall_segments)

Algorithm:

Cluster centroids by X coordinate (tolerance 50px)
Sort clusters left to right
For each cluster:
- If single room: spans full height of wall bounding box
- If multiple rooms: subdivide vertically by Y midpoints between consecutive centroids
Compute X boundaries as midpoints between cluster centers

Centroid subdivision example:

    Labels: A(200,300), B(200,500), C(600,400)

    X clusters: [A,B] at x~200, [C] at x~600

    +----------+----------+
    |          |          |
    |    A     |          |
    |          |    C     |
    +----------+          |
    |          |          |
    |    B     |          |
    |          |          |
    +----------+----------+

Fallback: Single Room from Bounding Box

If all face generation fails but wall segments exist, generates a single rectangular face from the wall bounding box with a 20px margin.

Opening Binding

Doors and windows from the VLM are snapped to the nearest CV wall segment:

fn snap_openings_to_walls(doors, windows, wall_segments, labels, max_snap_distance: 120.0)

For each opening:

Find the nearest point on any wall segment
If distance <= 120px, snap the opening position to that point
If distance > 120px, downgrade confidence by 50% (capped at 0.3)
Filter out invalid room references (room names not in labels)
Update wall_side for windows based on nearest wall orientation

Scale Extraction

The system collects multiple scale candidates and selects the one with highest confidence:

pub struct ScaleCandidate {
    pub meters_per_pixel: f64,
    pub source_text: String,
    pub confidence: f64,
}

Source 1: VLM scale_info (confidence: 0.4-0.8)

From the VLM response's scale_info.meters_per_pixel. Confidence is 0.8 if detected=true and the resulting dimensions are plausible (0.5-20m per axis), otherwise 0.2-0.4.

Source 2: Overall Dimensions (confidence: 0.75)

From overall_dimensions.width_pixels and width_meters. Computes meters_per_pixel = width_meters / width_pixels.

Source 3: Dimension Annotations (confidence: 0.9)

Cross-validates VLM-reported dimension annotations against the CV wall bounding box extent:

// For each annotation like "3600" (horizontal):
let dim_m = 3600.0 / 1000.0;  // 3.6m
let extent_px = wall_bbox_width;  // e.g., 480px
let mpp = dim_m / extent_px;  // 0.0075

Uses the average meters_per_pixel across all valid annotations. This is the highest-confidence source because it uses concrete numbers from the floor plan.

Source 4: CV Wall Extent Fallback (confidence: 0.45)

Assumes the longest wall extent represents a typical residential dimension of ~8 meters:

let cv_mpp = 8.0 / max_extent_px;

Only used if the resulting dimensions are plausible (1-20m per axis).

Scale Selection

The downstream convert::convert_plan_graph_to_scene() function selects the candidate with the highest confidence value.

Source Field

The source field indicates the geometry quality:

"hybrid_cv_vlm" -- 3+ CV wall segments were detected
"vlm_only" -- fewer than 3 CV segments (degraded quality)

Output Example

{
  "wall_segments": [
    {
      "id": "cv_wall_1",
      "start": [100.0, 50.0],
      "end": [800.0, 50.0],
      "thickness_px": 3.0,
      "source": "cv_mask",
      "confidence": 0.8
    }
  ],
  "faces": [
    {
      "id": "face_1",
      "polygon": [[100.0, 50.0], [450.0, 50.0], [450.0, 400.0], [100.0, 400.0], [100.0, 50.0]],
      "area_px": 122500.0,
      "label_ref": "label_1",
      "source": "wall_grid"
    }
  ],
  "labels": [
    {
      "id": "label_1",
      "room_type": "living_room",
      "name": "客厅",
      "centroid": [275.0, 225.0],
      "confidence": 0.9,
      "source": "vlm"
    }
  ],
  "doors": [
    {
      "id": "door_1",
      "position": [300.0, 50.0],
      "width_meters": 0.9,
      "connected_rooms": ["living_room", "kitchen"],
      "swing_direction": "left_inward",
      "confidence": 0.8
    }
  ],
  "windows": [],
  "scale_candidates": [
    {
      "meters_per_pixel": 0.0075,
      "source_text": "dimension_annotations",
      "confidence": 0.9
    }
  ],
  "alignment_scores": null,
  "source": "hybrid_cv_vlm"
}

Saved to data/pipeline/{project_id}/plan_graph.json.

Data Structure​

Wall Segments​

Room Label Extraction​

Room Face Generation​

Level 1: VLM Polygons (Preferred)​

Level 2: Wall Grid Topology​

Level 3: Centroid Subdivision​

Fallback: Single Room from Bounding Box​

Opening Binding​

Scale Extraction​

Source 1: VLM scale_info (confidence: 0.4-0.8)​

Source 2: Overall Dimensions (confidence: 0.75)​

Source 3: Dimension Annotations (confidence: 0.9)​

Source 4: CV Wall Extent Fallback (confidence: 0.45)​

Scale Selection​

Source Field​

Output Example​