talan-hdf/semantic-suggestion

TYPO3 extension for suggesting semantically related pages

Installs: 1 551

Dependents: 0

Suggesters: 0

Security: 0

Stars: 10

Watchers: 4

Forks: 5

Open Issues: 1

Type:typo3-cms-extension

3.1.0 2025-09-18 13:16 UTC

README

TYPO3 12 TYPO3 13 Latest Stable Version License

Elevate your TYPO3 website with intelligent, AI-powered content recommendations using advanced NLP technology.

Introduction

The Semantic Suggestion extension revolutionizes the way related content is presented on TYPO3 websites. Moving beyond traditional category-based functionalities, this extension employs advanced NLP (Natural Language Processing) and TF-IDF (Term Frequency-Inverse Document Frequency) analysis to create genuinely relevant content connections.

Since version 2.0.0, similarity scores are stored in a dedicated database table (tx_semanticsuggestion_similarities) instead of the TYPO3 cache. A Scheduler task handles calculation and storage, ensuring persistence and performance.

๐Ÿš€ NEW in v3.0.0: Enhanced with nlp_tools integration for professional-grade text analysis, featuring intelligent language detection for TYPO3 12/13 multilingual sites and optimized German language support.

Key Benefits:

  • ๐ŸŽฏ Highly Relevant Links: TF-IDF vectorization with advanced NLP creates genuinely relevant content connections
  • ๐ŸŒ Multilingual Intelligence: Smart language detection for TYPO3 12/13 sites with automatic content analysis
  • ๐Ÿ‡ฉ๐Ÿ‡ช German Language Optimized: Professional stemming and handling of compound words (Automobilindustrie โ†” Automobil)
  • ๐Ÿ“ˆ Increased User Engagement: Keep visitors on your site longer by offering truly related content
  • ๐Ÿ” Semantic Cocoon: Contributes to a high-quality semantic network, enhancing SEO and navigation
  • โšก Intelligent Automation: Reduces manual linking work while improving internal link quality by 30-50%

Performance Considerations

  • The similarity calculation process performed by the Scheduler task can take time, especially on sites with a large number of pages (>500 pages might require 30s or more depending on the server).
  • Displaying suggestions and statistics (reading from the database) is optimized.
  • Use the backend module to assess the performance and relevance of suggestions for your specific setup.

๐Ÿ†• What's New in Version 3.0.0

๐Ÿง  Advanced NLP Integration

  • TF-IDF Vectorization: Professional-grade similarity calculation replacing basic cosine similarity
  • Intelligent Language Detection: Hybrid approach using TYPO3 configuration + content analysis
  • Advanced Text Processing: Stemming, stop word removal, and text normalization via nlp_tools

๐ŸŒ Enhanced Multilingual Support

  • TYPO3 12/13 Compatibility: Uses modern Context API and Site Configuration
  • Smart Language Detection:
    • Priority 1: Uses TYPO3 site language configuration (de_DE.UTF-8 โ†’ de)
    • Priority 2: Falls back to intelligent content analysis when TYPO3 config is uncertain
    • Priority 3: Confidence scoring prevents misdetection of mixed-language content
  • Multi-site Ready: Supports complex multilingual TYPO3 setups

๐Ÿ‡ฉ๐Ÿ‡ช German Language Excellence

  • Professional Stemming: Uses Wamania\Snowball for German morphology
  • Compound Word Support: Recognizes relationships between "Automobil" and "Automobilindustrie"
  • Umlaut Handling: Perfect support for รค, รถ, รผ, รŸ characters
  • German Stop Words: Optimized list including der, die, das, ein, eine, mit, von, etc.

๐Ÿ“Š Enhanced Algorithm

  • TF-IDF Vectors: More accurate similarity scoring than traditional word counting
  • Confidence Thresholds: Prevents false positives in language detection
  • Fallback Mechanisms: Graceful degradation if NLP processing fails
  • Performance Caching: TF-IDF vectors and stemming results are cached

Table of Contents

Features

๐Ÿ”ง Core Functionality

  • Advanced TF-IDF Analysis: Professional semantic similarity calculation using vectorization
  • Database Storage: Persistent similarity scores in tx_semanticsuggestion_similarities table
  • Automated Processing: Scheduler task for background calculation and updates
  • Frontend Integration: Clean API for displaying suggestions (title, media, excerpt)
  • Backend Analytics: Comprehensive statistics and performance metrics

๐ŸŒ Multilingual & NLP Features

  • Intelligent Language Detection: Hybrid TYPO3 + content analysis approach
  • Professional Text Processing: Stemming, stop word removal, text normalization
  • German Language Excellence: Optimized for compound words and umlauts
  • Multi-site Support: Handles complex TYPO3 multilingual configurations
  • Confidence Scoring: Prevents false language detection in mixed content

โš™๏ธ Configuration & Performance

  • Highly Configurable: TypoScript settings for display and Scheduler for analysis scope
  • Performance Optimized: TF-IDF vector caching and intelligent fallbacks
  • Flexible Exclusions: Page-level exclusions for analysis and/or display
  • Debug Mode: Comprehensive logging for development and troubleshooting

Requirements

  • TYPO3: 12.0.0 - 13.9.99
  • PHP: 8.0 or higher
  • Dependencies:
    • cywolf/nlp-tools (automatically installed via Composer)
    • wamania/php-stemmer (for advanced stemming support)

Installation

Composer Installation (recommended)
  1. Install the extension with dependencies:
    composer require talan-hdf/semantic-suggestion
    composer require cywolf/nlp-tools
  2. Activate both extensions in the TYPO3 Extension Manager.
  3. Clear TYPO3 cache: ./vendor/bin/typo3 cache:flush
Manual Installation
  1. Download the extension from the TER or GitHub.
  2. Upload the archive to typo3conf/ext/.
  3. Activate the extension in the Extension Manager.

Language Detection System

๐Ÿง  How Language Detection Works

The extension uses a hybrid approach that combines TYPO3's multilingual configuration with intelligent content analysis:

Detection Priority (Cascade System):

  1. ๐ŸŽฏ Priority 1: TYPO3 Site Configuration (for TYPO3 12/13)

    // Uses TYPO3's Context API and Site Configuration
    $site = $request->getAttribute('site');
    $siteLanguage = $site->getLanguageById($languageId);
    $locale = $siteLanguage->getLocale(); // e.g., "de_DE.UTF-8"
    return strtolower(substr($locale->getLanguageCode(), 0, 2)); // โ†’ "de"
  2. ๐Ÿ” Priority 2: Content Analysis (when TYPO3 config is uncertain)

    • Extracts character trigrams from text content
    • Compares against language profiles built from stop words
    • Uses confidence scoring to prevent false positives
  3. โš–๏ธ Priority 3: Confidence Verification

    // If confidence difference < 30%, trust TYPO3 context
    if (($firstScore - $secondScore) / $firstScore < 0.3) {
        return $this->getTypo3LanguageContext();
    }

๐ŸŒ Multilingual Site Examples

Example 1: Well-configured TYPO3 Site

# site/config.yaml
languages:
  - languageId: 0
    locale: 'en_US.UTF-8'  # โ†’ nlp_tools detects "en"
  - languageId: 1  
    locale: 'de_DE.UTF-8'  # โ†’ nlp_tools detects "de"
  - languageId: 2
    locale: 'fr_FR.UTF-8'  # โ†’ nlp_tools detects "fr"

Result: nlp_tools uses TYPO3 configuration directly via Site API

Example 2: Mixed Content Detection

Page with:
- sys_language_uid = 0 (English)
- But content: "Die deutsche Automobilindustrie ist sehr wichtig"

Detection results:
- Content analysis: 85% German confidence
- TYPO3 context: English
- Final decision: German (high confidence overrides TYPO3)

Example 3: Uncertain Content

Detection scores: French: 45%, German: 42%
Confidence difference: (45-42)/45 = 6.7% < 30%
โ†’ Falls back to TYPO3 language context

๐Ÿš€ nlp_tools Integration

Yes, the extension fully integrates with nlp_tools:

// In PageAnalysisService.php
protected function getCurrentLanguage(): string
{
    // Uses nlp_tools LanguageDetectionService
    $detectedLanguage = $this->languageDetector->detectLanguage('');
    return $detectedLanguage; // Smart hybrid detection
}

The language detection leverages:

  • LanguageDetectionService: Trigram analysis and confidence scoring
  • TextAnalysisService: Advanced text processing and stemming
  • TextVectorizerService: TF-IDF vectorization for similarity calculation

๐ŸŒ Language Support Matrix

All languages benefit from nlp_tools integration, but with different optimization levels:

๐Ÿฅ‡ EXPERT Level Support (Maximum Optimization)

Language Code Stemming Stop Words TF-IDF Improvement
๐Ÿ‡ฉ๐Ÿ‡ช German de โœ… Advanced (Wamania\Snowball) โœ… Specialized โœ… Full +40-50%

German Features:

  • Professional compound word handling (Automobilindustrie โ†” Automobil)
  • Perfect umlaut support (รค, รถ, รผ, รŸ)
  • Specialized stop words (der, die, das, ein, eine, mit, von, zu...)

๐Ÿฅˆ ADVANCED Level Support (Professional Optimization)

Language Code Stemming Stop Words TF-IDF Improvement
๐Ÿ‡ซ๐Ÿ‡ท French fr โœ… Wamania\Snowball โœ… Specialized โœ… Full +30-40%
๐Ÿ‡ฌ๐Ÿ‡ง English en โœ… Wamania\Snowball โœ… Specialized โœ… Full +30-40%
๐Ÿ‡ช๐Ÿ‡ธ Spanish es โœ… Wamania\Snowball โœ… Specialized โœ… Full +30-40%

๐Ÿฅ‰ STANDARD Level Support (TF-IDF Optimization)

Language Code Stemming Stop Words TF-IDF Improvement
๐Ÿ‡ฎ๐Ÿ‡น Italian it โŒ Basic tokenization โš ๏ธ Generic โœ… Full +20-30%
๐Ÿ‡ต๐Ÿ‡น Portuguese pt โŒ Basic tokenization โš ๏ธ Generic โœ… Full +20-30%
๐Ÿ‡ณ๐Ÿ‡ฑ Dutch nl โŒ Basic tokenization โš ๏ธ Generic โœ… Full +20-30%
All Others * โŒ Basic tokenization โš ๏ธ Generic โœ… Full +20-25%

๐Ÿ“Š What Every Language Gets

โœ… All languages benefit from:

  1. TF-IDF Vectorization: Professional semantic similarity calculation
  2. Intelligent Language Detection: Hybrid TYPO3 + content analysis
  3. Advanced Text Cleaning: Unicode normalization, accent removal
  4. Performance Caching: TF-IDF vectors and processing results cached
// ALL languages get TF-IDF processing
$tfidfResult = $textVectorizer->createTfIdfVectors([$text1, $text2], $language);
$similarity = $textVectorizer->cosineSimilarity($vector1, $vector2);

// Only de/fr/en/es get advanced stemming
if (in_array($language, ['de', 'fr', 'en', 'es'])) {
    $stemmedWords = $textAnalyzer->stem($text, $language);
} else {
    $stemmedWords = $textAnalyzer->tokenize($text); // Basic tokenization
}

๐ŸŽฏ Language-Specific Configuration Examples

๐Ÿ‡ฉ๐Ÿ‡ช German Sites (Maximum Performance)

# Scheduler Configuration: minimumSimilarity = 0.15
plugin.tx_semanticsuggestion_suggestions.settings {
    enableStemming = 1                # CRITICAL for compound words
    proximityThreshold = 0.25         # Lower threshold (compound matching)
    minTextLength = 50                # German compound words in short text

    analyzedFields {
        title = 2.0                   # German titles contain key compounds
        keywords = 2.5                # German keywords very specific
        content = 1.0                 # Standard weight
        description = 1.2             # Meta descriptions useful
        abstract = 1.3                # German abstracts well-structured
    }
}

๐Ÿ‡ซ๐Ÿ‡ท๐Ÿ‡ฌ๐Ÿ‡ง๐Ÿ‡ช๐Ÿ‡ธ French/English/Spanish Sites

# Scheduler Configuration: minimumSimilarity = 0.2
plugin.tx_semanticsuggestion_suggestions.settings {
    enableStemming = 1                # Advanced stemming available
    proximityThreshold = 0.3          # Standard TF-IDF threshold
    minTextLength = 50                # Standard minimum length

    analyzedFields {
        title = 1.5                   # Titles important but less compound
        keywords = 2.0                # Keywords still highly relevant
        content = 1.0                 # Main content
        description = 1.0             # Meta descriptions
        abstract = 1.1                # Abstracts helpful
    }
}

๐Ÿ‡ฎ๐Ÿ‡น๐Ÿ‡ต๐Ÿ‡น๐Ÿ‡ณ๐Ÿ‡ฑ Other Languages (Italian, Portuguese, Dutch, etc.)

# Scheduler Configuration: minimumSimilarity = 0.25
plugin.tx_semanticsuggestion_suggestions.settings {
    enableStemming = 0                # No advanced stemming, but TF-IDF still helps
    proximityThreshold = 0.35         # Higher threshold (less precise without stemming)
    minTextLength = 100               # Longer text needed for TF-IDF accuracy

    analyzedFields {
        title = 1.8                   # Rely more on titles without stemming
        keywords = 2.2                # Keywords become more important
        content = 0.8                 # Content less reliable without stemming
        description = 1.0             # Standard weight
        abstract = 1.0                # Standard weight
    }
}

Note: Even without stemming, languages like Italian and Portuguese still get significant improvements from TF-IDF vectorization compared to basic word counting.

Configuration

๐ŸŽฏ NEW in v3.1: Unified Quality Configuration replaces the old complex hierarchy! Now use a single qualityLevel parameter instead of managing separate minimumSimilarity and proximityThreshold. Full backward compatibility maintained.

๐Ÿš€ Unified Configuration (v3.1+)

The new unified configuration eliminates the confusing hierarchy and dual parameters. Now you only need ONE parameter for quality control:

๐ŸŽฏ UNIFIED CONFIGURATION FLOW:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   qualityLevel  โ”‚โ”€โ”€โ”€โ–ถโ”‚  Storage: -0.1   โ”‚โ”€โ”€โ”€โ–ถโ”‚  Display: same   โ”‚
โ”‚   (0.1 - 1.0)   โ”‚    โ”‚  (broad range)   โ”‚    โ”‚  (quality)       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
    ONE PARAMETER           AUTOMATIC            USER SEES QUALITY
    TO MAINTAIN             OPTIMIZATION         SUGGESTIONS

Benefits:

  • โœ… Single parameter instead of complex hierarchy
  • โœ… Automatic optimization (storage = qualityLevel - 0.1, display = qualityLevel)
  • โœ… No more conflicts between Scheduler and TypoScript
  • โœ… Backward compatibility with legacy configurations
  • โœ… Self-explanatory values (higher = more selective)

Legacy Configuration Hierarchy (v3.0 and earlier)

โš ๏ธ DEPRECATED: This section documents the old system for reference. Use unified qualityLevel instead!

Click to view legacy configuration details

Understanding the configuration hierarchy is critical for proper setup. The extension uses a two-tier system where Scheduler settings always take precedence over TypoScript settings:

๐Ÿ”„ CONFIGURATION FLOW:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   Scheduler     โ”‚โ”€โ”€โ”€โ–ถโ”‚    Database      โ”‚โ”€โ”€โ”€โ–ถโ”‚   TypoScript    โ”‚โ”€โ”€โ”€โ–ถโ”‚   Frontend       โ”‚
โ”‚   Settings      โ”‚    โ”‚    Storage       โ”‚    โ”‚   Settings      โ”‚    โ”‚   Display        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
     LEVEL 1                 LEVEL 2                LEVEL 3               LEVEL 4
   (Analysis &              (Persistent            (Display &             (User Sees
    Storage)                 Data)                 Filtering)             Results)

๐ŸŽฏ Level 1: Scheduler Configuration (Master Control)

  • Controls: What gets analyzed and stored in database
  • Authority: Absolute - cannot be overridden by TypoScript
  • Key Settings:
    • startPageId: Defines analysis scope
    • excludePages: Pages never analyzed (permanent exclusion)
    • minimumSimilarity: Minimum score to store in database
    • recursiveExclusion: How exclusions are applied

๐Ÿ” Level 2: Database Storage (Persistent Data)

  • Contains: Only similarities โ‰ฅ minimumSimilarity from Scheduler
  • Source: Generated by Scheduler task execution
  • Limitation: TypoScript cannot access data that was never stored

๐ŸŽจ Level 3: TypoScript Configuration (Display Filter)

  • Controls: What gets displayed from existing database data
  • Limitation: Can only filter/limit existing data, cannot create new data
  • Key Settings:
    • proximityThreshold: Minimum score to display (must be โ‰ฅ Scheduler minimumSimilarity)
    • maxSuggestions: Limit displayed results
    • excludePages: Pages excluded from display only (still analyzed/stored)

๐Ÿ‘๏ธ Level 4: Frontend Display (Final Output)

  • Shows: Final filtered and limited results
  • Source: Database data filtered by TypoScript settings

โš–๏ธ Priority Rules

  1. Scheduler ALWAYS wins:

    Scheduler excludePages = "42,56"
    TypoScript excludePages = "" (empty)
    โ†’ Pages 42,56 will NEVER appear (not analyzed at all)
    
  2. TypoScript cannot override Scheduler thresholds:

    Scheduler minimumSimilarity = 0.5
    TypoScript proximityThreshold = 0.3
    โ†’ Impossible! No similarity < 0.5 exists in database
    
  3. TypoScript can only be MORE restrictive:

    Scheduler minimumSimilarity = 0.3 โœ…
    TypoScript proximityThreshold = 0.5 โœ…
    โ†’ Works: Shows only similarities โ‰ฅ 0.5 from stored data โ‰ฅ 0.3
    

Scheduler Task Configuration

๐Ÿ”ง CRITICAL: This is the primary configuration that controls what gets analyzed and stored in the database. TypoScript cannot override these settings!

Create a "Semantic Suggestion: Generate Similarities" task in the TYPO3 Scheduler module with these settings:

โš ๏ธ WARNING: Choose your minimumSimilarity carefully - it cannot be lowered retroactively without re-running the entire analysis!

  • startPageId (required): The UID of the root page from which the analysis will begin. This defines the scope of the analysis for this task run. Each task execution is linked to a Start Page ID (stored as root_page_id in the DB).

    • Example: 1 (for site root page)
  • excludePages (optional): Comma-separated list of page UIDs that will not be analyzed, and their similarities will not be stored.

    • Example: 42,56,78
  • minimumSimilarity (required): Threshold (0.0 to 1.0) below which a pair of similar pages will not be saved to the database. This controls storage efficiency.

    • Example: 0.3 (saves only pairs with similarity โ‰ฅ 30%)

Recommended scheduling:

  • Frequency: Daily or weekly
  • Execution time: During off-peak hours (e.g., 2:00 AM)

TypoScript Settings

๐ŸŽจ DISPLAY FILTER: These settings control the frontend display and the analysis algorithm details. They can only filter/limit what was already stored by the Scheduler.

โŒ CONSTRAINT: proximityThreshold MUST be โ‰ฅ Scheduler minimumSimilarity (otherwise no suggestions will display)

Define them in your TypoScript Setup file under plugin.tx_semanticsuggestion_suggestions.settings.

Constants (constants.typoscript)

plugin.tx_semanticsuggestion_suggestions.settings {
    # --- Frontend Display Settings ---
    # โš ๏ธ IMPORTANT: Must be โ‰ฅ Scheduler minimumSimilarity
    proximityThreshold = 0.25    # TF-IDF optimized threshold (was 0.5 in v2.x)
    maxSuggestions = 3           # Maximum number of suggestions to display
    excerptLength = 100          # Max length of the text excerpt
    excludePages =               # Pages to exclude from DISPLAY only (comma-separated UIDs)

    # --- Analysis Algorithm Settings (Used by Scheduler task) ---
    recencyWeight = 0.2          # Weight of recency in the final score (0.0 to 1.0)

    # Fields analyzed and their weights (TF-IDF enhanced)
    analyzedFields {
        title = 1.5              # Page titles are highly relevant
        description = 1.0        # Meta descriptions
        keywords = 2.0           # Explicit keywords have high weight
        abstract = 1.2           # Page abstracts/summaries
        content = 1.0            # Main page content (can be noisy)
    }

    # --- NLP & Language Settings (v3.0+) ---
    enableStemming = 1           # Enable advanced stemming (especially for German)
    defaultLanguage = en         # Fallback language
    minTextLength = 50           # Minimum text length for analysis
    confidenceThreshold = 0.3    # Language detection confidence threshold

    # --- Debugging ---
    debugMode = 0                # Enable debug logs (0 or 1)
}

Setup (setup.typoscript)

# Main plugin configuration
plugin.tx_semanticsuggestion_suggestions {
    settings {
        proximityThreshold = {$plugin.tx_semanticsuggestion_suggestions.settings.proximityThreshold}
        maxSuggestions = {$plugin.tx_semanticsuggestion_suggestions.settings.maxSuggestions}
        excludePages = {$plugin.tx_semanticsuggestion_suggestions.settings.excludePages}
        excerptLength = {$plugin.tx_semanticsuggestion_suggestions.settings.excerptLength}
        recencyWeight = {$plugin.tx_semanticsuggestion_suggestions.settings.recencyWeight}
        debugMode = {$plugin.tx_semanticsuggestion_suggestions.settings.debugMode}
        
        analyzedFields {
            title = {$plugin.tx_semanticsuggestion_suggestions.settings.analyzedFields.title}
            description = {$plugin.tx_semanticsuggestion_suggestions.settings.analyzedFields.description}
            keywords = {$plugin.tx_semanticsuggestion_suggestions.settings.analyzedFields.keywords}
            abstract = {$plugin.tx_semanticsuggestion_suggestions.settings.analyzedFields.abstract}
            content = {$plugin.tx_semanticsuggestion_suggestions.settings.analyzedFields.content}
        }
    }
    view {
        # Paths to your Fluid templates if you wish to customize them
        templateRootPaths.10 = EXT:your_extension/Resources/Private/Templates/
        partialRootPaths.10 = EXT:your_extension/Resources/Private/Partials/
        layoutRootPaths.10 = EXT:your_extension/Resources/Private/Layouts/
    }
}

# Reusable TypoScript object
lib.semantic_suggestion = USER
lib.semantic_suggestion {
    userFunc = TYPO3\CMS\Extbase\Core\Bootstrap->run
    extensionName = SemanticSuggestion
    pluginName = Suggestions
    vendorName = TalanHdf
    
    view =< plugin.tx_semanticsuggestion_suggestions.view
    persistence =< plugin.tx_semanticsuggestion_suggestions.persistence
    settings =< plugin.tx_semanticsuggestion_suggestions.settings
}

# Content element integration
tt_content.list.20.semanticsuggestion_suggestions =< plugin.tx_semanticsuggestion_suggestions

NLP Configuration

Advanced Language & Text Processing Settings

plugin.tx_semanticsuggestion_suggestions {
    settings {
        # ๐Ÿง  NLP Features (NEW in v3.0)
        enableStemming = 1               # Enable advanced stemming (German compound words)
        defaultLanguage = en             # Fallback language
        minTextLength = 50               # Minimum text length for analysis
        confidenceThreshold = 0.3        # Language detection confidence threshold
        
        # ๐ŸŒ Language Mapping (TYPO3 12/13)
        languageMapping {
            0 = en    # Language UID 0 โ†’ English
            1 = fr    # Language UID 1 โ†’ French  
            2 = de    # Language UID 2 โ†’ German
            3 = es    # Language UID 3 โ†’ Spanish
            4 = it    # Language UID 4 โ†’ Italian
            5 = pt    # Language UID 5 โ†’ Portuguese
        }
        
        # ๐Ÿ‡ฉ๐Ÿ‡ช German Language Optimization
        proximityThreshold = 0.25        # Lower threshold for TF-IDF (more sensitive)
        
        # ๐Ÿ“Š TF-IDF Algorithm Settings  
        analyzedFields {
            title = 1.5      # German titles often contain key compound words
            keywords = 2.0   # German keywords are highly indicative
            content = 1.0    # Base content weight
            description = 1.0
            abstract = 1.2
        }
    }
}

Multilingual Site Example Configuration

For a German/English bilingual site:

# constants.typoscript
plugin.tx_semanticsuggestion_suggestions.settings {
    # ๐Ÿ‡ฉ๐Ÿ‡ช German language optimization
    enableStemming = 1                    # Critical for compound words
    proximityThreshold = 0.25             # Lower threshold for German compound matching
    confidenceThreshold = 0.3

    # Language detection mapping
    languageMapping {
        0 = en                           # TYPO3 language UID 0 = English
        1 = de                           # TYPO3 language UID 1 = German
    }

    # German-optimized field weights
    analyzedFields {
        title = 2.0                      # German titles contain key compounds
        keywords = 2.5                   # German keywords are very specific
        content = 1.0                    # Standard content weight
    }
}

# IMPORTANT: Create separate Scheduler tasks:
# Task 1: English content (startPageId = 1, minimumSimilarity = 0.3)
# Task 2: German content (startPageId = 10, minimumSimilarity = 0.25)

Configuration Interaction

  • Analysis Scope: Defined by the Scheduler task's startPageId.
  • DB Storage: Controlled by the Scheduler task's minimumSimilarity and excludePages.
  • Similarity Calculation: Performed by the PageAnalysisService (called by the Scheduler task), which uses the TypoScript settings analyzedFields and recencyWeight.
  • Frontend Display: Reads from the DB and filters/limits based on the TypoScript settings proximityThreshold, maxSuggestions, excludePages.
  • Backend Display: Reads from the DB (based on the selected root_page_id) and filters based on the TypoScript proximityThreshold.

Key Points:

  • The proximityThreshold (TypoScript) cannot display suggestions with a score lower than the minimumSimilarity (Scheduler) because they were not saved. For the TypoScript setting to be effective, it must be โ‰ฅ the Scheduler threshold.
  • A page excluded in the Scheduler will never be analyzed/stored. A page excluded only in TypoScript will be analyzed/stored (if not excluded in Scheduler) but not displayed. It's often simpler to keep the excludePages lists synchronized.
  • You can create multiple Scheduler tasks with different startPageId values to analyze different sections of the site.

Unified Configuration Setup

1. Scheduler Task Configuration (Simplified)

Create a "Semantic Suggestion: Generate Similarities" task with these settings:

  • startPageId (required): Root page UID for analysis scope
  • qualityLevel (required): Quality threshold (0.1-1.0) for suggestions
    • Storage: Automatically set to qualityLevel - 0.1 (broad data collection)
    • Display: Uses qualityLevel directly (quality suggestions to users)
  • excludePages (optional): Pages to exclude from both analysis and display
  • recursiveExclusion (optional): Apply exclusions recursively

Recommended Quality Levels:

๐Ÿ‡ฉ๐Ÿ‡ช German sites: 0.25 (compound word optimization)
๐ŸŒ Standard sites: 0.30 (balanced quality/quantity)
๐Ÿ“š Content sites: 0.35 (higher quality threshold)
๐Ÿ’Ž Premium sites: 0.40 (very selective suggestions)

2. TypoScript Configuration (Simplified)

plugin.tx_semanticsuggestion_suggestions {
    settings {
        # ๐ŸŽฏ UNIFIED QUALITY CONTROL
        qualityLevel = 0.3              # Single parameter for all quality control

        # Display Settings
        maxSuggestions = 3              # Number of suggestions to show
        excludePages =                  # Additional pages to exclude from display
        excerptLength = 100             # Text excerpt length

        # NLP Settings (v3.0+)
        enableStemming = 1              # Enable advanced text processing
        defaultLanguage = en            # Fallback language
        debugMode = 0                   # Debug logging
    }
}

Configuration Examples

Basic Setup (Most Common)

# Single quality level controls everything
qualityLevel = 0.3

# Result:
# - Storage threshold: 0.2 (collects broad range)
# - Display threshold: 0.3 (shows quality suggestions)
# - No configuration conflicts possible

German Site Optimization

# Lower threshold for German compound words
qualityLevel = 0.25
enableStemming = 1

# Result optimized for compound words like:
# "Automobil" โ†” "Automobilindustrie"

High-Quality Site

# Higher threshold for premium content
qualityLevel = 0.4

# Result:
# - Storage threshold: 0.3 (good range)
# - Display threshold: 0.4 (only excellent suggestions)

Migration from Legacy Configuration

Automatic Migration

The extension automatically migrates old configurations:

Legacy v3.0:
  Scheduler: minimumSimilarity = 0.4
  TypoScript: proximityThreshold = 0.5

Auto-migrated to v3.1:
  qualityLevel = 0.5 (migrated from proximityThreshold)
  Internal storage = 0.4 (preserved)

Manual Migration Guide

  1. Identify your old proximityThreshold from TypoScript
  2. Set qualityLevel to that value
  3. Remove old parameters:
    • Delete proximityThreshold from TypoScript
    • Delete minimumSimilarity from Scheduler (auto-computed)
    • Merge duplicate excludePages lists

Example Migration:

# OLD v3.0 configuration
plugin.tx_semanticsuggestion_suggestions.settings {
    proximityThreshold = 0.35          # OLD: Display threshold
    excludePages = 42,56,78            # OLD: Display exclusions only
}

# NEW v3.1 configuration
plugin.tx_semanticsuggestion_suggestions.settings {
    qualityLevel = 0.35               # NEW: Unified quality control
    excludePages = 42,56,78           # NEW: Unified exclusions (analysis + display)
}

# Scheduler task OLD: minimumSimilarity = 0.25, excludePages = ""
# Scheduler task NEW: qualityLevel = 0.35 (automatically computes storage = 0.25)

Legacy Configuration Constraints & Common Pitfalls

Understanding these technical constraints will save you hours of debugging:

๐Ÿšซ Critical Constraints

1. Threshold Hierarchy Rule
โŒ WRONG Configuration:
Scheduler: minimumSimilarity = 0.7
TypoScript: proximityThreshold = 0.3
โ†’ Result: NO suggestions displayed (none stored below 0.7)

โœ… CORRECT Configuration:
Scheduler: minimumSimilarity = 0.3
TypoScript: proximityThreshold = 0.7
โ†’ Result: Shows high-quality suggestions from broader stored data

Rule: proximityThreshold โ‰ฅ minimumSimilarity (or equal)

2. Exclusion Scope Difference
โŒ PROBLEMATIC:
Scheduler: excludePages = "" (empty)
TypoScript: excludePages = "42,56,78"
โ†’ Result: Pages 42,56,78 are analyzed/stored but never displayed (wasted processing)

โœ… EFFICIENT:
Scheduler: excludePages = "42,56,78"
TypoScript: excludePages = "" (empty or same list)
โ†’ Result: Pages 42,56,78 are never processed (faster, cleaner)
3. TF-IDF Score Ranges (NEW in v3.0)
โš ๏ธ TF-IDF produces LOWER scores than v2.x cosine similarity:
v2.x typical range: 0.3-0.9
v3.0 TF-IDF range: 0.05-0.4

โŒ Legacy Configuration:
minimumSimilarity = 0.8  โ†’ NO results with TF-IDF

โœ… TF-IDF Optimized:
minimumSimilarity = 0.15  โ†’ Good range for TF-IDF
proximityThreshold = 0.25  โ†’ Quality suggestions

๐Ÿ“Š Language-Specific Constraints

German Language (Compound Words)
๐Ÿ‡ฉ๐Ÿ‡ช German sites need LOWER thresholds due to compound word stemming:

โŒ Standard Configuration:
proximityThreshold = 0.5  โ†’ Misses compound relationships

โœ… German Optimized:
minimumSimilarity = 0.15
proximityThreshold = 0.25
enableStemming = 1
โ†’ Captures "Automobil" โ†” "Automobilindustrie" relationships
Multi-language Sites
โš ๏ธ CONSTRAINT: Each language needs separate analysis

โŒ Single Task Configuration:
Task 1: startPageId = 1 (analyzing both English + German pages)
โ†’ Result: Language mixing, poor similarity quality

โœ… Multi-Task Configuration:
Task 1: startPageId = 1 (English root) - minimumSimilarity = 0.3
Task 2: startPageId = 10 (German root) - minimumSimilarity = 0.25
โ†’ Result: Optimized per-language analysis

โšก Performance Constraints

Large Site Thresholds
Sites with >500 pages:

โŒ Permissive Configuration:
minimumSimilarity = 0.05  โ†’ Database bloat (millions of records)
maxSuggestions = 10  โ†’ Slow frontend queries

โœ… Performance Optimized:
minimumSimilarity = 0.25  โ†’ Quality storage
proximityThreshold = 0.4  โ†’ Fast display
maxSuggestions = 3  โ†’ Quick queries

๐Ÿ”ง Validation Rules Checklist

Before deploying, verify these constraints:

  • Threshold Check: proximityThreshold โ‰ฅ minimumSimilarity
  • Exclusion Sync: Scheduler excludePages includes all TypoScript exclusions
  • TF-IDF Adjustment: Thresholds lowered from v2.x values (โ‰ค 0.4 typically)
  • Language Separation: Each language has its own Scheduler task
  • Performance Test: Task execution time acceptable for your server
  • Storage Monitoring: Database table tx_semanticsuggestion_similarities size reasonable

๐Ÿšจ Warning Signs of Misconfiguration

Watch for these symptoms:

Symptom Likely Cause Solution
Zero suggestions displayed proximityThreshold > all stored similarities Lower proximityThreshold or minimumSimilarity
Poor suggestion quality Threshold too low Raise proximityThreshold
Missing expected pages Pages excluded in Scheduler Check excludePages settings
Scheduler timeouts Large site with low threshold Raise minimumSimilarity
Mixed language results Single task for multilingual site Create per-language tasks

Usage (Frontend)

Integrate the plugin into your Fluid templates to display suggestions:

<f:cObject typoscriptObjectPath='lib.semantic_suggestion' />

Or include it directly in TypoScript:

# Include on a page
page.10 =< lib.semantic_suggestion

# Or in a content element
lib.myContent = COA
lib.myContent {
    10 =< lib.semantic_suggestion
}

The plugin will read relevant suggestions for the current page from the database, applying filters defined in the TypoScript settings (proximityThreshold, maxSuggestions, excludePages).

Backend Module

Backend Module

A backend module ("Semantic Suggestion" under "Web") allows visualizing the results of the analyses stored in the database.

Features

  • Analysis Selection: Choose which analysis to view (based on the startPageId / root_page_id of executed Scheduler tasks).
  • Detailed Statistics: Most similar pairs, score distribution, pages with the most links, language statistics.
  • Configuration Overview: Reminder of the main parameters used (display threshold, etc.).
  • Performance Metrics: Module load time, number of stored pairs for the selected analysis.

Scheduler Task

The "Semantic Suggestion: Generate Similarities" task is essential for the extension's operation.

  • Role: Calculates similarities between pages (using PageAnalysisService) and saves relevant results (above the minimumSimilarity threshold) to the tx_semanticsuggestion_similarities table.
  • Configuration: Set the startPageId, excludePages, and minimumSimilarity via the Scheduler interface.
  • Frequency: Schedule its execution regularly (e.g., daily, weekly) during off-peak hours to keep suggestions up-to-date without impacting site performance.

Similarity Logic (TF-IDF Enhanced)

๐Ÿ”„ Processing Pipeline

  1. ๐ŸŽฏ Language Detection (NEW in v3.0)

    Text Input โ†’ TYPO3 Site Context โ†’ Content Analysis โ†’ Language: "de"
    
  2. ๐Ÿ“ Advanced Text Processing (via nlp_tools)

    Raw Text โ†’ Stop Words Removal โ†’ Stemming (German) โ†’ Clean Tokens
    
    Example:
    "Die deutschen Automobilhersteller" 
    โ†’ "deutsch automobilherstell" (stemmed)
    
  3. ๐Ÿ”ข TF-IDF Vectorization (NEW in v3.0)

    [Text1, Text2] โ†’ TF-IDF Vectors โ†’ Cosine Similarity Score
    
    Traditional: Simple word counting
    TF-IDF: Professional semantic analysis
    
  4. ๐Ÿ’พ Smart Storage & Display

    Similarity Score + Recency Boost โ†’ Database โ†’ Frontend Filter
    Scheduler threshold: 0.3 โ†’ Display threshold: 0.5
    

๐Ÿ“Š Algorithm Improvements

Before (v2.x): Basic Cosine Similarity

// Simple word frequency comparison
$similarity = $dotProduct / ($magnitude1 * $magnitude2);

After (v3.0): TF-IDF + NLP

// Professional semantic analysis
$tfidfResult = $this->textVectorizer->createTfIdfVectors([$text1, $text2], $language);
$similarity = $this->textVectorizer->cosineSimilarity($vector1, $vector2);

// + German stemming, stop word removal, confidence scoring

Performance Impact: 30-50% better accuracy, especially for German compound words.

Multilingual Sites Setup

๐ŸŒ TYPO3 12/13 Site Configuration

The extension automatically detects languages from your TYPO3 site configuration:

# config/sites/main/config.yaml
languages:
  - 
    languageId: 0
    title: 'English'
    hreflang: 'en-US'
    locale: 'en_US.UTF-8'    # โ†’ Auto-detected as "en"
    iso-639-1: 'en'
  -
    languageId: 1
    title: 'Deutsch'  
    hreflang: 'de-DE'
    locale: 'de_DE.UTF-8'    # โ†’ Auto-detected as "de"  
    iso-639-1: 'de'
  -
    languageId: 2
    title: 'Franรงais'
    hreflang: 'fr-FR' 
    locale: 'fr_FR.UTF-8'    # โ†’ Auto-detected as "fr"
    iso-639-1: 'fr'

๐ŸŽฏ Per-Language Scheduler Tasks

For optimal performance, create separate Scheduler tasks for each language:

Task 1: "Generate Similarities - English"
- startPageId: 1 (English root)
- minimumSimilarity: 0.3

Task 2: "Generate Similarities - German" 
- startPageId: 2 (German root)
- minimumSimilarity: 0.25  # Lower for German (compound words)

Task 3: "Generate Similarities - French"
- startPageId: 3 (French root) 
- minimumSimilarity: 0.3

๐Ÿ”ง Language-Specific Tuning

German Language Optimization

plugin.tx_semanticsuggestion_suggestions.settings {
    # German compound words need lower thresholds
    proximityThreshold = 0.25
    
    # Enable stemming for better compound word detection
    enableStemming = 1
    
    # German titles are often highly descriptive
    analyzedFields {
        title = 2.0      # Higher weight for German titles
        keywords = 2.5   # German keywords are very specific
    }
}

๐Ÿšจ Troubleshooting Multilingual Sites

Problem: Language always detected as "en"

Solution: Check your TypoScript language mapping:

plugin.tx_semanticsuggestion_suggestions.settings {
    languageMapping {
        0 = en
        1 = de    # Make sure this matches your site language UID
        2 = fr
    }
}

Problem: Mixed language content not detected properly

Enable debug mode to see detection process:

plugin.tx_semanticsuggestion_suggestions.settings {
    debugMode = 1
}

Check logs for:

Language detected via nlp_tools: de
TF-IDF similarity calculated: language=de, vocabularySize=156
Confidence verification: firstScore=0.85, secondScore=0.45, using content analysis

Display Customization

Modify the appearance of suggestions by overriding the plugin's Fluid template (List.html). Configure the paths to your custom templates in TypoScript (see Configuration section).

Debugging & Performance

๐Ÿ” Debug Mode

Enable comprehensive logging to troubleshoot language detection and similarity calculation:

plugin.tx_semanticsuggestion_suggestions.settings {
    debugMode = 1
}

Debug logs show:

[INFO] Language detected via nlp_tools: de
[DEBUG] TF-IDF similarity calculated: page1=123, page2=456, similarity=0.75, vocabularySize=245
[DEBUG] German stemming applied: "Automobilindustrie" โ†’ "automobilindustr"
[DEBUG] Confidence verification: using TYPO3 context due to low confidence difference
[ERROR] Failed to create TF-IDF vectors: text too short, using fallback calculation

โšก Performance Monitoring

Backend Module shows performance metrics:

  • TF-IDF processing time per page pair
  • Language detection accuracy
  • Vocabulary size per language
  • Cache hit rates for stemming/vectorization

Scheduler Task execution time:

  • Small sites (<100 pages): ~10-30 seconds
  • Medium sites (100-500 pages): ~1-5 minutes
  • Large sites (>500 pages): ~5-30 minutes

๐Ÿš€ Performance Optimization

Recommended Settings for Large Sites (>500 pages)

# Scheduler Configuration: minimumSimilarity = 0.3 (storage optimization)
plugin.tx_semanticsuggestion_suggestions.settings {
    # Performance optimizations
    minTextLength = 100               # Skip short content (faster processing)
    proximityThreshold = 0.4          # Higher display threshold (faster queries)
    maxSuggestions = 3                # Limit results (faster rendering)

    # Selective field analysis (reduce processing time)
    analyzedFields {
        title = 2.0                   # Titles are fast to process
        keywords = 2.0                # Keywords are lightweight
        content = 0.5                 # Reduce content weight (can be slow)
        description = 1.0             # Meta descriptions are fast
        abstract = 0                  # Disable if not commonly used
    }

    # Conservative language settings
    confidenceThreshold = 0.4         # Higher confidence reduces processing
    enableStemming = 1                # Keep enabled (cached results)
}

# SCHEDULER OPTIMIZATION for large sites:
# Split into multiple tasks:
# Task 1: Pages 1-100 (daily)
# Task 2: Pages 101-200 (every 2 days)
# Task 3: Pages 201+ (weekly)

Cache Configuration

Ensure TYPO3 cache is properly configured for optimal performance:

# Clear cache after configuration changes
./vendor/bin/typo3 cache:flush

# Monitor cache effectiveness
./vendor/bin/typo3 cache:listGroups

Configuration Troubleshooting

๐Ÿ” Step-by-Step Diagnosis

When suggestions aren't working as expected, follow this systematic approach:

Step 1: Verify Scheduler Task Execution

# Check if task has run successfully
./vendor/bin/typo3 scheduler:run

# Check TYPO3 logs for task errors
tail -f var/log/typo3_*.log | grep -i semantic

Expected Output:

[INFO] Starting similarity generation task, startPageId: 1, minimumSimilarity: 0.3
[INFO] Similarity generation task completed successfully

Step 2: Verify Database Content

-- Check if similarities are stored
SELECT COUNT(*) as total_similarities
FROM tx_semanticsuggestion_similarities;

-- Check score distribution
SELECT
    ROUND(similarity_score, 1) as score_range,
    COUNT(*) as count,
    sys_language_uid
FROM tx_semanticsuggestion_similarities
GROUP BY ROUND(similarity_score, 1), sys_language_uid
ORDER BY score_range DESC;

Healthy Output Example:

score_range | count | sys_language_uid
0.4         | 15    | 0
0.3         | 42    | 0
0.2         | 128   | 0
0.1         | 203   | 0

Step 3: Test Configuration Hierarchy

# Enable debug mode temporarily
plugin.tx_semanticsuggestion_suggestions.settings {
    debugMode = 1
}

Check Debug Logs:

tail -f typo3temp/logs/semantic_suggestion.log

๐Ÿšจ Common Problems & Solutions

Problem 1: "No suggestions displayed anywhere"

Symptoms:

  • Frontend shows empty suggestions list
  • Backend module shows 0 similar pairs

Diagnosis Checklist:

โœ“ Scheduler task executed successfully?
โœ“ Database contains similarities?
โœ“ proximityThreshold โ‰ค stored similarities?
โœ“ Current page has stored similarities?

Solutions:

  1. Threshold Too High

    โŒ Current: proximityThreshold = 0.8
    โœ… Fix: proximityThreshold = 0.3 (or lower)
    
    # Or check what's actually in database:
    SELECT MAX(similarity_score) FROM tx_semanticsuggestion_similarities;
    
  2. Scheduler Never Ran

    # Manual execution
    ./vendor/bin/typo3 scheduler:run <task_id>
    
    # Check task configuration
    SELECT * FROM tx_scheduler_task WHERE classname LIKE '%Similarities%';
  3. Wrong Root Page

    โŒ Current: startPageId = 1, viewing page = 42
    โœ… Fix: startPageId = 1, ensure page 42 is child of page 1
    
    # Verify page tree relationship
    SELECT pid, title FROM pages WHERE uid = 42;
    

Problem 2: "Suggestions displayed but poor quality"

Symptoms:

  • Suggestions shown but irrelevant
  • Mixed languages in suggestions
  • Very low similarity scores

Solutions:

  1. TF-IDF Score Adjustment (v3.0+)

    โŒ Legacy: proximityThreshold = 0.7
    โœ… TF-IDF: proximityThreshold = 0.25
    
    # TF-IDF scores are naturally lower
    
  2. Language Separation Required

    โŒ Single task: Pages 1-100 (mixed EN/DE content)
    โœ… Multi-task:
    - Task 1: English pages (1-50)
    - Task 2: German pages (51-100)
    
  3. Field Weight Optimization

    # Increase weight of reliable fields
    analyzedFields {
        title = 2.0        # Titles are usually accurate
        keywords = 2.5     # Keywords are intentional
        content = 0.5      # Content can be noisy
    }
    

Problem 3: "Missing expected pages in suggestions"

Symptoms:

  • Page A should suggest Page B (they're clearly related)
  • Page B exists in database but not suggested to Page A

Diagnosis:

-- Check if relationship exists in database
SELECT similarity_score
FROM tx_semanticsuggestion_similarities
WHERE page_id = A AND similar_page_id = B;

-- Check if excluded somewhere
SELECT exclude_pages FROM tx_scheduler_task WHERE classname LIKE '%Similarities%';

Solutions:

  1. Page Excluded in Scheduler

    โŒ Scheduler excludePages = "42,56,78" (contains Page B)
    โœ… Remove Page B from exclusions, re-run Scheduler
    
  2. Similarity Below Threshold

    -- Find actual similarity score
    SELECT similarity_score FROM tx_semanticsuggestion_similarities
    WHERE page_id = A AND similar_page_id = B;
    
    -- If score = 0.22 but proximityThreshold = 0.3
    -- Lower the threshold or improve content similarity
  3. Text Content Insufficient

    Check if Page A or B has minimal text content:
    - Minimum 50 characters required
    - Pure image pages won't generate similarities
    - Check 'minTextLength' setting
    

Problem 4: "Scheduler task timeouts or fails"

Symptoms:

  • Task shows "Failed" status
  • PHP timeout errors in logs
  • Task takes >5 minutes

Solutions:

  1. Increase Processing Limits

    # In Scheduler task or php.ini
    ini_set('max_execution_time', 300);  // 5 minutes
    ini_set('memory_limit', '512M');
  2. Reduce Analysis Scope

    โŒ Current: startPageId = 1 (1000+ pages)
    โœ… Split:
    - Task 1: startPageId = 1 (pages 1-100)
    - Task 2: startPageId = 101 (pages 101-200)
    
  3. Increase Threshold

    โŒ Current: minimumSimilarity = 0.05 (stores everything)
    โœ… Optimized: minimumSimilarity = 0.25 (quality only)
    

Problem 5: "Mixed language suggestions"

Symptoms:

  • English page suggests German pages
  • Suggestions ignore language boundaries

Solutions:

  1. Enable Language Mapping

    plugin.tx_semanticsuggestion_suggestions.settings {
        languageMapping {
            0 = en
            1 = de
            2 = fr
        }
    }
    
  2. Check Site Configuration

    # site/config.yaml should have proper locales
    languages:
      - languageId: 0
        locale: 'en_US.UTF-8'  # โ† Must be properly formatted
      - languageId: 1
        locale: 'de_DE.UTF-8'  # โ† Must be properly formatted
  3. Separate Scheduler Tasks

    Instead of: One task analyzing mixed language tree
    Use: One task per language branch
    

๐Ÿ”ง Quick Fix Commands

# Clear all caches
./vendor/bin/typo3 cache:flush

# Re-run all scheduler tasks
./vendor/bin/typo3 scheduler:run

# Check database table size
echo "SELECT COUNT(*) FROM tx_semanticsuggestion_similarities;" | mysql your_db

# Reset extension configuration (emergency)
./vendor/bin/typo3 configuration:remove --path="EXTENSIONS/semantic_suggestion"
./vendor/bin/typo3 extension:deactivate semantic_suggestion
./vendor/bin/typo3 extension:activate semantic_suggestion

๐Ÿ“‹ Pre-Deployment Checklist

Before going live, verify:

  • Scheduler Tasks: All tasks run successfully without timeouts
  • Database Check: tx_semanticsuggestion_similarities contains expected data
  • Threshold Validation: proximityThreshold โ‰ฅ minimumSimilarity
  • Language Testing: Each language shows appropriate suggestions
  • Performance Test: Frontend loads suggestions in <200ms
  • Content Quality: Manual review of suggestion relevance
  • Exclusion Review: All intentionally excluded pages work correctly

๐Ÿ›ก๏ธ Final Configuration Validation Checklist

Use this comprehensive checklist to ensure your configuration is optimal:

โœ… Scheduler Configuration Validation

  • startPageId exists and is accessible: SELECT title FROM pages WHERE uid = [startPageId];
  • minimumSimilarity appropriate for TF-IDF: Between 0.1 and 0.4 (not v2.x values like 0.8)
  • excludePages list verified: All UIDs exist and are intentionally excluded
  • Task execution successful: Check task history and logs for errors
  • Multilingual separation: Each language has its own task (recommended)

โœ… TypoScript Configuration Validation

  • Threshold hierarchy respected: proximityThreshold โ‰ฅ minimumSimilarity
  • TF-IDF thresholds updated: Not using v2.x legacy values (>0.5)
  • Language settings match site: languageMapping corresponds to TYPO3 language UIDs
  • Field weights optimized: Higher weights for title/keywords, lower for content
  • Performance settings: maxSuggestions and minTextLength appropriate for site size

โœ… Database & Storage Validation

-- Verify data exists and has reasonable distribution
SELECT
    sys_language_uid,
    MIN(similarity_score) as min_score,
    MAX(similarity_score) as max_score,
    AVG(similarity_score) as avg_score,
    COUNT(*) as total_pairs
FROM tx_semanticsuggestion_similarities
GROUP BY sys_language_uid;

-- Check for unexpected language mixing
SELECT DISTINCT root_page_id, sys_language_uid, COUNT(*) as pairs
FROM tx_semanticsuggestion_similarities
GROUP BY root_page_id, sys_language_uid;

โœ… Frontend Integration Validation

  • Plugin displays correctly: <f:cObject typoscriptObjectPath='lib.semantic_suggestion' /> works
  • Suggestions appear on pages: Test on multiple pages with different content
  • Language separation working: German pages don't show English suggestions
  • Exclusions effective: Excluded pages don't appear in suggestions
  • Performance acceptable: Page load time impact <100ms

โœ… NLP & Language Detection Validation

  • nlp_tools dependency installed: composer show cywolf/nlp-tools
  • Stemming working (German sites): Debug logs show stemmed words
  • Language detection accurate: Pages analyzed in correct language
  • Site configuration proper: Locales formatted as de_DE.UTF-8 (not just de)

๐Ÿšจ Red Flags to Watch For

Warning Sign Quick Test Solution
Zero suggestions anywhere SELECT COUNT(*) FROM tx_semanticsuggestion_similarities; Lower thresholds or check task
Very low similarity scores (all <0.1) Check debug logs for TF-IDF failures Verify text length and language detection
Mixed language results Test DE page shows EN suggestions Separate scheduler tasks
Poor suggestion quality Manual review of top suggestions Adjust field weights or thresholds
Slow performance Frontend timing >500ms Increase thresholds or reduce maxSuggestions

๐ŸŽฏ Configuration Quality Score

Rate your setup (aim for 80%+ before going live):

Basic Setup (50 points)

  • Scheduler task runs (20 pts)
  • Database contains data (15 pts)
  • Frontend shows suggestions (15 pts)

Optimization (30 points)

  • TF-IDF thresholds optimized (10 pts)
  • Language separation implemented (10 pts)
  • Performance <200ms (10 pts)

Advanced Features (20 points)

  • German stemming active (5 pts)
  • Debug logging configured (5 pts)
  • Exclusions properly managed (5 pts)
  • Field weights customized (5 pts)

Score: ___ / 100

Target: 80+ for production deployment Minimum: 60+ for staging/testing

Migration from v2.x

๐Ÿ”„ Automatic Migration

The extension includes automatic fallback mechanisms:

  1. TF-IDF Processing: Falls back to old cosine similarity if nlp_tools fails
  2. Language Detection: Falls back to TypoScript mapping if site detection fails
  3. Text Processing: Falls back to basic stop word removal if stemming fails

๐Ÿ“‹ Migration Checklist

  1. Install nlp_tools dependency:

    composer require cywolf/nlp-tools
  2. Update TypoScript configuration (add new NLP settings):

    plugin.tx_semanticsuggestion_suggestions.settings {
        enableStemming = 1
        languageMapping {
            0 = en
            1 = de
            # ... your language mapping
        }
    }
    
  3. Clear TYPO3 cache:

    ./vendor/bin/typo3 cache:flush
  4. Regenerate similarities (run Scheduler task):

    • All existing similarities will be recalculated with TF-IDF
    • German content should see immediate improvement
    • Check debug logs to verify language detection
  5. Monitor performance:

    • Check backend module for accuracy improvements
    • Compare similarity scores before/after migration
    • Verify multilingual sites work correctly

๐Ÿ†˜ Rollback Plan

If you experience issues, you can disable advanced features:

plugin.tx_semanticsuggestion_suggestions.settings {
    enableStemming = 0           # Disable stemming
    debugMode = 1                # Enable debug logging
    minTextLength = 200          # Increase minimum text length
}

The extension will automatically fall back to v2.x behavior while maintaining database compatibility.

Contributing

Contributions are welcome! Fork the repository, create a branch, make your changes, and submit a Pull Request.

License

This project is licensed under the GNU General Public License v2.0 or later. See the LICENSE file.

Support

Contact: Wolfangel Cyril (cyril.wolfangel@gmail.com) Bugs & Features: GitHub Issues Documentation & Updates: GitHub Repository