Preparing Data for Small Language Model Training

You've chosen your small language model training approach: distillation, RAG + LoRA, or something in between. Now comes the critical phase: preparing your data. This is where the quality of your training directly determines the quality of your model's performance.

Most organizations have data scattered across multiple systems such as documents in shared drives, customer information in databases, operational procedures in spreadsheets, and procedural images in various formats. The challenge is converting this diverse data into clean, structured training datasets.

We will walk through preparing data types using example business scenarios to give an idea of how to approach prepping data for model fine-tuning.

Step 1: Data Inventory and Assessment

Before preparing data files for language model training, assessing the inventory of data will help alleviate data accuracy issues that might crop up in later stages of the training process.

Create a Data Inventory

List possible data sources that are relevant to the capabilities you need for your small language model. Examples include:

Documents

Source: Runbooks, guides, procedures, handbooks, presentations stored on shared Drives, cloud storage, or local computers
Format: PDF, DOCX, DOC, TXT, MD, RTF, PPTX, HTML

Customer Data

Source: Account info, interaction logs stored on CRM or CDP systems
Format: Exported JSON, CSV

Operational Data

Source: Transaction history, metrics, analytics, reports stored in databases, spreadsheets, or BI platforms
Format: Exported CSV

Email

Source: Client communications, internal discussions, approval chains stored in cloud email provider or email servers
Format: EML, MSG, plain text

Images

Source: Screenshots, diagrams, product photos stored stored on local drives or cloud
Format: PNG, JPG, JPEG, TIF, GIF

Assess Data Quality

Prune records from each data source do not meet these criteria:

Completeness: How much data is missing or blank?
Consistency: Are naming conventions, formats, and structures consistent?
Accuracy: How confident are you in the correctness of the data?
Relevance: Does this data actually help your model learn your domain?

Step 2: Data Preparation

Identify the Fine-Tuning Format

There are several formats to fine-tune a language model with. Before extracting data from sources, it’s helpful to determine the final fine-tuning format needed so that data sources can be prepped accordingly. The two most common are instruction-style and question-answer pairs. Depending on your use case, these two will be most probably a fit. Others formats can be used if these two don’t apply to your use case.

Instruction-Style / Instruction-Following

JSONL format: { "instruction": "Summarize the following company policy:", "input": "All employees must complete security training annually.", "output": "Employees must complete annual security training." }
Use: Teaching the model to follow commands, summarize, rewrite, or generate text.

Question-Answer Pairs

JSONL format: { "question": "What is the company's security training policy?", "answer": "Employees must complete security training annually." }
Use: Classic supervised QA fine-tuning, retrieval-augmented generation (RAG).

Text-to-Text / Seq2Seq

JSONL format: { "input_text": "Ths is the policy: All employees must complete security training annually.", "target_text": "This is the policy: Employees must complete security training annually." }
Use: Any transformation task, e.g., grammar correction, code completion, summarization.

Masked Language / Cloze Tasks

JSONL format: {"text": "All employees must complete [MASK] training annually."}
Use: Originally for models like BERT; can be used to fine-tune knowledge or vocabulary.

Classification Labels

CSV format:

text,label
"The security training policy is annual.","compliance"

Use: Sentiment analysis, topic classification, spam detection.

Example: "text": "The service was excellent!", "label": "positive"

LoRA can fine-tune LLMs for classification by using a head on top of embeddings.

Pairwise / Ranking Data

Format: { "query": "security training frequency", "positive_example": "Employees must complete security training annually.", "negative_example": "Company cafeteria hours are 8-5." }
Use: Training models for retrieval, ranking, or recommendation.

Example: Search query → relevant vs irrelevant documents.

Multi-Turn Conversations

Format: { "context": [ "User: Hi, what are the security training requirements?", "Bot: All employees need annual security training." ], "response": "User: How many days do employees have to complete it?" }
Use: Chatbots or dialogue fine-tuning.

Code / Structured Data

Format: { "input": "Generate SQL to list employees with annual training:", "output": "SELECT * FROM employees WHERE training_frequency='annual';" }
Use: Code completion, SQL generation, data extraction from structured formats.

Example: Converting human description → SQL query or code snippet.

Generate Training Data

To fine-tune a language model, files need to be generated using one of the formats discussed earlier. Use a frontier LLM like Claude, GPT-4, or Llama to read your data sources and generate training examples in the desired format. Here are some practical examples to guide you through what this might look like.

Data Source: HTML

Business example: A DevOps team has 50 runbooks on deploying, monitoring, and troubleshooting production systems on an internal wiki with HTML pages.

The goal is to extract text from HTML pages and generate instruction-style synthetic training data. This code reads text content from each wiki page and generates instruction-style JSONL files.

from bs4 import BeautifulSoup
from pathlib import Path
import anthropic  # or openai, depending on your LLM provider
import json

def extract_wiki_content(html_file_path):
    """
    Extract title and content from wiki HTML page.
    Removes navigation, footers, headers, and non-content elements.
    """
    with open(html_file_path, 'r', encoding='utf-8') as f:
        html_content = f.read()

    soup = BeautifulSoup(html_content, 'html.parser')

    # Extract title
    title = soup.find('title').get_text() if soup.find('title') else 'Untitled'

    # Remove non-content elements (navigation, sidebars, footers, scripts, styles)
    for element in soup(['script', 'style', 'nav', 'footer', 'header', 'aside']):
        element.decompose()

    # Extract main content (adjust selector based on your wiki structure)
    main_content = soup.find('main') or soup.find('article') or soup.find('body')
    text = main_content.get_text(separator='\n', strip=True)

    # Clean up whitespace and remove wiki metadata
    lines = [line.strip() for line in text.split('\n') if line.strip()]
    cleaned_text = '\n'.join(lines)

    return title, cleaned_text

def generate_synthetic_training_data(title, content, num_examples=10):
    """
    Generate instruction-style training data using an LLM.
    Constructs prompt based on page title for contextual relevance.
    """
    # Construct prompt based on page title and content
    prompt = f"""
Read this {title} document and generate {num_examples} instruction-style
training examples in JSONL format. Each example should include:
- instruction: A command or request related to {title}
- input: Additional context or parameters (can be empty string if not needed)
- output: The expected response or action

Create diverse examples covering different aspects: explanations, prerequisites,
troubleshooting steps, verification procedures, and edge cases.

Document Title: {title}
Document Content:
{content}

Generate {num_examples} examples in JSONL format (one JSON object per line).
"""

    # Call LLM (example using Anthropic's Claude)
    client = anthropic.Anthropic(api_key="your-api-key")

    response = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=4000,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )

    return response.content[0].text

def process_wiki_pages(wiki_dir, output_file):
    """
    Process all wiki HTML pages: extract text and generate synthetic training data.
    This unified approach handles both extraction and generation in a single pass.
    """
    wiki_path = Path(wiki_dir)
    html_files = list(wiki_path.glob('*.html'))

    all_examples = []

    for html_file in html_files:
        print(f"Processing: {html_file.name}")

        # Step 1: Extract clean text from HTML
        title, content = extract_wiki_content(html_file)
        print(f"  Extracted: {title} ({len(content)} characters)")

        # Step 2: Generate synthetic training data immediately
        synthetic_data = generate_synthetic_training_data(title, content, num_examples=10)

        # Parse and collect examples
        for line in synthetic_data.strip().split('\n'):
            if line.strip():
                try:
                    example = json.loads(line)
                    all_examples.append(example)
                    print(f"  Generated example: {example['instruction'][:50]}...")
                except json.JSONDecodeError:
                    continue

    # Save all examples to JSONL file
    with open(output_file, 'w', encoding='utf-8') as f:
        for example in all_examples:
            f.write(json.dumps(example) + '\n')

    print(f"\nCompleted!")
    print(f"Total pages processed: {len(html_files)}")
    print(f"Total training examples generated: {len(all_examples)}")
    print(f"Saved to: {output_file}")

# Run the script
process_wiki_pages('wiki_pages/', 'training_data/devops-runbooks.jsonl')

Generated JSONL output (devops-runbooks.jsonl):

{"instruction": "Explain the database failover process", "input": "Primary database is unresponsive", "output": "Initiate failover by first confirming primary database is truly down using health check scripts. Then promote the standby replica to primary, update DNS records to point to new primary, and verify all connections are routing correctly."}
{"instruction": "List prerequisites for database failover", "input": "", "output": "Prerequisites include: configured standby replica with replication enabled, DNS or connection pooling configured for automatic failover, health monitoring scripts in place, and documented rollback procedures."}
{"instruction": "Determine if manual failover is required", "input": "Primary database response time degraded by 80%", "output": "Yes, manual failover is recommended when primary database performance degrades beyond 75% of baseline. This prevents cascading failures and maintains service availability."}

Data Source: Documents

Business example: An insurance company has 200+ documents containing claims processing and procedures in PDF and Word file formats.

The goal is to extract text from PDF and Word documents and generate question-answer pair training data. This code reads document contents, uses an LLM to create Q&A pairs, and saves them to JSONL files.

from pathlib import Path
import anthropic  # or openai, depending on your LLM provider
import json
import pypdf
from docx import Document

def extract_document_content(file_path):
    """
    Extract text content from PDF or Word documents.
    Returns filename and cleaned text content.
    """
    file_path = Path(file_path)
    filename = file_path.stem  # Get filename without extension

    # Handle PDF files
    if file_path.suffix.lower() == '.pdf':
        with open(file_path, 'rb') as f:
            pdf_reader = pypdf.PdfReader(f)
            text_parts = []
            for page in pdf_reader.pages:
                text_parts.append(page.extract_text())
            content = '\n'.join(text_parts)

    # Handle Word documents (.docx)
    elif file_path.suffix.lower() == '.docx':
        doc = Document(file_path)
        paragraphs = [para.text for para in doc.paragraphs if para.text.strip()]
        content = '\n'.join(paragraphs)

    # Handle plain text files
    elif file_path.suffix.lower() == '.txt':
        with open(file_path, 'r', encoding='utf-8') as f:
            content = f.read()

    else:
        raise ValueError(f"Unsupported file format: {file_path.suffix}")

    # Clean up whitespace
    lines = [line.strip() for line in content.split('\n') if line.strip()]
    cleaned_content = '\n'.join(lines)

    # Anonymize sensitive information
    cleaned_content = anonymize_content(cleaned_content)

    return filename, cleaned_content

def anonymize_content(text):
    """
    Replace sensitive information with placeholders.
    """
    import re

    # Replace potential customer names (simple pattern)
    # In production, use more sophisticated NER or pattern matching
    text = re.sub(r'\b[A-Z][a-z]+ [A-Z][a-z]+\b', '[CUSTOMER_NAME]', text)

    # Replace email addresses
    text = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', '[EMAIL]', text)

    # Replace phone numbers
    text = re.sub(r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b', '[PHONE]', text)

    # Replace account numbers (assuming 8-12 digit patterns)
    text = re.sub(r'\b\d{8,12}\b', '[ACCOUNT_ID]', text)

    # Replace addresses (simple pattern)
    text = re.sub(r'\b\d+\s+[A-Za-z\s]+(?:Street|St|Avenue|Ave|Road|Rd|Boulevard|Blvd)\b', '[ADDRESS]', text)

    return text

def generate_qa_pairs(filename, content, num_pairs=10):
    """
    Generate question-answer pairs from document content using an LLM.
    """
    prompt = f"""
Based on this insurance document titled "{filename}", generate {num_pairs} realistic
question-answer pairs in JSONL format that customers might ask. Include variations
in how questions are phrased (formal, casual, direct, indirect). Each pair should include:
- question: A customer question about the document's topic
- answer: A clear, accurate response based on the document content
- category: The topic category (e.g., "claims_filing", "claims_timeline", "coverage", "disputes")

Create diverse questions covering different aspects of the document: processes,
timelines, requirements, edge cases, and common concerns.

Document: {filename}
Content:
{content}

Generate {num_pairs} examples in JSONL format (one JSON object per line).
"""

    # Call LLM (example using Anthropic's Claude)
    client = anthropic.Anthropic(api_key="your-api-key")

    response = client.messages.create(
        model="claude-sonnet-4-5-20250929",
        max_tokens=4000,
        messages=[
            {"role": "user", "content": prompt}
        ]
    )

    return response.content[0].text

def process_documents(docs_dir, output_dir):
    """
    Process all documents: extract text and generate Q&A pairs.
    Saves Q&A pairs to individual JSONL files per document.
    """
    docs_path = Path(docs_dir)
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    # Find all PDF, DOCX, and TXT files
    doc_files = list(docs_path.glob('*.pdf')) + \
                list(docs_path.glob('*.docx')) + \
                list(docs_path.glob('*.txt'))

    all_qa_pairs = []

    for doc_file in doc_files:
        print(f"Processing: {doc_file.name}")

        try:
            # Step 1: Extract text from document
            filename, content = extract_document_content(doc_file)
            print(f"  Extracted: {len(content)} characters")
            print(f"  Anonymized sensitive information")

            # Step 2: Generate Q&A pairs immediately
            qa_data = generate_qa_pairs(filename, content, num_pairs=10)

            # Parse Q&A pairs
            doc_qa_pairs = []
            for line in qa_data.strip().split('\n'):
                if line.strip():
                    try:
                        qa_pair = json.loads(line)
                        doc_qa_pairs.append(qa_pair)
                        all_qa_pairs.append(qa_pair)
                        print(f"  Generated Q&A: {qa_pair['question'][:50]}...")
                    except json.JSONDecodeError:
                        continue

            # Save Q&A pairs for this document to individual file
            output_file = output_path / f"{filename}_qa_pairs.jsonl"
            with open(output_file, 'w', encoding='utf-8') as f:
                for qa in doc_qa_pairs:
                    f.write(json.dumps(qa) + '\n')

            print(f"  Saved {len(doc_qa_pairs)} Q&A pairs to {output_file.name}\n")

        except Exception as e:
            print(f"  Error processing {doc_file.name}: {e}\n")
            continue

    # Save all Q&A pairs to combined file
    combined_output = output_path / "all_qa_pairs.jsonl"
    with open(combined_output, 'w', encoding='utf-8') as f:
        for qa in all_qa_pairs:
            f.write(json.dumps(qa) + '\n')

    print(f"Completed!")
    print(f"Total documents processed: {len(doc_files)}")
    print(f"Total Q&A pairs generated: {len(all_qa_pairs)}")
    print(f"Combined output: {combined_output}")

# Run the script
process_documents('insurance_docs/', 'training_data/qa_pairs/')

Generated JSONL output (claims-processing-policy_qa_pairs.jsonl):

{"question": "How do I file an insurance claim?", "answer": "You can file a claim through three methods: online portal at claims.example.com, by calling our 24/7 claims hotline at 1-800-CLAIMS, or by visiting any local branch office. You'll need your policy number and incident details.", "category": "claims_filing"}
{"question": "What's the process for submitting a claim after an accident?", "answer": "After an accident, document the scene with photos if safe to do so, exchange information with other parties, then file your claim within 24 hours using our online portal, phone hotline, or local branch. Our team will review and contact you within 5 business days.", "category": "claims_filing"}
{"question": "How long does it take for a claim to be processed?", "answer": "Standard claims are reviewed within 5 business days of submission. Complex claims requiring investigation may take 10-15 business days. You'll receive status updates via email and can check progress in your online account.", "category": "claims_timeline"}

Data Source: Spreadsheets

Business example: A sales organization needs their language model to categorize sales reporting data saved in 100 spreadsheets. The aim is to tune the language model to know which of 25 different products a set of sales reports belongs to.

from pathlib import Path
import anthropic  # or openai, depending on your LLM provider
import pandas as pd
import json

# Define 25 product categories
PRODUCT_CATEGORIES = [
    "Enterprise_CRM", "Marketing_Automation", "Sales_Analytics",
    "Customer_Support_Platform", "Email_Marketing", "Social_Media_Management",
    "Project_Management", "Time_Tracking", "Collaboration_Suite",
    "Document_Management", "Video_Conferencing", "HR_Management",
    "Payroll_Software", "Accounting_Software", "Inventory_Management",
    "E-commerce_Platform", "Payment_Processing", "Shipping_Management",
    "Business_Intelligence", "Data_Warehouse", "ETL_Tools",
    "Security_Software", "Cloud_Storage", "Backup_Solutions", "API_Management"
]

def extract_sales_report_data(file_path):
    """
    Extract sales report data from CSV or Excel files.
    Returns structured data for classification.
    """
    file_path = Path(file_path)

    # Read file based on extension
    if file_path.suffix.lower() == '.csv':
        df = pd.read_csv(file_path)
    elif file_path.suffix.lower() in ['.xlsx', '.xls']:
        df = pd.read_excel(file_path)
    else:
        raise ValueError(f"Unsupported file format: {file_path.suffix}")

    # Clean and standardize data
    df = df.dropna(subset=['report_id'])  # Remove rows without report ID
    df = df.fillna('')  # Fill other NaN values with empty string

    # Convert to list of report dictionaries
    reports = []
    for _, row in df.iterrows():
        report = {
            'report_id': str(row.get('report_id', '')),
            'revenue': str(row.get('revenue', '')),
            'units_sold': str(row.get('units_sold', '')),
            'region': str(row.get('region', '')),
            'customer_segment': str(row.get('customer_segment', '')),
            'description': str(row.get('description', ''))
        }
        reports.append(report)

    return reports

def generate_classification_labels(reports, num_synthetic_per_report=5):
    """
    Generate classification training data from sales reports.
    Uses LLM to classify reports and create synthetic variations.
    """
    all_training_examples = []

    for report in reports:
        print(f"Processing Report: {report['report_id']}")

        # Create summary text from report data
        report_text = f"""
        Report ID: {report['report_id']}
        Revenue: {report['revenue']}
        Units Sold: {report['units_sold']}
        Region: {report['region']}
        Customer Segment: {report['customer_segment']}
        Description: {report['description']}
        """

        # Construct prompt for LLM
        prompt = f"""
Based on this sales report, classify which product category it belongs to and generate
{num_synthetic_per_report} synthetic training examples in CSV format.

Product Categories (choose one):
{', '.join(PRODUCT_CATEGORIES)}

Sales Report:
{report_text}

For each synthetic example, create:
- text: A variation of the sales report summary or key metrics
- product_category: The classified product category
- confidence: high, medium, or low

Generate {num_synthetic_per_report} diverse examples that vary in:
- Phrasing (formal, casual, technical, executive summary)
- Focus (revenue-focused, usage-focused, region-focused)
- Length (brief, detailed)

Output in CSV format with header: text,product_category,confidence
"""

        # Call LLM
        client = anthropic.Anthropic(api_key="your-api-key")

        response = client.messages.create(
            model="claude-sonnet-4-5-20250929",
            max_tokens=2000,
            messages=[
                {"role": "user", "content": prompt}
            ]
        )

        # Parse LLM response
        csv_data = response.content[0].text

        # Parse CSV data (skip header)
        for line in csv_data.strip().split('\n')[1:]:
            if line.strip() and ',' in line:
                try:
                    # Simple CSV parsing (for production, use csv module)
                    parts = line.split(',', 2)
                    if len(parts) >= 3:
                        text = parts[0].strip('"').strip()
                        category = parts[1].strip('"').strip()
                        confidence = parts[2].strip('"').strip()

                        example = {
                            'text': text,
                            'product_category': category,
                            'confidence': confidence,
                            'source_report_id': report['report_id']
                        }
                        all_training_examples.append(example)
                        print(f"  Generated: {category} ({confidence} confidence)")
                except Exception as e:
                    print(f"  Error parsing line: {e}")
                    continue

    return all_training_examples

def process_sales_reports(reports_file, output_file, num_synthetic_per_report=5):
    """
    Process sales reports and generate classification training data.
    """
    print(f"Reading sales reports from: {reports_file}")

    # Extract sales report data
    reports = extract_sales_report_data(reports_file)
    print(f"Extracted {len(reports)} sales reports\n")

    # Generate classification labels
    training_examples = generate_classification_labels(
        reports,
        num_synthetic_per_report=num_synthetic_per_report
    )

    # Save to CSV
    output_path = Path(output_file)
    output_path.parent.mkdir(parents=True, exist_ok=True)

    df = pd.DataFrame(training_examples)
    df.to_csv(output_path, index=False)

    # Print summary statistics
    print(f"\nCompleted!")
    print(f"Total reports processed: {len(reports)}")
    print(f"Total training examples generated: {len(training_examples)}")
    print(f"Average examples per report: {len(training_examples) / len(reports):.1f}")

    # Show product category distribution
    print(f"\nProduct category distribution:")
    category_counts = df['product_category'].value_counts()
    for category, count in category_counts.head(10).items():
        print(f"  {category}: {count}")

    print(f"\nSaved to: {output_path}")

# Run the script
process_sales_reports(
    'sales_data/sales_reports_2024.csv',
    'training_data/product_classification.csv',
    num_synthetic_per_report=5
)

Generated CSV output (product_classification.csv):

text,product_category,confidence,source_report_id
"Enterprise CRM solution generated $2.4M in Q4 across North America enterprise segment with 1200 licenses sold",Enterprise_CRM,high,RPT-001
"Q4 revenue of $2.4M from enterprise customers in NA region - high adoption of CRM platform",Enterprise_CRM,high,RPT-001
"CRM product line: 1200 units, $2.4M revenue, enterprise focus",Enterprise_CRM,medium,RPT-001
...

Next Steps

Once the fine-tuning data is saved to files, the next step in the tuning process is training a language model on your prepared data. Open source tools like unsloth and axolotl can expedite tuning. These tools can be run on cloud VM’s or locally (axolotl on M-series Macs or AMD GPUs, unsloth on Windows or Linux).