TextForge
Meta-feature extraction for text classification datasets.
TextForge computes dataset-level meta-features from text classification data. These features describe properties of the dataset itself -- vocabulary distribution, readability, class balance, embedding geometry, information content, and more -- and are designed for automated model selection, attention mechanism recommendation, and AutoML pipelines.
Installation
# Core (all extractors except embeddings)
pip install git+https://github.com/TextForge/TextForge.git
# With sentence-transformer embedding features
pip install "git+https://github.com/TextForge/TextForge.git#egg=TextForge[embeddings]"
After installation, download the required NLTK data once:
import nltk
nltk.download("stopwords")
nltk.download("averaged_perceptron_tagger_eng")
nltk.download("universal_tagset")
Quick Start
import pandas as pd
from TextForge import extract_features
# Your data must have 'text' and 'label' columns
df = pd.read_csv("my_dataset.csv") # columns: text, label
# Extract all default meta-features (long format)
features = extract_features(df)
print(features.head())
# feature value description category
# 0 n_documents 1500 Total number of documents. Basic Statistics
# 1 n_classes 5 Number of unique class labels. Basic Statistics
# ...
Wide Format (one row per dataset)
features = extract_features(df, dataset_name="my_dataset", wide_format=True)
# Returns a single-row DataFrame where each meta-feature is a column
Batch Extraction
import os
all_features = pd.DataFrame()
for fname in os.listdir("datasets/"):
df = pd.read_csv(f"datasets/{fname}")
row = extract_features(df, dataset_name=fname, wide_format=True)
all_features = pd.concat([all_features, row], ignore_index=True)
all_features.to_csv("meta_features.csv", index=False)
Configuration
Control which extractor groups run via a config dictionary:
config = {
"basic": True,
"vocabulary": True,
"readability": True,
"pos_tagging": True,
"category_stats": True,
"pca": True,
"hardness": True,
"landmarking": True,
"mudof": True,
"taxonomy": True,
"automl": True,
"embeddings": True, # requires: pip install TextForge[embeddings]
"info_theoretic": True,
}
features = extract_features(df, config=config)
List all registered extractors:
from TextForge import list_extractors
print(list_extractors())
Extractor Groups
| Extractor | Key | Description |
|---|
| Basic Statistics | basic | Document count, class count, token/character length distributions, type-token ratio |
| Vocabulary | vocabulary | Zipf divergence (SEM), hapax/dis legomena ratios, Yule's K, word frequency stats |
| Readability | readability | Flesch, Kincaid, SMOG, Coleman-Liau, ARI, Dale-Chall, Gunning Fog scores |
| POS Tagging | pos_tagging | Universal POS tag proportions, content/function word ratios |
| Category Statistics | category_stats | Class balance metrics, class entropy, imbalance ratio |
| PCA | pca | BoW/TF-IDF PCA statistics, sparsity, explained variance, intrinsic dimensionality |
| Hardness | hardness | SEM, UVB, SVB, normalised MRH-J, vocabulary ratios, class imbalance |
| Landmarking | landmarking | 1-NN, Decision Tree, Logistic Regression, Naive Bayes baseline performance |
| MUDOF | mudof | Per-class TF-IDF statistics: doc length, term values, information gain |
| Taxonomy | taxonomy | Sentiment (word/sentence/document), syllable stats, special characters, sentence structure |
| AutoML | automl | Auto-sklearn-inspired: combined feature matrix stats, PCA dimensionality, kurtosis/skewness |
| Embeddings | embeddings | Sentence-transformer centroids, inter/intra-class distances, class separability, embedding geometry |
| Information Theoretic | info_theoretic | Mutual information, feature entropy, class entropy, feature relevance counts |
For a longer implementation-oriented catalog (distributional summaries, per-extractor behaviour, optional embeddings), see METAFEATURES.md. Prior-work citations and a condensed feature-family table: docs/metafeature_provenance.md.
Embedding Features (n-class support)
The embeddings extractor solves the variable-class representation problem by:
- Encoding all documents with a sentence-transformer model (default:
all-MiniLM-L6-v2)
- Computing per-class centroid embeddings
- Deriving inter-class statistics (pairwise centroid distances)
- Deriving intra-class statistics (document-to-centroid spread)
- Computing class separability ratios and embedding geometry stats
These features are independent of the number of classes and capture the geometric structure that determines which models/attention mechanisms will perform well.
Input Requirements
Your DataFrame must satisfy:
- Columns:
text (string) and label (any hashable type)
- At least 2 unique labels
- At least 2 documents per label
- Not all documents are purely stopwords
- Each label's pooled text has at least 5 tokens
TextForge will automatically drop null rows, lowercase text, and validate the data before extraction.
Migrating from v0.2.x
The v1.0.0 API is a complete rewrite. Here's how to update your code:
Before (v0.2.x):
from TextForge import textForge
config = {
"InitialFeatures": True,
"TextStatMetrics": True,
"VocabularyMetrics": True,
# ...
}
output_df = textForge.extract_features(df, file_name, current_features, config)
wide_df = textForge.extract_features_only(output_df, file_name)
After (v1.0.0):
from TextForge import extract_features
config = {
"basic": True, # was "InitialFeatures"
"readability": True, # was "TextStatMetrics"
"vocabulary": True, # was "VocabularyMetrics"
# ...
}
# Long format (same as before)
output_df = extract_features(df, config=config)
# Wide format (replaces extract_features_only)
wide_df = extract_features(df, config=config, dataset_name="my_file", wide_format=True)
Key differences:
file_name and current_features parameters have been removed (they were unused).
- Config keys use short lowercase names instead of class names.
extract_features_only is replaced by wide_format=True on extract_features.
- Timing features (e.g. "Main 3 features Time") are no longer in the output; use Python logging to track runtime.
- The old
Depreciated.py module (never part of the pipeline) has been removed.
See CHANGELOG.md for the full list of changes, bug fixes, and new features.
Architecture
TextForge/
βββ __init__.py # Public API exports
βββ forge.py # Main extract_features() orchestration
βββ extractors/
β βββ __init__.py # BaseExtractor ABC + auto-registration
β βββ basic.py # Dataset size, token/char distributions
β βββ vocabulary.py # Zipf, hapax, Yule's K
β βββ readability.py # textstat readability scores
β βββ pos_tagging.py # Universal POS tag proportions
β βββ category_stats.py# Class balance and entropy
β βββ pca.py # TruncatedSVD on BoW/TF-IDF
β βββ hardness.py # SEM, UVB, SVB, MRH-J
β βββ landmarking.py # Baseline classifier performance
β βββ mudof.py # Per-class TF-IDF statistics
β βββ taxonomy.py # Sentiment, structure, style
β βββ automl.py # Auto-sklearn-style features
β βββ embeddings.py # Sentence-transformer geometry
β βββ information_theoretic.py # MI, entropy, relevance
βββ utils/
βββ statistics.py # Descriptive statistics helpers
βββ validation.py # Input validation and cleaning
All extractors inherit from BaseExtractor and self-register via __init_subclass__. Adding a new extractor is as simple as creating a new file with a class that defines name, category, and extract().
License
MIT License. Copyright 2023 TextForge.