Title: | Automatic Text Classification via Supervised Learning |
---|---|
Description: | A machine learning package for automatic text classification that makes it simple for novice users to get started with machine learning, while allowing experienced users to easily experiment with different settings and algorithm combinations. The package includes eight algorithms for ensemble classification (svm, slda, boosting, bagging, random forests, glmnet, decision trees, neural networks), comprehensive analytics, and thorough documentation. |
Authors: | Timothy P. Jurka, Loren Collingwood, Amber E. Boydstun, Emiliano Grossman, Wouter van Atteveldt |
Maintainer: | Loren Collingwood <[email protected]> |
License: | GPL-3 |
Version: | 1.4.3 |
Built: | 2025-03-12 03:46:17 UTC |
Source: | https://github.com/cran/RTextTools |
An S4 class
containing the analytics for a classified set of documents. This includes a label summary and a document summary. This class is returned if virgin=TRUE
in create_container
.
Objects could in principle be created by calls of the
form new("analytics_virgin", ...)
.
The preferred form is to have them created via a call to
create_analytics
.
label_summary
Object of class "data.frame"
:
stores the analytics for each label, including how many documents were classified with each label
document_summary
Object of class "data.frame"
:
stores the analytics for each document, including all available raw data associated with the learning process
Timothy P. Jurka <[email protected]>
An S4 class
containing the analytics for a classified set of documents. This includes a label summary, document summary, ensemble summary, and algorithm summary. This class is returned if virgin=FALSE
in create_container
.
Objects could in principle be created by calls of the
form new("analytics", ...)
.
The preferred form is to have them created via a call to
create_analytics
.
label_summary
Object of class "data.frame"
:
stores the analytics for each label, including the percent coded accurately and how much overcoding occurred
document_summary
Object of class "data.frame"
:
stores the analytics for each document, including all available raw data associated with the learning process
algorithm_summary
Object of class "data.frame"
:
stores precision, recall, and F-score statistics for each algorithm, broken down by label
ensemble_summary
Object of class "matrix"
:
stores the accuracy and coverage for an n-algorithm ensemble scoring
Timothy P. Jurka <[email protected]>
Converts a DocumentTermMatrix or TermDocumentMatrix (package tm), Matrix (package Matrix), matrix.csr (SparseM), data.frame, or matrix into a matrix.csr
representation to be used in the RTextTools
functions.
as.compressed.matrix(DocumentTermMatrix)
as.compressed.matrix(DocumentTermMatrix)
DocumentTermMatrix |
A class of type DocumentTermMatrix or TermDocumentMatrix (package tm), Matrix (package Matrix), matrix.csr (SparseM), data.frame, or matrix. |
A matrix.csr
representation of the DocumentTermMatrix or TermDocumentMatrix (package tm), Matrix (package Matrix), matrix.csr (SparseM), data.frame, or matrix.
Timothy P. Jurka <[email protected]>
Uses a trained model from the train_model
function to classify new data.
classify_model(container, model, s=0.01, ...)
classify_model(container, model, s=0.01, ...)
container |
Class of type |
model |
Slot for trained SVM, SLDA, boosting, bagging, RandomForests, glmnet, decision tree, neural network, or maximum entropy model generated by |
s |
Penalty parameter lambda for glmnet classification. |
... |
Additional parameters to be passed into the |
Only one model may be passed in at a time for classification. See train_models
and classify_models
to train and classify using multiple algorithms.
Returns a data.frame
of predicted codes and probabilities for the specified algorithm.
Loren Collingwood <[email protected]>, Timothy P. Jurka <[email protected]>
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE) svm_model <- train_model(container,"SVM") svm_results <- classify_model(container,svm_model)
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE) svm_model <- train_model(container,"SVM") svm_results <- classify_model(container,svm_model)
Uses a trained model from the train_models
function to classify new data.
classify_models(container, models, ...)
classify_models(container, models, ...)
container |
Class of type |
models |
List of models to be used for classification generated by |
... |
Other parameters to be passed on to |
Use the list returned by train_models
to use multiple models for classification.
Wouter Van Atteveldt <[email protected]>, Timothy P. Jurka <[email protected]>
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE) models <- train_models(container, algorithms=c("RF","SVM")) results <- classify_models(container, models)
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE) models <- train_models(container, algorithms=c("RF","SVM")) results <- classify_models(container, models)
Takes the results from functions classify_model
or classify_models
and computes various statistics to help interpret the data.
create_analytics(container, classification_results, b=1)
create_analytics(container, classification_results, b=1)
container |
Class of type |
classification_results |
A |
b |
b-value for generating precision, recall, and F-scores statistics. |
Object of class analytics_virgin-class
or analytics-class
has either two or four slots respectively, depending on whether the virgin
flag is set to TRUE
or FALSE
in create_container
. They can be accessed using the @
operator
for S4 classes (e.g. analytics@document_summary
).
Timothy P. Jurka <[email protected]>, Loren Collingwood <[email protected]>
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE) models <- train_models(container, algorithms=c("RF","SVM")) results <- classify_models(container, models) analytics <- create_analytics(container, results)
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE) models <- train_models(container, algorithms=c("RF","SVM")) results <- classify_models(container, models) analytics <- create_analytics(container, results)
Given a DocumentTermMatrix
from the tm package and corresponding document labels, creates a container of class matrix_container-class
that can be used for training and classification (i.e. train_model
, train_models
, classify_model
, classify_models
)
create_container(matrix, labels, trainSize=NULL, testSize=NULL, virgin)
create_container(matrix, labels, trainSize=NULL, testSize=NULL, virgin)
matrix |
A document-term matrix of class |
labels |
A |
trainSize |
A range (e.g. |
testSize |
A range (e.g. |
virgin |
A logical ( |
A container of class matrix_container-class
that can be passed into other functions such as train_model
, train_models
, classify_model
, classify_models
, and create_analytics
.
Timothy P. Jurka <[email protected]>, Loren Collingwood <[email protected]>
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE)
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE)
Creates a summary with ensemble coverage and precision values for an ensemble greater than the threshold specified.
create_ensembleSummary(document_summary)
create_ensembleSummary(document_summary)
document_summary |
The |
This summary is created in the create_analytics
function. Note that a threshold value of 3 will return ensemble coverage and precision statistics for topic codes that had 3 or more (i.e. >=3) algorithms agree on the same topic code.
Loren Collingwood, Timothy P. Jurka
Creates an object of class DocumentTermMatrix
from tm that can be used in the create_container
function.
create_matrix(textColumns, language="english", minDocFreq=1, maxDocFreq=Inf, minWordLength=3, maxWordLength=Inf, ngramLength=1, originalMatrix=NULL, removeNumbers=FALSE, removePunctuation=TRUE, removeSparseTerms=0, removeStopwords=TRUE, stemWords=FALSE, stripWhitespace=TRUE, toLower=TRUE, weighting=weightTf)
create_matrix(textColumns, language="english", minDocFreq=1, maxDocFreq=Inf, minWordLength=3, maxWordLength=Inf, ngramLength=1, originalMatrix=NULL, removeNumbers=FALSE, removePunctuation=TRUE, removeSparseTerms=0, removeStopwords=TRUE, stemWords=FALSE, stripWhitespace=TRUE, toLower=TRUE, weighting=weightTf)
textColumns |
Either character vector (e.g. data$Title) or a |
language |
The language to be used for stemming the text data. |
minDocFreq |
The minimum number of times a word should appear in a document for it to be included in the matrix. See package tm for more details. |
maxDocFreq |
The maximum number of times a word should appear in a document for it to be included in the matrix. See package tm for more details. |
minWordLength |
The minimum number of letters a word or n-gram should contain to be included in the matrix. See package tm for more details. |
maxWordLength |
The maximum number of letters a word or n-gram should contain to be included in the matrix. See package tm for more details. |
ngramLength |
The number of words to include per n-gram for the document-term matrix. |
originalMatrix |
The original |
removeNumbers |
A |
removePunctuation |
A |
removeSparseTerms |
See package tm for more details. |
removeStopwords |
A |
stemWords |
A |
stripWhitespace |
A |
toLower |
A |
weighting |
Either |
Timothy P. Jurka <[email protected]>, Loren Collingwood <[email protected]>
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
Creates a summary with precision, recall, and F1 scores for each algorithm broken down by unique label.
create_precisionRecallSummary(container, classification_results, b_value = 1)
create_precisionRecallSummary(container, classification_results, b_value = 1)
container |
Class of type |
classification_results |
A |
b_value |
b-value for generating precision, recall, and F-scores statistics. |
Loren Collingwood, Timothy P. Jurka
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE) models <- train_models(container, algorithms=c("RF","SVM")) results <- classify_models(container, models) precision_recall_f1 <- create_precisionRecallSummary(container, results)
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE) models <- train_models(container, algorithms=c("RF","SVM")) results <- classify_models(container, models) precision_recall_f1 <- create_precisionRecallSummary(container, results)
Creates a summary with the best label for each document, determined by highest algorithm certainty, and highest consensus (i.e. most number of algorithms agreed).
create_scoreSummary(container, classification_results)
create_scoreSummary(container, classification_results)
container |
Class of type |
classification_results |
A |
Timothy P. Jurka <[email protected]>, Loren Collingwood <[email protected]>
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE) models <- train_models(container, algorithms=c("RF","SVM")) results <- classify_models(container, models) score_summary <- create_scoreSummary(container, results)
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE) models <- train_models(container, algorithms=c("RF","SVM")) results <- classify_models(container, models) score_summary <- create_scoreSummary(container, results)
Performs n-fold cross-validation of specified algorithm.
cross_validate(container, nfold, algorithm = c("SVM", "SLDA", "BOOSTING", "BAGGING", "RF", "GLMNET", "TREE", "NNET"), seed = NA, method = "C-classification", cross = 0, cost = 100, kernel = "radial", maxitboost = 100, maxitglm = 10^5, size = 1, maxitnnet = 1000, MaxNWts = 10000, rang = 0.1, decay = 5e-04, ntree = 200, l1_regularizer = 0, l2_regularizer = 0, use_sgd = FALSE, set_heldout = 0, verbose = FALSE)
cross_validate(container, nfold, algorithm = c("SVM", "SLDA", "BOOSTING", "BAGGING", "RF", "GLMNET", "TREE", "NNET"), seed = NA, method = "C-classification", cross = 0, cost = 100, kernel = "radial", maxitboost = 100, maxitglm = 10^5, size = 1, maxitnnet = 1000, MaxNWts = 10000, rang = 0.1, decay = 5e-04, ntree = 200, l1_regularizer = 0, l2_regularizer = 0, use_sgd = FALSE, set_heldout = 0, verbose = FALSE)
container |
Class of type |
nfold |
Number of folds to perform for cross-validation. |
algorithm |
A string specifying which algorithm to use. Use |
seed |
Random seed number used to replicate cross-validation results. |
method |
Method parameter for SVM implentation. See e1071 documentation for more details. |
cross |
Cross parameter for SVM implentation. See e1071 documentation for more details. |
cost |
Cost parameter for SVM implentation. See e1071 documentation for more details. |
kernel |
Kernel parameter for SVM implentation. See e1071 documentation for more details. |
maxitboost |
Maximum iterations parameter for boosting implentation. See caTools documentation for more details. |
maxitglm |
Maximum iterations parameter for glmnet implentation. See glmnet documentation for more details. |
size |
Size parameter for neural networks implentation. See nnet documentation for more details. |
maxitnnet |
Maximum iterations for neural networks implentation. See nnet documentation for more details. |
MaxNWts |
Maximum number of weights parameter for neural networks implentation. See nnet documentation for more details. |
rang |
Range parameter for neural networks implentation. See nnet documentation for more details. |
decay |
Decay parameter for neural networks implentation. See nnet documentation for more details. |
ntree |
Number of trees parameter for RandomForests implentation. See randomForest documentation for more details. |
l1_regularizer |
An |
l2_regularizer |
An |
use_sgd |
A |
set_heldout |
An |
verbose |
A |
Loren Collingwood, Timothy P. Jurka
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE) svm <- cross_validate(container,2,algorithm="SVM")
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE) svm <- cross_validate(container,2,algorithm="SVM")
This dynamically determines the names of the languages for which stemming is supported by this package. This is controlled when the package is created (not installed) by downloading the stemming algorithms for the different languages.
This language support requires more support for Unicode and more complex text than simple strings.
getStemLanguages()
getStemLanguages()
This queries the C code for the list of languages that were compiled when the package was installed which in turn is determined by the code that was included in the distributed package itself.
A character vector giving the names of the languages.
Duncan Temple Lang <[email protected]>
See http://snowball.tartarus.org/
wordStem
inst/scripts/download
in the source of the
Rstem package.
An S4 class containing all information necessary to train, classify, and generate analytics for a dataset.
Objects could in principle be created by calls of the
form new("matrix_container", ...)
.
The preferred form is to have them created via a call to
create_container
.
training_matrix
Object of class "matrix.csr"
:
stores the training set of the DocumentTermMatrix
created by create_matrix
training_codes
Object of class "factor"
:
stores the training labels for each document in the training_matrix
slot of matrix_container-class
classification_matrix
Object of class "matrix.csr"
:
stores the classification set of the DocumentTermMatrix
created by create_matrix
testing_codes
Object of class "factor"
:
if virgin=FALSE
, stores the labels for each document in classification_matrix
column_names
Object of class "vector"
:
stores the column names of the DocumentTermMatrix
created by create_matrix
virgin
Object of class "logical"
:
boolean specifying whether the classification set is virgin data (TRUE
) or not (FALSE
).
Timothy P. Jurka
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE) container@training_matrix container@training_codes container@classification_matrix container@testing_codes container@column_names container@virgin
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE) container@training_matrix container@training_codes container@classification_matrix container@testing_codes container@column_names container@virgin
A sample dataset containing labeled headlines from The New York Times, compiled by Professor Amber E. Boydstun at the University of California, Davis.
data(NYTimes)
data(NYTimes)
A data.frame
containing five columns.
1. Article_ID
- A unique identifier for the headline from The New York Times.
2. Date
- The date the headline appeared in The New York Times.
3. Title
- The headline as it appeared in The New York Times.
4. Subject
- A manually classified subject of the headline.
5. Topic.Code
- A manually labeled topic code corresponding to the subject.
data(NYTimes)
data(NYTimes)
An informative function that displays options for the algorithms
parameter in train_model
and train_models
.
print_algorithms()
print_algorithms()
Prints a list of available algorithms.
Timothy P. Jurka
library(RTextTools) print_algorithms()
library(RTextTools) print_algorithms()
Reads data from several types of data storage types into an R data frame.
read_data(filepath, type=c("csv","delim","folder"), index=NULL, ...)
read_data(filepath, type=c("csv","delim","folder"), index=NULL, ...)
filepath |
Character string of the name of the file or folder, include path if the file is not located in the working directory. |
type |
Character vector specifying the file type. Options include |
index |
The path to a CSV file specifying the training label of each file in the folder of text files, one per line. An example of one line would be |
... |
Other arguments passed to R's |
An data.frame
object is returned with the contents of the file.
Loren Collingwood, Timothy P. Jurka
library(RTextTools) data <- read_data(system.file("data/NYTimes.csv.gz",package="RTextTools"),type="csv",sep=";")
library(RTextTools) data <- read_data(system.file("data/NYTimes.csv.gz",package="RTextTools"),type="csv",sep=";")
Given the true labels to compare to the labels predicted by the algorithms, calculates the recall accuracy of each algorithm.
recall_accuracy(true_labels, predicted_labels)
recall_accuracy(true_labels, predicted_labels)
true_labels |
A vector containing the true labels, or known values for each document in the classification set. |
predicted_labels |
A vector containing the predicted labels, or classified values for each document in the classification set. |
Loren Collingwood, Timothy P. Jurka
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE) models <- train_models(container, algorithms=c("RF","SVM")) results <- classify_models(container, models) analytics <- create_analytics(container, results) recall_accuracy(analytics@document_summary$MANUAL_CODE, analytics@document_summary$RF_LABEL) recall_accuracy(analytics@document_summary$MANUAL_CODE, analytics@document_summary$SVM_LABEL)
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE) models <- train_models(container, algorithms=c("RF","SVM")) results <- classify_models(container, models) analytics <- create_analytics(container, results) recall_accuracy(analytics@document_summary$MANUAL_CODE, analytics@document_summary$RF_LABEL) recall_accuracy(analytics@document_summary$MANUAL_CODE, analytics@document_summary$SVM_LABEL)
analytics-class
class
Returns a summary of the contents within an object of class analytics-class
.
## S3 method for class 'analytics' summary(object, ...)
## S3 method for class 'analytics' summary(object, ...)
object |
An object of class |
... |
Additional parameters to be passed onto the summary function. |
Timothy P. Jurka
analytics_virgin-class
class
Returns a summary of the contents within an object of class analytics_virgin-class
.
## S3 method for class 'analytics_virgin' summary(object, ...)
## S3 method for class 'analytics_virgin' summary(object, ...)
object |
An object of class |
... |
Additional parameters to be passed onto the summary function. |
Timothy P. Jurka
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=TRUE) models <- train_models(container, algorithms=c("RF","SVM")) results <- classify_models(container, models) analytics <- create_analytics(container, results) summary(analytics)
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=TRUE) models <- train_models(container, algorithms=c("RF","SVM")) results <- classify_models(container, models) analytics <- create_analytics(container, results) summary(analytics)
Creates a trained model using the specified algorithm.
train_model(container, algorithm=c("SVM","SLDA","BOOSTING","BAGGING", "RF","GLMNET","TREE","NNET"), method = "C-classification", cross = 0, cost = 100, kernel = "radial", maxitboost = 100, maxitglm = 10^5, size = 1, maxitnnet = 1000, MaxNWts = 10000, rang = 0.1, decay = 5e-04, trace=FALSE, ntree = 200, l1_regularizer = 0, l2_regularizer = 0, use_sgd = FALSE, set_heldout = 0, verbose = FALSE, ...)
train_model(container, algorithm=c("SVM","SLDA","BOOSTING","BAGGING", "RF","GLMNET","TREE","NNET"), method = "C-classification", cross = 0, cost = 100, kernel = "radial", maxitboost = 100, maxitglm = 10^5, size = 1, maxitnnet = 1000, MaxNWts = 10000, rang = 0.1, decay = 5e-04, trace=FALSE, ntree = 200, l1_regularizer = 0, l2_regularizer = 0, use_sgd = FALSE, set_heldout = 0, verbose = FALSE, ...)
container |
Class of type |
algorithm |
Character vector (i.e. a string) specifying which algorithm to use. Use |
method |
Method parameter for SVM implentation. See e1071 documentation for more details. |
cross |
Cross parameter for SVM implentation. See e1071 documentation for more details. |
cost |
Cost parameter for SVM implentation. See e1071 documentation for more details. |
kernel |
Kernel parameter for SVM implentation. See e1071 documentation for more details. |
maxitboost |
Maximum iterations parameter for boosting implentation. See caTools documentation for more details. |
maxitglm |
Maximum iterations parameter for glmnet implentation. See glmnet documentation for more details. |
size |
Size parameter for neural networks implentation. See nnet documentation for more details. |
maxitnnet |
Maximum iterations for neural networks implentation. See nnet documentation for more details. |
MaxNWts |
Maximum number of weights parameter for neural networks implentation. See nnet documentation for more details. |
rang |
Range parameter for neural networks implentation. See nnet documentation for more details. |
decay |
Decay parameter for neural networks implentation. See nnet documentation for more details. |
trace |
Trace parameter for neural networks implentation. See nnet documentation for more details. |
ntree |
Number of trees parameter for RandomForests implentation. See randomForest documentation for more details. |
l1_regularizer |
An |
l2_regularizer |
An |
use_sgd |
A |
set_heldout |
An |
verbose |
A |
... |
Additional arguments to be passed on to algorithm function calls. |
Only one algorithm may be selected for training. See train_models
and classify_models
to train and classify using multiple algorithms.
Returns a trained model that can be subsequently used in classify_model
to classify new data.
Timothy P. Jurka, Loren Collingwood
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE) rf_model <- train_model(container,"RF") svm_model <- train_model(container,"SVM")
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE) rf_model <- train_model(container,"RF") svm_model <- train_model(container,"SVM")
Creates a trained model using the specified algorithms.
train_models(container, algorithms, ...)
train_models(container, algorithms, ...)
container |
Class of type |
algorithms |
List of algorithms as a character vector (e.g. |
... |
Other parameters to be passed on to |
Calls the train_model
function for each algorithm you list.
Returns a list
of trained models that can be subsequently used in classify_models
to classify new data.
Wouter Van Atteveldt <[email protected]>
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE) models <- train_models(container, algorithms=c("RF","SVM"))
library(RTextTools) data(NYTimes) data <- NYTimes[sample(1:3100,size=100,replace=FALSE),] matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf) container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, virgin=FALSE) models <- train_models(container, algorithms=c("RF","SVM"))
A sample dataset containing labeled bills from the United States Congress, compiled by Professor John D. Wilkerson at the University of Washington, Seattle and E. Scott Adler at the University of Colorado, Boulder.
data(USCongress)
data(USCongress)
A data.frame
containing five columns.
1. ID
- A unique identifier for the bill.
2. cong
- The session of congress that the bill first appeared in.
3. billnum
- The number of the bill as it appears in the congressional docket.
4. h_or_sen
- A field specifying whether the bill was introduced in the House (HR) or the Senate (S).
5. major
- A manually labeled topic code corresponding to the subject of the bill.
http://www.congressionalbills.org/
data(USCongress)
data(USCongress)
This function computes the stems of each of the given words in the vector. This reduces a word to its base component, making it easier to compare words like win, winning, winner. See http://snowball.tartarus.org/ for more information about the concept and algorithms for stemming.
wordStem(words, language = character(), warnTested = FALSE)
wordStem(words, language = character(), warnTested = FALSE)
words |
a character vector of words whose stems are to be computed. |
language |
the name of a recognized language for the package.
This should either be a single string which is an element in the
vector returned by |
warnTested |
an option to control whether a warning is issued about languages which have not been explicitly tested as part of the unit testing of the code. For the most part, one can ignore these warnings and so they are turned off. In the future, we might consider controlling this with a global option, but for now we suppress the warnings by default. |
This uses Dr. Martin Porter's stemming algorithm and the interface generated by Snowball http://snowball.tartarus.org/.
A character vector with as many elements as there are in the input vector with the corresponding elements being the stem of the word.
Duncan Temple Lang <[email protected]>
See http://snowball.tartarus.org/