Package 'RTextTools' reference manual

Title:	Automatic Text Classification via Supervised Learning
Description:	A machine learning package for automatic text classification that makes it simple for novice users to get started with machine learning, while allowing experienced users to easily experiment with different settings and algorithm combinations. The package includes eight algorithms for ensemble classification (svm, slda, boosting, bagging, random forests, glmnet, decision trees, neural networks), comprehensive analytics, and thorough documentation.
Authors:	Timothy P. Jurka, Loren Collingwood, Amber E. Boydstun, Emiliano Grossman, Wouter van Atteveldt
Maintainer:	Loren Collingwood <[email protected]>
License:	GPL-3
Version:	1.4.3
Built:	2025-03-12 03:46:17 UTC
Source:	https://github.com/cran/RTextTools

an S4 class containing the analytics for a classified set of documents.

Description

An S4 class containing the analytics for a classified set of documents. This includes a label summary and a document summary. This class is returned if virgin=TRUE in create_container.

Objects from the Class

Objects could in principle be created by calls of the form new("analytics_virgin", ...). The preferred form is to have them created via a call to create_analytics.

Slots

label_summary: Object of class "data.frame": stores the analytics for each label, including how many documents were classified with each label
document_summary: Object of class "data.frame": stores the analytics for each document, including all available raw data associated with the learning process

Author(s)

Timothy P. Jurka <[email protected]>

an S4 class containing the analytics for a classified set of documents.

Description

An S4 class containing the analytics for a classified set of documents. This includes a label summary, document summary, ensemble summary, and algorithm summary. This class is returned if virgin=FALSE in create_container.

Objects from the Class

Objects could in principle be created by calls of the form new("analytics", ...). The preferred form is to have them created via a call to create_analytics.

Slots

label_summary: Object of class "data.frame": stores the analytics for each label, including the percent coded accurately and how much overcoding occurred
document_summary: Object of class "data.frame": stores the analytics for each document, including all available raw data associated with the learning process
algorithm_summary: Object of class "data.frame": stores precision, recall, and F-score statistics for each algorithm, broken down by label
ensemble_summary: Object of class "matrix": stores the accuracy and coverage for an n-algorithm ensemble scoring

Author(s)

Timothy P. Jurka <[email protected]>

converts a tm DocumentTermMatrix or TermDocumentMatrix into a matrix.csr representation.

Description

Converts a DocumentTermMatrix or TermDocumentMatrix (package tm), Matrix (package Matrix), matrix.csr (SparseM), data.frame, or matrix into a matrix.csr representation to be used in the RTextTools functions.

Usage

as.compressed.matrix(DocumentTermMatrix)
as.compressed.matrix(DocumentTermMatrix)

Arguments

DocumentTermMatrix

A class of type DocumentTermMatrix or TermDocumentMatrix (package tm), Matrix (package Matrix), matrix.csr (SparseM), data.frame, or matrix.

Value

A matrix.csr representation of the DocumentTermMatrix or TermDocumentMatrix (package tm), Matrix (package Matrix), matrix.csr (SparseM), data.frame, or matrix.

Author(s)

Timothy P. Jurka <[email protected]>

makes predictions from a train_model() object.

Description

Uses a trained model from the train_model function to classify new data.

Usage

classify_model(container, model, s=0.01, ...)
classify_model(container, model, s=0.01, ...)

Arguments

`container`	Class of type `matrix_container-class` generated by the `create_container` function.
`model`	Slot for trained SVM, SLDA, boosting, bagging, RandomForests, glmnet, decision tree, neural network, or maximum entropy model generated by `train_model`.
`s`	Penalty parameter lambda for glmnet classification.
`...`	Additional parameters to be passed into the `predict` function of any algorithm.

Details

Only one model may be passed in at a time for classification. See train_models and classify_models to train and classify using multiple algorithms.

Value

Returns a data.frame of predicted codes and probabilities for the specified algorithm.

Author(s)

Loren Collingwood <[email protected]>, Timothy P. Jurka <[email protected]>

Examples

library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)
svm_model <- train_model(container,"SVM")
svm_results <- classify_model(container,svm_model)
library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)
svm_model <- train_model(container,"SVM")
svm_results <- classify_model(container,svm_model)

makes predictions from a train_models() object.

Description

Uses a trained model from the train_models function to classify new data.

Usage

classify_models(container, models, ...)
classify_models(container, models, ...)

Arguments

`container`	Class of type `matrix_container-class` generated by the `create_container` function.
`models`	List of models to be used for classification generated by `train_models`.
`...`	Other parameters to be passed on to `classify_model`.

Details

Use the list returned by train_models to use multiple models for classification.

Author(s)

Wouter Van Atteveldt <[email protected]>, Timothy P. Jurka <[email protected]>

Examples

library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)
models <- train_models(container, algorithms=c("RF","SVM"))
results <- classify_models(container, models)
library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)
models <- train_models(container, algorithms=c("RF","SVM"))
results <- classify_models(container, models)

creates an object of class analytics given classification results.

Description

Takes the results from functions classify_model or classify_models and computes various statistics to help interpret the data.

Usage

create_analytics(container, classification_results, b=1)
create_analytics(container, classification_results, b=1)

Arguments

`container`	Class of type `matrix_container-class` generated by the `create_container` function.
`classification_results`	A `cbind()` of result objects returned by `classify_model`, or the object returned by `classify_models`.
`b`	b-value for generating precision, recall, and F-scores statistics.

Value

Object of class analytics_virgin-class or analytics-class has either two or four slots respectively, depending on whether the virgin flag is set to TRUE or FALSE in create_container. They can be accessed using the @ operator for S4 classes (e.g. analytics@document_summary).

Author(s)

Timothy P. Jurka <[email protected]>, Loren Collingwood <[email protected]>

Examples

library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)
models <- train_models(container, algorithms=c("RF","SVM"))
results <- classify_models(container, models)
analytics <- create_analytics(container, results)
library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)
models <- train_models(container, algorithms=c("RF","SVM"))
results <- classify_models(container, models)
analytics <- create_analytics(container, results)

creates a container for training, classifying, and analyzing documents.

Description

Given a DocumentTermMatrix from the tm package and corresponding document labels, creates a container of class matrix_container-class that can be used for training and classification (i.e. train_model, train_models, classify_model, classify_models)

Usage

create_container(matrix, labels, trainSize=NULL, testSize=NULL, virgin)
create_container(matrix, labels, trainSize=NULL, testSize=NULL, virgin)

Arguments

`matrix`	A document-term matrix of class `DocumentTermMatrix` or `TermDocumentMatrix` from the tm package, or generated by `create_matrix`.
`labels`	A `factor` or `vector` of labels corresponding to each document in the matrix.
`trainSize`	A range (e.g. `1:1000`) specifying the number of documents to use for training the models. Can be left blank for classifying corpora using saved models that don't need to be trained.
`testSize`	A range (e.g. `1:1000`) specifying the number of documents to use for classification. Can be left blank for training on all data in the matrix.
`virgin`	A logical (`TRUE` or `FALSE`) specifying whether to treat the classification data as virgin data or not.

Value

A container of class matrix_container-class that can be passed into other functions such as train_model, train_models, classify_model, classify_models, and create_analytics.

Author(s)

Timothy P. Jurka <[email protected]>, Loren Collingwood <[email protected]>

Examples

library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)
library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)

creates a summary with ensemble coverage and precision.

Description

Creates a summary with ensemble coverage and precision values for an ensemble greater than the threshold specified.

Usage

create_ensembleSummary(document_summary)
create_ensembleSummary(document_summary)

Arguments

document_summary

The document_summary slot from the analytics-class generated by create_analytics.

Details

This summary is created in the create_analytics function. Note that a threshold value of 3 will return ensemble coverage and precision statistics for topic codes that had 3 or more (i.e. >=3) algorithms agree on the same topic code.

Author(s)

Loren Collingwood, Timothy P. Jurka

creates a document-term matrix to be passed into create_container().

Description

Creates an object of class DocumentTermMatrix from tm that can be used in the create_container function.

Usage

create_matrix(textColumns, language="english", minDocFreq=1, maxDocFreq=Inf, 
minWordLength=3, maxWordLength=Inf, ngramLength=1, originalMatrix=NULL, 
removeNumbers=FALSE, removePunctuation=TRUE, removeSparseTerms=0, 
removeStopwords=TRUE,  stemWords=FALSE, stripWhitespace=TRUE, toLower=TRUE, 
weighting=weightTf)
create_matrix(textColumns, language="english", minDocFreq=1, maxDocFreq=Inf, 
minWordLength=3, maxWordLength=Inf, ngramLength=1, originalMatrix=NULL, 
removeNumbers=FALSE, removePunctuation=TRUE, removeSparseTerms=0, 
removeStopwords=TRUE,  stemWords=FALSE, stripWhitespace=TRUE, toLower=TRUE, 
weighting=weightTf)

Arguments

`textColumns`	Either character vector (e.g. data$Title) or a `cbind()` of columns to use for training the algorithms (e.g. `cbind(data$Title,data$Subject)`).
`language`	The language to be used for stemming the text data.
`minDocFreq`	The minimum number of times a word should appear in a document for it to be included in the matrix. See package tm for more details.
`maxDocFreq`	The maximum number of times a word should appear in a document for it to be included in the matrix. See package tm for more details.
`minWordLength`	The minimum number of letters a word or n-gram should contain to be included in the matrix. See package tm for more details.
`maxWordLength`	The maximum number of letters a word or n-gram should contain to be included in the matrix. See package tm for more details.
`ngramLength`	The number of words to include per n-gram for the document-term matrix.
`originalMatrix`	The original `DocumentTermMatrix` used to train the models. If supplied, will adjust the new matrix to work with saved models.
`removeNumbers`	A `logical` parameter to specify whether to remove numbers.
`removePunctuation`	A `logical` parameter to specify whether to remove punctuation.
`removeSparseTerms`	See package tm for more details.
`removeStopwords`	A `logical` parameter to specify whether to remove stopwords using the language specified in language.
`stemWords`	A `logical` parameter to specify whether to stem words using the language specified in language.
`stripWhitespace`	A `logical` parameter to specify whether to strip whitespace.
`toLower`	A `logical` parameter to specify whether to make all text lowercase.
`weighting`	Either `weightTf` or `tm::weightTfIdf`. See package tm for more details.

Author(s)

Timothy P. Jurka <[email protected]>, Loren Collingwood <[email protected]>

Examples

library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)

creates a summary with precision, recall, and F1 scores.

Description

Creates a summary with precision, recall, and F1 scores for each algorithm broken down by unique label.

Usage

create_precisionRecallSummary(container, classification_results, b_value = 1)
create_precisionRecallSummary(container, classification_results, b_value = 1)

Arguments

`container`	Class of type `matrix_container-class` generated by the `create_container` function.
`classification_results`	A `cbind()` of result objects returned by `classify_model`, or the object returned by `classify_models`.
`b_value`	b-value for generating precision, recall, and F-scores statistics.

Author(s)

Loren Collingwood, Timothy P. Jurka

Examples

library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)
models <- train_models(container, algorithms=c("RF","SVM"))
results <- classify_models(container, models)
precision_recall_f1 <- create_precisionRecallSummary(container, results)
library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)
models <- train_models(container, algorithms=c("RF","SVM"))
results <- classify_models(container, models)
precision_recall_f1 <- create_precisionRecallSummary(container, results)

creates a summary with the best label for each document.

Description

Creates a summary with the best label for each document, determined by highest algorithm certainty, and highest consensus (i.e. most number of algorithms agreed).

Usage

create_scoreSummary(container, classification_results)
create_scoreSummary(container, classification_results)

Arguments

`container`	Class of type `matrix_container-class` generated by the `create_container` function.
`classification_results`	A `cbind()` of result objects returned by `classify_model`, or the object returned by `classify_models`.

Author(s)

Timothy P. Jurka <[email protected]>, Loren Collingwood <[email protected]>

Examples

library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)
models <- train_models(container, algorithms=c("RF","SVM"))
results <- classify_models(container, models)
score_summary <- create_scoreSummary(container, results)
library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)
models <- train_models(container, algorithms=c("RF","SVM"))
results <- classify_models(container, models)
score_summary <- create_scoreSummary(container, results)

used for cross-validation of various algorithms.

Description

Performs n-fold cross-validation of specified algorithm.

Usage

cross_validate(container, nfold, algorithm = c("SVM", "SLDA", "BOOSTING", 
"BAGGING", "RF", "GLMNET", "TREE", "NNET"), seed = NA, 
method = "C-classification", cross = 0, cost = 100, kernel = "radial", 
maxitboost = 100, maxitglm = 10^5, size = 1, maxitnnet = 1000, MaxNWts = 10000, 
rang = 0.1, decay = 5e-04, ntree = 200, l1_regularizer = 0, l2_regularizer = 0, 
use_sgd = FALSE, set_heldout = 0, verbose = FALSE)
cross_validate(container, nfold, algorithm = c("SVM", "SLDA", "BOOSTING", 
"BAGGING", "RF", "GLMNET", "TREE", "NNET"), seed = NA, 
method = "C-classification", cross = 0, cost = 100, kernel = "radial", 
maxitboost = 100, maxitglm = 10^5, size = 1, maxitnnet = 1000, MaxNWts = 10000, 
rang = 0.1, decay = 5e-04, ntree = 200, l1_regularizer = 0, l2_regularizer = 0, 
use_sgd = FALSE, set_heldout = 0, verbose = FALSE)

Arguments

`container`	Class of type `matrix_container-class` generated by the `create_container` function.
`nfold`	Number of folds to perform for cross-validation.
`algorithm`	A string specifying which algorithm to use. Use `print_algorithms` to see a list of options.
`seed`	Random seed number used to replicate cross-validation results.
`method`	Method parameter for SVM implentation. See e1071 documentation for more details.
`cross`	Cross parameter for SVM implentation. See e1071 documentation for more details.
`cost`	Cost parameter for SVM implentation. See e1071 documentation for more details.
`kernel`	Kernel parameter for SVM implentation. See e1071 documentation for more details.
`maxitboost`	Maximum iterations parameter for boosting implentation. See caTools documentation for more details.
`maxitglm`	Maximum iterations parameter for glmnet implentation. See glmnet documentation for more details.
`size`	Size parameter for neural networks implentation. See nnet documentation for more details.
`maxitnnet`	Maximum iterations for neural networks implentation. See nnet documentation for more details.
`MaxNWts`	Maximum number of weights parameter for neural networks implentation. See nnet documentation for more details.
`rang`	Range parameter for neural networks implentation. See nnet documentation for more details.
`decay`	Decay parameter for neural networks implentation. See nnet documentation for more details.
`ntree`	Number of trees parameter for RandomForests implentation. See randomForest documentation for more details.
`l1_regularizer`	An `numeric` turning on L1 regularization and setting the regularization parameter. A value of 0 will disable L1 regularization. See maxent documentation for more details.
`l2_regularizer`	An `numeric` turning on L2 regularization and setting the regularization parameter. A value of 0 will disable L2 regularization. See maxent documentation for more details.
`use_sgd`	A `logical` indicating that SGD parameter estimation should be used. Defaults to `FALSE`. See maxent documentation for more details.
`set_heldout`	An `integer` specifying the number of documents to hold out. Sets a held-out subset of your data to test against and prevent overfitting. See maxent documentation for more details.
`verbose`	A `logical` specifying whether to provide descriptive output about the training process. Defaults to `FALSE`, or no output. See maxent documentation for more details.

Author(s)

Loren Collingwood, Timothy P. Jurka

Examples

library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)
svm <- cross_validate(container,2,algorithm="SVM")
library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)
svm <- cross_validate(container,2,algorithm="SVM")

Query the languages supported in this package

Description

This dynamically determines the names of the languages for which stemming is supported by this package. This is controlled when the package is created (not installed) by downloading the stemming algorithms for the different languages.

This language support requires more support for Unicode and more complex text than simple strings.

Usage

getStemLanguages()
getStemLanguages()

Details

This queries the C code for the list of languages that were compiled when the package was installed which in turn is determined by the code that was included in the distributed package itself.

Value

A character vector giving the names of the languages.

Author(s)

Duncan Temple Lang <[email protected]>

References

See http://snowball.tartarus.org/

an S4 class containing the training and classification matrices.

Description

An S4 class containing all information necessary to train, classify, and generate analytics for a dataset.

Objects from the Class

Objects could in principle be created by calls of the form new("matrix_container", ...). The preferred form is to have them created via a call to create_container.

Slots

training_matrix: Object of class "matrix.csr": stores the training set of the DocumentTermMatrix created by create_matrix
training_codes: Object of class "factor": stores the training labels for each document in the training_matrix slot of matrix_container-class
classification_matrix: Object of class "matrix.csr": stores the classification set of the DocumentTermMatrix created by create_matrix
testing_codes: Object of class "factor": if virgin=FALSE, stores the labels for each document in classification_matrix
column_names: Object of class "vector": stores the column names of the DocumentTermMatrix created by create_matrix
virgin: Object of class "logical": boolean specifying whether the classification set is virgin data (TRUE) or not (FALSE).

Author(s)

Timothy P. Jurka

Examples

library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)

container@training_matrix
container@training_codes
container@classification_matrix
container@testing_codes
container@column_names
container@virgin
library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)

container@training_matrix
container@training_codes
container@classification_matrix
container@testing_codes
container@column_names
container@virgin

a sample dataset containing labeled headlines from The New York Times.

Description

A sample dataset containing labeled headlines from The New York Times, compiled by Professor Amber E. Boydstun at the University of California, Davis.

Usage

data(NYTimes)
data(NYTimes)

Format

A data.frame containing five columns.

1. Article_ID - A unique identifier for the headline from The New York Times.

2. Date - The date the headline appeared in The New York Times.

3. Title - The headline as it appeared in The New York Times.

4. Subject - A manually classified subject of the headline.

5. Topic.Code - A manually labeled topic code corresponding to the subject.

Source

http://www.amberboydstun.com/

Examples

data(NYTimes)
data(NYTimes)

prints available algorithms for train_model() and train_models().

Description

An informative function that displays options for the algorithms parameter in train_model and train_models.

Usage

print_algorithms()
print_algorithms()

Value

Prints a list of available algorithms.

Author(s)

Timothy P. Jurka

Examples

library(RTextTools)
print_algorithms()
library(RTextTools)
print_algorithms()

reads data from files into an R data frame.

Description

Reads data from several types of data storage types into an R data frame.

Usage

read_data(filepath, type=c("csv","delim","folder"), index=NULL, ...)
read_data(filepath, type=c("csv","delim","folder"), index=NULL, ...)

Arguments

`filepath`	Character string of the name of the file or folder, include path if the file is not located in the working directory.
`type`	Character vector specifying the file type. Options include `csv`, `delim`, and `folder` to denote .csv files, delimited files (tab, pipe, etc.) files, or folders of text files. If using the `delim` option, be sure to pass in a separate `sep` parameter to indicate how the file is delimited.
`index`	The path to a CSV file specifying the training label of each file in the folder of text files, one per line. An example of one line would be `1.txt,1`. Do not include the full file path for each file, that will be handled automatically using the folder location passed into `filepath`. This index file must be located outside the folder of files.
`...`	Other arguments passed to R's `read.csv` function.

Value

An data.frame object is returned with the contents of the file.

Author(s)

Loren Collingwood, Timothy P. Jurka

Examples

library(RTextTools)
data <- read_data(system.file("data/NYTimes.csv.gz",package="RTextTools"),type="csv",sep=";")
library(RTextTools)
data <- read_data(system.file("data/NYTimes.csv.gz",package="RTextTools"),type="csv",sep=";")

calculates the recall accuracy of the classified data.

Description

Given the true labels to compare to the labels predicted by the algorithms, calculates the recall accuracy of each algorithm.

Usage

recall_accuracy(true_labels, predicted_labels)
recall_accuracy(true_labels, predicted_labels)

Arguments

`true_labels`	A vector containing the true labels, or known values for each document in the classification set.
`predicted_labels`	A vector containing the predicted labels, or classified values for each document in the classification set.

Author(s)

Loren Collingwood, Timothy P. Jurka

Examples

library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)
models <- train_models(container, algorithms=c("RF","SVM"))
results <- classify_models(container, models)
analytics <- create_analytics(container, results)
recall_accuracy(analytics@document_summary$MANUAL_CODE,
analytics@document_summary$RF_LABEL)
recall_accuracy(analytics@document_summary$MANUAL_CODE,
analytics@document_summary$SVM_LABEL)
library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)
models <- train_models(container, algorithms=c("RF","SVM"))
results <- classify_models(container, models)
analytics <- create_analytics(container, results)
recall_accuracy(analytics@document_summary$MANUAL_CODE,
analytics@document_summary$RF_LABEL)
recall_accuracy(analytics@document_summary$MANUAL_CODE,
analytics@document_summary$SVM_LABEL)

summarizes the `analytics-class` class

Description

Returns a summary of the contents within an object of class analytics-class.

Usage

## S3 method for class 'analytics'
summary(object, ...)
## S3 method for class 'analytics'
summary(object, ...)

Arguments

`object`	An object of class `analytics-class` containing the output of the `create_analytics` function.
`...`	Additional parameters to be passed onto the summary function.

Author(s)

Timothy P. Jurka

summarizes the `analytics_virgin-class` class

Description

Returns a summary of the contents within an object of class analytics_virgin-class.

Usage

## S3 method for class 'analytics_virgin'
summary(object, ...)
## S3 method for class 'analytics_virgin'
summary(object, ...)

Arguments

`object`	An object of class `analytics_virgin-class` containing the output of the `create_analytics` function.
`...`	Additional parameters to be passed onto the summary function.

Author(s)

Timothy P. Jurka

Examples

library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=TRUE)
models <- train_models(container, algorithms=c("RF","SVM"))
results <- classify_models(container, models)
analytics <- create_analytics(container, results)

summary(analytics)
library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=TRUE)
models <- train_models(container, algorithms=c("RF","SVM"))
results <- classify_models(container, models)
analytics <- create_analytics(container, results)

summary(analytics)

makes a model object using the specified algorithm.

Description

Creates a trained model using the specified algorithm.

Usage

train_model(container, algorithm=c("SVM","SLDA","BOOSTING","BAGGING",
"RF","GLMNET","TREE","NNET"), method = "C-classification", 
cross = 0, cost = 100, kernel = "radial", maxitboost = 100, 
maxitglm = 10^5, size = 1, maxitnnet = 1000, MaxNWts = 10000, 
rang = 0.1, decay = 5e-04, trace=FALSE, ntree = 200, 
l1_regularizer = 0, l2_regularizer = 0, use_sgd = FALSE, 
set_heldout = 0, verbose = FALSE,
...)
train_model(container, algorithm=c("SVM","SLDA","BOOSTING","BAGGING",
"RF","GLMNET","TREE","NNET"), method = "C-classification", 
cross = 0, cost = 100, kernel = "radial", maxitboost = 100, 
maxitglm = 10^5, size = 1, maxitnnet = 1000, MaxNWts = 10000, 
rang = 0.1, decay = 5e-04, trace=FALSE, ntree = 200, 
l1_regularizer = 0, l2_regularizer = 0, use_sgd = FALSE, 
set_heldout = 0, verbose = FALSE,
...)

Arguments

`container`	Class of type `matrix_container-class` generated by the `create_container` function.
`algorithm`	Character vector (i.e. a string) specifying which algorithm to use. Use `print_algorithms` to see a list of options.
`method`	Method parameter for SVM implentation. See e1071 documentation for more details.
`cross`	Cross parameter for SVM implentation. See e1071 documentation for more details.
`cost`	Cost parameter for SVM implentation. See e1071 documentation for more details.
`kernel`	Kernel parameter for SVM implentation. See e1071 documentation for more details.
`maxitboost`	Maximum iterations parameter for boosting implentation. See caTools documentation for more details.
`maxitglm`	Maximum iterations parameter for glmnet implentation. See glmnet documentation for more details.
`size`	Size parameter for neural networks implentation. See nnet documentation for more details.
`maxitnnet`	Maximum iterations for neural networks implentation. See nnet documentation for more details.
`MaxNWts`	Maximum number of weights parameter for neural networks implentation. See nnet documentation for more details.
`rang`	Range parameter for neural networks implentation. See nnet documentation for more details.
`decay`	Decay parameter for neural networks implentation. See nnet documentation for more details.
`trace`	Trace parameter for neural networks implentation. See nnet documentation for more details.
`ntree`	Number of trees parameter for RandomForests implentation. See randomForest documentation for more details.
`l1_regularizer`	An `numeric` turning on L1 regularization and setting the regularization parameter. A value of 0 will disable L1 regularization. See maxent documentation for more details.
`l2_regularizer`	An `numeric` turning on L2 regularization and setting the regularization parameter. A value of 0 will disable L2 regularization. See maxent documentation for more details.
`use_sgd`	A `logical` indicating that SGD parameter estimation should be used. Defaults to `FALSE`. See maxent documentation for more details.
`set_heldout`	An `integer` specifying the number of documents to hold out. Sets a held-out subset of your data to test against and prevent overfitting. See maxent documentation for more details.
`verbose`	A `logical` specifying whether to provide descriptive output about the training process. Defaults to `FALSE`, or no output. See maxent documentation for more details.
`...`	Additional arguments to be passed on to algorithm function calls.

Details

Only one algorithm may be selected for training. See train_models and classify_models to train and classify using multiple algorithms.

Value

Returns a trained model that can be subsequently used in classify_model to classify new data.

Author(s)

Timothy P. Jurka, Loren Collingwood

Examples

library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)
rf_model <- train_model(container,"RF")
svm_model <- train_model(container,"SVM")
library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)
rf_model <- train_model(container,"RF")
svm_model <- train_model(container,"SVM")

makes a model object using the specified algorithms.

Description

Creates a trained model using the specified algorithms.

Usage

train_models(container, algorithms, ...)
train_models(container, algorithms, ...)

Arguments

`container`	Class of type `matrix_container-class` generated by the `create_container` function.
`algorithms`	List of algorithms as a character vector (e.g. `c("SVM","MAXENT")`).
`...`	Other parameters to be passed on to `train_model`.

Details

Calls the train_model function for each algorithm you list.

Value

Returns a list of trained models that can be subsequently used in classify_models to classify new data.

Author(s)

Wouter Van Atteveldt <[email protected]>

Examples

library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)
models <- train_models(container, algorithms=c("RF","SVM"))
library(RTextTools)
data(NYTimes)
data <- NYTimes[sample(1:3100,size=100,replace=FALSE),]
matrix <- create_matrix(cbind(data["Title"],data["Subject"]), language="english", 
removeNumbers=TRUE, stemWords=FALSE, weighting=tm::weightTfIdf)
container <- create_container(matrix,data$Topic.Code,trainSize=1:75, testSize=76:100, 
virgin=FALSE)
models <- train_models(container, algorithms=c("RF","SVM"))

a sample dataset containing labeled bills from the United State Congress.

Description

A sample dataset containing labeled bills from the United States Congress, compiled by Professor John D. Wilkerson at the University of Washington, Seattle and E. Scott Adler at the University of Colorado, Boulder.

Usage

data(USCongress)
data(USCongress)

Format

A data.frame containing five columns.

1. ID - A unique identifier for the bill.

2. cong - The session of congress that the bill first appeared in.

3. billnum - The number of the bill as it appears in the congressional docket.

4. h_or_sen - A field specifying whether the bill was introduced in the House (HR) or the Senate (S).

5. major - A manually labeled topic code corresponding to the subject of the bill.

Source

http://www.congressionalbills.org/

Examples

data(USCongress)
data(USCongress)

Get the common root/stem of words

Description

This function computes the stems of each of the given words in the vector. This reduces a word to its base component, making it easier to compare words like win, winning, winner. See http://snowball.tartarus.org/ for more information about the concept and algorithms for stemming.

Usage

wordStem(words, language = character(), warnTested = FALSE)
wordStem(words, language = character(), warnTested = FALSE)

Arguments

`words`	a character vector of words whose stems are to be computed.
`language`	the name of a recognized language for the package. This should either be a single string which is an element in the vector returned by `getStemLanguages`, or alternatively a character vector of length 3 giving the names of the routines for creating and closing a Snowball `SN\_env` environment and performing the stem (in that order). See the example below.
`warnTested`	an option to control whether a warning is issued about languages which have not been explicitly tested as part of the unit testing of the code. For the most part, one can ignore these warnings and so they are turned off. In the future, we might consider controlling this with a global option, but for now we suppress the warnings by default.

Details

This uses Dr. Martin Porter's stemming algorithm and the interface generated by Snowball http://snowball.tartarus.org/.

Value

A character vector with as many elements as there are in the input vector with the corresponding elements being the stem of the word.

Author(s)

Duncan Temple Lang <[email protected]>

References

See http://snowball.tartarus.org/

Package 'RTextTools'

Help Index

an S4 class containing the analytics for a classified set of documents.

Description

Objects from the Class

Slots

Author(s)

an S4 class containing the analytics for a classified set of documents.

Description

Objects from the Class

Slots

Author(s)

converts a tm DocumentTermMatrix or TermDocumentMatrix into a matrix.csr representation.

Description

Usage

Arguments

Value

Author(s)

makes predictions from a train_model() object.

Description

Usage

Arguments

Details

Value

Author(s)

Examples

makes predictions from a train_models() object.

Description

Usage

Arguments

Details

Author(s)

Examples

creates an object of class analytics given classification results.

Description

Usage

Arguments

Value

Author(s)

Examples

creates a container for training, classifying, and analyzing documents.

Description

Usage

Arguments

Value

Author(s)

Examples

creates a summary with ensemble coverage and precision.

Description

Usage

Arguments

Details

Author(s)

creates a document-term matrix to be passed into create_container().

Description

Usage

Arguments

Author(s)

Examples

creates a summary with precision, recall, and F1 scores.

Description

Usage

Arguments

Author(s)

Examples

creates a summary with the best label for each document.

Description

Usage

Arguments

Author(s)

Examples

used for cross-validation of various algorithms.

Description

Usage

Arguments

Author(s)

Examples

Query the languages supported in this package

Description

Usage

summarizes the `analytics-class` class

summarizes the `analytics_virgin-class` class