R_TextMining_WordCloud

Rでテキストマイニング＆ワードクラウド（Text Mining & Word Cloud using R） 2016年9月25日 †

専門家ではないので、ちょっと調べるだけではあるのですが、R を使って、 Text Mining とその結果を表示するための Word Cloud 作成をしました。

ほとんどはほかの人の使い方を真似しただけですが、 tm（text mining）パッケージが使いこなせていなくて、モザイクというより２つの別々なことを行ったという形になりました。

材料は、ダーウィンの種の起原初版、 gutenberg project からデータを取得して、上下にある gutenberg project の説明文などを削除したテキストファイル（Origin_of_Species_maintext.txt）。

↑

その１：tm（text mining）パッケージを使わない処理 †

こちらに従って処理。以下はほぼそのコピーです。特別なパッケージは不要です。

the とか is とか、そういった「stop word」を削っていないので、 token 数を数えるのにはいいのですが、頻度を数えるのにはこれだけではいまひとつでした。

↑

単語出現頻度表の作成 †

#テキストファイルの読み込み
#
#一行ずつ読み込んで、リストに格納
txt <- readLines("Origin_of_Species_maintext.txt")
#
#スペース&記号による分割
wordL <- strsplit(txt, "[[:space:]]|[[:punct:]]")
#
#各行のデータを一括化
wordL <- unlist(wordL)
#
#小文字に変換
wordL <- tolower(wordL)
#
#空白"“の削除
wordL <- wordL[nchar(wordL) > 0]
wordL <- wordL[wordL != ""]
#
#単語のToken数
tokens <- length(wordL)
tokens
#
#単語のTypes数
#    unique()関数は，リストの重複しない要素を返す
types <- length(unique(wordL))
types
#
#TTR: Type-Token Ratioの計算
#\[ TTR=\frac{types}{tokens} \times 100 \]
#
types/tokens * 100
#
#
#単語の頻度数
freqL <- sort(table(wordL), decreasing = TRUE)
#
#単語の頻度数(上位5語)
#freqL[1:5]
#
#
write.table(freqL, "originofspeciesout.txt", quote=F,col.names=F)

↑

その２：tm（text mining）パッケージを使っての処理 †

こちらに従って処理。以下はほぼそのコピーです。

stop word を削り、さらに、stemming もしています。

stemming というのは、単語を共通（であろう）部分まで削り取ってしまうことのようです。その結果、複数形はほぼなくなり、～ion とかも削られるので、動詞とそれを起源とする名詞が同じものになるという感じです。 stemming したものは自動的には元には戻りません。そこで、今回は頻度分布としてまとまったデータをテキストファイルに書き出し、元の単語のうちに頻度の高かったものに書き換えました。その１で出力した originofspeciesout.txt を利用しています。

「Cleaning the text」のステップを二度繰り返しているのは、なぜか tolower の部分で２バイト文字があるエラーが出てしまったのを回避（？）するためです。ファイルを nkf で euc-unix にしても同じエラーが出たので、繰り返してみました。エラーが出たまま放っておくと、stop word が適切に削除されませんでした。

↑

まず、必要パッケージをインストール †

install.packages("tm")
install.packages("SnowballC") # これは stemDocument に必要
install.packages("wordcloud")

↑

頻度分布を作成 †

# Load
library("tm")
library("SnowballC")
library("wordcloud")
#
#単語出現頻度表の作成                                        
#テキストファイルの読み込み
#
#一行ずつ読み込んで、リストに格納
text <- readLines("Origin_of_Species_maintext.txt")
# Load the data as a corpus
docs <- Corpus(VectorSource(text))
#
# Text transformation
#
toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
docs <- tm_map(docs, toSpace, "/")
docs <- tm_map(docs, toSpace, "@")
docs <- tm_map(docs, toSpace, "\\|")
#
#
# Cleaning the text
#
# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove your own stop word
# specify your stopwords as a character vector
#docs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) 
## Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
# Text stemming
docs <- tm_map(docs, stemDocument)
#
# Convert the text to lower case
docs <- tm_map(docs, content_transformer(tolower))
# Remove numbers
docs <- tm_map(docs, removeNumbers)
# Remove english common stopwords
docs <- tm_map(docs, removeWords, stopwords("english"))
# Remove your own stop word
# specify your stopwords as a character vector
#docs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) 
## Remove punctuations
docs <- tm_map(docs, removePunctuation)
# Eliminate extra white spaces
docs <- tm_map(docs, stripWhitespace)
# Text stemming
docs <- tm_map(docs, stemDocument)
#
# Step 4 : Build a term-document matrix
#
#Document matrix is a table containing the frequency of the
#words. Column names are words and row names are documents. The
#function TermDocumentMatrix() from text mining package can be used
#as follow :
#
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
v <- sort(rowSums(m),decreasing=TRUE)
d <- data.frame(word = names(v),freq=v)
#
myhead <- head(d, 1000000)
write.table(myhead, "origintmout.txt", quote=F,col.names=F)

↑

その３：Word Cloud の作成 †

パラメータなどは、その２と同じくこちらに従って処理。

origintmout.tsv が、stemming で変な単語になってしまったものを手動で頻度の高い単語に戻したデータファイル。 read.table で読み込ませるのに簡単なので、tab separated vector （タブ区切り）にしました。

require(wordcloud)
library(wordcloud)
#
newfreq <- read.table("origintmout.tsv")
#
set.seed(1234)
wordcloud(words = newfreq$V2, freq = newfreq$V3, min.freq = 1, max.words=200,
         random.order=FALSE, rot.per=0.35,
         colors=brewer.pal(8, "Dark2"))

↑

まとめ †

151191 tokens
6916 unique words
Type-Token Ratio (%): 4.574346

種の起原では、進化の単語を直接使っているのはたった１ヶ所で、しかも、それは、本文の最後の単語（evolved）*1。そこで、全 token 中の「進化」の単語の利用率（ここではEvol-Token Ratio とします）は 1/151191 なので

Evol-Token Ratio (%): 0.000661415

↑

Word Cloud †

Today:1

Yesterday:1

Total:6119 since 25 September 2016

Menu

最新の10件