R_TextMining_WordCloud の変更点

追加された行はこの色です。
削除された行はこの色です。
R_TextMining_WordCloud へ行く。
R_TextMining_WordCloud の差分を削除
*Rでテキストマイニング＆ワードクラウド（Text Mining & Word Cloud using R） 2016年9月25日 [#r024f442]

専門家ではないので、ちょっと調べるだけではあるのですが、R を使って、
Text Mining とその結果を表示するための Word Cloud 作成をしました。

ほとんどはほかの人の使い方を真似しただけですが、
tm（text mining)パッケージが使いこなせていなくて、
tm（text mining）パッケージが使いこなせていなくて、
モザイクというより２つの別々なことを行ったという形になりました。

材料は、ダーウィンの種の起原初版、
[[gutenberg project>https://www.gutenberg.org/]]
からデータを取得して、上下にある gutenberg project
の説明文などを削除したテキストファイル（Origin_of_Species_maintext.txt）。


** その１：tm（text mining）パッケージを使わない処理 [#if68037f]

[[こちら>http://rstudio-pubs-static.s3.amazonaws.com/10072_0cf8cff5e82c483298851b550c56fae0.html]]
に従って処理。以下はほぼそのコピーです。
特別なパッケージは不要です。

the とか is とか、そういった「stop word」を削っていないので、
token 数を数えるのにはいいのですが、
頻度を数えるのにはこれだけではいまひとつでした。

*** 単語出現頻度表の作成 [#y1d6db6f]
 #テキストファイルの読み込み
 #
 #一行ずつ読み込んで、リストに格納
 txt <- readLines("Origin_of_Species_maintext.txt")
 #
 #スペース&記号による分割
 wordL <- strsplit(txt, "[[:space:]]|[[:punct:]]")
 #
 #各行のデータを一括化
 wordL <- unlist(wordL)
 #
 #小文字に変換
 wordL <- tolower(wordL)
 #
 #空白"“の削除
 wordL <- wordL[nchar(wordL) > 0]
 wordL <- wordL[wordL != ""]
 #
 #単語のToken数
 tokens <- length(wordL)
 tokens
 #
 #単語のTypes数
 #    unique()関数は，リストの重複しない要素を返す
 types <- length(unique(wordL))
 types
 #
 #TTR: Type-Token Ratioの計算
 #\[ TTR=\frac{types}{tokens} \times 100 \]
 #
 types/tokens * 100
 #
 #
 #単語の頻度数
 freqL <- sort(table(wordL), decreasing = TRUE)
 #
 #単語の頻度数(上位5語)
 #freqL[1:5]
 #
 #
 write.table(freqL, "originofspeciesout.txt", quote=F,col.names=F)


** その２：tm（text mining）パッケージを使っての処理 [#pff4a6ed]

[[こちら>http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know]]
に従って処理。以下はほぼそのコピーです。

stop word を削り、さらに、stemming もしています。

stemming というのは、
単語を共通（であろう）部分まで削り取ってしまうことのようです。
その結果、複数形はほぼなくなり、～ion とかも削られるので、
動詞とそれを起源とする名詞が同じものになるという感じです。
stemming したものは自動的には元には戻りません。
そこで、今回は頻度分布としてまとまったデータをテキストファイルに書き出し、
元の単語のうちに頻度の高かったものに書き換えました。
その１で出力した originofspeciesout.txt を利用しています。

「Cleaning the text」のステップを二度繰り返しているのは、なぜか
tolower の部分で２バイト文字があるエラーが出てしまったのを
回避（？）するためです。ファイルを nkf で euc-unix 
にしても同じエラーが出たので、繰り返してみました。
エラーが出たまま放っておくと、stop word が適切に削除されませんでした。


***まず、必要パッケージをインストール [#j7372f21]

 install.packages("tm")
 install.packages("SnowballC") # これは stemDocument に必要
 install.packages("wordcloud")

***頻度分布を作成 [#a86876e8]

 # Load
 library("tm")
 library("SnowballC")
 library("wordcloud")
 #
 #単語出現頻度表の作成                                        
 #テキストファイルの読み込み
 #
 #一行ずつ読み込んで、リストに格納
 text <- readLines("Origin_of_Species_maintext.txt")
 # Load the data as a corpus
 docs <- Corpus(VectorSource(text))
 #
 # Text transformation
 #
 toSpace <- content_transformer(function (x , pattern ) gsub(pattern, " ", x))
 docs <- tm_map(docs, toSpace, "/")
 docs <- tm_map(docs, toSpace, "@")
 docs <- tm_map(docs, toSpace, "\\|")
 #
 #
 # Cleaning the text
 #
 # Convert the text to lower case
 docs <- tm_map(docs, content_transformer(tolower))
 # Remove numbers
 docs <- tm_map(docs, removeNumbers)
 # Remove english common stopwords
 docs <- tm_map(docs, removeWords, stopwords("english"))
 # Remove your own stop word
 # specify your stopwords as a character vector
 #docs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) 
 ## Remove punctuations
 docs <- tm_map(docs, removePunctuation)
 # Eliminate extra white spaces
 docs <- tm_map(docs, stripWhitespace)
 # Text stemming
 docs <- tm_map(docs, stemDocument)
 #
 # Convert the text to lower case
 docs <- tm_map(docs, content_transformer(tolower))
 # Remove numbers
 docs <- tm_map(docs, removeNumbers)
 # Remove english common stopwords
 docs <- tm_map(docs, removeWords, stopwords("english"))
 # Remove your own stop word
 # specify your stopwords as a character vector
 #docs <- tm_map(docs, removeWords, c("blabla1", "blabla2")) 
 ## Remove punctuations
 docs <- tm_map(docs, removePunctuation)
 # Eliminate extra white spaces
 docs <- tm_map(docs, stripWhitespace)
 # Text stemming
 docs <- tm_map(docs, stemDocument)
 #
 # Step 4 : Build a term-document matrix
 #
 #Document matrix is a table containing the frequency of the
 #words. Column names are words and row names are documents. The
 #function TermDocumentMatrix() from text mining package can be used
 #as follow :
 #
 dtm <- TermDocumentMatrix(docs)
 m <- as.matrix(dtm)
 v <- sort(rowSums(m),decreasing=TRUE)
 d <- data.frame(word = names(v),freq=v)
 #
 myhead <- head(d, 1000000)
 write.table(myhead, "origintmout.txt", quote=F,col.names=F)

** その３：Word Cloud の作成 [#h2a13823]

パラメータなどは、その１と同じく
パラメータなどは、その２と同じく
[[こちら>http://www.sthda.com/english/wiki/text-mining-and-word-cloud-fundamentals-in-r-5-simple-steps-you-should-know]]
に従って処理。

origintmout.tsv が、stemming 
で変な単語になってしまったものを手動で頻度の高い単語に戻したデータファイル。
read.table で読み込ませるのに簡単なので、tab separated vector
（タブ区切り）にしました。


 require(wordcloud)
 library(wordcloud)
 #
 newfreq <- read.table("origintmout.tsv")
 #
 set.seed(1234)
 wordcloud(words = newfreq$V2, freq = newfreq$V3, min.freq = 1, max.words=200,
          random.order=FALSE, rot.per=0.35,
          colors=brewer.pal(8, "Dark2"))


** まとめ [#fe428cd4]

151191 tokens~
6916 unique words~
Type-Token Ratio (%): 4.574346

種の起原では、進化の単語を直接使っているのはたった１ヶ所で、
しかも、それは、本文の最後の単語（evolved）((たぶん、知っている人は知っている「あるあるネタ」ではないかと))。
そこで、全 token 中の「進化」の単語の利用率（ここではEvol-Token Ratio
とします）は 1/151191 なので

Evol-Token Ratio (%): 0.000661415

***Word Cloud [#c3d114e1]

&ref(OriginWC.png);

|Today:&counter(today);|Yesterday:&counter(yesterday);|Total:&counter(); since 25 September 2016|