Interreg prosjekt GSS: Grenseløst samarbeid for sikkerhet
netGSS :: Analyse av tekst fra øving i 2013 :: R-statistics :: tm Text Mining
Oversikt: histogram | alle deltakere | svenske vs norske deltakere (Analyse av ord med 6 tegn eller mer) (Alternativt --> histogram over ord med 5 tegn eller mer)
Fig 1. Wordcloud med alle deltakere (øvelsen i 2013: ord med 6 tegn eller mer)
Fig 2. Trediagram (dendogram) over alle deltakere fra Norge og Sverige (øvelsen i 2013)
Vi ser at Fylkeskommunen i Norge (n_County) bruker ord og begrep som har mye felles med NTE Nett (n_Power_Co).
Vi ser videre at brannmannskap i kommunene i Norge (n_Fire_muni) bruker begrep som har mye til felles med 110 Brann (n_Fire), 113 AMK (n_Ambulance) og med det svenske SOS Alarm (s_SOS_alarm)
Her er mer informasjon om dendogram: https://eight2late.wordpress.com/2015/07/22/a-gentle-introduction-to-cluster-analysis-using-r/
Fig 3. Principal component plot (k=2): R ClusterPlot over alle deltakere fra Norge og Sverige (øvelsen i 2013)
ClusterPlot viser relativ avstand mellom alle deltakergruppene når det gjelder ord og begrep brukt i tekstkommunikasjonen under øvelsen.
Lenken under figur 2 har mer informasjon om ClusterPlot.
Fig 4. Number of clusters - alle deltakere fra Norge og Sverige (øvelsen i 2013)
Optimalt antall clusters er 3 i dette tilfellet - reduksjonen i "within groups sum of squares" flater ut - med en "knekk" i grafen.
Lenken under figur 2 har mer informasjon om ClusterPlot.
Partial output of three participant groups in simulation
[13] "s_SOS_alarm"
> library(tm)
>
> library(SnowballC)
>
> docs <- Corpus(DirSource(cname, encoding = "UTF-8"))
>
> orig_docs <- Corpus(DirSource(cname, encoding = "UTF-8"))
> orig_docs <- tm_map(orig_docs,content_transformer(function(x) iconv(x, to='UTF-8', sub='byte')),mc.cores=1)
> orig_docs <- tm_map(orig_docs,removePunctuation)
> orig_docs <- tm_map(orig_docs, removeNumbers)
> orig_docs <- tm_map(orig_docs, stripWhitespace)
>
>
> ## docs <- tm_map(docs,content_transformer(function(x) iconv(x, to='UTF-8', sub='byte')),mc.cores=1)
> docs <- tm_map(docs, removePunctuation)
> docs <- tm_map(docs, removeNumbers)
>
> ## docs <- tm_map(docs, tolower)
> ## docs <- tm_map(docs, removeWords, stopwords("swedish"))
> ## docs <- tm_map(docs, removeWords, stopwords("norwegian"))
> ## docs <- tm_map(docs, stemDocument, language = "swedish")
> ## docs <- tm_map(docs, stemDocument, language = "norwegian")
> docs <- tm_map(docs, stripWhitespace)
> ## docs <- tm_map(docs, PlainTextDocument)
>
> ### Stage the Data
> dtm <- DocumentTermMatrix(docs, control = list(wordLengths = c(6, Inf)))
> ## inspect(dtm[1:4,20:50])
>
>
> tdm <- TermDocumentMatrix(docs, control = list(wordLengths = c(6, Inf)))
>
>
> tdm_orig <- TermDocumentMatrix(orig_docs, control = list(wordLengths = c(6, Inf)))
> inspect(tdm_orig[1:50,2:4])
<>
Non-/sparse entries: 23/127
Sparsity : 85%
Maximal term length: 18
Weighting : term frequency (tf)
Docs
Terms n_County n_Fire n_Fire_muni
abonnenter 0 0 0
adressa 0 0 0
aftenposten 0 0 0
aggregat 0 0 0
akkurat 0 0 0
aksept 0 1 0
aksjon 0 0 1
aksjonen 0 1 1
aksjonerer 0 0 1
aktcarer 0 0 0
aktcbrer 0 0 0
aktivera 0 0 0
aktiverade 0 0 0
aktiverar 0 0 0
aktuell 1 0 0
aktuelle 1 0 0
alarmerer 0 0 0
alarmert 0 0 0
alarmerte 0 1 0
allvarliga 0 0 0
ambpersonell 0 1 0
ambulans 0 0 0
ambulanse 0 0 1
ambulansekapasitet 0 0 1
ambulanser 0 0 0
ambulasns 0 0 0
anbefaler 0 0 0
angcaende 1 2 0
ankommer 0 0 1
anlcaggningarna 0 0 0
anmode 0 0 0
anmodet 0 0 0
ansatte 0 0 0
anscatten 0 0 0
anstrcangt 0 0 0
ansvar 0 0 0
arbetar 0 0 0
arbetas 0 0 0
arbetet 0 1 0
assistans 0 0 0
assistanse 1 0 0
asylmottak 2 2 0
avbrudd 1 0 0
avgrening 0 0 0
avgrense 1 0 0
avhengig 0 0 0
avklart 1 0 1
avlastning 0 0 0
avlcbp 0 0 0
avlcbpsvann 1 0 0
>
>
> ### Explore the data
>
> dtms <- removeSparseTerms(dtm, 0.5) # matrix with max 50% empty cells
> ## inspect(dtms)
>
> ## Word Frequency
>
> ## freq <- freq[head(ord)]
> ## freq <- freq[tail(ord)]
>
> ## head(table(freq), 20) # 20 least frequent words
> ## tail(table(freq), 20) # 20 most frequent words
>
> freq <- colSums(as.matrix(dtms))
> ## freq
>
> freq <- sort(colSums(as.matrix(tdm_orig)), decreasing=TRUE)
>
> ## head(freq, 6) # 6 most frequent words
>
> ## findFreqTerms(dtm, lowfreq=20) # terms occuring 10 or more times
>
> wf <- data.frame(word=names(freq), freq=freq)
> ## head(wf)
>
> library(ggplot2)
>
> ## Histogram of Word Frequencies
> ## lager fila Rplots.pdf - men er ikke lesbar som pdf ...
>
> png("fig1.png", width=500, height=1000)
>
> ## p <- qplot(names(termFrequency), termFrequency, geom="bar", stat="identity") + coord_flip()
>
> p <- ggplot(subset(wf, freq>8), aes(word, freq))
> ## p <- p + geom_bar(stat="identity", fill="darkred", color="black") + coord_flip()
> p <- p + geom_bar(stat="identity", fill="darkred", color="black")
> p <- p + theme(axis.text.x=element_text(colour="black", angle=90, hjust=0))
> plot(p)
> dev.off()
null device
1
>
> library(wordcloud)
>
> freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)
>
>
> ## set.seed(142)
> ## wordcloud(names(freq), freq, min.freq=10)
>
> ## set.seed(142)
> ## wordcloud(names(freq), freq, max.words=100)
>
> ## set.seed(142)
> ## wordcloud(names(freq), freq, min.freq=20, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))
>
> png("fig2.png")
>
> set.seed(142)
> dark2 <- brewer.pal(6, "Dark2")
> wordcloud(names(freq), freq, max.words=100, rot.per=0.2, colors=dark2)
> dev.off()
null device
1
>
> dtmss <- removeSparseTerms(tdm, 0.50) # matrix with max 60% empty space
> ## inspect(dtmss)
>
> ## Hierarchal Clustering
>
> png("fig3.png")
>
> library(cluster)
> d <- dist(t(dtmss), method="euclidian")
>
> fit <- hclust(d=d, method="ward.D")
> fit
Call:
hclust(d = d, method = "ward.D")
Cluster method : ward.D
Distance : euclidean
Number of objects: 13
>
> plot(fit, hang=-1)
> groups <- cutree(fit, k=3) ## k defines the number of clusters
> rect.hclust(fit, k=3, border="red") ## draws dendogram with red borders
> dev.off()
null device
1
>
> ## K-means clustering
>
> png("fig4.png", width=1000, height=1000)
> library(fpc)
> d <- dist(t(dtmss), method="euclidian")
> kfit <- kmeans(d, 2)
> clusplot(as.matrix(d), kfit$cluster, color=T, shade=T, labels=2, lines=0)
Data Mining with R: Text Mining (Hugh Murrell) (http://www.cs.ukzn.ac.za/~murrellh/dm/content/slides10.pdf)
R and Data Mining: Examples and Case Studies (Yanchang Zhao) Ch. 10 (https://cran.r-project.org/doc/contrib/Zhao_R_and_data_mining.pdf)
The RQDA - a R package for Qualitative Data Analysis (http://rqda.r-forge.r-project.org/)
What is RQDA and what are its features?
The R Project for Statistical Computing (http://www.r-project.org/)
Jocker, M. L. Executing R in PhP. URL https://www.stanford.edu/~mjockers/cgi-bin/drupal/node/25
Personality-project.org: Using R for psychological research: A simple quide to an elegant package.
URL http://personality-project.org/r/
Csardi, G. Practical Statistical Network Analysis: Community Structure in Networks. (slide 29) URL http://statmath.wu.ac.at/research/friday/resources_WS0708_SS08/igraph.pdf
Butts, CT (2008). network: A Package for Managing Relational Data in R. Journal of Statistical Software, 24(2). http://www.jstatsoft.org/v24/i02/paper
Goodreau SM, Handcock MS, Hunter DR, Butts CT, Morris M (2008). A statnet Tutorial. Journal of Statistical Software, 24(8). http://www.jstatsoft.org/v24/i01/paper
McGlohon, Mary. Statistical Properties of Social Networks. http://www.springer.com/cda/content/document/cda_downloaddocument/9781441984616-c2.pdf