Week 36: Medium Posts

TidyTuesday

2018

Published

December 8, 2018

#loading packages
#load data
library(readr)
#manipulate data
library(dplyr)
library(magrittr)
# format table with expense
library(formattable)
# knitting the document
library(knitr)
# another type of table
library(kableExtra)
# playing with strings
library(stringr)
# combining two plots
library(grid)
library(gridExtra)
# that theme you wanted
library(ggthemr)
# text analysis
library(tm)
library(SnowballC)
library(RColorBrewer)
library(wordcloud)

# adding theme called fresh for plots
ggthemr("fresh")
#loading the data
medium <- read_csv("medium_datasci.csv")
attach(medium)

Focusing on Claps with Authors and publications, where does writing alot of posts will get you with popularity and claps. The code will focus on Top 10 Authors with most of the posts and Claps they have received. Further, does having an image in the post matter ?. Finally, word clouds for Top 10 authors and Top 5 publications with their titles.

GitHub Code

Claps

Table indicates that 25,729 posts have 0 claps, while 7,093 posts with only one clap, and finally 3044 posts with 2 claps. The only odd one is posts with 50 claps where the count is 970.

# extracting the top 15 with most claps
claps_table<-table(claps) %>%
                     sort() %>%
                     tail(15)

# table it up 
kable(t(claps_table),"html") %>%
  kable_styling(bootstrap_options = "striped",full_width = T) %>%
  row_spec(0,bold = T,font_size = 13,color = "grey")

13	12	9	8	11	7	50	10	6	5	4	3	2	1	0
602	634	661	721	759	864	970	989	1124	1389	1402	2150	3044	7093	25729

Top 10 Authors and Claps for their posts

There are only two posts which do not have Images in their content. The highest number of claps is 60,000, a post written by Sophia Ciocca under the title “How Does Spotify Know You So Well?”. Second Place is for the article “Blockchain is not only crappy technology but a bad vision for the future” which was written by Kai Stinchombe with 53,000 claps. De Xun is the only author who has two articles which are in this list on the places 8 and 9 with claps respectively 37,000 and 36,000.

# seperate medium with author, titles, claps and image
claps_A_I<-medium[,c("author","title","subtitle","claps","image")] %>%
              arrange(claps) %>%
              tail(10)
names(claps_A_I)<-c("Author","Title","Subtitle","Claps","Image")

# table it
formattable(claps_A_I[,-3],align=c("l","l","r","c"),
            list(
              Claps=color_tile("lightblue","blue"),
              Image=color_tile("red","green")
            ))

Author	Title	Claps	Image
Vishal Maini	A Beginners Guide to AI/ML	36000	1
De Xun	SWIPE Bi-Weekly Update, 16th-27th July	36000	1
De Xun	[PARTNERSHIP] SWIPE-Bluzelle: Building SWIPEs decentralized database	37000	1
Andrej Karpathy	Software 2.0	38000	0
Radu Raicea	Want to know how Deep Learning works? Heres a quick guide for everyone.	39000	1
Anything App	Fast-forward twenty years with Anything App.	42000	0
Michael Jordan	Artificial IntelligenceThe Revolution Hasnt Happened Yet	46000	1
Xiaohan Zeng	I interviewed at five top companies in Silicon Valley in five days, and luckily got five job offers	49000	1
Kai Stinchcombe	Blockchain is not only crappy technology but a bad vision for the future	53000	1
Sophia Ciocca	How Does Spotify Know You So Well?	60000	1

Word Cloud for the Titles from Top 10 Authors

Word cloud from the titles of the posts by Top 10 authors of most number of posts is below. The words thing, read, data, drone and new are with highest mentions. Where words such as big, telecom and tech are next in the line with higher amount of posts. In restrictions I have considered that this word cloud will have 1500 words and only if a word atleast holds the frequency of 10.

Well, I could clearly see that there cannot be 1500 words here.

#convert into data frame
Top10_author<-data.frame(Top10_author)

# Calculate Corpus
Top10_author.Corpus<-Corpus(VectorSource(Top10_author$title))

# clean the data
Top10_author.Clean<-tm_map(Top10_author.Corpus,PlainTextDocument)
Top10_author.Clean<-tm_map(Top10_author.Corpus,tolower)
Top10_author.Clean<-tm_map(Top10_author.Clean,removeNumbers)
Top10_author.Clean<-tm_map(Top10_author.Clean,removeWords,stopwords("english"))
Top10_author.Clean<-tm_map(Top10_author.Clean,removePunctuation)
Top10_author.Clean<-tm_map(Top10_author.Clean,stripWhitespace)
Top10_author.Clean<-tm_map(Top10_author.Clean,stemDocument)

# save as png
#png(filename = "wordcloud1.png",width = 1024,height = 768)
# plot the word cloud
wordcloud(Top10_author.Clean,max.words = 1500,min.freq = 10,
          colors = brewer.pal(11,"Spectral"),random.color = FALSE,
          random.order = TRUE)

Word Cloud for the Titles from Top 5 publications

This word cloud also has similar restrictions for number of words and minimum frequency for a word. Words such as data, learn, use, machin, network, deep, scienc and artifici have most amount of frequency. Further, words such as neural, intellig, chatbot, part and python are also with significant amount of frequency. Here we can see clearly see there can be more than 1000 words.

#convert into data frame
Top5_pub<-data.frame(Top5_pub)

# Calculate Corpus
Top5_pub.Corpus<-Corpus(VectorSource(Top5_pub$title))

#clean the data
Top5_pub.Clean<-tm_map(Top5_pub.Corpus,PlainTextDocument)
Top5_pub.Clean<-tm_map(Top5_pub.Corpus,tolower)
Top5_pub.Clean<-tm_map(Top5_pub.Clean,removeNumbers)
Top5_pub.Clean<-tm_map(Top5_pub.Clean,removeWords,stopwords("english"))
Top5_pub.Clean<-tm_map(Top5_pub.Clean,removePunctuation)
Top5_pub.Clean<-tm_map(Top5_pub.Clean,stripWhitespace)
Top5_pub.Clean<-tm_map(Top5_pub.Clean,stemDocument)

# save as png
#png(filename = "wordcloud2.png",width = 1024,height = 768)
# plot the word cloud
wordcloud(Top5_pub.Clean,max.words = 1500,min.freq = 10,
          colors = brewer.pal(11,"Spectral"),random.color = FALSE,
          random.order = TRUE)

Conclusion

My conclusion of the above plots and tables in point form

Using dplyr to manipulate the data-set was useful when there is complication.
grid and gridExtra packages provide a safe way of combining multiple plots into one plot.
formattable and kableExtra were crucial in generating tables which are informative.
Word cloud or analyzing text is very useful and flexible when we use the above packages.

Further Analysis

Similarly we can do the above analysis for Top 5 publications and other variables.
Word clouds for subtitles also will be interesting to see, specially focusing on authors and publications.

Please see that
This is my sixth post on the internet so my mistakes in grammar and spellings should be very less than previous posts. I intend to post more statistics related materials in the future and learn accordingly. Thank you for reading.

THANK YOU

Week 36: Medium Posts

Claps

Top 10 Authors and Claps for their posts

Top 10 Authors with most posts

Top 10 Authors who have posts with Images

Top 10 Authors who have posts without Images

Top 10 Authors and Reading time for their posts

Top 5 Publications with most posts

Word Cloud for the Titles from Top 10 Authors

Word Cloud for the Titles from Top 5 publications

Conclusion

Further Analysis