Next Word Prediction App – Milestone Report

by John Akwei, ECMp ERMp Data Science Specialist


Wednesday, July 08, 2015


This is a milestone report on the initial stages, (1st month), of the creation of a Next Word Prediction App. The project is for the Data Science Capstone course from Coursera, and Johns Hopkins University. The text prediction based company, SwiftKey, is a partner in this phase of the Data Science Specialization course.

The objective of the Next Word Prediction App project, (lasting two months), is to implement an application, capable of predicting the most likely next word that the application user will input, after the inputting of 1 or more words. The application utilizes Natural Language Processing programmed in the R Language, and is hosted by the platform. In order to perform Natural Language Processing, the application’s algorithm utilizes examples of Natural Language text from news, blogs, and Twitter, saved into .txt files.

This milestone report examines the .txt files, in order to determine the characteristics of these datasets for Natural Language Processing. The datasets are statistically examined with the R programming language, and the IDE – RStudio.

The Natural Language Processing datasets, (or “Corpora”), are available from This project utilizes the same files from The initial application development will concentrate on English language text only.

Data Acquistion

The source of Corpus files for Natural Language Processing is

The news, blogs, and twitter datasets are imported as character datasets:

blogs <- readLines(“Coursera-SwiftKey/final/en_US/en_US.blogs.txt”, encoding=”UTF-8″)
news <- readLines(“Coursera-SwiftKey/final/en_US/”, encoding=”UTF-8″)
twitter <- readLines(“Coursera-SwiftKey/final/en_US/en_US.twitter.txt”,

Data Optimization of the NLP Dataset via Tokenization

The datasets are filtered to remove whitespace, punctuation, and numbers. Then converted to lower case.

Optimized line from blogs dataset:

“in the years thereafter most of the oil fields and platforms were named after pagan gods “

Optimized line from news dataset:

“he wasnt home alone apparently”

Optimized line from twitter dataset:

“how are you btw thanks for the rt you gonna be in dc anytime soon love to see you been way way too long”

Exploratory Data Analysis of Blogs, News, and Twitter datasets

File Size for blogs dataset: 200.424207687378

File Size for news dataset: 196.277512550354

File Size for twitter dataset: 159.364068984985

Lines in blogs dataset: 899288

Lines in news dataset: 77259

Lines in twitter dataset: 2360148

Characters in blogs dataset: 206824382

Characters in news dataset: 15639408

Characters in twitter dataset: 162096031

Summary of blogs dataset word count:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.00    9.00   28.00   41.75   60.00 6726.00

Summary of news dataset word count:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
1.00   19.00   32.00   34.62   46.00 1123.00

Summary of twitter dataset word count:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
1.00    7.00   12.00   12.75   18.00   47.00

Word Cloud for blogs dataset


Unigram Graph


Interesting Findings About The Datasets

  1. The twitter dataset is useful for general limited text input, and the news and blogs datasets are useful for higher level text input.
  2. Combining and tokenizing the three datasets creates nonsequiturs, via the last word of a sentence being followed by the first word of a following sentence. However, the sequiturs created by the tokenization process probably outweigh the nonsequiturs in frequency, and thereby preserve the accuracy of the prediction algorithm.

Future Plan

  1. Create an Ngram Table of unigrams, bigrams, and trigrams with preprocessed prediction unigrams, and a word frequency column to sort the most reliable predictions.
  2. Script application code to compare user input with the prediction table.
  3. Explore ngram-based NLP for prediction of the word being typed from initial typed letters.
  4. Expand the capabilities of the algorithm to process longer lines of text.
  5. Explore learning functions to update the ngram table based on user specific next words.


Initial Exploratory Data Analysis allows for an understanding of the scope of tokenization required for the final dataset. Then, via tokenization, it is possible to arrive at a final Corpora for Natural Language Processing via Ngrams.

There are about 3 million lines of text in the combined Corpora. Analysis of word frequency within the Corpora allows for reliable Statistical Inference, in order to find possible Next Words. The total object size of the Corpora is very possibly reducible to a file size that prevents slow processing times, thereby allowing for real time interaction via text input.


Developing Data Products

Couldn’t share my Data Science Specialization Course, “Data Product” until now. Interestingly, I had to get through the MOOC peer review process with only 25 hours per month of online time, at

The Johns Hopkins University / Coursera peer review is finished, and I received a 100% score. I am glad for the opportunity to apply my R programming skills to an initial demonstration product.

Here is the Data Product, the Community Demographics Health Status app:

Here is the sales pitch presentation that was part of the submission:

Data Science Analysis of Health and Economic Effects of USA Weather Events

Analysis of Health and Economic Effects of USA Weather Events

(NOAA Storm Database)

John Akwei, ECMp ERMp

January 21, 2015


This project examines the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, in order to determine the effects of major weather events on the population of the USA, in terms of health, (fatalities and injuries), and property damage, (property and crops).

By applying statistical processing in the R programming language, relevant information is extracted from the NOAA Storm Database that determines the exact weather events requiring development of resources, and strategies, to mitigate effects on the health, and property, of US citizens.

Data Processing

Requirements of Processing

required <- function(wd) {
  if (!require("data.table")) { install.packages("data.table"); require("data.table") }
  if (!require("plyr")) { install.packages("plyr"); require("plyr") }
  if (!require("dplyr")) { install.packages("dplyr"); require("dplyr") }
  if (!require("reshape2")) { install.packages("reshape2"); require("reshape2") }
  if (!require("xtable")) { install.packages("xtable"); require("xtable") }
  if (!require("knitr")) { install.packages("knitr"); require("knitr") }
  if (!require("ggplot2")) { install.packages("ggplot2"); require("ggplot2") }
  if (!require("R.utils")) { install.packages("R.utils"); require("R.utils") }

Data download and extraction

unextracted <- "repdata-data-StormData.csv.bz2"
extracted  <- "repdata-data-StormData.csv"
dataLocation <- ""
if (!file.exists(unextracted)) { download(dataLocation, unextracted, mode="wb") }
if (!file.exists(extracted)) { bunzip2(unextracted) }

Formatting data variables for processing

StormData <- read.table("repdata-data-StormData.csv", sep=",", header=T)
StormData$BGN_DATE <- strptime(StormData$BGN_DATE, format="%m/%d/%Y 0:00:00")
StormData$FATALITIES <- as.numeric(StormData$FATALITIES)
StormData$INJURIES <- as.numeric(StormData$INJURIES)
StormData$PROPDMG <- as.numeric(StormData$PROPDMG)


Which types of events are most harmful with respect to population health?

names <- c('EVTYPE', 'SUM')
fatalities <- aggregate(StormData$FATALITIES~StormData$EVTYPE, FUN=sum)
names(fatalities) <- names
fatalities <- fatalities[order(fatalities$SUM, decreasing=T), ]

injuries <- aggregate(StormData$INJURIES~StormData$EVTYPE, FUN=sum)
names(injuries) <- names
injuries <- injuries[order(injuries$SUM, decreasing=T), ]

Major weather events for fatalities:

head(fatalities, 8)
##             EVTYPE  SUM
## 834        TORNADO 5633
## 130 EXCESSIVE HEAT 1903
## 153    FLASH FLOOD  978
## 275           HEAT  937
## 464      LIGHTNING  816
## 856      TSTM WIND  504
## 170          FLOOD  470
## 585    RIP CURRENT  368

Major weather events for injuries:

head(injuries, 8)
##             EVTYPE   SUM
## 834        TORNADO 91346
## 856      TSTM WIND  6957
## 170          FLOOD  6789
## 130 EXCESSIVE HEAT  6525
## 464      LIGHTNING  5230
## 275           HEAT  2100
## 427      ICE STORM  1975
## 153    FLASH FLOOD  1777

Graphs of Events with the Greatest Health Consequences:

ggplot(data=head(fatalities, 6), aes(EVTYPE, SUM)) +
  geom_bar(aes(), stat="identity", fill=rainbow(6)) +
  ggtitle("Graph of Major Weather Events Causing Fatalities") +
  xlab('Events') +


ggplot(data=head(injuries, 6), aes(EVTYPE, SUM)) +
  geom_bar(aes(), stat="identity", fill=rainbow(6)) +
  ggtitle("Graph of Major Weather Events Causing Injuries") +
  xlab('Events') +


Across the US, which types of events have the greatest economic consequences?

propertyDamage <- aggregate(StormData$PROPDMG~StormData$EVTYPE, FUN=sum)
names(propertyDamage) <- names
propertyDamage <- propertyDamage[order(propertyDamage$SUM, decreasing=T), ]

cropDamage <- aggregate(StormData$CROPDMG~StormData$EVTYPE, FUN=sum)
names(cropDamage) <- names
cropDamage <- cropDamage[order(cropDamage$SUM, decreasing=T), ]

Major weather events for Property damage:

head(propertyDamage, 8)
##                 EVTYPE       SUM
## 834            TORNADO 3212258.2
## 153        FLASH FLOOD 1420124.6
## 856          TSTM WIND 1335965.6
## 170              FLOOD  899938.5
## 760  THUNDERSTORM WIND  876844.2
## 244               HAIL  688693.4
## 464          LIGHTNING  603351.8
## 786 THUNDERSTORM WINDS  446293.2

Major weather events for Crop damage:

head(cropDamage, 8)
##                 EVTYPE       SUM
## 244               HAIL 579596.28
## 153        FLASH FLOOD 179200.46
## 170              FLOOD 168037.88
## 856          TSTM WIND 109202.60
## 834            TORNADO 100018.52
## 760  THUNDERSTORM WIND  66791.45
## 95             DROUGHT  33898.62
## 786 THUNDERSTORM WINDS  18684.93

Graph of Events with the Greatest Economic Consequences:

econDamage <- merge(propertyDamage, cropDamage, by="EVTYPE")
names(econDamage) <- names
econDamage <- econDamage[order(econDamage$SUM, decreasing=T), ]

ggplot(data=head(econDamage, 8), aes(EVTYPE, SUM)) +
  geom_bar(aes(), stat="identity", fill=rainbow(8)) +
  coord_flip() +
  ggtitle("Graph: Weather Events, Property/Crop Damage") +
  xlab('Events') +



The data extracted from the NOAA Storm Database positively determines that the weather events requiring counter-strategies, with the objective of mitigatation of effects on US citizen health, are Tornadoes, Storm-Force Winds, Floods, Excessive Heat, and Lightning.

The weather events damaging Property the most severely are Tornadoes, Flash Floods, Storm Force Winds, and Non-Flash Flooding.

The weather events severely damaging Crops in the USA are Hail, Flash Floods, Regular Floods, Storm Force Winds, and Tornadoes.

Update on Data Science Course

After a second month, I have completed a total of 2 sections of the 9 section Data Science Specialization Course, (Massive Online Open Course, or MOOC), from Johns Hopkins University /

The completed course sections are: “The Data Scientist’s Toolbox”, score: 98%,

Coursera Johns Hopkins University Data Sci Toolbox 98 percent with Distinction 2014

and “R Programming”, score: 100%.

Coursera Johns Hopkins University R Programming 100 percent with Distinction 2014

Looking forward to exploring the world of online, large dataset Data Analysis, and including “Data Science” in my list of skills!

Learning Data Science from Johns Hopkins University online

I earned 98.0% (with Distinction) in The Data Scientist’s Toolbox Course from Johns Hopkins University.

Data Scientist's Toolbox Certificate

This was my first Massive Online Open Course. I found the course at, and discovered the course was created as a Johns Hopkins University online course. Because this was my first MOOC, I had no idea what the structure of the course was. The website navigation had several possible interpretations, so the first week was spent learning exactly what was needed to access the entire learning environment. I scored a 90% on the first week’s quiz, as a result.

The next three weeks of the Data Science online course didn’t involve learning a new online interface, and therefore I scored 100% on all the next 3 quizzes, and in all 4 parts of the end-of-month course project.

Thanks to everyone that supported me, by liking my test result postings on Facebook, during the one month process.

Eight more one month courses are required to gain the entire Data Science Specialization Certification from JHU. I am definitely looking forward to the journey!

Configuring a Google+ private Community as a ESN

In order to implement an ESN for my new company, ContextBase, I decided to configure a Google+ private Community as a Enterprise Social Network. An important part of this concept is the use of discussion topics configured as Company departments.

After the Community has been created as a private Community, (that cannot be set to Public), the company logo has been imported, the About section created, and the company members set-up, the ESN can be integrated as a Social Layer over the Company’s productivity programs.

The company members’ Google Drive accounts are configurable for data storage and productivity programs with sharing via the Google+ Community ESN, and Gmail with ESN sharing is a method for client database/correspondence functionality.

My methodology for working with a Google+ Community ESN is:

First, sign into the Company Member Google Account, then navigate to the Google+ Community ESN, (ContextBase), then open Gmail/Tasks, Calendar, Drive, & Google+ in separate tabs.

In order for business processes to not mire in information technology confusion, I recommend that all company communications take place through the Google+ ESN Community, all completed work is posted to a department discussion thread subgroup, Person to Person communications take place through the Google Chat dialog box at bottom right of the Community and Google+ screens, and Video conferencing takes place via Google+ Hangouts at the Google+ tab.

These photographs show the Moblie version of the ContextBase Google+ Community ESN:

1) The first photo shows an accounting file, (that had been posted to the Accounting thread), opened successfully in an Android mobile browser.


2) The second photo shows the mobile ESN with the Accounting thread selected.