Developing Data Products

Couldn’t share my Data Science Specialization Course, “Data Product” until now. Interestingly, I had to get through the MOOC peer review process with only 25 hours per month of online time, at

The Johns Hopkins University / Coursera peer review is finished, and I received a 100% score. I am glad for the opportunity to apply my R programming skills to an initial demonstration product.

Here is the Data Product, the Community Demographics Health Status app:

Here is the sales pitch presentation that was part of the submission:

Data Science Analysis of Health and Economic Effects of USA Weather Events

Analysis of Health and Economic Effects of USA Weather Events

(NOAA Storm Database)

John Akwei, ECMp ERMp

January 21, 2015


This project examines the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, in order to determine the effects of major weather events on the population of the USA, in terms of health, (fatalities and injuries), and property damage, (property and crops).

By applying statistical processing in the R programming language, relevant information is extracted from the NOAA Storm Database that determines the exact weather events requiring development of resources, and strategies, to mitigate effects on the health, and property, of US citizens.

Data Processing

Requirements of Processing

required <- function(wd) {
  if (!require("data.table")) { install.packages("data.table"); require("data.table") }
  if (!require("plyr")) { install.packages("plyr"); require("plyr") }
  if (!require("dplyr")) { install.packages("dplyr"); require("dplyr") }
  if (!require("reshape2")) { install.packages("reshape2"); require("reshape2") }
  if (!require("xtable")) { install.packages("xtable"); require("xtable") }
  if (!require("knitr")) { install.packages("knitr"); require("knitr") }
  if (!require("ggplot2")) { install.packages("ggplot2"); require("ggplot2") }
  if (!require("R.utils")) { install.packages("R.utils"); require("R.utils") }

Data download and extraction

unextracted <- "repdata-data-StormData.csv.bz2"
extracted  <- "repdata-data-StormData.csv"
dataLocation <- ""
if (!file.exists(unextracted)) { download(dataLocation, unextracted, mode="wb") }
if (!file.exists(extracted)) { bunzip2(unextracted) }

Formatting data variables for processing

StormData <- read.table("repdata-data-StormData.csv", sep=",", header=T)
StormData$BGN_DATE <- strptime(StormData$BGN_DATE, format="%m/%d/%Y 0:00:00")
StormData$FATALITIES <- as.numeric(StormData$FATALITIES)
StormData$INJURIES <- as.numeric(StormData$INJURIES)
StormData$PROPDMG <- as.numeric(StormData$PROPDMG)


Which types of events are most harmful with respect to population health?

names <- c('EVTYPE', 'SUM')
fatalities <- aggregate(StormData$FATALITIES~StormData$EVTYPE, FUN=sum)
names(fatalities) <- names
fatalities <- fatalities[order(fatalities$SUM, decreasing=T), ]

injuries <- aggregate(StormData$INJURIES~StormData$EVTYPE, FUN=sum)
names(injuries) <- names
injuries <- injuries[order(injuries$SUM, decreasing=T), ]

Major weather events for fatalities:

head(fatalities, 8)
##             EVTYPE  SUM
## 834        TORNADO 5633
## 130 EXCESSIVE HEAT 1903
## 153    FLASH FLOOD  978
## 275           HEAT  937
## 464      LIGHTNING  816
## 856      TSTM WIND  504
## 170          FLOOD  470
## 585    RIP CURRENT  368

Major weather events for injuries:

head(injuries, 8)
##             EVTYPE   SUM
## 834        TORNADO 91346
## 856      TSTM WIND  6957
## 170          FLOOD  6789
## 130 EXCESSIVE HEAT  6525
## 464      LIGHTNING  5230
## 275           HEAT  2100
## 427      ICE STORM  1975
## 153    FLASH FLOOD  1777

Graphs of Events with the Greatest Health Consequences:

ggplot(data=head(fatalities, 6), aes(EVTYPE, SUM)) +
  geom_bar(aes(), stat="identity", fill=rainbow(6)) +
  ggtitle("Graph of Major Weather Events Causing Fatalities") +
  xlab('Events') +


ggplot(data=head(injuries, 6), aes(EVTYPE, SUM)) +
  geom_bar(aes(), stat="identity", fill=rainbow(6)) +
  ggtitle("Graph of Major Weather Events Causing Injuries") +
  xlab('Events') +


Across the US, which types of events have the greatest economic consequences?

propertyDamage <- aggregate(StormData$PROPDMG~StormData$EVTYPE, FUN=sum)
names(propertyDamage) <- names
propertyDamage <- propertyDamage[order(propertyDamage$SUM, decreasing=T), ]

cropDamage <- aggregate(StormData$CROPDMG~StormData$EVTYPE, FUN=sum)
names(cropDamage) <- names
cropDamage <- cropDamage[order(cropDamage$SUM, decreasing=T), ]

Major weather events for Property damage:

head(propertyDamage, 8)
##                 EVTYPE       SUM
## 834            TORNADO 3212258.2
## 153        FLASH FLOOD 1420124.6
## 856          TSTM WIND 1335965.6
## 170              FLOOD  899938.5
## 760  THUNDERSTORM WIND  876844.2
## 244               HAIL  688693.4
## 464          LIGHTNING  603351.8
## 786 THUNDERSTORM WINDS  446293.2

Major weather events for Crop damage:

head(cropDamage, 8)
##                 EVTYPE       SUM
## 244               HAIL 579596.28
## 153        FLASH FLOOD 179200.46
## 170              FLOOD 168037.88
## 856          TSTM WIND 109202.60
## 834            TORNADO 100018.52
## 760  THUNDERSTORM WIND  66791.45
## 95             DROUGHT  33898.62
## 786 THUNDERSTORM WINDS  18684.93

Graph of Events with the Greatest Economic Consequences:

econDamage <- merge(propertyDamage, cropDamage, by="EVTYPE")
names(econDamage) <- names
econDamage <- econDamage[order(econDamage$SUM, decreasing=T), ]

ggplot(data=head(econDamage, 8), aes(EVTYPE, SUM)) +
  geom_bar(aes(), stat="identity", fill=rainbow(8)) +
  coord_flip() +
  ggtitle("Graph: Weather Events, Property/Crop Damage") +
  xlab('Events') +



The data extracted from the NOAA Storm Database positively determines that the weather events requiring counter-strategies, with the objective of mitigatation of effects on US citizen health, are Tornadoes, Storm-Force Winds, Floods, Excessive Heat, and Lightning.

The weather events damaging Property the most severely are Tornadoes, Flash Floods, Storm Force Winds, and Non-Flash Flooding.

The weather events severely damaging Crops in the USA are Hail, Flash Floods, Regular Floods, Storm Force Winds, and Tornadoes.

Update on Data Science Course

After a second month, I have completed a total of 2 sections of the 9 section Data Science Specialization Course, (Massive Online Open Course, or MOOC), from Johns Hopkins University /

The completed course sections are: “The Data Scientist’s Toolbox”, score: 98%,

Coursera Johns Hopkins University Data Sci Toolbox 98 percent with Distinction 2014

and “R Programming”, score: 100%.

Coursera Johns Hopkins University R Programming 100 percent with Distinction 2014

Looking forward to exploring the world of online, large dataset Data Analysis, and including “Data Science” in my list of skills!

Learning Data Science from Johns Hopkins University online

I earned 98.0% (with Distinction) in The Data Scientist’s Toolbox Course from Johns Hopkins University.

Data Scientist's Toolbox Certificate

This was my first Massive Online Open Course. I found the course at, and discovered the course was created as a Johns Hopkins University online course. Because this was my first MOOC, I had no idea what the structure of the course was. The website navigation had several possible interpretations, so the first week was spent learning exactly what was needed to access the entire learning environment. I scored a 90% on the first week’s quiz, as a result.

The next three weeks of the Data Science online course didn’t involve learning a new online interface, and therefore I scored 100% on all the next 3 quizzes, and in all 4 parts of the end-of-month course project.

Thanks to everyone that supported me, by liking my test result postings on Facebook, during the one month process.

Eight more one month courses are required to gain the entire Data Science Specialization Certification from JHU. I am definitely looking forward to the journey!

Configuring a Google+ private Community as a ESN

In order to implement an ESN for my new company, ContextBase, I decided to configure a Google+ private Community as a Enterprise Social Network. An important part of this concept is the use of discussion topics configured as Company departments.

After the Community has been created as a private Community, (that cannot be set to Public), the company logo has been imported, the About section created, and the company members set-up, the ESN can be integrated as a Social Layer over the Company’s productivity programs.

The company members’ Google Drive accounts are configurable for data storage and productivity programs with sharing via the Google+ Community ESN, and Gmail with ESN sharing is a method for client database/correspondence functionality.

My methodology for working with a Google+ Community ESN is:

First, sign into the Company Member Google Account, then navigate to the Google+ Community ESN, (ContextBase), then open Gmail/Tasks, Calendar, Drive, & Google+ in separate tabs.

In order for business processes to not mire in information technology confusion, I recommend that all company communications take place through the Google+ ESN Community, all completed work is posted to a department discussion thread subgroup, Person to Person communications take place through the Google Chat dialog box at bottom right of the Community and Google+ screens, and Video conferencing takes place via Google+ Hangouts at the Google+ tab.

These photographs show the Moblie version of the ContextBase Google+ Community ESN:

1) The first photo shows an accounting file, (that had been posted to the Accounting thread), opened successfully in an Android mobile browser.


2) The second photo shows the mobile ESN with the Accounting thread selected.


Possibilities of Activity Stream Computing

This is my assessment of the possibilities of Enterprise/Consumer Social Networks as a Top Layer application, and an Activity Stream Computing environment. In the future, persons will sit down to work, and whether telecommuting or working from a specifically business-oriented location, start up their Activity Stream Computers. The Basic Layer of this Activity Stream Computer, the same as the Operating System, will be an Activity Stream Programming environment, (possibly similar to the website

Starting with the Basic Layer Activity Stream Operating System, the Activity Stream Computer will launch applications based on Social Networking only. These applications will be the equivalent of present Social Networks – Google+, Yammer, Facebook, LinkedIn, Twitter, etc. Enterprise Social Networks, like Yammer, will have become the same as personal Consumer Social Networks. In other words, in order to work, persons will operate the same Social Networking-based application as persons operate in order to shop, interact with government, plan their lives, socialize with friends, and read/listen/watch entertainment media.

In order to socialize, shop, conduct personal business, and for entertainment, persons will use multi-function Consumer Social Networking with the same capabilities found with Enterprise Social Networking. There will be no difference between a Consumer Social Network and a Enterprise Social Network. Both will be multi-functional, and contain within them the multiple computing applications required by persons to work, or live their personal lives.

With this form of Activity Stream Computing, employees using Personal Social Networks at Work will eventually not be considered in any way unusual. Employees will work and take care of their personal interactions concurrently. The personal Social Networking of Employees will appear in the Activity Streams of the same Enterprise Social Network they use for Work. The work they need to perform to earn an income will appear in the Activity Streams of the Consumer Social Network they use at home, for personal business.

When a Consumer, via Social Networking, is purchasing services from a Business, (via Social Networking), the Employees of that Business will see the relevant Consumer, and all the other Individuals interacting with that Business within a Social Network context. The Consumer will also see all the other Consumers interacting with the same Business they are mutually purchasing services from via the Activity Streams of the Consumer/Enterprise Social Network application.

It is possible that Twitter will become only a back end for Twitter-like dialog box Applications within Multi-System Social Networks. LinkedIn could become only a back end for employment networking within a Top Layer Social Network.

Therefore the Layers would be – Activity Stream Programming base layer, a Multi-Activity Stream Layer Social Network Appllication, multiple Single Activity Stream Social Networks providing integrated multiple Activity Streams for the previous Layer, Productivity Applications that operate within the Multi-Activity Stream Social Network, with the Files produced existing in a Social context from intial creation of the Files. Legacy concepts of Productivity Applications will eventually phase out, as persons and businesses switch to Productivity with Social Networking dialogs, including the legacy applications, spreadsheets and slide presentations.