New StartUp — the WalkInJobSearch App

I would like to introduce my concept of a transformative App, The WalkInJobSearch App. The WalkInJobSearch app is a form of uberization, and employs methodologies found in gig economy apps in order to transform a traditional work search regimen.

The mission of the WalkInJobSearch app is to transform the traditional method of job hunting via personal delivery of your resume. With the WalkInJobSearch app, Job Seekers can first see the Businesses where Job Seekers are welcome to apply for employment in person. Business are able to schedule and track the arrivals of Job Seekers, personally receive their resumes, and discuss application for employment. The goal of the WalkInJobSearch app is to accomplish this transformation of employment searching on an international basis.

The app will have a Job Seeker interface, and a Business interface. The App will be monetized at the point of successful application by a Job Seeker to a Business. When that occurrence is confirmed via the App, a Service Fee paid by the Business is automatically transferred. Thereby, the App will have a revenue stream not dependent on advertising. The target market for the WalkInJobSearch App will be young adults and teens looking for their first jobs, and older individuals looking for service industry work, or work in other street-level industries. Therefore, the revenue of the App is possibly substantial.

Investing in the WalkInJobSearch app is now open. Information about the app has been posted to AngelList — (https://angel.co/walkinjobsearch), and Crunchbase — (https://www.crunchbase.com/organization/walkinjobsearch-app#/entity). A Business Plan on Google Docs is available, and App specifications have been completed, with App prototyping via R programming.

As the App’s founder and Lead Data Scientist, I have many years of Data Management and Financial Management experience. If you, (or your company), are interested in investing at a beginning equity rate of 1% per $10,000, please contact me, John Akwei ECMp ERMp Data Scientist, via LinkedIn at https://www.linkedin.com/in/john-akwei-8138b02, or via email at johnakwei1@gmail.com. The website of my Data Science consultancy is http://contextbase.github.io. Thank you very much.

ContextBase Predictive Analytics

John Akwei, ECMp ERMp Data Scientist

This document contains examples of the Predictive Analytics capabilities of ContextBase, http://contextbase.github.io.

Predictive Analytics Example 1: Linear Regression

Linear Regression allows for prediction of future occurrences derived from one explanatory variable, and one response variable.

freenyTable

cat(“The Intercept =”, model$coefficients[1])
## The Intercept = 15.21788

PriceIndexLinearRegressionGraph

Example 1 — Linear Regression Conclusion:

cat(“For a Price Index of “, as.character(test), “, the predicted Market Potential = “, round(result, 2), “.”, sep=””)

## For a Price Index of 4.57592, the predicted Market Potential = 13.03.

In conclusion to ContextBase Predictive Analytics Example 1, a direct correlation of Price Index to Market Potential was found, (see above graph). As a test of the Predictive Algorithm, a Price Index of 4.57592 was processed, and a Market Potential of 13.03 was predicted. The source R dataset shows this prediction to be accurate.

 

Predictive Analytics Example 2: Logistic Regression

 

Logistic Regression allows for prediction of a logical, (Yes or No), occurrence based on the effects of an explanatory variable on a response variable. For example, the probability of winning a congressional election vs campaign expenditures.

How does the amount of money spent on a campaign affect the probability that the candidate will win the election?

Source of Data Ranges: https://www.washingtonpost.com/news/the-fix/wp/2014/04/04/think-money-doesnt-matter-in-elections-this-chart-says-youre-wrong/

The logistic regression analysis gives the following output:
model model$coefficients
## (Intercept) Expenditures
## -7.615054e+00 4.098080e-06

The output indicates that campaign expenditures significantly affect the probability of winning the election.

The output provides the coefficients for Intercept = -7.615054e+00, and Expenditures = 4.098080e-06. These coefficients are entered in the logistic regression equation to estimate the probability of winning the election:

Probability of winning election = 1/(1+exp(-(-7.615054e+00+4.098080e-06*CampaignExpenses)))

 

ElectionLogisticRegressionGraph

For a Candidate that has $1,600,000 in expenditures:
CampaignExpenses ProbabilityOfWinningElection cat(“Probability of winning Election = 1/(1+exp(-(-7.615054e+00+4.098080e-06*”,
CampaignExpenses, “))) = “, round(ProbabilityOfWinningElection, 2), “.”, sep=””)
## Probability of winning Election = 1/(1+exp(-(-7.615054e+00+4.098080e-06*1600000))) = 0.26.

For a Candidate that has $2,100,000 in expenditures:
CampaignExpenses ProbabilityOfWinningElection cat(“Probability of winning Election = 1/(1+exp(-(-7.615054e+00+4.098080e-06*”,
CampaignExpenses, “))) = “, round(ProbabilityOfWinningElection, 2), “.”, sep=””)
## Probability of winning Election = 1/(1+exp(-(-7.615054e+00+4.098080e-06*2100000))) = 0.73.

Example 2 — Logistic Regression Conclusion:

ElectionWinTable 1700000, 1900000,
2300000),
column2=
c(round(1/(1+exp(-(-7.615054e+00+4.098080e-06*1100000))), 2),
round(1/(1+exp(-(-7.615054e+00+4.098080e-06*1400000))), 2),
round(1/(1+exp(-(-7.615054e+00+4.098080e-06*1700000))), 2),
round(1/(1+exp(-(-7.615054e+00+4.098080e-06*1900000))), 2),
round(1/(1+exp(-(-7.615054e+00+4.098080e-06*2300000))), 2)))
names(ElectionWinTable)

electionTable

In conclusion to ContextBase Predictive Analytics Example 2, a direct correlation of Campaign Expenditures to Election Performance was verified. The above table displays corresponding probablities of winning an election to campaign expenses.

 

Predictive Analytics Example 3: Multiple Regression

Multiple Regression allows for the prediction of the future values of a response variable, based on values of multiple explanatory variables.

lifeExpectancyTable
## Call:
## lm(formula = Life_Exp ~ Population + Income + Illiteracy, data = input)
##
## Coefficients:
## (Intercept) Population Income Illiteracy
## 7.120e+01 -1.024e-05 2.477e-04 -1.179e+00

MultipleRegressionCoefficients

a cat(“The Multiple Regression Intercept = “, a, “.”, sep=””)

The Multiple Regression Intercept = 71.2023.

MultipleRegressionGraph

Multiple Regression Conclusion:

Y = a + popl * XPopulation + Incm * XIncome + Illt * XIlliteracy
cat(“For a City where Population = “, popl, “, Income = “, Incm, “, and Illiteracy = “, Illt, “,
the predicted Life Expectancy is: “, round(Y, 2), “.”, sep=””)

 

## For a City where Population = 3100, Income = 5348, and Illiteracy = 1.1,
## the predicted Life Expectancy is: 71.2.

In conclusion to ContextBase Predictive Analytics Example 3, the multiple variables of “Population”, “Income”, and “Illiteracy” were used to determine the predicted “Life Expectancy” of an area corresponding to a USA State. For an area with a Population of 3100, a per capita Income Rate of 5348, and an Illiteracy Rate of 1.1, a Life Expectancy of 71.2 years was predicted.

Graduated Data Science Course

ICYMI, I have graduated from the 1 year Johns Hopkins University / Coursera Data Science Specializations Course. Also, I have started on a career in internet R programming. Some of my internet project are at www.rpubs.com/johnakwei, and www.github.com/johnakwei/MyConsultingProjects.

I have used my github skills from the Data Science course to create johnakwei.github.io, and contextbase.github.io.

Also, I am grateful for my new LinkedIn contacts from the MOOC course.

 

Johns Hopkins University Coursera Data Science Specialization Certificate 2015

Next Word Prediction App – Milestone Report

by John Akwei, ECMp ERMp Data Science Specialist

ContextBase

Wednesday, July 08, 2015

Synopsis

This is a milestone report on the initial stages, (1st month), of the creation of a Next Word Prediction App. The project is for the Data Science Capstone course from Coursera, and Johns Hopkins University. The text prediction based company, SwiftKey, is a partner in this phase of the Data Science Specialization course.

The objective of the Next Word Prediction App project, (lasting two months), is to implement an application, capable of predicting the most likely next word that the application user will input, after the inputting of 1 or more words. The application utilizes Natural Language Processing programmed in the R Language, and is hosted by the shinyapps.io platform. In order to perform Natural Language Processing, the application’s algorithm utilizes examples of Natural Language text from news, blogs, and Twitter, saved into .txt files.

This milestone report examines the .txt files, in order to determine the characteristics of these datasets for Natural Language Processing. The datasets are statistically examined with the R programming language, and the IDE – RStudio.

The Natural Language Processing datasets, (or “Corpora”), are available from http://www.corpora.heliohost.org. This project utilizes the same files from http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. The initial application development will concentrate on English language text only.

Data Acquistion

The source of Corpus files for Natural Language Processing is http://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.

The news, blogs, and twitter datasets are imported as character datasets:

blogs <- readLines(“Coursera-SwiftKey/final/en_US/en_US.blogs.txt”, encoding=”UTF-8″)
news <- readLines(“Coursera-SwiftKey/final/en_US/en_US.news.txt”, encoding=”UTF-8″)
twitter <- readLines(“Coursera-SwiftKey/final/en_US/en_US.twitter.txt”,
                    encoding=”UTF-8″)

Data Optimization of the NLP Dataset via Tokenization

The datasets are filtered to remove whitespace, punctuation, and numbers. Then converted to lower case.

Optimized line from blogs dataset:

“in the years thereafter most of the oil fields and platforms were named after pagan gods “

Optimized line from news dataset:

“he wasnt home alone apparently”

Optimized line from twitter dataset:

“how are you btw thanks for the rt you gonna be in dc anytime soon love to see you been way way too long”

Exploratory Data Analysis of Blogs, News, and Twitter datasets

File Size for blogs dataset: 200.424207687378

File Size for news dataset: 196.277512550354

File Size for twitter dataset: 159.364068984985

Lines in blogs dataset: 899288

Lines in news dataset: 77259

Lines in twitter dataset: 2360148

Characters in blogs dataset: 206824382

Characters in news dataset: 15639408

Characters in twitter dataset: 162096031

Summary of blogs dataset word count:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.00    9.00   28.00   41.75   60.00 6726.00

Summary of news dataset word count:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
1.00   19.00   32.00   34.62   46.00 1123.00

Summary of twitter dataset word count:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
1.00    7.00   12.00   12.75   18.00   47.00

Word Cloud for blogs dataset

wordcloud

Unigram Graph

Unigrams

Interesting Findings About The Datasets

  1. The twitter dataset is useful for general limited text input, and the news and blogs datasets are useful for higher level text input.
  2. Combining and tokenizing the three datasets creates nonsequiturs, via the last word of a sentence being followed by the first word of a following sentence. However, the sequiturs created by the tokenization process probably outweigh the nonsequiturs in frequency, and thereby preserve the accuracy of the prediction algorithm.

Future Plan

  1. Create an Ngram Table of unigrams, bigrams, and trigrams with preprocessed prediction unigrams, and a word frequency column to sort the most reliable predictions.
  2. Script application code to compare user input with the prediction table.
  3. Explore ngram-based NLP for prediction of the word being typed from initial typed letters.
  4. Expand the capabilities of the algorithm to process longer lines of text.
  5. Explore learning functions to update the ngram table based on user specific next words.

Summary

Initial Exploratory Data Analysis allows for an understanding of the scope of tokenization required for the final dataset. Then, via tokenization, it is possible to arrive at a final Corpora for Natural Language Processing via Ngrams.

There are about 3 million lines of text in the combined Corpora. Analysis of word frequency within the Corpora allows for reliable Statistical Inference, in order to find possible Next Words. The total object size of the Corpora is very possibly reducible to a file size that prevents slow processing times, thereby allowing for real time interaction via text input.

References

http://cran.r-project.org/web/views/NaturalLanguageProcessing.html
http://www.corpora.heliohost.org/aboutcorpus.html
http://weka.wikispaces.com/
https://en.wikipedia.org/wiki/N-gram

Developing Data Products

Couldn’t share my Data Science Specialization Course, “Data Product” until now. Interestingly, I had to get through the MOOC peer review process with only 25 hours per month of online time, at shinyapp.io.

The Johns Hopkins University / Coursera peer review is finished, and I received a 100% score. I am glad for the opportunity to apply my R programming skills to an initial demonstration product.

Here is the Data Product, the Community Demographics Health Status app:

https://johnakwei1.shinyapps.io/CourseProject/

Here is the sales pitch presentation that was part of the submission:

http://www.rpubs.com/johnakwei/80052

Data Science Analysis of Health and Economic Effects of USA Weather Events

Analysis of Health and Economic Effects of USA Weather Events

(NOAA Storm Database)

John Akwei, ECMp ERMp

January 21, 2015

Synopsis

This project examines the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, in order to determine the effects of major weather events on the population of the USA, in terms of health, (fatalities and injuries), and property damage, (property and crops).

By applying statistical processing in the R programming language, relevant information is extracted from the NOAA Storm Database that determines the exact weather events requiring development of resources, and strategies, to mitigate effects on the health, and property, of US citizens.

Data Processing

Requirements of Processing

echo=T
options(scipen=999)
required <- function(wd) {
  setwd(wd)
  if (!require("data.table")) { install.packages("data.table"); require("data.table") }
  if (!require("plyr")) { install.packages("plyr"); require("plyr") }
  if (!require("dplyr")) { install.packages("dplyr"); require("dplyr") }
  if (!require("reshape2")) { install.packages("reshape2"); require("reshape2") }
  if (!require("xtable")) { install.packages("xtable"); require("xtable") }
  if (!require("knitr")) { install.packages("knitr"); require("knitr") }
  if (!require("ggplot2")) { install.packages("ggplot2"); require("ggplot2") }
  if (!require("R.utils")) { install.packages("R.utils"); require("R.utils") }
}
suppressMessages(required("C:/Users/johnakwei/Desktop/Coursera/ReproducibleResearch/Week3/RepData_PeerAssessment2"))

Data download and extraction

unextracted <- "repdata-data-StormData.csv.bz2"
extracted  <- "repdata-data-StormData.csv"
dataLocation <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists(unextracted)) { download(dataLocation, unextracted, mode="wb") }
if (!file.exists(extracted)) { bunzip2(unextracted) }

Formatting data variables for processing

StormData <- read.table("repdata-data-StormData.csv", sep=",", header=T)
StormData$BGN_DATE <- strptime(StormData$BGN_DATE, format="%m/%d/%Y 0:00:00")
StormData$FATALITIES <- as.numeric(StormData$FATALITIES)
StormData$INJURIES <- as.numeric(StormData$INJURIES)
StormData$PROPDMG <- as.numeric(StormData$PROPDMG)

Results

Which types of events are most harmful with respect to population health?

names <- c('EVTYPE', 'SUM')
fatalities <- aggregate(StormData$FATALITIES~StormData$EVTYPE, FUN=sum)
names(fatalities) <- names
fatalities <- fatalities[order(fatalities$SUM, decreasing=T), ]

injuries <- aggregate(StormData$INJURIES~StormData$EVTYPE, FUN=sum)
names(injuries) <- names
injuries <- injuries[order(injuries$SUM, decreasing=T), ]

Major weather events for fatalities:

head(fatalities, 8)
##             EVTYPE  SUM
## 834        TORNADO 5633
## 130 EXCESSIVE HEAT 1903
## 153    FLASH FLOOD  978
## 275           HEAT  937
## 464      LIGHTNING  816
## 856      TSTM WIND  504
## 170          FLOOD  470
## 585    RIP CURRENT  368

Major weather events for injuries:

head(injuries, 8)
##             EVTYPE   SUM
## 834        TORNADO 91346
## 856      TSTM WIND  6957
## 170          FLOOD  6789
## 130 EXCESSIVE HEAT  6525
## 464      LIGHTNING  5230
## 275           HEAT  2100
## 427      ICE STORM  1975
## 153    FLASH FLOOD  1777

Graphs of Events with the Greatest Health Consequences:

ggplot(data=head(fatalities, 6), aes(EVTYPE, SUM)) +
  geom_bar(aes(), stat="identity", fill=rainbow(6)) +
  ggtitle("Graph of Major Weather Events Causing Fatalities") +
  xlab('Events') +
  ylab("Fatalities")

fatalities

ggplot(data=head(injuries, 6), aes(EVTYPE, SUM)) +
  geom_bar(aes(), stat="identity", fill=rainbow(6)) +
  ggtitle("Graph of Major Weather Events Causing Injuries") +
  xlab('Events') +
  ylab("Injuries")

injuries

Across the US, which types of events have the greatest economic consequences?

propertyDamage <- aggregate(StormData$PROPDMG~StormData$EVTYPE, FUN=sum)
names(propertyDamage) <- names
propertyDamage <- propertyDamage[order(propertyDamage$SUM, decreasing=T), ]

cropDamage <- aggregate(StormData$CROPDMG~StormData$EVTYPE, FUN=sum)
names(cropDamage) <- names
cropDamage <- cropDamage[order(cropDamage$SUM, decreasing=T), ]

Major weather events for Property damage:

head(propertyDamage, 8)
##                 EVTYPE       SUM
## 834            TORNADO 3212258.2
## 153        FLASH FLOOD 1420124.6
## 856          TSTM WIND 1335965.6
## 170              FLOOD  899938.5
## 760  THUNDERSTORM WIND  876844.2
## 244               HAIL  688693.4
## 464          LIGHTNING  603351.8
## 786 THUNDERSTORM WINDS  446293.2

Major weather events for Crop damage:

head(cropDamage, 8)
##                 EVTYPE       SUM
## 244               HAIL 579596.28
## 153        FLASH FLOOD 179200.46
## 170              FLOOD 168037.88
## 856          TSTM WIND 109202.60
## 834            TORNADO 100018.52
## 760  THUNDERSTORM WIND  66791.45
## 95             DROUGHT  33898.62
## 786 THUNDERSTORM WINDS  18684.93

Graph of Events with the Greatest Economic Consequences:

econDamage <- merge(propertyDamage, cropDamage, by="EVTYPE")
names(econDamage) <- names
econDamage <- econDamage[order(econDamage$SUM, decreasing=T), ]

ggplot(data=head(econDamage, 8), aes(EVTYPE, SUM)) +
  geom_bar(aes(), stat="identity", fill=rainbow(8)) +
  coord_flip() +
  ggtitle("Graph: Weather Events, Property/Crop Damage") +
  xlab('Events') +
  ylab("Costs")

property

Summary

The data extracted from the NOAA Storm Database positively determines that the weather events requiring counter-strategies, with the objective of mitigatation of effects on US citizen health, are Tornadoes, Storm-Force Winds, Floods, Excessive Heat, and Lightning.

The weather events damaging Property the most severely are Tornadoes, Flash Floods, Storm Force Winds, and Non-Flash Flooding.

The weather events severely damaging Crops in the USA are Hail, Flash Floods, Regular Floods, Storm Force Winds, and Tornadoes.