Augmenting R Programming/Data Science with Tableau

After years of R programming and Data Science experience, I decided to study Tableau. I was motivated by the prospect of creating improved Data Science plots using Tableau’s resources. I discovered that Tableau does, in many ways, allow for the creation of graphs that are visually preferable to ggplot2 in R. The quality of Tableau’s graphs are comparable to the googlevis, and ggvis, R packages. After previewing the R language integration in Tableau, I am glad that I have ggplot2 graphing skills for ease of coding.

Therefore, with the above explanation taken care of, here are samples of my new Tableau skills:

1 – A Plot of pollution readings from an Oil Production facility:

2 – A plot comparing pollution readings from 3 sources:

3 – Clustering, by Tableau, of pollution readings:

4 – A map of IoT device locations:

5 – A graph of IoT location stream durations:

6 – An IoT graph of Vehicle Count per Garage Code:

7 – An IoT graph of Vehicle Count by Total Spaces over time:

8 – A Linear Regression Plot of Movie Budgets by Rating:

9 – A Box Plot of E.Coli/S.Aureus absorbance experiments:

10 – A Quantile Distribution Plot of E.Coli/S.Aureus absorbances:

11- A map of acquisitions by Portugal Ventures:

12 – Finally, an IoT Dashboard with selectable checkboxes:

Thanks for your interest! Examples of the R Programming, and Data Science, expertise of my consultancy, ContextBase, are at, and The ContextBase website is at

ContextBase is available for Data Science projects. Also provided is “Diversified Portfolio Predictive Analytics”, that can predict future extreme events that might occur within your diversified portfolio.

Thank you very much!

New StartUp — the WalkInJobSearch App

I would like to introduce my concept of a transformative App – The WalkInJobSearch App. The WalkInJobSearch app is a form of uberization, and employs methodologies found in gig economy apps in order to transform a traditional work search regimen.

The mission of the WalkInJobSearch app is to transform the traditional method of job hunting via personal delivery of your resume. With the WalkInJobSearch app, Job Seekers can first see the Businesses where Job Seekers are welcome to apply for employment in person. Business are able to schedule and track the arrivals of Job Seekers, personally receive their resumes, and discuss application for employment. The goal of the WalkInJobSearch app is to accomplish this transformation of employment searching on an international basis.

The app will have a Job Seeker interface, and a Business interface. The App will be monetized at the point of successful application by a Job Seeker to a Business. When that occurrence is confirmed via the App, a Service Fee paid by the Business is automatically transferred. Thereby, the App will have a revenue stream not dependent on advertising. The target market for the WalkInJobSearch App will be young adults and teens looking for their first jobs, and older individuals looking for service industry work, or work in other street-level industries. Therefore, the revenue of the App is possibly substantial.

Development of the WalkInJobSearch app has began via prototyping in the R programming language, using the Shinyapps R package. A Github repository for the app has been created, and initial R scripts uploaded. An AngelList account has been created:, and a Crunchbase account has been created:



Investing in the WalkInJobSearch app is available at a beginning equity rate of 1% share of total valuation per $10,000 invested. Sale of the WalkInJobSearch app concept is available to investors or developers via two methods: Method 1) requires a $1,000 initial payment, plus 5% of total revenue, (payable monthly), from products based on the WalkInJobSearch app concept. Method 2) requires a $10,000 initial payment, plus 1% of total revenue, (payable monthly), from products based on the WalkInJobSearch app concept.

As the App’s founder and Lead Data Scientist, I have many years of Data Management and Financial Management experience. If you, (or your company), are interested in investing or acquiring the app concept, please contact me, John Akwei ECMp ERMp Data Scientist, via LinkedIn at, or via email at The website of my Data Science consultancy is Thank you very much!

ContextBase Predictive Analytics

John Akwei, ECMp ERMp Data Scientist

This document contains examples of the Predictive Analytics capabilities of ContextBase,

Predictive Analytics Example 1: Linear Regression

Linear Regression allows for prediction of future occurrences derived from one explanatory variable, and one response variable.


cat(“The Intercept =”, model$coefficients[1])
## The Intercept = 15.21788


Example 1 — Linear Regression Conclusion:

cat(“For a Price Index of “, as.character(test), “, the predicted Market Potential = “, round(result, 2), “.”, sep=””)

## For a Price Index of 4.57592, the predicted Market Potential = 13.03.

In conclusion to ContextBase Predictive Analytics Example 1, a direct correlation of Price Index to Market Potential was found, (see above graph). As a test of the Predictive Algorithm, a Price Index of 4.57592 was processed, and a Market Potential of 13.03 was predicted. The source R dataset shows this prediction to be accurate.


Predictive Analytics Example 2: Logistic Regression


Logistic Regression allows for prediction of a logical, (Yes or No), occurrence based on the effects of an explanatory variable on a response variable. For example, the probability of winning a congressional election vs campaign expenditures.

How does the amount of money spent on a campaign affect the probability that the candidate will win the election?

Source of Data Ranges:

The logistic regression analysis gives the following output:
model model$coefficients
## (Intercept) Expenditures
## -7.615054e+00 4.098080e-06

The output indicates that campaign expenditures significantly affect the probability of winning the election.

The output provides the coefficients for Intercept = -7.615054e+00, and Expenditures = 4.098080e-06. These coefficients are entered in the logistic regression equation to estimate the probability of winning the election:

Probability of winning election = 1/(1+exp(-(-7.615054e+00+4.098080e-06*CampaignExpenses)))



For a Candidate that has $1,600,000 in expenditures:
CampaignExpenses ProbabilityOfWinningElection cat(“Probability of winning Election = 1/(1+exp(-(-7.615054e+00+4.098080e-06*”,
CampaignExpenses, “))) = “, round(ProbabilityOfWinningElection, 2), “.”, sep=””)
## Probability of winning Election = 1/(1+exp(-(-7.615054e+00+4.098080e-06*1600000))) = 0.26.

For a Candidate that has $2,100,000 in expenditures:
CampaignExpenses ProbabilityOfWinningElection cat(“Probability of winning Election = 1/(1+exp(-(-7.615054e+00+4.098080e-06*”,
CampaignExpenses, “))) = “, round(ProbabilityOfWinningElection, 2), “.”, sep=””)
## Probability of winning Election = 1/(1+exp(-(-7.615054e+00+4.098080e-06*2100000))) = 0.73.

Example 2 — Logistic Regression Conclusion:

ElectionWinTable 1700000, 1900000,
c(round(1/(1+exp(-(-7.615054e+00+4.098080e-06*1100000))), 2),
round(1/(1+exp(-(-7.615054e+00+4.098080e-06*1400000))), 2),
round(1/(1+exp(-(-7.615054e+00+4.098080e-06*1700000))), 2),
round(1/(1+exp(-(-7.615054e+00+4.098080e-06*1900000))), 2),
round(1/(1+exp(-(-7.615054e+00+4.098080e-06*2300000))), 2)))


In conclusion to ContextBase Predictive Analytics Example 2, a direct correlation of Campaign Expenditures to Election Performance was verified. The above table displays corresponding probablities of winning an election to campaign expenses.


Predictive Analytics Example 3: Multiple Regression

Multiple Regression allows for the prediction of the future values of a response variable, based on values of multiple explanatory variables.

## Call:
## lm(formula = Life_Exp ~ Population + Income + Illiteracy, data = input)
## Coefficients:
## (Intercept) Population Income Illiteracy
## 7.120e+01 -1.024e-05 2.477e-04 -1.179e+00


a cat(“The Multiple Regression Intercept = “, a, “.”, sep=””)

The Multiple Regression Intercept = 71.2023.


Multiple Regression Conclusion:

Y = a + popl * XPopulation + Incm * XIncome + Illt * XIlliteracy
cat(“For a City where Population = “, popl, “, Income = “, Incm, “, and Illiteracy = “, Illt, “,
the predicted Life Expectancy is: “, round(Y, 2), “.”, sep=””)


## For a City where Population = 3100, Income = 5348, and Illiteracy = 1.1,
## the predicted Life Expectancy is: 71.2.

In conclusion to ContextBase Predictive Analytics Example 3, the multiple variables of “Population”, “Income”, and “Illiteracy” were used to determine the predicted “Life Expectancy” of an area corresponding to a USA State. For an area with a Population of 3100, a per capita Income Rate of 5348, and an Illiteracy Rate of 1.1, a Life Expectancy of 71.2 years was predicted.

Graduated Data Science Course

ICYMI, I have graduated from the 1 year Johns Hopkins University / Coursera Data Science Specializations Course. Also, I have started on a career in internet R programming. Some of my internet project are at, and

I have used my github skills from the Data Science course to create, and

Also, I am grateful for my new LinkedIn contacts from the MOOC course.


Johns Hopkins University Coursera Data Science Specialization Certificate 2015

Next Word Prediction App – Milestone Report

by John Akwei, ECMp ERMp Data Science Specialist


Wednesday, July 08, 2015


This is a milestone report on the initial stages, (1st month), of the creation of a Next Word Prediction App. The project is for the Data Science Capstone course from Coursera, and Johns Hopkins University. The text prediction based company, SwiftKey, is a partner in this phase of the Data Science Specialization course.

The objective of the Next Word Prediction App project, (lasting two months), is to implement an application, capable of predicting the most likely next word that the application user will input, after the inputting of 1 or more words. The application utilizes Natural Language Processing programmed in the R Language, and is hosted by the platform. In order to perform Natural Language Processing, the application’s algorithm utilizes examples of Natural Language text from news, blogs, and Twitter, saved into .txt files.

This milestone report examines the .txt files, in order to determine the characteristics of these datasets for Natural Language Processing. The datasets are statistically examined with the R programming language, and the IDE – RStudio.

The Natural Language Processing datasets, (or “Corpora”), are available from This project utilizes the same files from The initial application development will concentrate on English language text only.

Data Acquistion

The source of Corpus files for Natural Language Processing is

The news, blogs, and twitter datasets are imported as character datasets:

blogs <- readLines(“Coursera-SwiftKey/final/en_US/en_US.blogs.txt”, encoding=”UTF-8″)
news <- readLines(“Coursera-SwiftKey/final/en_US/”, encoding=”UTF-8″)
twitter <- readLines(“Coursera-SwiftKey/final/en_US/en_US.twitter.txt”,

Data Optimization of the NLP Dataset via Tokenization

The datasets are filtered to remove whitespace, punctuation, and numbers. Then converted to lower case.

Optimized line from blogs dataset:

“in the years thereafter most of the oil fields and platforms were named after pagan gods “

Optimized line from news dataset:

“he wasnt home alone apparently”

Optimized line from twitter dataset:

“how are you btw thanks for the rt you gonna be in dc anytime soon love to see you been way way too long”

Exploratory Data Analysis of Blogs, News, and Twitter datasets

File Size for blogs dataset: 200.424207687378

File Size for news dataset: 196.277512550354

File Size for twitter dataset: 159.364068984985

Lines in blogs dataset: 899288

Lines in news dataset: 77259

Lines in twitter dataset: 2360148

Characters in blogs dataset: 206824382

Characters in news dataset: 15639408

Characters in twitter dataset: 162096031

Summary of blogs dataset word count:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.00    9.00   28.00   41.75   60.00 6726.00

Summary of news dataset word count:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
1.00   19.00   32.00   34.62   46.00 1123.00

Summary of twitter dataset word count:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
1.00    7.00   12.00   12.75   18.00   47.00

Word Cloud for blogs dataset


Unigram Graph


Interesting Findings About The Datasets

  1. The twitter dataset is useful for general limited text input, and the news and blogs datasets are useful for higher level text input.
  2. Combining and tokenizing the three datasets creates nonsequiturs, via the last word of a sentence being followed by the first word of a following sentence. However, the sequiturs created by the tokenization process probably outweigh the nonsequiturs in frequency, and thereby preserve the accuracy of the prediction algorithm.

Future Plan

  1. Create an Ngram Table of unigrams, bigrams, and trigrams with preprocessed prediction unigrams, and a word frequency column to sort the most reliable predictions.
  2. Script application code to compare user input with the prediction table.
  3. Explore ngram-based NLP for prediction of the word being typed from initial typed letters.
  4. Expand the capabilities of the algorithm to process longer lines of text.
  5. Explore learning functions to update the ngram table based on user specific next words.


Initial Exploratory Data Analysis allows for an understanding of the scope of tokenization required for the final dataset. Then, via tokenization, it is possible to arrive at a final Corpora for Natural Language Processing via Ngrams.

There are about 3 million lines of text in the combined Corpora. Analysis of word frequency within the Corpora allows for reliable Statistical Inference, in order to find possible Next Words. The total object size of the Corpora is very possibly reducible to a file size that prevents slow processing times, thereby allowing for real time interaction via text input.


Developing Data Products

Couldn’t share my Data Science Specialization Course, “Data Product” until now. Interestingly, I had to get through the MOOC peer review process with only 25 hours per month of online time, at

The Johns Hopkins University / Coursera peer review is finished, and I received a 100% score. I am glad for the opportunity to apply my R programming skills to an initial demonstration product.

Here is the Data Product, the Community Demographics Health Status app:

Here is the sales pitch presentation that was part of the submission: