PORTFOLIO (IN DEPTH) | Mknight4714

PUBLIC TRUST IN MEDIA

This work details the testing of 3 different hypotheses in R using survey data focusing on 6 features that measure means of ingesting news (focusing heavily on social media) and the person's level of trust in various forms of media as well as in other people. The goal of this group project is to address the reason why people continue to frequently get their news from social networking sites despite the fact that they do not trust it.

WINE QUALITY PREDICTOR

There are 4,898 white wines from the northwest region of Portugal, a type of wine known as Vinho Verde. The overall goal of my analysis was to try to predict the quality of this wine based on a set of 11 objective quantitative variables using (in R) 2 different regression methods (Multiple Linear Regression and Shrinkage Methods, predicting quality as a quantitative variable) and 2 different classification methods (Support Vector Machine and Bayes Classifiers, predicting quality as a qualitative variable), and to determine which models performed better.

DIGITAL CHEESE SOMMELIER

There are over 3,000 cheeses that exist around the world, with around 500 different varieties. Because cheese varies so greatly and has changed so much in the 10,000 years that it has existed, there is no universally agreed upon means for classification. The challenge I decided to tackle was: How to teach a computer to recommend a cheese based on these factors? I decided to build a pythonic script for a recommender that would recommend 10 cheeses given an inputted cheese name and/or up to 13 desired features.

EMERGENCY RESPONSE VIA TWITTER

For this client-facing task, my group was given the task of using data to leverage social media and map natural disasters. We specifically focused on gathering our data from the Twitter API. This was a great opportunity for us to be exposed to a real organization doing real work with real data.

Testing Standardized Testing

I looked over aggregate SAT and ACT scores and participation rates from each state in the United States for 2017 and 2018. I sought to identify trends in the data and combine my data analysis with outside research to identify likely factors influencing participation rates and scores in various states.

HOUSE PRICING PREDICTOR

Using the well known Ames housing data, as well as a guide in the form of an Ames Housing .txt file, I created a regression model that predicts the price of houses in Ames, Iowa. It required a lot of data cleaning (fixing null values appropriately and dropping unneeded information), and I did some evaluation of the correlations to determine which features had the most impact on price (both positively and negatively). This model was able to predict price with 89.9% accuracy.

SUBREDDIT LOCATOR

My goal here was to create a model that will be able to take a reddit post and determine whether it came from two possible subreddits. I built a Random Forrest model that was able to predict with 82% accuracy, and a Naive Bayes Model that was able to predict with 84% accuracy.

WINE

CHEESE

TWITTER

MEDIA

SAT

AMES

Michael Knight, Data Scientist