Henry Sue's Website

Hello!

Hello! Welcome to my personal website. I've created this page as a way to document my past work and provide information about myself for friends, family, and potential employers!

Feel free to check out my "About Me" page for more information about me, or find my contact info on my "Contact Me" page. I'll be periodically updating this website as I continue my data science journey, so keep your eye out for more! Feel free to reach out to me about anything and everything, I'd be more than happy to chat.

You can also download my resume here.

Latest Posts

Learning React

Posted on 12/27/2020

After taking an entire day and binge completing the 'Learn Javascript' module on CodeAcademy, I feel that I have a good grasp on the Javascript syntax, but I'm sure there is much much more to be learned. The latter half of the module focused a lot on working with GET and POST methods with and without async await. I feel that this feels tedious while going through it, but helps ingrain javascript-y procedures. I can't help but actually like the curly bracket syntax, even though I used to hate trying to read it (compare to a sugary language like python). Now onto my next step, I will try learning some React fundamentals in order to see if I can throw together an MVP of a full stack project. Node.js is also on my radar, since it's a popular backend framework, but for now, I'll focus on the front end, since I can boilerplate for an MVP. Wish me luck!

Learning Javascript

Posted on 12/26/2020

Work has slowed to a crawl due to the holidays, and it is Winter Break for my Master's program, so I am deciding to make use of the free time to learn Javascript. It is important to me to continue to work on my software development skills. Data Science is super interesting, but it is useless to business without the skills to take it to production. In today's software engineering climate, Javascript reigns supreme, as it is a lightweight, high-level programming language that works in browsers. Comparitively, Python is fairly clunky, even though it is very simple syntactically. I love the Julia language, as it has the best of Python (simple syntax, easy prototyping with Jupyter notebooks) while being blazing fast for a high-level language. Unfortunately, everyone is slow to adopt Julia (except for academics, ironically). Therefore, I think it is the best use of my time to learn Javascript! After I get familiar the language, I want to learn Node.js and React or Vue to work on the full-stack. Wish me luck!

Naive Bayes Classifier from Scratch

Posted on 12/16/2020

Hello, a quick update for my projects page. I have added my paper for a Naive Bayes Classifier from scratch. This project covers the basic principles behind the Naive Bayes Classifier, as well as a comparison of my implementation against the Naive Bayes module in scikit-learn. Click here for the project paper.

Classifying Online Shopper’s Intention

Posted on 07/01/2020

Hello! Thank you for visitng my website. I have uploaded my project where I use machine learning to attempt to predict whether an online shopper will go through with a purchase or exit the website they are on. The data is taken from a dataset in the UCI Machine Learning Library. You can check this project out my visiting my "Projects" page in the navbar or by clicking here.

Multi-Sample Dropout for Better Generalization

Posted on 02/17/2020

In multiple top entries in Kaggle, we are seeing that there is a common theme to increase model generalization and decrease time to train models. This theme is introducing multi-sample dropout. We can see a prominent use in the 1st place model for the Google Quest QA challenge on Kaggle.

Drop out is a technique where during training, a deep neural network model randomly discards several neurons to avoid overfitting. In the original implementation of this, (Hinton et al 2012) the scientists omitted 50% of the neurons on each training instance. The fundamental idea behind this dropout is that is discourages collaborative adaptations, where a feature detector is only useful in context of another or several other feature detector. After introducing dropout, in encourages a more robust model that will indentify distinct features, rather than coagulating fature detection within the hidden layers of a deep-neural net.

"Multi-Sample Dropout" (Inoue 2019) is an extended dropout technique that involves creating multiple dropout samples, then calculating the average loss and emsembling the dropout samples. This process relies on the idea that an ensemble will perform better than any single sample. Using this ensemble method, the number of iterations is drastically reduced. A note by author indicates that this method does not provide gain without diversity in the dropout samples. Despite dropout inherently introducing diversity, the author recommends applying other methods to bolster the diversity in samples, such as horizontal flipping (in images that are not affected by horizontal mirroring) and zero padding at the pooling layer. These are typical in data augmentation and are widely used to aid in adding robustness to a model as well as to synthesize additional unique training data for the model to generalize on. Zero padding is necessary when a pooling layer is not a direct multiple of the window size. Thus, zero padding must be added to the margins to ensure that the pooling layer is able to read the image.

There is no surprise then why this method is employed in winning Kaggle competitions, as this method both reduces runtime and increases model accuracy and generalization. It is an almost necessary technique for top-level models and will continue to add value to models to come.

Uploading projects to this page

Posted on 02/06/2020

I have been working on uploading my projects to use my projects page as a portfolio for potential employers to view. Please take a look at my projects page to see some of my previous work, as well as keep your eye out for updates for existing projects. As of 02/06, I only have two projects uploaded currently, but will work to get more uploaded as soon as possible.

Feature Selection: Sample Size > 10

Posted on 01/06/2020

I recently read an article about how a sample size of 10 is not an arbitrarily set number for assessing whether or not independent variables are useful for fitting an regression model. This article can be found at http://nicholasjjackson.com/2012/03/08/sample-size-is-10-a-magic-number/ I think this article presents a very interesting idea about how to select valuable variables to prevent underfitting for regression models. On further research into best practices, what I seem to observe is that different disciplines will have different methods to how they select variables, but most quantitative disciplines will follow the same rule-of-thumb as statistics where the threshold of usability begins at 10 observations per variable.

Consider how the MedCalc software calculates how to select or drop characteristics for logistic regression (https://www.medcalc.org/manual/logistic_regression.php). The program will quantify the whether or not the variable is valuable based on the regression coefficient and drop potential features if the regession coefficient is not statistically significant (P < 0.05).

I believe for any problem, one should consider the context for their problem and their dataset in order to select what variables they would want to include in their model. For a dataset with 10,000 entries, a variable with 10 observations probably would not be a statistically significant feature, but for a dataset with less than 300 entries, a variable with 10 entries may give your model a notable increase in your confidence level.

Ultimately, feature selection remains an important part of a model and sometimes a feature must be either engineered or selected arbitrarily. Investigation and deep consideration for model features is so important to a model's success, and can be the difference between a mediocre model and a 'good' model.

Hello World?

Posted on 01/03/2020

Hi! Welcome to my website. In an effort to aid my journey in Data Science, I've created this website and hosted it on Github, which thankfully, is free. My goal with this website is to document my experiences learning data science and to have a homepage that I can show people who are interested in data science as well as to aid in my job search. Eventually, I would also like to start uploading guides and tutorials to both assist with my own understanding of topics, as well as to help future data science students.