Last night as I was dozing off, I had a sudden inclination to try and add to the attempts that have been made to identify the anonymous New York Times op-ed writer. I’ve had some success in the past with machine learning and stylometry. And this is one of the most intriguing authorship questions in years.
By 2:00 am I was convinced the data had singled out Mike Pence. I even started to wonder what the ethical thing to do is when one has hard evidence about the source of text whose author wished to remain anonymous.
Most dashboards display the same variables each day. I wanted something that was fresh - that changed based on spikes in the data.
Kaggle is a forum for interacting with other data scientists and competing to see who can write code that will best predict features of data. It’s a way to test your skills at statistics and machine learning, and to do a lot of human learning in the process (sorry, bad pun).
When I entered the contest to categorize crimes that occurred in San Francisco, my initial goal was to do better than random chance.
I wanted to see if it was possible to train a model to detect the difference between two fictional authors created by the same novelist based only on the frequency of common stop words, e.g., the, at, is
I wanted to see if it was possible to train a model to detect the difference between two fictional authors created by the same novelist based only on the frequency of common stop words, e.g., “the.” It worked: The randomForest model correctly selected Nick 93% of the time and Amy 91%.
Background When I first started using R for data analysis, I was mesmerized by all of the packages and what they made possible.