Data Science Boot Camp Spark Activity: Building a Linear Regression Classifier

1) Now we’ll use our old code as a basis to build a linear regression-based classifier, that will tell us whether a string is about religion or not. First, check out http://cmj4.web.rice.edu/DSDay2/Activity8Out.py. Take a look at this code. I have removed a bunch of the Activity 7 code, and I’ve added a couple of lines that attempt to map each 20,000-dimensional TF-IDF vector to a 1000-dimensional vector. The answer is at http://cmj4.web.rice.edu/DSDay2/Activity8Answer.py.

2) Now we add to this code to build the inverse Gram matrix. Check out http://cmj4.web.rice.edu/DSDay2/Activity9Out.py and the answer is at http://cmj4.web.rice.edu/DSDay2/Activity9Answer.py.  

3) Now we add to this code to build the 1000 regression parameters. Check out http://cmj4.web.rice.edu/DSDay2/Activity10Out.py and the answer is at http://cmj4.web.rice.edu/DSDay2/Activity10Answer.py.

4) Finally, look at http://cmj4.web.rice.edu/DSDay2/Activity11.py. There is no code for you to write here, but this file include the code that will actually use the regression parameters to perform a prediction. Try a few queries. Examples: getPrediction ("god jesus allah") and getPrediction ("I have a fish on the back of my car") and getPrediction ("I have a baby on board bumper sticker on the back of my car").