labs & assignments

Data Analytics Introduction and Practicum Assignments & Projects

Assignment1

data maniuplation and aggregation

Lab: Exploratory Data Analysis for Classification using Pandas and Matplotlib

{the goals of this assignment are: To practice data manipulation with Pandas To develop intuition about the interplay of precision, accuracy, and bias when making predictions To better understand how election forecasts are constructed}

visualization

Lab: Exploratory Data Analysis for Classification using Pandas and Matplotlib

{Applying different visualization techniques to Part 1} {Project Proposal also attached below}

Results	Research Proposal
Assignment1	Research Proposal

Assignment2

scientific computing

Lab: Scikit-Learn, Regression, PCA

{ The goal of this assignment is to introduce Scikit-Learn and its functions, Regression, and PCA, and still more regression. All objects within scikit-learn share a uniform common basic API consisting of three complementary interfaces: an estimator interface for building and ﬁtting models, a predictor interface for making predictions and a transformer interface for converting data. The estimator interface is at the core of the library. It deﬁnes instantiation mechanisms of objects and exposes a fit method for learning a model from training data. All supervised and unsupervised learning algorithms (e.g., for classiﬁcation, regression or clustering) are oﬀered as objects implementing this interface. Machine learning tasks like feature extraction, feature selection or dimensionality reduction are also provided as estimators.}

statistic alanalysis

{In this lab, and in homework 2, we alluded to cross-validation with a weak explanation about finding the right hyper-parameters, some of which were regularization parameters. We will have more to say about regularization soon, but lets tackle the reasons we do cross-validation. The bottom line is: finding the model which has an appropriate mix of bias and variance. We usually want to sit at the point of the tradeoff between the two: be simple but no simpler than necessary. We do not want a model with too much variance: it would not generalize well. This phenomenon is also called overfitting. There is no point doing prediction if we cant generalize well. At the same time, if we have too much bias in our model, we will systematically underpredict or overpredict values and miss most predictions. This is also known as underfitting. Cross-Validation provides us a way to find the “hyperparameters” of our model, such that we achieve the balance point.}

Results	Repository
Assignment2	[assignment2 repositiory]

Assignment3

machine learning part1

Lab: Neural Networks

{Classification

Identifying to which category an object belongs to. Applications: Spam detection, Image recognition. Algorithms: SVM, nearest neighbors, random forest, …

Regression

Predicting a continuous-valued attribute associated with an object. Applications: Drug response, Stock prices. Algorithms: SVR, ridge regression, Lasso, …

Clustering

Automatic grouping of similar objects into sets. Applications: Customer segmentation, Grouping experiment outcomes Algorithms: k-Means, spectral clustering, mean-shift, … }

machine learning part2

Lab: Support Vector Machines

{Dimensionality reduction

Reducing the number of random variables to consider. Applications: Visualization, Increased efficiency Algorithms: PCA, feature selection, non-negative matrix factorizations

Model selection

Comparing, validating and choosing parameters and models. Goal: Improved accuracy via parameter tuning Modules: grid search, cross validation, metrics.

Preprocessing

Feature extraction and normalization. Application: Transforming input data such as text for use with machine learning algorithms. Modules: preprocessing, feature extraction. }

Results	Repository
Assignment3	Assignment3 Repository

Extra Lab

network analysis

Lab: Networks

{In this lab we will do the following:

Get a LinkedIn API key
Use oauth2 to get an acceess token
First we are going to download our own LinkedIn data using the LinkedIn API.
Then we are exporting this data as a csv file to be able to import it into Gephi.
Before starting Gephi we will do some network analysis directly in python
We will analyze our data with the external tool Gephi }

big data analytics

Lab: MapReduce

{In this week’s lab, we will mostly ignore statistics and instead focus on some practical issues that you will encouter on Homework 4. Section 4 of that homework includes new python techniques (classes, inheritance), an unfamiliar approach to breaking up large computing problems (MapReduce), code that has to be run outside the friendly confines of an ipython notebook, and then you are asked to put it all to use on Amazon’s Elastic Compute Cloud (EC2). This sounds very complicated, but the end result is a simpler algorithm for that problem of calculating similarity scores, as well as the ability to expand to arbitrarily large data sets.}

Extra Lab2

webs craping

Lab: Web Scraping Part 1 Lab: Web Scraping Part 2 {In this example we will fetch data about countries and their population from Wikipedia}