Data Analytics Introduction and Practicum Assignments & Projects

Assignment1

data maniuplation and aggregation

{the goals of this assignment are: To practice data manipulation with Pandas To develop intuition about the interplay of precision, accuracy, and bias when making predictions To better understand how election forecasts are constructed}

visualization

{Applying different visualization techniques to Part 1} {Project Proposal also attached below}

Results Research Proposal
Assignment1 Research Proposal

Assignment2

scientific computing

{ The goal of this assignment is to introduce Scikit-Learn and its functions, Regression, and PCA, and still more regression. All objects within scikit-learn share a uniform common basic API consisting of three complementary interfaces: an estimator interface for building and fitting models, a predictor interface for making predictions and a transformer interface for converting data. The estimator interface is at the core of the library. It defines instantiation mechanisms of objects and exposes a fit method for learning a model from training data. All supervised and unsupervised learning algorithms (e.g., for classification, regression or clustering) are offered as objects implementing this interface. Machine learning tasks like feature extraction, feature selection or dimensionality reduction are also provided as estimators.}

statistic alanalysis

{In this lab, and in homework 2, we alluded to cross-validation with a weak explanation about finding the right hyper-parameters, some of which were regularization parameters. We will have more to say about regularization soon, but lets tackle the reasons we do cross-validation. The bottom line is: finding the model which has an appropriate mix of bias and variance. We usually want to sit at the point of the tradeoff between the two: be simple but no simpler than necessary. We do not want a model with too much variance: it would not generalize well. This phenomenon is also called overfitting. There is no point doing prediction if we cant generalize well. At the same time, if we have too much bias in our model, we will systematically underpredict or overpredict values and miss most predictions. This is also known as underfitting. Cross-Validation provides us a way to find the “hyperparameters” of our model, such that we achieve the balance point.}

Results Repository
Assignment2 [assignment2 repositiory]

Assignment3

machine learning part1

{Classification

Identifying to which category an object belongs to. Applications: Spam detection, Image recognition. Algorithms: SVM, nearest neighbors, random forest, …

Regression

Predicting a continuous-valued attribute associated with an object. Applications: Drug response, Stock prices. Algorithms: SVR, ridge regression, Lasso, …

Clustering

Automatic grouping of similar objects into sets. Applications: Customer segmentation, Grouping experiment outcomes Algorithms: k-Means, spectral clustering, mean-shift, … }

machine learning part2

{Dimensionality reduction

Reducing the number of random variables to consider. Applications: Visualization, Increased efficiency Algorithms: PCA, feature selection, non-negative matrix factorizations

Model selection

Comparing, validating and choosing parameters and models. Goal: Improved accuracy via parameter tuning Modules: grid search, cross validation, metrics.

Preprocessing

Feature extraction and normalization. Application: Transforming input data such as text for use with machine learning algorithms. Modules: preprocessing, feature extraction. }

Results Repository
Assignment3 Assignment3 Repository

Extra Lab

network analysis

{In this lab we will do the following:

1. Get a LinkedIn API key
2. Use oauth2 to get an acceess token
3. First we are going to download our own LinkedIn data using the LinkedIn API.
4. Then we are exporting this data as a csv file to be able to import it into Gephi.
5. Before starting Gephi we will do some network analysis directly in python
6. We will analyze our data with the external tool Gephi }
big data analytics

{In this week’s lab, we will mostly ignore statistics and instead focus on some practical issues that you will encouter on Homework 4. Section 4 of that homework includes new python techniques (classes, inheritance), an unfamiliar approach to breaking up large computing problems (MapReduce), code that has to be run outside the friendly confines of an ipython notebook, and then you are asked to put it all to use on Amazon’s Elastic Compute Cloud (EC2). This sounds very complicated, but the end result is a simpler algorithm for that problem of calculating similarity scores, as well as the ability to expand to arbitrarily large data sets.}


Extra Lab2

webs craping
sampling and text processing

{In this example we will see how to sample data and do text processing}


EMSE 6992 Labs

lab assignments

Programming For Analytics Assignments & Projects


Data Analysis for Eng & Sci Assignments & Projects


Design and Analysis of Algorithms Assignments & Projects


DBMS For Data Analytics Assignments & Projects


Applied Machine Learning For Analytics Assignments & Projects


Applied Machine Learning For Analytics Assignments & Projects


Introduction to Big Data & Analytics Assignments & Projects


Data-Driven Policy Assignments & Projects


Systems Thinking and Policy Modeling Assignments & Projects


Computer System Architecture Assignments & Projects


Bias In AI Assignments & Projects


Capstone: Opioid Mapping Project Assignments