617044 - Data Science in The Wild

Basic Information

The Course Covers The Tools Needed to Handle Data-science Tasks, Focusing On Problem Formulation, Data Preparation and Cleaning, Data Modeling, Implementation of Analysis Tasks and Presentation. The Following Methods and Topics Are Taught# Data Preparation# Sampling Methods, Similarity Join, Record Linkage, Locality Sensitve Hashing (lsh). Data Modeling# Binning, Non-linear Transformation, Domain-knowledge Feature Extraction, Handling a Noisy Environment. Model Selection and Evaluation# Ranking Evaluation (auc, Aclc), Classification (lift, Accuracy, Recall, Precision, F-score), Density Estimation (mean Absolute Error, Mean Squared Error, Cross Entropy/log Likelihood). Causality Vs. Correlation. Supervised and Unsupervised Learning Methods# Naive Bayes, Linear Regression, Logistic Regression, Decision Tree, K-nearest Neighbors, Linear Hyperplane,random Forests, Svm, Non-linear Svm, Processing Data Streams# Stochastic Gradient Descent (sgd), Sampling and Filtering Data Streams, Bloom Filter, Flajolet-martin Algorithm, Moments and The Ams Method, Dgim Method. Pagerang For Links Analysis. Community Detection in Social Networks# Girvan-newsman Algorithm, Spectral Clustering, Counting Triangles. Recommender Systems# Content-based, Collaborative Filtering, Hybrid Methods. Latent Factors Model (svd). Spam Filtering. Data Presentation And Visualization. Ethical Issues. The Course Includes Large Programming Assignments, Using Python and Related Technologies Such As Scipy, Numpy, Ipython Notebook. Tools For Processing Large Amounts Of Data Such As Apache Hadoop (maprecude), Apache Spark, Spark Sql, Pig, Hiva and Apache Maour Are Taught and Being Used in The Homework Assignments. Learning Outcomes# Upon Completing The Course, The Student Will Be Able To# 1. Select an Appropriate Model For Different Data Science Tasks And Give The Appropriate Weight to Different Model-selection Criteria. 2. Validate a Selected Model and Apply The Model, E.g., Partition The Data Into a Training Set and a Testing Set, in a Way That Satisfies The Desired Criteria. 3. Clean Data For Data Processing Tasks. 4. Sample Large-scale Datasets, Effectively, Without a Bias. 5. Apply Supervised and Unsupervised Learning Using Naive Bayes, Linear Regression, Logistic Regression, Decision Tree, Random Forest, Svm, K-nearest Neighbor Etc. 6. Choose The Right Method For Each Analysis Task and Validate The Chosen Method Using Common Criteria. 7. Carry Out Tasks of Processing Large Datasets, Under Different Distribution Models, Such As Mapreduce (hadoop) and Rdds (spark) And Be Able to Apply Ml Methods Over Datasets Using Apache Mahout. 8. Verify The Existence of Causation and Distinguish It From Correlation. 9. Apply Algorithms Over Data Streams and Use Spark Streaming For That. 10. Discover Communities in a Social Network, at Large Scale. 11. Build a Recommender System. 12. Build a Classifier Such As a Spam Filter. 13. Present Analysis Results Visually in a Non-misleading Manner. An Additional Outcome Is That The Students Will Be Aware of Ethical Issues That May Arise When Conducting Data Analysis and Will Be Able To Detect Unethical Misuse of Data.

Student's Portal

Basic Information

Faculty: Applied Sciences
|Graduate Studies

Semestrial Information

Weekly Hours

Responsible(s)

No Registration Groups

Weekly Hours

Responsible(s)

No Registration Groups

Weekly Hours

Responsible(s)

No Registration Groups

Basic Information

Faculty: Applied Sciences |Graduate Studies

Semestrial Information

Weekly Hours

Responsible(s)

No Registration Groups

Weekly Hours

Responsible(s)

No Registration Groups

Weekly Hours

Responsible(s)

No Registration Groups

Faculty: Applied Sciences
|Graduate Studies