Description
(Read all the instruction carefully and adhere to them.)
Instructions:
-
All the assignments should be completed and uploaded by 11.00 pm.
-
Markings will be based on the correctness and soundness of the outputs. Marks will be deducted in case of plagiarism.
-
Be precise for your explanations in the report. Unnecessary verbosity will be penalized. Prepare a Detailed report of the assignment.
-
Code should be done in Python.
-
You should zip all the required files and name the zip file as
rollno1_rollno2_rollno3_assignment1.zip, e.g., 1811cs01_1811cs02_1811cs03_assignment1.zip.
-
Upload your solution(zip file) to the following link: https://www.dropbox.com/request/eWE7CiUXKsTma79iwf43
Questions:
-
The crucial task before applying any machine learning algorithms is to understand the given data, i.e., a thorough data analysis cum data visualization is always necessary. As the part of this assignment, you are given a dataset, from which the following informations are to be extracted.
Dataset : stackOverflow.csv
Information to be extracted out:
-
Find out the no. of questions asked with respect to the given Tags.
-
Find out the most commonly used tags and what is the trend in Data Science Tags.
-
The average time is taken to answer a question.
-
Numbers of views related to the number of Answers.
-
Tags get highest/lowest rating in Questions.
-
Tags get highest/lowest rating in Answers.
-
Find out the most Active/Inactive in answering the questions.
-
Which tags draws the highest/lowest views?
Point to be noted :
-
You need to infer the above imformations using proper graph, wherever necessary.
-
You must do the code stuff in Python only.
Dataset is to be downloaded from the below mentioned link:
https://drive.google.com/file/d/
0B1AC_DBfxZmWS0pMbWsyNUJrV083akMtVV81NmViRjcxbmhj/view?usp=sharing
(2) Consider the training dataset data.csv, which has 8 variables, as follows.
“NumPreg”,”PlasmaGlucose”, “DiastolicBP”, “TricepSkin”, “BodyMassIndex” ,”Pedigree”
“Age”, “Diabetic”
The target is to fit a logistic regression model to predict the “Diabetic” variable based on the other 7 variables. In this connection, please answer the following questions, in given sequence.
-
Develop the best model to predict the categorical response variable “Diabetic” in case of the given dataset? Justify your choice for best model.
-
Suppose you have chosen a threshold t to classify P(Diabetic | X) > t as “Diabetic” = Yes. How would you choose the optimal threshold t such that the aforesaid classification achieves maximum accuracy for your best model? Justify your choice.
This dataset is to be downloaded from the below mentioned link:
https://drive.google.com/file/d/
0B1AC_DBfxZmWNkZ2QXVSVnVRbXQzVldQNFJsTnloRVlvN0Rv/view?usp=sharing