Fundamentals of Data Mining: Analyzing Diabetes Dataset
University | National University of Singapore (NUS) |
Subject | Data Mining |
Fundamentals of Data Mining
Question 1
In this question, you will use the dataset “diabetes.csv” to learn more about diabetes. Table 1 describes the fields in the dataset, which contains 768 records. Each record is a medical record of a patient who has had several tests to determine whether they have diabetes.
The dataset was extracted from an internal healthcare organisation database. The technical staff who assisted in the data collection process shared that if the data cannot be captured successfully, 0 will be used. In short, for some fields, a value of 0 indicates that the values are not captured in the dataset.
Table 1. Description of the dataset Field | Description |
Pregnancies | Number of times pregnant |
Glucose | Plasma glucose concentration after 2 hours in an oral glucose tolerance test |
BloodPressure | Diastolic blood pression (mm Hg) |
SkinThickness | Triceps skin fold thickness (mm) |
Insulin | 2-Hour serum insulin (mu U/ml) |
BMI | Body mass index (weight in kg/(height in m)^2) |
DiabetesPedigreeFunction | Diabetes pedigree function |
Age | Age (years) |
Outcome | 0 (non-diabetic) or 1 (diabetic) |
(a) Assess the quality of the dataset. If needed, perform data cleaning. In less than 200 words, discuss how you identify the data quality issues and clean the data. Also, justify your choice of data cleaning method.
You are expected to use the cleaned dataset obtained from part (a) when attempting the subsequent
parts of the question.
(b) Determine the obesity level for each patient according to the following categories:
• “Underweight” if the BMI is below 18.5
• “Normal” if the BMI is 18.5 and above but below 25
• “Overweight” if the BMI is between 25 and above but below 30
• “Obese” if the BMI is 30 and above
Then, present one (1) graphical display that can answer the following:
Which obesity level has the highest number of diabetic patients?
(c) Construct a K-Means model that can help you identify the profile of patients diagnosed with
diabetes. In your answer, discuss the following:
• How do you decide the input(s) and parameter(s) to be used in the model
• How do you determine your model is the final best model
• What are the profiles of the clusters
• How do you identify the cluster to be the target cluster
• Data preparation steps and post-model analysis, if any
(d) Construct an Apriori model that can help you identify the profile of patients diagnosed with
diabetes. In your answer, discuss the following:
• How do you decide the input(s) and parameter(s) to be used in your model
• How do you determine your model is the final best model
• Report the total number of association rules obtained
• Pick one interesting association rule and explain it in terms of support, rule support and confidence
• Data preparation steps and post-model analysis, if any.
Your writing should be succinct but not at the expense of excluding relevant details. Highlight only the points that are relevant to your discussion. Use plain and simple language. Some questions may not come with absolutely right or wrong answers. For such questions, you have the liberty to express
your views about the problem. However, your points have to be supported by evidence and good reasoning. It’s the quality and not the length that counts. Make sure you follow the report guidelines and style specified in this assignment.
The topics in the main report should be presented in the order according to the sequence of the tasks/questions listed in the assignment; that is, in the order of (a), (b), …, etc. You can have several sub-sections within a section if you deem it appropriate.
The report must be self-contained. It is important to include all relevant tables and figures in the report as evidence to support the answers given.
The following are some details of the report format:
• Length: should not exceed 10 pages (including the relevant graphs, tables, references, screenshots and appendices (if any), but excluding the cover page)
• Font Style: Times New Roman
• Font size: 12
• Line spacing: 1.5
• Margins: 1” for the top, bottom, right and left
• Include the page number on each page
Some further suggestions:
• Ensure minimal grammatical and typographical errors
• Write clearly in plain English
• Write appropriately to the context
• Cite appropriate sources
• Provide a reference or bibliography at the end of the main report
• Include less relevant details in the Appendix
• Good overall presentation of the repor
Hire a Professional Essay & Assignment Writer for completing your Academic Assessments
- FIN2210E/FIN2212E Group Assignment: Financial Risk Management Analysis of Bursa Malaysia Companies
- FLM101 Assignment: A Film Analysis: Stylistic Techniques and Their Thematic Importance
- Assignment: Talent Transformation in the Age of AI: Turning Challenges into Opportunities via Ecosystem Innovation
- COMP 1105 Assignment: Health-Focused E-Commerce Website: A Web Technologies Project Using HTML5, CSS, and JavaScript
- Assignment: Machine Learning in Robo-Advisory Services: Evolution, Applications, and Future Trends
- OMGT2229 Assignment: Quantitative EOQ Analysis, and Strategic Sourcing Decisions for JB Hi-Fi
- Assignment 2: Corporate Finance and Planning: An In-Depth Financial Analysis of Company
- BUSM4551 Assignment: The Role of Innovation in Advancing the UN Sustainable Development Goals (SDGs)
- EE1102 Quantum Physics Assignment: Analysis of Blackbody Radiation and Solar Emission
- CET206 TMA: Full Stack Web Application Development – Staycation App Enhancement and Analysis
UP TO 15 % DISCOUNT