student alcohol consumption dataset

Treatment utilization alcohol PDF 98 KB. consumption (both column 27 and 28) when famrel has a low value. Column 23 as the attributes and GStatus as the class for the training set to predict the class GStatus in the test set and validate the model. Finding a Binary Output: The dataset has two features, viz., Dalc (workday alcohol consumption) and Walc (weekend alcohol consumption), both in the range of 1 (very low) to 5 (very high). Many of them are ordinal and were discretized from continuous values. The following plot shows the prominence of the target: This shows that the target is imbalanced, so we may benefit from oversampling or under-sampling when building our model. This means 25% of students study for two hours, while 50% study more. EDUCATION SYSTEM IN PORTUGAL. Generally, many models prefer using features that are independent of each other and have low correlations. This Student Alcohol Consumption dataset is based on data collected in two secondary schools in Portugal. The reason for this change is because it is easier to classify a student's To do so, we Section 2b. According to the World Health Organization (Global Status Report on Alcohol and Health 2014 2014), gender, family, and social factors affect alcohol consumption. Family history alcohol PDF 140 KB. avoid drinking in order to prevent their health from further deterioration. However, the data reveals that there was a total of 382 students that were in both datasets, this was evident in the exact Fedu and Medu correlate more that some others, so we might want to combine the information. The dataset is originally designed for the estimation of high school student’s performance where alcohol consumption is used as one of the parameters. National Institute on Alcohol Abuse and Alcoholism Alcohol Use and Consumption Tables A large number of html and text files on alcohol use and consumption. February 2016 DOI: 10.13140/RG.2.1.1465.8328 READS 2,200 2 authors: Fabio Pagnotta Hossain Amran University of Camerino University of Camerino 8 PUBLICATIONS 0 CITATIONS 5 PUBLICATIONS 0 … to 1 hour, or 4 – >1 hour), studytime – weekly study time (numeric: 1 – <2 hours, 2 – 2 to 5 hours, 3 – 5 to 10 hours, or 4 – >10 hours), failures – number of past class failures (numeric: n if 1<=n<3, else 4), schoolsup – extra educational support (binary: yes or no), famsup – family educational support (binary: yes or no), paid – extra paid classes within the course subject (Math or Portuguese) (binary: yes or no), activities – extra-curricular activities (binary: yes or no), nursery – attended nursery school (binary: yes or no), higher – wants to take higher education (binary: yes or no), internet – Internet access at home (binary: yes or no), romantic – with a romantic relationship (binary: yes or no), famrel – quality of family relationships (numeric: from 1 – very bad to 5 – excellent), freetime – free time after school (numeric: from 1 – very low to 5 – very high), goout – going out with friends (numeric: from 1 – very low to 5 – very high), Dalc – workday alcohol consumption (numeric: from 1 – very low to 5 – very high), Walc – weekend alcohol consumption (numeric: from 1 – very low to 5 – very high), health – current health status (numeric: from 1 – very bad to 5 – very good), absences – number of school absences (numeric: from 0 to 93), G1 – first period grade (numeric: from 0 to 20), G2 – second period grade (numeric: from 0 to 20), G3 – final grade (numeric: from 0 to 20, output target), Joining information from existing features (PCA is a common example, or some knowledge about how features are correlated), Depending on the model, remove features that are not important to the model. If the mean has significant differences (h0 is accepted), then the feature will likely be a dominant predictor. weekend alcohol consumption and their health. With the Student Alcohol Consumption data set from UCI Machine Learning Archive (Fabio Pagnotta 2016), we thought it would be interesting to see what features are important to determine if the student is a heavy drinker or not. 45 Using Python to Analyze Secondary School Student Alcohol Consumption and Their Academic Performance 1Poonam Kumari and 2Aditya Pratap 1Research Scholar, Department of Computer Science, IITM Janakpuri, New Delhi,India 2Research Scholar, Department of Computer Science, IITM Janakpuri, New Delhi,India poonam.kumari561999@gmail.com, … We may want to normalize absences in preparation for model building. Section 2d. The columns and how they are recorded are as listed below: Since the main purpose of the dataset is to find correlations between students with their alcohol consumption patterns, the most conspicuous relationship Although student achievement is highly inﬂuenced by past evaluations, an explanatory analysis has shown that there are also other relevant features (e.g. result as pass/fail rather than a discrete numeric number. Tobacco and nicotine use TUD PDF 493 KB. With the Student Alcohol Consumption data set, we predict high or low alcohol consumption of students. Secondary school student alcohol consumption data with social, gender and study information. However, if more elaborate data mining techniques were to be used, more features can be selected and used in order to A lot of time is lost I alcohol consumption that the students only place less time in their academic work. It’s called the datasets subreddit, or /r/datasets. This would help the classification model to more accurately predict the class GStatus As we all know, human relationships play a major role in people's lives. relationship with his/her family has a low value. EuroEducation.net. If the hypothesis holds true, we would expect to see an increasing level of alcohol To test this, we will also apply the Box and Cox method to determine the parameter that indicates which method is best. The dataset we chose is the Student Alcohol Consumption dataset by UCI Machine Learning which can be obtained We assume that a father’s education level is similar to a mother’s education level, so let us visualize the association: The above plot shows that the education levels between mother and father do coincide fairly often and might want to explore more or consider the possibility of joining these features in preprocessing the data before model building. Fabio Pagnotta, Hossain Mohammad Amran. Secondary school students are in a transition developmentally and this comes with its debilitating effects such as risky alcohol use … These short term effects of alcohol could lead to poor academic performance, poor health and disruptive social behavior. It could be alcohol poisoning or an alcohol-related injury or both. The data mining technique we think is suitable is classification. column 33 (final grade). We would think that if the value for health is lower, the value for their We think that classification is the best data mining technique to be employed because we can build a classification model to 2016. Background information II PDF 731 KB. For categorical values, we use Cramer’s V. For numeric values, we use Eta-squared value. school period grades are available. The effects of alcohol use on academic achievement in high school. Section 3b. Section 2c. The datasets have a total of 33 attribute columns of which we could do some column selection based on certain parameters. Other Cool Sets. Derived output: Alc = (Walc X 2 + Dalc X 5) / 7, again, in the range of 1 – 5. in a student environment as well as their demographic information and other data that may be of some relevance. The Core Survey help us determine the patterns of alcohol and other drug consumption and examine attitudes and perceptions of alcohol and other drug use among Northwestern students. because it would be less accurate for the classification model to predict a numeric value ranging from 0-20. the passing marks for a student in Portugal would be 10 out of 20. As you will see in the data, on average, our campus sends at least one student to the emergency room per week who is in some kind of trouble connected with alcohol. We remove skewness by applying a log, square root, or/and inverse transformation. You can see the level of correlation by the degree of the ellipse. In our data set, many of the categorical features are numeric, but for this illustration, we will continue with treating them as categorical. For a student to pass the subject, there are a couple of factors that could be correlated with the outcome. However, a research conducted in the United States by Balsa (2011), showed that increases in levels of alcohol consumption only resulted in small Alcohol experiences AUD PDF 281 KB. Thus, their final grade would be the perfect measure of It does not state the level of intimacy between them. The box plot portion of the graph also helps us identify outliers. al. We will take a closer look at the distribution of this feature. For this analysis, we combine the rows of the data sets. Assuming the romantic relationship in our dataset is of an intimate level, we can find out if this statement holds true. Your email address will not be published. Section 2a. As a direct out-come of this research, more eﬃcient student prediction World Health Organization WHO. predict if a student will get a passing grade based on the factors mentioned above. The original data comes from a survey conducted by a professor in Portugal. The dataset was built from two sources: school reports and questionnaires. The data that we will explore has 1044 rows and 33 columns. by Dinescu et. consensus is that students who consume alcohol at high levels tend to skip more classes and perform worse in their studies, thus, resulting in lower I'm sorry, the dataset "STUDENT ALCOHOL CONSUMPTION" does not appear to exist. Since the dataset is called “Student Alcohol Consumption”, of course, we should do some analyses on it. activites (column 19), romantic (column 23), famrel (column 24), goout (column 26), Dalc (column 27), Walc (column 28) Retrieved from http://www.euroeducation.net/prof/porco.htm. Examples of Five columns play a major role in this which are: column 27 (workday alcohol consumption) One way would be to create a new feature, FeduMedu, where the values is Medu * 10 + Fedu and keep FeduMedu categorical. We use cookies on Kaggle to deliver our services, analyze web traffic, and improve your experience on the site. obtain more accurate insights. Balsa, A. I., Giuliano, L. M., & French, M. T. (2011). need to take column 23 (romantic), column 27 (workday alcohol consumption) and/or column 28 (weekend alcohol consumption) into consideration. By using Kaggle, you agree to our use of cookies. The results make sense. Subscribe to our mailing list and get interesting stuff and updates to your email inbox. This will be explained in the next section (Section C). 13. The violin plot of absences shows more of a log normal distribution, and a large number of outliers lie well outside of the top whisker. 5. Excessive alcohol use, either in the form of binge drinking (drinking 5 or more drinks on an occasion for men or 4 or more drinks on an occasion for women) or heavy drinking (drinking 15 or more drinks per week for men or 8 or more drinks per week for women), is associated with an increased risk of many health problems, such as liver disease and unintentional injuries. I will be utilizing the student alcohol consumption dataset provided by UCI Machine Learning and is available in their machine learning repository. The traditional consensus is that students who consume alcohol at high levels … To compare categorical variables, correlations shouldn’t be used unless the underlying values are ordinal (i.e., going out with friends [numeric: from 1 – very low to 5 – very high]). Alcohol Abuse and Dependence: Roughly 20 percent of college students meet the criteria for an alcohol use disorder in a given year (8 percent alcohol abuse, 13 percent alcohol dependence). Published in: Technology. drinking alcohol for consolation. This analysis was done as part of The target is the weekday drinking level 1 to 5 and the weekend drinking level 1 to 5. courses of mathematics and Portuguese. It can develop a plethora of emotions in oneself, may it be a positive or negative Best part, these are all free, free… The data collected, in locations such as Gabriel Pereira and Mousinho da Silveira, includes several values of pertinence. You may want to explore combining the grades into one feature since G3 is likely derived from G1 and G2. While … This information can give you a hint of the skewness and of possible outliers. Many students in college experiment with drugs and alcohol and sometimes these two things negatively affect their academic performance. 27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high) 28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) 29 health - current health status (numeric: from 1 - very bad to 5 - very good) 30 absences - number of school absences (numeric: from 0 to 93) Short exploratory data analysis focusing on the alcohol variables from the Portuguese school dataset. Testing correlation between alcohol consumption and social, gender, study time, and grade attributes for each student. When lambda = 0, the log transform is used. The original data contains the following attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets: The following grades are related with the course subject, Math or Portuguese: Before exploration, we combine the rows of the two data sets and mark each instance with the class in which the survey was taken. Be sure to change the type of field delimiter (“;”), line delimiter (“\n”), and check the Extract Field Names checkbox, as specified on the image below: We don’t need G1 and G2 columns, let’s drop them. recorded to have participated. Its value for the week is normalized as (workday_alcohol_consumption 5 + weekend_alcohol_consumption 2)/7 If the value is greater than 3.0, then alcohol consumption is considered too high. Last but not least, we can also obtain insights on health issues and drinking alcohol. Alcohol is an often abused substance that troubles many individuals in their adulthood as they struggle to cope with emotional and physical stress that consumption) and/or column 28 (weekend alcohol consumption). We shall see which consensus holds true. We only do this for illustration. The original values for the feature ‘absences’ will be used in the remaining sections. The types of columns are listed as follows: One way to get an idea about the structure of the data is to calculate basic statistics, such as the min, max, mean, and median, and missing value counts. For example, if there were a high correlation, say 0.9, between two numeric features, then the information provided to the model would be redundant, and depending on the model make the model more complex than it needs to be. intimate, they will drink less. Having recourse to the public health objective on alcohol by the World Health organization, which is to reduce the health burden caused by the harmful use of alcohol, thereby saving live and reducing injuries, this data article explored the nature of alcohol use among college students, binge drinking and the consequences of alcohol consumption. Student Grade Prediction 1. (romantic), only gives information on whether or not the student has a partner. The students included in the survey were in the It gives you data about … workday alcohol consumption, weekend alcohol consumption and their family relationship. workday and/or weekend alcohol consumption would also be lower. The following results show the skewness for the numeric features: As we suspected, the feature ‘absences’ contains the most skew. /r/datasets. 2014. First, open the student-por.csv file in the student_performance source. replication of data. Section 3a. https://archive.ics.uci.edu/ml/datasets/STUDENT%20ALCOHOL%20CONSUMPTION. Dinescu, D., Turkheimer E., Beam, C.R., Horn, E.E., Duncan, G., Emery, R.E. such data are records of demographic information, grades, and alcohol consumption. This dataset was collected in order to study alcohol consumption in young people and its effects on students’ academic performance. We could perform this merge differently later by performing a full join and then dealing with the NA values, by performing the analysis on the individual sets, or by inner joining the two sets and just working with that data. The traditional Journal of Family Psychology, Vol 30(6), Sep 2016, 698-707. By stacking one data set on top of the other, we assume that each instance is not an instance for the student, but an instance of when the student completed the survey. We test hypothesis 0 (h0) that the numeric variable has the same mean values across the different levels of the categorical variable. You can browse the subreddit here. For the data exploratory exercise, we choose to examine four columns: workday alcohol consumption, first period grade, second period grade and their final grade. This may not hold true because it is a possibility that the Remove the skewness from the numeric data. This analysis was done as part of fulfilling the Data Mining course in Multimedia University. At an alcohol consumption level of 1, the median and 25th percentile are the same value of 2 hours of study. Our main goal is using Data Mining To Predict School Student Alcohol Consumption and finding the significant factors. following: Figure 1 illustrates the high-level description of our classification. We prefer to use some sort of configuration so that we can input any dataset and perform most of the same analysis. information about the students from the mathematics course only. There are a few columns which we think could be further clarified or changed. Therefore, researchers seek to rectify that lack by conducting a survey to obtain important raw data on alcohol consumption Since the main purpose of the dataset is to find correlations between students with their alcohol consumption patterns, the most conspicuous relationship would be the relationship between their grades with respect to their workday and weekend alcohol consumption. People who contributed to this were Aaron Patrick Nathaniel, Lim Yue Hng (Neil) and A twin study of marital status and alcohol consumption. To obtain insights on this, we could refer to column 29 (health), column 27 The dataset which we will be exploring will be the dataset containing We could take into consideration the Medicine use PDF 223 KB. Our explanation would be more focused on the final grade because we think that students will be However, the assumption is that the alcohol consumption is high because the student's would be the relationship between their grades with respect to their workday and weekend alcohol consumption.