The methodology of Data Mining. An application to alcohol consumption in teenagers

Authors

  • Elena Gervilla García Área de Metodología de las Ciencias del Comportamiento. Departamento de Psicología. Universitat de les Illes Balears.
  • Rafael Jiménez López Área de Metodología de las Ciencias del Comportamiento. Departamento de Psicología. Universitat de les Illes Balears.
  • Juan José Montaño Moreno Área de Metodología de las Ciencias del Comportamiento. Departamento de Psicología. Universitat de les Illes Balears.
  • Albert Sesé Abad Área de Metodología de las Ciencias del Comportamiento. Departamento de Psicología. Universitat de les Illes Balears.
  • Berta Cajal Blasco Área de Metodología de las Ciencias del Comportamiento. Departamento de Psicología. Universitat de les Illes Balears.
  • Alfonso Palmer Pol Área de Metodología de las Ciencias del Comportamiento. Departamento de Psicología. Universitat de les Illes Balears.

DOI:

https://doi.org/10.20882/adicciones.253

Keywords:

Artificial Neural Networks, Decision Trees, Naive Bayes, Association Rules, alcohol

Abstract

This paper is aimed mainly at making researchers in the field of drug addictions aware of a methodology of data analysis aimed at knowledge discovery in databases (KDD). KDD is a process consisting of a series of phases, the most characteristic of which is called data mining (DM), whereby different modelling techniques are applied in order to detect patterns and relationships among the data. Common and differentiating factors between the most widely used DM techniques are analysed, mainly from a methodological viewpoint, and their use is exemplified using data related to alcohol consumption in teenagers and its possible relationship with personality variables (N=7030). Although the overall accuracy obtained (% correct predictions) is very similar in the three models analyzed, the Artificial Neural Network (ANN) technique generates the most accurate model (64.1%), followed by Decision Trees (DT) (62.3%) and Naïve Bayes (NB) (59.9%).

Author Biography

Elena Gervilla García, Área de Metodología de las Ciencias del Comportamiento. Departamento de Psicología. Universitat de les Illes Balears.

References

Agrawal, R. y Srikant, R. (1994). Fast algorithms for mining association

rules. Proceedings of the 20th International Conference on Very Large Databases, 487-499.

Agrawal, R., Imielinski, T. y Swami, A. (1993). Mining association rules

between sets of items in large databases. Proceedings of the 1993 ACM-SIGMOD International Conference on Management of Data, 207-216.

Agrawal, R., Mannila, H., Srikant, R., Toivonen, H. y Verkamo, A. I. (1996). Fast Discovery of Association Rules. In U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth y R. Uthurusamy (Eds.), Advances in Knowledge Discovery and Data Mining (pp. 307-328). AAAI/MIT Press.

Bigus, J.P. (1996). Data mining with neural networks: solving business

problems from application development to decision support. New York: McGraw-Hill.

Breiman, L., Friedman, J. H., Olshen, R. A. y Stone, C. J. (1984).

Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software.

Caspi, A., Roberts, B. W. y Shiner, R. L. (2005). Personality development:

stability and change. Annual Review of Psychology, 56, 453-484.

Ghosh, J. (2003). Scalable Clustering. In N. Ye (Ed.), The Handbook of

Data Mining (pp. 247-277). Mahwah, NJ: Lawrence Erlbaum Associates.

Hahsler, M., Grün, B. y Hornik, K. (2005). Arules - A Computational

Environment for Mining Association Rules and Frequent Item Sets. Journal of Statistical Software, 14, 1-25.

Hahsler, M., Hornik, K. y Reutterer, T. (2005). Implications of probabilistic data modeling for rule mining. Report 14, Research Report Series, Department of Statistics and Mathematics, Wirtschaftsuniversität Wien, Austria.

Han, J. y Kamber, M. (2006). Data Mining: Concepts and Techniques (2nd. ed.). San Francisco: Morgan Kaufmann.

Hand, D., Mannila, H. y Smyth, P. (2001). Principles of Data Mining.

Cambridge, MA: The MIT Press.

Hastie, T., Tibshirani, R. y Friedman, J. (2001). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer.

Hernández, J., Ramírez, M. J. y Ferri, C. (2004). Introducción a la Minería de Datos [Introduction to Data Mining]. Madrid: Pearson Educación, S.A.

Hipp, J., Güntzer, U. y Nakhaeizadeh, G. (2000). Algorithms for Association Rule Mining – A general survey and comparison. SIGKDD Explorations, 2, 58-64.

Ihaka, R. y Gentleman, R. (1996). R: A Language for Data Analysis and

Graphics. Journal of Computational and Graphical Statistics, 5, 299-314.

Kantardzic, M. (2003). Data Mining: Concepts, Models, Methods, and

Algorithms. New York: Wiley.

Kass, G. V. (1980). An exploratory technique for investigating large

quantities of categorical data. Applied Statistics, 29, 119-127.

Kitsantas, P., Moore, T. W. y Sly, D. F. (2007). Using classification trees

to profile adolescent smoking behaviors. Addictive Behaviors, 32, 9-23.

Larose, D. T. (2005). Discovering Knowledge in Data: An Introduction

to Data Mining. Hoboken, NJ: Wiley.

Larose, D. T. (2006). Data Mining Methods and Models. Hoboken, NJ:

Wiley.

MacDonald, K. (2005). Personality, Evolution, and Development. In R.

Burgess and K. MacDonald (Eds.), Evolutionary Perspectives on Human Development (pp. 207-242). Thousand Oaks, CA: Sage.

Michie, D., Spiegelhalter, D. J. y Taylor C. C. (Eds.) (1994). Machine Learning, Neural and Statistical Classification. New York: Ellis Horwood Ltd.

Palmer, A. y Montaño, J. J. (1999). ¿Qué son las redes neuronales artificiales? Aplicaciones realizadas en el ámbito de las adicciones [What are artificial neural networks? Applications in the field of addictions]. Adicciones, 11, 243-255.

Palmer, A., Fernández, C. y Montaño, J. J (2001). Sensitivity Neural Network 1.0 [Computer program]. Available at mailto:alfonso. palmer@uib.es

Palmer, A., Montaño, J. J. y Calafat, A. (2000). Predicción del consumo

de éxtasis a partir de redes neuronales artificiales [Ecstasy consumption prediction on the basis of artificial neural networks]. Adicciones, 12, 29-41.

Quinlan, J. R. (1986). Induction of Decision Trees. Machine Learning,

, 81-106.

Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. San Mateo:

Morgan Kaufmann.

Quinlan, J. R. (1997). C5.0 Data Mining Tool. RuleQuest Research,

http://www.rulequest.com.

Shmueli, G., Patel, N. R. y Bruce, P. C. (2007). Data Mining for Business

Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. New Jersey: John Wiley & Sons, Inc.

Two Crows Corporation (1999). Introduction to Data Mining and Knowledge Discovery (3th. ed.). Maryland: Two Crows Corporation.

Witten, I. H. y Frank, E. (2005). Data Mining: Practical Machine Learning Tools and Techniques (2nd. ed.). San Francisco: Morgan Kaufmann.

Witten, I. H., Frank, E., Trigg, L., Hall, M., Holmes, G. y Cunningham, S.

J. (1999). Weka: Practical machine learning tools and techniques with Java implementations. In N. Kasabov and K. Ko (Ed.), Proceedings of the ICONIP/ANZIIS/ANNES’99 Workshop on Emerging Knowledge Engineering and Connectionist-Based Information Systems (pp. 192-196). Dunedin, New Zealand.

Ye, N. (Ed.) (2003). The Handbook of Data Mining. Mahwah, NJ: Lawrence Erlbaum Associates.

Published

2009-03-01

Issue

Section

Originals