Knowledge Discovery in Databases

Lecturer (Coordinator):
Juan Pedro Caraça-Valente
jpvalente@fi.upm.es
Lecturer:
Aurora Pérez
aurora@fi.upm.es

Semester

First semester

Credits

4 ECTS

Outline

Techniques for knowledge discovery (or data mining) in large volumes of information are widely used today in different domains like medicine, banking environments, industrial systems, etc., and have a wide range of applications, such as, for example, data analysis, fraud detection, risk analysis, mailing campaigns, etc.

This subject will review all the stages of the knowledge discovery process and list the most important techniques for each stage. It will highlight data cleaning and preprocessing techniques, which are often overlooked.

It will then address major data mining techniques, including classification, clustering and association rules. Genetic algorithms have become exceptionally popular in recent years, and many have been applied in the field of knowledge discovery. This subject will also explore genetic algorithms.

There is a recent trend towards building temporal information into large databases to preserve historical information, be able to analyse the evolution of a variable or determine when a data item is valid. Additionally, there are domains where the information takes the form mainly of time series. Such domains require specialized treatment. This subject addresses information discovery techniques in time series, as this data type poses a major challenge to traditional data mining techniques and calls for new solutions.

Learning Goals

  • Be aware of and know how to apply all the knowledge discovery process stages and the major techniques of each stage in a particular domain
  • Know how to analyse a domain (problem, data and goals) to determine the key characteristics and their influence on decision making on which data mining to use
  • Be aware of data mining techniques and know how to apply them to specific problems
  • Evaluate the operation and results of a knowledge discovery system

Syllabus

  1. Introduction
    1. Data types
    2. Basic concepts
  2. Knowledge discovery process
    1. Knowledge discovery process stages
    2. Data preprocessing
  3. KDD Tools
    1. Background
    2. A KDD tool: WEKA
  4. Data mining techniques
    1. Classification
    2. Clustering
    3. Genetic Algorithms
    4. Time Series Techniques
  5. Evaluation
    1. Objectives
    2. Verification techniques

Recommended reading

  • WEKA. http://www.cs.waikato.ac.nz/ml/weka/
  • J. Han; M. Kamber: "Data Mining: Concepts and Techniques." Ed. Morgan Kauffman 2006
  • M. Kantardzic: "Data Mining: Concepts, Models, Methods, and Algorithms". John Wiley & Sons. 2003.
  • U. Fayyad, G. Piatetsky-Shapiro and P. Smyth: "From Data Mining to Knowledge Discovery in databases".1996.

Assessment Method

For the evaluation of this subject we will take into account, on the one hand, the attendance and participation in class and, on the other hand, the Data Mining Project.

For the Continuous Evaluation the attendance to class, active participation of the student and the evaluable exercises that are raised in class will be considered.

The Data Mining Project will be evaluated according to the three phases described below and the corresponding weights

Data Mining Project

This project will be done individually or in groups of 2 people. The work will be done incrementally and will be presented in the following phases:

  • Phase 1: students will choose a domain to which data they have access, analyze their characteristics and establish the objectives to be achieved through the Data Mining Project. They will write a report indicating the different tasks that would be carried out in each stage of the Knowledge Discovery process according to the specific needs of the domain and the objectives.
  • Phase 2: through the use of a Knowledge Discovery software tool, Data Mining algorithms will be applied to the data of each domain. In addition, the student will analyze the limitations of the algorithms available in the tool and possible improvements.
  • Phase 3: an evaluation plan will be made to assess the results obtained and the plan will be executed.

The three deliveries of the Data Mining Project are mandatory and will be evaluated.

Grading criteria

The Data Mining Project will be presented in class. Each group will have 15 minutes for the oral presentation plus 5 minutes of questions.

Qualification standards

The subject will be evaluated on 10 points, divided into 3 points for continuous assessment and 7 for the Data Mining Project. To pass the subject it will be necessary to attend at least 70% of the classes and obtain a final grade of no less than 5 points.

Tuition language

English

Subject-Specific Competences

Code, description and proficiency level for each subject-specific competence
Code Competence Proficiency Level
CEM2 Acquisition of an advanced level of knowledge in order to analyse and synthesize solutions to problems requiring innovative approaches to the definition of the computational infrastructure, processing and analysis of heterogeneous data types S
CEM7 Knowledge of the theoretical foundations and training in the many available techniques for knowledge extraction and discovery from large datasets and related research topics S

Learning Outcomes

Code, description and proficiency level for each subject learning outcome
Code Learning Outcome Associated competences Proficiency level
RA-APDI-68 Be able to analyse a domain to determine the relevance of its temporal characteristics and the knowledge discovery tasks worth undertaking CEM2, CEM7 S
RA-APDI-69 Be able to use knowledge discovery techniques and their applicability in each case CEM2, CEM7 S
RA-APDI-70 Be able to conduct a complete evaluation of the operation and usefulness of such a project CEM2, CEM7 S

Learning Guide

Learning Guide: Knowledge Discovery in Databases