StudyNp | CSIT Seventh Semester – 2081 Question for Data Warehousing and Data Mining

Tribhuwan University

Institute of Science and Technology

2081

Bachelor Level / Fourth Year / Seventh Semester / Science

B.Sc in Computer Science and Information Technology (CSC420)

(Data Warehousing and Data Mining)

Full Marks: 60

Pass Marks: 24

Time: 3 Hours

Candidates are required to give their answers in their own words as for as practicable.

The figures in the margin indicate full marks.

Section A

Long Answers Questions

Attempt any TWO questions.

[2*10=20]

When do we prefer trim mean for statistical description of data? Justify with an example. Describe about multi-dimensional data model and conceptual modeling of data warehouse. [10]

How do you generate strong association rules? From the following dataset find the frequent item set using FP growth algorithm using 3 as minimum support.

\begin{array}{c|c} \text{Transaction ID} & \text{Items} \\ \hline \text{T1} & \{K, E, M, O, Y\} \\ \text{T2} & \{K, E, O, Y\} \\ \text{T3} & \{K, E, M\} \\ \text{T4} & \{K, M, Y\} \\ \text{T5} & \{K, E, O\} \\ \end{array}

[10]

Define overfitting and under fitting. Train the decision tree classifier using the ID3 algorithm based on the following training data.

\begin{array}{c|c|c|c|c} \text{TID} & \text{Age} & \text{Car} & \text{Type} & \text{Class} \\ \hline 1 & \leq 30 & \text{Family} & \text{High} \\ 2 & \leq 30 & \text{Sports} & \text{High} \\ 3 & >30 & \text{Sports} & \text{High} \\ 4 & >30 & \text{Family} & \text{Low} \\ 5 & >30 & \text{Truck} & \text{Low} \\ 6 & \leq 30 & \text{Family} & \text{High} \\ \end{array}

[10]

Section B

Short Answers Questions

Attempt any Eight questions.

[8*5=40]

Using k-means++ algorithm and Euclidean distance, find the initial 3 cluster centroids from A1 = (3, 11), A2 = (3, 6), A3 = (9, 5), A4 = (6, 9), A6 = (7, 5), A7 = (2, 3), A8 = (5, 10). Choose (3, 11) as one of the initial centroids. [5]

Explain the general strategies for cube computation. [5]

Distinguish between data characterization and data discrimination. What are the challenges of multimedia mining? [5]

Define graph mining. Discuss the conflict between theory of balance and theory of status. [5]

What is support vector? How do you evaluate the accuracy of a classifier? Describe. [5]

Differentiate between k-means and k-medoids clustering algorithm. [5]

10.

List any two OLAP operations with example. How do you compute rule coverage and rule accuracy? [5]

11.

Define link mining. What are the roles of epsilon and MinPts in DBSCAN. [5]

12.

Describe any two methods of handling noisy data. [5]