Information Gain (= Mutual Informaion), conditional entropy, lift

일/Data Mining

Information Gain (= Mutual Informaion), conditional entropy, lift

LEEHK 2016. 4. 21. 19:02

Information Gain (= Mutual Informaion)

* 관련해서 과거에 적었던 포스트 : Information Theory : Entropy, KL-divergence (Cross Entorpy), Mutual Information

1. 정의

: 서로 다른 두 랜덤벡터 x,y에 대한 의존도를 측정한다.

* 비고 : kl - divergence 정의 - 같은 랜덤 벡터 x의 서로 다른 두 확률분포 간 거리를 측정.

2. 의미.

: 두 랜덤벡터 x,y가 독립일 때와, 독립이 아닐때의 비교.

: p(x.y) 와 p(x).p(y)의 kl - divergence

* 둘이 독립이라면 p(x.y) = p(x).p(y) 이므로.

* p(x.y) 는 p(x교집합y)를 약식 표기한 것.

: IG(X;Y) = H(Y) - H(Y|X) = H(X) - H(X|Y)

* 둘이 독립이면 H(Y|X) = H(Y)

* H(Y|X) : Expected number of bits to transmit Y if both side will know the value of X

: 독립이면 IG=0, 연관이 높으면 IG>0, 연관이 낮으면 IG<0

* 비고 : 독립이면 LIFT = 1. 연관이 높으면 LIFT>1. 연관이 낮으면 0<LIFT<1. LIFT = P(B|A) / P(B) = P(A|B) / P(A)

3. 수식 증명.

4. Information Gain 계산하는 파이썬 코드

#!coding=utf8
#/usr/bin/python
# all the imports
import sys
import math


def Entropy(dict):
    #Calculate entropy
    # dict : {key1:cnt,key2:cnt, ... }

    Entropy = 0.0
    totcnt = 0

    for key in dict:
        totcnt += dict[key]

    for key in dict:
        probs = dict[key]/float(totcnt)
        Entropy_part = -probs*math.log(probs)
        Entropy += Entropy_part
        #print probs, Entropy_part, Entropy

    return Entropy


def info_gain(dict):
    #Calculate information gain
    # dict : { (x1,y1):cnt,(x2,y2):cnt, ... (xn,yn):cnt }

    # IG(X;Y) = H(Y) - H(X|Y)                                    --- calculation 1)
    #         = ΣxΣy p(x,y) * log ( p(x,y) / ( p(x)*p(y) ) )     --- calculation 2)

    dic_x_cnt={}
    dic_y_cnt={}
    totcnt = 0

    for key in dict:
        totcnt += dict[key]

        if dic_x_cnt.has_key(key[0]):
            dic_x_cnt[key[0]] += dict[key]
        else:
            dic_x_cnt[(key[0])] = dict[key]

        if dic_y_cnt.has_key(key[1]):
            dic_y_cnt[key[1]] += dict[key]
        else:
            dic_y_cnt[(key[1])] = dict[key]

    # -------------------- calculation 1)
    Information_Gain = 0.0
    Conditionanl_entropy = 0.0

    # Conditionanl_entropy : H(X|Y) = Σj p(x=vj) * H(Y|x=vj)
    for var_x in dic_x_cnt:
        # 하둡 python UDF로 돌릴 때는 아래 문법이 안 먹힌다 -_-
        # filtered_dict = {(x,y):cnt for ((x,y),cnt) in dict.iteritems() if x == var_x}
        # 그래서 4줄로 풀어서 다시 씀
        filtered_dict = {}
        for var_xy in dict:
            if var_xy[0] == var_x:
                filtered_dict[var_xy] = dict[var_xy]

        p_x = dic_x_cnt[var_x]/float(totcnt)

        Conditionanl_entropy += p_x * Entropy(filtered_dict)

    Information_Gain = Entropy(dic_y_cnt) - Conditionanl_entropy

    '''
    # -------------------- calculation 2)
    IG = 0.0
    CE = 0.0
    for xy in dict:
        p_xy = dict[xy]/float(totcnt)
        p_x = dic_x_cnt[xy[0]]/float(totcnt)
        p_y = dic_y_cnt[xy[1]]/float(totcnt)

        CE += p_xy * math.log ( p_x / p_xy )
        IG += p_xy * math.log( p_xy / (p_x*p_y) )

    # 검증 : calculation 1)은 첫번쨰 줄로 구현  calculation 2)는 두번쨰 줄로 구현 둘이 같음을 확인함
    #print Entropy(dic_y_cnt)
    #print Conditionanl_entropy, Information_Gain
    #print CE, IG
    #0.0433636193042
    #0.0427109609299 0.00065265837421
    #0.0427109609299 0.00065265837421
    '''

    return Information_Gain

최근에 사용할 일이 있었는데, 남이 짜놓은 코드를 보는 건 정합성 체크가 어려워서 그냥 간만에 수식 다시 풀어보고 짰음.

기억이 사라지가 전에 기록해둠. 다음에 다시 일 할 때 참고하게. :)

'일 > Data Mining' 카테고리의 다른 글

Fundamental of Reinforcement Learning (0)	2017.08.25
모두를 위한 머신러닝/딥러닝 강의 (0)	2017.05.26
데이터 사이언티스트? (0)	2016.02.20
Triangle numbers (0)	2014.08.09
일의 의미. (0)	2014.03.07

현재글Information Gain (= Mutual Informaion), conditional entropy, lift

정을 붙이면 어디든 고향

Data Mining, 북경, 동방신기, 준수, 베이징, 자막, daum, 야스다 쇼타, 칸쟈니, 에이또,

Today :
Yesterday :

일	월	화	수	목	금	토
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

정을 붙이면 어디든 고향