:::: MENU ::::

Udacity Week5: Improving Search Engine Performance (Python)

This post is a continued post of the series How To Build Search Engine a course of Udacity. The first week deals with the extraction of the first link in a web document, the second week deals with extracting all the links from a pre loaded html source. The third weeks helps us to make a web crawler. I am skipping the fourth week and moving on to the fifth week, improving search engines.

Search engine performance can be increased using better data structures like hash tables, dictionaries etc.

def get_page(url):
        import urllib
        x=urllib.urlopen(url)
#        x=open(url,'r')
        page=x.read()
        return page


def get_all_links(page):
        links=[]
        end_quote=0
        while(True):
                link=page.find('<a href=',end_quote)
                if(link!=-1):
                        start_quote=page.find('"',link) 
                        end_quote=page.find('"',start_quote+1) 
                        url=page[start_quote+1:end_quote]
                        links.append(url)
                else:
                        break
        return links

def union(p,q):
    for e in q:
        if e not in p:
            p.append(e)




def web_crawl(index,seed):
        crawled=[]

        #tocrawl=get_page('c:\\2.htm')
        tocrawl=[seed]
        #tocrawl=get_page('http://www.twitter.com/sanaltsk')
        while tocrawl:
                url=tocrawl.pop()
                if url not in crawled:
                    content=get_page(url)

                    add_page_to_index(index,url,content)
                    union(tocrawl,get_all_links(content))
                    crawled.append(url)

        #print crawled
        return index


def add_to_index(index,keyword,url):
        if keyword in index:
                index[keyword].append([url,0])
        else:
                index[keyword]=[[url,0]]

def add_page_to_index(index,url,content):

    x=content.split()
    for e in x:
        add_to_index(index,e,url)




def lookup(index,keyword):
        if keyword in index:
                return index[keyword]
        else:
                return None

def record_user_click(index,keyword,url):
        if keyword in index:
                e=index[keyword]
                for x in e:
                        if x[0]==url:
                                x[1]+=1


index={}
seed='http://www.udacity.com/cs101x/index.html'
web_crawl(index,seed)


print lookup(index, 'good')
record_user_click(index, 'good', 'http://www.udacity.com/cs101x/crawling.html')
print lookup(index, 'good')

2 Comments

  • Reply Anonymous |

    Hi Sanal,
    Thanks very much for posting about this course. Do you have by any chance Week 4 or Week 6?

    Good luck with the exam. :)

So, what do you think ?