World University Names Database
I’ve been searching for university names database, but i didn’t find complete database that covers at least universities in 240 countries. I decide to crawl university names in LinkedIn and Webometric, since both give university names based on country code. If you are in hurry to get university names database, you can download them here. If you are interested to know how to crawl the LinkedIn and Webometric university names data and build the database, then hang on with me.
I used Python to crawl LinkedIn and Webometric. It was accidental, when i edit my profile on my LinkedIn, firebug shows ajax get request to populate university names on select element. So here is my python code to get LinkedIn university names based on country code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 | import urllib from xml.dom import minidom import MySQLdb WS_URL = 'http://www.linkedin.com/wsSchoolDir?q=&country=%s' DB_HOST = 'localhost' DB_NAME = 'bell' DB_USER = 'root' DB_PASSWD = '123456' TBL_COUNTRY = 'countries' COUNTRY_ID = 'id' COUNTRY_CODE = 'iso2' TBL_SCHOOL = 'universities' def buildSchools(): conn = MySQLdb.connect( host = DB_HOST, user = DB_USER, passwd = DB_PASSWD, db = DB_NAME, charset = 'utf8' ) print "Creating table %sn" % TBL_SCHOOL cursor = conn.cursor() cursor.execute("DROP TABLE IF EXISTS %s" % (TBL_SCHOOL)) cursor.execute("CREATE TABLE %s(`id` INT(11) NOT NULL AUTO_INCREMENT,`country_id` INT(5) NOT NULL,`name` VARCHAR(150) NOT NULL, PRIMARY KEY (`id`), KEY `country_id` (`country_id`)) ENGINE=MyISAM DEFAULT CHARSET=utf8" % (TBL_SCHOOL)) print "Get countries from table %snn" % TBL_COUNTRY cursor.execute("SELECT %s, %s FROM %s WHERE 1" % (COUNTRY_ID, COUNTRY_CODE, TBL_COUNTRY)) countries = cursor.fetchall() print "Trying to build schools..n" for country in countries: print "Populate schools in %sn" % country[1] schools = getSchools(country[1].lower()) for school in schools: if len(school.getAttribute('v')) > 0 and school.getAttribute('v') != '0': univ = school.childNodes[0].data.replace('"', '\"') cursor.execute('''INSERT INTO %s VALUES(NULL, %d, "%s")''' % (TBL_SCHOOL, country[0], univ)) print "nEnd building schools." cursor.close() conn.close() def getSchools(code): dom = minidom.parse(urllib.urlopen(WS_URL % code)) return dom.getElementsByTagName('s') if __name__ == "__main__": buildSchools() |
I got 9,229 university names from LinkedIn. In case you are curious how the code operates, it just iterate country code (2 digit code) in my countries table and requests the web service URL based on that country code. The web service will return XML containing universities name and, i guest, their id. The code then parse the XML, iterates each university names and insert it to table universities. Webometric is more complete than LinkedIn, I got 16,773 university names. Webometric provides catalogue of world universities consisting html pages of university names based on country code. Basically, crawling proccess is the same, request each html page of country and parse the university name and, this is the bonus, university url. Each country might be more than one page consisting 50 universities per page. This time i used SGMLParser to parse html page. So here’s the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 | from sgmllib import SGMLParser from urllib2 import urlopen, Request, BaseHandler from httplib import BadStatusLine import time import MySQLdb WS_URL = 'http://www.webometrics.info/university_by_country.asp?country=%s' DB_HOST = 'localhost' DB_NAME = 'bell' DB_USER = 'root' DB_PASSWD = '123456' TBL_COUNTRY = 'countries' COUNTRY_ID = 'id' COUNTRY_CODE = 'iso2' TBL_SCHOOL = 'webometric_universities' class UnivParser(SGMLParser): domain = 'http://www.webometrics.info/' path = '/university_by_country.asp?country=%s' univ = [] errorURL = [] insideRowData = 0 insideColOfUnivName = 0 passedColOfUnivName = 0 insideColOfUnivLink = 0 insideUnivLink = 0 currentUniv = {} sleepCount = 1; insideColNav = 0 insideLinkOfNav = 0 currentLinkOfNav = '' nextPage = '' pageNumber = 1 endOfPage = 0 def __init__(self): SGMLParser.__init__(self, verbose=0) def parse(self, url): self.univ = [] self.pageNumber = 1 self.goToNextPage(url) return self.univ def goToNextPage(self, url): print "crawl page %d" % self.pageNumber self.insideRowData = 0 self.insideColOfUnivName = 0 self.passedColOfUnivName = 0 self.insideColOfUnivLink = 0 self.insideUnivLink = 0 self.currentUniv = {} self.insideColNav = 0 self.insideLinkOfNav = 0 self.currentLinkOfNav = '' self.nextPage = '' self.endOfPage = 0 try: req = urlopen(url) except IOError, e: print "Oops, we got HTTPError." reason = "" if hasattr(e, 'reason'): reason = 'Failed to reach a server. Reason: %d' % e.reason elif hasattr(e, 'code'): reason = 'The server couldn't fulfill the request. Error code: %d' % e.code print reason self.errorURL.append({'url': url, 'reason': reason}) return except BadStatusLine, e: reason = "" if hasattr(e, 'reason'): reason = 'Failed to reach a server. Reason: %d' % e.reason elif hasattr(e, 'code'): reason = 'The server couldn't fulfill the request. Error code: %d' % e.code print reason self.errorURL.append({'url': url, 'reason': reason}) return finally: self.sleepCount = 1 self.feed(req.read()) req.close() if len(self.nextPage): self.goToNextPage(self.domain + self.nextPage) def __sleep(self): time.sleep(0.5 * self.sleepCount) def getErrorURL(self): return self.errorURL def start_tr(self, attrs): if self.insideRowData == 0: for name, val in attrs: if name == 'class' and val == 'nav6a': self.insideRowData = 1 def start_td(self, attrs): if self.insideRowData: if self.passedColOfUnivName == 0: self.insideColOfUnivName = 1 else: self.insideColOfUnivLink = 1 else: # inside col of nav for name, val in attrs: if name == 'class' and val == 'nav6a': self.insideColNav = 1 def start_a(self, attrs): if self.insideColOfUnivName: self.insideUnivLink = 1 elif self.insideColNav: for name, val in attrs: if name == 'class' and val == 'nav6a': self.insideLinkOfNav = 1 if name == 'href': self.currentLinkOfNav = val def end_tr(self): if self.insideRowData: self.insideRowData = 0 self.passedColOfUnivName = 0 self.univ.append( self.currentUniv ) self.currentUniv = {} def end_td(self): if self.insideRowData: if self.insideColOfUnivName: self.insideColOfUnivName = 0 self.passedColOfUnivName = 1 else: self.insideColOfUnivLink = 0 elif self.insideColNav: self.insideColNav = 0 def end_a(self): if self.insideUnivLink: self.insideUnivLink = 0 elif self.insideLinkOfNav: self.insideLinkOfNav = 0 self.currentLinkOfNav = '' def handle_data(self, data): if self.insideUnivLink: self.currentUniv['name'] = unicode( data.strip().replace('"', '\"'), 'latin-1') print self.currentUniv['name'] elif self.insideColOfUnivLink: self.currentUniv['link'] = u'%s' % data.strip() elif self.insideLinkOfNav and data.lower() == 'next' and self.nextPage == '': self.nextPage = self.currentLinkOfNav self.pageNumber += 1 def buildSchools(): conn = MySQLdb.connect( host = DB_HOST, user = DB_USER, passwd = DB_PASSWD, db = DB_NAME, charset = 'latin1' ) print "Creating table %sn" % TBL_SCHOOL cursor = conn.cursor() cursor.execute("DROP TABLE IF EXISTS %s" % (TBL_SCHOOL)) cursor.execute("CREATE TABLE %s(`id` INT(11) NOT NULL AUTO_INCREMENT,`country_id` INT(5) NOT NULL,`name` VARCHAR(150) NOT NULL, `url` VARCHAR(150) NOT NULL, PRIMARY KEY (`id`), KEY `country_id` (`country_id`)) ENGINE=MyISAM DEFAULT CHARSET=utf8" % (TBL_SCHOOL)) print "Get countries from table %snn" % TBL_COUNTRY cursor.execute("SELECT %s, %s FROM %s WHERE 1" % (COUNTRY_ID, COUNTRY_CODE, TBL_COUNTRY)) countries = cursor.fetchall() w = UnivParser() print "Trying to build schools from webometric..n" for country in countries: print "nPopulate schools in %s" % country[1] schools = w.parse(WS_URL % country[1].lower()) for school in schools: cursor.execute('''INSERT INTO %s VALUES(NULL, %d, "%s", "%s")''' % (TBL_SCHOOL, country[0], school['name'], school['link'])) print "nEnd building schools." """ print error """ for err in w.getErrorURL(): print "%sn" % err w.close() cursor.close() conn.close() def insertSchools(url, cc): conn = MySQLdb.connect( host = DB_HOST, user = DB_USER, passwd = DB_PASSWD, db = DB_NAME, charset = 'latin1' ) cursor = conn.cursor() cursor.execute("SELECT %s FROM %s WHERE iso2 = '%s'" % (COUNTRY_ID, TBL_COUNTRY, cc.upper())) country = cursor.fetchall() w = UnivParser() schools = w.parse(url) for school in schools: cursor.execute('''INSERT INTO %s VALUES(NULL, %d, "%s", "%s")''' % (TBL_SCHOOL, country[0][0], school['name'], school['link'])) print "nEnd building schools." """ print error """ for err in w.getErrorURL(): print "%sn" % err w.close() cursor.close() conn.close() if __name__ == "__main__": buildSchools() |
Sorry, if the code is messed up. Since i didn’t care about error handling, you might encounter some problems when running the scripts.
Seminar Go Open Source di SMA Sudirman Bekasi svn:ignore yang menjengkelkan
Back to top