中国地面气候资料日值数据集(V3.0)是遥感、GIS、水文、气候变化等研究领域中的常用数据源。数据集下载地址:中国地面气候资料日值数据集(V3.0)。该数据集包含了中国699个基准、基本气象站自1951年1月以来的气压、气温、降水量、蒸发量、相对湿度、风向风速、日照时数和0cm地温要素的逐日数据。
该数据集是以TXT文本格式存储,每个文件中存储了逐月的所有气象站的逐日数据,数据的存储结构简单,但使用起来还需要进一步提取和处理,例如,若提取某一地区某一时间段的降水数据,则需要从该时间段内的每个文件中找出该地区包括的气象站,然后提取数据,格式化存储。这便是本文的写作目的,利用Python程序进行数据的快速提取。
数据集说明
数据集中文名称:中国地面气候资料日值数据集(V3.0)
数据集代码:SURF_CLI_CHN_MUL_DAY
数据集版本:V3.0
数据集建立时间:20120804
数据TXT文件名称说明:
数据文件命名由数据集代码(SURF_CLI_CHN_MUL_DAY)、要素代码(XXX)、项目代码(XXXXX)、年份标识(YYYY)和月份标识(MM)组成。其中,SURF表示地面气象资料,CLI表示地面气候资料,CHN表示中国,MUL表示多要素,DAY表示日值数据。
文件命名:
编号 | 变量 | 命名 |
---|---|---|
1 | 气压 | SURF_CLI_CHN_MUL_DAY-PRS-10004-YYYYMM.TXT |
2 | 气温 | SURF_CLI_CHN_MUL_DAY-TEM-12001-YYYYMM.TXT |
3 | 相对湿度 | SURF_CLI_CHN_MUL_DAY-RHU-13003-YYYYMM.TXT |
4 | 降水 | SURF_CLI_CHN_MUL_DAY-PRE-13011-YYYYMM.TXT |
5 | 蒸发 | SURF_CLI_CHN_MUL_DAY-EVP-13240-YYYYMM.TXT |
6 | 风向风速 | SURF_CLI_CHN_MUL_DAY-WIN-11002-YYYYMM.TXT |
7 | 日照 | SURF_CLI_CHN_MUL_DAY-SSD-14032-YYYYMM.TXT |
8 | 0cm地温 | SURF_CLI_CHN_MUL_DAY-GST-12030-0cm-YYYYMM.TXT |
示例:SURF_CLI_CHN_MUL_DAY-TEM-12001-201812.txt (点击文件下载查看)
特征值说明:
台站海拔高度 | +100000,当台站海拔高度为估测值时,在估测数据基础上加100000 |
各要素项 | 32766,数据缺测或无观测任务 |
气压日极值 | +20000,气压极值取自定时值,在原值上加20000 |
日最小相对湿度 | +300,最小相对湿度取自定时值,在原值上加300 |
风速 | +1000,当风速超过仪器测量上限时,在上限数据基础上加1000 |
风向 | 1-17,用数字表示风向方位,17表示静风 |
+100,当表示风向为八风向时,在原值上加100 | |
90X,风向出现X个时,风向数据用个数X表示 | |
95X,风向至少出现X个时,风向数据用个数X表示 | |
降水量 | 32700,表示降水"微量" |
32XXX,XXX为纯雾露霜 | |
31XXX,XXX为雨和雪的总量 | |
30XXX,XXX为雪量(仅包括雨夹雪,雪暴) | |
蒸发量 | 32700,表示蒸发器结冰 |
+1000,蒸发器中注入的水全部蒸发,在注入的水量数据基础上加1000 | |
0cm地温 | +10000,实际温度(零上)超仪器上限刻度,在上限数据基础上加10000 |
-10000,实际温度(零下)超仪器下限刻度,在下限数据基础上减10000 |
数据提取
了解了数据集的存储格式,就可以根据自己的需求提取相应的数据。方便使用和处理的气象数据存储格式通常以数据类型为字段(列)、以日期或时间为记录(行)。
示例:
DATE | TEM | PRE | RHU | PRS |
---|---|---|---|---|
2018/01/01 | xx | xx | xx | xx |
2018/01/02 | xx | xx | xx | xx |
2018/01/03 | xx | xx | xx | xx |
... | ... | ... | ... | ... |
数据提取的Python代码
主程序
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 | # -*- coding: utf-8 -*- import os import math import numpy import time import datetime # 创建类 class ClimateData: ''' 读取气象站点数据,格式化输出 (下载的原始逐日气象数据) ''' def __init__( self , dir , dir_out, sid, fields, period, days): self .dataDir = dir self .dataDir_out = dir_out self .sid = sid self .period = period self .days = days self .fieldName = fields # code:数据类型,ind:数据索引号,frc:真值拉伸系数,ev:异常值阈值,详见气象数据说明文档 self .fieldInfo = { "TEM" : { "code" : "12001" , "ind" : 7 , "frc" : 0.1 , "ev" : 30000 }, "TMX" : { "code" : "12001" , "ind" : 8 , "frc" : 0.1 , "ev" : 30000 }, "TMN" : { "code" : "12001" , "ind" : 9 , "frc" : 0.1 , "ev" : 30000 }, "PRE" : { "code" : "13011" , "ind" : 9 , "frc" : 0.1 , "ev" : 30000 }, # 7:8-20 8:20-8 9:20-20 "EVP" : { "code" : "13240" , "ind" : 7 , "frc" : 0.1 , "ev" : 1000 }, "RHU" : { "code" : "13003" , "ind" : 7 , "frc" : 1.0 , "ev" : 300 }, "WIN" : { "code" : "11002" , "ind" : 7 , "frc" : 0.1 , "ev" : 1000 }, "SSD" : { "code" : "14032" , "ind" : 7 , "frc" : 0.1 , "ev" : 99 }, "GST" : { "code" : "12030-0cm" , "ind" : 7 , "frc" : 0.1 , "ev" : 10000 }, "PRS" : { "code" : "10004" , "ind" : 7 , "frc" : 0.1 , "ev" : 20000 } } # 全部数据 self .data = {} # 数据日期 self .data_date = {} # 存储逐年数据 self .data_y = {} # 待提取日期数据 self .data_d = {} # 获得日期数组 self .GetDateArr() for i in self .fieldName: self .data[i] = [] self .data_date[i] = [] for t in self .years: self .data_y[t] = {} for i in self .fieldName: self .data_y[t][i] = [] for d in self .days: self .data_d[d] = {} for i in self .fieldName: self .data_d[d][i] = 0 def GetDateArr( self ): ''' 根据起始日期,获得逐月日期 :return: ''' self .date = [] self .years = [] startDT_y = int ( self .period[ 0 ][ 0 : 4 ]) startDT_m = int ( self .period[ 0 ][ 4 : 6 ]) endDT_y = int ( self .period[ 1 ][ 0 : 4 ]) endDT_m = int ( self .period[ 1 ][ 4 : 6 ]) if startDT_y = = endDT_y: self .years.append(startDT_y) for j in range (startDT_m, endDT_m + 1 ): if j > 9 : self .date.append( str (startDT_y) + str (j)) else : self .date.append( str (startDT_y) + "0" + str (j)) else : for i in range (startDT_y, endDT_y + 1 ): self .years.append(i) if i = = startDT_y: for j in range (startDT_m, 13 ): if j > 9 : self .date.append( str (i) + str (j)) else : self .date.append( str (i) + "0" + str (j)) elif i < endDT_y: for j in range ( 1 , 13 ): if j > 9 : self .date.append( str (i) + str (j)) else : self .date.append( str (i) + "0" + str (j)) else : for j in range ( 1 , endDT_m + 1 ): if j > 9 : self .date.append( str (i) + str (j)) else : self .date.append( str (i) + "0" + str (j)) def ExtractData( self , sr = 0 ): ''' Extract data :param sr: start row numbers, default is 0 :return: ''' print ( "Data extracting..." ) # Get date arr s_time = time.clock() # 遍历每个数据类型 for fn in self .fieldName: # 遍历逐月日期 for dt in self .date: print (fn, dt) yr = int (dt[ 0 : 4 ]) # 拼接字符串,组成数据文件路径 field = fn if fn = = "TMN" or fn = = "TMX" : field = "TEM" fileName = self .dataDir + os.sep + "SURF_CLI_CHN_MUL_DAY-" + \ field + "-" + self .fieldInfo[fn][ 'code' ] + "-" + dt + ".TXT" if not os.path.isfile(fileName): raise Exception( "Can not find %s" % fileName) else : txtFile = open (fileName, 'r' ) linesList = txtFile.read().split( '\n' ) iffind = False hasdata = True # 逐行遍历数据 for i in range (sr, len (linesList)): if len (linesList[i]) > 0 : # 将每行数据拆成数组,按照索引提取数据 lineArr = SplitStr(linesList[i], spliters = ' ' ) if int (lineArr[ 0 ]) = = self .sid: iffind = True # 将数据存储至数组 od = float (lineArr[ self .fieldInfo[fn][ 'ind' ]]) # PRS 数值处理 if fn = = "PRS" : if od > = self .fieldInfo[fn][ 'ev' ]: od = - 100 # PRE 数值处理 elif fn = = "PRE" : if od = = 32766 : od = - 100 elif od = = 32700 : od = 0 elif od > 99999 : od = 0 else : od = od - int (od / 1000 ) * 1000 else : # 异常值处理,异常值用-100代替 if od > = float ( self .fieldInfo[fn][ 'ev' ]) / float ( self .fieldInfo[fn][ 'frc' ]): # od = float(lineArr[int(self.fieldInfo[fn]['ind']) - 1]) od = - 100. # 将处理结果添加至数据字典 if od ! = - 100 : self .data[fn].append(od * float ( self .fieldInfo[fn][ 'frc' ])) self .data_y[yr][fn].append(od * float ( self .fieldInfo[fn][ 'frc' ])) else : self .data[fn].append(od) self .data_y[yr][fn].append(od) # 保存对应的日期 data_date_str = lineArr[ 4 ] + "-" + lineArr[ 5 ] + "-" + lineArr[ 6 ] data_date_date = datetime.datetime.strptime(data_date_str, "%Y-%m-%d" ) data_date_fmt = datetime.datetime.strftime(data_date_date, "%Y-%m-%d" ) self .data_date[fn].append(data_date_fmt) # 遍历完所设置站点的日期后结束循环 if int (lineArr[ 0 ]) ! = self .sid and iffind: break # 如果未匹配到数据,做标记 if i = = len (linesList) - 27 and not iffind: hasdata = False break # 如果未匹配数据,用-9999填充 if not hasdata: firstrow = SplitStr(linesList[ 0 ], spliters = ' ' ) s0 = firstrow[ 0 ] for k in range ( len (linesList)): lineArr_s0 = SplitStr(linesList[k], spliters = ' ' ) if int (lineArr_s0[ 0 ]) = = int (s0): # 将-9999添加至数据字典 self .data[fn].append( - 9999 ) self .data_y[yr][fn].append( - 9999 ) # 保存对应的日期 data_date_str = lineArr_s0[ 4 ] + "-" + lineArr_s0[ 5 ] + "-" + lineArr_s0[ 6 ] data_date_date = datetime.datetime.strptime(data_date_str, "%Y-%m-%d" ) data_date_fmt = datetime.datetime.strftime(data_date_date, "%Y-%m-%d" ) self .data_date[fn].append(data_date_fmt) else : break e_time = time.clock() print ( "\t<Run time: %.3f s>" % (e_time - s_time)) def SaveData( self , period_days, avg = True , d = True ): ''' 将提取的数据存储到文件 :param avg: 输出逐年平均数据 :return: ''' print ( "Save as file..." , end = '') outStr = "" outStr + = "date," # 添加字段 for s in range ( len ( self .fieldName)): if s ! = len ( self .fieldName) - 1 : outStr + = self .fieldName[s] + "," else : outStr + = self .fieldName[s] + "\n" # 先遍历天数,再遍历类型,逐日添加数据 for k in range ( len ( self .data[ self .fieldName[ 0 ]])): # outStr += period_days[k] + "," for s in range ( len ( self .fieldName)): if s = = 0 : outStr + = str ( self .data_date[ self .fieldName[s]][k]) + "," if s ! = len ( self .fieldName) - 1 : outStr + = str ( self .data[ self .fieldName[s]][k]) + "," else : outStr + = str ( self .data[ self .fieldName[s]][k]) + "\n" # Save createForld( self .dataDir_out) outputFile = self .dataDir_out + os.sep + str ( self .sid) + "_data_" + self .period[ 0 ] + "_" + self .period[ 1 ] + ".csv" DeleteFile(outputFile) WriteLog(outputFile, outStr, MODE = 'append' ) # 输出逐年平均 if avg: outStr = "" # 添加字段 outStr + = "DATE," for s in range ( len ( self .fieldName)): if s ! = len ( self .fieldName) - 1 : outStr + = self .fieldName[s] + "," else : outStr + = self .fieldName[s] + "\n" # 先遍历年份,再遍历类型,逐日添加数据 for yr in self .years: outStr + = str (yr) + "," for s in range ( len ( self .fieldName)): # 获取平均值 if self .fieldName[s] = = "PRE" : # 降水求累加值 data_avg = numpy. sum ( self .data_y[yr][ self .fieldName[s]]) else : data_avg = numpy.average( self .data_y[yr][ self .fieldName[s]]) if s ! = len ( self .fieldName) - 1 : outStr + = str (data_avg) + "," else : outStr + = str (data_avg) + "\n" outputFile_avg = self .dataDir_out + os.sep + str ( self .sid) + "_data_" + self .period[ 0 ] + "_" + self .period[ 1 ] + "_avg.csv" DeleteFile(outputFile_avg) WriteLog(outputFile_avg, outStr, MODE = 'append' ) # 输出日数据 if d: outStr = "" # 添加字段 outStr + = "DATE," for s in range ( len ( self .fieldName)): if s ! = len ( self .fieldName) - 1 : outStr + = self .fieldName[s] + "," else : outStr + = self .fieldName[s] + "\n" # 先遍历年份,再遍历类型,逐日添加数据 for d in self .days: outStr + = str (d) + "," for s in range ( len ( self .fieldName)): data_d = self .data_d[d][ self .fieldName[s]] if s ! = len ( self .fieldName) - 1 : outStr + = str (data_d) + "," else : outStr + = str (data_d) + "\n" outputFile_d = self .dataDir_out + os.sep + "data_" + self .period[ 0 ] + "_" + self .period[ 1 ] + "_days.csv" DeleteFile(outputFile_d) WriteLog(outputFile_d, outStr, MODE = 'append' ) print ( "Completed!" ) |
用到的一些函数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | ## DateTime def GetDateArr_days(timeStart, timeEnd): TIME_Start = datetime.datetime.strptime(timeStart, "%Y-%m-%d" ) TIME_End = datetime.datetime.strptime(timeEnd, "%Y-%m-%d" ) dateArr = getDayByDay(TIME_Start, TIME_End) # print dateArr return dateArr def GetDateArr_strdays(timeStart, timeEnd, fmt = "%Y-%m-%d" ): days = GetDateArr_days(timeStart, timeEnd) dateArr_str = [] for d in days: dateArr_str.append(datetime.datetime.strftime(d, fmt)) return dateArr_str def getDayByDay(timeStart, timeEnd): oneday = datetime.timedelta(days = 1 ) timeArr = [timeStart] while timeArr[ len (timeArr) - 1 ] < timeEnd: tempday = timeArr[ len (timeArr) - 1 ] + oneday timeArr.append(tempday) return timeArr # Remove space(' ') and indent('\t') at the begin and end of the string def StripStr( str ): oldStr = '' newStr = str while oldStr ! = newStr: oldStr = newStr newStr = oldStr.strip( '\t' ) newStr = newStr.strip( ' ' ) return newStr # Split string by spliter space(' ') and indent('\t') as default def SplitStr( str , spliters = None ): # spliters = [' ', '\t'] # spliters = [] # if spliter is not None: # spliters.append(spliter) if spliters is None : spliters = [ ' ' , '\t' ] destStrs = [] srcStrs = [ str ] while True : oldDestStrs = srcStrs[:] for s in spliters: for srcS in srcStrs: tempStrs = srcS.split(s) for tempS in tempStrs: tempS = StripStr(tempS) if tempS ! = '': destStrs.append(tempS) srcStrs = destStrs[:] destStrs = [] if oldDestStrs = = srcStrs: destStrs = srcStrs[:] break return destStrs # Write file def WriteLog(logfile, contentlist, MODE = 'replace' ): if os.path.exists(logfile): if MODE = = 'replace' : os.remove(logfile) logStatus = open (logfile, 'w' ) else : logStatus = open (logfile, 'a' ) else : logStatus = open (logfile, 'w' ) if isinstance (contentlist, list ) or isinstance (contentlist, tuple ): for content in contentlist: logStatus.write( "%s%s" % (content, else : logStatus.write(contentlist) logStatus.flush() logStatus.close() # Create forld def createForld(forldPath): if not os.path.isdir(forldPath): os.makedirs(forldPath) # Delete file def DeleteFile(fp): if os.path.exists(fp): os.remove(fp) |
主函数,主程序调用
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | if __name__ = = "__main__" : # 定义文件路径 dataDir = <data direction> dataDir_out = <output direction> sidArr = [ "53463" , "53478" , "53480" , "53487" ] # 站号列表 fields = [ "TEM" , "TMN" , "TMX" , "PRE" , "RHU" , "WIN" , "PRS" , "SSD" ] # 可选的提取变量列表 START = "2000-01-01" # 起始日期 END = "2018-12-31" # 终止日期 period = [START.split( '-' )[ 0 ] + START.split( '-' )[ 1 ], END.split( '-' )[ 0 ] + END.split( '-' )[ 1 ]] period_days = GetDateArr_strdays(START, END) days = [] # 逐站点提取数据 for sid in sidArr: print (sid) # 计算起始搜索行数,提高提取速度 sr = sidArr.index(sid) * 28 c = ClimateData(dataDir, dataDir_out, int (sid), fields, period, days) c.ExtractData(sr = sr) c.SaveData(period_days, avg = False , d = False ) |
补充说明
第一段 L25: self.fieldInfo
是数据索引字典,其中ind
表示该变量在行中的索引位置(起始索引为0),ind
表示拉伸系数,参考附件变量单位说明文档,该字典可以根据实际需求添加(或删除)变量。
第三段 L8-L13: 定义数据提取的时间范围,提取该时间段内连续的逐日数据,如果提取不连续的数据,则在days
中定义不连续的时间。
将以上三段代码粘贴在同一个py
文件中,运行程序只需要修改main
函数中的文件路径、待提取的站号列表、变量列表(需要提取哪个变量,就保留哪个变量,其余的删掉)、以及日期,即可运行使用。
数据后处理
所需气象参数提取出来后,数据文件以站点ID为文件名独立存储。如果站点比较多,需要进行后处理。
后处理程序需放在上述代码中的主函数中使用,使用时注意注释掉提取数据的代码。
后处理程序每次只能单独提取某一个变量,多个变量可以分多次提取处理,将多站点长时段数据存储在一个文件中,“行”为日期,“列”为站点,代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | # ## 合并多站点数据 stations_datadir = <extracted data direction> outputfile = <output file > var_index = 0 # 变量索引号,对应数据提取结果中的变量位置 stations_data = {} for sid in sidArr: stations_data[ int (sid)] = [] for sid in sidArr: dataFile = stations_datadir + os.sep + str ( int (sid)) + "_data_" + period[ 0 ] + "_" + period[ 1 ] + ".csv" print (dataFile) if (os.path.isfile(dataFile)): txtFile = open (dataFile, 'r' ) linesList = txtFile.read().split( '\n' ) for k in range ( 1 , len (linesList)): if len (linesList[k]) > 0 : stationInfo = linesList[k].split( ',' ) stations_data[ int (sid)].append( float (stationInfo[var_index])) # print(stations_data[int(sidArr[0])]) stations_data_str = "" stations_data_str + = "date," for k in range ( len (sidArr)): if k ! = len (sidArr) - 1 : stations_data_str + = "S%d," % int (sidArr[k]) else : stations_data_str + = "S%d\n" % int (sidArr[k]) for t in range ( len (period_days)): stations_data_str + = period_days[t] + "," for s in range ( len (sidArr)): if s ! = len (sidArr) - 1 : stations_data_str + = "%.3f," % stations_data[ int (sidArr[s])][t] else : stations_data_str + = "%.3f\n" % stations_data[ int (sidArr[s])][t] # print(stations_data_str) WriteLog(outputfile, stations_data_str) print ( "Finished!" ) |
注意事项
特别注意:如果提取1960年之前的数据,本程序在部分站点数据提取上可能会出现数据和时间不对应的问题,因为60年代新增站点的影响,本程序未做特殊处理。可以预先对站点处理,分两次提取数据,保证每一段时间内站点一致。也可以按照评论区大神的修改意见:提取数据对应的时间列,替换period_days
变量。
上述Bug已经修复,缺测日期自动填充-9999值作为No Data Value,使用时注意。
本示例代码对特征值处理较为简单,可根据特征值说明处理数据特征值,以便满足具体应用需要。
在程序使用过程中,若有任何问题,可在下方评论区留言,也可以新浪微博(@斩之浪)私信或邮件(gispie@163.com)联系,欢迎沟通交流~
附件下载
若有气象数据需求,可关注新浪微博私信,或邮件(gispie@163.com)联系~
Fighting, GISer!
最新博文