scite自动检测文件编码

转载

玉宽 2023-04-26 13:57:28 博主文章分类：Python

文章标签 sed ci ico 文章分类 代码人生

##########################文件开始fileDect.py#############################

#encoding:utf8 
 
 # Detect file encoding 
 
 # Simple method that just chacks that first 1000 lines are valid in each encoding 
 
 # and chooses first from set that is valid for all lines checked. 
 
 # A better version would allow for a small proportion of failures and rank encodings 
 
 # depending on how well they match the input. 
 
 import sys  
 
 import os 
 

 encodings = [  
 
     ['utf-8', 65001, 0],  
 
     ['cp932', 932, 128], 
 
     ['cp936', 936, 134], 
 
     ['cp949', 949, 129], 
 
     ['cp950', 950, 136], 
 
 ] 
 

 codings = [e[0] for e in encodings] 
 

 def EncodingWorks(encoding, text): 
 
     try: 
 
         text.decode(encoding) 
 
         return True 
 
     except UnicodeDecodeError: 
 
         return False 
 
      
 
 # Read up to first 1000 lines of file 
 
 if len(sys.argv) > 1 and os.path.isfile(sys.argv[1]): 
 
     with open(sys.argv[1], "rb") as f: 
 
         lineNumber = 1  
 
         for line in f.readlines(): 
 
             # Filter out any encodings that fail 
 
             codings = [c for c in codings if EncodingWorks(c, line)] 
 
             lineNumber += 1 
 
             if lineNumber > 1000: 
 
                 break 
 

 codingsKnow = False 
 

 comment = '' 
 
 for c in codings: 
 
     for e in encodings: 
 
         if e[0] == c: 
 
             codingsKnow = True 
 
             codePage, characterSet = e[1:] 
 
             if codePage: 
 
                 print('%scode.page=%s' % (comment, codePage)) 
 
             if characterSet: 
 
                 print('%scharacter.set=%s' % (comment, characterSet)) 
 
             # Display other matches as comments so can check results 
 
             comment = '#'  
 
 #如果检测不出文件的编码，将默认编码设置成cp936（GBK） 
 
 if codingsKnow==False: 
 
     print 'code.page=936' 
 
     print 'character.set=134' 
 
 # Change the caret colour so we can see that something happened 
 
 print('caret.fore=#4499FF')

############################文件结束#######################################
然后在配置文件SciTEGlobal.properties中加入
command.discover.properties=python /path/to/fileDetect.py "$(FilePath)"
即可自动检测文件编码，上面的文件可以检测utf-8,gbk,big5等编码，足够使用。

ps.上面的代码是别人写的。。。在linux上测试通过，需要安装python环境
由于是直接复制成网页的，直接拷贝到代码文件可能有问题

Encodings

SciTE will automatically detect the encoding scheme used for Unicode files that start with a Byte Order Mark (BOM). The UTF-8 and UTF-16 encodings are recognised including both Little Endian and Big Endian variants of UTF-16.

UTF-8 files will also be recognised when they contain a coding cookie on one of the first two lines. A coding cookie looks similar to "coding: utf-8" ("coding" followed by ':' or '=', optional whitespace, optional quote, "utf-8") and is normally contained in a comment:

# -*- coding: utf-8 -*-

For XML there is a declaration:

<?xml version='1.0' encoding='utf-8'?>

For other encodings set the code.page and character.set properties.