(Java) How can I read in a text file that could use various encodings and output the contents in a text file that looks normal? -


i'm reading in file , replacing text, writing new file, line line. use following code read , write file. there no issues files cp1252 , utf-8 encoded, when try reading in file encoded in "ucs-2 le bom" file saved starts bom characters , contains whole lot of whitespace. know due encoding don't know if need read in differently or save differently. also, know set encoding when read file in, how can handle differntly-encoded files without knowing 1 coming. have no control on file until hits java code. appreciated, thank you.

        fileinputstream sourcefileinputstream = new fileinputstream(sourcefile);         datainputstream sourcedatainputstream = new datainputstream(sourcefileinputstream);          bufferedreader sourcebufferedreader = new bufferedreader(                 new inputstreamreader(sourcedatainputstream));         filewriter targetfilewriter = new filewriter(new file(targetfilelocation));         bufferedwriter targetbufferedwriter = new bufferedwriter(                 targetfilewriter);                   .                   .                   .         targetbufferedwriter.write(newtextline); 

  1. the bom can indicate several encodings, not utf-8. see wikipedia article byte order mark.

  2. in absence of bom, don't need read whole file, can read as needed until have meaningful statistics. 100 or bytes enough - once wrote program did that. on other hand there chance if read entire file statistics not conclusive. method used based on letter frequency - unigram, bigram , trigram frequencies language, , relationship of encoding language. when calculating bigram , trigram frequencies suggest whitespace should considered in own right. account frequency of letters @ beginning , @ end of words. "now the" bigrams no, o_, i, is, s, t, th, he, e. see monogram, bigram , trigram frequency counts.


Comments

Popular posts from this blog

php - Permission denied. Laravel linux server -

google bigquery - Delta between query execution time and Java query call to finish -

python - Pandas two dataframes multiplication? -