(Java) How can I read in a text file that could use various encodings and output the contents in a text file that looks normal? -

i'm reading in file , replacing text, writing new file, line line. use following code read , write file. there no issues files cp1252 , utf-8 encoded, when try reading in file encoded in "ucs-2 le bom" file saved starts bom characters , contains whole lot of whitespace. know due encoding don't know if need read in differently or save differently. also, know set encoding when read file in, how can handle differntly-encoded files without knowing 1 coming. have no control on file until hits java code. appreciated, thank you.

        fileinputstream sourcefileinputstream = new fileinputstream(sourcefile);         datainputstream sourcedatainputstream = new datainputstream(sourcefileinputstream);          bufferedreader sourcebufferedreader = new bufferedreader(                 new inputstreamreader(sourcedatainputstream));         filewriter targetfilewriter = new filewriter(new file(targetfilelocation));         bufferedwriter targetbufferedwriter = new bufferedwriter(                 targetfilewriter);                   .                   .                   .         targetbufferedwriter.write(newtextline);

the bom can indicate several encodings, not utf-8. see wikipedia article byte order mark.
in absence of bom, don't need read whole file, can read as needed until have meaningful statistics. 100 or bytes enough - once wrote program did that. on other hand there chance if read entire file statistics not conclusive. method used based on letter frequency - unigram, bigram , trigram frequencies language, , relationship of encoding language. when calculating bigram , trigram frequencies suggest whitespace should considered in own right. account frequency of letters @ beginning , @ end of words. "now the" bigrams no, o_, i, is, s, t, th, he, e. see monogram, bigram , trigram frequency counts.

Search This Blog

New Generation Education

(Java) How can I read in a text file that could use various encodings and output the contents in a text file that looks normal? -

Comments

Post a Comment

Popular posts from this blog

cookies - Yii2 Advanced - Share session between frontend and mainsite (duplicate of frontend for www) -

angular - password and confirm password field validation angular2 reactive forms -

javascript - Angular2 intelliJ config error.. Cannot find module '@angular/core' -