(Java) How can I read in a text file that could use various encodings and output the contents in a text file that looks normal? -
i'm reading in file , replacing text, writing new file, line line. use following code read , write file. there no issues files cp1252 , utf-8 encoded, when try reading in file encoded in "ucs-2 le bom" file saved starts bom characters , contains whole lot of whitespace. know due encoding don't know if need read in differently or save differently. also, know set encoding when read file in, how can handle differntly-encoded files without knowing 1 coming. have no control on file until hits java code. appreciated, thank you.
fileinputstream sourcefileinputstream = new fileinputstream(sourcefile); datainputstream sourcedatainputstream = new datainputstream(sourcefileinputstream); bufferedreader sourcebufferedreader = new bufferedreader( new inputstreamreader(sourcedatainputstream)); filewriter targetfilewriter = new filewriter(new file(targetfilelocation)); bufferedwriter targetbufferedwriter = new bufferedwriter( targetfilewriter); . . . targetbufferedwriter.write(newtextline);
the bom can indicate several encodings, not utf-8. see wikipedia article byte order mark.
in absence of bom, don't need read whole file, can read as needed until have meaningful statistics. 100 or bytes enough - once wrote program did that. on other hand there chance if read entire file statistics not conclusive. method used based on letter frequency - unigram, bigram , trigram frequencies language, , relationship of encoding language. when calculating bigram , trigram frequencies suggest whitespace should considered in own right. account frequency of letters @ beginning , @ end of words. "now the" bigrams no, o_, i, is, s, t, th, he, e. see monogram, bigram , trigram frequency counts.
Comments
Post a Comment