Datei in UTF-8 konvertieren

Ricky Spanish · 10.10.2005

Tach.

Ich suche eine Möglichkeit, beliebige Dateien unter Windows, welche im normalen ISO-88-schlag-mich-tot Format gespeichert sind, ins UTF-8 Format zu konvertieren.

Kennt jemand hierzu ein gutes Tool? Nach Möglichkeit Freeware bzw. OpenSource.
Wäre auch prima wenn das Tool automatisiert arbeiten würde und man es von der Kommandozeile aus starten könnte. Man übergibt dann als Parameter vielleicht eine oder mehrere Dateien und das Tool konvertiert die Files automatisch nach UTF-8.

Gruß

m.a.k.

Ricky Spanish · 10.10.2005

hat sich erledigt.
hab mir selbst ein tool dafür geschrieben.

gruß
m.a.k.

Brzelius · 10.10.2005

Lustich. Gib mal Source bitte

Ricky Spanish · 11.10.2005

Original geschrieben von Brzelius
Lustich. Gib mal Source bitte

ich poste mal nur die wichtigste methode, weil das ganze programm wahrscheinlich n bisschen zu viel ist.
hier wird ein einzelnes zeichen nach UTF-8 konvertiert.
das ganze muss dann für einen kompletten string entsprechend einmal pro zeichen aufgerufen werden.

die dateien kann man dann einfach konvertieren, indem man die dinger zeilenweise mit nem buffered reader einliest, jede zeile konvertiert und dann mit nem DataOutputStream wieder abspeichert.

methode zum konvertieren eines einzelnen zeichens sieht so aus:

/**
* Converts the specified Unicode character to UTF-8. 
* The UTF-8 code can consist of more than one byte, depending on the original character code. 
* If the code of the character is between 0 and 127, the UTF-8
* code equals the Unicode. 
* Otherwise the character is converted according to the following
* table: 
* 
* Unicode: 0000 0000 - 0000 007F 
* UTF-8 code: 0xxxxxxx 
* 
* Unicode: 0000 0080 - 0000 07FF 
* UTF-8 code: 110xxxxx 10xxxxxx 
* 
* Unicode: 0000 0800 - 0000 FFFF 
* UTF-8 code: 1110xxxx 10xxxxxx 10xxxxxx 
* 
* Unicode: 0001 0000 - 001F FFFF 
* UTF-8 code: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 
* 
* Notice: The x's represent the bits of the original Unicode bit string. They are assigned to the UTF-8 bytes
* from right to left, which means that the least significant bit of the Unicode bit string becomes the least
* significant bit of the rightmost UTF-8 byte, and so on. The most significant bits may be 0, depending on the
* length of the Unicode bit string.
*
* @param c character that should be converted
*
* @return char array representing the UTF-8 code for the specified Unicode string. The UTF-8 code for one single
* character consists of several bytes (1 to 4), depending on the Unicode of the character.
* Returns null in case of an error
*/
public static char[] convertToUTF8(char c)
{
//Unicode: 0000 0000 - 0000 007F
//UTF-8 code: 0xxxxxxx
if(c <= 0x7F)
{
//if the code of the character is between 0 and 127, the UTF-8
//code equals the Unicode
return(new char[] {(char)(c & 0xFF)});
}
//Unicode: 0000 0080 - 0000 07FF
//UTF-8 code: 110xxxxx 10xxxxxx
else if((c >= 0x80) && (c <= 0x7FF))
{
return(new char[] {(char) (0xC0 | (c >> 6)),
(char) (0x80 | (c & 0x3F))});
}
//Unicode: 0000 0800 - 0000 FFFF
//UTF-8 code: 1110xxxx 10xxxxxx 10xxxxxx
else if((c >= 0x800) && (c <= 0xFFFF))
{
return(new char[] {(char) (0xE0 | (c >> 12)),
(char) (0x80 | ((c >> 6) & 0x3F)),
(char) (0x80 | (c & 0x3F))});
}
//Unicode: 0001 0000 - 001F FFFF
//UTF-8 code: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
else if((c >= 0x10000) && (c <= 0x1FFFFF))
{
return(new char[] {(char) (0xF0 | (c >> 18)),
(char) (0x80 | ((c >> 12) & 0x3F)),
(char) (0x80 | ((c >> 6) & 0x3F)),
(char) (0x80 | (c & 0x3F))});
}
//invalid bit string
else
return(null);
}

gruß
m.a.k.

p.s.: sorry, ist leider alles nicht korrekt eingerückt aber ich hatte keinen bock das jetzt noch in HTML korrekt zu formatieren. aber die funktion ist ja relativ simpel, sollte also auch so gut verständlich sein...

Datei in UTF-8 konvertieren

Ricky Spanish

Ricky Spanish

Brzelius

Ricky Spanish