Soft! e Oscar 2024 » Blog Archive Eliminare tags HTML da un file

Giu 13

Eliminare tags HTML da un file

Soft!

Vediamo come sia possibile in Java eliminare i tags all’interno di un file e restituire testo plain.

Espressioni regolari

Una speciale espressione regolare è usata per eliminare ogni cosa tra caporali (< e >):

import java.io.*;

public class Html2TextWithRegExp {
   private Html2TextWithRegExp() {}

   public static void main (String[] args) throws Exception{
     StringBuilder sb = new StringBuilder();
     BufferedReader br = new BufferedReader(new FileReader("java-new.html"));
     String line;
     while ( (line=br.readLine()) != null) {
       sb.append(line);
       // or
       //  sb.append(line).append(System.getProperty("line.separator"));
     }
     String nohtml = sb.toString().replaceAll("\\<.*?>","");
     System.out.println(nohtml);
   }
}

javax.swing.text.html.HTMLEditorKit

HTMLEditorKit funziona bene se il codice HTML è ben formattato.

import java.io.IOException;
import java.io.FileReader;
import java.io.Reader;
import java.util.List;
import java.util.ArrayList;

import javax.swing.text.html.parser.ParserDelegator;
import javax.swing.text.html.HTMLEditorKit.ParserCallback;
import javax.swing.text.html.HTML.Tag;
import javax.swing.text.MutableAttributeSet;

public class HTMLUtils {
  private HTMLUtils() {}

  public static List extractText(Reader reader) throws IOException {
    final ArrayList list = new ArrayList();

    ParserDelegator parserDelegator = new ParserDelegator();
    ParserCallback parserCallback = new ParserCallback() {
      public void handleText(final char[] data, final int pos) {
        list.add(new String(data));
      }
      public void handleStartTag(Tag tag, MutableAttributeSet attribute, int pos) { }
      public void handleEndTag(Tag t, final int pos) {  }
      public void handleSimpleTag(Tag t, MutableAttributeSet a, final int pos) { }
      public void handleComment(final char[] data, final int pos) { }
      public void handleError(final java.lang.String errMsg, final int pos) { }
    };
    parserDelegator.parse(reader, parserCallback, true);
    return list;
  }

  public final static void main(String[] args) throws Exception{
    FileReader reader = new FileReader("java-new.html");
    List lines = HTMLUtils.extractText(reader);
    for (String line : lines) {
      System.out.println(line);
    }
  }
}

Usare un parser HTML

Questa forse è la migliore soluzione. Io uso il progetto open source Jsoup.

import java.io.IOException;
import java.io.FileReader;
import java.io.Reader;
import java.io.BufferedReader;
import org.jsoup.Jsoup;

public class HTMLUtils {
  private HTMLUtils() {}

  public static String extractText(Reader reader) throws IOException {
    StringBuilder sb = new StringBuilder();
    BufferedReader br = new BufferedReader(reader);
    String line;
    while ( (line=br.readLine()) != null) {
      sb.append(line);
    }
    String textOnly = Jsoup.parse(sb.toString()).text();
    return textOnly;
  }

  public final static void main(String[] args) throws Exception{
    FileReader reader = new FileReader
          ("C:/RealHowTo/topics/java-language.html");
    System.out.println(HTMLUtils.extractText(reader));
  }
}

Questo è tutto, buon lavoro.

Scrivi un Commento

giugno 2012
L	M	M	G	V	S	D
« Mag		Lug »
	1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30

Ultimi Commenti

stefano: ciao, forse è proprio quello che cercavo.. in azienda quando...
Soft!: Ciao, riavvia bpanda e clicca il bottone "cancella tutte le ...
Filippo: Ciao, ho usato il BPanda, che sarebbe risultato utilissimo,...
ieiei: Complimenti ottimo programmino...
Luca: Ciao, non sono esperto, ho capito le potenzialità. Ma potres...

Eliminare tags HTML da un file

Espressioni regolari

javax.swing.text.html.HTMLEditorKit

Usare un parser HTML

Scrivi un Commento

Categorie

Archivi

Links utili

Tag

Post Recenti

Ultimi Commenti

Il contenuto di questa pagina richiede una nuova versione di Adobe Flash Player.