Posts Tagged String

JAVA – How to remove HTML from String

Tuesday, May 19th, 2009

HTML tag and HTML code can be removed from string by String replaceAll() method. We can remove HTML tag from string with using regular expression. After removing HTML from string it will return string in text.

public class RemoveHTML {

    public static String removeHTML(String htmlString)
    {
          // Remove HTML tag from java String    
        String noHTMLString = htmlString.replaceAll("\\<.*?\\>", "");

        // Remove Carriage return from java String
        noHTMLString = noHTMLString.replaceAll("\r", "<br/>");

        // Remove New line from java string and replace html break
        noHTMLString = noHTMLString.replaceAll("\n", " ");
        noHTMLString = noHTMLString.replaceAll("\'", "&#39;");
        noHTMLString = noHTMLString.replaceAll("\"", "&quot;");
        return noHTMLString;
    }

    public static void main(String[] args) {

    String strHTML= "<html>"+
                    "<head>"+
                    "<title>Convert HTML to Text String</title>"+
                    "</head>"+

                    "<body>"+
                    "This is HTML String of java's source code  \"my program\""+
                    "</body>"+
                    "</html>";

        String stringWithoutHTML=removeHTML(strHTML);

        System.out.println(stringWithoutHTML);
    }
}

Output

Convert HTML to Text StringThis is HTML String of java&#39;s source code &quot;my program&quot;