Note: content extraction of Lucene+Tika document

Tika is a library for file type detection and file content extraction, with the following points.

  • Unified resolution interface. All the third-party parsing libraries of Tika are encapsulated in a single parser. Because of this feature, users reduce the burden of choosing the appropriate parsing library according to different document types.
  • Low memory usage. Because of the unified parser interface, Tika consumes less memory resources and is easy to embed various Java applications.
  • Fast processing. The content detection and information extraction in the application can be expected, and the processing speed is fast.
  • Flexible metadata. Tika understands all the metadata models used to describe files.
  • Parser integration. Tika can use a variety of parser libraries for each file type in a single application.
  • MIME type detection. Tika can detect and extract content from all included mime standard media types.
  • Language detection. Tika includes language recognition, so it can be used in a multilingual website based on language type documents.
   public static void main(String[] args) throws IOException, TikaException, SAXException {
        //Create a new file folder for various files
        File files = new File("/Users/fxl/IdeaProjects/learning-pro/lucene/src/main/resources/doc");
        if (!files.exists()) {
            System.out.println("Folder does not exist, please check!");
            System.exit(0);
        }
        File[] fileArr = files.listFiles();
        //Method 1
//        Tika tika = new Tika();
//        String fileContent;
//        for (File f : fileArr) {
//            fileContent = tika.parseToString(f);
//            System.out.println("Extracted Content: " + fileContent);
//        }
        //Method two
        BodyContentHandler handler = new BodyContentHandler(10 * 1024 * 1024);
        //Create metadata
        Metadata metadata = new Metadata();
        FileInputStream fileInputStream;
        Parser parser = new AutoDetectParser();
        ParseContext parseContext = new ParseContext();
        for (File f : fileArr) {
            fileInputStream = new FileInputStream(f);
            parser.parse(fileInputStream, handler, metadata, parseContext);
            System.out.println(f.getName() + ":\n" + handler.toString());
        }

    }

Tags: Programming less Java

Posted on Sat, 18 Apr 2020 07:41:19 -0700 by Bryan Ando