sindresorhus/file-type

False-positives for the `msi` detection #162

fazouane-marouane posted onGitHub

Hello

It seems that doc files (in fact all files that are CFB based, meaning: msi, doc, xls, ppt, oft...) are recognized as "msi" files. Example of a doc file: http://www.softdoteducation.com/upload/study-notes/STNOTES_DOC_5.doc

Checking the code, I found the following:

    if (check([0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1])) {
        return {
            ext: 'msi',
            mime: 'application/x-msi'
        };
    }

The issue here is that 0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1 is the header signature for CFB format not msi specifically. https://msdn.microsoft.com/en-us/library/dd941946.aspx

Can we please remove this entry and discuss other ways of recognizing such files? Thanks


posted by karlhiramoto over 6 years ago

mmmagic detects .doc files correctly as application/msword

posted by thehappycoder almost 6 years ago

I have something that detects doc/xls/ppt/msi... but it needs to parse the whole cfb container file and doesn’t use magic values which is not that perfect for most people. (+1ko gzipped if my memory serves me right)

posted by fazouane-marouane almost 6 years ago

I have the same issue.

wrong: (using "file-type-cli") $ node cli.js my.doc msi application/x-msi

correct: (using Ubuntu's "file" command) $ file --mime-type -b my.doc application/msword

posted by ilsundal almost 6 years ago

I have something that detects doc/xls/ppt/msi... but it needs to parse the whole cfb container file and doesn’t use magic values which is not that perfect for most people. (+1ko gzipped if my memory serves me right)

If this is the only way then so be it :)

Note that Ubuntu's "file" command/utility also works nicely (and super-fast). One option could be to investigate how it does it and then do the same in the "file-type" library.

posted by ilsundal almost 6 years ago
posted by PhantomSophia almost 6 years ago

The fix would not be to add doc support, which I'm not interested in, but rather improve the msi detection.

posted by sindresorhus almost 6 years ago

@issuehunt has funded $40.00 to this issue.


posted by issuehunt-app[bot] almost 6 years ago

Hi, is anyone able to provide true positive .msi files?

Using the following check I've been able to detect msi (that I've found) without having a false positive on .doc, .xls or .ppt

Full disclosure: I started with the commented out magic bytes from https://github.com/sindresorhus/file-type/issues/162#issuecomment-500106450 but ended up just reading the byte stream of msi files.

check([0xD0, 0xCF, 0x11, 0xE0, 0xA1, 0xB1, 0x1A, 0xE1, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x3E, 0x00, 0x04, 0x00, 0xFE, 0xFF, 0x0C, 0x00, 0x06])
posted by HugoDF over 5 years ago

@sindresorhus has rewarded $36.00 to @hugodf. See it on IssueHunt

  • :moneybag: Total deposit: $40.00
  • :tada: Repository reward(0%): $0.00
  • :wrench: Service fee(10%): $4.00
posted by issuehunt-app[bot] over 5 years ago

Fund this Issue

$40.00
Rewarded

Rewarded pull request

Recent activities

hugodf was rewarded by sindresorhus for sindresorhus/file-type# 162
over 5 years ago
hugodf submitted an output to  sindresorhus/ file-type# 162
over 5 years ago