Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Welcome to the CollectiveAccess support forum! Here the developers and community answer questions related to use of the software. Please include the following information in every new issue posted here:

  1. Version of the software that is used, along with browser and version

  2. If the issue pertains to Providence, Pawtucket or both

  3. What steps you’ve taken to try to resolve the issue

  4. Screenshots demonstrating the issue

  5. The relevant sections of your installation profile or configuration including the codes and settings defined for your local elements.


If your question pertains to data import or export, please also include:

  1. Data sample

  2. Your mapping


Answers may be delayed for posts that do not include sufficient information.

Media import not matching existing object identifiers

edited March 14 in Troubleshooting

My installation (Providence 1.6.2) uses object identifiers of the form DSYHC.2003.2 or DSYHC.2003.4.1.

These object identifiers definitely exist in the database and I am now trying to do a bulk import of images to match up against these identifiers using appropriate filenames, e.g. DSYHC.2003.2.jpg and DSYHC.2003.4.1.jpg.

The files are correctly located in the import directory (see attached image).

When I run the batch import I get an error message to say that each image was skipped because it couldn't be matched (also see attached image)

My import settings are as below (and in the attached image)

  • Import Mode = Import only media which can be matched with existing records
  • Object Identifier = Set object identifier to file name without extension

I believe everything is named and configured correctly on the media import page so am now not sure what I am missing.

Thanks for any help.

Andy

717 x 361 - 51K
846 x 359 - 83K
760 x 663 - 99K
«1

Comments

  • edited March 20
    Still no luck with this. Does anyone have any thoughts on what might be going on? Thanks
  • Do I need to modify a regex expression in a configuration file somewhere to make a successful match?
  • Most likely you do. What modifications are required will depend upon the structure of the filenames on the files you're importing.
  • edited March 22

    Hi Seth. the structure of the filenames and the configuration is in the original post.

    e.g. filenames of media are DSYHC.2003.2.jpg, DSYHC.2003.4.1.jpg, etc.

    Existing idnos in the database are the same as these but without the extension, e.g. DSYHC.2003.2, DSYHC.2003.4.1

    I thought this would have matched without any problem when "Object Identifier = Set object identifier to file name without extension"

    Is that not the case?

    Thanks

    Andy

  • From what I can see the standard regex in app.conf should work fine with the above examples for matching against filename without extension, I.e.

     filename_without_extension = {
      displayName = _(Filename without extension),
      regexes = { "(.*?)\.[A-Za-z0-9]+$" }

    I've tested this code outside CollectiveAccess with the above filenames and it seems to match OK yet for some reason it doesn't match the idno for the objects.

  • I'd have to see the system to tell what wrong. If things don't line up exactly it won't work at all.
  • Hi Seth. As far as I can see the idno of the existing record exactly matches the filename (minus the extension) of the media I'm trying to import but it says it can't find a match. I could give you access to the system if that helps?
    Thanks
    Andy
  • edited March 24

    OK, I tried again with a 1.6.3 installation and get the same result. I'm trying to import one media file to match against an existing idno.

    idno = DSYHC.2003.4.1

    media filename = DSYHC.2003.4.1.jpg

    My import settings are as above, i.e.

    • Import Mode = Import only media which can be matched with existing records
    • Object Identifier = Set object identifier to file name without extension

    It still fails to match. The debug log file is attached here.

    Any ideas? Thanks

  • Did the log file help? I'm really stuck now on this so would appreciate any pointers.

    Thanks

    Andy

  • I am having the exact same issue in 1.7.

     My object accession idno's are 2003.1.1, 2003.1.2, 2003.1.3, etc. My media is the "import" directory and the filenames are 2003.1.1.pdf, 2003.1.2.pdf, 2003.1.3.pdf, etc.  I've used the same settings as pictured in Andy's sample screenshots in his original post, although I have "Document" selected under "limit to types, as the correct object is a document type. Screen shots attached.

    idno = 2003.1.2

    media filename = 2003.1.2.pdf


    In my case, all of the media is matching to a random non-document object with a completely different IDNO with the message "matched using expression "Filename with page number - page number stripped".  This is the same whether I have selected "set identifier to filename" both with and without extension.

    The installation is on the CA hosted server, should be easy then to poke around, I have a "test" directory  with 3 of these files, but the full import will be over 8000 records.

    Also, strangely, although "delete after import" is checked, nothing is deleted from the import directory.

    Thanks for any direction.
  • edited March 28

    It's good to hear I'm not the only one experiencing this but I have completely run out of ideas on what's going wrong!

    The idno is exactly the same as the media filename but without the extension. I agree that selecting the options with and without extension don't seem to make any difference.

    Looking at the debug log attached above it seems to go through all possible matching selections even though I chose "set identifier to filename without extension".

    I also see the following attempt at matching.

    2017-03-24 19:23:41 - DEBUG --> Processing mediaFilenameToObjectIdnoRegexes entry filename_without_extension
    2017-03-24 19:23:41 - DEBUG --> Trying to match on file name 'DSYHC.2003.4.1.jpg'
    2017-03-24 19:23:41 - DEBUG --> Names to match: Array
    (
        [0] => DSYHC.2003.4.1.jpg
    )

    2017-03-24 19:23:41 - DEBUG --> Matched name DSYHC.2003.4.1.jpg on regex (.*?).[A-Za-z0-9]+$
    2017-03-24 19:23:41 - DEBUG --> Trying to find records using boolean OR and values Array
    (
        [idno] => %DSYHC.2003.4.1%
    )

    Why does the idno get shown as %DSYHC.2003.4.1% rather than DSYHC.2003.4.1? Is that why no match occurs? The database clearly shows the correct idno.

    Thanks

    Andy

  • I wondered if the "decimals" were throwing something off, but i switched them to "-"'s [hyphens] on one file and the same thing happened.  I haven't done this import in quite awhile, but I know it did work about two years ago [last time I tried].
  • edited March 28
    I tried exactly the same thing with the same result. I've not tried this import before and thought it would be straightforward. I'm positive everything is being done correctly so something seems not right.
  • Any progress on this one?  I'm still stumped.
  • I'm also keen to see what might be causing this issue. Any thoughts?
  • edited April 7

    I managed to get an import working though it's not how I would expect to do it. My media files are still named using the format as above, i.e.

    media filename = DSYHC.2003.4.1.jpg

    ... and I'm trying to match against ID numbers of the same form but without the extension, i.e.:

    idno = DSYHC.2003.4.1

    All the objects in the database have been previously assigned to a sub-type (i.e. a sub-division of one of the principal types). These types and sub-types were configured in the installation profile.

    e.g. a principal type could be 'Artefact' or 'Publication', where 'Publication' could have sub-types 'Book', 'Magazine', 'Postcard', etc.

    My menu selections on the media import page are:

    Import Mode = Import media that can be matched with existing records

    Set = Do not associate imported media with a set

    Object Identifier = Set object Identifier to filename without extension

    Then show Advanced Options

    Matching = Match using file name where identifier matches exactly

    Limit to Types = Select all the principal types in the list (holding down CTRL to select multiple types) - This effectively selects all types and sub-types, i.e. implying all objects. I couldn't see a way of de-selecting all options on the list so that no limit was imposed. Selecting all the principal types means I'm effectively now including all types and sub-types which ends up as the same thing.

    Object Representation Identifier = Set object Representation Identifier to filename without extension

    When I ran the import media it now seems to match to existing objects correctly but I'd still be interested in what the CA team's thoughts are:

    Andy

  • Andy, I tried your method and still get the same wrong results.

    One thing I have noticed is that the matching that I do get is not really matching at all. It simply throws every single media item at the very first object listed in the database when sorted by "CollectiveAccess id" and not using the  object identifier at all

    In my case this first object has the very first CollectiveAccess id of #834 and object identifier of 2003.7.1, [the first 833 objects were previously deleted], rather than matching against the correct object identifier of 2003.1.1.  See attached screenshot.

    In the end, all three test media files were attached to the same object with CollectiveAccess id #834, even though they have differing object identifiers.  Also, nothing is deleted when I have "Delete media after import" checked.

    See previous screenshots from March 28.  All of the other settings are just as you describe.

    Seth .... Is there a setting in app.conf that determines the field media is matched on?
  • Well it's hard for me to say what's going on without mucking with your actual system. The media importer uses a series of regular expressions against the file name of an imported file. The first regular expression that matches is used and the matched text in the first parenthesized group is what is actually searched for.

    The default app.conf has this regular expression first: (.*?)\.[A-Za-z0-9\-]+\.[A-Za-z]+$
    This expression will extract the portion of the file name up to the first period, but only if it has two following period-separated extensions. Eg. a2017.1.jpg would match, but only "a2017" would be used to match against idno. 

    I think this pattern is causing problems for some of your files, as you actually want the period separated numeric extensions included, and only the file extension omitted. 

    Another issue, which is definitely affecting Den's work is a bug in how type restrictions are handled. It's mangling the query such that it's effectively OR'ing type and identifier, which casts a very wide net. This is why you're getting odd matches.

    I just patched the issue and will roll it out to hosting... you can also pick it up from the GitHub develop branch. The fix will also be in the 1.7 release, which looks to be coming in the next few days given the lack of complaint regarding the 1.7 release candidate.

    seth
  • edited April 10

    Thanks for the response Seth. You're correct that I want everything prior to the final period to be matched against the idno currently in the database. This would include the period separated "numeric extensions" you refer to. I only consider the characters after the final period to be the actual extension, e.g. jpg

    However I thought that this would be achieved using the regex (.*?)\.[A-Za-z0-9]+$, from app.conf not the one you put in your comment. The one I assumed seems to be the one used when matching against filename without extension, which is what I selected. It seems to work OK in an external regex tester (https://regex101.com/). According to app.conf, the one you included seems to be the one for "Filename with page number - page number stripped" though I'm not sure what that refers to as it's not given that name on the GUI.

    Thanks

    Andy


  • You might want to reorder them to get the one that behaves as you want up top. I wasn't commenting on your numbers specifically; if (.*?)\.[A-Za-z0-9]+$ works for you use it. If all of your numbers are in the format "DSYHC.2003.2.jpg" then this'll work ok. On the other hand (.*?)\.[A-Za-z0-9\-]+\.[A-Za-z]+$ will also match... but will only actually extract "DSYCH.2003". If the latter regular expression is the first in the list it would be used. The point I was trying to make is that you need to take order of regular expressions in the list into account and make sure the ones you want to try first are up top. 

    If you were importing media with type restrictions set in "advanced options" then the fix I just pushed will help.

    seth

Sign In or Register to comment.