Using Python to query EIDR registry to search multiple titles

12 min readOct 26, 2020

Before you start reading this article, please read my previous article Using Python to access the EIDR registry with EIDR REST APIs to understand to initial setup.

EIDR Query Service

EIDR Query service is one of the powerful service of the EIDR API. This service queries the EIDR content database. It allows you to query any element of an EIDR record. The query service takes a set of metadata criteria as input and returns xml response for all objects that match the criteria. The simplest query tests a single metadata field. More complex queries can be built up by grouping simple queries together with standard logical expressions.
Queries can be based on the content record’s metadata, its modification history (provenance), and it’s virtual fields.

To see all the possible elements that can be queried, search for an EIDR and go to the details page. On the details page, click on the “View XML” link on the top right hand side. This will show the XML layout of the EIDR content.

Below is the xml structure of the EIDR content. All elements in <BaseObjectData>and <ProvenanceMetadata> can be queried.

This article show how to build and run an EIDR query using REST APIs.

Pre-requisites

You have Python installed. For this tutorial, we are using Python version 3.7.4. If not, you can download Python from https://www.python.org/ Please refer to Python documentation to install Python on your machine.
You have a Text editor installed on your machine. There are many python friendly editor, which helps writing Python code easier.

For Mac, you can use SublimeText, Atom, TextWrangler
For Windows, you can use Notepad, Notepad++, SublimeText, Atom

3. You can also install Jupyter, which is a web-based interactive Python development environment. You can download from https://jupyter.org/

Assumption

You know basic Python programming.
You are a member of EIDR and have credentials to access EIDR Registry

EIDR Rest APIs

The EIDR system provides various services using a REST based interface in combination with HTTP 1.1 (see RFC 2616).

Here is the link to the EIDR Rest API document and EIDR Registry Technical Overview

Note: Public services do not necessarily mean open access. Ingesting or registering data into EIDR is controlled, while reading data from EIDR is generally not restricted.

EIDR Query Service API

Refer to 2.3.2 Query service (page 20) in the EIDR Rest API document for the request call and parameters. Also refer to EIDR Registry Technical Overview (page 20 and Appendix A on Page 33) to get additional details and examples.

Let’s get started…

Refer to Using Python to access the EIDR registry with EIDR REST APIs to do the following -

Install the required python packages
Authenticate and authorize to access EIDR Rest APIs

import requests, base64, hashlib
import pandas as pd
from bs4 import BeautifulSoupUserID = '10.5238/xxxxxxxx'.         # enter your EIDR User Id
Pwd = '************'                 # enter your EIDR password
PartyID = '10.5237/xxxxxxxxx'        # enter your EIDR party ID
url = 'https://sandbox1.eidr.org:443/EIDR/' # EIDR Registry URL#Encrypt the credentials
PasswordShadow = base64.b64encode(hashlib.md5(pwd.encode('utf-8')).digest()).decode('utf8')auth_str = '%s:%s:%s' % (UserID, PartyID, PasswordShadow)headers = {'Authorization' : 'Eidr {}'.format(auth_str), 'Accept': 'text/xml', 'Content-Type': 'text/xml'}

Now we write a function to run the query and get the results.

def postEidrQuery(qry, id_only=True):
    if id_only:
        req = url + 'query/?type=id'
    else:
        req = url + 'query/'
        
    resp = requests.post(req, headers=headers, data=qry)    return resp.content

The function takes two parameters. The first is the EIDR query and the second parameter determines the result output. The results can be a list of resulting EIDR IDs or it can be EIDR Ids along with basic EIDR content. By default, we set the result to show IDs only.

Query Examples

Here are a few examples of EIDR queries, which gives you an idea on the possible queries that can be created

Example 1: Simple query to search for a title by the title name. The below query will search for the word “Batman” in the ResourceName element. It will result in all titles which has the word “Batman” in it.

<Request xmlns="http://www.eidr.org/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Operation>
        <Query>
            <Expression>
                (/FullMetadata/BaseObjectData/ResourceName Batman)
            </Expression>
            <PageNumber>1</PageNumber>
            <PageSize>300</PageSize>
        </Query>
    </Operation>
</Request>

If you want an exact title search, you can change the query as below -

(/FullMetadata/BaseObjectData/ResourceName IS Batman) Or(/FullMetadata/BaseObjectData/ResourceName "Batman")

The “IS” operator is used for exact matches or you can use double quotes.

Example 2. Titles search with additional criteria.

<Request xmlns="http://www.eidr.org/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
    <Operation>
    <Query>
        <Expression>
          (/FullMetadata/BaseObjectData/ResourceName Batman) AND
          (/FullMetadata/BaseObjectData/StructuralType IS Abstraction) AND
          (/FullMetadata/BaseObjectData/ReferentType IS Movie) AND
          (/FullMetadata/BaseObjectData/ReleaseDate IS 2005)
        </Expression>
        <PageNumber>1</PageNumber>
        <PageSize>300</PageSize>
    </Query>
    </Operation>
</Request>

Here we are searching abstract eidrs for all movies with the word Batman that is released in 2005.

Example 3: Here we search for all EIDRs where the Registrant is “10.5237/A929-C667” and the last modification date of the EIDR content was greater than 2020–09–01. This may result in thousands of EIDRs. To limit the number of EIDRs in the resultset, use the <PageSize> element. In the below query the <PageSize> is set to 300.

<Request xmlns="http://www.eidr.org/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
  <Operation>
    <Query>
        <Expression>              (/ProvenanceMetadata/LastModificationDate &gt; 2020-09-01) AND
              (/ProvenanceMetadata/Administrators/Registrant  "10.5237/A929-C667")
        </Expression>
        <PageNumber>1</PageNumber>
        <PageSize>300</PageSize>
    </Query>
  </Operation>
</Request>

Example 4: Here we search for the exact title “Two and a Half Men” and to show only those EIDRs which has episodic information.

<Request xmlns="http://www.eidr.org/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<Operation>
    <Query>
        <Expression>
            (/FullMetadata/BaseObjectData/ResourceName "Two and a Half Men") 
            AND
            (/FullMetadata/ExtraObjectMetadata/EpisodeInfo EXISTS)    
        </Expression>
        <PageNumber>1</PageNumber>
        <PageSize>300</PageSize>
    </Query>
</Operation>
</Request>

I think, you get the idea on the possible queries that can be created. Refer to EIDR Registry Technical Overview (page 20 and Appendix A on Page 33) to get additional details and examples

To execute the queries, you can call the function we created. Below we save the example 1 query in a variable and then execute it.

qry1 = '''
    <Request xmlns="http://www.eidr.org/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
        <Operation>
            <Query>
                <Expression>(/FullMetadata/BaseObjectData/ResourceName Batman)</Expression>
                <PageNumber>1</PageNumber>
                <PageSize>300</PageSize>
            </Query>
        </Operation>
    </Request>'''eidr_query_resp = postEidrQuery(qry1)
soup = BeautifulSoup(eidr_query_resp, 'xml')
print(soup.prettify())

Below is the response XML for example 1-

<?xml version="1.0" encoding="utf-8"?>
<Response version="2.6.0" xmlns="http://www.eidr.org/schema">
 <Status>
  <Code>
   0
  </Code>
  <Type>
   success
  </Type>
 </Status>
 <Query>
  <Expression>
   (/FullMetadata/BaseObjectData/ResourceName Batman)
  </Expression>
  <PageNumber>
   1
  </PageNumber>
  <PageSize>
   300
  </PageSize>
 </Query>
 <QueryResults>
  <CurrentSize>
   300
  </CurrentSize>
  <TotalMatches>
   386
  </TotalMatches>
  <ID>
   10.5240/97EC-034D-5938-1904-1FCB-8
  </ID>
  <ID>
   10.5240/F41C-0D39-8C31-5F05-0619-B
  </ID>
  <ID>
   10.5240/B055-803F-5468-7F61-DAD6-K
...........

Since I did not pass the second parameter to our function, it defaults to show the EDIR IDs only. If you see the result XML, it show how many EIDR IDs were found matching the query criteria in the <TotalMatches> element. The XML will show only 300 IDs, since we have limited the results in the <PageSize> element.

We can pass the second parameter = False to see the Simple XML output of the resulting EIDR IDs

qry1 = '''
    <Request xmlns="http://www.eidr.org/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
        <Operation>
            <Query>
                <Expression>(/FullMetadata/BaseObjectData/ResourceName Batman)</Expression>
                <PageNumber>1</PageNumber>
                <PageSize>300</PageSize>
            </Query>
        </Operation>
    </Request>'''eidr_query_resp = postEidrQuery(qry1,False)
soup = BeautifulSoup(eidr_query_resp, 'xml')
print(soup.prettify())

Below is the XML Response —

<?xml version="1.0" encoding="utf-8"?>
<Response version="2.6.0" xmlns="http://www.eidr.org/schema">
 <Status>
  <Code>
   0
  </Code>
  <Type>
   success
  </Type>
 </Status>
 <Query>
  <Expression>
   (/FullMetadata/BaseObjectData/ResourceName Batman)
  </Expression>
  <PageNumber>
   1
  </PageNumber>
  <PageSize>
   300
  </PageSize>
 </Query>
 <QueryResults>
  <CurrentSize>
   300
  </CurrentSize>
  <TotalMatches>
   386
  </TotalMatches>
  <SimpleMetadata>
   <ID>
    10.5240/97EC-034D-5938-1904-1FCB-8
   </ID>
   <StructuralType>
    Abstraction
   </StructuralType>
   <ReferentType>
    Movie
   </ReferentType>
   <ResourceName lang="und">
    Alyas Batman En Robin
   </ResourceName>
   <OriginalLanguage mode="Audio">
    und
   </OriginalLanguage>
   <ReleaseDate>
    1991
   </ReleaseDate>
   <Status>
    valid
   </Status>
  </SimpleMetadata>
  <SimpleMetadata>
   <ID>
    10.5240/F41C-0D39-8C31-5F05-0619-B
   </ID>
   <StructuralType>
    Abstraction
   </StructuralType>
   <ReferentType>
    Movie
   </ReferentType>
.........

Here you see the basic information of each EIDR ID.

Let’s write the code to read the XML and parse the information. Here we can handle both the responses i.e. if only EIDR IDs are outputted or the Simple XML. The below code does the following -

Reads and checks the response xml status code to handle any errors.
If no errors, then check the total matches. If the total matches = 0, then we output it as “No Match found”
If total matches > 0 then we check if the output has Ids Only or has Simple XMLs and then parse the results accordingly.

#read the status code of the response XML
token_status_code = int(soup.Response.Status.Code.contents[0])#If the Status code > 0 means there was an error
if token_status_code > 0:
    print(soup.Response.Status.Type.contents[0])
else:
    #Check if the Total Matches > 0
    if int(soup.Response.TotalMatches.contents[0]) > 0:
        #check if the response has IDs only or Simple XML metadata
        if soup.Response.QueryResults.SimpleMetadata is not None:
            for metadata in soup.Response.QueryResults.findAll('SimpleMetadata'):
                print('{}: {} - {}'.format(metadata.ID.contents[0],metadata.ResourceName.contents[0],metadata.ReleaseDate.contents[0]))
        elif soup.Response.QueryResults.ID is not None:
            for id in soup.Response.QueryResults.findAll('ID'):
                print('EIDR ID = {}'.format(id.contents[0]))     
    else:
        print('No Match found')

Here is the result, when you run the above -

10.5240/97EC-034D-5938-1904-1FCB-8: Alyas Batman En Robin - 1991
10.5240/F41C-0D39-8C31-5F05-0619-B: Alyas Batman at Robin - 1965-05-25
10.5240/B055-803F-5468-7F61-DAD6-K: Barbecued Batman - 1966
10.5240/2094-C477-6FFE-987D-AC36-Q: Barbecued Batman - 1966-09-22
10.5240/5370-A733-CC28-5562-EFC4-Y: Batman - 1989
10.5240/F57A-623A-614A-8AC0-CA36-Q: Batman - 1989
10.5240/1AE0-9BF2-EF60-189D-86A5-J: Batman - 1989
10.5240/C13A-9E43-8024-8160-7D3C-D: Batman - 1989
10.5240/80E4-C273-7F5C-A38C-B214-B: Batman - 1996
10.5240/819A-E982-D744-AC70-5855-9: Batman - 1966
10.5240/A10D-B7BC-659F-FC37-0D73-O: Batman - 1966
10.5240/1437-598E-954B-5650-AF41-U: Batman - 1997
10.5240/4041-373F-869E-5195-8022-P: Batman - 1989
10.5240/B5D2-F2A1-0395-AF93-A460-H: Batman - 1966-07-30
10.5240/D2F5-4653-C517-A1DF-CC62-D: Batman & Bill - 2017-05-06
10.5240/371E-2369-F084-B3A3-D55B-3: Batman & Mr. Freeze: Subzero - 2015
10.5240/A6F2-5AF9-7F0C-53C9-1220-T: Batman & Mr. Freeze: Subzero - 1998-03-17
..............

Let’s consider a very common use case. We have a list of titles in an excel sheet and want to search the EIDR Ids for each title.

Step 1: Read the excel file with the titles and save as a dataframe.

Create an excel file with a list of titles and save the file as “input_query_titles.xlsx”. Below is a sample titles list. You can create your own.

Let’s open this excel file using pandas and create a dataframe. Here weopen the file input_query_titles.xlsx with sheet_name = Test EIDR Query and we specify our columns and the column types for the dataframe.

df_query_titles = pd.read_excel('input_query_titles.xlsx', 
            sheet_name='Test EIDR Query', 
            dtype={'Title': str, 'Type': str,'ReleaseYear': str,'StructuralType': str})df_query_titles.head()

Here’s the dataframe output -

Step 2: Setup the query

Here we write a function which builds and returns the query. Our query takes 4 parameters and therefore we pass 4 parameters to this function based on the columns we have in the excel file.

For the ReleaseDate, we create two new variables i.e. From Release Date and To Release Date. The reason being, EIDR may have either the Release Year in the EIDR content or a release date. If we just query by the release year and if it has a release date, it may not match. To cover all possibilities, we create the release date condition where it could either match the release year or can have the release date between 01/01/{ReleaseYear} and 12/31/{ReleaseYear)

#Parameters- Title, StructuralType, ReferentType and ReleaseDate
def getQuery(title, sType, rType, relDate ):
    
    #set the from and to dates to query the release dates
    from_reldate = relDate + '-01-01'
    to_reldate = relDate + '-12-31'
    
    #Set the query with placeholders and pass the parameters to build the query
    qry_search_titles = '''
        <Request xmlns="http://www.eidr.org/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
            <Operation>
                <Query>
                    <Expression>
                        (/FullMetadata/BaseObjectData/ResourceName {title}) AND
                        (/FullMetadata/BaseObjectData/StructuralType IS {stype}) AND
                        (/FullMetadata/BaseObjectData/ReferentType IS {rtype}) AND
                        (
                            (/FullMetadata/BaseObjectData/ReleaseDate = {reldate}) OR
                            (
                                /FullMetadata/BaseObjectData/ReleaseDate &gt;= {from_reldate} AND
                                /FullMetadata/BaseObjectData/ReleaseDate &lt;= {to_reldate}
                            )
                            
                        )
                    </Expression>
                    <PageNumber>1</PageNumber>
                    <PageSize>300</PageSize>
                </Query>
            </Operation>
        </Request>
    '''.format(title=title, 
               stype=sType, 
               rtype=rType, 
               reldate=relDate, 
               from_reldate=from_reldate, 
               to_reldate=to_reldate)return qry_search_titles

Step 3: Execute the query and parse the output

Below code does the following -

Loops thru each row in the dataframe, gets the title information and passes it to build the query
Execute the query function to get the results
Parse the result and save the results in the output_list

#create an empty list to store the query results
output_list = []  # holds the values of the requested titles and the results
output = []       # holds the values of the requeted titles#for each row in the dataframe (aka the excel sheet)
for index, row in df_query_titles.iterrows():
    ResourceName = row['Title']
    StructuralType = row['StructuralType']
    ReferentType = row['Type']
    ReleaseDate = row['ReleaseYear']#pass the parameters to build the query
    qry_search_titles = getQuery(ResourceName, StructuralType, ReferentType, ReleaseDate)#save the requested titles
    output = [ResourceName, StructuralType, ReferentType, ReleaseDate]
    
    #execute the query
    eidr_query_resp = postEidrQuery(qry_search_titles, False)
    soup = BeautifulSoup(eidr_query_resp, 'xml')
    
    #check the response token status
    token_status_code = int(soup.Response.Status.Code.contents[0])#If the Status code > 0 means there was an error
    if token_status_code > 0:
        output_list.append(output + [soup.Response.Status.Type.contents[0], 'N/A'])
    else:
        #Check if the Total Matches > 0
        if int(soup.Response.TotalMatches.contents[0]) > 0:
            #check to see if the response has IDs only or Simple XML metadata
            if soup.Response.QueryResults.SimpleMetadata is not None:
                for metadata in soup.Response.QueryResults.findAll('SimpleMetadata'):
                    output_list.append(output + [metadata.ID.contents[0], metadata.ResourceName.contents[0], metadata.ReleaseDate.contents[0]])
            elif soup.Response.QueryResults.ID is not None:
                for id in soup.Response.QueryResults.findAll('ID'):
                    output_list.append(output + [id.contents[0]])
        else:
            output_list.append(output + ['No Match found', 'N/A'])

Step 4. Create a new pandas dataframe using the results list

df_output = pd.DataFrame(output_list, columns = 
              ['Query Title', 'StructuralType', 'Type', 'ReleaseYear', 'EIDR ID', 'EIDR Title', 'EIDR Release Date'])df_output.head()

Here’s the output

Step 5: Once the dataframe is created, you can export it to excel

df_output.to_excel('query_result.xlsx')

Here is the complete python code

import requests, base64, hashlib
import pandas as pd
from bs4 import BeautifulSoupUserID = '10.5238/xxxxxxxx'.         # enter your EIDR User Id
Pwd = '************'                 # enter your EIDR password
PartyID = '10.5237/xxxxxxxxx'        # enter your EIDR party ID
url = 'https://sandbox1.eidr.org:443/EIDR/' # EIDR Registry URL#Encrypt the credentials
PasswordShadow = base64.b64encode(hashlib.md5(pwd.encode('utf-8')).digest()).decode('utf8')auth_str = '%s:%s:%s' % (UserID, PartyID, PasswordShadow)headers = {'Authorization' : 'Eidr {}'.format(auth_str), 'Accept': 'text/xml', 'Content-Type': 'text/xml'}#Execute the EIDR Query
def postEidrQuery(qry, id_only=True):
    if id_only:
        req = url + 'query/?type=id'
    else:
        req = url + 'query/'
        
    resp = requests.post(req, headers=headers, data=qry)    return resp.content
#Build the query
#Parameters- Title, StructuralType, ReferentType and ReleaseDate
def getQuery(title, sType, rType, relDate ):
    
    #set the from and to dates to query the release dates
    from_reldate = relDate + '-01-01'
    to_reldate = relDate + '-12-31'
    
    #Set the query with placeholders and pass the parameters to build the query
    qry_search_titles = '''
        <Request xmlns="http://www.eidr.org/schema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
            <Operation>
                <Query>
                    <Expression>
                        (/FullMetadata/BaseObjectData/ResourceName {title}) AND
                        (/FullMetadata/BaseObjectData/StructuralType IS {stype}) AND
                        (/FullMetadata/BaseObjectData/ReferentType IS {rtype}) AND
                        (
                            (/FullMetadata/BaseObjectData/ReleaseDate = {reldate}) OR
                            (
                                /FullMetadata/BaseObjectData/ReleaseDate &gt;= {from_reldate} AND
                                /FullMetadata/BaseObjectData/ReleaseDate &lt;= {to_reldate}
                            )
                            
                        )
                    </Expression>
                    <PageNumber>1</PageNumber>
                    <PageSize>300</PageSize>
                </Query>
            </Operation>
        </Request>
    '''.format(title=title, 
               stype=sType, 
               rtype=rType, 
               reldate=relDate, 
               from_reldate=from_reldate, 
               to_reldate=to_reldate) return qry_search_titles
#Open the input titles excel file
df_query_titles = pd.read_excel('input_query_titles.xlsx', 
            sheet_name='Test EIDR Query', 
            dtype={'Title': str, 'Type': str,'ReleaseYear': str,'StructuralType': str})#create an empty list to store the query results
output_list = []  # holds the values of the requested titles and the results
output = []       # holds the values of the requeted titles#for each row in the dataframe (aka the excel sheet)
for index, row in df_query_titles.iterrows():
    ResourceName = row['Title']
    StructuralType = row['StructuralType']
    ReferentType = row['Type']
    ReleaseDate = row['ReleaseYear']#pass the parameters to build the query
    qry_search_titles = getQuery(ResourceName, StructuralType, ReferentType, ReleaseDate)#save the requested titles
    output = [ResourceName, StructuralType, ReferentType, ReleaseDate]
    
    #execute the query
    eidr_query_resp = postEidrQuery(qry_search_titles, False)
    soup = BeautifulSoup(eidr_query_resp, 'xml')
    
    #check the response token status
    token_status_code = int(soup.Response.Status.Code.contents[0])#If the Status code > 0 means there was an error
    if token_status_code > 0:
        output_list.append(output + [soup.Response.Status.Type.contents[0], 'N/A'])
    else:
        #Check if the Total Matches > 0
        if int(soup.Response.TotalMatches.contents[0]) > 0:
            #check to see if the response has IDs only or Simple XML metadata
            if soup.Response.QueryResults.SimpleMetadata is not None:
                for metadata in soup.Response.QueryResults.findAll('SimpleMetadata'):
                    output_list.append(output + [metadata.ID.contents[0], metadata.ResourceName.contents[0], metadata.ReleaseDate.contents[0]])
            elif soup.Response.QueryResults.ID is not None:
                for id in soup.Response.QueryResults.findAll('ID'):
                    output_list.append(output + [id.contents[0]])
        else:
            output_list.append(output + ['No Match found', 'N/A'])#Create pandas dataframe using the output_list
df_output = pd.DataFrame(output_list, columns = 
              ['Query Title', 'StructuralType', 'Type', 'ReleaseYear', 'EIDR ID', 'EIDR Title', 'EIDR Release Date'])#export the results to an excel file
df_output.to_excel('query_result.xlsx')

This completes the tutorial on how to use the EIDR Query service to search titles

Also check out:

how to get status for multiple EIDR tokens.

how to get EIDRs from Alternate IDs