Comments on EPrint modifications

Registration - new cgi/registration script

This registration script requests all data, not just the e-mail, username and password. It warns if the username and password contains national characters (browsers and Apache handles them differently, thus authentication does not work). Here is a sample registration page.

It would be nice to change the confirmation page, too: it should check credentials (username/password), and then present /users/home with some extra lines (Your registration was successfull, etc.).

User credentials

I found the following user classification more helpful than the original one. There are five user types:

templateuser
who is registered but not confirmed its registration. The main difference is that templateuser has its password field required, while other cannot have.
viewer
who is allowed to subscribe, but not to submit. All registered and confirmed users get automatically into this category.
archive contributor (user)
registered, confirmed, and can submit material to the archive.
editor
the ususal editor; also editors can not submit papers.
archive administrator
who has right to do anything, except submitting documents

The confirmation page, after checking that all data is in order, calls confirm_user($user,$session) defined in ArchiveValidateConfig.pm. This subroutine can then decide, depending on various user data, whether the registered user gets into the viewer or the contributor category:

sub confirm_user
{
  my( $user, $session ) = @_;
  if( $user->get_type() eq "templateuser" || $user->get_type() eq "viewer" )
  {
    $user->set_value("usertype",
      ( $user->get_value("email") =~ /[\.\@]ceu\.hu$/i ) ?
           "user":"viewer" );
    $user->commit();
  }
}

The fields for templateuser, viewer and user are the same, except for the password field, which is requred in the first case. The password field must not be required, as otherwise personal record updating is not accepted only if the password field is also filled (which is usually not).

Changing the password during personal record updating is allowed, I hope it will cause no problem later. Archive conttibutors and editors cannot change their e-mail address: this is a little paranoia as anyone can submit material and later vanish in the thin air. Of course, admin can change that address.

Finally all user scripts check whether the user has a valid record. If not (possible only when the admin creates a new user, or changes his/her credentials), the home_fill script is called instead.

Submission - rewritten library module

Submitting a document is not a trivial fact; thus it should be as simple as possible. Users usually have no deep knowledge of the format of their submission; furthermore apache serves files not according to the specification (what the file was claimed to be) but rather on the extension of the file name (even apache does not consult the content of the file). Thus the "document_type" field can be -- and should be -- determined automatically, and not by the submitter. This is done by the new archive call get_document_type($main_filename).

Uploading may come from two sources: either from a local file, or from an internet address. Whether the uploaded material should be uncompressed or not is independent if its source, and can be given by a checkbox. The method of uncompressing can be decided locally; requesting the submitter to know the exact method is unnecessary. Thus "uncompress" can be a check box only. I have chosen even a simpler method: the very first file uploaded for a format is uncompressed if necessary, all the rest is not. This is what the average user might expect; a knowledgeable user could use it to upload any file she whishes.

Certain web addresses can only be used as links (no problem with the metadata, but they want to keep the file). Thus we have introduced the "link" document type. The link should also be specified in the "url" window, and ticking on the "use as link only" box will prevent downloading the specified URL. Internally links are stored as contents of a file with extension ".link". When rendering, the content is copied into the href field (in ArchiveRenderConfig.pm):

  my $fmt=$doc->get_value( "format" );
  my $link="";
  if( $fmt eq "link" && open(TMP,$doc->local_path()."/".$doc->get_main() ) )
  {
      $link = <TMP>; chomp $link;
      close TMP;
  } 
  if( $link eq "" )
  {
      $link = $doc->get_url();
  }

The full process of submission has five stages; sample pages are available here

Stage 1 has an altranete form when the document is edited, and not created.

All but stage 4 is relatively straighforward. The upload stage splits into two subcases: if the document has (one or more) formats, or if it has none. In the former case a table is presented for each format showing the type (determined from the main file), the commentary, a link to the main file (to preview in a separate window), the number of files belonging to that format, and two buttons: "Edit" and "Delete". Below the table there are three action buttons: "Back", "Next", and "Add New Format".

If no format is defined for the eprint, then not the summary page is shown, but the upload page. A similar, but slightly different page appears when the "Add New Format" button is pressed. Both pages let the user choose a local file (via the "Browser" field), enter an URL (text field), specifying whether the URL should be downloaded or used as a link, and also fill the field of extra format commentary. Clicking on "Next" starts the grabbing process and a new format is produced. The successor page is 4.

In the EPrints::SubmissionForm script the code for an uncompress button is commented out. This can be used to instruct the downloader to uncompress (can be decided locally the uncompressing method) the grabbed file. Now the very first uploaded file is uncompressed if possible.

On the format list, clicking on "Edit" next to a format leads to the edit format page. Here all files belonging to that format is listed with buttons to allow delete any separate file, or make it the "main" file (which, in turn, determines the format's type). The main file cannot be deleted. It is possible to edit the format's commentary, and upload new file to the format via a similar mechanism as for creating a new format. In this case the "link" button is not available. Clicking on "Next" goes to the page of stage 4.

On the "upload-first" and "edit" pages format description, language and security fields are presented depending on the settings in ArchiveConfig.pm. If the appropriate value is 0, the field is not shown, it the value is 1 then it is always appears. However if the value is 2, then it is shown when the page is edited by an editor. Keeping the submission page as simple as possible, only the format description is presented, and the security field is for editors only. Thus only editors can limit the availability of the document.

In our case it has been requested that documents -- whenever possible -- should be converted into pdf. The upload porocedures in the EPrints::Document library have been modified so that new files are automatically converted (whenever possible), and the resulting pdf file is made main. Using the format's Edit option this can be undone, and the pdf (or the original) file can be erased if necessary. We found this mechanism quite satisfactory, as postscript files regularly shrunk over 50%.

The submission pages have many more help information than before. Also, several new pins were introduced to refer to the document under submission. At stage 2 and 3 the document type is available, at stage 4 the standard one-line document rendering is available.

Restricting Document Availability

Availability of documents now can be restricted. In "metadata-types.xml" the "security" dataset has two entries only: "locally" and "staffonly". As security is not required any more (modified in EPrints::Documents), undefined value means unlimited access. In ArchiveConfig.pm the subroutine can_user_view_document was modified a little: the new trivial case

return( 1 ) if( !defined $security );
was added, as well as just before the last line the following text:
  if( $security eq "locally" )
   {
      return $user->has_priv( "local_view_docs" ) ? 1 : 0 ;
   }
Those users, who can view such documents also have the "local_view_docs" priviledge in the userauth table a little above. As local e-mail ensures archive contributor type automatically, archive contributors, editors, and admin are granted this priv. This ensures local view only.

Setting the security to "staffonly" makes it available to editors and admin only. This feature can be used to hide the document rather than deleting it from the archive.

Restricting document availability is reserved to editors (and admins) only, this is mainly because introducing a new box into the submission page made it confusing. Editors, however, can restrict the availability of the document any time by reediting it.

restricted pdf In the ArchiveRenderConfig.pm file those documents which have "staffonly" security are not listed at all. Those which have restricted availability are marked by a locked icon, thus warning the casual viewer.

SearchExpression

The search pages are rendered in a table format. The first column contains the name and help separated by a break, and the second column contains the input field. The name and help are formed similarly to other entries, and are not put together. It makes possible to give separate help for fields of the same type (for example two text fields might require quite different help); and also the name of multiple fields can be anything. In our case one field contains words from a relatively large list. The help info contains a button which presents all the words in a separate window to copy and paste from. This is archived by the following entries in the phrase-en.xml file:

<ep:phrase ref="eprint_searchname_country">Country</ep:phrase>
<ep:phrase ref="eprint_searchhelp_country">Click on 
&amp;&amp;&lt;input type="button" 
value="List of countries"
onclick="open('&amp;&amp;&base_url;/help/countries.html&amp;&amp;',
'_blank','status=no,toolbar=no,resizable=yes,scrollbars=yes')"&gt;&amp;&amp;
to get a list to copy &amp; paste.</ep:phrase>
(See also the next section on escaping).

If a field is too wide, the second column would dominate the whole table. When the field has the (new) property one_column then it is rendered to occupy both columns.

Escaping XML

Material in the phrases.xml file gets its way into the final web page via two different mechanism. The first one is used for big chunk of data, and is processed by the html_phrase() procedure. The material is parsed by an XML parser (that's why it should be properly formatted), and then it is copied formatted. Tags between < and > appear with no modification; the text between them is escaped, for example quotation marks are replaced by "&quot;". For example,

<p   align = "center"> "Text1"</p>
  <em> Emphasized </em>
  "Text2"
becomes
<p align="center"> &quot;Text1&quot;</p> <em> Emphasized </em>&quot;Text2&quot;
Observe that spaces collapsed and the line breaks disappeared.

The second method, performed by the phrase() procedure, evaluates the html tags, and then the resulting text is escaped. This means that in the above example the <p> introduces a new line, and <em> disappears as no emphasized text is available in ascii. In this case the above becomes

                      &quot;Text1&quot;
Emphasized &quot;Text2&quot;
in two lines, the first line is aligned into the middle of a line of length 80 characters.

As the name and help given for fields are processed by the second method, it is impossible to make there certain words italics or bold. To overcome this difficulty, the hack in XML/DOM.pm library file can be used. Text between two ampersand (&&) is not interpreted, and goes to the final text "as is". For example, to get "Text1 <em>Emphasized</em> Text2" in a help, the phrase should be

Text1 &amp;&amp;&lt;em&gt;&amp;&amp;Emphasized&amp;&amp;&lt;/em&gt;&amp;&amp; Text2
If you want a quotation mark " to appear in the final text, it should also be surrounded by double ampersand signs:
&amp;&amp;"&amp;&amp;
You get the same result if the quotation mark is replaced by "&quot;" as first the text is parsed.

In the citation file this escaping comes handy. Rendering a conference paper, the authors are typeset in bold face, the proceedings title in italics, and the volume number in bold face as well. This is achieved by the following extract:

<ep:citation type="eprint_confpaper"><span class="citation">
&amp;&amp;&lt;b&gt;&amp;&amp;@authors@&amp;&amp;&lt;/b&gt;&amp;&amp; (@year@):
<ep:linkhere>@title@</ep:linkhere>, in: 
<ep:ifset name="editors">@editors@, Eds. </ep:ifset>
<ep:ifset name="conference">&amp;&amp;&lt;i&gt;&amp;&amp;Proceedings of @conference@&amp;&amp;&lt;/i&gt;&amp;&amp;</ep:ifset>
<ep:ifset name="volume">Vol &amp;&amp;&lt;b&gt;&amp;&amp;@volume@&amp;&amp;&lt;/b&gt;&amp;&amp;</ep:ifset>
<ep:ifset name="number">(@number@)</ep:ifset>
<ep:ifset name="pages"> pp. @pages@</ep:ifset><ep:ifset name="confloc">, @confloc@</ep:ifset></span></ep:citation>

Context sensitive help

Each page generated by Eprints has a unique identifier, the pageid. It is used for different page hooks. It can also be used for a context sensitive help system as well. Adding this identifier to the URL address of the help page makes possible that all pages have different help. This is done by introducing a new pin when generating pages: <ep:pin ref="help" />. In the "template-en.xml" file the reference to the help page should be changed to

<a target="help" href="&base_url;/help/index.html#<ep:pin ref="help" />">HELP</a>
We expect this to come in the final HTML file as
<a target="help" href="/html/archive/en/help/index.html#submission">HELP</a>
Clicking on "HELP", on the separate designated window the given page is positioned at the "submission" label. If that window is not open yet, then a new window is opened. Unfortunately there are problems with the above line. First, the window will be a full-fledged viewer; it would be nicer not have the upper lines; also we could limit the size of the window. The second and bigger problem is that the XML parser does not allow embedded XML tags. The first problem can be solved by using a small javascrit function, the second by using the escape mechanism. The following line is inserted into the <head> part of the page:
<script language="JavaScript">&&<--&&
  function ow(win,prm){window.open(&&"&&&base_url;/help/index#<ep:pin ref="help" />&&"&&,win,prm,null);}
  &&//-->&&
  </script>
We quoted the start and end of the comment; also the quotation marks, as otherwise they are replaced by the string &quot;. The "HELP" reference then is the following:
<a href="javascript:ow('HELP','menubar=no,location=no,scrollbars=yes,resizable=yes,width=620,height=470')">HELP</a>
which works.

To differentiate between statically generated pages, the utility generate_static now calls build_page() with pageid as the last part of the file to be generated, and not as "static".

Importing XML

The import utility tries to import a full eprint database. The metadata is taken from the exported archive and document databases, files are from the copy of the disk00/ filesystem. This makes possible to have a backup independently of mysql.

Minor modifications/corrections to the perl_lib

Database.pm/919:
I prefer counters starting from 1000, not from 1. It looks silly to be of user #1; being user #1000 is somewhat better.
Document.pm/144:
The "security" field is not required. Two new fields: "upload_plain" and "upload_graburl" are introduced to ease the generation of submission pages. The type of the former one is "file" which is rendered in MetaField.pm.
Document.pm/900:
After upload check that we have indeed received something. If the resulting file has length 0, then return with failure. This takes care of garbage filenames, or broken lines.
Document.pm/922:
Following the recommendations from eprints technical list, file names are handled not by fileparse(), but by hand.
Document.pm/1033:
The new "upload_link" subroutine stores the argument as an URL in a unique filename, to be used later as a link reference.
Document.pm/1198:
New procedures: one to uncompress the uploaded file, and the other to convert it to pdf. The first routine calls ""zip", "targz" or "gzip" as guessed from the filename extension; the second routine calls "doctopdf". The latter one is an (external) perl-script which decides whether the only argument can be converted to pdf or not; and if yes, it does the conversion.
EPrint.pm/758:
The check whether a required field is filled is broken. If the field has property multiple then the returned value is not "undef" but an array of length zero.
MetaField.pm/971:
Even if a field is multiple, we must be able to unset it.
MetaField.pm/1000:
The "More Spaces" button is replaced by three buttons: the first is the default, the second adds 5 more entries, and the third one adds 10 more entries. Maybe this should also be configurable.
MetaField.pm/1577:
The "file" type is handled here, it is rendered as an input field of type "file" for upload files.
MetaField.pm/1800:
Any entered value into a field of type similar to "text" should not start or end with a whitespace. Formerly, if the the field had the property textarea (undocumented) then intervening white spaces were replaced by a single space. Now it is done for all fields except for longtext.
MetaField.pm/1853:
Same thing for names: multiple white spaces are replaced by a single space, white spaces at the front and the end are deleted.
SearchExpression.pm/142:
Following a recommendation in the eprints technical list, if order is not defined, take the value from the system default.
SearchExpression.pm/350-400:
Render search as a table in two columns. Comments should go after a break and not after inserting a space.
SearchExpression.pm/576:
The search field may be empty not only if it is not set. Use EPrints::Utils::is_set() for checking.
SearchExpression.pm/1114,1306:
There is no way to disable the Show and/Show all buttons for the simple search. Introducing the "$showall" parameter for "process_webpage()" just does this.
SearchField.pm/138:
Search fields should have the same naming conventions as field names. Namely the name and help should not come from some general place, rather it should be local to the search field. Changes introduced here does exactly this.
SearchField.pm/947,1017:
Now as search fields are rendered in a table format, fields look nicer if the pull down menus are below the field, and not next to them. Spaces are replaced by breaks here.
SearchField.pm/1210:
Reference to display_name is replaced by the appropriate function call (as it has a different value)
SearchField.pm/1235,1269:
Procedures get_help() and get_display_name() now return a value produced similarly as for field names, only "field" is replaced by "search".
Session.pm/1183:
Reference to the text "continue" is replaced by an html_phrase() call.
Session.pm/1358:
A special "separator" field is rendered here. It can be used to insert special fields in the middle of a single search. The type should be "separator", and the "render_value" field is inserted here.
Session.pm/1397:
The missing comment field is rendered here. (Used in the succession/commentary stage at submission).
Session.pm/1459:
A hack to generate context sensitive help system: when generating a page, the pageid is supplied for the generating procedure with using the "help" pin.
Session.pm/1590:
All generated pages must start with an appropriate DOCTYPE header. It is included here verbatim. It can be achieved through DOM, but I don't know how.
SubmissionForm.pm
Lot of changes made here as the whole submission process have been rearranged. Several pins are inserted to give a better information to the submitter.
UserForm.pm
The user form now looks like this: at the top the values are rendered, and below that a form is given where data can be edited. The procedure _update_from_form() now returns a list of problems instead yes/no. The fields "email" and "username" are not presented, as they cannot be edited by the user directly. Also, a new "Cancel" button is presented next to the "Update" button. Values are not committed only when no problems were found.
User.pm/112:
Both the "username" and "password" fields default to 20 characters.
XML.pm/598:
Recommendation from the eprints listserv: when printing out a page now it is done with a nice DOCTYPE header.
XML/DOM.pm/281
A hack which was requested by several people, also commented in the source. The encodeText() procedure does the XML conversion, and here the following escape mechanism is interpreted: three ampersand signs (&&&) is replaced by a single one; moreover text enclosed between two ampersand signs (such as &&<a>&&) is copied verbatim and not quoted. Be careful, as UTF-8 conversion does not take place in those places.

Minor modifications/corrections to the cgi library

in general:
Replace the "general:userhome_link" reference by a configurable "library_name:tail" reference.
confirm:
Do not associate the e-mail address with the user. Many users may share the same e-mail. After confirming the users, an archive call is made to let to change the user's type depending on the e-mail address (and other data) confirmed.
search:
Call to process_webpage() is supplied by the extra parameter to suppress the "all/any" button.
set_password:
Direct link to the new "register" page.
users/home:
Invalid users are sent directly to the users/home_fill page. Deposits in the submission buffer now can be edited by the submitter; references are now rendered as links.
users/record:
Direct reference to the "home" directory is replaced by one in the archive configuration.
users/status:
A link is included at the end to the users' home directory.
users/subscribe:
A link is included to the users' home directory.
users/staff/edit_eprint:
Users who submitted and eprint can edit their submission in the submission buffer. However only two buttons are presented: "clone" and "edit".