http://www.dotnetmafia.com/blogs/dotnettipoftheday/archive/2014/04/08/how-to-use-the-sharepoint-2013-content-enrichment-web-service.aspx
A SharePoint MVP bringing you the latest time saving tips for SharePoint 2013, Office 365 / SharePoint Online and Visual Studio 2013.
How to: Use the SharePoint 2013 Content Enrichment Web Service
The Content Enrichment Web Service (CEWS) allows you to extend the functionality of SharePoint 2013 Search. Using CEWS, a developer can send the values of managed properties to an external web service and return new or modified managed properties to include in the index. The process involved implementing a custom WCF service and then registering it with PowerShell. The PowerShell cmdlet specifies which properties go into and out of the service.
This post has been cross-posted to MSDN Code where you can download a working sample and deploy it.
This example will take the values of the Author and LastModifiedTime managed properties and write a new string such as "Modified by <author> on <LastModifiedTime>." to the managed property TestProperty. This property need to be created first prior to trying to use your Content Enrichment Web Service. The property should be configured as type Text with the following attributes: Query, Search,Retrieve, and Refine.
To get started, create a new WCF Service Project called ContentEnrichmentExampleService.
Once the project is created, you can delete the default service Service1.svc and IService.cs as it won't be needed.
Next, you will need to add a reference to the following assembly.
Now, we need to create the service to do the content enrichment processing. Create a new service calledContentEnrichmentExampleService.svc.
Delete the file IContentEnrichmentExampleService.cs as it will not be needed. The custom service instead inherits from IContentProcessingEnrichmentService.
Now we can start adding our code to ContentEnrichmentProcessingExampleService.svc.cs. This code will retrieve the values from the input properties, create our new output property TestProperty and send it back to the search index.
Start by adding using statements to the assembly we added.
To register the service with SharePoint we use using the New-SPEnterpriseSearchContentEnrichmentConfiguration cmdlet. Use the following PowerShell script to register the Content Enrichment Web Service. Verify that the Endpoint parameter contains the correct URL to your service. The example below has the location used in the source code I provided. If you start from scratch or you have deployed you service to a remote server, then you will need to update the address.
After registering your content enrichment service, start a full crawl. Again, ensure that your Content Enrichment Web Service is running in the debugger. While it is crawling, you can set breakpoints as desired.
To verify the functionality after the crawl is complete, issue a query using REST in the browser like the one below.
http://server/_api/search/query?querytext='*'&selectproperties='title,path,author,testproperty'
This query will return every item in the index and include the new TestProperty field. You can verify that the new property was included and has the expected result as shown in the example below.
I hope this gets you started with Content Enrichment Web Services. I have a few follow-up posts to include on some more of the PowerShell parameters, but I hope this helps.
Again, you can find the complete source code and PowerShell script on MSDN Code. Feel free to leave me a comment if you run into an issue or have a question.
This post has been cross-posted to MSDN Code where you can download a working sample and deploy it.
This example will take the values of the Author and LastModifiedTime managed properties and write a new string such as "Modified by <author> on <LastModifiedTime>." to the managed property TestProperty. This property need to be created first prior to trying to use your Content Enrichment Web Service. The property should be configured as type Text with the following attributes: Query, Search,Retrieve, and Refine.
To get started, create a new WCF Service Project called ContentEnrichmentExampleService.
Once the project is created, you can delete the default service Service1.svc and IService.cs as it won't be needed.
Next, you will need to add a reference to the following assembly.
- microsoft.office.server.search.contentprocessingenrichment.dll
Now, we need to create the service to do the content enrichment processing. Create a new service calledContentEnrichmentExampleService.svc.
Delete the file IContentEnrichmentExampleService.cs as it will not be needed. The custom service instead inherits from IContentProcessingEnrichmentService.
Now we can start adding our code to ContentEnrichmentProcessingExampleService.svc.cs. This code will retrieve the values from the input properties, create our new output property TestProperty and send it back to the search index.
Start by adding using statements to the assembly we added.
using Microsoft.Office.Server.Search.ContentProcessingEnrichment;
using Microsoft.Office.Server.Search.ContentProcessingEnrichment.PropertyTypes;
The interface that the class is inheriting from will be shown as broken since you deleted it. Change it instead to inherit from IContentProcessingEnrichmentService.using Microsoft.Office.Server.Search.ContentProcessingEnrichment.PropertyTypes;
public class ContentEnrichmentExampleService : IContentProcessingEnrichmentService
Add a ProcessedItem collection to hold the output managed property values from the service.
private readonly ProcessedItem processedItemHolder = new ProcessedItem
{
ItemProperties = new List<AbstractProperty>()
};
Then, Implement the ProcessItem method. This method receives the input managed properties and allows you to write code to generate the output managed properties.{
ItemProperties = new List<AbstractProperty>()
};
public ProcessedItem ProcessItem(Item item)
{
}
Inside the ProcessItem method, initialize the ErrorCode and ItemProperties.{
}
processedItemHolder.ErrorCode = 0;
processedItemHolder.ItemProperties.Clear();
We then, need to Create a new output managed property named TestProperty. The property object takes types based on what type of managed property you defined.processedItemHolder.ItemProperties.Clear();
var testProperty = new Property<string>();
testProperty.Name = "TestProperty";
Now we are going to retrieve the managed properties using a simple lamdba expression. Remember that the names of properties are case sensitive and need to match exactly how it shows on the Search Schema page. You also need to cast the object to the appropriate type. Since the Author managed property is a multi-valued property, we need to use List<string>. The LastModifiedTime is a date so we use a DateTimetype.testProperty.Name = "TestProperty";
var authorProperty = item.ItemProperties.FirstOrDefault(i => i.Name == "Author") asProperty<List<string>>;
var writeProperty = item.ItemProperties.FirstOrDefault(i => i.Name =="LastModifiedTime") as Property<DateTime>;
Now, we need to verify that the properties aren't null.var writeProperty = item.ItemProperties.FirstOrDefault(i => i.Name =="LastModifiedTime") as Property<DateTime>;
if ((authorProperty != null) && (writeProperty != null))
{
}
We are then going to write out a new string to TestProperty in the format Modified by {Author} on {LastModifiedTime}. Since Author supports multiple values, only the first value was used. This value goes in the Value property. Once we set the value, we have to add it processedItemHolder so that it can send the values back to the search index.{
}
testProperty.Value = string.Format("Modified by {0} on {1}.", authorProperty.Value.First(), writeProperty.Value);
processedItemHolder.ItemProperties.Add(testProperty);
Return the processItemHolderprocessedItemHolder.ItemProperties.Add(testProperty);
return processedItemHolder;
At this point, we can run and debug our service using F5. Leave the service running as it will be called when doing a full crawl.To register the service with SharePoint we use using the New-SPEnterpriseSearchContentEnrichmentConfiguration cmdlet. Use the following PowerShell script to register the Content Enrichment Web Service. Verify that the Endpoint parameter contains the correct URL to your service. The example below has the location used in the source code I provided. If you start from scratch or you have deployed you service to a remote server, then you will need to update the address.
$ssa = Get-SPEnterpriseSearchServiceApplication $config = New-SPEnterpriseSearchContentEnrichmentConfiguration $config.Endpoint = "http://localhost:54641/ContentEnrichmentExampleService.svc" $config.InputProperties = "Author", "LastModifiedTime" $config.OutputProperties = "TestProperty" $config.SendRawData = $false $config.Timeout = 30000 $config Set-SPEnterpriseSearchContentEnrichmentConfiguration –SearchApplication $ssa –ContentEnrichmentConfiguration $configThe InputProperties parameter specifies the managed properties sent to the service. TheOutputProperties specifies the managed properties returned by the service. Note, that both are case sensitive. All managed properties referenced need to be created in advance. Set the Timeout propety higher to give yourself sufficient time to debug. For a complete reference on parameters, see this MSDNreference.
After registering your content enrichment service, start a full crawl. Again, ensure that your Content Enrichment Web Service is running in the debugger. While it is crawling, you can set breakpoints as desired.
To verify the functionality after the crawl is complete, issue a query using REST in the browser like the one below.
http://server/_api/search/query?querytext='*'&selectproperties='title,path,author,testproperty'
This query will return every item in the index and include the new TestProperty field. You can verify that the new property was included and has the expected result as shown in the example below.
I hope this gets you started with Content Enrichment Web Services. I have a few follow-up posts to include on some more of the PowerShell parameters, but I hope this helps.
Again, you can find the complete source code and PowerShell script on MSDN Code. Feel free to leave me a comment if you run into an issue or have a question.
Advanced Content Enrichment in SharePoint 2013 Search
Advanced Content Enrichment in SharePoint 2013 Search
19 Jun 2013 11:55 AM
Microsoft re-engineered the search experience in SharePoint 2013 to take advantage of the best capabilities from FAST plus many new capabilities built from the ground up. Although much has been said about the query side changes of search (result sources, query rules, content by search web part, display templates, etc), the feed side of search got similar love from Redmond. In this post I’ll discuss a concept carried over from FAST that allows crawled content to be manually massaged before getting added to the search index. Several basic examples of this capability exist, so I’ll throw some advanced solution challenges at it. The solution adds a sentiment analysis score to indexed social activity as is outlined in the video below.
The content enrichment web service (CEWS) callout is a component of the content processing pipeline that enables organizations to augment content before it is added to the search index. CEWS can be any external SOAP-based web service that implements the IContentProcessingEnrichmentService interface. SharePoint can be configured to call CEWS with specific managed properties and (optionally) the raw binary file. CEWS can update existing managed property values and/or add completely new managed properties. The outputs of this enrichment service get merged into content before it is added to the search index. The CEWS callout can be used for numerous data cleansing, entity extraction, classification, and tagging scenarios such as:
The solution outlined in this post addresses both of these challenges. It will deliver an asynchronous CEWS callout and a process for marking an indexed item as dirty so it can be re-crawled without touching/updating the actual item. The entire solution has three primary components…a content enrichment web service, a custom SharePoint timer job for marking content in the crawl log for re-crawl, and a database to queue asynchronous results that other components can reference.
Below is the callout code in its entirety. Note that I leveraged the entity framework for connecting to my enrichment queue database (ContentEnrichmentEntities class below):
The content enrichment web service is associated with a search service application using Windows PowerShell. The configuration of this service has a lot of flexibility around the managed properties going in and out of CEWS and the criteria for triggering the callout. In my example the trigger is empty, indicating all items going through CEWS:
I decompiled the CrawlLogURLExplorer.aspx page and was pleased to find it leveraged a Microsoft.Office.Server.Search.Administration.CrawlLog class with a public RecrawlDocumentmethod to re-crawl items by path. This API will basically update an item in the crawl log so it looks like an error to the crawler, and thus picked up in the next incremental/continuous crawl.
So why a custom SharePoint timer job? An item may not yet be represented in the crawl log when it completes our asynchronous thread (especially for new items). Calling RecrawlDocument on a path that does not exist in the crawl log would do nothing. The timer job allows us to mark items for re-crawl only if the most recent crawl is complete or has a start date after the crawl timestamp of the item. In short, it will take a minimum of two incremental crawls for a new item to get enrichment data with this asynchronous approach.
With these three solution components in place, we get the following before/after experience in search
The content enrichment web service (CEWS) callout is a component of the content processing pipeline that enables organizations to augment content before it is added to the search index. CEWS can be any external SOAP-based web service that implements the IContentProcessingEnrichmentService interface. SharePoint can be configured to call CEWS with specific managed properties and (optionally) the raw binary file. CEWS can update existing managed property values and/or add completely new managed properties. The outputs of this enrichment service get merged into content before it is added to the search index. The CEWS callout can be used for numerous data cleansing, entity extraction, classification, and tagging scenarios such as:
- Perform sentiment analysis on social activity and augment activity with a sentiment score
- Translate a field or folder structure to a taxonomy term in the managed metadata service
- Derive an item property based on one or more other properties
- Perform lookups against line of business data and tag items with that data
- Parse the raw binary file for more advanced entity extraction
The solution outlined in this post addresses both of these challenges. It will deliver an asynchronous CEWS callout and a process for marking an indexed item as dirty so it can be re-crawled without touching/updating the actual item. The entire solution has three primary components…a content enrichment web service, a custom SharePoint timer job for marking content in the crawl log for re-crawl, and a database to queue asynchronous results that other components can reference.
High-level Architecture of Async CEWS Solution |
---|
Enrichment Queue (SQL Database)
Because of the asynchronous nature of the solution, operations will be running on different threads, some of which could be long running. In order to persist information between threads, I leveraged a single-table SQL database to queue asynchronously processed items. Here is the schema and description of that database table.Id | integer identity column that serves as the unique id of the rows in the database |
ItemPath | the absolute path to the item as provided by the crawler and crawl logs |
ManagedProperty | the managed property that gets its value from an asynchronous operation |
DataType | the data type of the managed property so we can cast value correctly |
CrawlDate | the date the item was sent through CEWS that serves as a crawl timestamp |
Value | the value derived from the asynchronous operation |
Content Enrichment Web Service
As mentioned at the beginning of the post, the content enrichment web service callout is implemented by creating a web service that references the IContentProcessingEnrichmentServiceinterface. There are a number of great example of this online, including MSDN. Instead, this post will focus on calling asynchronous operations from this callout. The main objective of making the CEWS callout asynchronous is to prevent the negative impact a long running process could have on crawling content. The best way to do this in CEWS is to collect all the information we need in the callout, pass the information to a long running process queue, update any items that have values ready from the queue, and then release the callout thread (before the long running process completes).Process Diagram of Async CEWS |
---|
Content Enrichment Web Service
using Microsoft.Office.Server.Search.Administration;
using Microsoft.Office.Server.Search.ContentProcessingEnrichment; using Microsoft.Office.Server.Search.ContentProcessingEnrichment.PropertyTypes; using Microsoft.SharePoint; using System; using System.Collections.Generic; using System.Configuration; using System.IO; using System.Linq; using System.Net; using System.Runtime.Serialization; using System.ServiceModel; using System.Text; using System.Threading; namespace ContentEnrichmentServices { public class Service1 : IContentProcessingEnrichmentService { private const int UNEXPECTED_ERROR = 2; public ProcessedItem ProcessItem(Item item) { //initialize the processedItem processedItem.ErrorCode = 0; processedItem.ItemProperties.Clear(); try { //only process items where ContentType:Item var ct = item.ItemProperties.FirstOrDefault(i => i.Name.Equals("ContentType", StringComparison.Ordinal)); if (ct != null && ct.ObjectValue.ToString().Equals("Item", StringComparison.CurrentCultureIgnoreCase)) { //get path and use database to process async enrichment data var path = item.ItemProperties.FirstOrDefault(i => i.Name.Equals("Path", StringComparison.Ordinal)); var title = item.ItemProperties.FirstOrDefault(i => i.Name.Equals("Title", StringComparison.Ordinal)); var sentiment = item.ItemProperties.FirstOrDefault(i => i.Name.Equals("Sentiment", StringComparison.Ordinal)); if (path != null && title != null) { using (ContentEnrichmentEntities entities = new ContentEnrichmentEntities(ConfigurationManager.ConnectionStrings["ContentEnrichmentEntities"].ConnectionString)) { //try to get the item from the database string pathValue = path.ObjectValue.ToString(); var asyncItem = entities.EnrichmentAsyncData.FirstOrDefault(i => i.ItemPath.Equals(pathValue, StringComparison.CurrentCultureIgnoreCase)); if (asyncItem != null && !String.IsNullOrEmpty(asyncItem.Value)) { //add the property to processedItem Property<decimal> sentimentProperty = new Property<decimal>() { Name = asyncItem.ManagedProperty, Value = Convert.ToDecimal(asyncItem.Value) }; processedItem.ItemProperties.Add(sentimentProperty); //delete the async item from the database entities.EnrichmentAsyncData.DeleteObject(asyncItem); } else { if (sentiment != null && sentiment.ObjectValue != null) processedItem.ItemProperties.Add(sentiment); if (asyncItem == null) { //add to database EnrichmentAsyncData newAsyncItem = new EnrichmentAsyncData() { ManagedProperty = "Sentiment", DataType = "System.Decimal", ItemPath = path.ObjectValue.ToString(), CrawlDate = DateTime.Now.ToUniversalTime() }; entities.EnrichmentAsyncData.AddObject(newAsyncItem); //Start a new thread for this async operation Thread thread = new Thread(GetSentiment); thread.Name = "Async - " + path; var data = new AsyncData() { Path = path.ObjectValue.ToString(), Data = title.ObjectValue.ToString() }; thread.Start(data); } } //save the changes entities.SaveChanges(); } } } } catch (Exception) { processedItem.ErrorCode = UNEXPECTED_ERROR; } return processedItem; } /// <summary> /// Called on a separate thread to perform sentiment analysis on text /// </summary> /// <param name="data">object containing the crawl path and text to analyze</param> public static void GetSentiment(object data) { AsyncData asyncData = (AsyncData)data; HttpWebRequest myRequest = (HttpWebRequest)HttpWebRequest.Create("http://text-processing.com/api/sentiment/"); myRequest.Method = "POST"; string text = "text=" + asyncData.Data; byte[] bytes = Encoding.UTF8.GetBytes(text); myRequest.ContentLength = bytes.Length; using (Stream requestStream = myRequest.GetRequestStream()) { requestStream.Write(bytes, 0, bytes.Length); requestStream.Flush(); requestStream.Close(); using (WebResponse response = myRequest.GetResponse()) { using (StreamReader reader = new StreamReader(response.GetResponseStream())) { string result = reader.ReadToEnd(); using (ContentEnrichmentEntities entities = new ContentEnrichmentEntities(ConfigurationManager.ConnectionStrings["ContentEnrichmentEntities"].ConnectionString)) { //try to get the item from the database var asyncItem = entities.EnrichmentAsyncData.FirstOrDefault(i => i.ItemPath.Equals(asyncData.Path, StringComparison.CurrentCultureIgnoreCase)); if (asyncItem != null && String.IsNullOrEmpty(asyncItem.Value)) { //calculate sentiment from result string neg = result.Substring(result.IndexOf("\"neg\": ") + 7); neg = neg.Substring(0, neg.IndexOf(',')); string pos = result.Substring(result.IndexOf("\"pos\": ") + 7); pos = pos.Substring(0, pos.IndexOf('}')); decimal negD = Convert.ToDecimal(neg); decimal posD = Convert.ToDecimal(pos); decimal sentiment = 5 + (-5 * negD) + (5 * posD); asyncItem.Value = sentiment.ToString(); entities.SaveChanges(); } } } } } } private readonly ProcessedItem processedItem = new ProcessedItem() { ItemProperties = new List<AbstractProperty>() }; } public class AsyncData { public string Path { get; set; } public string Data { get; set; } } } |
The content enrichment web service is associated with a search service application using Windows PowerShell. The configuration of this service has a lot of flexibility around the managed properties going in and out of CEWS and the criteria for triggering the callout. In my example the trigger is empty, indicating all items going through CEWS:
PowerShell to Register CEWS
$ssa = Get-SPEnterpriseSearchServiceApplication
Remove-SPEnterpriseSearchContentEnrichmentConfiguration –SearchApplication $ssa $config = New-SPEnterpriseSearchContentEnrichmentConfiguration $config.Endpoint = "http://localhost:8888/Service1.svc" $config.DebugMode = $false $config.SendRawData = $false $config.InputProperties = "Path", "ContentType", "Title", "Sentiment" $config.OutputProperties = "Sentiment" Set-SPEnterpriseSearchContentEnrichmentConfiguration –SearchApplication $ssa –ContentEnrichmentConfiguration $config |
Timer Job (Force Re-Crawl)
The biggest challenge with an asynchronous enrichment approach is updating the index after the CEWS thread is released. No API exists to directly update items in the search index, so CEWS is the last opportunity to augment an item before it becomes available to users executing queries. The best we can do is kick-off an asynchronous thread that can queue enrichment data for the next crawl. Marking individual items for re-crawl is a critical component to the solution, because “the next crawl” will only crawl items if a full crawl occurs or if the search connector believes the source items have updated (which could be never). The crawl log in Central Administration provides a mechanism to mark individual indexed items for re-crawlCrawlLogURLExplorer.aspx option to recrawl |
---|
I decompiled the CrawlLogURLExplorer.aspx page and was pleased to find it leveraged a Microsoft.Office.Server.Search.Administration.CrawlLog class with a public RecrawlDocumentmethod to re-crawl items by path. This API will basically update an item in the crawl log so it looks like an error to the crawler, and thus picked up in the next incremental/continuous crawl.
So why a custom SharePoint timer job? An item may not yet be represented in the crawl log when it completes our asynchronous thread (especially for new items). Calling RecrawlDocument on a path that does not exist in the crawl log would do nothing. The timer job allows us to mark items for re-crawl only if the most recent crawl is complete or has a start date after the crawl timestamp of the item. In short, it will take a minimum of two incremental crawls for a new item to get enrichment data with this asynchronous approach.
Custom Timer Job
using Microsoft.Office.Server.Search.Administration;
using Microsoft.SharePoint; using Microsoft.SharePoint.Administration; using System; using System.Collections.Generic; using System.Data.EntityClient; using System.Linq; using System.Text; using System.Threading.Tasks; namespace ContentEnrichmentTimerJob { public class ContentEnrichmentJob : SPJobDefinition { public ContentEnrichmentJob() : base() { } public ContentEnrichmentJob(string jobName, SPService service, SPServer server, SPJobLockType targetType) : base(jobName, service, server, targetType) { } public ContentEnrichmentJob(string jobName, SPWebApplication webApplication) : base(jobName, webApplication, null, SPJobLockType.ContentDatabase) { this.Title = "Content Enrichment Timer Job"; } public override void Execute(Guid targetInstanceId) { try { SearchServiceApplication application = SearchService.Service.SearchServiceApplications.FirstOrDefault(); CrawlLog crawlLog = new CrawlLog(application); using (ContentEnrichmentEntities entities = new ContentEnrichmentEntities(GetEntityConnection())) { //process all items in the database that where added before the current crawl DateTime start, stop; GetLatestCrawlTimes(WebApplication.Sites[0], out start, out stop); //use the first site collection for context foreach (var item in entities.EnrichmentAsyncData.Where(i => i.CrawlDate < start || stop != DateTime.MaxValue)) { crawlLog.RecrawlDocument(item.ItemPath.TrimEnd('/')); } } } catch (Exception) { //TODO: log error } } private EntityConnection GetEntityConnection() { //build an Entity Framework connection string in code...to lazy to update OWSTIMER config EntityConnectionStringBuilder connBuilder = new EntityConnectionStringBuilder(); connBuilder.Provider = "System.Data.SqlClient"; connBuilder.ProviderConnectionString = "data source=SHPT01;initial catalog=ContentEnrichment;integrated security=True;MultipleActiveResultSets=True;App=EntityFramework"; connBuilder.Metadata = "res://*/ContentEnrichmentModel.csdl|res://*/ContentEnrichmentModel.ssdl|res://*/ContentEnrichmentModel.msl"; //return the formatted connection string return new EntityConnection(connBuilder.ToString()); } private void GetLatestCrawlTimes(SPSite site, out DateTime start, out DateTime stop) { //mark item for recrawl SPServiceContext context = SPServiceContext.GetContext(site); SearchServiceApplication application = SearchService.Service.SearchServiceApplications.FirstOrDefault(); Content content = new Content(application); ContentSource cs = content.ContentSources["Local SharePoint sites"]; CrawlLog crawlLog = new CrawlLog(application); var history = crawlLog.GetCrawlHistory(1, cs.Id); start = Convert.ToDateTime(history.Rows[0]["CrawlStartTime"]); stop = Convert.ToDateTime(history.Rows[0]["CrawlEndTime"]); } } } |
With these three solution components in place, we get the following before/after experience in search
Before | After |
---|---|
The information which you have provided is very good. It is very useful who is looking for
ReplyDeleteBig data consulting services Singapore
Data Warehousing services Singapore
Data Warehousing services
Data migration services Singapore
Data migration services