How Service predicts HVAC faults utilizing AWS Glue and Amazon SageMaker



In their very own phrases, “In 1902, Willis Service solved one in all mankind’s most elusive challenges of controlling the indoor atmosphere by way of fashionable air-con. As we speak, Service merchandise create snug environments, safeguard the worldwide meals provide, and allow secure transport of significant medical provides underneath exacting circumstances.”

At Service, the inspiration of our success is making merchandise our clients can belief to maintain them snug and secure year-round. Excessive reliability and low gear downtime are more and more necessary as excessive temperatures develop into extra widespread because of local weather change. We have now traditionally relied on threshold-based programs that alert us to irregular gear habits, utilizing parameters outlined by our engineering group. Though such programs are efficient, they’re supposed to determine and diagnose gear points reasonably than predict them. Predicting faults earlier than they happen permits our HVAC sellers to proactively deal with points and enhance the client expertise.

As a way to enhance our gear reliability, we partnered with the Amazon Machine Studying Options Lab to develop a customized machine studying (ML) mannequin able to predicting gear points previous to failure. Our groups developed a framework for processing over 50 TB of historic sensor knowledge and predicting faults with 91% precision. We will now notify sellers of impending gear failure, in order that they will schedule inspections and decrease unit downtime. The answer framework is scalable as extra gear is put in and might be reused for a wide range of downstream modeling duties.

On this submit, we present how the Service and AWS groups utilized ML to foretell faults throughout massive fleets of kit utilizing a single mannequin. We first spotlight how we use AWS Glue for extremely parallel knowledge processing. We then talk about how Amazon SageMaker helps us with characteristic engineering and constructing a scalable supervised deep studying mannequin.

Overview of use case, objectives, and dangers

The primary purpose of this undertaking is to scale back downtime by predicting impending gear failures and notifying sellers. This permits sellers to schedule upkeep proactively and supply distinctive customer support. We confronted three main challenges when engaged on this resolution:

Data scalability – Data processing and have extraction must scale throughout massive rising historic sensor knowledge
Mannequin scalability – The modeling strategy must be able to scaling throughout over 10,000 items
Mannequin precision – Low false constructive charges are wanted to keep away from pointless upkeep inspections

Scalability, each from an information and modeling perspective, is a key requirement for this resolution. We have now over 50 TB of historic gear knowledge and anticipate this knowledge to develop shortly as extra HVAC items are related to the cloud. Data processing and mannequin inference must scale as our knowledge grows. To ensure that our modeling strategy to scale throughout over 10,000 items, we’d like a mannequin that may be taught from a fleet of kit reasonably than counting on anomalous readings for a single unit. It will permit for generalization throughout items and cut back the price of inference by internet hosting a single mannequin.

The opposite concern for this use case is triggering false alarms. Which means that a seller or technician will go on-site to examine the client’s gear and discover every thing to be working appropriately. The answer requires a excessive precision mannequin to make sure that when a seller is alerted, the gear is more likely to fail. This helps earn the belief of sellers, technicians, and householders alike, and reduces the prices related to pointless on-site inspections.

We partnered with the AI/ML consultants on the Amazon ML Options Lab for a 14-week growth effort. In the long run, our resolution contains two main elements. The primary is an information processing module constructed with AWS Glue that summarizes gear habits and reduces the dimensions of our coaching knowledge for environment friendly downstream processing. The second is a mannequin coaching interface managed by way of SageMaker, which permits us to coach, tune, and consider our mannequin earlier than it’s deployed to a manufacturing endpoint.

Data processing

Every HVAC unit we set up generates knowledge from 90 totally different sensors with readings for RPMs, temperature, and pressures all through the system. This quantities to roughly 8 million knowledge factors generated per unit per day, with tens of hundreds of items put in. As extra HVAC programs are related to the cloud, we anticipate the quantity of knowledge to develop shortly, making it essential for us to handle its dimension and complexity to be used in downstream duties. The size of the sensor knowledge historical past additionally presents a modeling problem. A unit might begin displaying indicators of impending failure months earlier than a fault is definitely triggered. This creates a major lag between the predictive sign and the precise failure. A way for compressing the size of the enter knowledge turns into essential for ML modeling.

To deal with the dimensions and complexity of the sensor knowledge, we compress it into cycle options as proven in Determine 1. This dramatically reduces the dimensions of knowledge whereas capturing options that characterize the gear’s habits.

Determine 1: Pattern of HVAC sensor knowledge

AWS Glue is a serverless knowledge integration service for processing massive portions of knowledge at scale. AWS Glue allowed us to simply run parallel knowledge preprocessing and have extraction. We used AWS Glue to detect cycles and summarize unit habits utilizing key options recognized by our engineering group. This dramatically lowered the dimensions of our dataset from over 8 million knowledge factors per day per unit right down to roughly 1,200. Crucially, this strategy preserves predictive details about unit habits with a a lot smaller knowledge footprint.

The output of the AWS Glue job is a abstract of unit habits for every cycle. We then use an Amazon SageMaker Processing job to calculate options throughout cycles and label our knowledge. We formulate the ML drawback as a binary classification job with a purpose of predicting gear faults within the subsequent 60 days. This permits our seller community to handle potential gear failures in a well timed method. It’s necessary to notice that not all items fail inside 60 days. A unit experiencing sluggish efficiency degradation might take extra time to fail. We deal with this through the mannequin analysis step. We targeted our modeling on summertime as a result of these months are when most HVAC programs within the US are in constant operation and underneath extra excessive circumstances.


Transformer architectures have develop into the state-of-the-art strategy for dealing with temporal knowledge. They will use lengthy sequences of historic knowledge at every time step with out affected by vanishing gradients. The enter to our mannequin at a given time limit consists of the options for the earlier 128 gear cycles, which is roughly one week of unit operation. That is processed by a three-layer encoder whose output is averaged and fed right into a multi-layered perceptron (MLP) classifier. The MLP classifier consists of three linear layers with ReLU activation capabilities and a remaining layer with LogSoftMax activation. We use weighted detrimental log-likelihood loss with a special weight on the constructive class for our loss operate. This biases our mannequin in direction of excessive precision and avoids expensive false alarms. It additionally incorporates our enterprise targets instantly into the mannequin coaching course of. Determine 2 illustrates the transformer structure.

Transformer Architecture

Determine 2: Temporal transformer structure


One problem when coaching this temporal studying mannequin is knowledge imbalance. Some items have an extended operational historical past than others and due to this fact have extra cycles in our dataset. As a result of they’re overrepresented within the dataset, these items could have extra affect on our mannequin. We remedy this by randomly sampling 100 cycles in a unit’s historical past the place we assess the likelihood of a failure at the moment. This ensures that every unit is equally represented through the coaching course of. Whereas eradicating the imbalanced knowledge drawback, this strategy has the additional advantage of replicating a batch processing strategy that might be utilized in manufacturing. This sampling strategy was utilized to the coaching, validation, and check units.

Coaching was carried out utilizing a GPU-accelerated occasion on SageMaker. Monitoring the loss reveals that it achieves the very best outcomes after 180 coaching epochs as present in Determine 3. Determine 4 reveals that the realm underneath the ROC curve for the ensuing temporal classification mannequin is 81%.

Training Curve

Determine 3: Coaching loss over epochs

Determine 4: ROC-AUC for 60-day lockout


Whereas our mannequin is skilled on the cycle stage, analysis must happen on the unit stage. On this manner, one unit with a number of true constructive detections continues to be solely counted as a single true constructive on the unit stage. To do that, we analyze the overlap between the expected outcomes and the 60-day window previous a fault. That is illustrated within the following determine, which reveals 4 instances of predicting outcomes:

True detrimental – All of the prediction outcomes are detrimental (purple) (Determine 5)
False constructive – The constructive predictions are false alarms (Determine 6)
False detrimental – Though the predictions are all detrimental, the precise labels could possibly be constructive (inexperienced) (Determine 7)
True constructive – A number of the predictions could possibly be detrimental (inexperienced), and a minimum of one prediction is constructive (yellow) (Determine 8)

True Negative

Determine 5.1: True detrimental case

False Positive

Determine 5.2: False constructive case

False Negative

Determine 5.3: False detrimental case

True Positive

Determine 5.4: True constructive case

After coaching, we use the analysis set to tune the brink for sending an alert. Setting the mannequin confidence threshold at 0.99 yields a precision of roughly 81%. This falls in need of our preliminary 90% criterion for fulfillment. Nonetheless, we discovered {that a} good portion of items failed simply exterior the 60-day analysis window. This is smart, as a result of a unit might actively show defective habits however take longer than 60 days to fail. To deal with this, we outlined a metric referred to as efficient precision, which is a mixture of the true constructive precision (81%) with the added precision of lockouts that occurred within the 30 days past our goal 60-day window.

For an HVAC seller, what’s most necessary is that an onsite inspection helps stop future HVAC points for the client. Utilizing this mannequin, we estimate that 81.2% of the time the inspection will stop a lockout from occurring within the subsequent 60 days. Moreover, 10.4% of the time the lockout would have occurred in inside 90 days of inspection. The remaining 8.4% might be a false alarm. The efficient precision of the skilled mannequin is 91.6%.


On this submit, we confirmed how our group used AWS Glue and SageMaker to create a scalable supervised studying resolution for predictive upkeep. Our mannequin is able to capturing tendencies throughout long-term histories of sensor knowledge and precisely detecting tons of of kit failures weeks upfront. Predicting faults upfront will cut back curb-to-curb time, permitting our sellers to offer extra well timed technical help and enhancing the general buyer expertise. The impacts of this strategy will develop over time as extra cloud-connected HVAC items are put in yearly.

Our subsequent step is to combine these insights into the upcoming launch of Service’s Related Vendor Portal. The portal combines these predictive alerts with different insights we derive from our AWS-based knowledge lake with a view to give our sellers extra readability into gear well being throughout their complete consumer base. We are going to proceed to enhance our mannequin by integrating knowledge from further sources and extracting extra superior options from our sensor knowledge. The strategies employed on this undertaking present a robust basis for our group to start out answering different key questions that may assist us cut back guarantee claims and enhance gear effectivity within the subject.

If you happen to’d like assist accelerating the usage of ML in your services and products, please contact the Amazon ML Options Lab. To be taught extra in regards to the companies used on this undertaking, discuss with the AWS Glue Developer Information and the Amazon SageMaker Developer Information.

In regards to the Authors

Ravi Patankar is a technical chief for IoT associated analytics at Service’s Residential HVAC Unit. He formulates analytics issues associated to diagnostics and prognostics and supplies course for ML/deep learning-based analytics options and structure.

Dan Volk is a Data Scientist on the AWS Generative AI Innovation Heart. He has ten years of expertise in machine studying, deep studying and time-series evaluation and holds a Grasp’s in Data Science from UC Berkeley. He’s enthusiastic about remodeling complicated enterprise challenges into alternatives by leveraging cutting-edge AI applied sciences.

Yingwei Yu is an Utilized Scientist at AWS Generative AI Innovation Heart. He has expertise working with a number of organizations throughout industries on varied proof-of-concepts in machine studying, together with NLP, time-series evaluation, and generative AI applied sciences. Yingwei obtained his PhD in laptop science from Texas A&M College.

Yanxiang Yu is an Utilized Scientist at Amazon Internet Companies, engaged on the Generative AI Innovation Heart. With over 8 years of expertise constructing AI and machine studying fashions for industrial functions, he focuses on generative AI, laptop imaginative and prescient, and time sequence modeling. His work focuses on discovering revolutionary methods to use superior generative methods to real-world issues.

Diego Socolinsky is a Senior Utilized Science Supervisor with the AWS Generative AI Innovation Heart, the place he leads the supply group for the Jap US and Latin America areas. He has over twenty years of expertise in machine studying and laptop imaginative and prescient, and holds a PhD diploma in arithmetic from The Johns Hopkins College.

Kexin Ding is a fifth-year Ph.D. candidate in laptop science at UNC-Charlotte. Her analysis focuses on making use of deep studying strategies for analyzing multi-modal knowledge, together with medical picture and genomics sequencing knowledge.


Supply hyperlink

What do you think?

Written by TechWithTrends

Leave a Reply

Your email address will not be published. Required fields are marked *

GIPHY App Key not set. Please check settings


Consultants Worry Crooks are Cracking Keys Stolen in LastPass Breach – Krebs on Safety


This is one other take a look at the rumored iPhone 15 colours